├── nbjekyll ├── nb_git │ ├── tests │ │ ├── __init__.py │ │ ├── test_repo.py │ │ ├── base.py │ │ └── files │ │ │ └── Tutorial.md │ ├── __init__.py │ └── nb_git.py ├── jekyllconvert │ ├── __init__.py │ ├── templates │ │ └── Jekyll_template.tpl │ └── jekyll_export.py ├── __init__.py └── convert_nbs.py ├── setup.cfg ├── MANIFEST.in ├── testenv.yml ├── .gitignore ├── .travis.yml ├── LICENSE ├── License.txt ├── setup.py └── README.md /nbjekyll/nb_git/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /nbjekyll/jekyllconvert/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description_file = README.md -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include nbjekyll/jekyllconvert/templates/* 2 | -------------------------------------------------------------------------------- /nbjekyll/nb_git/__init__.py: -------------------------------------------------------------------------------- 1 | from .nb_git import nb_repo -------------------------------------------------------------------------------- /nbjekyll/__init__.py: -------------------------------------------------------------------------------- 1 | from . import jekyllconvert 2 | from . import nb_git -------------------------------------------------------------------------------- /testenv.yml: -------------------------------------------------------------------------------- 1 | name: testenv 2 | dependencies: 3 | - python=3.6 4 | - pytest 5 | - pytz 6 | - pip 7 | - pip: 8 | - nbval 9 | - nbconvert 10 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | \.DS_Store 3 | 4 | \.idea/ 5 | 6 | __pycache__/ 7 | 8 | 9 | 10 | # Compiled python modules. 11 | *.pyc 12 | 13 | # Setuptools distribution folder. 14 | /dist/ 15 | 16 | # Python egg metadata, regenerated from source files by setuptools. 17 | /*.egg-info 18 | /*.egg 19 | 20 | \.cache/v/cache/ 21 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | 3 | python: 4 | - 3.6 5 | 6 | env: 7 | global: 8 | - PACKAGENAME="nbjekyll" 9 | 10 | before_install: 11 | # Here we download miniconda and createour env 12 | - export MINICONDA=$HOME/miniconda 13 | - export PATH="$MINICONDA/bin:$PATH" 14 | - hash -r 15 | - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh 16 | - bash miniconda.sh -b -f -p $MINICONDA 17 | - conda config --set always_yes yes 18 | - conda update conda 19 | - conda info -a 20 | - conda env create -f testenv.yml -v 21 | - source activate testenv 22 | - conda install -c conda-forge pygit2 23 | 24 | install: 25 | - python setup.py install 26 | 27 | script: 28 | - pytest $PACKAGENAME 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Tania Allard 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /License.txt: -------------------------------------------------------------------------------- 1 | The MIT License 2 | Copyright (c) 2018 Tania Allard 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | The above copyright notice and this permission notice shall be 11 | included in all copies or substantial portions of the Software. 12 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 13 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 14 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 15 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 16 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 17 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 18 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 | -------------------------------------------------------------------------------- /nbjekyll/nb_git/tests/test_repo.py: -------------------------------------------------------------------------------- 1 | """Tests the nb_git functions """ 2 | import os 3 | import unittest 4 | import pygit2 5 | import pytest 6 | import time 7 | import datetime 8 | 9 | from . import base 10 | 11 | from ..nb_git import nb_repo 12 | 13 | here = os.getcwd() 14 | 15 | class RepoTest(base.NoRepoTestCase): 16 | 17 | def test_discover_repo(self): 18 | repo = pygit2.init_repository(self._temp_dir, False) 19 | subdir = os.path.join(self._temp_dir, "test1", "test2") 20 | os.makedirs(subdir) 21 | self.assertEqual(repo.path, pygit2.discover_repository(subdir)) 22 | 23 | 24 | def test_nb_repo(): 25 | """ Checks a repo is found """ 26 | repo = nb_repo(here) 27 | assert repo.repo.path == pygit2.discover_repository(here) 28 | 29 | def test_find_notebooks(): 30 | """Checks that the method finds the notebooks 31 | in this repository """ 32 | notebooks = nb_repo(os.getcwd()).find_notebooks() 33 | assert len(notebooks) == 1 34 | assert notebooks[0] == 'Tutorial.ipynb' 35 | 36 | def test_get_commit(): 37 | """Tests the get commit function, does it return a commit? 38 | """ 39 | last_commit = nb_repo(os.getcwd()).get_commit() 40 | repository = pygit2.Repository(pygit2.discover_repository(here)) 41 | assert last_commit['sha1'] == repository.revparse_single('HEAD').hex[0:7] 42 | 43 | def test_convert_time(): 44 | now = time.time() 45 | conv_time = nb_repo(os.getcwd()).convert_time(now) 46 | now_full = datetime.datetime.now().strftime("%d-%m-%Y") 47 | assert conv_time == now_full 48 | 49 | 50 | if __name__ == '__main__': 51 | unittest.main() -------------------------------------------------------------------------------- /nbjekyll/nb_git/tests/base.py: -------------------------------------------------------------------------------- 1 | """ Base TestCase for testing nb_git""" 2 | import gc 3 | import os 4 | import shutil 5 | import unittest 6 | import tempfile 7 | 8 | class NoRepoTestCase(unittest.TestCase): 9 | 10 | def setUp(self): 11 | self._temp_dir = tempfile.mkdtemp() 12 | self.repo = None 13 | 14 | def tearDown(self): 15 | del self.repo 16 | gc.collect() 17 | rmtree(self._temp_dir) 18 | 19 | def assertRaisesAssign(self, exc_class, instance, name, value): 20 | try: 21 | setattr(instance, name, value) 22 | except: 23 | self.assertEqual(exc_class, sys.exc_info()[0]) 24 | 25 | def assertAll(self, func, entries): 26 | return self.assertTrue(all(func(x) for x in entries)) 27 | 28 | def assertAny(self, func, entries): 29 | return self.assertTrue(any(func(x) for x in entries)) 30 | 31 | def assertRaisesWithArg(self, exc_class, arg, func, *args, **kwargs): 32 | try: 33 | func(*args, **kwargs) 34 | except exc_class as exc_value: 35 | self.assertEqual((arg,), exc_value.args) 36 | else: 37 | self.fail('%s(%r) not raised' % (exc_class.__name__, arg)) 38 | 39 | def assertEqualSignature(self, a, b): 40 | # XXX Remove this once equality test is supported by Signature 41 | self.assertEqual(a.name, b.name) 42 | self.assertEqual(a.email, b.email) 43 | self.assertEqual(a.time, b.time) 44 | self.assertEqual(a.offset, b.offset) 45 | 46 | 47 | def rmtree(path): 48 | """In Windows a read-only file cannot be removed, and shutil.rmtree fails. 49 | So we implement our own version of rmtree to address this issue. 50 | """ 51 | if os.path.exists(path): 52 | onerror = lambda func, path, e: force_rm_handle(func, path, e) 53 | shutil.rmtree(path, onerror=onerror) 54 | 55 | -------------------------------------------------------------------------------- /nbjekyll/jekyllconvert/templates/Jekyll_template.tpl: -------------------------------------------------------------------------------- 1 | {% extends 'markdown.tpl' %} 2 | 3 | {# custom header for jekyll post #} 4 | {%- block header -%} 5 | --- 6 | layout: notebook 7 | title: "{{resources['metadata']['name']}}" 8 | tags: 9 | update_date: [-date-] 10 | code_version: [-sha1-] 11 | author: [-author-] 12 | validation_pass: '[-validated-]' 13 | badge: "https://img.shields.io/badge/notebook-[-badge-]" 14 | --- 15 |
16 | 17 | 18 | {%- if "widgets" in nb.metadata -%} 19 | 20 | {%- endif-%} 21 | {%- endblock header -%} 22 | 23 | 24 | {% block in_prompt -%} 25 | {%- if cell.execution_count is defined -%} 26 | {%- if resources.global_content_filter.include_input_prompt-%} 27 | In [{{ cell.execution_count|replace(None, " ") }}]: 28 | {%- else -%} 29 | In [ ]: 30 | {%- endif -%} 31 | {%- endif -%} 32 | {%- endblock in_prompt %} 33 | 34 | 35 | {# Images will be saved in the custom path #} 36 | {% block data_svg %} 37 | svg 38 | {% endblock data_svg %} 39 | 40 | {% block data_png %} 41 | png 42 | {% endblock data_png %} 43 | 44 | {% block data_jpg %} 45 | jpeg 46 | {% endblock data_jpg %} 47 | 48 | {# cells containing markdown text only #} 49 | {% block markdowncell scoped %} 50 | {{ cell.source | wrap_text(80) }} 51 | {% endblock markdowncell %} 52 | 53 | {# headings #} 54 | {% block headingcell scoped %} 55 | {{ '#' * cell.level }} {{ cell.source | replace('\n', ' ') }} 56 | {% endblock headingcell %} 57 | 58 | {% block stream -%} 59 | {% endblock stream %} 60 | 61 | {# latex data block#} 62 | {% block data_latex %} 63 | {{ output.data['text/latex'] }} 64 | {% endblock data_latex %} 65 | 66 | {% block data_text scoped %} 67 | {{ output.data['text/plain'] | indent }} 68 | {% endblock data_text %} 69 | 70 | {% block data_html scoped -%} 71 | 72 | {{ output.data['text/html'] }} 73 | 74 | {%- endblock data_html %} 75 | 76 | {% block data_markdown scoped -%} 77 | {{ output.data['text/markdown'] | markdown2html }} 78 | 79 | {%- endblock data_markdown %} 80 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from setuptools import setup, find_packages 4 | 5 | name = 'nbjekyll' 6 | 7 | pkg_root = os.path.join(os.path.dirname(__file__), name) 8 | here = os.path.dirname(__file__) 9 | 10 | setup_args = dict(name = name, 11 | version = '0.1.1', 12 | 13 | description = 'Package used for easy conversion from Jupyter notebook to Jekyll posts', 14 | long_description = open('README.md').read(), 15 | 16 | url = 'https://github.com/trallard/nbconvert-jekyllconvert.git', 17 | donwload_url ='https://github.com/trallard/nbjekyll/archive/v0.1.1.tar.gz', 18 | 19 | # Author details 20 | author = 'Tania Allard', 21 | author_email = 'taniar.allard@gmail.com', 22 | 23 | license = 'MIT', 24 | include_package_data = True, 25 | 26 | # You can just specify the packages manually here if your project is 27 | # simple. Or you can use find_packages(). 28 | packages = find_packages(), 29 | zip_safe = False, 30 | 31 | install_requires = ['pygit2', 'nbval', 'nbconvert >= 5.0','pytz'], 32 | 33 | keywords = 'jupyter, jekyll, teaching, dissemination, open science', 34 | 35 | classifiers = [ 36 | # Specify the Python versions you support here. In particular, ensure 37 | # that you indicate whether you support Python 2, Python 3 or both. 38 | 'Programming Language :: Python :: 2', 39 | 'Programming Language :: Python :: 2.6', 40 | 'Programming Language :: Python :: 2.7', 41 | 'Programming Language :: Python :: 3', 42 | 'Programming Language :: Python :: 3.2', 43 | 'Programming Language :: Python :: 3.3', 44 | 'Programming Language :: Python :: 3.4', 45 | 'Programming Language :: Python :: 3.5', 46 | 'Programming Language :: Python :: 3.6', 47 | "Development Status :: 3 - Alpha", 48 | "Intended Audience :: Education", 49 | "License :: OSI Approved :: MIT License" ] ) 50 | 51 | if __name__ == '__main__': 52 | setup(**setup_args) 53 | -------------------------------------------------------------------------------- /nbjekyll/convert_nbs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Script for conversion of .ipynb files into a format suitable 4 | for Jekyll blog posts 5 | """ 6 | 7 | import os 8 | 9 | from pathlib import Path 10 | from string import Template 11 | import pytest 12 | import argparse 13 | 14 | from .nb_git.nb_git import nb_repo 15 | from .jekyllconvert import jekyll_export 16 | 17 | 18 | #----------------------------------------------------------------------------- 19 | #Classes and functions 20 | #----------------------------------------------------------------------------- 21 | 22 | def validate_nb(nb): 23 | """ 24 | Run pytest with nbval on the notebooks 25 | :param nb: 26 | :return: pytest exit code 27 | see hhttps://docs.pytest.org/en/latest/usage.html?%20main 28 | """ 29 | print("[nbjekyll] Running test on {}".format(os.path.split(nb)[1])) 30 | return validation_code(pytest.main([nb, '--nbval-lax'])) 31 | 32 | 33 | def validation_code(exit_code): 34 | """ 35 | Check the exit code and pass the value to 36 | the dictionary containing the commit information 37 | :param exit_code: 38 | :return: validation status 39 | """ 40 | if exit_code == 0: 41 | validated = 'yes' 42 | badge = 'validated-brightgreen.svg' 43 | elif exit_code == 1: 44 | validated = 'no' 45 | badge = 'validation failed-red.svg' 46 | else: 47 | validated = 'unknown' 48 | badge = 'unknown%20status-yellow.svg' 49 | return [validated, badge] 50 | 51 | def format_template(commit_info, nb): 52 | """ 53 | Replace the template data with the information 54 | collected from the commit before 55 | :param commit_info: 56 | :param nb: 57 | :return: modified .md for the notebook previously 58 | converted 59 | """ 60 | 61 | nb_path = os.path.abspath(nb).replace('ipynb', 'md') 62 | with open(nb_path, 'r+') as file: 63 | template = NbTemplate(file.read()) 64 | updated = template.substitute(commit_info) 65 | file.seek(0) 66 | file.write(updated) 67 | file.truncate() 68 | 69 | 70 | class NbTemplate(Template): 71 | """" 72 | Subclass of Template, this uses [- -] as the delimiter sequence 73 | to replace the template variables instead of the default $, ${}, $$ 74 | as this causes problems when then notebooks use the R kernel 75 | """ 76 | delimiter = '[-' 77 | pattern = r''' 78 | \[-(?: 79 | (?P-) | # Expression [-- will become [- 80 | (?P[^\[\]\n-]+)-\] | # -, [, ], and \n can't be used in names 81 | \b\B(?P) | # Braced names disabled 82 | (?P) # 83 | ) 84 | ''' 85 | 86 | def parse_path(): 87 | arg_parser = argparse.ArgumentParser(description="Convert Jupyter notebooks to Jekyll posts") 88 | arg_parser.add_argument('-p', '--path', 89 | help="Custom path to save the Notebook images. The path in the" 90 | " output markdown will be modified accordingly") 91 | 92 | return arg_parser.parse_args() 93 | 94 | if __name__ == '__main__': 95 | args = parse_path() 96 | if args.path: 97 | img_path = args.path 98 | else: 99 | img_path = './images/notebook_images' 100 | 101 | print('[nbjekyll] Images will be saved in [{}]'.format(img_path)) 102 | 103 | here = os.getcwd() 104 | 105 | # Step one: find if this is a repository 106 | repository = nb_repo(here) 107 | 108 | # Find the notebooks that have been added to the repo 109 | # or that have been updated in the last commit 110 | notebooks = repository.check_log() 111 | # Convert each of the notebooks using nbconvert 112 | # then add repo specific information 113 | for nb in notebooks['notebooks']: 114 | nb_path = Path(nb).resolve() 115 | if os.path.exists(nb_path): 116 | # convert the notebook in a .md 117 | print('[nbjekyll] Converting [{}]'.format(nb)) 118 | jekyll_export.convert_single_nb(nb_path, img_path) 119 | # use nbval for the notebook 120 | test = validate_nb(nb_path) 121 | notebooks['validated'] = test[0] 122 | notebooks['badge'] = test[1] 123 | 124 | # substitute header 125 | format_template(notebooks, nb) 126 | print('[nbjekyll] Finalising conversion of [{}]'.format(nb)) 127 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # nbjekyll 2 | 3 | | Release | Usage | Development | 4 | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| 5 | | [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active) | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) | [![Build Status](https://travis-ci.org/trallard/nbjekyll.svg?branch=master)](https://travis-ci.org/trallard/nbjekyll) | 6 | | [![PyPI](https://img.shields.io/pypi/v/nine.svg)](https://pypi.python.org/pypi/nbjekyll) | [![PyPI](https://img.shields.io/pypi/pyversions/Django.svg)]() | | 7 | 8 | An experimental tool to convert Jupyter notebooks to .md files that could be immediately passed into Jekyll for publishing. 9 | 10 | Jupyter comes with support for generating .md files by using their own generated exporters and templates. This is a very robust approach, but far from being ideal for .md conversion for Jekyll static blogs. 11 | 12 | nbjekyll uses the nbconvert markdown exporter but ensures that the plots generated in the notebooks are saved in a separate directory and that the paths to such plots can be easily interpreted by Jekyll. 13 | The path for the plots is by default specified as `./images/notebook_images/{Notebook_name}` but it can be modified if needed. For more details see [Usage](#usage). 14 | 15 | 16 | nbjekylluses [nbval](https://github.com/computationalmodelling/nbval) to test the notebooks. Depending on the status code (see [pytest exit codes](https://docs.pytest.org/en/latest/usage.html)) the validation and appropriate badge is added: 17 | 18 | ![](https://img.shields.io/badge/notebook-validated-brightgreen.svg) 19 | 20 | 21 | 22 | ![](https://img.shields.io/badge/notebook-unknown%20status-yellow.svg) 23 | 24 | It returns a .md file with the yaml header needed for Jekyll. It contains the mandatory fields of `title` and `layout` (by default set to notebook). It also adds other fields such as the sha1 and author of the last commit associated to the notebook, the last update date, and a badge indicating if the notebook passed the validation with nbval. 25 | 26 | ```yaml 27 | --- 28 | layout: notebook 29 | title: "Classify_demo" 30 | tags: 31 | update_date: 17-01-2018 32 | code_version: 19e3e29 33 | author: Tania Allard 34 | validation_pass: 'yes' 35 | badge: "https://img.shields.io/badge/notebook-validated-brightgreen.svg" 36 | --- 37 | ``` 38 | 39 | You can see a Jekyll site using the converted notebooks [here](http://bitsandchips.me/Modules-template/) ✨⚡️ 40 | 41 | ## Install 42 | nbjekyll is available from [PyPi](https://pypi.python.org/pypi/nbjekyll) so you can install nbjekyll using pip like so: 43 | ```bash 44 | pip install nbjekyll 45 | ``` 46 | 47 | ## Usage 48 | Once the package is installed you can start using it directly from 49 | your Jekyll site directory. 50 | 51 | 1. Add the Jupyter notebook you want to add to your blog 52 | 2. Commit the notebook or notebooks to Git 53 | 3. Run the Jekyll converter from the terminal. Make sure to run it from the 54 | main directory of your Jekyll blog: 55 | 56 | ```bash 57 | python -m nbjekyll.convert_nbs 58 | ``` 59 | If you want your output images to be in a different path you can use the flags `-p` `--path` like so: 60 | 61 | ```bash 62 | python -m nbjekyll.convert_nbs -p ./site_images 63 | ``` 64 | 4. Make sure to modify the layout in your .md yaml header! 65 | 66 | ## Important things to consider 67 | - **You need to commit your notebooks to Git _right_ before using nbjekyll** 68 | 69 | At this moment nbjekyll will check for the last commit in your repository and convert the notebooks associated to such commit. 70 | 71 | We are looking into changing this to allow for more flexibility in the near future. 72 | 73 | - **What are the pre requisites?** 74 | - Python > 3.4 75 | - pytest 76 | - nbval 77 | - nbconvert > 5.0 78 | - pygit2 (if you use conda the easiest way to get this installed is by doing `conda install -c conda-forge pygit2`) 79 | -------------------------------------------------------------------------------- /nbjekyll/nb_git/nb_git.py: -------------------------------------------------------------------------------- 1 | """ 2 | Functions used to get details on the git repository 3 | and its commits. 4 | It is used to find which notebooks were modified in a specific 5 | """ 6 | 7 | import pygit2 8 | import os 9 | import fnmatch 10 | import glob 11 | from pathlib import Path 12 | from datetime import datetime 13 | import pytz 14 | 15 | class nb_repo(object): 16 | """ Class containing methods used to 17 | identify the notebooks committed to the 18 | repository and add the SHA to the Jinja template""" 19 | 20 | def __init__(self, here): 21 | """ Find if the current location is a 22 | Git repository, if so it will return a 23 | repository object """ 24 | try: 25 | repo_path = pygit2.discover_repository(here) 26 | repo = pygit2.Repository(repo_path) 27 | self.repo = repo 28 | self.here = here 29 | except: 30 | raise OtherException("[nbjekyll] This does not seem to be a repository," 31 | "make sure you are in an initialized repo") 32 | 33 | 34 | def check_log(self): 35 | """ Check the number of commits in the repository, 36 | if there is only one commit it will find all the notebooks 37 | inside the repository. 38 | Otherwise, it will find the notebooks in the latest commit only 39 | Returns 40 | ------- 41 | notebooks: dictionary containing the sha1 for the commit, 42 | the list of found notebooks, author, and the date when 43 | the notebooks were last updated 44 | """ 45 | all_commits = [commit for commit in self.repo.head.log()] 46 | if len(all_commits) <= 1: 47 | print('[nbjekyll] Only one commit: converting all the notebooks in the repo') 48 | # calls function find_notebooks 49 | notebooks = self.find_notebooks() 50 | commit_info = self.get_commit() 51 | commit_info['notebooks'] = notebooks 52 | 53 | return commit_info 54 | else: 55 | print(("[nbjekyll] There are notebooks already in version control," 56 | " finding the notebooks passed in the last commit")) 57 | # calls function last_commit 58 | notebooks = self.last_commit() 59 | return notebooks 60 | 61 | def find_notebooks(self): 62 | """ Find all the notebooks in the repo, but excludes those 63 | in the _site folder. 64 | Returns 65 | ------- 66 | notebooks: dictionary containing the sha1, notebooks name, 67 | and dat of the commit 68 | """ 69 | basePath = os.getcwd() 70 | #notebooksAll = [nb for nb in glob.glob('**/*.ipynb')] 71 | notebooksAll = list() 72 | for root, dirs, files in os.walk(basePath): 73 | # notebooksAll = [nb for nb in files if nb.endswith('.ipynb')] 74 | for file in files: 75 | if file.endswith('.ipynb'): 76 | notebooksAll.append(file) 77 | 78 | exception = os.path.join(basePath, '/_site/*/*') 79 | notebooks = [nb for nb in notebooksAll if not fnmatch.fnmatch(nb, exception)] 80 | 81 | if not notebooks: 82 | print('[nbjekyll] There were no notebooks found') 83 | else: 84 | return notebooks 85 | 86 | def last_commit(self): 87 | """ Find the notebooks modified in the last repository, but excludes those 88 | in the _site folder. 89 | Returns 90 | ------- 91 | notebooks: dictionary containing the sha1, notebooks name, 92 | and dat of the commit 93 | if no notebooks were modified in the last commit then 94 | the list notebooks is an empty list 95 | """ 96 | 97 | commit_info = self.get_commit() 98 | parent_commit = self.repo.get(commit_info['parent']) 99 | #notebooks = [nb.name for nb in self.repo.revparse_single('HEAD').tree if '.ipynb' in nb.name] 100 | diff = self.repo.revparse_single('HEAD').tree.diff_to_tree(parent_commit.tree) 101 | patches = [p for p in diff] 102 | notebooks = [patch.delta.new_file.path for patch in patches if 'ipynb' in patch.delta.new_file.path] 103 | commit_info['notebooks'] = notebooks 104 | del commit_info['parent'] 105 | 106 | return commit_info 107 | 108 | 109 | def convert_time(self, epoch): 110 | """ 111 | Pass on the epoch date from the last commit and 112 | returns it in a human readable format 113 | :param epoch: 114 | :return: commit date in a dd-mm-YYYY format 115 | """ 116 | time_zone = pytz.timezone('GMT') 117 | dt = datetime.fromtimestamp(epoch, time_zone) 118 | 119 | return dt.strftime('%d-%m-%Y') 120 | 121 | def get_commit(self): 122 | """ 123 | Get the information for the last commit in the repository 124 | :return: dictionary with the sha1, date, and author 125 | of the last commit. 126 | """ 127 | last = self.repo.revparse_single('HEAD') 128 | sha1 = last.hex[0:7] 129 | author = last.author.name 130 | 131 | parent_commit_id = last.parents[0].id 132 | 133 | date = self.convert_time(last.author.time) 134 | commit_info = {'sha1': sha1, 135 | 'date': date, 136 | 'author': author, 137 | 'parent': parent_commit_id} 138 | 139 | return commit_info 140 | -------------------------------------------------------------------------------- /nbjekyll/jekyllconvert/jekyll_export.py: -------------------------------------------------------------------------------- 1 | """ 2 | This uses the Jekyll markdown exporter to convert .ipynb files 3 | to .md files. 4 | It ensures that the images are not saved as base64 but as separate 5 | .png, .jpg. or .svg files in an images directory and that the 6 | path to this is accurately updated using a custom jekyllpath filter 7 | 8 | It also uses Beautiful soup to do some basic HTML parsing 9 | """ 10 | import os 11 | import io 12 | 13 | from traitlets.config import Config 14 | from nbconvert import MarkdownExporter 15 | from ipython_genutils.path import ensure_dir_exists 16 | 17 | from bs4 import BeautifulSoup 18 | 19 | def init_nb_resources(notebook_filename, img_path): 20 | """Step 1: Initialize resources 21 | This initializes the resources dictionary for a single notebook. 22 | Returns 23 | ------- 24 | resources dictionary for a single notebook that MUST include: 25 | - unique_key: notebook nametable 26 | """ 27 | resources = {} 28 | basename = os.path.basename(notebook_filename) 29 | notebook_name = basename[:basename.rfind('.')] 30 | resources['unique_key'] = notebook_name 31 | #resources['output_files_dir'] = './images/notebook_images/{}'.format(notebook_name) 32 | resources['output_files_dir'] = img_path + '/' +notebook_name 33 | return resources 34 | 35 | def export_notebook(notebook_filename, resources): 36 | """Step 2: Export the notebook 37 | Exports the notebook to a particular format according to the specified 38 | exporter. This function returns the output and (possibly modified) 39 | resources from the exporter. 40 | Parameters 41 | ---------- 42 | notebook_filename : str 43 | name of notebook file. 44 | resources : dict 45 | Returns 46 | ------- 47 | output 48 | dict 49 | resources (possibly modified) 50 | """ 51 | config = Config() 52 | basePath = os.path.dirname(__file__) 53 | exporter = MarkdownExporter(config = config, 54 | template_path = [os.path.join(basePath,'templates/')], 55 | template_file = 'Jekyll_template.tpl', 56 | filters = {'jekyllpath': jekyllpath}) 57 | content, resources = exporter.from_filename(notebook_filename, resources = resources) 58 | content = parse_html(content) 59 | return content, resources 60 | 61 | def parse_html(content): 62 | """ This step is included in Step 2: this will use beautiful soup to 63 | modify certain tags of the returned content 64 | Parameters 65 | ---------- 66 | content : returned from the notebook export 67 | Returns 68 | ------ 69 | soup (parsed html content) 70 | """ 71 | soup = BeautifulSoup(content, 'html.parser') 72 | if soup.table: 73 | for tag in soup.find_all('table'): 74 | tag['class'] = 'table-responsive table-striped' 75 | tag['border'] = '0' 76 | return soup 77 | 78 | 79 | def jekyllpath(path): 80 | """ Take the filepath of an image output by the ExportOutputProcessor 81 | and convert it into a URL we can use with Jekyll. This is passed to the exporter 82 | as a filter to the exporter. 83 | Note that this will be directly taken from the Jekyll _config.yml file 84 | """ 85 | return path.replace("./", "{{site.url}}{{site.baseurl}}/") 86 | 87 | def write_outputs(content, resources): 88 | """Step 3: Write the notebook to file 89 | This writes output from the exporter to file using the specified writer. 90 | It returns the results from the writer. 91 | Parameters 92 | ---------- 93 | output : 94 | resources : dict 95 | resources for a single notebook including name, config directory 96 | and directory to save output 97 | Returns 98 | ------- 99 | file 100 | results from the specified writer output of exporter 101 | """ 102 | 103 | # various paths and variables needed for the module 104 | notebook_namefull = resources['metadata']['name'] + resources.get('output_extension') 105 | outdir_nb = resources['metadata']['path'] 106 | outfile = os.path.join(outdir_nb, notebook_namefull) 107 | imgs_outdir = resources.get('output_files_dir') 108 | ensure_dir_exists(imgs_outdir) 109 | 110 | # write file in the appropriate format 111 | with io.open(outfile, 'w', encoding = "utf-8") as fout: 112 | body = content.prettify(formatter='html') 113 | fout.write(body) 114 | 115 | # if the content has images then they are returned and saved 116 | if resources['outputs']: 117 | save_imgs(resources, imgs_outdir) 118 | 119 | def save_imgs(resources, imgs_outdir): 120 | """ If the notebook had plots or figures, then they are saved in the appropriate 121 | directory""" 122 | items = resources.get('outputs', {}).items() 123 | if not os.path.exists(imgs_outdir): 124 | os.mkdir(imgs_outdir) 125 | for filename, data in items: 126 | dest = filename 127 | with io.open(dest, 'wb+') as f: 128 | f.write(data) 129 | 130 | def convert_single_nb(notebook_filename, img_path): 131 | """Convert a single notebook. 132 | Performs the following steps: 133 | 1. Initialize notebook resources 134 | 2. Export the notebook to a particular format 135 | 3. Write the exported notebook to file as well as complementary images 136 | Parameters 137 | ---------- 138 | notebook_filename : str 139 | img_path : str 140 | """ 141 | resources = init_nb_resources(notebook_filename, img_path) 142 | content, resources = export_notebook(notebook_filename, resources) 143 | write_outputs(content, resources) 144 | -------------------------------------------------------------------------------- /nbjekyll/nb_git/tests/files/Tutorial.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: notebook 3 | title: "Tutorial" 4 | tags: 5 | update_date: 12-01-2018 6 | code_version: d2de4b0 7 | author: Tania Allard 8 | --- 9 |
10 | 11 |
12 | # BAD Day 1: Tutorial 13 | 14 | # 0. Source/install the needed packages 15 | 16 | In [1]: 17 | 18 | ```R 19 | # In case you need to install the packages 20 | install.packages("xlsx") 21 | install.packages("gdata") 22 | install.packages("ape") 23 | ``` 24 | 25 | In [2]: 26 | 27 | ```R 28 | source("http://bioconductor.org/biocLite.R"); 29 | biocLite("multtest"); 30 | ``` 31 | 32 | 33 | # 1. Exploratory data analysis 34 | 35 | We will be usig the Gene Expression dataset from **Golub et al (1999)**. The 36 | gene expression data collected by Golub et al. (1999) are among the most 37 | classical in bioinformatics. A selection of the set is called `golub` which is 38 | contained in the `multtest` package loaded before. 39 | 40 | 41 | The data consist of gene expression values of 3051 genes (rows) from 38 leukemia 42 | patients Pre-processing was done as described in Dudoit et al. (2002). The R 43 | code for pre-processing is available in the file ../doc/golub.R. 44 | 45 | **Source**: 46 | Golub et al. (1999). Molecular classification of cancer: class discovery and 47 | class prediction by gene expression monitoring, Science, Vol. 286:531-537. 48 | (http://www-genome.wi.mit.edu/MPR/). 49 | 50 | In [3]: 51 | 52 | ```R 53 | require(multtest); 54 | 55 | # Usage 56 | data(golub); 57 | 58 | # If you need more information on the data set just 59 | # uncomment the line below 60 | # ?golub 61 | ``` 62 | 63 | Data set values: 64 | - `golub`: matrix of gene expression levels for the 38 tumor mRNA samples, rows 65 | correspond to genes (3051 genes) and columns to mRNA samples. 66 | - `golub.cl`: numeric vector indicating the tumor class, 27 acute lymphoblastic 67 | leukemia (ALL) cases (code 0) and 11 acute myeloid leukemia (AML) cases (code 68 | 1). 69 | - `golub.names`: a matrix containing the names of the 3051 genes for the 70 | expression matrix golub. The three columns correspond to the gene index, ID, and 71 | Name, respectively. 72 | 73 | In [4]: 74 | 75 | ```R 76 | # Checking the dimension of the data 77 | dim(golub) 78 | ``` 79 |
    80 |
  1. 81 | 3051 82 |
  2. 83 |
  3. 84 | 38 85 |
  4. 86 |
87 | 88 | In [5]: 89 | 90 | ```R 91 | # we will have a look at the first rows contained in the data set 92 | head(golub) 93 | ``` 94 | 95 | 96 | 97 | 100 | 103 | 106 | 109 | 112 | 115 | 118 | 121 | 124 | 127 | 130 | 133 | 136 | 139 | 142 | 145 | 148 | 151 | 154 | 157 | 160 | 161 | 162 | 165 | 168 | 171 | 174 | 177 | 180 | 183 | 186 | 189 | 192 | 195 | 198 | 201 | 204 | 207 | 210 | 213 | 216 | 219 | 222 | 225 | 226 | 227 | 230 | 233 | 236 | 239 | 242 | 245 | 248 | 251 | 254 | 257 | 260 | 263 | 266 | 269 | 272 | 275 | 278 | 281 | 284 | 287 | 290 | 291 | 292 | 295 | 298 | 301 | 304 | 307 | 310 | 313 | 316 | 319 | 322 | 325 | 328 | 331 | 334 | 337 | 340 | 343 | 346 | 349 | 352 | 355 | 356 | 357 | 360 | 363 | 366 | 369 | 372 | 375 | 378 | 381 | 384 | 387 | 390 | 393 | 396 | 399 | 402 | 405 | 408 | 411 | 414 | 417 | 420 | 421 | 422 | 425 | 428 | 431 | 434 | 437 | 440 | 443 | 446 | 449 | 452 | 455 | 458 | 461 | 464 | 467 | 470 | 473 | 476 | 479 | 482 | 485 | 486 | 487 |
98 | -1.45769 99 | 101 | -1.39420 102 | 104 | -1.42779 105 | 107 | -1.40715 108 | 110 | -1.42668 111 | 113 | -1.21719 114 | 116 | -1.37386 117 | 119 | -1.36832 120 | 122 | -1.47649 123 | 125 | -1.21583 126 | 128 | ⋯ 129 | 131 | -1.08902 132 | 134 | -1.29865 135 | 137 | -1.26183 138 | 140 | -1.44434 141 | 143 | 1.10147 144 | 146 | -1.34158 147 | 149 | -1.22961 150 | 152 | -0.75919 153 | 155 | 0.84905 156 | 158 | -0.66465 159 |
163 | -0.75161 164 | 166 | -1.26278 167 | 169 | -0.09052 170 | 172 | -0.99596 173 | 175 | -1.24245 176 | 178 | -0.69242 179 | 181 | -1.37386 182 | 184 | -0.50803 185 | 187 | -1.04533 188 | 190 | -0.81257 191 | 193 | ⋯ 194 | 196 | -1.08902 197 | 199 | -1.05094 200 | 202 | -1.26183 203 | 205 | -1.25918 206 | 208 | 0.97813 209 | 211 | -0.79357 212 | 214 | -1.22961 215 | 217 | -0.71792 218 | 220 | 0.45127 221 | 223 | -0.45804 224 |
228 | 0.45695 229 | 231 | -0.09654 232 | 234 | 0.90325 235 | 237 | -0.07194 238 | 240 | 0.03232 241 | 243 | 0.09713 244 | 246 | -0.11978 247 | 249 | 0.23381 250 | 252 | 0.23987 253 | 255 | 0.44201 256 | 258 | ⋯ 259 | 261 | -0.43377 262 | 264 | -0.10823 265 | 267 | -0.29385 268 | 270 | 0.05067 271 | 273 | 1.69430 274 | 276 | -0.12472 277 | 279 | 0.04609 280 | 282 | 0.24347 283 | 285 | 0.90774 286 | 288 | 0.46509 289 |
293 | 3.13533 294 | 296 | 0.21415 297 | 299 | 2.08754 300 | 302 | 2.23467 303 | 305 | 0.93811 306 | 308 | 2.24089 309 | 311 | 3.36576 312 | 314 | 1.97859 315 | 317 | 2.66468 318 | 320 | -1.21583 321 | 323 | ⋯ 324 | 326 | 0.29598 327 | 329 | -1.29865 330 | 332 | 2.76869 333 | 335 | 2.08960 336 | 338 | 0.70003 339 | 341 | 0.13854 342 | 344 | 1.75908 345 | 347 | 0.06151 348 | 350 | 1.30297 351 | 353 | 0.58186 354 |
358 | 2.76569 359 | 361 | -1.27045 362 | 364 | 1.60433 365 | 367 | 1.53182 368 | 370 | 1.63728 371 | 373 | 1.85697 374 | 376 | 3.01847 377 | 379 | 1.12853 380 | 382 | 2.17016 383 | 385 | -1.21583 386 | 388 | ⋯ 389 | 391 | -1.08902 392 | 394 | -1.29865 395 | 397 | 2.00518 398 | 400 | 1.17454 401 | 403 | -1.47218 404 | 406 | -1.34158 407 | 409 | 1.55086 410 | 412 | -1.18107 413 | 415 | 1.01596 416 | 418 | 0.15788 419 |
423 | 2.64342 424 | 426 | 1.01416 427 | 429 | 1.70477 430 | 432 | 1.63845 433 | 435 | -0.36075 436 | 438 | 1.73451 439 | 441 | 3.36576 442 | 444 | 0.96870 445 | 447 | 2.72368 448 | 450 | -1.21583 451 | 453 | ⋯ 454 | 456 | -1.08902 457 | 459 | -1.29865 460 | 462 | 1.73780 463 | 465 | 0.89347 466 | 468 | -0.52883 469 | 471 | -1.22168 472 | 474 | 0.90832 475 | 477 | -1.39906 478 | 480 | 0.51266 481 | 483 | 1.36249 484 |
488 | The gene names are collected in the matrix `golub.gnames` of which the columns 489 | correspond to the gene index, ID, and Name, respectively. 490 | 491 | In [6]: 492 | 493 | ```R 494 | # Adding 3051 gene names 495 | row.names(golub) = golub.gnames[,3] 496 | 497 | head(golub) 498 | ``` 499 | 500 | 501 | 502 | 505 | 508 | 511 | 514 | 517 | 520 | 523 | 526 | 529 | 532 | 535 | 538 | 541 | 544 | 547 | 550 | 553 | 556 | 559 | 562 | 565 | 568 | 569 | 570 | 573 | 576 | 579 | 582 | 585 | 588 | 591 | 594 | 597 | 600 | 603 | 606 | 609 | 612 | 615 | 618 | 621 | 624 | 627 | 630 | 633 | 636 | 637 | 638 | 641 | 644 | 647 | 650 | 653 | 656 | 659 | 662 | 665 | 668 | 671 | 674 | 677 | 680 | 683 | 686 | 689 | 692 | 695 | 698 | 701 | 704 | 705 | 706 | 709 | 712 | 715 | 718 | 721 | 724 | 727 | 730 | 733 | 736 | 739 | 742 | 745 | 748 | 751 | 754 | 757 | 760 | 763 | 766 | 769 | 772 | 773 | 774 | 777 | 780 | 783 | 786 | 789 | 792 | 795 | 798 | 801 | 804 | 807 | 810 | 813 | 816 | 819 | 822 | 825 | 828 | 831 | 834 | 837 | 840 | 841 | 842 | 845 | 848 | 851 | 854 | 857 | 860 | 863 | 866 | 869 | 872 | 875 | 878 | 881 | 884 | 887 | 890 | 893 | 896 | 899 | 902 | 905 | 908 | 909 | 910 |
503 | AFFX-HUMISGF3A/M97935_MA_at 504 | 506 | -1.45769 507 | 509 | -1.39420 510 | 512 | -1.42779 513 | 515 | -1.40715 516 | 518 | -1.42668 519 | 521 | -1.21719 522 | 524 | -1.37386 525 | 527 | -1.36832 528 | 530 | -1.47649 531 | 533 | -1.21583 534 | 536 | ⋯ 537 | 539 | -1.08902 540 | 542 | -1.29865 543 | 545 | -1.26183 546 | 548 | -1.44434 549 | 551 | 1.10147 552 | 554 | -1.34158 555 | 557 | -1.22961 558 | 560 | -0.75919 561 | 563 | 0.84905 564 | 566 | -0.66465 567 |
571 | AFFX-HUMISGF3A/M97935_MB_at 572 | 574 | -0.75161 575 | 577 | -1.26278 578 | 580 | -0.09052 581 | 583 | -0.99596 584 | 586 | -1.24245 587 | 589 | -0.69242 590 | 592 | -1.37386 593 | 595 | -0.50803 596 | 598 | -1.04533 599 | 601 | -0.81257 602 | 604 | ⋯ 605 | 607 | -1.08902 608 | 610 | -1.05094 611 | 613 | -1.26183 614 | 616 | -1.25918 617 | 619 | 0.97813 620 | 622 | -0.79357 623 | 625 | -1.22961 626 | 628 | -0.71792 629 | 631 | 0.45127 632 | 634 | -0.45804 635 |
639 | AFFX-HUMISGF3A/M97935_3_at 640 | 642 | 0.45695 643 | 645 | -0.09654 646 | 648 | 0.90325 649 | 651 | -0.07194 652 | 654 | 0.03232 655 | 657 | 0.09713 658 | 660 | -0.11978 661 | 663 | 0.23381 664 | 666 | 0.23987 667 | 669 | 0.44201 670 | 672 | ⋯ 673 | 675 | -0.43377 676 | 678 | -0.10823 679 | 681 | -0.29385 682 | 684 | 0.05067 685 | 687 | 1.69430 688 | 690 | -0.12472 691 | 693 | 0.04609 694 | 696 | 0.24347 697 | 699 | 0.90774 700 | 702 | 0.46509 703 |
707 | AFFX-HUMRGE/M10098_5_at 708 | 710 | 3.13533 711 | 713 | 0.21415 714 | 716 | 2.08754 717 | 719 | 2.23467 720 | 722 | 0.93811 723 | 725 | 2.24089 726 | 728 | 3.36576 729 | 731 | 1.97859 732 | 734 | 2.66468 735 | 737 | -1.21583 738 | 740 | ⋯ 741 | 743 | 0.29598 744 | 746 | -1.29865 747 | 749 | 2.76869 750 | 752 | 2.08960 753 | 755 | 0.70003 756 | 758 | 0.13854 759 | 761 | 1.75908 762 | 764 | 0.06151 765 | 767 | 1.30297 768 | 770 | 0.58186 771 |
775 | AFFX-HUMRGE/M10098_M_at 776 | 778 | 2.76569 779 | 781 | -1.27045 782 | 784 | 1.60433 785 | 787 | 1.53182 788 | 790 | 1.63728 791 | 793 | 1.85697 794 | 796 | 3.01847 797 | 799 | 1.12853 800 | 802 | 2.17016 803 | 805 | -1.21583 806 | 808 | ⋯ 809 | 811 | -1.08902 812 | 814 | -1.29865 815 | 817 | 2.00518 818 | 820 | 1.17454 821 | 823 | -1.47218 824 | 826 | -1.34158 827 | 829 | 1.55086 830 | 832 | -1.18107 833 | 835 | 1.01596 836 | 838 | 0.15788 839 |
843 | AFFX-HUMRGE/M10098_3_at 844 | 846 | 2.64342 847 | 849 | 1.01416 850 | 852 | 1.70477 853 | 855 | 1.63845 856 | 858 | -0.36075 859 | 861 | 1.73451 862 | 864 | 3.36576 865 | 867 | 0.96870 868 | 870 | 2.72368 871 | 873 | -1.21583 874 | 876 | ⋯ 877 | 879 | -1.08902 880 | 882 | -1.29865 883 | 885 | 1.73780 886 | 888 | 0.89347 889 | 891 | -0.52883 892 | 894 | -1.22168 895 | 897 | 0.90832 898 | 900 | -1.39906 901 | 903 | 0.51266 904 | 906 | 1.36249 907 |
911 | 912 | In [7]: 913 | 914 | ```R 915 | # Let's just have a look at the top 20 genes ID's contained in golub.gnames 916 | head(golub.gnames[,2], n = 20) 917 | ``` 918 |
    919 |
  1. 920 | 'AFFX-HUMISGF3A/M97935_MA_at (endogenous control)' 921 |
  2. 922 |
  3. 923 | 'AFFX-HUMISGF3A/M97935_MB_at (endogenous control)' 924 |
  4. 925 |
  5. 926 | 'AFFX-HUMISGF3A/M97935_3_at (endogenous control)' 927 |
  6. 928 |
  7. 929 | 'AFFX-HUMRGE/M10098_5_at (endogenous control)' 930 |
  8. 931 |
  9. 932 | 'AFFX-HUMRGE/M10098_M_at (endogenous control)' 933 |
  10. 934 |
  11. 935 | 'AFFX-HUMRGE/M10098_3_at (endogenous control)' 936 |
  12. 937 |
  13. 938 | 'AFFX-HUMGAPDH/M33197_5_at (endogenous control)' 939 |
  14. 940 |
  15. 941 | 'AFFX-HUMGAPDH/M33197_M_at (endogenous control)' 942 |
  16. 943 |
  17. 944 | 'AFFX-HSAC07/X00351_5_at (endogenous control)' 945 |
  18. 946 |
  19. 947 | 'AFFX-HSAC07/X00351_M_at (endogenous control)' 948 |
  20. 949 |
  21. 950 | 'AFFX-HUMTFRR/M11507_5_at (endogenous control)' 951 |
  22. 952 |
  23. 953 | 'AFFX-HUMTFRR/M11507_M_at (endogenous control)' 954 |
  24. 955 |
  25. 956 | 'AFFX-HUMTFRR/M11507_3_at (endogenous control)' 957 |
  26. 958 |
  27. 959 | 'AFFX-M27830_5_at (endogenous control)' 960 |
  28. 961 |
  29. 962 | 'AFFX-M27830_M_at (endogenous control)' 963 |
  30. 964 |
  31. 965 | 'AFFX-M27830_3_at (endogenous control)' 966 |
  32. 967 |
  33. 968 | 'AFFX-HSAC07/X00351_3_st (endogenous control)' 969 |
  34. 970 |
  35. 971 | 'AFFX-HUMGAPDH/M33197_M_st (endogenous control)' 972 |
  36. 973 |
  37. 974 | 'AFFX-HUMGAPDH/M33197_3_st (endogenous control)' 975 |
  38. 976 |
  39. 977 | 'AFFX-HSAC07/X00351_M_st (endogenous control)' 978 |
  40. 979 |
980 | Twenty-seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and 981 | eleven as acute myeloid leukemia (AML). The tumor class is given by the numeric 982 | vector golub.cl, where ALL is indicated by 0 and AML by 1. 983 | 984 | In [8]: 985 | 986 | ```R 987 | colnames(golub) = golub.cl 988 | 989 | head(golub) 990 | ``` 991 | 992 | 993 | 994 | 996 | 999 | 1002 | 1005 | 1008 | 1011 | 1014 | 1017 | 1020 | 1023 | 1026 | 1029 | 1032 | 1035 | 1038 | 1041 | 1044 | 1047 | 1050 | 1053 | 1056 | 1059 | 1060 | 1061 | 1062 | 1063 | 1066 | 1069 | 1072 | 1075 | 1078 | 1081 | 1084 | 1087 | 1090 | 1093 | 1096 | 1099 | 1102 | 1105 | 1108 | 1111 | 1114 | 1117 | 1120 | 1123 | 1126 | 1129 | 1130 | 1131 | 1134 | 1137 | 1140 | 1143 | 1146 | 1149 | 1152 | 1155 | 1158 | 1161 | 1164 | 1167 | 1170 | 1173 | 1176 | 1179 | 1182 | 1185 | 1188 | 1191 | 1194 | 1197 | 1198 | 1199 | 1202 | 1205 | 1208 | 1211 | 1214 | 1217 | 1220 | 1223 | 1226 | 1229 | 1232 | 1235 | 1238 | 1241 | 1244 | 1247 | 1250 | 1253 | 1256 | 1259 | 1262 | 1265 | 1266 | 1267 | 1270 | 1273 | 1276 | 1279 | 1282 | 1285 | 1288 | 1291 | 1294 | 1297 | 1300 | 1303 | 1306 | 1309 | 1312 | 1315 | 1318 | 1321 | 1324 | 1327 | 1330 | 1333 | 1334 | 1335 | 1338 | 1341 | 1344 | 1347 | 1350 | 1353 | 1356 | 1359 | 1362 | 1365 | 1368 | 1371 | 1374 | 1377 | 1380 | 1383 | 1386 | 1389 | 1392 | 1395 | 1398 | 1401 | 1402 | 1403 | 1406 | 1409 | 1412 | 1415 | 1418 | 1421 | 1424 | 1427 | 1430 | 1433 | 1436 | 1439 | 1442 | 1445 | 1448 | 1451 | 1454 | 1457 | 1460 | 1463 | 1466 | 1469 | 1470 | 1471 |
995 | 997 | 0 998 | 1000 | 0 1001 | 1003 | 0 1004 | 1006 | 0 1007 | 1009 | 0 1010 | 1012 | 0 1013 | 1015 | 0 1016 | 1018 | 0 1019 | 1021 | 0 1022 | 1024 | 0 1025 | 1027 | ⋯ 1028 | 1030 | 1 1031 | 1033 | 1 1034 | 1036 | 1 1037 | 1039 | 1 1040 | 1042 | 1 1043 | 1045 | 1 1046 | 1048 | 1 1049 | 1051 | 1 1052 | 1054 | 1 1055 | 1057 | 1 1058 |
1064 | AFFX-HUMISGF3A/M97935_MA_at 1065 | 1067 | -1.45769 1068 | 1070 | -1.39420 1071 | 1073 | -1.42779 1074 | 1076 | -1.40715 1077 | 1079 | -1.42668 1080 | 1082 | -1.21719 1083 | 1085 | -1.37386 1086 | 1088 | -1.36832 1089 | 1091 | -1.47649 1092 | 1094 | -1.21583 1095 | 1097 | ⋯ 1098 | 1100 | -1.08902 1101 | 1103 | -1.29865 1104 | 1106 | -1.26183 1107 | 1109 | -1.44434 1110 | 1112 | 1.10147 1113 | 1115 | -1.34158 1116 | 1118 | -1.22961 1119 | 1121 | -0.75919 1122 | 1124 | 0.84905 1125 | 1127 | -0.66465 1128 |
1132 | AFFX-HUMISGF3A/M97935_MB_at 1133 | 1135 | -0.75161 1136 | 1138 | -1.26278 1139 | 1141 | -0.09052 1142 | 1144 | -0.99596 1145 | 1147 | -1.24245 1148 | 1150 | -0.69242 1151 | 1153 | -1.37386 1154 | 1156 | -0.50803 1157 | 1159 | -1.04533 1160 | 1162 | -0.81257 1163 | 1165 | ⋯ 1166 | 1168 | -1.08902 1169 | 1171 | -1.05094 1172 | 1174 | -1.26183 1175 | 1177 | -1.25918 1178 | 1180 | 0.97813 1181 | 1183 | -0.79357 1184 | 1186 | -1.22961 1187 | 1189 | -0.71792 1190 | 1192 | 0.45127 1193 | 1195 | -0.45804 1196 |
1200 | AFFX-HUMISGF3A/M97935_3_at 1201 | 1203 | 0.45695 1204 | 1206 | -0.09654 1207 | 1209 | 0.90325 1210 | 1212 | -0.07194 1213 | 1215 | 0.03232 1216 | 1218 | 0.09713 1219 | 1221 | -0.11978 1222 | 1224 | 0.23381 1225 | 1227 | 0.23987 1228 | 1230 | 0.44201 1231 | 1233 | ⋯ 1234 | 1236 | -0.43377 1237 | 1239 | -0.10823 1240 | 1242 | -0.29385 1243 | 1245 | 0.05067 1246 | 1248 | 1.69430 1249 | 1251 | -0.12472 1252 | 1254 | 0.04609 1255 | 1257 | 0.24347 1258 | 1260 | 0.90774 1261 | 1263 | 0.46509 1264 |
1268 | AFFX-HUMRGE/M10098_5_at 1269 | 1271 | 3.13533 1272 | 1274 | 0.21415 1275 | 1277 | 2.08754 1278 | 1280 | 2.23467 1281 | 1283 | 0.93811 1284 | 1286 | 2.24089 1287 | 1289 | 3.36576 1290 | 1292 | 1.97859 1293 | 1295 | 2.66468 1296 | 1298 | -1.21583 1299 | 1301 | ⋯ 1302 | 1304 | 0.29598 1305 | 1307 | -1.29865 1308 | 1310 | 2.76869 1311 | 1313 | 2.08960 1314 | 1316 | 0.70003 1317 | 1319 | 0.13854 1320 | 1322 | 1.75908 1323 | 1325 | 0.06151 1326 | 1328 | 1.30297 1329 | 1331 | 0.58186 1332 |
1336 | AFFX-HUMRGE/M10098_M_at 1337 | 1339 | 2.76569 1340 | 1342 | -1.27045 1343 | 1345 | 1.60433 1346 | 1348 | 1.53182 1349 | 1351 | 1.63728 1352 | 1354 | 1.85697 1355 | 1357 | 3.01847 1358 | 1360 | 1.12853 1361 | 1363 | 2.17016 1364 | 1366 | -1.21583 1367 | 1369 | ⋯ 1370 | 1372 | -1.08902 1373 | 1375 | -1.29865 1376 | 1378 | 2.00518 1379 | 1381 | 1.17454 1382 | 1384 | -1.47218 1385 | 1387 | -1.34158 1388 | 1390 | 1.55086 1391 | 1393 | -1.18107 1394 | 1396 | 1.01596 1397 | 1399 | 0.15788 1400 |
1404 | AFFX-HUMRGE/M10098_3_at 1405 | 1407 | 2.64342 1408 | 1410 | 1.01416 1411 | 1413 | 1.70477 1414 | 1416 | 1.63845 1417 | 1419 | -0.36075 1420 | 1422 | 1.73451 1423 | 1425 | 3.36576 1426 | 1428 | 0.96870 1429 | 1431 | 2.72368 1432 | 1434 | -1.21583 1435 | 1437 | ⋯ 1438 | 1440 | -1.08902 1441 | 1443 | -1.29865 1444 | 1446 | 1.73780 1447 | 1449 | 0.89347 1450 | 1452 | -0.52883 1453 | 1455 | -1.22168 1456 | 1458 | 0.90832 1459 | 1461 | -1.39906 1462 | 1464 | 0.51266 1465 | 1467 | 1.36249 1468 |
1472 | Note that sometimes it is better to construct a factor which indicates the tumor 1473 | class of the patients. Such a factor could be used for instance to separate the 1474 | tumor groups for plotting purposes. The factor (`gol.fac`) can be contructed as 1475 | follows. 1476 | 1477 | In [9]: 1478 | 1479 | ```R 1480 | gol.fac <- factor(golub.cl, levels = 0:1, labels = c("AML", "ALL")) 1481 | ``` 1482 | 1483 | The labels correspond to the two tumor classes. The evaluation of gol.fac=="ALL" 1484 | returns 1485 | TRUE for the first twenty-seven values and FALSE for the remaining eleven, 1486 | which is useful as a column index for selecting the expression values of the 1487 | ALL patients. The expression values of gene CCND3 Cyclin D3 from the 1488 | ALL patients can now be printed to the screen, as follows. 1489 | 1490 | In [10]: 1491 | 1492 | ```R 1493 | golub[1042, gol.fac == "ALL"] 1494 | ``` 1495 |
1496 |
1497 | 1 1498 |
1499 |
1500 | 0.88941 1501 |
1502 |
1503 | 1 1504 |
1505 |
1506 | 1.45014 1507 |
1508 |
1509 | 1 1510 |
1511 |
1512 | 0.42904 1513 |
1514 |
1515 | 1 1516 |
1517 |
1518 | 0.82667 1519 |
1520 |
1521 | 1 1522 |
1523 |
1524 | 0.63637 1525 |
1526 |
1527 | 1 1528 |
1529 |
1530 | 1.0225 1531 |
1532 |
1533 | 1 1534 |
1535 |
1536 | 0.12758 1537 |
1538 |
1539 | 1 1540 |
1541 |
1542 | -0.74333 1543 |
1544 |
1545 | 1 1546 |
1547 |
1548 | 0.73784 1549 |
1550 |
1551 | 1 1552 |
1553 |
1554 | 0.4947 1555 |
1556 |
1557 | 1 1558 |
1559 |
1560 | 1.12058 1561 |
1562 |
1563 | ## Creating the exploratory plots 1564 | 1565 | ### 1.1\. Plotting the value of gene (CCND3) in all nRNA samples (M92287_at) 1566 | 1567 | We shall first have a look at the expression values of a gener with manufacurer 1568 | name `M92278_at`, which is known in biology as "CCND3 Cyclin D3". 1569 | 1570 | The expression values of this gene are collected in row 1042 of golub. To load 1571 | the data and to obtain the relevant information from row 1042 of golub.gnames, 1572 | use the following: 1573 | 1574 | In [11]: 1575 | 1576 | ```R 1577 | mygene <- golub[1042, ] 1578 | ``` 1579 | 1580 | The data has now been stored in the `golub` matrix. We will now plot the 1581 | expression values od the gene CCND3 Cyclin D3. 1582 | 1583 | In [12]: 1584 | 1585 | ```R 1586 | plot(mygene) 1587 | ``` 1588 | png 1589 | In the previous plot we just used the default plotting preferences within R base 1590 | plotting.We can do some improvements so that the plot is easily understood. 1591 | 1592 | In [13]: 1593 | 1594 | ```R 1595 | plot(mygene, pch = 15, col = 'slateblue', ylab = 'Expression value of gene: CCND3', 1596 | main = ' Gene expression values of CCND3 Cyclin D3') 1597 | ``` 1598 | png 1599 | In this plot the vertical axis corresponds to the size of the expression values 1600 | and the horizontal axis the index of the patients. 1601 | 1602 | ### 1.2\. Gene expression between patient 1 (ALL) and patient 38 (AML) 1603 | 1604 | In [14]: 1605 | 1606 | ```R 1607 | plot(golub[,1], golub[,38]) 1608 | ``` 1609 | png 1610 | Adding diagonal lines to the plot and changing axes labels 1611 | 1612 | In [15]: 1613 | 1614 | ```R 1615 | plot(golub[,1], golub[,38], xlab = 'Patient 1 (ALL)', ylab = 'Patient 38 (AML)') 1616 | abline(a = 0, b = 1, col = 'mediumpurple', lwd =3) 1617 | ``` 1618 | png 1619 | ### 1.3\. Scatter plots to detect independence 1620 | 1621 | In [16]: 1622 | 1623 | ```R 1624 | mysamplist <- golub[, c(1:15)] 1625 | colnames(mysamplist) = c(1:15) 1626 | ``` 1627 | 1628 | In [17]: 1629 | 1630 | ```R 1631 | plot(as.data.frame(mysamplist), pch='.') 1632 | ``` 1633 | png 1634 | ### 1.4\. Bar plot of 4 cyclin genes expression values in 3 ALL and AML patients 1635 | 1636 | We will analyse the expression values of the `D13639_at, M92287_at, U11791_at, 1637 | Z36714_AT` genes in three chosen AML and ALL patients 1638 | 1639 | In [18]: 1640 | 1641 | ```R 1642 | mygenelist <- golub[c(85, 1042, 1212, 2240), c(1:3, 36:38)] 1643 | 1644 | # having a look at the data set chosen 1645 | mygenelist 1646 | ``` 1647 | 1648 | 1649 | 1650 | 1652 | 1655 | 1658 | 1661 | 1664 | 1667 | 1670 | 1671 | 1672 | 1673 | 1674 | 1677 | 1680 | 1683 | 1686 | 1689 | 1692 | 1695 | 1696 | 1697 | 1700 | 1703 | 1706 | 1709 | 1712 | 1715 | 1718 | 1719 | 1720 | 1723 | 1726 | 1729 | 1732 | 1735 | 1738 | 1741 | 1742 | 1743 | 1746 | 1749 | 1752 | 1755 | 1758 | 1761 | 1764 | 1765 | 1766 |
1651 | 1653 | 0 1654 | 1656 | 0 1657 | 1659 | 0 1660 | 1662 | 1 1663 | 1665 | 1 1666 | 1668 | 1 1669 |
1675 | D13639_at 1676 | 1678 | 2.09511 1679 | 1681 | 1.71953 1682 | 1684 | -1.46227 1685 | 1687 | -0.92935 1688 | 1690 | -0.11091 1691 | 1693 | 1.15591 1694 |
1698 | M92287_at 1699 | 1701 | 2.10892 1702 | 1704 | 1.52405 1705 | 1707 | 1.96403 1708 | 1710 | 0.73784 1711 | 1713 | 0.49470 1714 | 1716 | 1.12058 1717 |
1721 | U11791_at 1722 | 1724 | -0.11439 1725 | 1727 | -0.72887 1728 | 1730 | -0.39674 1731 | 1733 | -0.94364 1734 | 1736 | 0.05047 1737 | 1739 | 0.05905 1740 |
1744 | Z36714_at 1745 | 1747 | -1.45769 1748 | 1750 | -1.39420 1751 | 1753 | -1.46227 1754 | 1756 | -1.39906 1757 | 1759 | -1.34579 1760 | 1762 | -1.32403 1763 |
1767 | 1768 | In [19]: 1769 | 1770 | ```R 1771 | barplot(mygenelist) 1772 | box() 1773 | ``` 1774 | png 1775 | The plot is not very easy to read, so we will add some colours and a legend so 1776 | that we know which gene each bar segment corresponds to. 1777 | 1778 | In [20]: 1779 | 1780 | ```R 1781 | # custom colours 1782 | colours = c('lightblue2', 'slateblue', '#BD7BB8', '#2B377A') 1783 | 1784 | barplot(mygenelist, col = colours, legend = TRUE, border = 'white') 1785 | box() 1786 | ``` 1787 | png 1788 | In this case the patients are indicated on the `X` axis (0 and 1 respectively) 1789 | while the gene expression level is indicate on the `Y` axis. 1790 | 1791 | We can make some improvements to the plots. 1792 | Let's have a look at the `barplot` arguments: 1793 | 1794 | In [21]: 1795 | 1796 | ```R 1797 | ?barplot 1798 | ``` 1799 | 1800 | We are going to focus on only a few of the histgram arguments: 1801 | - `beside`: `TRUE` for the bars to be displayed as justapoxed bars, `FALSE` for 1802 | stacked bars 1803 | - `horiz` : `FALSE` bars displayed vertically with the first bar to the left, 1804 | `TRUE` bars are displayed horizontally with the first at the bottom. 1805 | - `ylim`, `xlim` : limits for the y and x axes 1806 | - `col`: colour choices 1807 | 1808 | In [22]: 1809 | 1810 | ```R 1811 | barplot(mygenelist, horiz = TRUE, col = colours, legend = TRUE, 1812 | ylab = 'Patient', border = 'white', 1813 | xlab = 'Gene expression level', main = 'Cycline genes expression') 1814 | box() 1815 | ``` 1816 | png 1817 | In the plot above we presented the barplots horizontally and added some colours, 1818 | which makes it easier to understand the data presented. 1819 | You can also use the barplots to represent the mean and standard error which we 1820 | will be doing in the following sections. 1821 | 1822 | ### 1.5\. Plotting the mean 1823 | 1824 | In the following we will compute the mean for the expression values of both the 1825 | ALL and AML patients. We will be using the same 4 cycline genes used in the 1826 | example above. 1827 | 1828 | First we will compute the ALL and AML for all the patients. Once the means are 1829 | computed they are combined into a single data frame. 1830 | 1831 | Finally, the means are plotted using the `barplot` function. 1832 | 1833 | In [23]: 1834 | 1835 | ```R 1836 | # Calculating the mean of the chosen genes from patient 1 to 27 and 28 to 38 1837 | ALLmean <- rowMeans(golub[c(85,1042,1212,2240),c(1:27)]) 1838 | AMLmean <- rowMeans(golub[c(85,1042,1212,2240),c(28:38)]) 1839 | 1840 | # Combining the mean matrices previously calculated 1841 | dataheight <- cbind(ALLmean, AMLmean) 1842 | 1843 | # Plotting 1844 | barx <- barplot(dataheight, beside=T, horiz=F, col= colours, ylim=c(-2,2.5), 1845 | legend = TRUE,border = 'white' , 1846 | ylab = 'Gene expression level', main = 'Cycline genes mean expression 1847 | in AML and ALL patients') 1848 | box() 1849 | ``` 1850 | png 1851 | ### 1.6\. Adding error bars to the previous plot 1852 | 1853 | 1854 | In the previous section we computed the mean expression level for 4 cycline 1855 | genes between the AML and ALL patients. Sometimes it is useful to add error bars 1856 | to the plots (as the one above) to convey the uncertainty in the data presented. 1857 | 1858 | For such a purpose we often use the **Standard Deviation**: 1859 | 1860 | 1861 | $$ \sigma = \sqrt{\frac{\sum_{i=1}^{n}\left(x_i -\mu \right)^2}{N}}$$ 1862 | 1863 | 1864 | which in turn tells us how much the values in a certain group tend to deviate 1865 | from their mean value. 1866 | 1867 | Let's start calculating the Standard Deviation of the data. 1868 | 1869 | In [24]: 1870 | 1871 | ```R 1872 | # Calculating the SD 1873 | ALLsd <- apply(golub[c(85,1042,1212,2240),c(1:27)], 1, sd) 1874 | nALL=length(c(1:27)) 1875 | AMLsd <- apply(golub[c(85,1042,1212,2240),c(28:38)], 1, sd) 1876 | nAML=length(c(28:38)) 1877 | 1878 | # Combining the data 1879 | datasd <- cbind(ALLsd, AMLsd) 1880 | 1881 | 1882 | ``` 1883 | 1884 | Another measure used to quantify the deviation is the **standard error**, which 1885 | qutifies the variability in the **_means_** of our groups instead of reporting 1886 | the variability among the data points. 1887 | 1888 | A relatively straigtforward way to compute this is by assuming if we were to 1889 | repeat a given experiment many many times, then it would roughly follow a 1890 | normal distribution. **Note – this is a big assumption**. hence, if we assuemt 1891 | hat the means follow a nosmal distribution, then the standard error (_a.k.a. 1892 | variability of group means_) can be defined as: 1893 | 1894 | $$ SE = \frac{SD}{\sqrt{n}} $$ 1895 | 1896 | which in layman terms can be read as “take the general variability of the 1897 | points around their group means (the standard deviation), and scale this number 1898 | by the number of points that you’ve collected”. 1899 | 1900 | Since we have already computed the SD we can now compute the standard error 1901 | (SE). 1902 | 1903 | In [25]: 1904 | 1905 | ```R 1906 | datase <- cbind(ALLsd/sqrt(nALL), AMLsd/sqrt(nAML)) 1907 | ``` 1908 | 1909 | Now we can create a plot of the mean data as well as the SE and SD. 1910 | 1911 | In [26]: 1912 | 1913 | ```R 1914 | # creating a panel of 2 plots displayed in 1 row 1915 | par(mfrow = c(1,2)) 1916 | 1917 | # Plot with the SD 1918 | datasdend<-abs(dataheight) + abs(datasd) 1919 | datasdend[c(3,4),] = - datasdend[c(3,4),] 1920 | barx <- barplot(dataheight, beside=T, horiz=F, col = colours, ylim=c(-2,2.5), 1921 | main = 'Data + SD', border = 'white') 1922 | abline(a = 0 , b = 0, h = 0) 1923 | arrows(barx, dataheight, barx, datasdend, angle=90, lwd = 2, length = 0.15, 1924 | col = 'navyblue') 1925 | box() 1926 | 1927 | # Plot with the se: error associated to the mean! 1928 | datasdend<-abs(dataheight) + abs(datase) 1929 | datasdend[c(3,4),] = -datasdend[c(3,4),] 1930 | barx <- barplot(dataheight, beside=T, horiz=F, col = colours, ylim=c(-2,2.5), 1931 | main = 'Data + SE', border = 'white') 1932 | abline(a = 0 , b = 0, h = 0) 1933 | arrows(barx, dataheight, barx, datasdend, angle=90, lwd = 2, length = 0.15, 1934 | col = 'navyblue') 1935 | box() 1936 | ``` 1937 | png 1938 | Note that the error bars for the SE are smaller than those for the SD. This is 1939 | no coincidence! 1940 | 1941 | As we increase N (in the SE equation), we will decrease the error. Hence the 1942 | standard error will **always** be smaller than the SD. 1943 | 1944 | ## 2. Data representation 1945 | This section presents some essential manners to display and visualize data. 1946 | 1947 | ### 2.1 Frequency table 1948 | Discrete data occur when the values naturally fall into categories. A frequency 1949 | table simply gives the number of occurrences within a category. 1950 | 1951 | A gene consists of a sequence of nucleotides (A; C; G; T) 1952 | 1953 | The number of each nucleotide can be displayed in a frequency table. 1954 | 1955 | This will be illustrated by the Zyxin gene which plays an important role in cell 1956 | adhesion The accession number (X94991.1) of one of its variants can be found in 1957 | a data base like NCBI (UniGene). The code below illustrates how to read the 1958 | sequence ”X94991.1” of the species homo sapiens from GenBank, to construct a 1959 | pie from a frequency table of the four nucleotides . 1960 | 1961 | In [27]: 1962 | 1963 | ```R 1964 | library('ape') 1965 | ``` 1966 | 1967 | In [29]: 1968 | 1969 | ```R 1970 | v = read.GenBank(c("X94991.1"),as.character = TRUE) 1971 | 1972 | pie(table(v$X94991.1), col = colours, border = 'white') 1973 | 1974 | # prints the data as a table 1975 | table(read.GenBank(c("X94991.1"),as.character=TRUE)) 1976 | ``` 1977 | 1978 | 1979 | 1980 | a c g t 1981 | 410 789 573 394 1982 | png 1983 | ### 2.2 Stripcharts 1984 | 1985 | An elementary method to visualize data is by using a so-called stripchart, 1986 | by which the values of the data are represented as e.g. small boxes 1987 | it is useful in combination with a factor that distinguishes members from 1988 | different experimental conditions or patients groups. 1989 | 1990 | Once again we use the CCND3 Cyclin D3 data to generate the plots. 1991 | 1992 | In [30]: 1993 | 1994 | ```R 1995 | # data(golub, package = "multtest") 1996 | gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML")) 1997 | 1998 | stripchart(golub[1042,] ~ gol.fac, method = "jitter", 1999 | col = c('slateblue', 'darkgrey'), pch = 16) 2000 | 2001 | ``` 2002 | png 2003 | From the above figure, it can be observed that the CCND3 Cyclin D3 expression 2004 | values of the ALL patients tend to have larger expression values than those of 2005 | the AML patient. 2006 | 2007 | 2008 | ### 2.3 Histograms 2009 | 2010 | Another method to visualize data is by dividing the range of data values into 2011 | a number of intervals and to plot the frequency per interval as a bar. Such 2012 | a plot is called a histogram. 2013 | 2014 | We will now generate a histogram of the expression values of gene CCND3 Cyclin 2015 | D3 as well as all the genes for the AML and ALL patients contained in the Golub 2016 | dataset. 2017 | 2018 | In [31]: 2019 | 2020 | ```R 2021 | par(mfrow=c(2,2)) 2022 | 2023 | hist(golub[1042, gol.fac == "ALL"], 2024 | col = 'slateblue', border = 'white', 2025 | main = 'Golub[1042], ALL', xlab = 'ALL') 2026 | box() 2027 | 2028 | hist(golub,breaks = 10, 2029 | col = 'slateblue', border = 'white', 2030 | main = 'Golub') 2031 | box() 2032 | 2033 | hist(golub[, gol.fac == "AML"],breaks = 10, 2034 | col = 'slateblue', border = 'white', 2035 | main = 'Golub, AML', xlab = 'AML') 2036 | box() 2037 | 2038 | hist(golub[, gol.fac == "ALL"],breaks = 10, 2039 | col = 'slateblue', border = 'white', 2040 | main = 'Golub, ALL', xlab = 'ALL') 2041 | box() 2042 | ``` 2043 | png 2044 | ### 2.3 Boxplots 2045 | 2046 | A popular method to display data is by 2047 | drawing a box around the 1st and the 3rd quartile (a bold line segment 2048 | for the median), and the smaller line segments (whiskers) for the smallest and 2049 | the largest data values. 2050 | 2051 | Such a data display is known as a box-and-whisker plot. 2052 | 2053 | We will start by creating a vector with gene expression values sorted in 2054 | ascending order (using the `sort` function). 2055 | 2056 | In [32]: 2057 | 2058 | ```R 2059 | # Sort the values of one gene 2060 | x <- sort(golub[1042, gol.fac=="ALL"], decreasing = FALSE) 2061 | 2062 | # printing the first five values 2063 | x[1:5] 2064 | ``` 2065 |
2066 |
2067 | 0 2068 |
2069 |
2070 | 0.45827 2071 |
2072 |
2073 | 0 2074 |
2075 |
2076 | 1.10546 2077 |
2078 |
2079 | 0 2080 |
2081 |
2082 | 1.27645 2083 |
2084 |
2085 | 0 2086 |
2087 |
2088 | 1.32551 2089 |
2090 |
2091 | 0 2092 |
2093 |
2094 | 1.36844 2095 |
2096 |
2097 | A view on the distribution of the gene expression values of the `ALL` and `AML` 2098 | patients on gene CCND3 Cyclin D3 can be obtained by generating two separate 2099 | boxplots adjacent to each other: 2100 | 2101 | In [41]: 2102 | 2103 | ```R 2104 | # Even though we are creating two boxplots we only need one major graph 2105 | par(mfrow=c(1,1)) 2106 | boxplot(golub[1042,] ~ gol.fac, col = c('lightblue2', 'mediumpurple')) 2107 | 2108 | ``` 2109 | png 2110 | It can be observed that the gene expression values for ALL are larger than those 2111 | for AML. Furthermore, since the two sub-boxes around the median are more or less 2112 | equally wide, the data are quite symmetrically distributed around the median. 2113 | 2114 | We can create a histogram of the expression values of gene CCND3 Cyclin D3 of 2115 | the acute lymphoblastic leukemia patients e.g. 2116 | 2117 | In [110]: 2118 | 2119 | ```R 2120 | hist(golub[1042,], col= 'lightblue', border= 'black', breaks= 6, freq= F, 2121 | main = 'Expression values of gene CCND3 Cyclin D3') 2122 | lines(density(golub[1042,]), col= 'slateblue', lwd = 3) 2123 | box() 2124 | ``` 2125 | png 2126 | Now we can observe the distribution of all gene expressions values in all 38 2127 | patients 2128 | 2129 | In [113]: 2130 | 2131 | ```R 2132 | boxplot(golub, col= 'lightblue2', lwd = 1, border="black", pch=18) 2133 | ``` 2134 | png 2135 | To compute the exact values for the quartiles we need a sequence running from 0 2136 | to 1 with increments in steps of 0.25 2137 | 2138 | In [114]: 2139 | 2140 | ```R 2141 | pvec <- seq(0, 1, 0.25) 2142 | quantile(golub[1042, gol.fac=='ALL'], pvec) 2143 | ``` 2144 |
2145 |
2146 | 0% 2147 |
2148 |
2149 | 0.45827 2150 |
2151 |
2152 | 25% 2153 |
2154 |
2155 | 1.796065 2156 |
2157 |
2158 | 50% 2159 |
2160 |
2161 | 1.92776 2162 |
2163 |
2164 | 75% 2165 |
2166 |
2167 | 2.178705 2168 |
2169 |
2170 | 100% 2171 |
2172 |
2173 | 2.7661 2174 |
2175 |
2176 | Outliers are data points lying far apart from the pattern set by the majority of 2177 | the data values. The implementation in R of the boxplot draws such outliers as 2178 | smalle circles. 2179 | 2180 | A data point `x` is defined (graphically, not statistically) as an outlier point 2181 | if $$x < 0.25 x -1.5\left(0.75 x -0.25 x\right) [x>0.25x >1.5(0.75x-0.25x)]$$ 2182 | 2183 | 2184 | ### 2.4 Q-Q plots (Quantile-quantile plots) 2185 | 2186 | A method to visualize the distribution of gene expression values is y the so- 2187 | called quantile-quantile (Q-Q) plots. In such a plot the quantiles of the gene 2188 | expression values are displayed against the corresponding quantiles of the 2189 | normal distribution (bell-shaped). 2190 | 2191 | A straight line is added to represent the points which 2192 | correspond exactly to the quantiles of the normal distribution. By observing 2193 | the extent in which the points appear on the line, it can be evaluated to 2194 | what degree the data are normally distributed. That is, the closer the gene 2195 | expression values appear to the line, the more likely it is that the data are 2196 | normally distributed. 2197 | 2198 | To produce a Q-Q plot of the ALL gene expression values of CCND3 Cyclin D3 one 2199 | may use the following. 2200 | 2201 | In [116]: 2202 | 2203 | ```R 2204 | qqnorm(golub[1042, gol.fac == 'ALL']) 2205 | qqline(golub[1042, gol.fac == 'ALL'], col = 'slateblue', lwd = 2) 2206 | ``` 2207 | png 2208 | It can be seen that most of the data points are on or near the straight line, 2209 | while a few others are further away. The above example illustrates a case where 2210 | the degree of non-normality is moderate so that a clear conclusion cannot be 2211 | drawn. 2212 | 2213 | 2214 | ## 3. Loading tab-delimited data 2215 | 2216 | In [117]: 2217 | 2218 | ```R 2219 | mydata<-read.delim("./NeuralStemCellData.tab.txt", row.names=1, header=T) 2220 | ``` 2221 | 2222 | In [118]: 2223 | 2224 | ```R 2225 | class(mydata) 2226 | ``` 2227 | 2228 | 'data.frame' 2229 | 2230 | ### Now try and do some exploratory analysis of your own on this data! 2231 | 2232 | 2233 | GvHD flow cytometry data 2234 | 2235 | Only exract the CD3 positive cells 2236 | 2237 | In [119]: 2238 | 2239 | ```R 2240 | cor(mydata[,1],mydata[,2]) 2241 | plot(mydata[,1],mydata[,3]) 2242 | ``` 2243 | 2244 | 0.956021382271511 2245 | png 2246 | --------------------------------------------------------------------------------