├── .gitignore ├── GastricCancer_NMR.xlsx ├── MTBLS290db.xlsx ├── README.md ├── Tutorial1.html ├── Tutorial1.ipynb ├── Tutorial2.html ├── Tutorial2.ipynb ├── _config.yml ├── environment.yml └── images ├── R2Q2.png ├── R2Q2_ab.png ├── bulb.png ├── cog2.png ├── logo_text.png └── mouse.png /.gitignore: -------------------------------------------------------------------------------- 1 | # ignore Jupyter checkpoints 2 | .ipynb_checkpoints 3 | 4 | # ignore result files produced by notebooks 5 | modelPLS.xlsx 6 | stats.xlsx -------------------------------------------------------------------------------- /GastricCancer_NMR.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/GastricCancer_NMR.xlsx -------------------------------------------------------------------------------- /MTBLS290db.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/MTBLS290db.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # README.md - `SI_Mendez_etal_2019` 4 | 5 |

6 | This repository contains the supplementary information for the "Toward Collaborative Open Data Science in Metabolomics using Jupyter Notebooks and Cloud Computing" tutorial review. The focuses is on experiential learning using an example interactive metabolomics data analysis workflow deployed using a combination of Python, Jupyter Notebooks and Binder. The three-step pedagogical process of understanding, implementation, and deployment is broken down into 5 tutorials. 7 |

8 | 9 |

10 | The tutorials takes you through the process of using interactive notebooks to produce a shareable, reproducible data analysis workflow that connects the study design to reported biological conclusions in an interactive document. The workflow implemented includes a discrete set of interactive and interlinked procedures: data cleaning, univariate statistics, multivariate machine learning, feature selection, and data visualisation. 11 |

12 | 13 |
14 | 15 | ## Quick Start 16 | 17 | #### *To launch the tutorial environment in the cloud:* [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/cimcb/MetabWorkflowTutorial/master?urlpath=tree) 18 | 19 | #### Tutorial 1: 20 | - [Tutorial 1 static notebook](https://cimcb.github.io/MetabWorkflowTutorial/Tutorial1.html) 21 | - [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CIMCB/MetabWorkflowTutorial/master?filepath=Tutorial1.ipynb) 22 | 23 | #### Tutorial 2: 24 | - [Tutorial 2 static notebook](https://cimcb.github.io/MetabWorkflowTutorial/Tutorial2.html) 25 | - [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CIMCB/MetabWorkflowTutorial/master?filepath=Tutorial2.ipynb) 26 | 27 | #### Tutorial 4: 28 | - [Tutorial 4 static notebook](https://cimcb.github.io/MetabSimpleQcViz/Tutorial4.html) 29 | - [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/cimcb/MetabSimpleQcViz/master?filepath=Tutorial4.ipynb) 30 | 31 |
32 | 33 | ## Tutorials 34 | 35 | 1. [Launching and using a Jupyter notebook on Binder](#one) 36 | 2. [Interacting with and editing a Jupyter notebook on Binder](#two) 37 | 3. [Downloading and installing a Jupyter notebook on a local machine](#three) 38 | 4. [Creating a new Jupyter notebook on a local computer](#four) 39 | 5. [Deploying a Jupyter notebook on Binder via GitHub](#five) 40 | 41 |
42 | 43 | 44 | ## Tutorial 1: Launching and using a Jupyter notebook on Binder 45 |

46 | In this tutorial we will step though a metabolomics computational workflow that has already been implemented as a Jupyter Notebook and deployed on Binder. In this workflow we will interrogate a published (Chan et al., 2016) NMR urine data set (deconvolved and annotated) used to discriminate between samples from gastric cancer and healthy patients. The data set is available in the Metabolomics Workbench data repository (Project ID PR000699). The data can be accessed directly via its project DOI:10.21228/M8B10B. 47 |

48 | 49 |

Prior to beginning this tutorial, it is recommended to view the provided static version of the Jupyter analysis notebook: Tutorial 1: Static Notebook 50 |

51 | 52 | #### Tutorial 1 steps: 53 | 1. Launch Binder by clicking the "Launch Binder" icon: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/cimcb/MetabWorkflowTutorial/master) 54 | 2. Click on the Tutorial1.ipynb filename to open the Jupyter Notebook 55 | 3. Run the code cells: 56 | 3.1. Click anywhere within the cell, which will then be outlined by a green box (or blue for text cells) 57 | 3.2. Click on the “Run” button in the top menu to execute the code in the cell 58 | 4. **Alternatively**, run multiple cells: 59 | 4.1. In the "Cell" menu item, choose "Run All" (runs all cells in the notebook, from top to bottom) 60 | 4.2. In the "Cell" menu item, choose "Run all below" (runs all cells below the current selection) 61 | 5. Download notebook (as changes to the notebook are lost when the Binder session end): 62 | 5.1. Return to Jupyter landing page, by choosing "File" then "Open.." 63 | 5.2. Click the checkbox next to each file you wish to download 64 | 5.3. Click the ‘Download’ button from the top menu 65 | 66 |
67 | 68 | 69 | ## Tutorial 2: Interacting with and editing a Jupyter notebook on Binder 70 |

71 | The fuctionality of the notebook in Tutorial 2 is identical to Tutorial 1, but now the text cells have been expanded into a comprehensive interactive tutorial. Text cells with a yellow background provide the metabolomics context and describe the purpose of the code in the following code cell. Additional coloured text boxes are placed throughout the workflow to help novice users navigate and understand the interactive principles of a Jupyter Notebook: 72 |

73 | 74 | | Box | Description | Purpose | 75 | | ------------- | ------------- | -----| 76 | | Action | red background labelled with ‘gears’ icon | Suggestions for changing the functionality of the subsequent code cell by editing (or substituting) a line of code | 77 | | Interaction | green background with ‘mouse’ icon | Suggestions for interacting with the visual results generated by a code cell | 78 | | Notes | blue background with ‘lightbulb’ icon | Further information about the theoretical reasoning behind the block of code or a given visualisation | 79 | 80 |

81 | There is a second data included with this tutorial which has been converted to our standardised Tidy Data This data has been previously published as an article Gardlo et al. (2016), in Metabolomics. Urine samples collected from newborns with perinatal asphyxia were analysed, and the deconvoluted and annotated file is deposited at the Metabolights data repository (Project ID MTBLS290). 82 |

83 | 84 |

85 | Prior to beginning this tutorial, it is recommended to view the provided static version of the Jupyter analysis notebook: Tutorial 2: Static Notebook 86 |

87 | 88 | #### Tutorial 2 steps: 89 | 1. Launch Binder by clicking the "Launch Binder" icon: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/cimcb/MetabWorkflowTutorial/master) 90 | 2. Click on the Tutorial2.ipynb filename to open the Jupyter Notebook 91 | 3. Run the code cells: 92 | 3.1. Click anywhere within the cell, which will then be outlined by a green box (or blue for text cells) 93 | 3.2. Click on the “Run” button in the top menu to execute the code in the cell 94 | 4. **Modify and run code cells**: 95 | 4.1. When prompted, complete one (or multiple) modifications suggested in each ‘action’ box . 96 | 4.2. Click “Run all below” from the “Cell” dropdown menu, observing the changes in cell output for all the subsequent cells. 97 | 5. Download notebook (as changes to the notebook are lost when the Binder session end): 98 | 5.1. Return to Jupyter landing page, by choosing "File" then "Open.." 99 | 5.2. Click the checkbox next to each file you wish to download 100 | 5.3. Click the ‘Download’ button from the top menu 101 | 102 | **Note: Further guidance for step 4 is included in the notebook itself.** 103 | 104 |
105 | 106 | 107 | ## Tutorial 3: Downloading and installing a Jupyter notebook on a local machine. 108 |

109 | For Tutorial 3, we will provide the process of installing Python and Jupyter, and reproducing Tutorial 1 on your own machine. The Anaconda distribution provides a unified, platform-independent framework for running notebooks and managing Conda virtual environments that is consistent across multiple operating systems, so for convenience we will use the Anaconda interface in these tutorials. 110 |

111 | 112 | #### Tutorial 3 steps: 113 | 114 | ##### Part A) Install Jupyter and Python using Anaconda 115 | 116 | 1. Go to the [Official Anaconda Website](https://www.anaconda.com/distribution/) and click the 'Download' button. 117 | 2. Press the 'Download' button under the 'Python 3.x version' in Bold to download the graphical installer for your OS. 118 | 3. After the download has finished, open (double-click) the installer to begin installing the Anaconda Distribution 119 | 4. Follow the prompts on the graphical installer to completely install the Anaconda Distribution 120 | 121 | ##### Part B) Create a virtual Environment, and start Tutorial 1. 122 | 123 | **Note: If you are using Windows, you need to install git using the following: [Git for Windows](https://gitforwindows.org/)** 124 | 125 | 1. Open Terminal on Linux/MacOS or Command Prompt on Windows 126 | 2. Enter the following into the console (one line at a time) 127 | 128 | ```console 129 | git clone https://github.com/cimcb/MetabWorkflowTutorial 130 | cd MetabWorkflowTutorial 131 | conda env create -f environment.yml 132 | conda activate MetabWorkflowTutorial 133 | jupyter notebook 134 | ``` 135 | 136 |
137 | 138 | 139 | ## Tutorial 4: Creating a new Jupyter notebook on a local computer 140 |

141 | The Jupyter notebook we will generate for this data set demonstrates the use of visualisation methods available in Anaconda Python without the need to install additional 3rd party packages. This tutorial is made available in the GitHub repository at https://github.com/cimcb/MetabSimpleQcViz. In the notebook, we will load the data set and produce four graphical outputs: 142 |

143 | 144 | 1. A histogram of the distribution of QCRSD across the data set. 145 | 2. A kernel density plot of QCRSD vs. D-ratio across the data set. 146 | 3. A PCA scores plot of the data set labelled by sample type. 147 | 4. A bubble scatter plot of molecular mass vs. retention time, with bubble size proportional to QCRSD 148 | 149 | *Prior to beginning this tutorial, it is recommended to view the provided static version of the `Jupyter` analysis notebook: [Tutorial 4: Static Notebook](https://cimcb.github.io/MetabSimpleQcViz/Tutorial4.html)* 150 | 151 | #### Tutorial 4 steps: 152 | 1. Download and unzip the [repository](https://github.com/cimcb/MetabSimpleQcViz) to a folder on your own computer, as in tutorial 3. 153 | 2. Create an new notebook: 154 | 2.1. Ensure “[base (root)]” is selected in the “Applications on” dropdown list of the main panel 155 | 2.2. Launch Jupyter Notebook 156 | 2.3. Navigate to the repository root (the “MetabSimpleQcViz” folder) 157 | 2.4. Click on the “New” button in the top right corner of the page 158 | 2.5. Select “Python 3” from the list of supported languages (a black Jupyter Notebook called "Untitled" will open) 159 | 2.6. Rename the notebook by clicking on the text “Untitled” and replacing it with “myQCviz” 160 | 3. Edit the notebook (using the template file as a guide) ... 161 | 4. Save and close the notebook: 162 | 4.1. Save by clicking on the floppy disk icon (far left on the menu) 163 | 4.2. Close by clicking “File” and then “Close and Halt”from the top Jupyter menu 164 | 4.3. The Jupyter session can be closed by clicking on “Quit” on the Jupyter landing page tab of your web browser . 165 | 166 |
167 | 168 | 169 | ## Tutorial 5: Deploying a Jupyter notebook on Binder via GitHub 170 | 1. Create a GitHub account (please follow the instructions on GitHub at https://help.github.com/en/articles/signing-up-for-a-new-github-account) 171 | 2. Create a new repository 172 | 2.1. Click on the “Repositories” link at the top of the page 173 | 2.2. Enter “Jupyter_metabQC” into the “Repository name” field 174 | 2.3. Select the ‘Public’ option for the repository 175 | 2.3. Select the checkbox to “Initialize this repository with a "README” 176 | 2.4. Add a license (https://choosealicense.com/) 177 | 2.5. Click the “Create repository” button 178 | 3. Add files (new Jupyter notebook and the Excel data file ) to the repository 179 | 3.1. Click on the ‘Upload files’ button 180 | 3.2. Drag the files (myQCviz.ipynb’ and ‘data.xlsx’) from your computer 181 | 3.3. Enter the text “Add data and notebook via upload” to the top field under “Commit changes.” 182 | 3.4. Commit the files to the repository, click on the “Commit changes” button 183 | 4. Build and launch a Binder virtual machine for this repository 184 | 4.1. Open https://mybinder.org in a modern web browser 185 | 4.2. Enter the path to the home page of your repository (https://github.com/account_name/Jupyter_metabQC) 186 | 4.3. Click the ‘Launch’ button 187 | 4.4. The URL shown in the field “Copy the URL below and share your Binder with others” can be shared with colleagues 188 | 189 |

190 | Congratulations, you have created your first Binder notebook! Now share it with your colleagues! 191 |

192 | -------------------------------------------------------------------------------- /Tutorial1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " To begin: Click anywhere in this cell and press Run on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A text cell and a code cell. When you Run a text cell (we are in a text cell now), you advance to the next cell without executing any code. When you Run a code cell (identified by In[ ]: to the left of the cell) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press Run again. Repeat this process until the end of the notebook. NOTE: All the cells in this notebook can be automatically excecuted sequentially by clicking KernelRestart and Run All. Should anything crash then restart the Jupyter Kernal by clicking KernelRestart, and start again from the top.\n", 9 | " \n", 10 | "
" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "
\n", 18 | "

\n", 19 | "\n", 20 | "

Tutorial 1: Basic Metabolomics Data Analysis Workflow

\n", 21 | "\n", 22 | "


\n", 23 | "
\n", 24 | "
\n", 25 | "
\n", 26 | "\n", 27 | "

This Jupyter notebook describes a typical metabolomics data analysis workflow for a study with a binary classification outcome. The main steps included are:

\n", 28 | "\n", 29 | "\n", 53 | "\n", 54 | "

The study used in this tutorial has been previously published as an open access article Chan et al. (2016), in the British Journal of Cancer, and the deconvolved and annotated data file deposited at the Metabolomics Workbench data repository (Project ID PR000699). The data can be accessed directly via its project DOI:10.21228/M8B10B 1H-NMR spectra were acquired at Canada’s National High Field Nuclear Magnetic Resonance Centre (NANUC) using a 600 MHz Varian Inova spectrometer. Spectral deconvolution and metabolite annotation was performed using the Chenomx NMR Suite v7.6. Unfortunately, the Raw NMR data is unavailable.

\n", 55 | "\n", 56 | "
" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": { 62 | "toc-hr-collapsed": false 63 | }, 64 | "source": [ 65 | "
\n", 66 | " \n", 67 | "

1. Import Packages/Modules

\n", 68 | "\n", 69 | "

The first code cell of this tutorial (below this text box) imports packages and modules into the Jupyter environment. Packages and modules provide additional functions and tools that extend the basic functionality of the Python language. We will need the following tools to analyse the data in this tutorial:

\n", 70 | "\n", 71 | "\n", 85 | "\n", 86 | "

Run the cell by clicking anywhere in the cell (the cell will be surrounded by a blue box) and then clicking Run in the Menu.
\n", 87 | "When successfully executed the cell will print All packages successfully loaded in the notebook below the cell.

\n", 88 | "
" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "import numpy as np\n", 98 | "import pandas as pd\n", 99 | "\n", 100 | "from sklearn.model_selection import train_test_split\n", 101 | "\n", 102 | "import cimcb_lite as cb\n", 103 | "\n", 104 | "print('All packages successfully loaded')" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "
\n", 112 | " \n", 113 | "

2. Load Data and Peak sheet

\n", 114 | "\n", 115 | "

This workflow requires data to be uploaded as a Microsoft Excel file, using the Tidy Data framework (i.e. each column is a variable, and row is an observation). As such, the Excel file should contain a Data Sheet and Peak Sheet. The Data Sheet contains all the metabolite concentrations and metadata associated with each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The Peak Sheet contains all the metadata pertaining to each measured metabolite (requiring the inclusion of the columns: Idx, Name, and Label). Please inspect the Excel file used in this tutorial before proceeding.

\n", 116 | "\n", 117 | "

The code cell below loads the Data and Peak sheets from an Excel file, using the CIMCB helper function load_dataXL(). When this is complete, you should see confirmation that Peak (stored in the Peak worksheet in the Excel file) and Data (stored in the Data worksheet in the Excel file) tables have been loaded:

\n", 118 | "\n", 119 | "
Loadings PeakFile: Peak\n",
120 |     "Loadings DataFile: Data\n",
121 |     "Data Table & Peak Table is suitable.\n",
122 |     "TOTAL SAMPLES: 140 TOTAL PEAKS: 149\n",
123 |     "Done!\n",
124 |     "
\n", 125 | "\n", 126 | "

Once loaded, the data is available for use in variables called dataTable and peakTable.

\n", 127 | "\n", 128 | "
" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# The path to the input file (Excel spreadsheet)\n", 138 | "filename = 'GastricCancer_NMR.xlsx'\n", 139 | "\n", 140 | "# Load Peak and Data tables into two variables\n", 141 | "dataTable, peakTable = cb.utils.load_dataXL(filename, DataSheet='Data', PeakSheet='Peak') " 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "
\n", 149 | "\n", 150 | "

2.1 Display the Data table

\n", 151 | "\n", 152 | "

\n", 153 | " The dataTable table can be displayed interactively so we can inspect and check the imported values. To do this, we use the display() function. For this example the imported data consists of 140 samples and 149 metabolites. \n", 154 | "

\n", 155 | "\n", 156 | "

Note that each row describes a single urine sample, where:

\n", 157 | "\n", 158 | "\n", 163 | "\n", 164 | "
" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "scrolled": false 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "display(dataTable) # View and check the dataTable " 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "
\n", 183 | "

2.2 Display the Peak table

\n", 184 | "\n", 185 | "

The peakTable table can also be displayed interactively so we can inspect and check the imported values. To do this, we again use the display() function. For this example the imported data consists of 149 metabolites (the same as in the dataTable data)

\n", 186 | "\n", 187 | "

Each row describes a single metabolite, where

\n", 188 | "\n", 189 | "\n", 202 | "
" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "display(peakTable) # View and check PeakTable" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "
\n", 219 | "\n", 220 | "

3. Data Cleaning

\n", 221 | "\n", 222 | "

It is good practice to assess the quality of your data, and remove (clean out) any poorly measured metabolites, before performing any statistical or machine learning modelling Broadhurst et al. 2018. For the Gastric Cancer NMR data set used in this example we have already calculated some basic statistics for each metabolite and stored them in the Peak table. In this notebook we keep only metabolites that meet the following criteria:

\n", 223 | "\n", 224 | "\n", 229 | "\n", 230 | "

When the data is cleaned, the number of remaining peaks will be reported.

\n", 231 | "\n", 232 | "
" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "# Create a clean peak table \n", 242 | "\n", 243 | "rsd = peakTable['QC_RSD'] \n", 244 | "percMiss = peakTable['Perc_missing'] \n", 245 | "peakTableClean = peakTable[(rsd < 20) & (percMiss < 10)] \n", 246 | "\n", 247 | "print(\"Number of peaks remaining: {}\".format(len(peakTableClean)))" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "
\n", 255 | "\n", 256 | "

4. PCA - Quality Assessment

\n", 257 | "\n", 258 | "

To provide a multivariate assesment of the quality of the cleaned data set it is good practice to perform a simple Principal Component Analysis (PCA), after suitable transforming & scaling. The PCA score plot is typically labelled by sample type (i.e. quality control (QC) or biological sample (Sample)). Data of high quality will have QCs that cluster tightly compared to the biological samples Broadhurst et al. 2018.

\n", 259 | "\n", 260 | "

First the metabolite data matrix is extracted from the dataTable, and transformed & scaled:

\n", 261 | "\n", 262 | "\n", 273 | "\n", 274 | "

The transformed & scaled dataset Xknn is used as input to PCA, using the helper function cb.plot.pca(). This returns plots of PCA scores and PCA loadings, for interpretation and quality assessment.

\n", 275 | "\n", 276 | "
" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "# Extract and scale the metabolite data from the dataTable \n", 286 | "\n", 287 | "peaklist = peakTableClean['Name'] # Set peaklist to the metabolite names in the peakTableClean\n", 288 | "X = dataTable[peaklist].values # Extract X matrix from dataTable using peaklist\n", 289 | "Xlog = np.log10(X) # Log scale (base-10)\n", 290 | "Xscale = cb.utils.scale(Xlog, method='auto') # methods include auto, pareto, vast, and level\n", 291 | "Xknn = cb.utils.knnimpute(Xscale, k=3) # missing value imputation (knn - 3 nearest neighbors)\n", 292 | "\n", 293 | "print(\"Xknn: {} rows & {} columns\".format(*Xknn.shape))\n", 294 | "\n", 295 | "cb.plot.pca(Xknn,\n", 296 | " pcx=1, # pc for x-axis\n", 297 | " pcy=2, # pc for y-axis\n", 298 | " group_label=dataTable['SampleType']) # labels for Hover in PCA loadings plot" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "
\n", 306 | " \n", 307 | "

5. Univariate Statistics for comparison of Gastric Cancer (GC) vs Healthy Controls (HE)

\n", 308 | "\n", 309 | "

The data set uploaded into dataTable describes the 1H-NMR urine metabolite profiles of individuals classified into three distinct groups: GC (gastric cancer), BN (benign), and HE (healthy). For this workflow we are interested in comparing only the differences in profiles between individuals classified as GC or HE.

\n", 310 | "\n", 311 | "

The helper function cb.utils.univariate_2class() will take as input a data table where the observations represent data from two groups, and a corresponding table of metabolite peak information, and produce as output summary statistics of univariate comparisons between the two groups. The output is returned as a pandas dataframe, describing output from statistical tests such as Student's t-test and Shapiro-Wilks, and summaries of data quality, like the number and percentage of missing values.

\n", 312 | "\n", 313 | "

First, we reduce the data in dataTable to only those observations for GC and HE samples, and we define the GC class to be a positive outcome, in the variable pos_outcome. Next, we pass the reduced dataset and the cleaned peakTable to cb.utils.univariate_2class(), and store the returned dataframe in a new variable called statsTable. This is then displayed as before for interactive inspection and interpretation.

\n", 314 | "\n", 315 | "
" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "# Select subset of Data for statistical comparison\n", 325 | "dataTable2 = dataTable[(dataTable.Class == \"GC\") | (dataTable.Class == \"HE\")] # Reduce data table only to GC and HE class members\n", 326 | "pos_outcome = \"GC\" \n", 327 | "\n", 328 | "# Calculate basic statistics and create a statistics table.\n", 329 | "statsTable = cb.utils.univariate_2class(dataTable2,\n", 330 | " peakTableClean,\n", 331 | " group='Class', # Column used to determine the groups\n", 332 | " posclass=pos_outcome, # Value of posclass in the group column\n", 333 | " parametric=True) # Set parametric = True or False\n", 334 | "\n", 335 | "# View and check StatsTable\n", 336 | "display(statsTable)" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "
\n", 344 | "\n", 345 | "

It is useful to have this interactive view, but the output will disappear when we close the notebook. To store the output in a more persistent format, such as an Excel spreadsheet, we can use the methods that are built in to pandas dataframes.

\n", 346 | "\n", 347 | "

To save a pandas dataframe to an Excel spreadsheet file as a single sheet, we use the dataframe's .to_excel()method, and provide the name of the file we want to write to (and optionally a name for the sheet). We do not want to keep the dataframe's own index column, so we also set index=False.

\n", 348 | "\n", 349 | "

The code in the cell below will write the contents of statsTable to the file stats.xlsx.

\n", 350 | "\n", 351 | "
" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "# Save StatsTable to Excel\n", 361 | "statsTable.to_excel(\"stats.xlsx\", sheet_name='StatsTable', index=False)\n", 362 | "print(\"done!\")" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "
\n", 370 | " \n", 371 | "

6. Machine Learning

\n", 372 | "\n", 373 | "

\n", 374 | "The remainder of this tutorial will describe the use of a 2-class Partial Least Squares-Discriminant Analysis (PLS-DA) model to identify metabolites which, when combined in a linear equation, are able to classify unknown samples as either GC and HE with a measurable degree of certainty.\n", 375 | "

\n", 376 | "\n", 377 | "

6.1 Splitting data into Training and Test sets.

\n", 378 | "\n", 379 | "

\n", 380 | "Multivariate predictive models are prone to overfitting. In order to provide some level of independent evaluation it is common practice to split the source data set into two parts: training set and test set. The model is then optimised using the training data and independently evaluated using the test data. The true effectiveness of a model can only be assessed using the test data (Westerhuis et al. 2008, Xia et al. 2012). It is vitally important that both the training and test data are equally representative of the the sample population (in our example the urine metabotype of Gastric Cancer and the urine metabotype of Healthy Control). It is typical to split the data using a ratio of 2:1 (⅔ training, ⅓ test) using stratified random selection. If the purpose of model-building is exploratory, or sample numbers are small, this step is often ignored; however, care must be taken in interpreting a model that has not been tested on a dataset that is independent of the data it was trained on.\n", 381 | "

\n", 382 | "\n", 383 | "

\n", 384 | "We use the dataTable2 dataframe created above, which contains a subset of the complete data suitable for a 2-class comparision (GC vs HE). Our goal is to split this dataframe into a training subset (dataTrain) which will be used to train our model, and a test set (dataTest), which will be used to evaluate the trained model. We will split the data such that number of test set samples is 25% of the the total. To do this, we will use the scikit-learn module's train_test_split() function.\n", 385 | "

\n", 386 | "\n", 387 | "

\n", 388 | "First, we need to ensure that the sample split - though random - is stratified so that the class membership is balanced to the same proportions in both the test and training sets. In order to do this, we need to supply a binary vector indicating stratification group membership.\n", 389 | "

\n", 390 | "\n", 391 | "

\n", 392 | "The train_test_split() function expects a binary (1/0) list of positive/negative outcome indicators, not the GC/HE classes that we have. We convert the class information for each sample in dataTable2 into Y, a list of 1/0 values, in the code cell below.\n", 393 | "

\n", 394 | "\n", 395 | "
" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "# Create a Binary Y vector for stratifiying the samples\n", 405 | "outcomes = dataTable2['Class'] # Column that corresponds to Y class (should be 2 groups)\n", 406 | "Y = [1 if outcome == 'GC' else 0 for outcome in outcomes] # Change Y into binary (GC = 1, HE = 0) \n", 407 | "Y = np.array(Y) # convert boolean list into to a numpy array" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "
\n", 415 | "\n", 416 | "

Now that we have the dataset (dataTable2) and the list of binary outcomes (Y) for stratification, we can use the train_test_split() function in the code cell below.

\n", 417 | "\n", 418 | "

Once the training and test sets have been created, summary output will be printed:

\n", 419 | "\n", 420 | "
DataTrain = 62 samples with 32 positive cases.\n",
421 |     "DataTest = 21 samples with 11 positive cases.\n",
422 |     "
\n", 423 | "\n", 424 | "

Two new dataframes and two new lists are created:

\n", 425 | "\n", 426 | "\n", 435 | "\n", 436 | "
" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "# Split dataTable2 and Y into train and test (with stratification)\n", 446 | "dataTrain, dataTest, Ytrain, Ytest = train_test_split(dataTable2, Y, test_size=0.25, stratify=Y,random_state=10)\n", 447 | "\n", 448 | "print(\"DataTrain = {} samples with {} positive cases.\".format(len(Ytrain),sum(Ytrain)))\n", 449 | "print(\"DataTest = {} samples with {} positive cases.\".format(len(Ytest),sum(Ytest)))" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "
\n", 457 | " \n", 458 | "

6.2. Determine optimal number of components for PLS-DA model

\n", 459 | "\n", 460 | "

The most common method to determine the optimal PLS-DA model configuration without overfitting is to use k-fold cross-validation. For PLS-DA, this will be a linear search of models having $1$ to $N$ latent variables (components).\n", 461 | " \n", 462 | "First, each PLS-DA configuration is trained using all the available data (XTknn and Ytrain). The generalised predictive ability of that model is then evaluated using the same data - typically by calculating the coefficient of determination $R^2$. This will generate $N$ evaluation scores ($R^2_1,R^2_2 ... R^2_N$).\n", 463 | "\n", 464 | "The training data is then split into k equally sized subsets (folds). For each of the PLS-DA configurations, $k$ models are built, such that each model is trained using $k-1$ folds and the remaining 1-fold is applied to the model and model predictions are recorded. The modeling process is implemented such than that after $k$ models each fold will have been *'held-out'* only once.\n", 465 | "\n", 466 | "The generalised predictive ability of the model is then evaluated by comparing the *'held-out'* model predictions to the expected classification (cross-validated coefficient of determination - $Q^2$). This will generate $N$ cross-validated evaluation scores scores ($Q^2_1,Q^2_2 ... Q^2_N$). If the values for $R^2$ and $Q^2$ are plotted against model complexity (number of latent variables), typically the value of $Q^2$ will be seen to rise and then fall. The point at which the $Q^2$ value begins to diverge from the $R^2$ value is considered the point at which the optimal number of components has been met without overfitting.

\n", 467 | "\n", 468 | "

In this section, we perform 5-fold cross-validation using the training set we created above (dataTrain) to determine the optimal number of components to use in our PLS-DA model.

\n", 469 | "\n", 470 | "

First, in the cell below we extract and scale the training data in dataTrain the same way as we did for PCA quality assessment in section 4 (log-transformation, scaling, and k-nearest-neighbour imputation of missing values).

\n", 471 | "\n", 472 | "
" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "# Extract and scale the metabolite data from the dataTable\n", 482 | "peaklist = peakTableClean['Name'] # Set peaklist to the metabolite names in the peakTableClean\n", 483 | "XT = dataTrain[peaklist] # Extract X matrix from DataTrain using peaklist\n", 484 | "XTlog = np.log(XT) # Log scale (base-10)\n", 485 | "XTscale = cb.utils.scale(XTlog, method='auto') # methods include auto, pareto, vast, and level\n", 486 | "XTknn = cb.utils.knnimpute(XTscale, k=3) # missing value imputation (knn - 3 nearest neighbors)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "
\n", 494 | " \n", 495 | "

Now we use the cb.cross_val.kfold() helper function to carry out 5-fold cross-validation of a set of PLS-DA models configured with different numbers of latent variables (1 to 6). This helper function is generally applicable, and the values being passed here are:

\n", 496 | "\n", 497 | "\n", 510 | "\n", 511 | "

The cb.cross_val.kfold() function returns an object that we store in the cv variable. To actually run the cross-validation, we call the cv.run() method of this object. When the cell is run, a progress bar will appear:

\n", 512 | "\n", 513 | "
Kfold: 100%|██████████| 100/100 [00:02<00:00, 33.71it/s]\n",
514 |     "
\n", 515 | "\n", 516 | "
" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "# initalise cross_val kfold (stratified) \n", 526 | "cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS, # model; we are using the PLS_SIMPLS model\n", 527 | " X=XTknn, \n", 528 | " Y=Ytrain, \n", 529 | " param_dict={'n_components': [1,2,3,4,5,6]}, # The numbers of latent variables to search \n", 530 | " folds=5, # folds; for the number of splits (k-fold)\n", 531 | " bootnum=100) # num bootstraps for the Confidence Intervals\n", 532 | "\n", 533 | "# run the cross validation\n", 534 | "cv.run() " 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "metadata": {}, 540 | "source": [ 541 | "
\n", 542 | "\n", 543 | "

The object stored in the cv variable also has a .plot() method, which renders two views of $R^2$ and $Q^2$ statistics: difference ($R^2 - Q^2$), and absolute values of both metrics against the number of components, to aid in selecting the optimal number of components.

\n", 544 | "\n", 545 | "

The point at which the $Q^2$ value begins to diverge from the $R^2$ value is considered to be the point at which the optimal number of components has been met without overfitting. In this case, the plots clearly indicate that the optimal number of latent variables in our model is two.

\n", 546 | "\n", 547 | "
" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": null, 553 | "metadata": { 554 | "scrolled": false 555 | }, 556 | "outputs": [], 557 | "source": [ 558 | "cv.plot() # plot cross validation statistics" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "\n", 566 | "
\n", 567 | " \n", 568 | "

6.3 Train and evaluate PLS-DA model

\n", 569 | "\n", 570 | "

Now we have determined that the optimal number of components for this example data set is 2, we create a PLS-DA model with 2 latent variables, and evaluate its predictive ability. The implementation of PLS we use is the PLS_SIMPLS class from the CIMCB helper module. We first create a PLS model object with two components, in the variable modelPLS:

\n", 571 | "\n", 572 | "
" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": null, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "modelPLS = cb.model.PLS_SIMPLS(n_components=2) # Initalise the model with n_components = 2" 582 | ] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "
\n", 589 | "\n", 590 | "

Next we fit the model on the XTknn training dataset, with the values in Ytrain as the known response variables. We do this by calling the model's .train() method, with the predictor and response variables.

\n", 591 | "\n", 592 | "

This returns a list of values that are the predicted response variables, after model fitting.

\n", 593 | "\n", 594 | "
" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": null, 600 | "metadata": {}, 601 | "outputs": [], 602 | "source": [ 603 | "Ypred = modelPLS.train(XTknn, Ytrain) # Train the model " 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "
\n", 611 | "\n", 612 | "

Finally, we call the trained model's .evaluate() method, passing a classification cutoff score from which a standard set of model evaluations will be calculated from the model predictions ($R^2$, Mann-Whitney p-value, Area under ROC curve, Accuracy, Precision, Sensitivity, Specificity). The model perfomance is also visualised using the following plots:

\n", 613 | "\n", 614 | "\n", 621 | "\n", 622 | "

From these plots and the table we find that the trained classifier performs acceptably well.

\n", 623 | "\n", 624 | "
" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": { 631 | "scrolled": false 632 | }, 633 | "outputs": [], 634 | "source": [ 635 | "# Evaluate the model \n", 636 | "modelPLS.evaluate(cutoffscore=0.5) " 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": {}, 642 | "source": [ 643 | "
\n", 644 | " \n", 645 | "

6.4. Perform a permutation test for the PLS-DA model

\n", 646 | "\n", 647 | "

The reliability of our trained model can be assessed using a permutation test. In this test, the original data is randomised (permuted or 'shuffled') so that the predictor variables and response variables are mixed, and a new model is then trained and tested on the shuffled data. This is repeated many times so that the behaviour of models constructed from \"random\" data can be fairly assessed.

\n", 648 | "\n", 649 | "

We can be confident that our model is being trained on relevant and meaningful features of the original dataset if the $R^2$ and $Q^2$ values generated from these models (with randomised data) are much lower than those found for our model trained on the original data.

\n", 650 | "\n", 651 | "

The PLS model we are using from the CIMCB module has a .permutation_test() method that can perform this analysis for us. It returns a pair of graphs that can be used to interpret model performance.

\n", 652 | "\n", 653 | "\n", 658 | "\n", 659 | "

We see that the models trained on randomised/shuffled data have much lower $R^2$ and $Q^2$ values than the models trained on the original data, so we can be confident that the model represents meaningful features in the original dataset.

\n", 660 | "\n", 661 | "
" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": null, 667 | "metadata": { 668 | "scrolled": false 669 | }, 670 | "outputs": [], 671 | "source": [ 672 | "modelPLS.permutation_test(nperm=100) #nperm refers to the number of permutations" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "\n", 680 | "
\n", 681 | " \n", 682 | "

6.5. Plot latent variable projections for PLS-DA model

\n", 683 | "\n", 684 | "

The PLS model also provides a .plot_projections() method, so we can visually inspect characteristics of the fitted latent variables. This returns a grid of plots:

\n", 685 | "\n", 686 | "\n", 693 | "\n", 694 | "

Where only one latent variable is fitted, a similar plot is produced to that with the .evaluate() method, with the addition of a scatterplot of latent variable scores)

\n", 695 | "\n", 696 | "
" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": {}, 703 | "outputs": [], 704 | "source": [ 705 | "modelPLS.plot_projections(label=dataTrain[['Idx','SampleID']], size=12) # size changes circle size" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "\n", 713 | "
\n", 714 | "\n", 715 | "

6.6. Plot feature importance (Coefficient plot and VIP) for PLS-DA model

\n", 716 | "\n", 717 | "

Now that we have built a model and established that it represents meaningful features of the dataset, we determine the importance of specific peaks to the model's discriminatory power.

\n", 718 | "\n", 719 | "

To do this, in the cell below we use the PLS model's plot_featureimportance() method to render scatterplots of the PLS regression coefficient values for each metabolite, and Variable Importance in Projection (VIP) plots. The coefficient values provide information about the contribution of the peak to either a negative or positive classification for the sample, and peaks with VIP greater than unity (1) are considered to be \"important\" in the model.

\n", 720 | "\n", 721 | "

We could generate these plots for the model as it was trained, but we would prefer to have an estimate of the robustness of these values, so we generate bootstrapped confidence intervals with the model's .calc_bootci() method. Any metabolite coefficient with a confidence interval crossing the zero line is considered non-significant, and thus not \"important\" to the model.

\n", 722 | "\n", 723 | "

The .plot_featureimportance() method renders the two scatterplots, and also returns a new dataframe reporting these values, and their confidence intervals, which we capture in the variable peakSheet.

\n", 724 | "\n", 725 | "
" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": null, 731 | "metadata": {}, 732 | "outputs": [], 733 | "source": [ 734 | "# Calculate the bootstrapped confidence intervals \n", 735 | "modelPLS.calc_bootci(type='bca', bootnum=200) # decrease bootnum if it this takes too long on your machine\n", 736 | "\n", 737 | "# Plot the feature importance plots, and return a new Peaksheet \n", 738 | "peakSheet = modelPLS.plot_featureimportance(peakTableClean,\n", 739 | " peaklist,\n", 740 | " ylabel='Label', # change ylabel to 'Name' \n", 741 | " sort=False) # change sort to False" 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "
\n", 749 | " \n", 750 | "

6.7. Test model with new data (using test set from section 6.1)

\n", 751 | "\n", 752 | "

So far, we have trained and tested our PLS classifier on a single training dataset. This risks overfitting as we could be optimising the performance of the model on this dataset such that it cannot generalise, in the sense that it may not perform as well on a dataset that it has not already seen.

\n", 753 | "\n", 754 | "

To see if the model can generalise, we must test our trained model using a new dataset that it has not already encountered. In section 6.1 we divided our original complete dataset into four components: datatrain, Ytrain, dataTest and Ytest. Our trained model has not seen the dataTest and Ytest values that we have held out, so these can be used to evaluate model preformance on new data.

\n", 755 | "\n", 756 | "

We begin by transforming and scaling this holdout dataset in the same way as we did for the training data. To do this, we first find the mean and variance of our transformed training data set XTlog with the cb.utils.scale() function, so that we can use these values to scale the holdout data.

\n", 757 | "\n", 758 | "
" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": null, 764 | "metadata": {}, 765 | "outputs": [], 766 | "source": [ 767 | "# Get mu and sigma from the training dataset to use for the Xtest scaling\n", 768 | "mu, sigma = cb.utils.scale(XTlog, return_mu_sigma=True) " 769 | ] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "
\n", 776 | " \n", 777 | "

Next, we extract the peak data for our holdout dataTest set, and put this in the variable XV. As before, we take the log transform (XVlog), scale the data in the same way as the training data (XVscale; note that we specify mu and sigma as calculated above), and impute missing values to give the final holdout test set XVknn.

\n", 778 | "\n", 779 | "
" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": null, 785 | "metadata": {}, 786 | "outputs": [], 787 | "source": [ 788 | "# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)\n", 789 | "peaklist = peakTableClean.Name \n", 790 | "XV = dataTest[peaklist].values\n", 791 | "\n", 792 | "# Log transform, unit-scale and knn-impute missing values for Xtest\n", 793 | "XVlog = np.log(XV)\n", 794 | "XVscale = cb.utils.scale(XVlog, method='auto', mu=mu, sigma=sigma) \n", 795 | "XVknn = cb.utils.knnimpute(XVscale, k=3)" 796 | ] 797 | }, 798 | { 799 | "cell_type": "markdown", 800 | "metadata": {}, 801 | "source": [ 802 | "
\n", 803 | "\n", 804 | "

Now we predict a new set of response variables from XVknn as input, using our trained model and its .test() method, and then evaluate the performance of model prediction against the known values in Ytest using the .evaluate() method (as in section 6.3).

\n", 805 | "\n", 806 | "

Three plots are generated, showing comparisons of the performance of the model on training and holdout test datasets.

\n", 807 | "\n", 808 | "\n", 815 | "\n", 816 | "

A table of performance metrics for both datasets is shown below the figures.

\n", 817 | "\n", 818 | "
" 819 | ] 820 | }, 821 | { 822 | "cell_type": "code", 823 | "execution_count": null, 824 | "metadata": { 825 | "scrolled": false 826 | }, 827 | "outputs": [], 828 | "source": [ 829 | "# Calculate Ypredicted score using modelPLS.test\n", 830 | "YVpred = modelPLS.test(XVknn)\n", 831 | "\n", 832 | "# Evaluate Ypred against Ytest\n", 833 | "evals = [Ytest, YVpred] # alternative formats: (Ytest, Ypred) or np.array([Ytest, Ypred])\n", 834 | "#modelPLS.evaluate(evals, specificity=0.9)\n", 835 | "modelPLS.evaluate(evals, cutoffscore=0.5) " 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "\n", 843 | "
\n", 844 | " \n", 845 | "

6.8. Export results to Excel

\n", 846 | "\n", 847 | "

Finally, we will save our results in a persistent Excel spreadsheet.

\n", 848 | "\n", 849 | "

Unlike section 5, we want to save two sheets in a single Excel workbook called modelPLS.xlsx. We want to save one sheet showing the holdout test data (with results from YVpred), and a separate sheet showing the peaks with their residual coefficients and VIP scores.

\n", 850 | "\n", 851 | "

Firstly, we generate a dataframe containing the test dataset and the model's predictions. This will have columns for

\n", 852 | "\n", 853 | "\n", 862 | "\n", 863 | "
" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": null, 869 | "metadata": {}, 870 | "outputs": [], 871 | "source": [ 872 | "# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest\n", 873 | "dataSheet = dataTest[[\"Idx\", \"SampleID\", \"Class\"]].copy() \n", 874 | "\n", 875 | "# Add 'Ypred' to Datasheet\n", 876 | "dataSheet['Ypred'] = YVpred \n", 877 | " \n", 878 | "display(dataSheet) # View and check the dataTable " 879 | ] 880 | }, 881 | { 882 | "cell_type": "markdown", 883 | "metadata": {}, 884 | "source": [ 885 | "
\n", 886 | "\n", 887 | "

In section 5 we saved a single dataframe to an Excel workbook, as a single worksheet. Here, we want to save two worksheets. This means we can't use the .to_excel() method of a dataframe directly to write twice to the same file. Instead, we must create a pd.ExcelWriter object, and add each dataframe in turn to this object. When we are finished adding datframes, we can use the object's .save() method to write the Excel workbook with several worksheets (one per dataframe) to a single file.

\n", 888 | "\n", 889 | "
\n" 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "metadata": {}, 896 | "outputs": [], 897 | "source": [ 898 | "# Create an empty excel workbook\n", 899 | "writer = pd.ExcelWriter(\"modelPLS.xlsx\") # provide the filename for the Excel file\n", 900 | "\n", 901 | "# Add each dataframe to the workbook in turn, as a separate worksheet\n", 902 | "dataSheet.to_excel(writer, sheet_name='Datasheet', index=False)\n", 903 | "peakSheet.to_excel(writer, sheet_name='Peaksheet', index=False)\n", 904 | "\n", 905 | "# Write the Excel workbook to disk\n", 906 | "writer.save()\n", 907 | "\n", 908 | "print(\"Done!\")" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "
\n", 916 | "\n", 917 | "

Congratulations! You have reached the end of tutorial 1.

\n", 918 | "\n", 919 | "
" 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": null, 925 | "metadata": {}, 926 | "outputs": [], 927 | "source": [] 928 | } 929 | ], 930 | "metadata": { 931 | "kernelspec": { 932 | "display_name": "Python 3", 933 | "language": "python", 934 | "name": "python3" 935 | }, 936 | "language_info": { 937 | "codemirror_mode": { 938 | "name": "ipython", 939 | "version": 3 940 | }, 941 | "file_extension": ".py", 942 | "mimetype": "text/x-python", 943 | "name": "python", 944 | "nbconvert_exporter": "python", 945 | "pygments_lexer": "ipython3", 946 | "version": "3.6.8" 947 | }, 948 | "toc": { 949 | "base_numbering": 1, 950 | "nav_menu": { 951 | "height": "338px", 952 | "width": "315px" 953 | }, 954 | "number_sections": false, 955 | "sideBar": true, 956 | "skip_h1_title": false, 957 | "title_cell": "Table of Contents", 958 | "title_sidebar": "Contents", 959 | "toc_cell": false, 960 | "toc_position": { 961 | "height": "calc(100% - 180px)", 962 | "left": "10px", 963 | "top": "150px", 964 | "width": "184px" 965 | }, 966 | "toc_section_display": true, 967 | "toc_window_display": false 968 | }, 969 | "toc-autonumbering": false, 970 | "toc-showmarkdowntxt": false, 971 | "widgets": { 972 | "application/vnd.jupyter.widget-state+json": { 973 | "state": {}, 974 | "version_major": 2, 975 | "version_minor": 0 976 | } 977 | } 978 | }, 979 | "nbformat": 4, 980 | "nbformat_minor": 2 981 | } 982 | -------------------------------------------------------------------------------- /Tutorial2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " To begin: Click anywhere in this cell and press Run on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cell. A text cell and a code cell. When you Run a text cell (we are in a text cell now), you advance to the next cell without executing any code. When you Run a code cell (identified by In[ ]: to the left of the cell) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press Run again. Repeat this process until the end of the notebook. NOTE: All the cells in this notebook can be automatically executed sequentially by clicking KernelRestart and Run All. Should anything crash then restart the Jupyter Kernel by clicking KernelRestart, and start again from the top.\n", 9 | " \n", 10 | "
" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "\n", 18 | "\n", 19 | "
\n", 20 | "\n", 21 | "\n", 22 | "

Tutorial 2: Interactive Metabolomics Data Analysis Workflow

\n", 23 | "\n", 24 | "


\n", 25 | "
\n", 26 | "
\n", 27 | "
\n", 28 | "The functionality of this notebook is identical to Tutorial 1, but now the text cells have been expanded into a comprehensive interactive tutorial. As before, text cells provide the metabolomics context and describe the purpose of the code in the following code cell; however, this has now been simplified to avoid complete reptition of Tutorial 1. Additional coloured text boxes are now placed throughout the workflow to help novice users navigate and understand the interactive principles of a Jupyter Notebook.\n", 29 | "

\n", 30 | "
\n", 31 | "\n", 32 | "
\n", 33 | "\n", 34 | "
\n", 35 | "Red boxes (cog icon) provide suggestions for changing the functionality of the subsequent code cell by editing (or substituting) one or more lines of code.

\n", 36 | "
\n", 37 | "\n", 38 | "
\n", 39 | "\n", 40 | "
\n", 41 | " Green boxes (mouse icon) provide suggestions for interacting with the visual results generated by a code cell. For example, the first green box in the notebook describes how to sort and colour data in the embedded data tables.
\n", 42 | "
\n", 43 | "\n", 44 | "
\n", 45 | "\n", 46 | "
\n", 47 | "Blue boxes (lightbulb icon) provide further information about the theoretical reasoning behind a block of code or visualisation. This information is not essential to understand Jupyter notebooks but may be of general educational utility and interest to new metabolomics data scientists.
\n", 48 | "
\n", 49 | "\n" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": { 55 | "toc-hr-collapsed": false 56 | }, 57 | "source": [ 58 | "
\n", 59 | " \n", 60 | "

1. Import Packages/Modules

\n", 61 | "\n", 62 | "

The first code cell of this tutorial (below this text box) imports packages and modules into the Jupyter environment. Packages and modules provide additional functions and tools that extend the basic functionality of the Python language.\n", 63 | "

\n", 64 | "
\n", 65 | "\n", 66 | "
\n", 67 | "\n", 68 | "
\n", 69 | "\n", 70 | "
    \n", 71 | "
  • All the code embedded in this example notebook is written using the Python programming language (python.org) and is based upon extensions of popular open source packages with high levels of support. \n", 72 | " \n", 73 | "Note: a tutorial on the python programming language in itself is beyond the scope of this notebook. For more information on using Python and Jupyter Notebooks please refer to the excellent: \n", 74 | "Python Data Science Handbook (Jake VanderPlas, 2016), which is in itself a Jupyter Notebook deployed via Binder.
  • \n", 75 | "
\n", 76 | "
\n", 77 | "\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "import numpy as np\n", 87 | "import pandas as pd\n", 88 | "\n", 89 | "from sklearn.model_selection import train_test_split\n", 90 | "\n", 91 | "import cimcb_lite as cb\n", 92 | "\n", 93 | "print('All packages successfully loaded')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "
\n", 101 | "\n", 102 | "

2. Load Data and Peak sheet

\n", 103 | "\n", 104 | "

The code cell below loads the Data and Peak sheets from an Excel file, using the CIMCB helper function load_dataXL(). When this is complete, you should see confirmation that Peak (stored in the Peak worksheet in the Excel file) and Data (stored in the Data worksheet in the Excel file) tables have been loaded.

\n", 105 | "
\n", 106 | "\n", 107 | "
\n", 108 | "\n", 109 | "
\n", 110 | "\n", 111 | "
    \n", 112 | "
  • There is a second datase included with this tutorial which has been converted to standardised Tidy Data format. This data has been previously published as an article Gardlo et al. (2016) in Metabolomics. \n", 113 | " \n", 114 | "Urine samples collected from newborns with perinatal asphyxia were analysed using a Dionex UltiMate 3000 RS system coupled to a triple quadrupole QTRAP 5500 tandem mass spectrometer. The deconvoluted and annotated file is deposited at the Metabolights data repository (Project ID MTBLS290). \n", 115 | "\n", 116 | "Please inspect the Excel file before using it in this tutorial. To change the data set to be loaded into the notebook replace filename = 'GastricCancer_NMR.xlsx' with 'filename = 'MTBLS290db.xlsx',and press Run on the menu bar.\n", 117 | "\n", 118 | "Note: if you change the name of the file in this code cell, you will also have to make changes to Section 5 and Section 6 (as indicated in the text cell above each) for the correct models to be built. It is probably best to come back to this excercise after finishing an initial walk-through of the complete tutorial using the default data set.
  • \n", 119 | "
\n", 120 | "
" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "# The path to the input file (Excel spreadsheet)\n", 130 | "filename = 'GastricCancer_NMR.xlsx'\n", 131 | "#filename = 'MTBLS290db.xlsx'\n", 132 | "\n", 133 | "# Load Peak and Data tables into two variables\n", 134 | "dataTable, peakTable = cb.utils.load_dataXL(filename, DataSheet='Data', PeakSheet='Peak') " 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "
\n", 142 | "

2.1 Display the Data table

\n", 143 | "\n", 144 | "

The dataTable table can be displayed interactively so we can inspect and check the imported values. To do this, we use the display() function.\n", 145 | "

\n", 146 | "
\n", 147 | "\n", 148 | "
\n", 149 | "\n", 150 | "
\n", 151 | "\n", 152 | "
    \n", 153 | "
  • Scroll up/down & left/right using the scroll bars
  • \n", 154 | "
  • Click on any column header to sort by that column (sort alternates between ascending and decending order)
  • \n", 155 | "
  • Click on the left side of a header column for futher options \n", 156 | "
      \n", 157 | "
    • for column Class click on 'color by unique'
    • \n", 158 | "
    • for column SampleType click on 'sort ascending' to group all the QC samples together.
    \n", 159 | "
  • \n", 160 | "
  • Click on column header index to sort back into the orginal order.
  • \n", 161 | "
\n", 162 | "
" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "scrolled": false 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "display(dataTable) # View and check the dataTable " 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "
\n", 181 | "\n", 182 | "

2.2. Display the Peak sheet

\n", 183 | "\n", 184 | "

The peakTable table can be displayed interactively in the same way.\n", 185 | "

\n", 186 | "
\n", 187 | "\n", 188 | "
\n", 189 | "\n", 190 | "
\n", 191 | "\n", 192 | "
    \n", 193 | "
  • Click on the column header QC_RSD to sort the peaks by ascending value
  • \n", 194 | "
  • Click on the left edge of the column header QC_RSD and select 'heatmap'
  • \n", 195 | "
  • Scroll up/down to see how the \"quality\" of the peaks increase/decrease
  • \n", 196 | "
\n", 197 | "
" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "display(peakTable) # View and check PeakTable" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "
\n", 214 | "\n", 215 | "

3. Data Cleaning

\n", 216 | "
\n", 217 | "
\n", 218 | "\n", 219 | "
\n", 220 | " \n", 221 | "
    \n", 222 | "
  • Replace the code: PeakTableClean = peakTable[(rsd < 20) & (percMiss < 10] with: peakTableClean = peakTable[(rsd < 10) & (percMiss < 5)]. In doing this you will see the effect of making the data cleaning criteria more stringent. This will change the number of 'clean' metabolites.
  • \n", 223 | "
\n", 224 | "
\n", 225 | "
\n", 226 | "\n", 227 | "
\n", 228 | "
    \n", 229 | "
  • Note: Changing the number of clean metabolites will significantly change the outputs from all subsequent code cells.
    So be sure to click on CellRun All Below then scroll down the notebook to see how changing this setting has changed all the cell outputs.
  • \n", 230 | "
\n", 231 | "
" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "# Create a clean peak table \n", 241 | "\n", 242 | "rsd = peakTable['QC_RSD'] \n", 243 | "percMiss = peakTable['Perc_missing'] \n", 244 | "peakTableClean = peakTable[(rsd < 20) & (percMiss < 10)] \n", 245 | "\n", 246 | "print(\"Number of peaks remaining: {}\".format(len(peakTableClean)))" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "
\n", 254 | "\n", 255 | "

4. PCA - Quality Assesment

\n", 256 | "\n", 257 | "

To provide a multivariate assesment of the quality of the cleaned data set it is good practice to perform a simple Principal Component Analysis (PCA), after suitable transforming & scaling. The PCA score plot is typically labelled by sample type (i.e. quality control (QC) or biological sample (Sample)). Data of high quality will have QCs that cluster tightly compared to the biological samples Broadhurst et al. 2018.

\n", 258 | "\n", 259 | "
\n", 260 | "\n", 261 | "
\n", 262 | "\n", 263 | "
\n", 264 | "\n", 265 | "
    \n", 266 | "
  • Hover over points in the PCA Score Plot to reveal corresponding sample information ('IDX' and 'SampleType').
  • \n", 267 | "\n", 268 | "
  • Hover over points in the PCA Loading Plot to reveal corresponding metabolite information ('Name','Label', and 'QC_RSD').
  • \n", 269 | "\n", 270 | "
  • In the menu at the top right corner of the figure click on the 'disk' icon to save the images.
  • \n", 271 | "\n", 272 | "
  • In the menu at the top right corner of the figure click on the 'magnifying glass' icon to selct a zoom area.
  • \n", 273 | "
\n", 274 | "\n", 275 | "
\n", 276 | "
\n", 277 | "\n", 278 | "
\n", 279 | "\n", 280 | "\n", 281 | "
    \n", 282 | "
  • Replace the code: XScale = cb.utils.scale(Xlog, method='auto') with: XScale = cb.utils.scale(Xlog, method='pareto') This will change the type of X column scaling.
  • \n", 283 | "\n", 284 | "
  • In the PCA function call cb.plot.pca replace the code: pcy=2 with: pcy=3 to change the plot from (PC1 vs. PC2) to (PC1 vs. PC3)
  • \n", 285 | "\n", 286 | "
  • Replace the code: group_label=dataTable['SampleType'] with: group_label=dataTable['Class']. The PCA scores plot will now be grouped by the data in column Class of the dataTable.
  • \n", 287 | "
\n", 288 | "
\n", 289 | "\n", 290 | "
\n", 291 | "\n", 292 | "
\n", 293 | "\n", 294 | "\n", 295 | "
    \n", 296 | "
  • There are four type of scaling supported by the function cimvb.utils.scale: 'auto', 'range', 'pareto', 'vast', and 'level'. In the context of metabolomics these are comprehensively reviewed by van den Berg et al 2006.
  • \n", 297 | "
\n", 298 | "
\n", 299 | "
" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "# Extract and scale the metabolite data from the dataTable \n", 309 | "\n", 310 | "peaklist = peakTableClean['Name'] # Set peaklist to the metabolite names in the peakTableClean\n", 311 | "X = dataTable[peaklist].values # Extract X matrix from dataTable using peaklist\n", 312 | "Xlog = np.log10(X) # Log scale (base-10)\n", 313 | "Xscale = cb.utils.scale(Xlog, method='auto') # methods include auto, range, pareto, vast, and level\n", 314 | "Xknn = cb.utils.knnimpute(Xscale, k=3) # missing value imputation (knn - 3 nearest neighbors)\n", 315 | "\n", 316 | "print(\"Xknn: {} rows & {} columns\".format(*Xknn.shape))\n", 317 | "\n", 318 | "cb.plot.pca(Xknn,\n", 319 | " pcx=1, # pc for x-axis\n", 320 | " pcy=2, # pc for y-axis\n", 321 | " group_label=dataTable['SampleType']) # labels for Hover in PCA loadings plot" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "
\n", 329 | "\n", 330 | "

5. Univariate Statistics for comparison of Gastric Cancer (GC) vs Healthy Controls (HE)

\n", 331 | "\n", 332 | "

The data set uploaded into dataTable describes the 1H-NMR urine metabolite profiles of individuals classified into three distinct groups: GC (gastric cancer), BN (benign), and HE (healthy). For this specific workflow we are interested in comparing only the differences in profiles between individuals classsified as GC and HE.\n", 333 | "

\n", 334 | "
\n", 335 | "\n", 336 | "
\n", 337 | "\n", 338 | "
\n", 339 | "\n", 340 | "
    \n", 341 | "
  • Scroll up/down using the scroll bars.
  • \n", 342 | "
  • Click on the column header to sort by that column (sort alternates between ascending and decending order).
  • \n", 343 | "
  • Click on the left side of a header column for futher options, e.g.:\n", 344 | "
      \n", 345 | "
    • For column TtestStat click on Data Bars.
    • \n", 346 | "
    • For column ShapiroPvalue click on Format -> exponential 5 (coverts to scientific notation).
    • \n", 347 | "
  • \n", 348 | "
\n", 349 | "\n", 350 | "\n", 351 | "
\n", 352 | "\n", 353 | "
\n", 354 | "\n", 355 | "\n", 356 | "
    \n", 357 | "
  • For data set GastricCancer_NMR.xlsx replace the code: dataTable[(dataTable.Class == \"GC\") | (dataTable.Class == \"HE\")] with:
    dataTable[(dataTable.Class == \"BN\") | (dataTable.Class == \"HE\")] and replace pos_outcome = \"GC\" with: pos_outcome = \"BN\". This will allow you to perform a 2-class statistical comparison between the patients with benign tumors and healthy controls.
  • \n", 358 | "\n", 359 | "
  • OR for data set MTBLS290db.xlsx replace the code: dataTable[(dataTable.Class == \"GC\") | (dataTable.Class == \"HE\")] with: dataTable[(dataTable.Class == \"Patient\") | (dataTable.Class == \"Control\")] and replace pos_outcome = \"GC\" with: pos_outcome = \"Patient\". You will now perform a 2-class statistical comparison between the unhealthy patients and healthy controls.
  • \n", 360 | "\n", 361 | "
  • In the statistical function call cb.utils.univariate_2class replace the code: parametric=True with: parametric=False to change the statistical test to a non-parametric Wilcoxon rank-sum test.
  • \n", 362 | "
\n", 363 | "
\n", 364 | "\n", 365 | "
\n", 366 | "\n", 367 | "
\n", 368 | "\n", 369 | "\n", 370 | "
    \n", 371 | "
  • Note: Changing the outcome comparison will significantly affect the output of subsequent code cells.
    So be sure to click on CellRun All Below then scroll down the notebook to see how changing this setting has changed all the cell outputs.
  • \n", 372 | "
\n", 373 | "
" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "# Select subset of Data for statistical comparison\n", 383 | "dataTable2 = dataTable[(dataTable.Class == \"GC\") | (dataTable.Class == \"HE\")] # Reduce data table only to GC and HE class members\n", 384 | "pos_outcome = \"GC\" \n", 385 | "\n", 386 | "# Calculate basic statistics and create a statistics table.\n", 387 | "statsTable = cb.utils.univariate_2class(dataTable2,\n", 388 | " peakTableClean,\n", 389 | " group='Class', # Column used to determine the groups\n", 390 | " posclass=pos_outcome, # Value of posclass in the group column\n", 391 | " parametric=True) # Set parametric = True or False\n", 392 | "\n", 393 | "# View and check StatsTable\n", 394 | "display(statsTable)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "\n", 402 | "
\n", 403 | "
\n", 404 | "\n", 405 | "
\n", 406 | "\n", 407 | "
\n", 408 | "\n", 409 | "
    \n", 410 | "
  • Replace the filename \"stats.xlsx\" to \"my_stats.xlsx\"
  • \n", 411 | "
  • AND/OR replace sheet_name='StatsTable' with sheet_name='myStatsTable'
  • \n", 412 | "
\n", 413 | "
" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "# Save StatsTable to Excel\n", 423 | "statsTable.to_excel(\"stats.xlsx\", sheet_name='StatsTable', index=False)\n", 424 | "print(\"done!\")" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "
\n", 432 | "

\n", 433 | " \n", 434 | "

6. Machine Learning

\n", 435 | "\n", 436 | "

The remainder of this tutorial will describe the use of a 2-class Partial Least Squares-Discriminant Analysis (PLS-DA) model to identify metabolites which, when combined in a linear equation, are able to classify unknown samples as either GC or HE with a measurable degree of certainty.

\n", 437 | "\n", 438 | "\n", 439 | "

6.1 Splitting data into Training and Test sets.

\n", 440 | "
\n", 441 | "\n", 442 | "
\n", 443 | "\n", 444 | "
\n", 445 | "\n", 446 | "\n", 447 | "
    \n", 448 | "
  • If you have changed the comparsion groups in the default data to benign tumors (BN) vs. healthy controls (HE) then replace the code: outcome == 'GC' with: outcome == 'BN'.
  • \n", 449 | "\n", 450 | "
  • For data set MTBLS290db.xlsx replace the code: outcome == 'GC' with: outcome == 'Patient'.
  • \n", 451 | "\n", 452 | "
  • Replace the code: train_test_split(DataTable2, Y, test_size=0.25, stratify=Y) with: train_test_split(DataTable2, Y, test_size=0.1, stratify=Y). This will decrease the number of samples in the test set. How does this affect the results?
  • \n", 453 | "
\n", 454 | "
\n", 455 | "\n", 456 | "
\n", 457 | "\n", 458 | "
\n", 459 | "\n", 460 | "\n", 461 | "
    \n", 462 | "
  • Note: If you change any of the code in the following Machine Learning sections you will change the performance of all the subsequent code cells.
    So be sure to click on CellRun All Below then scroll down the notebook to see how changing this setting has changed all the cell outputs.
  • \n", 463 | "
\n", 464 | "
" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": {}, 471 | "outputs": [], 472 | "source": [ 473 | "# Create a Binary Y vector for stratifiying the samples\n", 474 | "outcomes = dataTable2['Class'] # Column that corresponds to Y class (should be 2 groups)\n", 475 | "Y = [1 if outcome == 'GC' else 0 for outcome in outcomes] # Change Y into binary (GC = 1, HE = 0) \n", 476 | "Y = np.array(Y) # convert boolean list into to a numpy array\n", 477 | "\n", 478 | "# Split DataTable2 and Y into train and test (with stratification)\n", 479 | "dataTrain, dataTest, Ytrain, Ytest = train_test_split(dataTable2, Y, test_size=0.25, stratify=Y, random_state=10)\n", 480 | "\n", 481 | "print(\"DataTrain = {} samples with {} postive cases.\".format(len(Ytrain),sum(Ytrain)))\n", 482 | "print(\"DataTest = {} samples with {} postive cases.\".format(len(Ytest),sum(Ytest)))" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "
\n", 490 | " \n", 491 | "

6.2. Determine optimal number of components for PLS-DA model

\n", 492 | "\n", 493 | "

In this section, we will perform 5-fold cross-validation using the training set we created above (dataTrain) to determine the optimal number of components to use in our PLS-DA model. First, we extract and scale the training data in dataTrain the same way as we did for PCA quality assessment in section 4 (log-transformation, scaling, and k-nearest-neighbour imputation of missing values).

\n", 494 | "
\n", 495 | "\n", 496 | "
\n", 497 | "\n", 498 | "
\n", 499 | "\n", 500 | "\n", 501 | "
    \n", 502 | "
  • Replace the code: cb.utils.scale(XTlog, method='auto') with: cb.utils.scale(XTlog, method='pareto') This will change the type of X column scaling.
  • \n", 503 | "\n", 504 | "
  • Replace the code: cb.utils.scale(XTlog, method='auto') with: cb.utils.scale(XT, method='auto') This change will ignore the log transformed data (XTlog), and scale the raw XT data instead (thus missing out the log tranformation step of the data preprocessing).
  • \n", 505 | "
\n", 506 | "
" 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": null, 512 | "metadata": {}, 513 | "outputs": [], 514 | "source": [ 515 | "# Extract and scale the metabolite data from the dataTable\n", 516 | "peaklist = peakTableClean['Name'] # Set peaklist to the metabolite names in the peakTableClean\n", 517 | "XT = dataTrain[peaklist] # Extract X matrix from DataTrain using peaklist\n", 518 | "XTlog = np.log(XT) # Log scale (base-10)\n", 519 | "XTscale = cb.utils.scale(XTlog, method='auto') # methods include auto, pareto, vast, and level\n", 520 | "XTknn = cb.utils.knnimpute(XTscale, k=3) # missing value imputation (knn - 3 nearest neighbors)" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": {}, 526 | "source": [ 527 | "
\n", 528 | "\n", 529 | "

We use the cb.cross_val.kfold() helper function to carry out 5-fold cross-validation of a set of PLS-DA models configured with different numbers of latent variables.

\n", 530 | "
\n", 531 | "\n", 532 | "
\n", 533 | "\n", 534 | "
\n", 535 | "\n", 536 | "\n", 537 | "
    \n", 538 | "
  • Replace the code: param_dict={'n_components': [1,2,3,4,5,6]} with: param_dict={'n_components': [1,2,3,4,5,6,7,8,9,10]}. This will increase the range of latent variables used to build PLS-DA models from a PLS-DA model with 1 latent variable to a PLS-DA model with 10 latent variables.
  • \n", 539 | "\n", 540 | "
  • Replace the code: folds=5 with: folds=10. This will change the number of folds in the k-fold cross validation.
  • \n", 541 | "\n", 542 | "
  • Replace the code: bootnum=100 with: bootnum=500. This will change the number of bootstrap samples used to calculate the 95% confidence interval for the $R^2$ and $Q^2$ curves. This will drastically slow down the code execution.
  • \n", 543 | "
\n", 544 | "
\n", 545 | "\n", 546 | "
\n", 547 | "\n", 548 | "\n", 549 | "
\n", 550 | "\n", 551 | "
    \n", 552 | "
  • For more information on the PLS SIMPLS algorithm refer to: De Jong, S., 1993. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263
  • \n", 553 | "
  • Although it is common practice to assume the optimal number of components for the PLS-DA model is chosen when $Q^2$ is at its apex (A), this is incorrect. Overtraining starts as soon as $Q^2$ significantly deviates from the $R^2$ trajectory. If the distance between $R^2$ and $Q^2$ gets large (>0.2 or the 95% CI stop overlapping) then one has to assume that the model is already overtrained. The point at which the $Q^2$ value begins to diverge from the $R^2$ value is considered point at which the optimal number of components has been met without overfitting (B). The $R^2$ vs. $(R^2 - Q^2$) plot is provided to aid decison making.
  • \n", 554 | "
\n", 555 | "
\n", 556 | "\n", 557 | "
\n", 558 | "\n", 559 | "
\n", 560 | "\n", 561 | "
    \n", 562 | "
  • Hover over the green data points in each of the plots to view the corresponding $R^2$ and $Q^2$ values.
  • \n", 563 | "\n", 564 | "
  • Click on a point in one of the green plots. Notice that the two plots are linked.
  • \n", 565 | "\n", 566 | "
  • Use the menu bar at the top right of the figure to save, scroll and zoom.
  • \n", 567 | "
\n", 568 | "
\n" 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "execution_count": null, 574 | "metadata": {}, 575 | "outputs": [], 576 | "source": [ 577 | "# initalise cross_val kfold (stratified) \n", 578 | "cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS, # model; we are using the PLS_SIMPLS model\n", 579 | " X=XTknn, \n", 580 | " Y=Ytrain, \n", 581 | " param_dict={'n_components': [1,2,3,4,5,6]}, # The numbers of latent variables to search \n", 582 | " folds=5, # folds; for the number of splits (k-fold)\n", 583 | " bootnum=100) # num bootstraps for the Confidence Intervals\n", 584 | "\n", 585 | "\n", 586 | "cv.run() # run the cross validation\n", 587 | "cv.plot() # plot cross validation statistics" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "\n", 595 | "
\n", 596 | " \n", 597 | "

6.3 Train and evaluate PLS-DA model

\n", 598 | "\n", 599 | "

Now we have determined the optimal number of components for this data set using k-fold cross validation, we create a new PLS-DA model with the requisite number of latent variables, train the model using XTknn and 'YT, then evaluate its predictive ability.

\n", 600 | "
\n", 601 | "\n", 602 | "
\n", 603 | "\n", 604 | "
\n", 605 | "\n", 606 | "\n", 607 | "
    \n", 608 | "
  • Replace the code: n_components=2 with: n_components=3. This will increase the number of latent variables used in the PLS-DA model. Notice how this changes the apparent predictive ability of the model.
  • \n", 609 | "\n", 610 | "
  • Replace the code: cutoffscore=0.5 with: cutoffscore=0.4 This will change the decision boundary for the classifier and alter the resulting perfomance statistics.
  • \n", 611 | "
\n", 612 | "
\n", 613 | "\n", 614 | "
\n", 615 | "\n", 616 | "
\n", 617 | "\n", 618 | "\n", 619 | "
    \n", 620 | "
  • Hover over the green data points in each of the plots to view extra information.
  • \n", 621 | "\n", 622 | "
  • Use the menu bar at the right of the figures to save, scroll and zoom.
  • \n", 623 | "
\n", 624 | "\n", 625 | "
" 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "execution_count": null, 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "modelPLS = cb.model.PLS_SIMPLS(n_components=2) # Initalise the model with n_components = 2\n", 635 | "\n", 636 | "Ypred = modelPLS.train(XTknn, Ytrain) # Train the model \n", 637 | "\n", 638 | "modelPLS.evaluate(cutoffscore=0.5) # Evaluate the model\n", 639 | "\n", 640 | "modelPLS.permutation_test(nperm=100) #nperm denotes to the number of permutations" 641 | ] 642 | }, 643 | { 644 | "cell_type": "markdown", 645 | "metadata": {}, 646 | "source": [ 647 | "\n", 648 | "
\n", 649 | " \n", 650 | "

6.4. Plot latent variable projections for PLS-DA model

\n", 651 | "\n", 652 | "

The PLS model also provides a .plot_projections() method, so we can visually inspect characteristics of the fitted latent variables. This returns a grid of plots:

\n", 653 | "
\n", 654 | "\n", 655 | "
\n", 656 | "\n", 657 | "
\n", 658 | "\n", 659 | "\n", 660 | "
    \n", 661 | "
  • These plots are useful to visualise to what degree each model component (latent variable) contribute to the model's discriminative ability. In the Gastric cancer example each individual component does not perform well in isolation. It is only when combined that a good prdicitve ability is revealed. In the bottom left figure the prjection scores plot includes a solid diagonal line describing the direction of prediction and a dashed line describing the orthogonal variance. In the method orthogonal partial least squares (O-PLS) this rotation is performed automatically to aid interpretation. However, these changes only improve the interpretability, not the predictivity, of the PLS models (see Fiehnlab for further discussion)\n", 662 | "
\n", 663 | "
\n", 664 | "
" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": null, 670 | "metadata": {}, 671 | "outputs": [], 672 | "source": [ 673 | "modelPLS.plot_projections(label=\n", 674 | " dataTrain[['Idx','SampleID']], size=12) # size changes circle size" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": {}, 680 | "source": [ 681 | "\n", 682 | "
\n", 683 | " \n", 684 | "

6.5. Plot feature importance (Coefficient plot and VIP) for PLS-DA model

\n", 685 | "\n", 686 | "

Now that we have built a model and established that it represents meaningful features of the dataset, we determine the importance of specific peaks to the model's discriminatory power. To do this, in the cell below we use the PLS model's plot_featureimportance() method to render scatterplots of the PLS regression coefficient values for each metabolite, and Variable Importance in Projection (VIP) plots. The coefficient values provide information about the contribution of the peak to either a negative or positive classification for the sample, and peaks with VIP greater than unity (1) are considered to be \"important\" in the model.

\n", 687 | "
\n", 688 | "\n", 689 | "
\n", 690 | "\n", 691 | "
\n", 692 | "\n", 693 | "
    \n", 694 | "
  • In statistics, the bootstrap procedure involves choosing random samples with replacement from a data set and calculating some statistic on those samples. The range of sample estimates you obtain enables you to establish the uncertainty of the quantity you are estimating. Sampling with replacement means that each observation in a sample is selected (and recorded) at random from the original dataset and then replaced, so it is possilbe for an observation can be selected multiple times. If the orginal data set contains N observations then each bootstrap sample contains N randomly selected observations. It has been shown that approximately 2/3 of the orginal data are include in each bootstrap sample (with 1/3 of the original data being included twice). Here we use bootstrap resampling to calculate confidence intervals for the coefficients in the PLS-DA model using the 'bootstrapping of observations' method.\n", 695 | "
\n", 696 | "
\n", 697 | "\n", 698 | "
\n", 699 | "\n", 700 | "

\n", 701 | "\n", 702 | "\n", 707 | "\n", 708 | "
" 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": null, 714 | "metadata": {}, 715 | "outputs": [], 716 | "source": [ 717 | "# Calculate the bootstrapped confidence intervals \n", 718 | "modelPLS.calc_bootci(type='bca', bootnum=200) # decrease bootnum if it this takes too long on your machine\n", 719 | "\n", 720 | "# Plot the feature importance plots, and return a new Peaksheet \n", 721 | "peakSheet = modelPLS.plot_featureimportance(peakTableClean,\n", 722 | " peaklist,\n", 723 | " ylabel='Label', # change ylabel to 'Name' \n", 724 | " sort=False) # change sort to False" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "
\n", 732 | " \n", 733 | "

6.6. Test model with new data (using test set from section 6.1)

\n", 734 | "\n", 735 | "

So far, we have trained and tested our PLS classifier on a single training dataset. This risks overfitting as we could be optimising the performance of the model on this dataset such that it cannot generalise, in the sense that it may not perform as well on a dataset that it has not already seen. To see if the model can generalise, we must test our trained model using a new dataset that it has not already encountered. In section 6.1 we divided our original complete dataset into four components: datatrain, Ytrain, dataTest and Ytest. Our trained model has not seen the dataTest and Ytest values that we have held out, so these can be used to evaluate model preformance on new data.

\n", 736 | "
\n", 737 | "\n", 738 | "
\n", 739 | "\n", 740 | "
\n", 741 | "\n", 742 | "\n", 743 | "
    \n", 744 | "
  • Note: It is important that the test data is tranformed and scaled using the same parameters as the training data. If the training data is log transformed then the test data must also be log transformed, otherwise the test predictions will be inappropriate, and likely highly imprecise. Equally the scaling must be performed using the scaling factors derived from the training data (e.g. mean-centred to the traning data mean, and normalised to the training data standard deviation.
  • \n", 745 | "
\n", 746 | "
\n", 747 | "
" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [ 756 | "# Get mu and sigma from the training dataset to use for the Xtest scaling\n", 757 | "mu, sigma = cb.utils.scale(XTlog, return_mu_sigma=True) \n", 758 | "\n", 759 | "# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)\n", 760 | "peaklist = peakTableClean.Name \n", 761 | "XV = dataTest[peaklist].values\n", 762 | "\n", 763 | "# Log transform, unit-scale and knn-impute missing values for Xtest\n", 764 | "XVlog = np.log(XV)\n", 765 | "XVscale = cb.utils.scale(XVlog, method='auto', mu=mu, sigma=sigma) \n", 766 | "XVknn = cb.utils.knnimpute(XVscale, k=3)" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "
\n", 774 | "\n", 775 | "

Now we predict a new set of response variables from XVknn as input, using our trained model and its .test() method, and then evaluate the performance of the prediction against the known values in Ytest using the .evaluate() method (as in section 6.3).

\n", 776 | "
\n", 777 | "\n", 778 | "
\n", 779 | "\n", 780 | "
\n", 781 | "\n", 782 | "\n", 783 | "
    \n", 784 | "
  • Note: Although the calulcated bootstrap confidence intervals for prediciton will give an estimate of uncertainty of prediction the only way to definitively evaluate any model is with an independent test set, as shown in this plot.
  • \n", 785 | "
\n", 786 | "
\n", 787 | "
\n" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": null, 793 | "metadata": { 794 | "scrolled": false 795 | }, 796 | "outputs": [], 797 | "source": [ 798 | "# Calculate Ypredicted score using modelPLS.test\n", 799 | "YVpred = modelPLS.test(XVknn)\n", 800 | "\n", 801 | "# Evaluate Ypred against Ytest\n", 802 | "evals = [Ytest, YVpred] # alternative formats: (Ytest, Ypred) or np.array([Ytest, Ypred])\n", 803 | "#modelPLS.evaluate(evals, specificity=0.9)\n", 804 | "modelPLS.evaluate(evals, cutoffscore=0.5) " 805 | ] 806 | }, 807 | { 808 | "cell_type": "markdown", 809 | "metadata": {}, 810 | "source": [ 811 | "\n", 812 | "
\n", 813 | " \n", 814 | "

6.7. Export results to Excel

\n", 815 | "\n", 816 | "

Finally, we copy our model predictions into a table and save in a persistent Excel spreadsheet.

\n", 817 | "
\n", 818 | "\n", 819 | "
\n", 820 | "\n", 821 | "
\n", 822 | "\n", 823 | "\n", 824 | "
    \n", 825 | "
  • Replace the filename \"modelPLS.xlsx\" to \"myModelPLS.xlsx\"
  • \n", 826 | "\n", 827 | "
  • AND/OR change sheet_name='Datasheet' / sheet_name='PeakSheet' as appropriate
  • \n", 828 | "
\n", 829 | "
\n" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": null, 835 | "metadata": {}, 836 | "outputs": [], 837 | "source": [ 838 | "# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest\n", 839 | "dataSheet = dataTest[[\"Idx\", \"SampleID\", \"Class\"]].copy() \n", 840 | "\n", 841 | "# Add 'Ypred' to Datasheet\n", 842 | "dataSheet['Ypred'] = YVpred \n", 843 | " \n", 844 | "# Create an empty excel workbook\n", 845 | "writer = pd.ExcelWriter(\"modelPLS.xlsx\") # provide the filename for the Excel file\n", 846 | "\n", 847 | "# Add each dataframe to the workbook in turn, as a separate worksheet\n", 848 | "dataSheet.to_excel(writer, sheet_name='Datasheet', index=False)\n", 849 | "peakSheet.to_excel(writer, sheet_name='Peaksheet', index=False)\n", 850 | "\n", 851 | "# Write the Excel workbook to disk\n", 852 | "writer.save()\n", 853 | "\n", 854 | "print(\"Done!\")" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "
\n", 862 | "\n", 863 | "

Congratulations! You have completed tutorial 2.

\n", 864 | "\n", 865 | "
" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "metadata": {}, 872 | "outputs": [], 873 | "source": [] 874 | } 875 | ], 876 | "metadata": { 877 | "kernelspec": { 878 | "display_name": "Python 3", 879 | "language": "python", 880 | "name": "python3" 881 | }, 882 | "language_info": { 883 | "codemirror_mode": { 884 | "name": "ipython", 885 | "version": 3 886 | }, 887 | "file_extension": ".py", 888 | "mimetype": "text/x-python", 889 | "name": "python", 890 | "nbconvert_exporter": "python", 891 | "pygments_lexer": "ipython3", 892 | "version": "3.6.8" 893 | }, 894 | "toc": { 895 | "base_numbering": 1, 896 | "nav_menu": { 897 | "height": "338px", 898 | "width": "315px" 899 | }, 900 | "number_sections": false, 901 | "sideBar": true, 902 | "skip_h1_title": false, 903 | "title_cell": "Table of Contents", 904 | "title_sidebar": "Contents", 905 | "toc_cell": false, 906 | "toc_position": { 907 | "height": "calc(100% - 180px)", 908 | "left": "10px", 909 | "top": "150px", 910 | "width": "184px" 911 | }, 912 | "toc_section_display": true, 913 | "toc_window_display": false 914 | }, 915 | "toc-autonumbering": false, 916 | "toc-showmarkdowntxt": false, 917 | "widgets": { 918 | "application/vnd.jupyter.widget-state+json": { 919 | "state": {}, 920 | "version_major": 2, 921 | "version_minor": 0 922 | } 923 | } 924 | }, 925 | "nbformat": 4, 926 | "nbformat_minor": 2 927 | } 928 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal 2 | title: MetabWorkflowTutorial 3 | description: "Supplementary information for Mendez et al. (2019) DOI: 10.1007/s11306-019-1588-0" 4 | show_downloads: True 5 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: MetabWorkflowTutorial 2 | channels: 3 | - defaults 4 | - conda-forge 5 | - cimcb 6 | dependencies: 7 | - python=3.7.3 8 | - jupyter=1.0.0 9 | - notebook=6.4.7 10 | - bokeh=1.1.0 11 | - numpy=1.16.3 12 | - pandas=0.24.2 13 | - openpyxl=2.6.1 14 | - cimcb_lite=1.0.3 15 | - xlrd=1.2.0 16 | - Jinja2==3.0.3 17 | -------------------------------------------------------------------------------- /images/R2Q2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/R2Q2.png -------------------------------------------------------------------------------- /images/R2Q2_ab.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/R2Q2_ab.png -------------------------------------------------------------------------------- /images/bulb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/bulb.png -------------------------------------------------------------------------------- /images/cog2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/cog2.png -------------------------------------------------------------------------------- /images/logo_text.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/logo_text.png -------------------------------------------------------------------------------- /images/mouse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CIMCB/MetabWorkflowTutorial/dadb9eff6dbe3c21cb5b21f6fb38d6ae1606d970/images/mouse.png --------------------------------------------------------------------------------