├── criugm.jpg ├── favicon.png ├── requirements.txt ├── MAIN_BASC064_subsamp_features.npz ├── _config.yml ├── data_fetch.sh ├── Dockerfile ├── sklearn └── requirements.txt ├── LICENSE ├── README.md ├── course-outline.md ├── dl-course-outline.md ├── MAIN_tutorial_intro_to_nilearn.ipynb ├── plot_variance_linear_regr.ipynb ├── MAIN_tutorial_machine_learning_with_nilearn.ipynb └── model_validation.ipynb /criugm.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brainhack101/introML/HEAD/criugm.jpg -------------------------------------------------------------------------------- /favicon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brainhack101/introML/HEAD/favicon.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | git+https://github.com/KamalakerDadi/nilearn.git@16f2df26401e5b6a16ff134f4da71ff920e3ac40 2 | -------------------------------------------------------------------------------- /MAIN_BASC064_subsamp_features.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brainhack101/introML/HEAD/MAIN_BASC064_subsamp_features.npz -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal 2 | title: IntroML 3 | logo: http://www.crm.umontreal.ca/2018/MAIN2018/img/MAIN2018_Poster_FR_web.jpg 4 | -------------------------------------------------------------------------------- /data_fetch.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | python -c "import nilearn; from nilearn import datasets; datasets.fetch_main(); datasets.fetch_atlas_basc_multiscale_2015()" 3 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM jupyter/datascience-notebook 2 | 3 | LABEL maintainer="Pierre Bellec " 4 | 5 | USER jovyan 6 | 7 | # Copying the repository inside the container 8 | COPY . /home/jovyan 9 | 10 | # Instaling the Kamalaker's main fecther 11 | RUN pip install -r requirements.txt 12 | 13 | # Downloading the data 14 | RUN ["/bin/bash", "/home/jovyan/data_fetch.sh"] 15 | -------------------------------------------------------------------------------- /sklearn/requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.0 2 | backcall==0.1.0 3 | bleach==3.0.2 4 | cycler==0.10.0 5 | decorator==4.3.0 6 | defusedxml==0.5.0 7 | entrypoints==0.2.3 8 | ipykernel==5.1.0 9 | ipython==7.2.0 10 | ipython-genutils==0.2.0 11 | ipywidgets==7.4.2 12 | jedi==0.13.1 13 | Jinja2==2.10 14 | jsonschema==2.6.0 15 | jupyter==1.0.0 16 | jupyter-client==5.2.3 17 | jupyter-console==6.0.0 18 | jupyter-core==4.4.0 19 | kiwisolver==1.0.1 20 | MarkupSafe==1.1.0 21 | matplotlib==3.0.2 22 | mistune==0.8.4 23 | nbconvert==5.4.0 24 | nbformat==4.4.0 25 | notebook==5.7.2 26 | numpy==1.15.4 27 | pandocfilters==1.4.2 28 | parso==0.3.1 29 | pexpect==4.6.0 30 | pickleshare==0.7.5 31 | prometheus-client==0.4.2 32 | prompt-toolkit==2.0.7 33 | ptyprocess==0.6.0 34 | Pygments==2.3.0 35 | pyparsing==2.3.0 36 | python-dateutil==2.7.5 37 | pyzmq==17.1.2 38 | qtconsole==4.4.3 39 | scikit-learn==0.20.1 40 | scipy==1.1.0 41 | Send2Trash==1.5.0 42 | six==1.11.0 43 | terminado==0.8.1 44 | testpath==0.4.2 45 | tornado==5.1.1 46 | traitlets==4.3.2 47 | wcwidth==0.1.7 48 | webencodings==0.5.1 49 | widgetsnbextension==3.4.2 50 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 BrainHack 101 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## IntroML Resources @ MAIN 2018 2 | 3 | Welcome to the educational workshops @ MAIN 2018! On this page you'll find resources for the courses entitled, 4 | 5 | "[Machine learning for neuroimaging with Scikit-learn and nilearn](./course-outline.md)," 6 | 7 | and, 8 | 9 | "[Deep Learning for Neuroimaging](./dl-course-outline.md)" 10 | 11 | Register on [EventBrite](https://www.eventbrite.ca/e/deep-learning-in-neuroimaging-machine-learning-scikit-learn-nilearn-tickets-53388406160){: .btn} ! 12 | 13 | Join the [brainhack slack](https://brainhack-slack-invite.herokuapp.com/) and the #main-dl-2018 (Dec 11th) and/or #main-nilearn-2018 (Dec 12th) channel. 14 | 15 | Breakfast (8:30 am) and lunch are included. The training sessions will run from 9 am to 5 pm both days. All training sessions will be at the Groupe Maurice amphitheatre, at the [centre de recherche de l'institut de gériatrie de Montréal](https://goo.gl/maps/ouhdXKKWtko). 4545 Queen Mary Rd, Montreal, QC H3W 1W6, Canada. Metro station: snowdon (orange/blue lines), côte-des-neiges (blue line). 16 | ![CRIUGM](criugm.jpg) 17 | 18 | ### Usage 19 | 20 | To use the docker image, first after cloning the repository and cd to it, build it : 21 | ``` 22 | sudo docker build --tag=introml . 23 | ``` 24 | You can now run the container : 25 | ``` 26 | sudo docker run -p 8888:8888 -it introml jupyter notebook --no-browser --ip=0.0.0.0 27 | ``` 28 | -------------------------------------------------------------------------------- /course-outline.md: -------------------------------------------------------------------------------- 1 | ## Machine learning for neuroimaging ... 2 | ## ... with Scikit-learn and nilearn 3 | 4 | **Team:** Pierre Bellec, Elizabeth DuPre, Greg Kiar, Jacob Vogel 5 | 6 | **Date:** December 12th, 9h-17h. Breakfast/registration at 8h30. 7 | 8 | **Location:** Amphithéâtre “le groupe Maurice”, CRIUGM 9 | 10 | **Summary:** This course will be a hands-on/type-along introduction to machine learning for neuroimaging problems with scikit-learn and nilearn. 11 | 12 | ### Morning (9h-12h30): introduction to machine-learning with scikit-learn 13 | 14 | This part of the course will follow the scikit-learn chapter of the scipy-lectures, found [here](http://www.scipy-lectures.org/packages/scikit-learn/index.html). This includes: 15 | - Basic principles 16 | - Supervised learning: classification, the example of handwritten digits 17 | - Supervised learning: regression, the example of housing data 18 | - Measuring prediction performance 19 | - Unsupervised learning: dimension reduction and visualization 20 | - Chaining estimators: the example of eigenfaces 21 | - Parameter selection, validation, and testing 22 | 23 | ### Afternoon (13h30-17h): introduction to nilearn 24 | 25 | This part of the course will provide a general introduction to nilearn, building off of [several example analyses](http://nilearn.github.io/auto_examples/index.html#general-examples). 26 | 27 | ### Prerequisites 28 | 29 | - Basic familiarity with Python would be preferable 30 | - You will need enough space for Anaconda and all the course data (~4GB). 31 | 32 | If you are already savvy with Python and just want a tl;dr summary, here’s all you need to know: 33 | 34 | 1. Join the [Brainhack Slack](https://brainhack-slack-invite.herokuapp.com/) group and join the [main-nilearn-2018](https://brainhack.slack.com/messages/CEQB7U15M/) channel 35 | 2. Download and install python with the full-suite 64-bit [Anaconda](https://www.anaconda.com/download/) distribution 36 | 3. Download the [data](https://osf.io/5hju4/files/) and remember where you store it! 37 | 4. Download or clone the Intro to ML [repository](https://github.com/brainhack101/introML) 38 | 5. Install the necessary packages: `pip install -U nilearn scipy matplotlib scikit-learn jupyter pandas seaborn` 39 | 6. Test everything by opening one of the MAIN-tutorial .ipynb notebooks and running the first few cells 40 | 41 | For detailed instructions, view the full [installation instructions](https://docs.google.com/document/d/1G0QHtkZDklE5EEwbtTSruSijHhAoFIXoeDxk0AyVjM0/edit?usp=sharing). 42 | 43 | -------------------------------------------------------------------------------- /dl-course-outline.md: -------------------------------------------------------------------------------- 1 | ## Deep Learning for Neuroimaging 2 | 3 | **Team:** Andrew Doyle, Joseph Paul Cohen, Thomas Funck, Christopher Beckham 4 | 5 | **Date:** December 11th, 9h-17h. Breakfast/registration at 8h30. 6 | 7 | **Location:** Amphithéâtre “le groupe Maurice”, CRIUGM 8 | 9 | **Summary:** Deep learning is one of the most promising avenues towards achieving artificial general intelligence, and a strong new tool for the analysis of neuroimaging data. This course will offer an introduction into the theory behind how representations are automatically learned from data, and offer students an introduction into how to use the Keras library to formulate and solve a variety of deep learning problems using hands-on examples. 10 | 11 | **Learning Objectives:** 12 | * Understand how **representations are learned** in deep neural networks 13 | * Implement a **convolutional neural network** in Keras on neuroimaging data 14 | * Learn how **embeddings** can be learned 15 | 16 | 17 | ### Schedule: 18 | 19 | #### Morning (9h-12h30): Introduction & Segmentation with Deep Learning 20 | 21 | 9:00 am – 10:00 am: 22 | [Introduction to Deep Learning for Neuroimaging](https://www.dropbox.com/s/bju7auqwjslhhwy/IntroDL%20MAIN.pdf?dl=0) (Andrew Doyle) 23 | 24 | 10:00 am – 11:00 am: 25 | Deep Learning in Keras – [Hands-on Defacing Detector](https://colab.research.google.com/drive/1EgdnWZeNqmzqEmnSR9PUnYXlTjeu1wAU) (Andrew Doyle) 26 | 27 | 11:00 am – 11:15 am: 28 | Break 29 | 30 | 11:15 am – 12:30 am: 31 | Deep Learning for Segmentation - with [hands-on U-Net](https://colab.research.google.com/github/tfunck/minc_keras/blob/master/main2018.ipynb) (Thomas Funck) 32 | 33 | 12:30 pm - 1:30 pm: 34 | Lunch 35 | 36 | #### Afternoon (13h30-17h00): Getting Deeper 37 | 1:30 pm – 2:45 pm: 38 | [Looking Inside the Black Box](https://www.dropbox.com/s/dgmzittmthc41um/Interpretability.pdf?dl=0) - with Interpretability Hands-on (Andrew Doyle) 39 | 40 | 2:45 pm – 4:00 pm: 41 | [Clinical data successes using machine learning](https://docs.google.com/presentation/d/155oZORo29kpr1MNTwYbO2qEoYOIzHeBsDZCbI0NBmx8/edit) - with [Word2vec hands-on](https://colab.research.google.com/drive/1g4zvEg921sLQK-VsBk5mMb2-h4goCGyd) 42 | (Joseph Paul Cohen) 43 | 44 | 4:00 pm – 5:00 pm: 45 | [Generative Adversarial Networks](https://www.dropbox.com/s/vy5vdmubowv9g2h/gans.pdf?dl=0) - with [hands-on GAN](https://colab.research.google.com/drive/1KN0E_sORG-Bi7evOVtl6jONphI05ZiVL) (Christopher Beckham) 46 | 47 | ### Requirements 48 | * Basic familiarity with programming in Python is an asset, but not a requirement. 49 | * Examples will be presented in Google Collaboratory, and participants should create a Google account to run & write code along with the instructors: https://colab.research.google.com. 50 | * For students who wish to continue their analyses after the course, Python 3.x should be installed (ideally through [Anaconda](https://www.anaconda.com/)) with the [Keras](https://keras.io/) package. 51 | -------------------------------------------------------------------------------- /MAIN_tutorial_intro_to_nilearn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline\n", 10 | "import nilearn" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# Let's keep our notebook clean, so it's a little more readable!\n", 20 | "import warnings\n", 21 | "warnings.filterwarnings('ignore')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## [Understanding neuroimaging data](http://nilearn.github.io/manipulating_images/input_output.html)\n", 29 | "\n", 30 | "### Text files: phenotype or behavior\n", 31 | "\n", 32 | "Phenotypic or behavioral data are often provided as a text or CSV (Comma Separated Values) file. They can be loaded with the [pandas package](https://pandas.pydata.org/) but you may have to specify some options. Here, we'll specify the `sep` field, since our data is tab-delimited rather than comma-delimited.\n", 33 | "\n", 34 | "For our dataset, let's load participant level information:" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "import os\n", 44 | "from nilearn import datasets\n", 45 | "import pandas as pd\n", 46 | "\n", 47 | "data_dir = '/home/jovyan/nilearn_data/main/main/'\n", 48 | "participants = 'participants.tsv'\n", 49 | "phenotypic_data = pd.read_csv(os.path.join(data_dir, participants), sep='\\t')\n", 50 | "phenotypic_data.head()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "### Nifti data\n", 58 | "\n", 59 | "For volumetric data, nilearn works with data stored in the [Nifti structure](http://nipy.org/nibabel/nifti_images.html) (via the [nibabel package](http://nipy.org/nibabel/)).\n", 60 | "\n", 61 | "The NifTi data structure (also used in Analyze files) is the standard way of sharing data in neuroimaging research. Three main components are:\n", 62 | "\n", 63 | " * data:\traw scans in form of a numpy array: \n", 64 | " `data = img.get_data()`\n", 65 | " * affine:\treturns the transformation matrix that maps from voxel indices of the `numpy` array to actual real-world locations of the brain: \n", 66 | " `affine = img.affine`\n", 67 | " * header:\tlow-level informations about the data (slice duration, etc.): \n", 68 | " `header = img.header`\n", 69 | "\n", 70 | "It is important to appreciate that the representation of MRI data we'll be using is a big 4D matrix representing (3D MRI + 1D for time), stored in a single Nifti file.\n", 71 | "\n", 72 | "### Niimg-like objects\n", 73 | "\n", 74 | "Nilearn functions take as input argument what we call \"Niimg-like objects\":\n", 75 | "\n", 76 | "Niimg: A Niimg-like object can be one of the following:\n", 77 | "\n", 78 | " * A string with a file path to a Nifti image\n", 79 | " * A SpatialImage from `nibabel`, i.e., an object exposing the get_data() method and affine attribute, typically a Nifti1Image from `nibabel`.\n", 80 | "\n", 81 | "Niimg-4D: Similarly, some functions require 4D Nifti-like data, which we call Niimgs or Niimg-4D. Accepted input arguments are:\n", 82 | "\n", 83 | " * A path to a 4D Nifti image\n", 84 | " * List of paths to 3D Nifti images\n", 85 | " * 4D Nifti-like object\n", 86 | " * List of 3D Nifti-like objects\n", 87 | "\n", 88 | "**Note:** If you provide a sequence of Nifti images, all of them must have the same affine !" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "## [Manipulating and looking at data](http://nilearn.github.io/auto_examples/plot_nilearn_101.html#sphx-glr-auto-examples-plot-nilearn-101-py)\n", 96 | "\n", 97 | "There is a whole section of the [Nilearn documentation](http://nilearn.github.io/plotting/index.html#plotting) on making pretty plots for neuroimaging data ! But let's start with a simple one." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# Let's use a Nifti file that is shipped with nilearn\n", 107 | "from nilearn import datasets\n", 108 | "\n", 109 | "# Note that the variable MNI152_FILE_PATH is just a path to a Nifti file\n", 110 | "print('Path to MNI152 template: {}'.format(datasets.MNI152_FILE_PATH))" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "In the above, MNI152_FILE_PATH is nothing more than a string with a path pointing to a nifti image. You can replace it with a string pointing to a file on your disk. Note that it should be a 3D volume, and not a 4D volume." 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "from nilearn import plotting\n", 127 | "plotting.plot_img(datasets.MNI152_FILE_PATH)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "We can also directly manipulate these images using Nilearn ! As an example, let's try smoothing this image." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "from nilearn import image\n", 144 | "smooth_anat_img = image.smooth_img(datasets.MNI152_FILE_PATH, fwhm=6)\n", 145 | "\n", 146 | "# While we are giving a file name as input, the function returns\n", 147 | "# an in-memory object:\n", 148 | "print(smooth_anat_img)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "plotting.plot_img(smooth_anat_img)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "We can then save this manipulated image from in-memory to disk as follows:" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "smooth_anat_img.to_filename('smooth_anat_img.nii.gz')\n", 174 | "os.getcwd() # We'll' check our \"current working directory\" (cwd) to see where the file was saved" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "## [Visualizing neuroimaging volumes](https://nilearn.github.io/auto_examples/01_plotting/plot_visualization.html#visualization)\n", 182 | "\n", 183 | "What if we want to view not a structural MRI image, but a functional one ?\n", 184 | "No problem ! Let's try loading one:" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "fmri_filename = 'downsampled_derivatives:fmriprep:sub-pixar109:sub-pixar109_task-pixar_run-001_swrf_bold.nii.gz'\n", 194 | "plotting.plot_epi(os.path.join(data_dir, fmri_filename))" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "Uh-oh, what happened ?! Let's look back at the error message:\n", 202 | "\n", 203 | "> DimensionError: Input data has incompatible dimensionality: Expected dimension is 3D and you provided a 4D image. See http://nilearn.github.io/manipulating_images/input_output.html.\n", 204 | "\n", 205 | "We can fix that ! Let's take an average of the EPI image and plot that instead:" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "from nilearn.image import mean_img\n", 215 | "\n", 216 | "plotting.view_img(mean_img(os.path.join(data_dir, fmri_filename)))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "## [Convert the fMRI volumes to a data matrix](http://nilearn.github.io/auto_examples/plot_decoding_tutorial.html#convert-the-fmri-volume-s-to-a-data-matrix)\n", 224 | "\n", 225 | "These are some really lovely images, but for machine learning we want matrices so that we can use all of the techniques we learned this morning !\n", 226 | "\n", 227 | "To transform our Nifti images into matrices, we'll use the `nilearn.input_data.NiftiMasker` to extract the fMRI data from a mask and convert it to data series.\n", 228 | "\n", 229 | "First, let's do the simplest possible mask—a mask of the whole brain. We'll use a mask that ships with Nilearn and matches the MNI152 template we plotted earlier." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "brain_mask = datasets.load_mni152_brain_mask()\n", 239 | "plotting.plot_roi(brain_mask, cmap='Paired')" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "from nilearn.input_data import NiftiMasker\n", 249 | "masker = NiftiMasker(mask_img=brain_mask, standardize=True)\n", 250 | "masker" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "# We give the masker a filename and retrieve a 2D array ready\n", 260 | "# for machine learning with scikit-learn !\n", 261 | "fmri_masked = masker.fit_transform(os.path.join(data_dir, fmri_filename))\n", 262 | "print(fmri_masked)" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "print(fmri_masked.shape)" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "One way to think about what just happened is to look at it visually:\n", 279 | "\n", 280 | "![](http://nilearn.github.io/_images/masking.jpg)\n", 281 | "\n", 282 | "Essentially, we can think about overlaying a 3D grid on an image. Then, our mask tells us which cubes or \"voxels\" (like 3D pixels) to sample from. Since our Nifti images are 4D files, we can't overlay a single grid -- instead, we use a series of 3D grids (one for each volume in the 4D file), so we can get a measurement for each voxel at each timepoint. These are reflected in the shape of the matrix ! You can check this by checking the number of positive voxels in our brain mask.\n", 283 | "\n", 284 | "There are many other strategies in Nilearn [for masking data and for generating masks](http://nilearn.github.io/manipulating_images/manipulating_images.html#computing-and-applying-spatial-masks). I'd encourage you to spend some time exploring the documentation for these !" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "We can also [display this time series](http://nilearn.github.io/auto_examples/03_connectivity/plot_adhd_spheres.html#display-time-series) to get an intuition of how the whole brain signal is changing over time.\n", 292 | "\n", 293 | "We'll display the first three voxels by sub-selecting values from the matrix. You can also find more information on [how to slice arrays here](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html#basic-slicing-and-indexing)." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "import matplotlib.pyplot as plt\n", 303 | "plt.plot(fmri_masked[5:150, :3])\n", 304 | "\n", 305 | "plt.title('Voxel Time Series')\n", 306 | "plt.xlabel('Scan number')\n", 307 | "plt.ylabel('Normalized signal')\n", 308 | "plt.tight_layout()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "## [Extracting signals from a brain parcellation](http://nilearn.github.io/auto_examples/03_connectivity/plot_signal_extraction.html#extracting-signals-from-a-brain-parcellation)\n", 316 | "\n", 317 | "Now that we've seen how to create a data series from a single region-of-interest (ROI), we can start to scale up ! What if, instead of wanting to extract signal from one ROI, we want to define several ROIs and extract signal from all of them ? Nilearn can help us with that, too ! 🎉\n", 318 | "\n", 319 | "For this, we'll use `nilearn.input_data.NiftiLabelsMasker`. `NiftiLabelsMasker` which works like `NiftiMasker` except that it's for labelled data rather than binary. That is, since we have more than one ROI, we need more than one value ! Now that each ROI gets its own value, these values are treated as labels." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "# First, let's load a parcellation that we'd like to use\n", 329 | "multiscale = datasets.fetch_atlas_basc_multiscale_2015(resume=True)\n", 330 | "print('Atlas ROIs are located at: %s' % multiscale.scale064)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "plotting.plot_roi(multiscale.scale064)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "from nilearn.input_data import NiftiLabelsMasker\n", 349 | "label_masker = NiftiLabelsMasker(labels_img=multiscale.scale064, standardize=True)\n", 350 | "label_masker" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "fmri_matrix = label_masker.fit_transform(os.path.join(data_dir, fmri_filename))\n", 360 | "print(fmri_matrix)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "print(fmri_matrix.shape)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "### [Compute and display a correlation matrix](http://nilearn.github.io/auto_examples/03_connectivity/plot_signal_extraction.html#compute-and-display-a-correlation-matrix)\n", 377 | "\n", 378 | "Now that we have a matrix, we'd like to create a _connectome_. A connectome is a map of the connections in the brain. Since we're working with functional data, however, we don't have access to actual connections. Instead, we'll use a measure of statistical dependency to infer the (possible) presence of a connection.\n", 379 | "\n", 380 | "Here, we'll use Pearson's correlation as our measure of statistical dependency and compare how all of our ROIs from our chosen parcellation relate to one another." 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "from nilearn import connectome\n", 390 | "correlation_measure = connectome.ConnectivityMeasure(kind='correlation')\n", 391 | "correlation_measure" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "correlation_matrix = correlation_measure.fit_transform([fmri_matrix])\n", 401 | "correlation_matrix" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "import numpy as np\n", 411 | "\n", 412 | "correlation_matrix = correlation_matrix[0]\n", 413 | "# Mask the main diagonal for visualization:\n", 414 | "# np.fill_diagonal(correlation_matrix, 0)\n", 415 | "plotting.plot_matrix(correlation_matrix)" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### [The importance of specifying confounds](http://nilearn.github.io/auto_examples/03_connectivity/plot_signal_extraction.html#same-thing-without-confounds-to-stress-the-importance-of-confounds)\n", 423 | "\n", 424 | "In fMRI, we're collecting a noisy signal. We have artifacts like physiological noise (from heartbeats, respiration) and head motion which can impact our estimates. Therefore, it's strongly recommended that you control for these and related measures when deriving your connectome measures. Here, we'll repeat the correlation matrix example, but this time we'll control for confounds. " 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "conf_filename = 'sub-pixar109_task-pixar_run-001_ART_and_CompCor_nuisance_regressors.tsv'\n", 434 | "clean_fmri_matrix = label_masker.fit_transform(os.path.join(data_dir, fmri_filename),\n", 435 | " confounds=os.path.join(data_dir, conf_filename))\n", 436 | "clean_correlation_matrix = correlation_measure.fit_transform([clean_fmri_matrix])[0]\n", 437 | "np.fill_diagonal(clean_correlation_matrix, 0)\n", 438 | "plotting.plot_matrix(clean_correlation_matrix)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "That looks a little different !\n", 446 | "\n", 447 | "Looking more closely, we can see that our correlation matrix is symmetrical; that is, that both sides of the diagonal contain the same information. We don't want to feed duplicate information into our machine learning classifier, and Nilearn has a really easy way to remove this redundancy ! " 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "vectorized_correlation = connectome.ConnectivityMeasure(kind='correlation',\n", 457 | " vectorize=True, discard_diagonal=True)\n", 458 | "clean_vectorized_correlation = vectorized_correlation.fit_transform([clean_fmri_matrix])[0]\n", 459 | "clean_vectorized_correlation.shape # Why is this value not 64 * 64 ?" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "## [Interactive connectome plotting](http://nilearn.github.io/plotting/index.html#d-plots-of-connectomes)\n", 467 | "\n", 468 | "It can also be helpful to project these connection weightings back on to the brain, to visualize these connectomes ! Here, we'll use the interactive connectome plotting in Nilearn." 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "coords = plotting.find_parcellation_cut_coords(multiscale.scale064)\n", 478 | "plotting.view_connectome(clean_correlation_matrix, coords=coords)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "?plotting.view_connectome" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [] 496 | } 497 | ], 498 | "metadata": { 499 | "kernelspec": { 500 | "display_name": "Python 3", 501 | "language": "python", 502 | "name": "python3" 503 | }, 504 | "language_info": { 505 | "codemirror_mode": { 506 | "name": "ipython", 507 | "version": 3 508 | }, 509 | "file_extension": ".py", 510 | "mimetype": "text/x-python", 511 | "name": "python", 512 | "nbconvert_exporter": "python", 513 | "pygments_lexer": "ipython3", 514 | "version": "3.6.7" 515 | } 516 | }, 517 | "nbformat": 4, 518 | "nbformat_minor": 2 519 | } 520 | -------------------------------------------------------------------------------- /plot_variance_linear_regr.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "\n", 17 | "# Plot variance and regularization in linear models\n", 18 | "\n", 19 | "\n", 20 | "\n", 21 | "\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import numpy as np\n", 31 | "\n", 32 | "# Smaller figures\n", 33 | "from matplotlib import pyplot as plt\n", 34 | "plt.rcParams['figure.figsize'] = (3, 2)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "We consider the situation where we have only 2 data point\n", 42 | "\n" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "X = np.c_[ .5, 1].T\n", 52 | "y = [.5, 1]\n", 53 | "X_test = np.c_[ 0, 2].T" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "Without noise, as linear regression fits the data perfectly\n", 61 | "\n" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/plain": [ 72 | "[]" 73 | ] 74 | }, 75 | "execution_count": 4, 76 | "metadata": {}, 77 | "output_type": "execute_result" 78 | }, 79 | { 80 | "data": { 81 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAANAAAACPCAYAAACLfFVWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAELlJREFUeJzt3Xl8VPW5x/HPAwREQVCD7BBR6oKg0IggFfeFiERRK6WiKK0vFxSs6EWtYrn2umCpIoJF4FbcW1AaMIqgXKkUKGGRHQwqsil72AJkee4fvxHSEMiEOTO/mczzfr3ycibnvOY8HvP1zPzmd56fqCrGmGNTxXcBxiQyC5AxEbAAGRMBC5AxEbAAGRMBC5AxEbAAGRMBC5AxEbAAGROBar4OnJqaqmlpab4Ob8xRzZs3b4uq1itvP28BSktLIycnx9fhjTkqEVkTzn7lvoUTkaYiMl1ElonIUhHpV8Y+IiLDRCRXRBaJSLtjKdqYRBPOFagQeFhV54tIbWCeiExV1WUl9ukCtAz9XAiMDP3TmPikCiIRv0y5VyBV3aiq80OPdwHLgcaldssExqkzG6grIg0jrs6YoB3YA588BtMGBfJyFRqFE5E0oC0wp9SmxsDaEs/XcXjIEJG7RSRHRHI2b95csUqNidQ3/wcjOsLsEVCwz12FIhR2gESkFjAB6K+qO4/lYKo6SlXTVTW9Xr1yBziMCUb+DvhHXxiXCVWqQe9syHghkLdwYY3CiUgKLjxvq+oHZeyyHmha4nmT0O+M8WvFRzD5d7BnM3TqD5cOhJSagb18uQESEQHGAMtVdegRdssC+orIe7jBgzxV3RhYlcZU1O5N8PGjsPRDqN8aer4HjdoGfphwrkCdgF7AYhFZGPrd40AzAFV9DcgGMoBcYC9wZ+CVGhMOVVj0Pnwy0A0YXP4kdOoHVVOicrhyA6SqXwJHfbOorrHC/UEVZcwx2bEWJj8EuVOhSXvIHA71zozqIb3NRDAmMMXFkDMGpj3trkBdXoALfgNVqkb90BYgk9i25ELWA/D9v6DFZXD9y3BS85gd3gJkElNRIcx6BaY/CynHQeYIOL9nIEPTFWEBMoln4yLI6gsbv4Kzr4eMP0Ht+l5KsQCZxFGwD2a8AF++BMefAr8cB+dkei3JAmQSw/dz3FVnyyo4rydc80c4/mTfVVmATJzbvxs+Gwz/HgV1msBtE+CMK31XdZAFyMSv3M9gUn/IWwvtfwtXPAU1avuu6j9YgEz8yd8OU56AhW/DKS3hrk+gWQffVZXJAmTiy7IsyB4Ae7bAxQ9D50fdMHWcsgCZ+LDrRxec5VnQoA38ejw0bOO7qnJZgIxfqrDwHZjyOBTkwxWD4KIHojb5M2gWIOPP9jUwuT+s/hyadYRur0BqS99VVYgFyMRecTHMfR2m/cFNvcl4EdL7QJXE6/NpATKxtXmVm/y5drb7Pqfrn6FuM99VHbNw+sKNFZFNIrLkCNsvFZE8EVkY+nkq+DJNwisqgBkvwmudYMtKuPEvbqAggcMD4V2B/goMB8YdZZ9/qmrXQCoylc+GhW4azg+L4ZwbIGMI1DrVd1WBCOeO1BmhdlbGVExBPnzxPMwcBiekwq1vudnTlUhQn4E6ishXwAZggKouDeh1TaJaM8tddbbmQtvb4OpnoOZJvqsKXBABmg80V9XdIpIBTMS1+D2MiNwN3A3QrFliv/c1R7B/lxtdm/u6+3zTayKcfpnvqqIm4nFDVd2pqrtDj7OBFBFJPcK+1lixMvt6muv8OXc0dLgP7p1VqcMDAVyBRKQB8KOqqoi0x4Vya8SVmcSxd5ubSfDVu5B6JvT5FJq2911VTITTWPFd4FIgVUTWAYOAFDjYE+5m4F4RKQTygR6hNlemslOFZRMh+xE3g7rzo9B5AFSr4buymAlnFO5X5WwfjhvmNslk1w/w0cOwYjI0PB96fQgNWvuuKuZsJoKpGFVY8Ja7X6doP1w1GDrcD1WT808pOf+tzbHZ/h1M6ueWCWneCa4fBqln+K7KKwuQKV9xketJ8NlgkKpw3VD4+Z0JOfkzaBYgc3SbVrgvRNfNhZZXu8mfdZr4ripuWIBM2QoPwMyXYMYQqF4Lur8OrW+JeefPeGcBModbP9/dcvDjEjj3Jrj2eahlX3yXxQJkDinIh+n/A7OGQ6360ONdOCvDd1VxzQJknO++dFedbd9Auzvc8HTNur6rinsWoGS3b6db8j1nLJyUBrdnQYtLfFeVMCxAyWzVFLei266N0LEvXPYEVD/ed1UJxQKUjPZsdWuILv4b1DvbrXLQJN13VQnJApRMVGHJBLd69b6dcMlA1/2zWnXflSUsC1Cy2LnBTf5cmQ2N2rkFeOu38l1VwrMAVXaqMP8N+PRJ1xnn6j9Ch3tjsgBvMrAAVWbbvoGsB+G7f0LaxdBtGJzcwndVlYoFqDIqLoLZI+HzZ1yP6etfdt/t2DScwIVzR+pYoCuwSVXPLWO7AC8DGcBeoLeqzg+6UFO2iQvWM2TKSjbsyKdR3ZoM7liFK1YNhvXz4GddoOtQOLGR7zIrrSAaK3bBdeFpCVwIjAz900TZxAXreeyDxeQXFJFCITfveouLP5/I/honUuOmMW4em111oqrcGzpUdQaw7Si7ZALj1JkN1BWRhkEVaI5syJSV5BcUcZ7kMqn6EzyUMoHs4gu5UV6C1jdbeGIgiM9AjYG1JZ6vC/1uY+kdrS9csLbt2MHj1cbTp2o2mziJuw4M4PPidkie78qSR0wHEVR1FDAKID093Tr3ROLbGUyrOZDG+iNvFV7Bc4W/YjduGk6jujU9F5c8ggjQeqBpiedNQr8z0bAvz32nM/8N6pzQjNt3DmJG4ZkHN9dMqcoj15x5lBcwQQripvYs4HZxOgB5qnrY2zcTgJUfw6sXwoI34aIHqdVvDt2730rjujURoHHdmjzbvTU3tG3su9KkEURjxWzcEHYubhj7zmgVm7T2bHHz15ZMgFNbQY93oHE7AG5oe7wFxqMgGisqcH9gFZlDVGHxeBee/bvc7Qad+tvkzzhiMxHiVd46mPw7+HoKNLnALcB76tm+qzKlWIDiTXExzPtfmDoItAiufQ7a322TP+OUBSiebF3tJn+u+RJOu8TNYTv5NN9VmaOwAMWDokKY/arriFO1BnQb7lZ1s5kEcc8C5NsPS1znzw0L4Mzr4Lo/wYk2EypRWIB8Kdzvln3/cqhbO/SWv7oVrO2qk1AsQD6sneuuOptXQJsecO2zcPzJvqsyx8ACFEsH9rib3GaPhBMbw6/HQ8urfFdlImABipXV02HSg7Dje7jgt3DlIKhR23dVJkIWoGjL3wGfPuFWdTv5dLjzY2h+ke+qTEAsQNG0fLJrJbVnM/ziIbjkvyDFbjWoTCxA0bB7k1u5etlEqN8aer4Hjdr6rspEgQUoSKqw6H3XNvfAHrj8SejUz3XGMZWSBSgoO9bC5P6QOw2aXuhmE9T7me+qTJRZgCJVXAw5Y2Da0+4K1OUFN8pmC/AmBQtQJLZ87Ral+n4WtLjMTf48qbnvqkwMhRUgEbkW1zyxKjBaVZ8rtb03MIRDvRCGq+roAOv0rmQDw6Z1qjOyxUxarRoBKcdB5gg4v6dNw0lC4dzSXRV4FbgK17JqrohkqeqyUru+r6p9o1CjdyUbGJ4j3/F8/ihaLf+ODQ2volHPV6F2fd8lGk/CuQK1B3JV9RsAEXkP10yxdIAqrSFTVlJckM+Aah9yT9VJbKc29xzoz+LtlzDTwpPUwglQWY0Ty2rde5OIdAZWAQ+p6trSOyRqY8WGeQt5o/rrnFFlA38v7MwzhbeRRy1kR77v0oxnQQ0VTQLSVLUNMBV4o6ydVHWUqqaranq9evUCOnQU7d8N2Y/ytxqDOU4O0OvAQB4pvIc8agHWwNCEdwUqt3Giqm4t8XQ08ELkpXmW+xlM6g95a/n2tJ78MvdqthYf+kLUGhgaCO8KNBdoKSKniUh1oAeumeJBpZrJdwOWB1dijO3dBhPvg7e6Q7UacNcnnH7HCJ7sfoE1MDSHCacvXKGI9AWm4Iaxx6rqUhEZDOSoahbwoIh0AwpxKzn0jmLN0bPsH/DRANi71S2+2/lRN0wN3NC2sQXGHEZcX8TYS09P15ycHC/HPsyuHyF7ACzPggZtIPNVaNjGd1XGIxGZp6rp5e2X3DMRVGHhOzDlcSjIhyufho4PQNXkPi0mfMn7l7J9DUzqB99Mh2YdXefP1Ja+qzIJJvkCVFwMc1+HaX9wU28yXoT0Pjb50xyT5ArQ5pVu8ufaOXDGldD1z1A3cb7QNfEnOQJUVAAzX4YvnofqJ8CNf4E2t9rkTxOxyh+gDQtdD7YfFkOrG939OrVO9V2VqSQqb4AK8t0VZ+YwOCEVbn0bzu7quypTyVTOAK35l/usszUX2vaCq//btc81JmCVK0D7d7lbq+eOdoMDvSbC6Zf5rspUYpUnQF9PdZM/d66HDvfB5b93AwbGRFHiB2jvNvjkMVj0HtQ7C/p8Ck3b+67KJInEDZCqa1yY/Qjkb3cTPzsPcDOojYmRxAzQzo1u8ueKydDwfPdZp8G5vqsySSixAqQKC96EKb+Hov1w1WDocL9N/jTeJM5f3rZv3eTPb7+A5p3c5M9TTvddlUlyYc2gFJFrRWSliOSKyMAyttcQkfdD2+eISFpgFRYXwawRMPIiWD8frhsKd0y28Ji4EFRfuD7AdlU9Q0R6AM8Dt0Zc3aYVbhrOurnQ8hroOhTqNIn4ZY0JSlB94TKBp0OPxwPDRUT0WG93LTwAM1+CGUOgei3oPhpa32yTP03cCaov3MF9Qj0U8oBTgC0VrmhfHoztApuWwrk3ucmfJ6RW+GWMiYWYDiKE1VjxuDqQ1snNJDgrI4bVGVNx4QwilNsXruQ+IlINqANsLbVP+I0VM4ZYeExCCKQvXOj5HaHHNwOfH/PnH2MSSFB94cYAb4pILq4vXI9oFm1MvAjrM5CqZgPZpX73VInH+4Bbgi3NmPjnrbGiiGwG1hxll1SOZRQv+qyuiknUupqrarkrIHgLUHlEJCeczpCxZnVVTGWvy5qhGRMBC5AxEYjnAI3yXcARWF0VU6nritvPQMYkgni+AhkT97wHyOu9RpHV1VtENovIwtDPb2JQ01gR2SQiS46wXURkWKjmRSLSLto1hVnXpSKSV+JcPVXWflGoq6mITBeRZSKyVET6lbFPZOdMVb394GY2rAZaANWBr4BzSu1zH/Ba6HEP4P04qas3MDzG56sz0A5YcoTtGcDHgAAdgDlxUtelwGQPf18NgXahx7VxK8iX/u8Y0TnzfQU6eK+Rqh4AfrrXqKRMDq36PR64QiTqNwaFU1fMqeoM3FSpI8kExqkzG6hbav1aX3V5oaobVXV+6PEu3Nq9pdfpjOic+Q5QWfcalf4X/I97jYCf7jXyXRfATaHL/ngRaVrG9lgLt24fOorIVyLysYi0ivXBQ2/92wJzSm2K6Jz5DlAimwSkqWobYCqHrpLmcPNxU2POA14BJsby4CJSC5gA9FfVnUG+tu8ABXavUazrUtWtqro/9HQ08PMo1xSOcM5nzKnqTlXdHXqcDaSISExuMxaRFFx43lbVD8rYJaJz5jtA8XqvUbl1lXqf3A33/tq3LOD20MhSByBPVTf6LkpEGvz0uVVE2uP+7qL9P0FCxxwDLFfVoUfYLbJzFuuRkTJGSjJwoyOrgSdCvxsMdAs9Pg74O5AL/BtoESd1PQssxY3QTQfOikFN7wIbgQLce/U+wD3APaHtguugtBpYDKTH6FyVV1ffEudqNnBRjOr6BaDAImBh6CcjyHNmMxGMiYDvt3DGJDQLkDERsAAZEwELkDERsAAZEwELkDERsAAZEwELkDER+H9TzlTozKDfbQAAAABJRU5ErkJggg==\n", 82 | "text/plain": [ 83 | "
" 84 | ] 85 | }, 86 | "metadata": {}, 87 | "output_type": "display_data" 88 | } 89 | ], 90 | "source": [ 91 | "from sklearn import linear_model\n", 92 | "regr = linear_model.LinearRegression()\n", 93 | "regr.fit(X, y)\n", 94 | "plt.plot(X, y, 'o')\n", 95 | "plt.plot(X_test, regr.predict(X_test))" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "In real life situation, we have noise (e.g. measurement noise) in our data:\n", 103 | "\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 5, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "image/png": "\n", 114 | "text/plain": [ 115 | "
" 116 | ] 117 | }, 118 | "metadata": {}, 119 | "output_type": "display_data" 120 | } 121 | ], 122 | "source": [ 123 | "np.random.seed(0)\n", 124 | "for _ in range(6):\n", 125 | " noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)\n", 126 | " plt.plot(noisy_X, y, 'o')\n", 127 | " regr.fit(noisy_X, y)\n", 128 | " plt.plot(X_test, regr.predict(X_test))" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "As we can see, our linear model captures and amplifies the noise in the\n", 136 | "data. It displays a lot of variance.\n", 137 | "\n", 138 | "We can use another linear estimator that uses regularization, the\n", 139 | ":class:`~sklearn.linear_model.Ridge` estimator. This estimator\n", 140 | "regularizes the coefficients by shrinking them to zero, under the\n", 141 | "assumption that very high correlations are often spurious. The alpha\n", 142 | "parameter controls the amount of shrinkage used.\n", 143 | "\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "image/png": "\n", 154 | "text/plain": [ 155 | "
" 156 | ] 157 | }, 158 | "metadata": {}, 159 | "output_type": "display_data" 160 | } 161 | ], 162 | "source": [ 163 | "regr = linear_model.Ridge(alpha=.1)\n", 164 | "np.random.seed(0)\n", 165 | "for _ in range(6):\n", 166 | " noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)\n", 167 | " plt.plot(noisy_X, y, 'o')\n", 168 | " regr.fit(noisy_X, y)\n", 169 | " plt.plot(X_test, regr.predict(X_test))\n", 170 | "\n", 171 | "plt.show()" 172 | ] 173 | } 174 | ], 175 | "metadata": { 176 | "kernelspec": { 177 | "display_name": "Python 3", 178 | "language": "python", 179 | "name": "python3" 180 | }, 181 | "language_info": { 182 | "codemirror_mode": { 183 | "name": "ipython", 184 | "version": 3 185 | }, 186 | "file_extension": ".py", 187 | "mimetype": "text/x-python", 188 | "name": "python", 189 | "nbconvert_exporter": "python", 190 | "pygments_lexer": "ipython3", 191 | "version": "3.5.2" 192 | } 193 | }, 194 | "nbformat": 4, 195 | "nbformat_minor": 1 196 | } 197 | -------------------------------------------------------------------------------- /MAIN_tutorial_machine_learning_with_nilearn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "# Let's keep our notebook clean, so it's a little more readable!\n", 12 | "import warnings\n", 13 | "warnings.filterwarnings('ignore')" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": { 20 | "collapsed": true 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "%matplotlib inline" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "source": [ 33 | "# Section 2: Machine learning to predict age from rs-fmri\n", 34 | "\n", 35 | "We will integrate what we've learned in the previous sections to extract data from *several* rs-fmri images, and use that data as features in a machine learning model\n", 36 | "\n", 37 | "The dataset consists of 50 children (ages 3-13) and 33 young adults (ages 18-39). We will use rs-fmri data to try to predict who are adults and who are children." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Load the data\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "# change this to the location where you downloaded the data\n", 56 | "wdir = '/Users/jakevogel/Science/Nilearn_tutorial/reduced/'\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "# Now fetch the data\n", 68 | "\n", 69 | "from glob import glob\n", 70 | "import os\n", 71 | "data = sorted(glob(os.path.join(wdir,'*.gz')))\n", 72 | "confounds = sorted(glob(os.path.join(wdir,'*regressors.tsv')))" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "How many individual subjects do we have?" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "#len(data.func)\n", 89 | "len(data)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "### Extract features" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Here, we are going to use the same techniques we learned in the previous tutorial to extract rs-fmri connectivity features from every subject.\n", 104 | "\n", 105 | "How are we going to do that? With a for loop.\n", 106 | "\n", 107 | "Don't worry, it's not as scary as it sounds" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# Here is a really simple for loop\n", 117 | "\n", 118 | "for i in range(10):\n", 119 | " print('the number is', i)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "container = []\n", 129 | "for i in range(10):\n", 130 | " container.append(i)\n", 131 | "\n", 132 | "container" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "Now lets construct a more complicated loop to do what we want" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "First we do some things we don't need to do in the loop. Let's reload our atlas, and re-iniate our masker and correlation_measure" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "from nilearn.input_data import NiftiLabelsMasker\n", 156 | "from nilearn.connectome import ConnectivityMeasure\n", 157 | "from nilearn import datasets\n", 158 | "\n", 159 | "# load atlas\n", 160 | "multiscale = datasets.fetch_atlas_basc_multiscale_2015()\n", 161 | "atlas_filename = multiscale.scale064\n", 162 | "\n", 163 | "# initialize masker (change verbosity)\n", 164 | "masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True, \n", 165 | " memory='nilearn_cache', verbose=0)\n", 166 | "\n", 167 | "# initialize correlation measure, set to vectorize\n", 168 | "correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True,\n", 169 | " discard_diagonal=True)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "Okay -- now that we have that taken care of, let's run our big loop!" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "**NOTE**: On a laptop, this might a few minutes." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "all_features = [] # here is where we will put the data (a container)\n", 193 | "\n", 194 | "for i,sub in enumerate(data):\n", 195 | " # extract the timeseries from the ROIs in the atlas\n", 196 | " time_series = masker.fit_transform(sub, confounds=confounds[i])\n", 197 | " # create a region x region correlation matrix\n", 198 | " correlation_matrix = correlation_measure.fit_transform([time_series])[0]\n", 199 | " # add to our container\n", 200 | " all_features.append(correlation_matrix)\n", 201 | " # keep track of status\n", 202 | " print('finished %s of %s'%(i+1,len(data)))" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "collapsed": true 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "# Let's save the data to disk\n", 214 | "import numpy as np\n", 215 | "\n", 216 | "np.savez_compressed('MAIN_BASC064_subsamp_features',a = all_features)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "In case you do not want to run the full loop on your computer, you can load the output of the loop here!" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": { 230 | "collapsed": true 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "feat_file = 'MAIN_BASC064_subsamp_features.npz'\n", 235 | "X_features = np.load(feat_file)['a']" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "X_features.shape" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "Okay so we've got our features." 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "We can visualize our feature matrix" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "import matplotlib.pyplot as plt\n", 268 | "\n", 269 | "plt.imshow(X_features, aspect='auto')\n", 270 | "plt.colorbar()\n", 271 | "plt.title('feature matrix')\n", 272 | "plt.xlabel('features')\n", 273 | "plt.ylabel('subjects')" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "### Get Y (our target) and assess its distribution" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "collapsed": true 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "# Let's load the phenotype data\n", 292 | "\n", 293 | "pheno_path = os.path.join(wdir, 'participants.tsv')" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "from pandas import read_csv\n", 303 | "\n", 304 | "pheno = read_csv(pheno_path, sep='\\t').sort_values('participant_id')\n", 305 | "pheno.head()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "Looks like there is a column labeling children and adults. Let's capture it in a variable" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "y_ageclass = pheno['Child_Adult']\n", 322 | "y_ageclass.head()" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "Maybe we should have a look at the distribution of our target variable" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "import matplotlib.pyplot as plt\n", 339 | "import seaborn as sns\n", 340 | "sns.countplot(y_ageclass)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "We are a bit unbalanced -- there seems to be more children than adults" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "pheno.Child_Adult.value_counts()" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "### Prepare data for machine learning\n", 364 | "\n", 365 | "Here, we will define a \"training sample\" where we can play around with our models. We will also set aside a \"test\" sample that we will not touch until the end" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "We want to be sure that our training and test sample are matched! We can do that with a \"stratified split\". Specifically, we will stratify by age class." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "from sklearn.model_selection import train_test_split\n", 382 | "\n", 383 | "# Split the sample to training/test with a 60/40 ratio, and \n", 384 | "# stratify by age class, and also shuffle the data.\n", 385 | "\n", 386 | "X_train, X_test, y_train, y_test = train_test_split(\n", 387 | " X_features, # x\n", 388 | " y_ageclass, # y\n", 389 | " test_size = 0.4, # 60%/40% split \n", 390 | " shuffle = True, # shuffle dataset\n", 391 | " # before splitting\n", 392 | " stratify = y_ageclass, # keep\n", 393 | " # distribution\n", 394 | " # of ageclass\n", 395 | " # consistent\n", 396 | " # betw. train\n", 397 | " # & test sets.\n", 398 | " random_state = 123 # same shuffle each\n", 399 | " # time\n", 400 | " )\n", 401 | "\n", 402 | "# print the size of our training and test groups\n", 403 | "print('training:', len(X_train),\n", 404 | " 'testing:', len(X_test))" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "Let's visualize the distributions to be sure they are matched" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "fig,(ax1,ax2) = plt.subplots(2)\n", 421 | "sns.countplot(y_train, ax=ax1, order=['child','adult'])\n", 422 | "ax1.set_title('Train')\n", 423 | "sns.countplot(y_test, ax=ax2, order=['child','adult'])\n", 424 | "ax2.set_title('Test')" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "### Run your first model!\n", 432 | "\n", 433 | "Machine learning can get pretty fancy pretty quickly. We'll start with a very standard classification model called a Support Vector Classifier (SVC). \n", 434 | "\n", 435 | "While this may seem unambitious, simple models can be very robust. And we don't have enough data to create more complex models.\n", 436 | "\n", 437 | "For more information, see this excellent resource:\n", 438 | "https://hal.inria.fr/hal-01824205" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "First, a quick review of SVM!\n", 446 | "![](https://docs.opencv.org/2.4/_images/optimal-hyperplane.png)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "Let's fit our first model!" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "from sklearn.svm import SVC\n", 463 | "l_svc = SVC(kernel='linear') # define the model\n", 464 | "\n", 465 | "l_svc.fit(X_train, y_train) # fit the model" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "Well... that was easy. Let's see how well the model learned the data!\n", 473 | "\n", 474 | "We can judge our model on several criteria:\n", 475 | "* Accuracy: The proportion of predictions that were correct overall.\n", 476 | "* Precision: Accuracy of cases predicted as positive\n", 477 | "* Recall: Number of true positives correctly predicted to be positive\n", 478 | "* f1 score: A balance between precision and recall\n", 479 | "\n", 480 | "Or, for a more visual explanation...\n", 481 | "\n", 482 | "![](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "collapsed": true 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "from sklearn.metrics import classification_report, confusion_matrix, precision_score, f1_score\n", 494 | "\n", 495 | "# predict the training data based on the model\n", 496 | "y_pred = l_svc.predict(X_train) \n", 497 | "\n", 498 | "# caluclate the model accuracy\n", 499 | "acc = l_svc.score(X_train, y_train) \n", 500 | "\n", 501 | "# calculate the model precision, recall and f1, all in one convenient report!\n", 502 | "cr = classification_report(y_true=y_train,\n", 503 | " y_pred = y_pred)\n", 504 | "\n", 505 | "# get a table to help us break down these scores\n", 506 | "cm = confusion_matrix(y_true=y_train, y_pred = y_pred) \n" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "Let's view our results and plot them all at once!" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "import itertools\n", 523 | "from pandas import DataFrame\n", 524 | "\n", 525 | "# print results\n", 526 | "print('accuracy:', acc)\n", 527 | "print(cr)\n", 528 | "\n", 529 | "# plot confusion matrix\n", 530 | "cmdf = DataFrame(cm, index = ['Adult','Child'], columns = ['Adult','Child'])\n", 531 | "sns.heatmap(cmdf, cmap = 'RdBu_r')\n", 532 | "plt.xlabel('Predicted')\n", 533 | "plt.ylabel('Observed')\n", 534 | "# label cells in matrix\n", 535 | "for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 536 | " plt.text(j+0.5, i+0.5, format(cm[i, j], 'd'),\n", 537 | " horizontalalignment=\"center\",\n", 538 | " color=\"white\")" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "![](https://sebastianraschka.com/images/faq/multiclass-metric/conf_mat.png)" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "HOLY COW! Machine learning is amazing!!! Almost a perfect fit!\n", 553 | "\n", 554 | "...which means there's something wrong. What's the problem here?" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": { 561 | "collapsed": true 562 | }, 563 | "outputs": [], 564 | "source": [ 565 | "from sklearn.model_selection import cross_val_predict, cross_val_score\n", 566 | "\n", 567 | "# predict\n", 568 | "y_pred = cross_val_predict(l_svc, X_train, y_train, \n", 569 | " groups=y_train, cv=10)\n", 570 | "# scores\n", 571 | "acc = cross_val_score(l_svc, X_train, y_train, \n", 572 | " groups=y_train, cv=10)" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "We can look at the accuracy of the predictions for each fold of the cross-validation" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": {}, 586 | "outputs": [], 587 | "source": [ 588 | "for i in range(10):\n", 589 | " print('Fold %s -- Acc = %s'%(i, acc[i]))" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "We can also look at the overall accuracy of the model" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "from sklearn.metrics import accuracy_score\n", 606 | "overall_acc = accuracy_score(y_pred = y_pred, y_true = y_train)\n", 607 | "overall_cr = classification_report(y_pred = y_pred, y_true = y_train)\n", 608 | "overall_cm = confusion_matrix(y_pred = y_pred, y_true = y_train)\n", 609 | "print('Accuracy:',overall_acc)\n", 610 | "print(overall_cr)\n" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": null, 616 | "metadata": {}, 617 | "outputs": [], 618 | "source": [ 619 | "thresh = overall_cm.max() / 2\n", 620 | "cmdf = DataFrame(overall_cm, index = ['Adult','Child'], columns = ['Adult','Child'])\n", 621 | "sns.heatmap(cmdf, cmap='copper')\n", 622 | "plt.xlabel('Predicted')\n", 623 | "plt.ylabel('Observed')\n", 624 | "for i, j in itertools.product(range(overall_cm.shape[0]), range(overall_cm.shape[1])):\n", 625 | " plt.text(j+0.5, i+0.5, format(overall_cm[i, j], 'd'),\n", 626 | " horizontalalignment=\"center\",\n", 627 | " color=\"white\")" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "Not too bad at all!" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "### Tweak your model\n", 642 | "\n", 643 | "It's very important to learn when and where its appropriate to \"tweak\" your model.\n", 644 | "\n", 645 | "Since we have done all of the previous analysis in our training data, it's find to try different models. But we **absolutely cannot** \"test\" it on our left out data. If we do, we are in great danger of overfitting.\n", 646 | "\n", 647 | "We could try other models, or tweak hyperparameters, but we are probably not powered sufficiently to do so, and would once again risk overfitting.\n" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "But as a demonstration, we could see the impact of \"scaling\" our data. Certain machine learning algorithms perform better when all the input data is transformed to a uniform range of values. This is often between 0 and 1, or mean centered around with unit variance. We can perhaps look at the performance of the model after scaling the data" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": null, 660 | "metadata": { 661 | "collapsed": true 662 | }, 663 | "outputs": [], 664 | "source": [ 665 | "# Scale the training data\n", 666 | "from sklearn.preprocessing import MinMaxScaler\n", 667 | "scaler = MinMaxScaler().fit(X_train)\n", 668 | "X_train_scl = scaler.transform(X_train)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": {}, 675 | "outputs": [], 676 | "source": [ 677 | "plt.imshow(X_train, aspect='auto')\n", 678 | "plt.colorbar()\n", 679 | "plt.title('Training Data')\n", 680 | "plt.xlabel('features')\n", 681 | "plt.ylabel('subjects')" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": {}, 688 | "outputs": [], 689 | "source": [ 690 | "plt.imshow(X_train_scl, aspect='auto')\n", 691 | "plt.colorbar()\n", 692 | "plt.title('Scaled Training Data')\n", 693 | "plt.xlabel('features')\n", 694 | "plt.ylabel('subjects')" 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": null, 700 | "metadata": {}, 701 | "outputs": [], 702 | "source": [ 703 | "# repeat the steps above to re-fit the model \n", 704 | "# and assess its performance\n", 705 | "\n", 706 | "# don't forget to switch X_train to X_train_scl\n", 707 | "\n", 708 | "# predict\n", 709 | "y_pred = cross_val_predict(l_svc, X_train_scl, y_train, \n", 710 | " groups=y_train, cv=10)\n", 711 | "\n", 712 | "# get scores\n", 713 | "overall_acc = accuracy_score(y_pred = y_pred, y_true = y_train)\n", 714 | "overall_cr = classification_report(y_pred = y_pred, y_true = y_train)\n", 715 | "overall_cm = confusion_matrix(y_pred = y_pred, y_true = y_train)\n", 716 | "print('Accuracy:',overall_acc)\n", 717 | "print(overall_cr)\n", 718 | "\n", 719 | "# plot\n", 720 | "thresh = overall_cm.max() / 2\n", 721 | "cmdf = DataFrame(overall_cm, index = ['Adult','Child'], columns = ['Adult','Child'])\n", 722 | "sns.heatmap(cmdf, cmap='copper')\n", 723 | "plt.xlabel('Predicted')\n", 724 | "plt.ylabel('Observed')\n", 725 | "for i, j in itertools.product(range(overall_cm.shape[0]), range(overall_cm.shape[1])):\n", 726 | " plt.text(j+0.5, i+0.5, format(overall_cm[i, j], 'd'),\n", 727 | " horizontalalignment=\"center\",\n", 728 | " color=\"white\")" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "What do you think about the results of this model compared to the non-transformed model?" 736 | ] 737 | }, 738 | { 739 | "cell_type": "markdown", 740 | "metadata": {}, 741 | "source": [ 742 | "**Exercise:** Try fitting a new SVC model and tweak one of the many parameters. Run cross-validation and see how well it goes. Make a new cell and type SVC? to see the possible hyperparameters" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": { 749 | "collapsed": true 750 | }, 751 | "outputs": [], 752 | "source": [ 753 | "# new_model = SVC() " 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": {}, 759 | "source": [ 760 | "### Can our model classify childrens from adults in completely un-seen data?\n", 761 | "Now that we've fit a model we think has possibly learned how to decode childhood vs adulthood based on rs-fmri signal, let's put it to the test. We will train our model on all of the training data, and try to predict the age of the subjects we left out at the beginning of this section." 762 | ] 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "metadata": {}, 767 | "source": [ 768 | "Because we performed a transformation on our training data, we will need to transform our testing data using the *same information!* \n" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": { 775 | "collapsed": true 776 | }, 777 | "outputs": [], 778 | "source": [ 779 | "# Notice how we use the Scaler that was fit to X_train and apply to X_test,\n", 780 | "# rather than creating a new Scaler for X_test\n", 781 | "X_test_scl = scaler.transform(X_test)" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "And now for the moment of truth! \n", 789 | "\n", 790 | "No cross-validation needed here. We simply fit the model with the training data and use it to predict the testing data\n", 791 | "\n", 792 | "I'm so nervous. Let's just do it all in one cell" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": null, 798 | "metadata": {}, 799 | "outputs": [], 800 | "source": [ 801 | "l_svc.fit(X_train_scl, y_train) # fit to training data\n", 802 | "y_pred = l_svc.predict(X_test_scl) # classify age class using testing data\n", 803 | "acc = l_svc.score(X_test_scl, y_test) # get accuracy\n", 804 | "cr = classification_report(y_pred=y_pred, y_true=y_test) # get prec., recall & f1\n", 805 | "cm = confusion_matrix(y_pred=y_pred, y_true=y_test) # get confusion matrix\n", 806 | "\n", 807 | "# print results\n", 808 | "print('accuracy =', acc)\n", 809 | "print(cr)\n", 810 | "\n", 811 | "# plot results\n", 812 | "thresh = cm.max() / 2\n", 813 | "cmdf = DataFrame(cm, index = ['Adult','Child'], columns = ['Adult','Child'])\n", 814 | "sns.heatmap(cmdf, cmap='RdBu_r')\n", 815 | "plt.xlabel('Predicted')\n", 816 | "plt.ylabel('Observed')\n", 817 | "for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 818 | " plt.text(j+0.5, i+0.5, format(cm[i, j], 'd'),\n", 819 | " horizontalalignment=\"center\",\n", 820 | " color=\"white\")" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": {}, 826 | "source": [ 827 | "***Wow!!*** Congratulations. You just trained a machine learning model that used real rs-fmri data to predict the age of real humans.\n", 828 | "\n", 829 | "It seems like something in this data does seem to be systematically related to age ... but what? " 830 | ] 831 | }, 832 | { 833 | "cell_type": "markdown", 834 | "metadata": {}, 835 | "source": [ 836 | "### Interpreting model feature importances\n", 837 | "Interpreting the feature importances of a machine learning model is a real can of worms. This is an area of active research. Unfortunately, it's hard to trust the feature importance of some models. \n", 838 | "\n", 839 | "You can find a whole tutorial on this subject here:\n", 840 | "http://gael-varoquaux.info/interpreting_ml_tuto/index.html\n", 841 | "\n", 842 | "For now, we'll just eschew better judgement and take a look at our feature importances" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "We can access the feature importances (weights) used my the model" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "l_svc.coef_" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "lets plot these weights to see their distribution better" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "metadata": {}, 872 | "outputs": [], 873 | "source": [ 874 | "plt.bar(range(l_svc.coef_.shape[-1]),l_svc.coef_[0])\n", 875 | "plt.title('feature importances')\n", 876 | "plt.xlabel('feature')\n", 877 | "plt.ylabel('weight')" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "Or perhaps it will be easier to visualize this information as a matrix similar to the one we started with\n", 885 | "\n", 886 | "We can use the correlation measure from before to perform an inverse transform" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": null, 892 | "metadata": {}, 893 | "outputs": [], 894 | "source": [ 895 | "correlation_measure.inverse_transform(l_svc.coef_).shape" 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": null, 901 | "metadata": {}, 902 | "outputs": [], 903 | "source": [ 904 | "from nilearn import plotting\n", 905 | "\n", 906 | "feat_exp_matrix = correlation_measure.inverse_transform(l_svc.coef_)[0]\n", 907 | "\n", 908 | "plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), \n", 909 | " labels=range(feat_exp_matrix.shape[0]),\n", 910 | " reorder=False,\n", 911 | " tri='lower')" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": {}, 917 | "source": [ 918 | "Let's see if we can throw those features onto an actual brain.\n", 919 | "\n", 920 | "First, we'll need to gather the coordinates of each ROI of our atlas" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": { 927 | "collapsed": true 928 | }, 929 | "outputs": [], 930 | "source": [ 931 | "coords = plotting.find_parcellation_cut_coords(atlas_filename)" 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "And now we can use our feature matrix and the wonders of nilearn to create a connectome map where each node is an ROI, and each connection is weighted by the importance of the feature to the model" 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": null, 944 | "metadata": {}, 945 | "outputs": [], 946 | "source": [ 947 | "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True)" 948 | ] 949 | }, 950 | { 951 | "cell_type": "markdown", 952 | "metadata": {}, 953 | "source": [ 954 | "Whoa!! That's...a lot to process. Maybe let's threshold the edges so that only the most important connections are visualized" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": null, 960 | "metadata": {}, 961 | "outputs": [], 962 | "source": [ 963 | "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.04)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "That's definitely an improvement, but it's still a bit hard to see what's going on.\n", 971 | "Nilearn has a new feature that let's use view this data interactively!" 972 | ] 973 | }, 974 | { 975 | "cell_type": "code", 976 | "execution_count": null, 977 | "metadata": {}, 978 | "outputs": [], 979 | "source": [ 980 | "plotting.view_connectome(feat_exp_matrix, coords, threshold='90%')" 981 | ] 982 | }, 983 | { 984 | "cell_type": "code", 985 | "execution_count": null, 986 | "metadata": { 987 | "collapsed": true 988 | }, 989 | "outputs": [], 990 | "source": [ 991 | "#view = plotting.view_connectome(feat_exp_matrix, coords, threshold='90%') \n", 992 | "#view.open_in_browser() " 993 | ] 994 | } 995 | ], 996 | "metadata": { 997 | "anaconda-cloud": {}, 998 | "kernelspec": { 999 | "display_name": "Python 3", 1000 | "language": "python", 1001 | "name": "python3" 1002 | }, 1003 | "language_info": { 1004 | "codemirror_mode": { 1005 | "name": "ipython", 1006 | "version": 3 1007 | }, 1008 | "file_extension": ".py", 1009 | "mimetype": "text/x-python", 1010 | "name": "python", 1011 | "nbconvert_exporter": "python", 1012 | "pygments_lexer": "ipython3", 1013 | "version": "3.6.2" 1014 | } 1015 | }, 1016 | "nbformat": 4, 1017 | "nbformat_minor": 1 1018 | } 1019 | -------------------------------------------------------------------------------- /model_validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Validation Curves\n", 17 | "\n", 18 | "Let us create an example dataset:" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import numpy as np \n", 28 | "def generating_func(x, err=0.5):\n", 29 | " return np.random.normal(10 - 1. / (x + 0.1), err)\n", 30 | "\n", 31 | "# randomly sample more data\n", 32 | "np.random.seed(1)\n", 33 | "x = np.random.random(size=200)\n", 34 | "y = generating_func(x, err=1.)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "Central to quantify bias and variance of a model is to apply it on test data, sampled from the same distribution as the train, but that will capture independent noise:" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 5, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "from sklearn.model_selection import train_test_split\n", 51 | "xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.4)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "Let's visualize the data:" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 14, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "" 70 | ] 71 | }, 72 | "execution_count": 14, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | }, 76 | { 77 | "data": { 78 | "image/png": "\n", 79 | "text/plain": [ 80 | "
" 81 | ] 82 | }, 83 | "metadata": {}, 84 | "output_type": "display_data" 85 | } 86 | ], 87 | "source": [ 88 | "import matplotlib.pyplot as plt\n", 89 | "plt.plot(xtrain,ytrain,'ro')\n", 90 | "plt.plot(xtest,ytest,'bo')\n", 91 | "plt.legend(['train','test'])" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "**Validation curve** A validation curve consists in varying a model parameter that controls its complexity (here the degree of the polynomial) and measures both error of the model on training data, and on test data (eg with cross-validation). The model parameter is then adjusted so that the test error is minimized:\n", 99 | "\n", 100 | "We use `sklearn.model_selection.validation_curve()` to compute train and test error, and plot it:" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 16, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stderr", 110 | "output_type": "stream", 111 | "text": [ 112 | "/home/pbellec/env/nilearn/lib/python3.5/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.\n", 113 | " warnings.warn(CV_WARNING, FutureWarning)\n" 114 | ] 115 | }, 116 | { 117 | "data": { 118 | "text/plain": [ 119 | "" 120 | ] 121 | }, 122 | "execution_count": 16, 123 | "metadata": {}, 124 | "output_type": "execute_result" 125 | }, 126 | { 127 | "data": { 128 | "image/png": "\n", 129 | "text/plain": [ 130 | "
" 131 | ] 132 | }, 133 | "metadata": {}, 134 | "output_type": "display_data" 135 | } 136 | ], 137 | "source": [ 138 | "from sklearn.model_selection import validation_curve\n", 139 | "from sklearn.preprocessing import PolynomialFeatures\n", 140 | "from sklearn.linear_model import LinearRegression\n", 141 | "from sklearn.pipeline import make_pipeline\n", 142 | "degrees = np.arange(1, 21)\n", 143 | "\n", 144 | "model = make_pipeline(PolynomialFeatures(), LinearRegression())\n", 145 | "\n", 146 | "# Vary the \"degrees\" on the pipeline step \"polynomialfeatures\"\n", 147 | "train_scores, validation_scores = validation_curve(\n", 148 | " model, x[:, np.newaxis], y,\n", 149 | " param_name='polynomialfeatures__degree',\n", 150 | " param_range=degrees)\n", 151 | "\n", 152 | "# Plot the mean train score and validation score across folds\n", 153 | "plt.plot(degrees, validation_scores.mean(axis=1), label='cross-validation') \n", 154 | "\n", 155 | "plt.plot(degrees, train_scores.mean(axis=1), label='training') \n", 156 | "\n", 157 | "plt.legend(loc='best') " 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "This figure shows why validation is important. On the left side of the plot, we have very low-degree polynomial, which under-fit the data. This leads to a low explained variance for both the training set and the validation set. On the far right side of the plot, we have a very high degree polynomial, which over-fits the data. This can be seen in the fact that the training explained variance is very high, while on the validation set, it is low. Choosing d around 4 or 5 gets us the best tradeoff. The astute reader will realize that something is amiss here: in the above plot, d = 4 gives the best results. But in the previous plot, we found that d = 6 vastly over-fits the data. What’s going on here? The difference is the number of training points used. In the previous example, there were only eight training points. In this example, we have 100. As a general rule of thumb, the more training points used, the more complicated model can be used. But how can you determine for a given model whether more training points will be helpful? A useful diagnostic for this are learning curves." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## Learning Curves" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "A learning curve shows the training and validation score as a function of the number of training points. Note that when we train on a subset of the training data, the training score is computed using this subset, not the full training set. This curve gives a quantitative view into how beneficial it will be to add training samples. scikit-learn provides `sklearn.model_selection.learning_curve()`:" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 39, 184 | "metadata": {}, 185 | "outputs": [ 186 | { 187 | "name": "stderr", 188 | "output_type": "stream", 189 | "text": [ 190 | "/home/pbellec/env/nilearn/lib/python3.5/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.\n", 191 | " warnings.warn(CV_WARNING, FutureWarning)\n" 192 | ] 193 | }, 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "(0, 0.9)" 198 | ] 199 | }, 200 | "execution_count": 39, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | }, 204 | { 205 | "data": { 206 | "image/png": "\n", 207 | "text/plain": [ 208 | "
" 209 | ] 210 | }, 211 | "metadata": {}, 212 | "output_type": "display_data" 213 | } 214 | ], 215 | "source": [ 216 | "from sklearn.model_selection import learning_curve\n", 217 | "model.set_params(polynomialfeatures__degree=1)\n", 218 | "train_sizes, train_scores, validation_scores = learning_curve(\n", 219 | " model, x[:, np.newaxis], y, train_sizes=np.logspace(-1, 0, 20))\n", 220 | "\n", 221 | "# Plot the mean train score and validation score across folds\n", 222 | "plt.plot(train_sizes, validation_scores.mean(axis=1), label='cross-validation') \n", 223 | "plt.plot(train_sizes, train_scores.mean(axis=1), label='training') \n", 224 | "plt.title('degree=1')\n", 225 | "plt.ylim(ymin=0,ymax=0.9)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "\n", 233 | "Note that the validation score generally increases with a growing training set, while the training score generally decreases with a growing training set. As the training size increases, they will converge to a single value.\n", 234 | "\n", 235 | "From the above discussion, we know that d = 1 is a high-bias estimator which under-fits the data. This is indicated by the fact that both the training and validation scores are low. When confronted with this type of learning curve, we can expect that adding more training data will not help: both lines converge to a relatively low score.\n", 236 | "\n", 237 | "**When the learning curves have converged to a low score, we have a high bias model.**\n", 238 | "\n", 239 | "A high-bias model can be improved by:\n", 240 | "\n", 241 | "Using a more sophisticated model (i.e. in this case, increase d)\n", 242 | "Gather more features for each sample.\n", 243 | "Decrease regularization in a regularized model.\n", 244 | "Increasing the number of samples, however, does not improve a high-bias model.\n", 245 | "\n", 246 | "Now let’s look at a high-variance (i.e. over-fit) model:" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 42, 252 | "metadata": {}, 253 | "outputs": [ 254 | { 255 | "name": "stderr", 256 | "output_type": "stream", 257 | "text": [ 258 | "/home/pbellec/env/nilearn/lib/python3.5/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.\n", 259 | " warnings.warn(CV_WARNING, FutureWarning)\n" 260 | ] 261 | }, 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "(0, 0.9)" 266 | ] 267 | }, 268 | "execution_count": 42, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | }, 272 | { 273 | "data": { 274 | "image/png": "\n", 275 | "text/plain": [ 276 | "
" 277 | ] 278 | }, 279 | "metadata": {}, 280 | "output_type": "display_data" 281 | } 282 | ], 283 | "source": [ 284 | "model.set_params(polynomialfeatures__degree=15)\n", 285 | "train_sizes, train_scores, validation_scores = learning_curve(\n", 286 | " model, x[:, np.newaxis], y, train_sizes=np.logspace(-1, 0, 20))\n", 287 | "\n", 288 | "# Plot the mean train score and validation score across folds\n", 289 | "plt.plot(train_sizes, validation_scores.mean(axis=1), label='cross-validation') \n", 290 | "\n", 291 | "plt.plot(train_sizes, train_scores.mean(axis=1), label='training') \n", 292 | "plt.title('degree=15')\n", 293 | "plt.ylim(ymin=0,ymax=0.9)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "Here we show the learning curve for d = 15. From the above discussion, we know that d = 15 is a high-variance estimator which over-fits the data. This is indicated by the fact that the training score is much higher than the validation score. As we add more samples to this training set, the training score will continue to decrease, while the cross-validation error will continue to increase, until they meet in the middle.\n", 301 | "\n", 302 | "\n", 303 | "**Learning curves that have not yet converged with the full training set indicate a high-variance, over-fit model.**\n", 304 | "\n", 305 | "A high-variance model can be improved by:\n", 306 | "\n", 307 | "Gathering more training samples.\n", 308 | "Using a less-sophisticated model (i.e. in this case, make d smaller)\n", 309 | "Increasing regularization.\n", 310 | "In particular, gathering more features for each sample will not help the results." 311 | ] 312 | } 313 | ], 314 | "metadata": { 315 | "kernelspec": { 316 | "display_name": "Python 3", 317 | "language": "python", 318 | "name": "python3" 319 | }, 320 | "language_info": { 321 | "codemirror_mode": { 322 | "name": "ipython", 323 | "version": 3 324 | }, 325 | "file_extension": ".py", 326 | "mimetype": "text/x-python", 327 | "name": "python", 328 | "nbconvert_exporter": "python", 329 | "pygments_lexer": "ipython3", 330 | "version": "3.5.2" 331 | } 332 | }, 333 | "nbformat": 4, 334 | "nbformat_minor": 2 335 | } 336 | --------------------------------------------------------------------------------