├── .gitignore
├── LICENSE
├── README.md
├── command_list.sh
├── data
    └── README.md
├── imaging_preprocessing_ANTs
    ├── README.md
    ├── antsBrainExtraction.sh
    ├── antsRegistrationSyNQuick.sh
    ├── mni_icbm152_t1_tal_nlin_sym_09c.nii
    ├── mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz
    ├── mni_icbm152_t1_tal_nlin_sym_09c_mask.nii
    └── sge_job_script_ants.sh
├── notebooks
    └── Boston_feature_importance.ipynb
├── outputs
    └── README.md
├── requirements.txt
└── src
    ├── README.md
    ├── comparison
        ├── README.md
        ├── comparison_calculate_mean_predictions.py
        ├── comparison_fs_data_train_gp.py
        ├── comparison_fs_data_train_rvm.py
        ├── comparison_fs_data_train_svm.py
        ├── comparison_pca_data_train_gp.py
        ├── comparison_pca_data_train_rvm.py
        ├── comparison_pca_data_train_svm.py
        ├── comparison_statistical_analysis.py
        ├── comparison_voxel_data_rvm_relevance_vectors_weights.py
        ├── comparison_voxel_data_svm_primal_weights.py
        ├── comparison_voxel_data_train_rvm.py
        └── comparison_voxel_data_train_svm.py
    ├── download
        ├── README.md
        ├── download_ants_data.py
        └── download_data.py
    ├── generalisation
        ├── README.md
        ├── generalisation_calculate_mean_predictions.py
        ├── generalisation_test_fs_data.py
        ├── generalisation_test_pca_data.py
        ├── generalisation_test_voxel_data_rvm.py
        └── generalisation_test_voxel_data_svm.py
    ├── misc
        ├── README.md
        ├── misc_svm_hyperparameters_analysis.py
        └── misc_univariate_analysis.py
    ├── preprocessing
        ├── README.md
        ├── clean_data.py
        ├── compute_kernel_matrix.py
        ├── compute_kernel_matrix_general.py
        ├── compute_pca_variance_explained.py
        ├── compute_principal_components.py
        ├── create_pca_models.py
        ├── homogenize_gender.py
        └── quality_control.py
    ├── sample_size
        ├── README.md
        ├── sample_size_create_figures.py
        ├── sample_size_create_ids.py
        ├── sample_size_fs_data_gp_analysis.py
        ├── sample_size_fs_data_rvm_analysis.py
        ├── sample_size_fs_data_svm_analysis.py
        ├── sample_size_pca_data_gp_analysis.py
        ├── sample_size_pca_data_rvm_analysis.py
        ├── sample_size_pca_data_svm_analysis.py
        ├── sample_size_voxel_data_rvm_analysis.py
        └── sample_size_voxel_data_svm_analysis.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Environments
 2 | venv/
 3 | 
 4 | # Pycharm
 5 | .idea/
 6 | 
 7 | # Byte-compiled / optimized / DLL files
 8 | __pycache__/
 9 | 
10 | # Project's dataset
11 | data/
12 | !data/README.md


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Jessica Daflon, Lea Baecker, Pedro Ferreira da Costa, and
 4 |  Walter Hugo Lopez Pinaya
 5 | 
 6 | Permission is hereby granted, free of charge, to any person obtaining a copy
 7 | of this software and associated documentation files (the "Software"), to deal
 8 | in the Software without restriction, including without limitation the rights
 9 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 | copies of the Software, and to permit persons to whom the Software is
11 | furnished to do so, subject to the following conditions:
12 | 
13 | The above copyright notice and this permission notice shall be included in all
14 | copies or substantial portions of the Software.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data
 2 | [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg)](https://github.com/MLMH-Lab/Brain-age-prediction/blob/master/LICENSE)
 3 | 
 4 | Official script for the paper "Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data".
 5 | 
 6 | ## Abstract
 7 | Brain age prediction can be used to detect abnormalities in the ageing trajectory of an individual and their associated health issues. Existing studies on brain age vary widely in terms of their methods and type of data, so at present the most accurate and generalisable methodological approach is unclear. We used the UK Biobank dataset (N = 10,814) to compare the performance of the machine learning models support vector regression (SVR), relevance vector regression (RVR), and Gaussian process regression (GPR) on whole-brain region-based or voxel-based structural Magnetic Resonance Imaging data with or without dimensionality reduction through principal component analysis (PCA). Performance was assessed in the validation set through cross-validation as well as an independent test set. The models achieved mean absolute errors between 3.7 and 4.7 years, with those trained on voxel-level data with PCA performing best. There was little difference in performance between models trained on the same data type, indicating that the type of input data has greater impact on performance than model choice. Furthermore, dataset size analysis revealed that RVR required around half the sample size than SVR and GPR to yield generalisable results (approx. 120 subjects). Our results illustrated that the most suitable methodological approach for a brain age study depends on the sample size and the available computational and time resources. We are making all of our scripts open source in the hope that this will aid future research.
 8 | 
 9 | ## Test our models online
10 | 
11 | ## Citation
12 | If you find this code useful for your research, please cite:
13 | 
14 |     Baecker L, Dafflon J, da Costa PF, Garcia-Dias R, Vieira S, Scarpazza C, Calhoun VD, Sato JR, Mechelli A*, Pinaya WHL* (in press). Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data. Human Brain Mapping. * These authors contributed equally to this work
15 | 


--------------------------------------------------------------------------------
/command_list.sh:
--------------------------------------------------------------------------------
  1 | ## Initiate virtual environment
  2 | #source venv/bin/activate
  3 | #
  4 | ## Make all files executable
  5 | #chmod -R +x ./
  6 | #
  7 | export PYTHONPATH=$PYTHONPATH:./src
  8 | ## Run python scripts
  9 | ## ----------------------------- Getting data -------------------------------------
 10 | ## Download data from network-attached storage (MLMH lab use only)
 11 | ./src/download/download_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/"
 12 | ./src/download/download_ants_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/" -S "BIOBANK-SCANNER01" -O "/media/kcl_1/SSD2"
 13 | ./src/download/download_ants_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/" -S "BIOBANK-SCANNER02" -O "/media/kcl_1/HDD/DATASETS/BIOBANK"
 14 | 
 15 | # ----------------------------- Preprocessing ------------------------------------
 16 | # Clean UK Biobank data
 17 | ./src/preprocessing/clean_data.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 18 | ./src/preprocessing/clean_data.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02"
 19 | 
 20 | # Perform quality control
 21 | ./src/preprocessing/quality_control.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 22 | ./src/preprocessing/quality_control.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02"
 23 | 
 24 | # Make gender homogeneous along age range
 25 | # This was only performed in scanner1 because we were concerned not to create a biased regressor
 26 | ./src/preprocessing/homogenize_gender.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 27 | 
 28 | # Create kernel matrix for voxel-based analysis
 29 | ./src/preprocessing/compute_kernel_matrix.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1"
 30 | ./src/preprocessing/compute_kernel_matrix_general.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" -P2 "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" -E2 "biobank_scanner2"
 31 | 
 32 | # Create pca models
 33 | ./src/preprocessing/create_pca_models.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 34 | ./src/preprocessing/compute_principal_components.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1"
 35 | ./src/preprocessing/compute_principal_components.py -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" -E "biobank_scanner2" -I "cleaned_ids.csv" -S "_general"
 36 | 
 37 | # ----------------------------- Regressor comparison ------------------------------------
 38 | ./src/comparison/comparison_fs_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 39 | ./src/comparison/comparison_fs_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 40 | ./src/comparison/comparison_fs_data_train_gp.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 41 | 
 42 | ./src/comparison/comparison_voxel_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 43 | ./src/comparison/comparison_voxel_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 44 | 
 45 | ./src/comparison/comparison_pca_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 46 | ./src/comparison/comparison_pca_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 47 | ./src/comparison/comparison_pca_data_train_gp.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 48 | 
 49 | ./src/comparison/comparison_statistical_analysis.py -E "biobank_scanner1" -S "_all" -M "SVM" "RVM" "GPR" "voxel_SVM" "voxel_RVM" "pca_RVM" "pca_SVM" "pca_GPR"
 50 | 
 51 | ./src/comparison/comparison_voxel_data_svm_primal_weights.py -E "biobank_scanner1" -P "/media/kcl_1/SSD2/BIOBANK"
 52 | 
 53 | ./src/comparison/comparison_voxel_data_rvm_relevance_vectors_weights.py -E "biobank_scanner1" -P "/media/kcl_1/SSD2/BIOBANK"
 54 | 
 55 | ./comparison_feature_importance_visualisation.py
 56 | 
 57 | ## ----------------------------- Generalisation comparison -----------------------
 58 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "SVM" -I "cleaned_ids.csv"
 59 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "RVM" -I "cleaned_ids.csv"
 60 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "GPR" -I "cleaned_ids.csv"
 61 | 
 62 | ./src/generalisation/generalisation_test_voxel_data_svm.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "voxel_SVM" -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK"
 63 | ./src/generalisation/generalisation_test_voxel_data_rvm.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "voxel_RVM" -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK"
 64 | 
 65 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_RVM" -I "cleaned_ids.csv"
 66 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_SVM" -I "cleaned_ids.csv"
 67 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_GPR" -I "cleaned_ids.csv"
 68 | 
 69 | ./src/comparison/comparison_statistical_analysis.py -E "biobank_scanner2" -S "_generalization" -M "SVM" "RVM" "GPR" "voxel_SVM" "voxel_RVM" "pca_RVM" "pca_SVM" "pca_GPR"
 70 | 
 71 | # ----------------------------- Training set size analysis ------------------------------------
 72 | ./src/sample_size/sample_size_create_ids.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 73 | 
 74 | ./src/sample_size/sample_size_fs_data_svm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02"
 75 | ./src/sample_size/sample_size_fs_data_gp_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02"
 76 | ./src/sample_size/sample_size_fs_data_rvm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02"
 77 | 
 78 | ./src/sample_size/sample_size_voxel_data_svm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02"
 79 | ./src/sample_size/sample_size_voxel_data_rvm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02"
 80 | 
 81 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "SVM"
 82 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "RVM"
 83 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "GPR"
 84 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_RVM" -F 3 -R 10
 85 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_SVM" -F 3 -R 10
 86 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_GPR" -F 3 -R 10
 87 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "voxel_SVM"
 88 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "voxel_RVM"
 89 | 
 90 | # ----------------------------- Miscellaneous ------------------------------------
 91 | # Univariate analysis on FreeSurfer data
 92 | ./src/misc/misc_univariate_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01"
 93 | 
 94 | ./misc_classifier_train_svm.py
 95 | ./misc_classifier_regressor_comparison.py
 96 | 
 97 | # Performance of different values of the SVM hyperparameter (C)
 98 | ./src/misc/misc_svm_hyperparameters_analysis.py -E "biobank_scanner1"
 99 | 
100 | # ----------------------------- Exploratory Data Analysis ------------------------------------
101 | ./src/eda/eda_demographic_data.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -U "_homogenized" -I 'homogenized_ids.csv'
102 | ./src/eda/eda_demographic_data.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02" -U "_cleaned" -I 'cleaned_ids.csv'
103 | ./src/eda/eda_education_age.py


--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/data/README.md


--------------------------------------------------------------------------------
/imaging_preprocessing_ANTs/README.md:
--------------------------------------------------------------------------------
1 | # ANTS preprocessing


--------------------------------------------------------------------------------
/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c.nii:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c.nii


--------------------------------------------------------------------------------
/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz


--------------------------------------------------------------------------------
/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii


--------------------------------------------------------------------------------
/imaging_preprocessing_ANTs/sge_job_script_ants.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/tcsh
 2 | #$ -o $HOME/prep_temp/dataset/logs/
 3 | #$ -e $HOME/prep_temp/dataset/logs/
 4 | #$ -q global
 5 | #$ -N dataset_job
 6 | #$ -l h_vmem=6G
 7 | 
 8 | # First unload any modules loaded by ~/.cshrc then load the defaults
 9 | module purge
10 | module load nan/default
11 | module load sge
12 | # Load in script dependent modules here
13 | module load ants/2.2.0
14 | 
15 | # set the working variables
16 | # template and mask downloaded from http://nist.mni.mcgill.ca/?p=904 (ICBM 2009c Nonlinear Symmetric)
17 | set ants_template = $HOME/prep_temp/dataset/mni_icbm152_t1_tal_nlin_sym_09c.nii
18 | set ants_brain_mask = $HOME/prep_temp/dataset/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii
19 | 
20 | setenv working_data $HOME/prep_temp/dataset
21 | setenv sge_index ${working_data}/sge_index
22 | 
23 | 
24 | # Search the file for the SGE_TASK_ID number as a line number
25 | set file="`awk 'FNR==$SGE_TASK_ID' ${sge_index}`"
26 | 
27 | # Used the tcsh :t to find last part of path,
28 | # then :r to remove .gz then :r to remove .nii
29 | set file_name=${file:t:r:r}
30 | 
31 | # Based on https://github.com/ANTsX/ANTs/blob/master/Scripts/antsBrainExtraction.sh
32 | # https://github.com/ntustison/BasicBrainMapping
33 | # https://github.com/ntustison/antsBrainExtractionExample/blob/master/antsBrainExtractionCommand.sh
34 | # https://github.com/ANTsX/ANTs/blob/master/Scripts/antsRegistrationSyN.sh
35 | # https://sourceforge.net/p/advants/discussion/840261/thread/ca08a5aa74/?limit=25
36 | 
37 | bash antsBrainExtraction.sh \
38 |   -d 3 \
39 |   -a ${file} \
40 |   -e ${ants_template} \
41 |   -m ${ants_brain_mask} \
42 |   -o ${working_data}/subjects_output/${file_name}_
43 | 
44 | bash antsRegistrationSyNQuick.sh \
45 | #bash antsRegistrationSyN.sh \
46 |   -d 3 \
47 |   -f ${ants_template} \
48 |   -m ${working_data}/subjects_output/${file_name}_BrainExtractionBrain.nii.gz \
49 |   -x ${ants_brain_mask} \
50 |   -t s \
51 |   -o ${working_data}/subjects_output/${file_name}_
52 | 
53 | 
54 | rm ${working_data}/subjects_output/${file_name}_InverseWarped.nii.gz
55 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionPrior0GenericAffine.mat
56 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionMask.nii.gz
57 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionBrain.nii.gz
58 | rm ${working_data}/subjects_output/${file_name}_1Warp.nii.gz
59 | rm ${working_data}/subjects_output/${file_name}_1InverseWarp.nii.gz
60 | rm ${working_data}/subjects_output/${file_name}_0GenericAffine.mat
61 | 


--------------------------------------------------------------------------------
/outputs/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/outputs/README.md


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | imageio==2.6.1
 2 | joblib==0.14.1
 3 | matplotlib==3.1.2
 4 | nibabel==3.0.0
 5 | nilearn==0.6.0
 6 | numpy==1.18.1
 7 | pandas==0.25.3
 8 | scikit-learn==0.22.1
 9 | scipy==1.4.1
10 | sklearn-rvm==0.1
11 | tqdm==4.41.1
12 | statsmodels==0.10.2


--------------------------------------------------------------------------------
/src/README.md:
--------------------------------------------------------------------------------
 1 | # Source code
 2 | 
 3 | In this directory, we stored all the script files used in our analysis.
 4 | These scripts were divided into subdirectories according to their main 
 5 | functionality:
 6 | 
 7 | 1. [Comparison between machine learning methods](comparison)
 8 | 2. [Measuring the generalization of trained regressors](generalisation) 
 9 | 3. [Performance of models by the size of training set](sample_size)
10 | 


--------------------------------------------------------------------------------
/src/comparison/README.md:
--------------------------------------------------------------------------------
 1 | # Comparison of machine learning methods for brain age prediction
 2 | 
 3 | Here, we assessed the difference of the prediction performance between different machine learning approaches
 4 | trained using voxel-based or region-based morphometric MRI data 
 5 | preprocessed using ANTS and FreeSurfer software, respectively.
 6 | In our analysis, we included the most commonly used methods in the brain age literature:
 7 | 
 8 | Using voxel-based data:
 9 | 1. [Support Vector Machine]()
10 | 2. [Relevance Vector Machine]()
11 | 
12 | Using principal components from voxel-based data:
13 | 1. [Support Vector Machine]()
14 | 2. [Relevance Vector Machine]()
15 | 3. [Gaussian process model]()
16 | 
17 | Using region-based data:
18 | 1. [Support Vector Machine]()
19 | 2. [Relevance Vector Machine]()
20 | 3. [Gaussian process model]()
21 | 
22 | Each approach has their advantages and weaknesses.
23 | Voxel-based data preserve most information of the raw data with minimal preprocessing.
24 | However, this minimal preprocessing might include noise and irrelevant information for 
25 | the task of brain age prediction. The presence of irrelevant features can have a negative impact on
26 |  performance (as implied in the common machine learning concept: 
27 | "garbage in, garbage out"). These features are especially harmful in shallow machine learning methods,
28 |  as is the case in this study.
29 | For this reason, some feature engineering steps are commonly applied. This feature
30 | engineering can include feature selection, dimensionality reduction, feature extraction, etc.
31 | Here, we performed a dimensionality reduction using the Principal Component Analysis (PCA). Besides
32 | this approach, we transformed our raw data using the surface-based morphometry analysis that
33 | FreeSurfer software offers and worked with the features of the 101 selected regions of interest.
34 | 
35 | In order to compare our approaches, we assessed all methods using the same subjects in the training
36 | set and in the test set. These sets were defined using an resampling method called 10 times 10-fold 
37 | cross-validation (CV) that resulted in each model being evaluated 100 times. We chose this resampling method
38 | to have a better estimate of each approach and avoid influences caused by chance (lucky selection of training set).
39 | 
40 | The metrics of performance were obtained from the test set (to avoid biased results, a problem known as double dipping),
41 | and we used the most common brain age prediction metrics from the literature:
42 | - Mean Absolute Error (MAE)
43 | - Root Mean Squared Error (RMSE)
44 | - R-squared
45 | - Correlation between prediction error and age (or 'age bias')
46 | 
47 | Finally, we assessed if these performance metrics significantly differ between approaches through
48 | statistical testing. We used the corrected paired t-test to perform the hypothesis tests.
49 |  


--------------------------------------------------------------------------------
/src/comparison/comparison_calculate_mean_predictions.py:
--------------------------------------------------------------------------------
 1 | """Script to create csv file with mean predictions across model repetitions"""
 2 | 
 3 | import pandas as pd
 4 | from pathlib import Path
 5 | 
 6 | PROJECT_ROOT = Path.cwd()
 7 | 
 8 | 
 9 | def main():
10 |     experiment_name = 'biobank_scanner1'
11 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
12 | 
13 |     model_ls = ['SVM', 'RVM', 'GPR',
14 |                 'voxel_SVM', 'voxel_RVM',
15 |                 'pca_SVM', 'pca_RVM', 'pca_GPR']
16 | 
17 |     # Create df with subject IDs and chronological age
18 |     # All mean model predictions will be added to this df in the loop
19 |     # Based on an age_predictions csv file from model training to have the
20 |     # same order of subjects
21 |     example_file = pd.read_csv(experiment_dir / 'SVM' / 'age_predictions.csv')
22 |     age_predictions_all = pd.DataFrame(example_file.loc[:, 'image_id':'Age'])
23 | 
24 |     # Loop over all models, calculate mean predictions across repetitions
25 |     for model_name in model_ls:
26 |         model_dir = experiment_dir / model_name
27 |         file_name = model_dir / 'age_predictions.csv'
28 |         try:
29 |             model_data = pd.read_csv(file_name)
30 |         except FileNotFoundError:
31 |             print(f'No age prediction file for {model_name}.')
32 |             raise
33 | 
34 |         repetition_cols = model_data.loc[:,
35 |                         'Prediction repetition 00' : 'Prediction repetition 09']
36 | 
37 |         # get mean predictions across repetitions
38 |         model_data['prediction_mean'] = repetition_cols.mean(axis=1)
39 | 
40 |         # get those into one file for all models
41 |         age_predictions_all[model_name] = model_data['prediction_mean']
42 | 
43 |     # Calculate brainAGE for all models and add to age_predictions_all df
44 |     # brainAGE = predicted age - chronological age
45 |     for model_name in model_ls:
46 |         brainage_model = age_predictions_all[model_name] - \
47 |                          age_predictions_all['Age']
48 |         brainage_col_name = model_name + '_brainAGE'
49 |         age_predictions_all[brainage_col_name] = brainage_model
50 | 
51 |     # Export age_predictions_all as csv
52 |     age_predictions_all.to_csv(experiment_dir / 'age_predictions_allmodels.csv')
53 | 
54 | 
55 | if __name__ == '__main__':
56 |     main()


--------------------------------------------------------------------------------
/src/comparison/comparison_fs_data_train_gp.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Gaussian Processes on FreeSurfer data.
  3 | 
  4 | We trained the Gaussian Processes (GP) [1] in 10 repetitions of
  5 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  6 | 
  7 | References
  8 | ----------
  9 | [1] - Williams, Christopher KI, and Carl Edward Rasmussen.
 10 |  "Gaussian processes for regression." Advances in neural
 11 |  information processing systems. 1996.
 12 | """
 13 | import argparse
 14 | import random
 15 | import warnings
 16 | from math import sqrt
 17 | from pathlib import Path
 18 | 
 19 | import numpy as np
 20 | from joblib import dump
 21 | from scipy import stats
 22 | from sklearn.gaussian_process import GaussianProcessRegressor
 23 | from sklearn.gaussian_process.kernels import DotProduct
 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 25 | from sklearn.model_selection import StratifiedKFold
 26 | from sklearn.preprocessing import RobustScaler
 27 | 
 28 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 29 | 
 30 | PROJECT_ROOT = Path.cwd()
 31 | 
 32 | warnings.filterwarnings('ignore')
 33 | 
 34 | parser = argparse.ArgumentParser()
 35 | 
 36 | parser.add_argument('-E', '--experiment_name',
 37 |                     dest='experiment_name',
 38 |                     help='Name of the experiment.')
 39 | 
 40 | parser.add_argument('-S', '--scanner_name',
 41 |                     dest='scanner_name',
 42 |                     help='Name of the scanner.')
 43 | 
 44 | parser.add_argument('-I', '--input_ids_file',
 45 |                     dest='input_ids_file',
 46 |                     default='homogenized_ids.csv',
 47 |                     help='Filename indicating the ids to be used.')
 48 | 
 49 | args = parser.parse_args()
 50 | 
 51 | 
 52 | def main(experiment_name, scanner_name, input_ids_file):
 53 |     # ----------------------------------------------------------------------------------------
 54 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 55 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 56 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 57 |     ids_path = experiment_dir / input_ids_file
 58 | 
 59 |     model_dir = experiment_dir / 'GPR'
 60 |     model_dir.mkdir(exist_ok=True)
 61 |     cv_dir = model_dir / 'cv'
 62 |     cv_dir.mkdir(exist_ok=True)
 63 | 
 64 |     dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path)
 65 | 
 66 |     # ----------------------------------------------------------------------------------------
 67 |     # Initialise random seed
 68 |     np.random.seed(42)
 69 |     random.seed(42)
 70 | 
 71 |     # Normalise regional volumes by total intracranial volume (tiv)
 72 |     regions = dataset[COLUMNS_NAME].values
 73 | 
 74 |     tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 75 | 
 76 |     regions_norm = np.true_divide(regions, tiv)
 77 |     age = dataset['Age'].values
 78 | 
 79 |     # CV variables
 80 |     cv_r = []
 81 |     cv_r2 = []
 82 |     cv_mae = []
 83 |     cv_rmse = []
 84 |     cv_age_error_corr = []
 85 | 
 86 |     # Create DataFrame to hold actual and predicted ages
 87 |     age_predictions = dataset[['image_id', 'Age']]
 88 |     age_predictions = age_predictions.set_index('image_id')
 89 | 
 90 |     n_repetitions = 10
 91 |     n_folds = 10
 92 | 
 93 |     for i_repetition in range(n_repetitions):
 94 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 95 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 96 |         age_predictions[repetition_column_name] = np.nan
 97 | 
 98 |         # Create 10-fold CV scheme stratified by age
 99 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
100 |         for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)):
101 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
102 | 
103 |             x_train, x_test = regions_norm[train_index], regions_norm[test_index]
104 |             y_train, y_test = age[train_index], age[test_index]
105 | 
106 |             # Scaling using inter-quartile range
107 |             scaler = RobustScaler()
108 |             x_train = scaler.fit_transform(x_train)
109 |             x_test = scaler.transform(x_test)
110 | 
111 |             model = GaussianProcessRegressor(kernel=DotProduct(), random_state=0)
112 | 
113 |             model.fit(x_train, y_train)
114 | 
115 |             predictions = model.predict(x_test)
116 | 
117 |             mae = mean_absolute_error(y_test, predictions)
118 |             rmse = sqrt(mean_squared_error(y_test, predictions))
119 |             r, _ = stats.pearsonr(y_test, predictions)
120 |             r2 = r2_score(y_test, predictions)
121 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
122 | 
123 |             cv_r.append(r)
124 |             cv_r2.append(r2)
125 |             cv_mae.append(mae)
126 |             cv_rmse.append(rmse)
127 |             cv_age_error_corr.append(age_error_corr)
128 | 
129 |             # ----------------------------------------------------------------------------------------
130 |             # Save output files
131 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
132 | 
133 |             # Save scaler and model
134 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
135 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
136 | 
137 |             # Save model scores
138 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
139 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
140 | 
141 |             # ----------------------------------------------------------------------------------------
142 |             # Add predictions per test_index to age_predictions
143 |             for row, value in zip(test_index, predictions):
144 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
145 | 
146 |             # Print results of the CV fold
147 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
148 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
149 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
150 | 
151 |     # Save predictions
152 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
153 | 
154 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
155 |     print('')
156 |     print('Mean values:')
157 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
158 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
159 | 
160 | 
161 | if __name__ == '__main__':
162 |     main(args.experiment_name, args.scanner_name,
163 |          args.input_ids_file)
164 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_fs_data_train_rvm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Relevant Vector Machines on FreeSurfer data.
  3 | 
  4 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of
  5 | 10 stratified k-fold cross-validation (stratified by age).
  6 | 
  7 | References
  8 | ----------
  9 | [1] - Tipping, Michael E. "The relevance vector machine."
 10 | Advances in neural information processing systems. 2000.
 11 | """
 12 | import argparse
 13 | import random
 14 | import warnings
 15 | from math import sqrt
 16 | from pathlib import Path
 17 | 
 18 | import numpy as np
 19 | from joblib import dump
 20 | from scipy import stats
 21 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 22 | from sklearn.model_selection import StratifiedKFold
 23 | from sklearn.preprocessing import RobustScaler
 24 | from sklearn_rvm import EMRVR
 25 | 
 26 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 27 | 
 28 | PROJECT_ROOT = Path.cwd()
 29 | 
 30 | warnings.filterwarnings('ignore')
 31 | 
 32 | parser = argparse.ArgumentParser()
 33 | 
 34 | parser.add_argument('-E', '--experiment_name',
 35 |                     dest='experiment_name',
 36 |                     help='Name of the experiment.')
 37 | 
 38 | parser.add_argument('-S', '--scanner_name',
 39 |                     dest='scanner_name',
 40 |                     help='Name of the scanner.')
 41 | 
 42 | parser.add_argument('-I', '--input_ids_file',
 43 |                     dest='input_ids_file',
 44 |                     default='homogenized_ids.csv',
 45 |                     help='Filename indicating the ids to be used.')
 46 | 
 47 | args = parser.parse_args()
 48 | 
 49 | 
 50 | def main(experiment_name, scanner_name, input_ids_file):
 51 |     # ----------------------------------------------------------------------------------------
 52 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 53 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 54 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 55 |     ids_path = experiment_dir / input_ids_file
 56 | 
 57 |     model_dir = experiment_dir / 'RVM'
 58 |     model_dir.mkdir(exist_ok=True)
 59 |     cv_dir = model_dir / 'cv'
 60 |     cv_dir.mkdir(exist_ok=True)
 61 | 
 62 |     dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path)
 63 | 
 64 |     # ----------------------------------------------------------------------------------------
 65 |     # Initialise random seed
 66 |     np.random.seed(42)
 67 |     random.seed(42)
 68 | 
 69 |     # Normalise regional volumes by total intracranial volume (tiv)
 70 |     regions = dataset[COLUMNS_NAME].values
 71 | 
 72 |     tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 73 | 
 74 |     regions_norm = np.true_divide(regions, tiv)
 75 |     age = dataset['Age'].values
 76 | 
 77 |     # Cross validation variables
 78 |     cv_r = []
 79 |     cv_r2 = []
 80 |     cv_mae = []
 81 |     cv_rmse = []
 82 |     cv_age_error_corr = []
 83 | 
 84 |     # Create DataFrame to hold actual and predicted ages
 85 |     age_predictions = dataset[['image_id', 'Age']]
 86 |     age_predictions = age_predictions.set_index('image_id')
 87 | 
 88 |     n_repetitions = 10
 89 |     n_folds = 10
 90 | 
 91 |     for i_repetition in range(n_repetitions):
 92 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 93 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 94 |         age_predictions[repetition_column_name] = np.nan
 95 | 
 96 |         # Create 10-fold cross-validation scheme stratified by age
 97 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 98 |         for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)):
 99 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
100 | 
101 |             x_train, x_test = regions_norm[train_index], regions_norm[test_index]
102 |             y_train, y_test = age[train_index], age[test_index]
103 | 
104 |             # Scaling using inter-quartile range
105 |             scaler = RobustScaler()
106 |             x_train = scaler.fit_transform(x_train)
107 |             x_test = scaler.transform(x_test)
108 | 
109 |             model = EMRVR(kernel='linear', threshold_alpha=1e9)
110 | 
111 |             model.fit(x_train, y_train)
112 | 
113 |             predictions = model.predict(x_test)
114 | 
115 |             mae = mean_absolute_error(y_test, predictions)
116 |             rmse = sqrt(mean_squared_error(y_test, predictions))
117 |             r, _ = stats.pearsonr(y_test, predictions)
118 |             r2 = r2_score(y_test, predictions)
119 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
120 | 
121 |             cv_r.append(r)
122 |             cv_r2.append(r2)
123 |             cv_mae.append(mae)
124 |             cv_rmse.append(rmse)
125 |             cv_age_error_corr.append(age_error_corr)
126 | 
127 |             # ----------------------------------------------------------------------------------------
128 |             # Save output files
129 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
130 | 
131 |             # Save scaler and model
132 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
133 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
134 | 
135 |             # Save model scores
136 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
137 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
138 | 
139 |             # ----------------------------------------------------------------------------------------
140 |             # Add predictions per test_index to age_predictions
141 |             for row, value in zip(test_index, predictions):
142 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
143 | 
144 |             # Print results of the CV fold
145 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
146 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
147 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
148 | 
149 |     # Save predictions
150 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
151 | 
152 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
153 |     print('')
154 |     print('Mean values:')
155 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
156 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
157 | 
158 | 
159 | if __name__ == '__main__':
160 |     main(args.experiment_name, args.scanner_name,
161 |          args.input_ids_file)
162 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_fs_data_train_svm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Support Vector Machines on FreeSurfer data.
  3 | 
  4 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of
  5 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  6 | The hyperparameter tuning was performed in an automatic way using
  7 | nested CV.
  8 | 
  9 | References
 10 | ----------
 11 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
 12 | Machine learning 20.3 (1995): 273-297.
 13 | """
 14 | import argparse
 15 | import random
 16 | import warnings
 17 | from math import sqrt
 18 | from pathlib import Path
 19 | 
 20 | import numpy as np
 21 | from joblib import dump
 22 | from scipy import stats
 23 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 24 | from sklearn.model_selection import GridSearchCV
 25 | from sklearn.model_selection import StratifiedKFold
 26 | from sklearn.preprocessing import RobustScaler
 27 | from sklearn.svm import LinearSVR
 28 | 
 29 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 30 | 
 31 | PROJECT_ROOT = Path.cwd()
 32 | 
 33 | warnings.filterwarnings('ignore')
 34 | 
 35 | parser = argparse.ArgumentParser()
 36 | 
 37 | parser.add_argument('-E', '--experiment_name',
 38 |                     dest='experiment_name',
 39 |                     help='Name of the experiment.')
 40 | 
 41 | parser.add_argument('-S', '--scanner_name',
 42 |                     dest='scanner_name',
 43 |                     help='Name of the scanner.')
 44 | 
 45 | parser.add_argument('-I', '--input_ids_file',
 46 |                     dest='input_ids_file',
 47 |                     default='homogenized_ids.csv',
 48 |                     help='Filename indicating the ids to be used.')
 49 | 
 50 | args = parser.parse_args()
 51 | 
 52 | 
 53 | def main(experiment_name, scanner_name, input_ids_file):
 54 |     # ----------------------------------------------------------------------------------------
 55 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 56 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 57 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 58 |     ids_path = experiment_dir / input_ids_file
 59 | 
 60 |     model_dir = experiment_dir / 'SVM'
 61 |     model_dir.mkdir(exist_ok=True)
 62 |     cv_dir = model_dir / 'cv'
 63 |     cv_dir.mkdir(exist_ok=True)
 64 | 
 65 |     dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path)
 66 | 
 67 |     # ----------------------------------------------------------------------------------------
 68 |     # Initialise random seed
 69 |     np.random.seed(42)
 70 |     random.seed(42)
 71 | 
 72 |     # Normalise regional volumes by total intracranial volume (tiv)
 73 |     regions = dataset[COLUMNS_NAME].values
 74 | 
 75 |     tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 76 | 
 77 |     regions_norm = np.true_divide(regions, tiv)
 78 |     age = dataset['Age'].values
 79 | 
 80 |     # CV variables
 81 |     cv_r = []
 82 |     cv_r2 = []
 83 |     cv_mae = []
 84 |     cv_rmse = []
 85 |     cv_age_error_corr = []
 86 | 
 87 |     # Create DataFrame to hold actual and predicted ages
 88 |     age_predictions = dataset[['image_id', 'Age']]
 89 |     age_predictions = age_predictions.set_index('image_id')
 90 | 
 91 |     n_repetitions = 10
 92 |     n_folds = 10
 93 |     n_nested_folds = 5
 94 | 
 95 |     for i_repetition in range(n_repetitions):
 96 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 97 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 98 |         age_predictions[repetition_column_name] = np.nan
 99 | 
100 |         # Create 10-fold CV scheme stratified by age
101 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
102 |         for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)):
103 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
104 | 
105 |             x_train, x_test = regions_norm[train_index], regions_norm[test_index]
106 |             y_train, y_test = age[train_index], age[test_index]
107 | 
108 |             # Scaling using inter-quartile range
109 |             scaler = RobustScaler()
110 |             x_train = scaler.fit_transform(x_train)
111 |             x_test = scaler.transform(x_test)
112 | 
113 |             model_type = LinearSVR(loss='epsilon_insensitive')
114 | 
115 |             # Systematic search for best hyperparameters
116 |             search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]}
117 |             nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True, random_state=i_repetition)
118 |             gridsearch = GridSearchCV(model_type,
119 |                                       param_grid=search_space,
120 |                                       scoring='neg_mean_absolute_error',
121 |                                       refit=True, cv=nested_skf,
122 |                                       verbose=3, n_jobs=1)
123 | 
124 |             gridsearch.fit(x_train, y_train)
125 | 
126 |             model = gridsearch.best_estimator_
127 | 
128 |             params_results = {'means': gridsearch.cv_results_['mean_test_score'],
129 |                               'params': gridsearch.cv_results_['params']}
130 | 
131 |             predictions = model.predict(x_test)
132 | 
133 |             mae = mean_absolute_error(y_test, predictions)
134 |             rmse = sqrt(mean_squared_error(y_test, predictions))
135 |             r, _ = stats.pearsonr(y_test, predictions)
136 |             r2 = r2_score(y_test, predictions)
137 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
138 | 
139 |             cv_r.append(r)
140 |             cv_r2.append(r2)
141 |             cv_mae.append(mae)
142 |             cv_rmse.append(rmse)
143 |             cv_age_error_corr.append(age_error_corr)
144 | 
145 |             # ----------------------------------------------------------------------------------------
146 |             # Save output files
147 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
148 | 
149 |             # Save scaler and model
150 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
151 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
152 |             dump(params_results, cv_dir / f'{output_prefix}_params.joblib')
153 | 
154 |             # Save model scores
155 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
156 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
157 | 
158 |             # ----------------------------------------------------------------------------------------
159 |             # Add predictions per test_index to age_predictions
160 |             for row, value in zip(test_index, predictions):
161 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
162 | 
163 |             # Print results of the CV fold
164 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
165 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
166 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
167 | 
168 |     # Save predictions
169 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
170 | 
171 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
172 |     print('')
173 |     print('Mean values:')
174 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
175 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
176 | 
177 | 
178 | if __name__ == '__main__':
179 |     main(args.experiment_name, args.scanner_name,
180 |          args.input_ids_file)
181 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_pca_data_train_gp.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Gaussian Processes on voxel-level data
  3 | with reduced dimensionality through Principal Component Analysis (PCA).
  4 | 
  5 | We trained the Gaussian Processes (GP) [1] in 10 repetitions of
  6 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  7 | 
  8 | References
  9 | ----------
 10 | [1] - Williams, Christopher KI, and Carl Edward Rasmussen.
 11 |  "Gaussian processes for regression." Advances in neural
 12 |  information processing systems. 1996.
 13 | """
 14 | import argparse
 15 | import random
 16 | import warnings
 17 | from math import sqrt
 18 | from pathlib import Path
 19 | 
 20 | import numpy as np
 21 | from joblib import dump
 22 | from scipy import stats
 23 | from sklearn.gaussian_process import GaussianProcessRegressor
 24 | from sklearn.gaussian_process.kernels import DotProduct
 25 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 26 | from sklearn.model_selection import StratifiedKFold
 27 | from sklearn.preprocessing import RobustScaler
 28 | import pandas as pd
 29 | from utils import load_demographic_data
 30 | PROJECT_ROOT = Path.cwd()
 31 | 
 32 | warnings.filterwarnings('ignore')
 33 | 
 34 | parser = argparse.ArgumentParser()
 35 | 
 36 | parser.add_argument('-E', '--experiment_name',
 37 |                     dest='experiment_name',
 38 |                     help='Name of the experiment.')
 39 | 
 40 | parser.add_argument('-S', '--scanner_name',
 41 |                     dest='scanner_name',
 42 |                     help='Name of the scanner.')
 43 | 
 44 | parser.add_argument('-I', '--input_ids_file',
 45 |                     dest='input_ids_file',
 46 |                     default='homogenized_ids.csv',
 47 |                     help='Filename indicating the ids to be used.')
 48 | 
 49 | args = parser.parse_args()
 50 | 
 51 | def main(experiment_name, scanner_name, input_ids_file):
 52 |     # ----------------------------------------------------------------------------------------
 53 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 54 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 55 |     pca_dir = PROJECT_ROOT / 'outputs' / 'pca'
 56 |     ids_path = experiment_dir / input_ids_file
 57 | 
 58 |     model_dir = experiment_dir / 'pca_GPR'
 59 |     model_dir.mkdir(exist_ok=True)
 60 |     cv_dir = model_dir / 'cv'
 61 |     cv_dir.mkdir(exist_ok=True)
 62 | 
 63 |     participants_df = load_demographic_data(participants_path, ids_path)
 64 | 
 65 |     # ----------------------------------------------------------------------------------------
 66 |     # Initialise random seed
 67 |     np.random.seed(42)
 68 |     random.seed(42)
 69 | 
 70 |     age = participants_df['Age'].values
 71 | 
 72 |     # CV variables
 73 |     cv_r = []
 74 |     cv_r2 = []
 75 |     cv_mae = []
 76 |     cv_rmse = []
 77 |     cv_age_error_corr = []
 78 | 
 79 |     # Create DataFrame to hold actual and predicted ages
 80 |     age_predictions = participants_df[['image_id', 'Age']]
 81 |     age_predictions = age_predictions.set_index('image_id')
 82 | 
 83 |     n_repetitions = 10
 84 |     n_folds = 10
 85 | 
 86 |     for i_repetition in range(n_repetitions):
 87 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 88 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 89 |         age_predictions[repetition_column_name] = np.nan
 90 | 
 91 |         # Create 10-fold CV scheme stratified by age
 92 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 93 |         for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)):
 94 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
 95 | 
 96 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
 97 |             pca_path = pca_dir / f'{output_prefix}_pca_components.csv'
 98 | 
 99 |             pca_df = pd.read_csv(pca_path)
100 |             pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','') #TODO: put this in relation to project_root?
101 |             pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '')
102 | 
103 |             dataset_df = pd.merge(pca_df, participants_df, on='image_id')
104 |             x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values
105 | 
106 |             x_train, x_test = x_values[train_index], x_values[test_index]
107 |             y_train, y_test = age[train_index], age[test_index]
108 | 
109 |             # Scaling using inter-quartile range
110 |             scaler = RobustScaler()
111 |             x_train = scaler.fit_transform(x_train)
112 |             x_test = scaler.transform(x_test)
113 | 
114 |             model = GaussianProcessRegressor(kernel=DotProduct(), random_state=0)
115 | 
116 |             model.fit(x_train, y_train)
117 | 
118 |             predictions = model.predict(x_test)
119 | 
120 |             mae = mean_absolute_error(y_test, predictions)
121 |             rmse = sqrt(mean_squared_error(y_test, predictions))
122 |             r, _ = stats.pearsonr(y_test, predictions)
123 |             r2 = r2_score(y_test, predictions)
124 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
125 | 
126 |             cv_r.append(r)
127 |             cv_r2.append(r2)
128 |             cv_mae.append(mae)
129 |             cv_rmse.append(rmse)
130 |             cv_age_error_corr.append(age_error_corr)
131 | 
132 |             # ----------------------------------------------------------------------------------------
133 |             # Save output files
134 | 
135 |             # Save scaler and model
136 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
137 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
138 | 
139 |             # Save model scores
140 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
141 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
142 | 
143 |             # ----------------------------------------------------------------------------------------
144 |             # Add predictions per test_index to age_predictions
145 |             for row, value in zip(test_index, predictions):
146 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
147 | 
148 |             # Print results of the CV fold
149 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
150 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
151 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
152 | 
153 |     # Save predictions
154 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
155 | 
156 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
157 |     print('')
158 |     print('Mean values:')
159 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
160 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
161 | 
162 | 
163 | if __name__ == '__main__':
164 |     main(args.experiment_name, args.scanner_name,
165 |          args.input_ids_file)
166 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_pca_data_train_rvm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Relevant Vector Machines on voxel-level data
  3 | with reduced dimensionality through Principal Component Analysis (PCA).
  4 | 
  5 | 
  6 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of
  7 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  8 | 
  9 | References
 10 | ----------
 11 | [1] - Tipping, Michael E. "The relevance vector machine."
 12 | Advances in neural information processing systems. 2000.
 13 | """
 14 | import argparse
 15 | import random
 16 | import warnings
 17 | from math import sqrt
 18 | from pathlib import Path
 19 | 
 20 | import numpy as np
 21 | from joblib import dump
 22 | from scipy import stats
 23 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 24 | from sklearn.model_selection import StratifiedKFold
 25 | from sklearn.preprocessing import RobustScaler
 26 | from sklearn_rvm import EMRVR
 27 | import pandas as pd
 28 | from utils import load_demographic_data
 29 | PROJECT_ROOT = Path.cwd()
 30 | 
 31 | warnings.filterwarnings('ignore')
 32 | 
 33 | parser = argparse.ArgumentParser()
 34 | 
 35 | parser.add_argument('-E', '--experiment_name',
 36 |                     dest='experiment_name',
 37 |                     help='Name of the experiment.')
 38 | 
 39 | parser.add_argument('-S', '--scanner_name',
 40 |                     dest='scanner_name',
 41 |                     help='Name of the scanner.')
 42 | 
 43 | parser.add_argument('-I', '--input_ids_file',
 44 |                     dest='input_ids_file',
 45 |                     default='homogenized_ids.csv',
 46 |                     help='Filename indicating the ids to be used.')
 47 | 
 48 | args = parser.parse_args()
 49 | 
 50 | def main(experiment_name, scanner_name, input_ids_file):
 51 |     # ----------------------------------------------------------------------------------------
 52 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 53 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 54 |     pca_dir = PROJECT_ROOT / 'outputs' / 'pca'
 55 |     ids_path = experiment_dir / input_ids_file
 56 | 
 57 |     model_dir = experiment_dir / 'pca_RVM'
 58 |     model_dir.mkdir(exist_ok=True)
 59 |     cv_dir = model_dir / 'cv'
 60 |     cv_dir.mkdir(exist_ok=True)
 61 | 
 62 |     participants_df = load_demographic_data(participants_path, ids_path)
 63 | 
 64 |     # ----------------------------------------------------------------------------------------
 65 |     # Initialise random seed
 66 |     np.random.seed(42)
 67 |     random.seed(42)
 68 | 
 69 |     age = participants_df['Age'].values
 70 | 
 71 |     # CV variables
 72 |     cv_r = []
 73 |     cv_r2 = []
 74 |     cv_mae = []
 75 |     cv_rmse = []
 76 |     cv_age_error_corr = []
 77 | 
 78 |     # Create DataFrame to hold actual and predicted ages
 79 |     age_predictions = participants_df[['image_id', 'Age']]
 80 |     age_predictions = age_predictions.set_index('image_id')
 81 | 
 82 |     n_repetitions = 10
 83 |     n_folds = 10
 84 | 
 85 |     for i_repetition in range(n_repetitions):
 86 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 87 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 88 |         age_predictions[repetition_column_name] = np.nan
 89 | 
 90 |         # Create 10-fold CV scheme stratified by age
 91 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 92 |         for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)):
 93 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
 94 | 
 95 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
 96 |             pca_path = pca_dir / f'{output_prefix}_pca_components.csv'
 97 | 
 98 |             pca_df = pd.read_csv(pca_path)
 99 |             pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','') #TODO: fix path?
100 |             pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '')
101 | 
102 |             dataset_df = pd.merge(pca_df, participants_df, on='image_id')
103 |             x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values
104 | 
105 |             x_train, x_test = x_values[train_index], x_values[test_index]
106 |             y_train, y_test = age[train_index], age[test_index]
107 | 
108 |             # Scaling using inter-quartile range
109 |             scaler = RobustScaler()
110 |             x_train = scaler.fit_transform(x_train)
111 |             x_test = scaler.transform(x_test)
112 | 
113 |             model = EMRVR(kernel='linear', threshold_alpha=1e9)
114 | 
115 |             model.fit(x_train, y_train)
116 | 
117 |             predictions = model.predict(x_test)
118 | 
119 |             mae = mean_absolute_error(y_test, predictions)
120 |             rmse = sqrt(mean_squared_error(y_test, predictions))
121 |             r, _ = stats.pearsonr(y_test, predictions)
122 |             r2 = r2_score(y_test, predictions)
123 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
124 | 
125 |             cv_r.append(r)
126 |             cv_r2.append(r2)
127 |             cv_mae.append(mae)
128 |             cv_rmse.append(rmse)
129 |             cv_age_error_corr.append(age_error_corr)
130 | 
131 |             # ----------------------------------------------------------------------------------------
132 |             # Save output files
133 | 
134 |             # Save scaler and model
135 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
136 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
137 | 
138 |             # Save model scores
139 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
140 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
141 | 
142 |             # ----------------------------------------------------------------------------------------
143 |             # Add predictions per test_index to age_predictions
144 |             for row, value in zip(test_index, predictions):
145 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
146 | 
147 |             # Print results of the CV fold
148 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
149 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
150 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
151 | 
152 |     # Save predictions
153 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
154 | 
155 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
156 |     print('')
157 |     print('Mean values:')
158 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
159 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
160 | 
161 | 
162 | if __name__ == '__main__':
163 |     main(args.experiment_name, args.scanner_name,
164 |          args.input_ids_file)
165 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_pca_data_train_svm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Support Vector Machines on voxel-level data
  3 | with reduced dimensionality through Principal Component Analysis (PCA).
  4 | 
  5 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of
  6 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  7 | The hyperparameter tuning was performed in an automatic way using
  8 | nested CV.
  9 | 
 10 | References
 11 | ----------
 12 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
 13 | Machine learning 20.3 (1995): 273-297.
 14 | """
 15 | import argparse
 16 | import random
 17 | import warnings
 18 | from math import sqrt
 19 | from pathlib import Path
 20 | 
 21 | import numpy as np
 22 | from joblib import dump
 23 | from scipy import stats
 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 25 | from sklearn.model_selection import GridSearchCV
 26 | from sklearn.model_selection import StratifiedKFold
 27 | from sklearn.preprocessing import RobustScaler
 28 | from sklearn.svm import LinearSVR
 29 | import pandas as pd
 30 | from utils import load_demographic_data
 31 | PROJECT_ROOT = Path.cwd()
 32 | 
 33 | warnings.filterwarnings('ignore')
 34 | 
 35 | parser = argparse.ArgumentParser()
 36 | 
 37 | parser.add_argument('-E', '--experiment_name',
 38 |                     dest='experiment_name',
 39 |                     help='Name of the experiment.')
 40 | 
 41 | parser.add_argument('-S', '--scanner_name',
 42 |                     dest='scanner_name',
 43 |                     help='Name of the scanner.')
 44 | 
 45 | parser.add_argument('-I', '--input_ids_file',
 46 |                     dest='input_ids_file',
 47 |                     default='homogenized_ids.csv',
 48 |                     help='Filename indicating the ids to be used.')
 49 | 
 50 | args = parser.parse_args()
 51 | 
 52 | def main(experiment_name, scanner_name, input_ids_file):
 53 |     # ----------------------------------------------------------------------------------------
 54 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 55 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 56 |     pca_dir = PROJECT_ROOT / 'outputs' / 'pca'
 57 |     ids_path = experiment_dir / input_ids_file
 58 | 
 59 |     model_dir = experiment_dir / 'pca_SVM'
 60 |     model_dir.mkdir(exist_ok=True)
 61 |     cv_dir = model_dir / 'cv'
 62 |     cv_dir.mkdir(exist_ok=True)
 63 | 
 64 |     participants_df = load_demographic_data(participants_path, ids_path)
 65 | 
 66 |     # ----------------------------------------------------------------------------------------
 67 |     # Initialise random seed
 68 |     np.random.seed(42)
 69 |     random.seed(42)
 70 | 
 71 |     age = participants_df['Age'].values
 72 | 
 73 |     # CV variables
 74 |     cv_r = []
 75 |     cv_r2 = []
 76 |     cv_mae = []
 77 |     cv_rmse = []
 78 |     cv_age_error_corr = []
 79 | 
 80 |     # Create DataFrame to hold actual and predicted ages
 81 |     age_predictions = participants_df[['image_id', 'Age']]
 82 |     age_predictions = age_predictions.set_index('image_id')
 83 | 
 84 |     n_repetitions = 10
 85 |     n_folds = 10
 86 |     n_nested_folds = 5
 87 | 
 88 |     for i_repetition in range(n_repetitions):
 89 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 90 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 91 |         age_predictions[repetition_column_name] = np.nan
 92 | 
 93 |         # Create 10-fold CV scheme stratified by age
 94 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 95 |         for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)):
 96 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
 97 | 
 98 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
 99 |             pca_path = pca_dir / f'{output_prefix}_pca_components.csv'
100 | 
101 |             pca_df = pd.read_csv(pca_path)
102 |             pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','')
103 |             pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '')
104 | 
105 |             dataset_df = pd.merge(pca_df, participants_df, on='image_id')
106 |             x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values
107 | 
108 |             x_train, x_test = x_values[train_index], x_values[test_index]
109 |             y_train, y_test = age[train_index], age[test_index]
110 | 
111 |             # Scaling using inter-quartile range
112 |             scaler = RobustScaler()
113 |             x_train = scaler.fit_transform(x_train)
114 |             x_test = scaler.transform(x_test)
115 | 
116 |             model_type = LinearSVR(loss='epsilon_insensitive')
117 | 
118 |             # Systematic search for best hyperparameters
119 |             search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]}
120 |             nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True, random_state=i_repetition)
121 |             gridsearch = GridSearchCV(model_type,
122 |                                       param_grid=search_space,
123 |                                       scoring='neg_mean_absolute_error',
124 |                                       refit=True, cv=nested_skf,
125 |                                       verbose=3, n_jobs=1)
126 | 
127 |             gridsearch.fit(x_train, y_train)
128 | 
129 |             model = gridsearch.best_estimator_
130 | 
131 |             params_results = {'means': gridsearch.cv_results_['mean_test_score'],
132 |                               'params': gridsearch.cv_results_['params']}
133 | 
134 |             predictions = model.predict(x_test)
135 | 
136 |             mae = mean_absolute_error(y_test, predictions)
137 |             rmse = sqrt(mean_squared_error(y_test, predictions))
138 |             r, _ = stats.pearsonr(y_test, predictions)
139 |             r2 = r2_score(y_test, predictions)
140 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
141 | 
142 |             cv_r.append(r)
143 |             cv_r2.append(r2)
144 |             cv_mae.append(mae)
145 |             cv_rmse.append(rmse)
146 |             cv_age_error_corr.append(age_error_corr)
147 | 
148 |             # ----------------------------------------------------------------------------------------
149 |             # Save output files
150 | 
151 |             # Save scaler and model
152 |             dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib')
153 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
154 |             dump(params_results, cv_dir / f'{output_prefix}_params.joblib')
155 | 
156 |             # Save model scores
157 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
158 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
159 | 
160 |             # ----------------------------------------------------------------------------------------
161 |             # Add predictions per test_index to age_predictions
162 |             for row, value in zip(test_index, predictions):
163 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
164 | 
165 |             # Print results of the CV fold
166 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
167 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
168 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
169 | 
170 |     # Save predictions
171 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
172 | 
173 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
174 |     print('')
175 |     print('Mean values:')
176 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
177 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
178 | 
179 | 
180 | if __name__ == '__main__':
181 |     main(args.experiment_name, args.scanner_name,
182 |          args.input_ids_file)
183 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_statistical_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to statistically assess the performance of machine learning models;
  3 | Specifically, the script:
  4 | i) creates a summary of performance scores of the 100 iterations of each model type
  5 | ii) compares the performance of models using a version of
  6 | the paired Student’s t-test that is corrected for the violation of the
  7 | independence assumption from repeated k-fold cross-validation
  8 | when training the model [1-3]
  9 | 
 10 | References:
 11 | -----------
 12 | [1] - https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/
 13 | 
 14 | [2] - Bouckaert, Remco R., and Eibe Frank. "Evaluating the replicability of significance tests for comparing
 15 |  learning algorithms." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin,
 16 |  Heidelberg, 2004.
 17 | 
 18 | [3] - https://github.com/BayesianTestsML/tutorial/blob/9fb0bf75b4435d61d42935be4d0bfafcc43e77b9/Python/bayesiantests.py
 19 | """
 20 | import argparse
 21 | from pathlib import Path
 22 | import itertools
 23 | 
 24 | import numpy as np
 25 | import pandas as pd
 26 | 
 27 | from utils import ttest_ind_corrected
 28 | 
 29 | PROJECT_ROOT = Path.cwd()
 30 | 
 31 | parser = argparse.ArgumentParser()
 32 | 
 33 | parser.add_argument('-E', '--experiment_name',
 34 |                     dest='experiment_name',
 35 |                     help='Experiment name where the model predictions are stored.')
 36 | 
 37 | parser.add_argument('-S', '--suffix',
 38 |                     dest='suffix',
 39 |                     help='Suffix to add on the output file regressors_comparison_suffix.csv.')
 40 | 
 41 | parser.add_argument('-M', '--model_list',
 42 |                     dest='model_list',
 43 |                     nargs='+',
 44 |                     help='Names of models to analyse.')
 45 | 
 46 | args = parser.parse_args()
 47 | 
 48 | 
 49 | def main(experiment_name, suffix, model_list):
 50 |     # Create summary of the performance scores across the 100 iterations
 51 |     # of each model type (10 times 10-fold CV)
 52 |     n_repetitions = 10
 53 |     n_folds = 10
 54 | 
 55 |     for model_name in model_list:
 56 |         model_dir = PROJECT_ROOT / 'outputs' / experiment_name / model_name
 57 |         cv_dir = model_dir / 'cv'
 58 | 
 59 |         r_list = []
 60 |         r2_list = []
 61 |         mae_list = []
 62 |         rmse_list = []
 63 |         age_error_corr_list = []
 64 | 
 65 |         for i_repetition in range(n_repetitions):
 66 |             for i_fold in range(n_folds):
 67 |                 r, r2, mae, rmse, age_error_corr = np.load(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_scores.npy')
 68 |                 r_list.append(r)
 69 |                 r2_list.append(r2)
 70 |                 mae_list.append(mae)
 71 |                 rmse_list.append(rmse)
 72 |                 age_error_corr_list.append(age_error_corr)
 73 | 
 74 |         results = pd.DataFrame(columns=['Measure', 'Value'])
 75 |         results = results.append({'Measure': 'mean_r', 'Value': np.mean(r_list)}, ignore_index=True)
 76 |         results = results.append({'Measure': 'std_r', 'Value': np.std(r_list)}, ignore_index=True)
 77 |         results = results.append({'Measure': 'mean_r2', 'Value': np.mean(r2_list)}, ignore_index=True)
 78 |         results = results.append({'Measure': 'std_r2', 'Value': np.std(r2_list)}, ignore_index=True)
 79 |         results = results.append({'Measure': 'mean_mae', 'Value': np.mean(mae_list)}, ignore_index=True)
 80 |         results = results.append({'Measure': 'std_mae', 'Value': np.std(mae_list)}, ignore_index=True)
 81 |         results = results.append({'Measure': 'mean_rmse', 'Value': np.mean(rmse_list)}, ignore_index=True)
 82 |         results = results.append({'Measure': 'std_rmse', 'Value': np.std(rmse_list)}, ignore_index=True)
 83 |         results = results.append({'Measure': 'mean_age_error_corr', 'Value': np.mean(age_error_corr_list)},
 84 |                                  ignore_index=True)
 85 |         results = results.append({'Measure': 'std_age_error_corr', 'Value': np.std(age_error_corr_list)},
 86 |                                  ignore_index=True)
 87 | 
 88 |         results.to_csv(model_dir / f'{model_name}_scores_summary.csv', index=False)
 89 | 
 90 |     # Perform the statistical comparison of the summary performance metrics from different models
 91 |     combinations = list(itertools.combinations(model_list, 2))
 92 | 
 93 |     # Create new significance threshold based on Bonferroni correction for multiple comparisons
 94 |     corrected_alpha = 0.05 / len(combinations)
 95 | 
 96 |     results_df = pd.DataFrame(columns=['regressors', 'p-value', 'stats'])
 97 | 
 98 |     for classifier_a, classifier_b in combinations:
 99 |         classifier_a_dir = PROJECT_ROOT / 'outputs' / experiment_name / classifier_a
100 |         classifier_b_dir = PROJECT_ROOT / 'outputs' / experiment_name / classifier_b
101 | 
102 |         mae_a = []
103 |         mae_b = []
104 | 
105 |         for i_repetition in range(n_repetitions):
106 |             for i_fold in range(n_folds):
107 |                 performance_a = np.load(classifier_a_dir / 'cv' / f'{i_repetition:02d}_{i_fold:02d}_scores.npy')[1]
108 |                 performance_b = np.load(classifier_b_dir / 'cv' / f'{i_repetition:02d}_{i_fold:02d}_scores.npy')[1]
109 | 
110 |                 mae_a.append(performance_a)
111 |                 mae_b.append(performance_b)
112 | 
113 |         statistic, pvalue = ttest_ind_corrected(np.asarray(mae_a), np.asarray(mae_b), k=n_folds, r=n_repetitions)
114 | 
115 |         print(f'{classifier_a} vs. {classifier_b} pvalue: {pvalue:6.3}', end='')
116 |         if pvalue <= corrected_alpha:
117 |             print('*')
118 |         else:
119 |             print('')
120 | 
121 |         results_df = results_df.append({'regressors': f'{classifier_a} vs. {classifier_b}',
122 |                                         'p-value': pvalue,
123 |                                         'stats': statistic},
124 |                                        ignore_index=True)
125 | 
126 |         results_df.to_csv(PROJECT_ROOT / 'outputs' / experiment_name / f'regressors_comparison{suffix}.csv',
127 |                           index=False)
128 | 
129 | 
130 | if __name__ == '__main__':
131 |     main(args.experiment_name, args.suffix, args.model_list)
132 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_voxel_data_rvm_relevance_vectors_weights.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to calculate the relevance vectors of the Relevance Vector Machine
  3 | (RVM) approach for voxel-level data.
  4 | 
  5 | """
  6 | import argparse
  7 | import random
  8 | from pathlib import Path
  9 | 
 10 | import nibabel as nib
 11 | import numpy as np
 12 | import pandas as pd
 13 | from joblib import load
 14 | from nilearn.masking import apply_mask
 15 | from tqdm import tqdm
 16 | 
 17 | PROJECT_ROOT = Path.cwd()
 18 | 
 19 | parser = argparse.ArgumentParser()
 20 | 
 21 | parser.add_argument('-E', '--experiment_name',
 22 |                     dest='experiment_name',
 23 |                     help='Name of the experiment.')
 24 | 
 25 | parser.add_argument('-P', '--input_path',
 26 |                     dest='input_path_str',
 27 |                     help='Path to the local folder with preprocessed images.')
 28 | 
 29 | parser.add_argument('-I', '--input_ids_file',
 30 |                     dest='input_ids_file',
 31 |                     default='homogenized_ids.csv',
 32 |                     help='File name indicating the subject IDs to be used.')
 33 | 
 34 | parser.add_argument('-D', '--input_data_type',
 35 |                     dest='input_data_type',
 36 |                     default='.nii.gz',
 37 |                     help='Input data type')
 38 | 
 39 | parser.add_argument('-M', '--mask_filename',
 40 |                     dest='mask_filename',
 41 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 42 |                     help='File name of brain mask')
 43 | 
 44 | args = parser.parse_args()
 45 | 
 46 | 
 47 | def main(experiment_name, input_path_str, input_ids_file, input_data_type, mask_filename):
 48 |     # ----------------------------------------------------------------------------------------
 49 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 50 |     dataset_path = Path(input_path_str)
 51 | 
 52 |     model_dir = experiment_dir / 'voxel_RVM'
 53 |     cv_dir = model_dir / 'cv'
 54 | 
 55 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
 56 |     ids_df = pd.read_csv(ids_path)
 57 | 
 58 |     # Load the mask image
 59 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
 60 |     mask_img = nib.load(str(brain_mask))
 61 | 
 62 |     # Initialise random seed
 63 |     np.random.seed(42)
 64 |     random.seed(42)
 65 | 
 66 |     n_repetitions = 10
 67 |     n_folds = 10
 68 |     coef_list = []
 69 |     index_list = []
 70 | 
 71 |     for i_repetition in range(n_repetitions):
 72 |         for i_fold in range(n_folds):
 73 |             # Load model
 74 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
 75 |             model = load(cv_dir / f'{prefix}_regressor.joblib')
 76 | 
 77 |             # Load train index
 78 |             train_index = np.load(cv_dir / f'{prefix}_train_index.npy')
 79 | 
 80 |             coef_list.append(model.mu_[1:])
 81 |             index_list.append(train_index[model.relevance_])
 82 | 
 83 |     # Get the number of voxels in the mask
 84 |     mask_data = mask_img.get_fdata()
 85 |     n_voxels = sum(sum(sum(mask_data > 0)))
 86 |     n_models = 100
 87 |     weights = np.zeros((n_models, n_voxels))
 88 | 
 89 |     relevance_vector_dict = dict((el, []) for el in range(100))
 90 | 
 91 |     for i, subject_id in enumerate(tqdm(ids_df['image_id'])):
 92 |         # Check if subject is support vector in any model before loading the image
 93 |         is_support_vector = False
 94 |         for support_index in index_list:
 95 |             if i in support_index:
 96 |                 is_support_vector = True
 97 |                 break
 98 | 
 99 |         if is_support_vector == False:
100 |             continue
101 | 
102 |         subject_path = dataset_path / f'{subject_id}_Warped{input_data_type}'
103 | 
104 |         try:
105 |             img = nib.load(str(subject_path))
106 |         except FileNotFoundError:
107 |             print(f'No image file {subject_path}.')
108 |             raise
109 | 
110 |         # Extract only the brain voxels. This will create a 1D array.
111 |         img = apply_mask(img, mask_img)
112 |         img = np.asarray(img, dtype='float64')
113 |         img = np.nan_to_num(img)
114 | 
115 |         for j, (dual_coef, support_index) in enumerate(zip(coef_list, index_list)):
116 |             if i in support_index:
117 |                 selected_dual_coef = dual_coef[np.argwhere(support_index == i)]
118 |                 weights[j, :] = weights[j, :] + selected_dual_coef * img
119 | 
120 |                 relevance_vector_dict[j].append(img.astype('float16'))
121 | 
122 |     coords = np.argwhere(mask_data > 0)
123 |     i = 0
124 |     for i_repetition in range(n_repetitions):
125 |         for i_fold in range(n_folds):
126 |             importance_map = np.zeros_like(mask_data)
127 |             for xyz, importance in zip(coords, weights[i, :]):
128 |                 importance_map[tuple(xyz)] = importance
129 | 
130 |             importance_map_nifti = nib.Nifti1Image(importance_map, mask_img.affine)
131 |             nib.save(importance_map_nifti, str(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_importance.nii.gz'))
132 |             i = i + 1
133 | 
134 |     i = 0
135 |     for i_repetition in range(n_repetitions):
136 |         for i_fold in range(n_folds):
137 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
138 |             np.savez_compressed(cv_dir / f'{prefix}_relevance_vectors.npz',
139 |                                 relevance_vectors_=np.array(relevance_vector_dict[i]))
140 |             i = i + 1
141 | 
142 | 
143 | if __name__ == '__main__':
144 |     main(args.experiment_name, args.input_path_str,
145 |          args.input_ids_file,
146 |          args.input_data_type, args.mask_filename)
147 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_voxel_data_svm_primal_weights.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """ Script to calculate the primal weights of the Support Vector Machine
  3 | (SVM) approach for voxel-level data.
  4 | 
  5 | """
  6 | import argparse
  7 | import random
  8 | from pathlib import Path
  9 | 
 10 | import nibabel as nib
 11 | import numpy as np
 12 | import pandas as pd
 13 | from joblib import load
 14 | from nilearn.masking import apply_mask
 15 | from tqdm import tqdm
 16 | 
 17 | PROJECT_ROOT = Path.cwd()
 18 | 
 19 | parser = argparse.ArgumentParser()
 20 | 
 21 | parser.add_argument('-E', '--experiment_name',
 22 |                     dest='experiment_name',
 23 |                     help='Name of the experiment.')
 24 | 
 25 | parser.add_argument('-P', '--input_path',
 26 |                     dest='input_path_str',
 27 |                     help='Path to the local folder with preprocessed images.')
 28 | 
 29 | parser.add_argument('-I', '--input_ids_file',
 30 |                     dest='input_ids_file',
 31 |                     default='homogenized_ids.csv',
 32 |                     help='File name indicating the subject IDs to be used.')
 33 | 
 34 | parser.add_argument('-D', '--input_data_type',
 35 |                     dest='input_data_type',
 36 |                     default='.nii.gz',
 37 |                     help='Input data type')
 38 | 
 39 | parser.add_argument('-M', '--mask_filename',
 40 |                     dest='mask_filename',
 41 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 42 |                     help='Input data type')
 43 | 
 44 | args = parser.parse_args()
 45 | 
 46 | 
 47 | def main(experiment_name, input_path_str, input_ids_file, input_data_type, mask_filename):
 48 |     # ----------------------------------------------------------------------------------------
 49 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 50 |     dataset_path = Path(input_path_str)
 51 | 
 52 |     model_dir = experiment_dir / 'voxel_SVM'
 53 |     cv_dir = model_dir / 'cv'
 54 | 
 55 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
 56 |     ids_df = pd.read_csv(ids_path)
 57 | 
 58 |     # Load the mask image
 59 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
 60 |     mask_img = nib.load(str(brain_mask))
 61 | 
 62 |     # Initialise random seed
 63 |     np.random.seed(42)
 64 |     random.seed(42)
 65 | 
 66 |     n_repetitions = 10
 67 |     n_folds = 10
 68 |     coef_list = []
 69 |     index_list = []
 70 | 
 71 |     for i_repetition in range(n_repetitions):
 72 |         for i_fold in range(n_folds):
 73 |             # Load model
 74 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
 75 |             model = load(cv_dir / f'{prefix}_regressor.joblib')
 76 | 
 77 |             # Load train index
 78 |             train_index = np.load(cv_dir / f'{prefix}_train_index.npy')
 79 | 
 80 |             coef_list.append(model.dual_coef_[0])
 81 |             index_list.append(train_index[model.support_])
 82 | 
 83 |     # Get the number of voxels in the mask
 84 |     mask_data = mask_img.get_fdata()
 85 |     n_voxels = sum(sum(sum(mask_data > 0)))
 86 |     n_models = 100
 87 |     weights = np.zeros((n_models, n_voxels))
 88 | 
 89 |     for i, subject_id in enumerate(tqdm(ids_df['image_id'])):
 90 |         # Check if subject is support vector in any model before load the image.
 91 |         is_support_vector = False
 92 |         for support_index in index_list:
 93 |             if i in support_index:
 94 |                 is_support_vector = True
 95 |                 break
 96 | 
 97 |         if is_support_vector == False:
 98 |             continue
 99 | 
100 |         subject_path = dataset_path / f'{subject_id}_Warped{input_data_type}'
101 | 
102 |         try:
103 |             img = nib.load(str(subject_path))
104 |         except FileNotFoundError:
105 |             print(f'No image file {subject_path}.')
106 |             raise
107 | 
108 |         # Extract only the brain voxels. This will create a 1D array.
109 |         img = apply_mask(img, mask_img)
110 |         img = np.asarray(img, dtype='float64')
111 |         img = np.nan_to_num(img)
112 | 
113 |         for j, (dual_coef, support_index) in enumerate(zip(coef_list, index_list)):
114 |             if i in support_index:
115 |                 selected_dual_coef = dual_coef[np.argwhere(support_index == i)]
116 |                 weights[j, :] = weights[j, :] + selected_dual_coef * img
117 | 
118 |     coords = np.argwhere(mask_data > 0)
119 |     i = 0
120 |     for i_repetition in range(n_repetitions):
121 |         for i_fold in range(n_folds):
122 |             importance_map = np.zeros_like(mask_data)
123 |             for xyz, importance in zip(coords, weights[i, :]):
124 |                 importance_map[tuple(xyz)] = importance
125 | 
126 |             importance_map_nifti = nib.Nifti1Image(importance_map, mask_img.affine)
127 |             nib.save(importance_map_nifti, str(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_importance.nii.gz'))
128 |             i = i + 1
129 | 
130 | 
131 | if __name__ == '__main__':
132 |     main(args.experiment_name, args.input_path_str,
133 |          args.input_ids_file,
134 |          args.input_data_type, args.mask_filename)
135 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_voxel_data_train_rvm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Relevant Vector Machines on voxel-level data.
  3 | 
  4 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of
  5 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  6 | 
  7 | This script assumes that a kernel has been already pre-computed.
  8 | To compute the kernel use the script `precompute_3Ddata.py`
  9 | 
 10 | References
 11 | ----------
 12 | [1] - Tipping, Michael E. "The relevance vector machine."
 13 | Advances in neural information processing systems. 2000.
 14 | """
 15 | import argparse
 16 | import random
 17 | import warnings
 18 | from math import sqrt
 19 | from pathlib import Path
 20 | 
 21 | import numpy as np
 22 | import pandas as pd
 23 | from joblib import dump
 24 | from scipy import stats
 25 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 26 | from sklearn.model_selection import StratifiedKFold
 27 | from sklearn_rvm import EMRVR
 28 | 
 29 | from utils import load_demographic_data
 30 | 
 31 | PROJECT_ROOT = Path.cwd()
 32 | 
 33 | warnings.filterwarnings('ignore')
 34 | 
 35 | parser = argparse.ArgumentParser()
 36 | 
 37 | parser.add_argument('-E', '--experiment_name',
 38 |                     dest='experiment_name',
 39 |                     help='Name of the experiment.')
 40 | 
 41 | parser.add_argument('-S', '--scanner_name',
 42 |                     dest='scanner_name',
 43 |                     help='Name of the scanner.')
 44 | 
 45 | parser.add_argument('-I', '--input_ids_file',
 46 |                     dest='input_ids_file',
 47 |                     default='homogenized_ids.csv',
 48 |                     help='File name indicating the subject IDs to be used.')
 49 | 
 50 | args = parser.parse_args()
 51 | 
 52 | 
 53 | def main(experiment_name, scanner_name, input_ids_file):
 54 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 55 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 56 |     ids_path = experiment_dir / input_ids_file
 57 | 
 58 |     model_dir = experiment_dir / 'voxel_RVM'
 59 |     model_dir.mkdir(exist_ok=True)
 60 |     cv_dir = model_dir / 'cv'
 61 |     cv_dir.mkdir(exist_ok=True)
 62 | 
 63 |     # Load demographics
 64 |     demographics = load_demographic_data(participants_path, ids_path)
 65 | 
 66 |     # Load the Gram matrix
 67 |     kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv'
 68 |     kernel = pd.read_csv(kernel_path, header=0, index_col=0)
 69 | 
 70 |     # ----------------------------------------------------------------------------------------
 71 |     # Initialise random seed
 72 |     np.random.seed(42)
 73 |     random.seed(42)
 74 | 
 75 |     age = demographics['Age'].values
 76 | 
 77 |     # CV variables
 78 |     cv_r = []
 79 |     cv_r2 = []
 80 |     cv_mae = []
 81 |     cv_rmse = []
 82 |     cv_age_error_corr = []
 83 | 
 84 |     # Create DataFrame to hold actual and predicted ages
 85 |     age_predictions = demographics[['image_id', 'Age']]
 86 |     age_predictions = age_predictions.set_index('image_id')
 87 | 
 88 |     n_repetitions = 10
 89 |     n_folds = 10
 90 | 
 91 |     for i_repetition in range(n_repetitions):
 92 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 93 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 94 |         age_predictions[repetition_column_name] = np.nan
 95 | 
 96 |         # Create 10-fold CV scheme stratified by age
 97 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 98 |         for i_fold, (train_index, test_index) in enumerate(skf.split(kernel, age)):
 99 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
100 | 
101 |             x_train = kernel.iloc[train_index, train_index].values
102 |             x_test = kernel.iloc[test_index, train_index].values
103 |             y_train, y_test = age[train_index], age[test_index]
104 | 
105 |             model = EMRVR(kernel='precomputed',
106 |                           alpha_max=1e11, threshold_alpha=1e10)
107 | 
108 |             model.fit(x_train, y_train)
109 | 
110 |             predictions = model.predict(x_test)
111 | 
112 |             mae = mean_absolute_error(y_test, predictions)
113 |             rmse = sqrt(mean_squared_error(y_test, predictions))
114 |             r, _ = stats.pearsonr(y_test, predictions)
115 |             r2 = r2_score(y_test, predictions)
116 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
117 | 
118 |             cv_r.append(r)
119 |             cv_r2.append(r2)
120 |             cv_mae.append(mae)
121 |             cv_rmse.append(rmse)
122 |             cv_age_error_corr.append(age_error_corr)
123 | 
124 |             # ----------------------------------------------------------------------------------------
125 |             # Save output files
126 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
127 | 
128 |             # Save model
129 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
130 | 
131 |             # Save model scores
132 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
133 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
134 | 
135 |             # Save train index
136 |             np.save(cv_dir / f'{output_prefix}_train_index.npy', train_index)
137 | 
138 |             # ----------------------------------------------------------------------------------------
139 |             # Add predictions per test_index to age_predictions
140 |             for row, value in zip(test_index, predictions):
141 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
142 | 
143 |             # Print results of the CV fold
144 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
145 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
146 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
147 | 
148 |     # Save predictions
149 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
150 | 
151 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
152 |     print('')
153 |     print('Mean values:')
154 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
155 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
156 | 
157 | 
158 | if __name__ == '__main__':
159 |     main(args.experiment_name, args.scanner_name,
160 |          args.input_ids_file)
161 | 


--------------------------------------------------------------------------------
/src/comparison/comparison_voxel_data_train_svm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to train Support Vector Machines on voxel-level data.
  3 | 
  4 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of
  5 | 10 stratified k-fold cross-validation (CV) (stratified by age).
  6 | The hyperparameter tuning was performed in an automatic way using
  7 | a nested CV.
  8 | 
  9 | This script assumes that a kernel has been already pre-computed.
 10 | To compute the kernel use the script `src/preprocessing/compute_kernel_matrix.py`
 11 | 
 12 | References
 13 | ----------
 14 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
 15 | Machine learning 20.3 (1995): 273-297.
 16 | """
 17 | import argparse
 18 | import random
 19 | import warnings
 20 | from math import sqrt
 21 | from pathlib import Path
 22 | 
 23 | import numpy as np
 24 | import pandas as pd
 25 | from joblib import dump
 26 | from scipy import stats
 27 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 28 | from sklearn.model_selection import GridSearchCV
 29 | from sklearn.model_selection import StratifiedKFold
 30 | from sklearn.svm import SVR
 31 | 
 32 | from utils import load_demographic_data
 33 | 
 34 | PROJECT_ROOT = Path.cwd()
 35 | 
 36 | warnings.filterwarnings('ignore')
 37 | 
 38 | parser = argparse.ArgumentParser()
 39 | 
 40 | parser.add_argument('-E', '--experiment_name',
 41 |                     dest='experiment_name',
 42 |                     help='Name of the experiment.')
 43 | 
 44 | parser.add_argument('-S', '--scanner_name',
 45 |                     dest='scanner_name',
 46 |                     help='Name of the scanner.')
 47 | 
 48 | parser.add_argument('-I', '--input_ids_file',
 49 |                     dest='input_ids_file',
 50 |                     default='homogenized_ids.csv',
 51 |                     help='Filename indicating the ids to be used.')
 52 | 
 53 | args = parser.parse_args()
 54 | 
 55 | 
 56 | def main(experiment_name, scanner_name, input_ids_file):
 57 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 58 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 59 |     ids_path = experiment_dir / input_ids_file
 60 | 
 61 |     model_dir = experiment_dir / 'voxel_SVM'
 62 |     model_dir.mkdir(exist_ok=True)
 63 |     cv_dir = model_dir / 'cv'
 64 |     cv_dir.mkdir(exist_ok=True)
 65 | 
 66 |     # Load demographics
 67 |     demographics = load_demographic_data(participants_path, ids_path)
 68 | 
 69 |     # Load the Gram matrix
 70 |     kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv'
 71 |     kernel = pd.read_csv(kernel_path, header=0, index_col=0)
 72 | 
 73 |     # ----------------------------------------------------------------------------------------
 74 |     # Initialise random seed
 75 |     np.random.seed(42)
 76 |     random.seed(42)
 77 | 
 78 |     age = demographics['Age'].values
 79 | 
 80 |     # CV variables
 81 |     cv_r = []
 82 |     cv_r2 = []
 83 |     cv_mae = []
 84 |     cv_rmse = []
 85 |     cv_age_error_corr = []
 86 | 
 87 |     # Create DataFrame to hold actual and predicted ages
 88 |     age_predictions = demographics[['image_id', 'Age']]
 89 |     age_predictions = age_predictions.set_index('image_id')
 90 | 
 91 |     n_repetitions = 10
 92 |     n_folds = 10
 93 |     n_nested_folds = 5
 94 | 
 95 |     for i_repetition in range(n_repetitions):
 96 |         # Create new empty column in age_predictions df to save age predictions of this repetition
 97 |         repetition_column_name = f'Prediction repetition {i_repetition:02d}'
 98 |         age_predictions[repetition_column_name] = np.nan
 99 | 
100 |         # Create 10-fold CV scheme stratified by age
101 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
102 |         for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)):
103 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
104 | 
105 |             x_train = kernel.iloc[train_index, train_index].values
106 |             x_test = kernel.iloc[test_index, train_index].values
107 |             y_train, y_test = age[train_index], age[test_index]
108 | 
109 |             model_type = SVR(kernel='precomputed')
110 | 
111 |             # Systematic search for best hyperparameters
112 |             search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]}
113 |             nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True,
114 |                                          random_state=i_repetition)
115 |             gridsearch = GridSearchCV(model_type,
116 |                                       param_grid=search_space,
117 |                                       scoring='neg_mean_absolute_error',
118 |                                       refit=True, cv=nested_skf,
119 |                                       verbose=3, n_jobs=1)
120 | 
121 |             gridsearch.fit(x_train, y_train)
122 | 
123 |             model = gridsearch.best_estimator_
124 | 
125 |             params_results = {'means': gridsearch.cv_results_['mean_test_score'],
126 |                               'params': gridsearch.cv_results_['params']}
127 | 
128 |             predictions = model.predict(x_test)
129 | 
130 |             mae = mean_absolute_error(y_test, predictions)
131 |             rmse = sqrt(mean_squared_error(y_test, predictions))
132 |             r, _ = stats.pearsonr(y_test, predictions)
133 |             r2 = r2_score(y_test, predictions)
134 |             age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test)
135 | 
136 |             cv_r.append(r)
137 |             cv_r2.append(r2)
138 |             cv_mae.append(mae)
139 |             cv_rmse.append(rmse)
140 |             cv_age_error_corr.append(age_error_corr)
141 | 
142 |             # ----------------------------------------------------------------------------------------
143 |             # Save output files
144 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
145 | 
146 |             # Save scaler and model
147 |             dump(model, cv_dir / f'{output_prefix}_regressor.joblib')
148 |             dump(params_results, cv_dir / f'{output_prefix}_params.joblib')
149 | 
150 |             # Save model scores
151 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
152 |             np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array)
153 | 
154 |             # Save train index
155 |             np.save(cv_dir / f'{output_prefix}_train_index.npy', train_index)
156 | 
157 |             # ----------------------------------------------------------------------------------------
158 |             # Add predictions per test_index to age_predictions
159 |             for row, value in zip(test_index, predictions):
160 |                 age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value
161 | 
162 |             # Print results of the CV fold
163 |             print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 
164 |                   f'r: {r:0.3f}, R2: {r2:0.3f}, '
165 |                   f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
166 | 
167 |     # Save predictions
168 |     age_predictions.to_csv(model_dir / 'age_predictions.csv')
169 | 
170 |     # Variables for mean scores of performance metrics of CV folds across all repetitions
171 |     print('')
172 |     print('Mean values:')
173 |     print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} '
174 |           f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}')
175 | 
176 | 
177 | if __name__ == '__main__':
178 |     main(args.experiment_name, args.scanner_name,
179 |          args.input_ids_file)
180 | 


--------------------------------------------------------------------------------
/src/download/README.md:
--------------------------------------------------------------------------------
1 | # Download
2 | 
3 | The scripts in this folder were used to download the data 
4 | (i.e., `participants.tsv`, `freesurferData.csv`, `mriqc_prob.csv`, 
5 | `qoala_prob.csv`, imaging_preprocessing_ANTs files) from the Network
6 | Attached Storage System. The scripts are included in this repo to illustrate the
7 | structure in which the files were stored.
8 | 


--------------------------------------------------------------------------------
/src/download/download_ants_data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """Script used to download the ANTs data from the storage server.
 3 | 
 4 | Script to download all the UK BIOBANK files preprocessed using the
 5 | scripts available at the imaging_preprocessing_ANTs folder.
 6 | 
 7 | NOTE: Only for internal use at the Machine Learning in Mental Health Lab.
 8 | """
 9 | import argparse
10 | from pathlib import Path
11 | from shutil import copyfile
12 | 
13 | PROJECT_ROOT = Path.cwd()
14 | 
15 | parser = argparse.ArgumentParser()
16 | 
17 | parser.add_argument('-N', '--nas_path',
18 |                     dest='nas_path_str',
19 |                     help='Path to the Network Attached Storage system.')
20 | 
21 | parser.add_argument('-S', '--scanner_name',
22 |                     dest='scanner_name',
23 |                     help='Name of the scanner.')
24 | 
25 | parser.add_argument('-O', '--output_path',
26 |                     dest='output_path_str',
27 |                     help='Path to the local output folder.')
28 | 
29 | args = parser.parse_args()
30 | 
31 | 
32 | def main(nas_path_str, scanner_name, output_path_str):
33 |     """Perform download of selected datasets from the network-attached storage."""
34 |     nas_path = Path(nas_path_str)
35 |     output_path = Path(output_path_str)
36 | 
37 |     dataset_name = 'BIOBANK'
38 | 
39 |     dataset_output_path = output_path / dataset_name
40 |     dataset_output_path.mkdir(exist_ok=True)
41 | 
42 |     selected_path = nas_path / 'ANTS_NonLinear_preprocessed' / dataset_name / scanner_name
43 | 
44 |     for file_path in selected_path.glob('*.nii.gz'):
45 |         print(file_path)
46 |         copyfile(str(file_path), str(dataset_output_path / file_path.name))
47 | 
48 | 
49 | if __name__ == '__main__':
50 |     main(args.nas_path_str, args.scanner_name, args.output_path_str)
51 | 


--------------------------------------------------------------------------------
/src/download/download_data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """Script used to download the ANTs data from the storage server.
 3 | 
 4 | Script to download all the participants.tsv, freesurferData.csv,
 5 |  and quality metrics into the data folder.
 6 | 
 7 | NOTE: Only for internal use at the Machine Learning in Mental Health Lab.
 8 | """
 9 | import argparse
10 | from pathlib import Path
11 | from shutil import copyfile
12 | 
13 | PROJECT_ROOT = Path.cwd()
14 | 
15 | parser = argparse.ArgumentParser()
16 | 
17 | parser.add_argument('-N', '--nas_path',
18 |                     dest='nas_path_str',
19 |                     help='Path to the Network Attached Storage system.')
20 | 
21 | args = parser.parse_args()
22 | 
23 | 
24 | def download_files(data_dir, selected_path, dataset_prefix_path, nas_path):
25 |     """Download the files necessary for the study from network-attached storage.
26 |     These files include:
27 |         - participants.tsv: Demographic data
28 |         - freesurferData.csv: Neuroimaging data
29 |         - mriqc_prob.csv: Raw data quality metrics
30 |         - qoala_prob.csv: Freesurfer data quality metrics
31 | 
32 |     Parameters
33 |     ----------
34 |     data_dir: PosixPath
35 |         Path indicating local path to store the data.
36 |     selected_path: PosixPath
37 |         Path indicating external path with the data.
38 |     dataset_prefix_path: str
39 |         Datasets prefix.
40 |     nas_path: PosixPath
41 |         Path indicating NAS system.
42 |     """
43 | 
44 |     dataset_path = data_dir / dataset_prefix_path
45 |     dataset_path.mkdir(exist_ok=True, parents=True)
46 | 
47 |     copyfile(str(selected_path / 'participants.tsv'),
48 |              str(dataset_path / 'participants.tsv'))
49 | 
50 |     try:
51 |         copyfile(str(nas_path / 'FreeSurfer_preprocessed' / dataset_prefix_path / 'freesurferData.csv'),
52 |                  str(dataset_path / 'freesurferData.csv'))
53 |     except:
54 |         print(f'{dataset_prefix_path} does not have freesurferData.csv')
55 | 
56 |     try:
57 |         mriqc_prob_path = next((nas_path / 'MRIQC' / dataset_prefix_path).glob('*unseen_pred.csv'))
58 |         copyfile(str(mriqc_prob_path), str(dataset_path / 'mriqc_prob.csv'))
59 |     except:
60 |         print(f'{dataset_prefix_path} does not have *unseen_pred.csv')
61 | 
62 |     try:
63 |         qoala_prob_path = next((nas_path / 'Qoala' / dataset_prefix_path).glob('Qoala*'))
64 |         copyfile(str(qoala_prob_path), str(dataset_path / 'qoala_prob.csv'))
65 |     except:
66 |         print(f'{dataset_prefix_path} does not have Qoala*')
67 | 
68 | 
69 | def main(nas_path_str):
70 |     """Perform download of selected datasets from the network-attached storage."""
71 |     nas_path = Path(nas_path_str)
72 |     data_dir = PROJECT_ROOT / 'data'
73 | 
74 |     dataset_name = 'BIOBANK'
75 |     selected_path = nas_path / 'BIDS_data' / dataset_name
76 | 
77 |     for subdirectory_selected_path in selected_path.iterdir():
78 |         if not subdirectory_selected_path.is_dir():
79 |             continue
80 | 
81 |         print(subdirectory_selected_path)
82 | 
83 |         scanner_name = subdirectory_selected_path.stem
84 |         if (subdirectory_selected_path / 'participants.tsv').is_file():
85 |             download_files(data_dir, subdirectory_selected_path, dataset_name +
86 |                            '/' + scanner_name, nas_path)
87 | 
88 | 
89 | if __name__ == '__main__':
90 |     main(args.nas_path_str)
91 | 


--------------------------------------------------------------------------------
/src/generalisation/README.md:
--------------------------------------------------------------------------------
 1 | # Measuring the generalization of trained regressors
 2 | 
 3 | The scripts in this folder are used to test the generalization performance of 
 4 | the models trained in the scripts in the 'comparison' subdirectory. This means
 5 | that the models are applied to a new, independent dataset that was acquired 
 6 | on a different MRI scanner.
 7 | 
 8 | Measuring generalization performance in an independent dataset eliminates 
 9 | sample bias from the performance measures, and it provides a more realistic
10 | representation of brain age as a biomarker in clinical practice or the like.


--------------------------------------------------------------------------------
/src/generalisation/generalisation_calculate_mean_predictions.py:
--------------------------------------------------------------------------------
 1 | """Script to create csv file with mean predictions across model repetitions"""
 2 | 
 3 | import pandas as pd
 4 | from pathlib import Path
 5 | PROJECT_ROOT = Path.cwd()
 6 | 
 7 | def main():
 8 |     experiment_name = 'biobank_scanner2'
 9 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
10 | 
11 |     model_ls = ['SVM', 'RVM', 'GPR',
12 |                 'voxel_SVM', 'voxel_RVM',
13 |                 'pca_SVM', 'pca_RVM', 'pca_GPR']
14 | 
15 |     # Create df with subject IDs and chronological age
16 |     # All mean model predictions will be added to this df in the loop
17 |     # Based on an age_predictions csv file from model training to have the
18 |     # same order of subjects
19 |     example_file = pd.read_csv(experiment_dir / 'SVM' / 'age_predictions_test.csv')
20 |     age_predictions_all = pd.DataFrame(example_file.loc[:, 'image_id':'Age'])
21 | 
22 |     # Loop over all models, calculate mean predictions across repetitions
23 |     for model_name in model_ls:
24 |         model_dir = experiment_dir / model_name
25 |         file_name = model_dir / 'age_predictions_test.csv'
26 |         try:
27 |             model_data = pd.read_csv(file_name)
28 |         except FileNotFoundError:
29 |             print(f'No age prediction file for {model_name}.')
30 |             raise
31 | 
32 |         repetition_cols = model_data.loc[:,
33 |                         'Prediction 00_00' : 'Prediction 09_09']
34 | 
35 |         # get mean predictions across repetitions
36 |         model_data['prediction_mean'] = repetition_cols.mean(axis=1)
37 | 
38 |         # get those into one file for all models
39 |         age_predictions_all[model_name] = model_data['prediction_mean']
40 | 
41 |     # Calculate brainAGE for all models and add to age_predictions_all df
42 |     # brainAGE = predicted age - chronological age
43 |     for model_name in model_ls:
44 |         brainage_model = age_predictions_all[model_name] - \
45 |                          age_predictions_all['Age']
46 |         brainage_col_name = model_name + '_brainAGE'
47 |         age_predictions_all[brainage_col_name] = brainage_model
48 | 
49 |     # Export age_predictions_all as csv
50 |     age_predictions_all.to_csv(experiment_dir / 'age_predictions_test_allmodels.csv')
51 | 
52 | 
53 | if __name__ == '__main__':
54 |     main()


--------------------------------------------------------------------------------
/src/generalisation/generalisation_test_fs_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Tests models developed using FreeSurfer data from Biobank Scanner1
  4 | on previously unseen data from Biobank Scanner2 to predict brain age.
  5 | 
  6 | The script loops over the 100 models created in comparison_fs_data_train_svm.py
  7 | and comparison_fs_data_train_rvm.py, loads their regressors, applies
  8 | them to the Scanner2 data and saves all predictions per subjects
  9 | in age_predictions_test.csv.
 10 | """
 11 | import argparse
 12 | import random
 13 | from math import sqrt
 14 | from pathlib import Path
 15 | 
 16 | import numpy as np
 17 | from joblib import load
 18 | from scipy import stats
 19 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 20 | 
 21 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 22 | 
 23 | PROJECT_ROOT = Path.cwd()
 24 | 
 25 | parser = argparse.ArgumentParser()
 26 | 
 27 | parser.add_argument('-T', '--training_experiment_name',
 28 |                     dest='training_experiment_name',
 29 |                     help='Name of the experiment.')
 30 | 
 31 | parser.add_argument('-G', '--test_experiment_name',
 32 |                     dest='test_experiment_name',
 33 |                     help='Name of the experiment.')
 34 | 
 35 | parser.add_argument('-S', '--scanner_name',
 36 |                     dest='scanner_name',
 37 |                     help='Name of the scanner.')
 38 | 
 39 | parser.add_argument('-M', '--model_name',
 40 |                     dest='model_name',
 41 |                     help='Name of the model.')
 42 | 
 43 | parser.add_argument('-I', '--input_ids_file',
 44 |                     dest='input_ids_file',
 45 |                     default='cleaned_ids.csv',
 46 |                     help='Filename indicating the ids to be used.')
 47 | 
 48 | args = parser.parse_args()
 49 | 
 50 | 
 51 | def main(training_experiment_name, test_experiment_name, scanner_name, model_name, input_ids_file):
 52 |     # ----------------------------------------------------------------------------------------
 53 |     training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name
 54 |     test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name
 55 | 
 56 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 57 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 58 |     ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file
 59 | 
 60 |     # Create experiment's output directory
 61 |     test_model_dir = test_experiment_dir / model_name
 62 |     test_model_dir.mkdir(exist_ok=True)
 63 | 
 64 |     training_cv_dir = training_experiment_dir / model_name / 'cv'
 65 |     test_cv_dir = test_model_dir / 'cv'
 66 |     test_cv_dir.mkdir(exist_ok=True)
 67 | 
 68 |     dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path)
 69 | 
 70 |     # ----------------------------------------------------------------------------------------
 71 |     # Initialise random seed
 72 |     np.random.seed(42)
 73 |     random.seed(42)
 74 | 
 75 |     # Normalise regional volumes in testing dataset by total intracranial volume (tiv)
 76 |     regions = dataset[COLUMNS_NAME].values
 77 | 
 78 |     tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 79 | 
 80 |     regions_norm = np.true_divide(regions, tiv)
 81 |     age = dataset['Age'].values
 82 | 
 83 |     # Create dataframe to hold actual and predicted ages
 84 |     age_predictions = dataset[['image_id', 'Age']]
 85 |     age_predictions = age_predictions.set_index('image_id')
 86 | 
 87 |     n_repetitions = 10
 88 |     n_folds = 10
 89 | 
 90 |     for i_repetition in range(n_repetitions):
 91 |         for i_fold in range(n_folds):
 92 | 
 93 |             # Load model and scaler
 94 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
 95 | 
 96 |             model = load(training_cv_dir / f'{prefix}_regressor.joblib')
 97 |             scaler = load(training_cv_dir / f'{prefix}_scaler.joblib')
 98 | 
 99 |             # Use RobustScaler to transform testing data
100 |             x_test = scaler.transform(regions_norm)
101 | 
102 |             # Apply model to scaled data
103 |             predictions = model.predict(x_test)
104 | 
105 |             mae = mean_absolute_error(age, predictions)
106 |             rmse = sqrt(mean_squared_error(age, predictions))
107 |             r, _ = stats.pearsonr(age, predictions)
108 |             r2 = r2_score(age, predictions)
109 |             age_error_corr, _ = stats.spearmanr((predictions - age), age)
110 | 
111 |             # Save prediction per model in df
112 |             age_predictions[f'Prediction {i_repetition:02d}_{i_fold:02d}'] = predictions
113 | 
114 |             # Save model scores
115 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
116 |             np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array)
117 | 
118 |     # Save predictions
119 |     age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv')
120 | 
121 | 
122 | if __name__ == '__main__':
123 |     main(args.training_experiment_name, args.test_experiment_name,
124 |          args.scanner_name, args.model_name,
125 |          args.input_ids_file)
126 | 


--------------------------------------------------------------------------------
/src/generalisation/generalisation_test_pca_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Tests models trained using voxel-level data from Biobank Scanner1
  4 | with reduced dimensionality through Principal Component Analysis (PCA),
  5 | on previously unseen data from Biobank Scanner2 to predict brain age.
  6 | 
  7 | The script loops over the 100 models created in comparison_pca_data_train_svm.py
  8 | and comparison_pca_data_train_rvm.py, loads their regressors, applies them to the
  9 | Scanner2 data and saves all predictions per subjects in age_predictions_test.csv
 10 | """
 11 | import argparse
 12 | import random
 13 | from math import sqrt
 14 | from pathlib import Path
 15 | 
 16 | import numpy as np
 17 | from joblib import load
 18 | from scipy import stats
 19 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 20 | from utils import load_demographic_data
 21 | import pandas as pd
 22 | from tqdm import tqdm
 23 | 
 24 | PROJECT_ROOT = Path.cwd()
 25 | 
 26 | parser = argparse.ArgumentParser()
 27 | 
 28 | parser.add_argument('-T', '--training_experiment_name',
 29 |                     dest='training_experiment_name',
 30 |                     help='Name of the experiment.')
 31 | 
 32 | parser.add_argument('-G', '--test_experiment_name',
 33 |                     dest='test_experiment_name',
 34 |                     help='Name of the experiment.')
 35 | 
 36 | parser.add_argument('-S', '--scanner_name',
 37 |                     dest='scanner_name',
 38 |                     help='Name of the scanner.')
 39 | 
 40 | parser.add_argument('-M', '--model_name',
 41 |                     dest='model_name',
 42 |                     help='Name of the model.')
 43 | 
 44 | parser.add_argument('-I', '--input_ids_file',
 45 |                     dest='input_ids_file',
 46 |                     default='cleaned_ids.csv',
 47 |                     help='Filename indicating the ids to be used.')
 48 | 
 49 | args = parser.parse_args()
 50 | 
 51 | 
 52 | def main(training_experiment_name, test_experiment_name, scanner_name, model_name, input_ids_file):
 53 |     # ----------------------------------------------------------------------------------------
 54 |     training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name
 55 |     test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name
 56 | 
 57 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 58 |     pca_dir = PROJECT_ROOT / 'outputs' / 'pca'
 59 |     ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file
 60 |     # Create experiment's output directory
 61 |     test_model_dir = test_experiment_dir / model_name
 62 |     test_model_dir.mkdir(exist_ok=True)
 63 | 
 64 |     training_cv_dir = training_experiment_dir / model_name / 'cv'
 65 |     test_cv_dir = test_model_dir / 'cv'
 66 |     test_cv_dir.mkdir(exist_ok=True)
 67 | 
 68 |     participants_df = load_demographic_data(participants_path, ids_path)
 69 | 
 70 |     # ----------------------------------------------------------------------------------------
 71 |     # Initialise random seed
 72 |     np.random.seed(42)
 73 |     random.seed(42)
 74 | 
 75 |     # Normalise regional volumes in testing dataset by total intracranial volume (tiv)
 76 |     age = participants_df['Age'].values
 77 | 
 78 |     # Create dataframe to hold actual and predicted ages
 79 |     age_predictions = participants_df[['image_id', 'Age']]
 80 |     age_predictions = age_predictions.set_index('image_id')
 81 | 
 82 |     n_repetitions = 10
 83 |     n_folds = 10
 84 | 
 85 |     for i_repetition in range(n_repetitions):
 86 |         print(f'Repetition : {i_repetition}')
 87 |         for i_fold in tqdm(range(n_folds)):
 88 | 
 89 |             # Load model and scaler
 90 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
 91 | 
 92 |             model = load(training_cv_dir / f'{prefix}_regressor.joblib')
 93 |             scaler = load(training_cv_dir / f'{prefix}_scaler.joblib')
 94 | 
 95 |             # Use RobustScaler to transform testing data
 96 |             output_prefix = f'{i_repetition:02d}_{i_fold:02d}'
 97 |             pca_path = pca_dir / f'{output_prefix}_pca_components_general.csv'
 98 | 
 99 |             pca_df = pd.read_csv(pca_path)
100 |             pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK/','')
101 |             pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '')
102 | 
103 |             dataset_df = pd.merge(pca_df, participants_df, on='image_id')
104 |             pca_components = dataset_df[dataset_df.columns.difference(participants_df.columns)].values
105 |             x_test = scaler.transform(pca_components)
106 | 
107 |             # Apply model to scaled data and measure error
108 |             predictions = model.predict(x_test)
109 | 
110 |             mae = mean_absolute_error(age, predictions)
111 |             rmse = sqrt(mean_squared_error(age, predictions))
112 |             r, _ = stats.pearsonr(age, predictions)
113 |             r2 = r2_score(age, predictions)
114 |             age_error_corr, _ = stats.spearmanr((predictions - age), age)
115 | 
116 |             # Save prediction per model in df
117 |             age_predictions[f'Prediction {i_repetition:02d}_{i_fold:02d}'] = predictions
118 | 
119 |             # Save model scores
120 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
121 |             np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array)
122 | 
123 |     # Save predictions
124 |     age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv')
125 | 
126 | 
127 | if __name__ == '__main__':
128 |     main(args.training_experiment_name, args.test_experiment_name,
129 |          args.scanner_name, args.model_name,
130 |          args.input_ids_file)
131 | 


--------------------------------------------------------------------------------
/src/generalisation/generalisation_test_voxel_data_rvm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Tests RVM models developed using voxel data from Biobank Scanner1
  4 | on previously unseen data from Biobank Scanner2 to predict brain age.
  5 | 
  6 | The script loops over the 100 RVM models created in comparison_voxel_data_train_rvm.py,
  7 | loads their regressors, applies them to the Scanner2 data and saves all predictions
  8 | per subjects in age_predictions_test.csv.
  9 | """
 10 | import argparse
 11 | import random
 12 | from math import sqrt
 13 | from pathlib import Path
 14 | 
 15 | import nibabel as nib
 16 | import numpy as np
 17 | from joblib import load
 18 | from nilearn.masking import apply_mask
 19 | from scipy import stats
 20 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 21 | from sklearn.metrics.pairwise import pairwise_kernels
 22 | from tqdm import tqdm
 23 | 
 24 | from utils import load_demographic_data
 25 | 
 26 | PROJECT_ROOT = Path.cwd()
 27 | 
 28 | parser = argparse.ArgumentParser()
 29 | 
 30 | parser.add_argument('-T', '--training_experiment_name',
 31 |                     dest='training_experiment_name',
 32 |                     help='Name of the experiment.')
 33 | 
 34 | parser.add_argument('-G', '--test_experiment_name',
 35 |                     dest='test_experiment_name',
 36 |                     help='Name of the experiment.')
 37 | 
 38 | parser.add_argument('-S', '--scanner_name',
 39 |                     dest='scanner_name',
 40 |                     help='Name of the scanner.')
 41 | 
 42 | parser.add_argument('-P', '--input_path',
 43 |                     dest='input_path_str',
 44 |                     help='Path to the local folder with preprocessed images.')
 45 | 
 46 | parser.add_argument('-M', '--model_name',
 47 |                     dest='model_name',
 48 |                     help='Name of the model.')
 49 | 
 50 | parser.add_argument('-I', '--input_ids_file',
 51 |                     dest='input_ids_file',
 52 |                     default='cleaned_ids.csv',
 53 |                     help='Filename indicating the ids to be used.')
 54 | 
 55 | parser.add_argument('-N', '--mask_filename',
 56 |                     dest='mask_filename',
 57 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 58 |                     help='Input data type')
 59 | 
 60 | parser.add_argument('-D', '--input_data_type',
 61 |                     dest='input_data_type',
 62 |                     default='.nii.gz',
 63 |                     help='Input data type')
 64 | 
 65 | args = parser.parse_args()
 66 | 
 67 | 
 68 | def main(training_experiment_name,
 69 |          test_experiment_name,
 70 |          scanner_name,
 71 |          input_path_str,
 72 |          model_name,
 73 |          input_ids_file,
 74 |          input_data_type,
 75 |          mask_filename):
 76 |     # ----------------------------------------------------------------------------------------
 77 |     input_path = Path(input_path_str)
 78 |     training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name
 79 |     test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name
 80 | 
 81 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 82 |     ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file
 83 | 
 84 |     # Create experiment's output directory
 85 |     test_model_dir = test_experiment_dir / model_name
 86 |     test_model_dir.mkdir(exist_ok=True)
 87 | 
 88 |     training_cv_dir = training_experiment_dir / model_name / 'cv'
 89 |     test_cv_dir = test_model_dir / 'cv'
 90 |     test_cv_dir.mkdir(exist_ok=True)
 91 | 
 92 |     demographic = load_demographic_data(participants_path, ids_path)
 93 | 
 94 |     # ----------------------------------------------------------------------------------------
 95 |     # Initialise random seed
 96 |     np.random.seed(42)
 97 |     random.seed(42)
 98 | 
 99 |     # Load the mask image
100 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
101 |     mask_img = nib.load(str(brain_mask))
102 | 
103 |     age = demographic['Age'].values
104 | 
105 |     # Create dataframe to hold actual and predicted ages
106 |     age_predictions = demographic[['image_id', 'Age']]
107 |     age_predictions = age_predictions.set_index('image_id')
108 | 
109 |     n_repetitions = 10
110 |     n_folds = 10
111 | 
112 |     pbar = tqdm(total=100)
113 |     for i_repetition in range(n_repetitions):
114 |         for i_fold in range(n_folds):
115 | 
116 |             # Load model
117 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
118 |             pbar.set_description(f'{prefix}')
119 |             relevance_vector = np.load(training_cv_dir / f'{prefix}_relevance_vectors.npz')['relevance_vectors_']
120 |             model = load(training_cv_dir / f'{prefix}_regressor.joblib')
121 | 
122 |             pbar2 = tqdm(demographic['image_id'])
123 |             for i, subject_id in enumerate(pbar2):
124 |                 subject_path = input_path / f"{subject_id}_Warped{input_data_type}"
125 |                 pbar2.set_description(f'{subject_path.name}')
126 | 
127 |                 try:
128 |                     img = nib.load(str(subject_path))
129 |                 except FileNotFoundError:
130 |                     print(f'No image file {subject_path}.')
131 |                     raise
132 | 
133 |                 img = apply_mask(img, mask_img)
134 |                 img = np.asarray(img, dtype='float64')
135 |                 img = np.nan_to_num(img)
136 | 
137 |                 try:
138 |                     K = pairwise_kernels(img[None,:], Y=relevance_vector , metric='linear')
139 |                 except:
140 |                     K = [[]]
141 |                 K = K / model._scale
142 |                 K = np.hstack((np.ones((1, 1)), K))
143 | 
144 |                 prediction = K @ model.mu_
145 | 
146 |                 # Save prediction per model in df
147 |                 age_predictions.loc[subject_id, f'Prediction {prefix}'] = prediction
148 | 
149 |             # Get and save scores
150 |             predictions = age_predictions[f'Prediction {prefix}'].values
151 | 
152 |             mae = mean_absolute_error(age, predictions)
153 |             rmse = sqrt(mean_squared_error(age, predictions))
154 |             r, _ = stats.pearsonr(age, predictions)
155 |             r2 = r2_score(age, predictions)
156 |             age_error_corr, _ = stats.spearmanr((predictions - age), age)
157 | 
158 |             # Save model scores
159 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
160 |             np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array)
161 | 
162 |             pbar.update(1)
163 |     pbar.close()
164 |     # Save predictions
165 |     age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv')
166 | 
167 | 
168 | if __name__ == '__main__':
169 |     main(args.training_experiment_name, args.test_experiment_name, args.scanner_name,
170 |          args.input_path_str, args.model_name, args.input_ids_file,
171 |          args.input_data_type, args.mask_filename)
172 | 


--------------------------------------------------------------------------------
/src/generalisation/generalisation_test_voxel_data_svm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Tests SVM models developed using voxel data from Biobank Scanner1
  4 | on previously unseen data from Biobank Scanner2 to predict brain age.
  5 | 
  6 | The script loops over the 100 SVM models created in
  7 | train_svm_on_freesurfer_data.py, loads their regressors,
  8 | applies them to the Scanner2 data and saves all predictions per subjects in a csv file
  9 | 
 10 | This script assumes that a nifti file with the feature weights has been already
 11 | pre-computed. To compute the weights use the script
 12 | `src/comparison/comparison_voxel_data_svm_primal_weights.py`
 13 | """
 14 | import argparse
 15 | import random
 16 | from math import sqrt
 17 | from pathlib import Path
 18 | 
 19 | import nibabel as nib
 20 | import numpy as np
 21 | from joblib import load
 22 | from nilearn.masking import apply_mask
 23 | from scipy import stats
 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 25 | from tqdm import tqdm
 26 | 
 27 | from utils import load_demographic_data
 28 | 
 29 | PROJECT_ROOT = Path.cwd()
 30 | 
 31 | parser = argparse.ArgumentParser()
 32 | 
 33 | parser.add_argument('-T', '--training_experiment_name',
 34 |                     dest='training_experiment_name',
 35 |                     help='Name of the experiment.')
 36 | 
 37 | parser.add_argument('-G', '--test_experiment_name',
 38 |                     dest='test_experiment_name',
 39 |                     help='Name of the experiment.')
 40 | 
 41 | parser.add_argument('-S', '--scanner_name',
 42 |                     dest='scanner_name',
 43 |                     help='Name of the scanner.')
 44 | 
 45 | parser.add_argument('-M', '--model_name',
 46 |                     dest='model_name',
 47 |                     help='Name of the model.')
 48 | 
 49 | parser.add_argument('-P', '--input_path',
 50 |                     dest='input_path_str',
 51 |                     help='Path to the local folder with preprocessed images.')
 52 | 
 53 | parser.add_argument('-I', '--input_ids_file',
 54 |                     dest='input_ids_file',
 55 |                     default='cleaned_ids.csv',
 56 |                     help='Filename indicating the ids to be used.')
 57 | 
 58 | parser.add_argument('-N', '--mask_filename',
 59 |                     dest='mask_filename',
 60 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 61 |                     help='Input data type')
 62 | 
 63 | parser.add_argument('-D', '--input_data_type',
 64 |                     dest='input_data_type',
 65 |                     default='.nii.gz',
 66 |                     help='Input data type')
 67 | 
 68 | args = parser.parse_args()
 69 | 
 70 | 
 71 | def main(training_experiment_name,
 72 |          test_experiment_name,
 73 |          scanner_name,
 74 |          input_path_str,
 75 |          model_name,
 76 |          input_ids_file,
 77 |          input_data_type,
 78 |          mask_filename):
 79 |     # ----------------------------------------------------------------------------------------
 80 |     input_path = Path(input_path_str)
 81 |     training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name
 82 |     test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name
 83 | 
 84 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 85 |     ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file
 86 | 
 87 |     # Create experiment's output directory
 88 |     test_model_dir = test_experiment_dir / model_name
 89 |     test_model_dir.mkdir(exist_ok=True)
 90 | 
 91 |     training_cv_dir = training_experiment_dir / model_name / 'cv'
 92 |     test_cv_dir = test_model_dir / 'cv'
 93 |     test_cv_dir.mkdir(exist_ok=True)
 94 | 
 95 |     demographic = load_demographic_data(participants_path, ids_path)
 96 | 
 97 |     # ----------------------------------------------------------------------------------------
 98 |     # Initialise random seed
 99 |     np.random.seed(42)
100 |     random.seed(42)
101 | 
102 |     # Load the mask image
103 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
104 |     mask_img = nib.load(str(brain_mask))
105 | 
106 |     age = demographic['Age'].values
107 | 
108 |     # Create dataframe to hold actual and predicted ages
109 |     age_predictions = demographic[['image_id', 'Age']]
110 |     age_predictions = age_predictions.set_index('image_id')
111 | 
112 |     n_repetitions = 10
113 |     n_folds = 10
114 | 
115 |     for i, subject_id in enumerate(tqdm(demographic['image_id'])):
116 |         subject_path = input_path / f"{subject_id}_Warped{input_data_type}"
117 |         print(subject_path)
118 | 
119 |         try:
120 |             img = nib.load(str(subject_path))
121 |         except FileNotFoundError:
122 |             print(f'No image file {subject_path}.')
123 |             raise
124 | 
125 |         img = apply_mask(img, mask_img)
126 |         img = np.asarray(img, dtype='float64')
127 |         img = np.nan_to_num(img)
128 | 
129 |         for i_repetition in range(n_repetitions):
130 |             for i_fold in range(n_folds):
131 | 
132 |                 # Load model and scaler
133 |                 prefix = f'{i_repetition:02d}_{i_fold:02d}'
134 | 
135 |                 model = load(training_cv_dir / f'{prefix}_regressor.joblib')
136 |                 weights_path = training_cv_dir / f'{prefix}_importance.nii.gz'
137 |                 weights_img = nib.load(str(weights_path))
138 | 
139 |                 weights_img = weights_img.get_fdata()
140 |                 weights_img = nib.Nifti1Image(weights_img, mask_img.affine)
141 | 
142 |                 weights_img = apply_mask(weights_img, mask_img)
143 |                 weights_img = np.asarray(weights_img, dtype='float64')
144 |                 weights_img = np.nan_to_num(weights_img)
145 | 
146 |                 prediction = np.dot(weights_img, img.T) + model.intercept_
147 | 
148 |                 # Save prediction per model in df
149 |                 age_predictions.loc[subject_id, f'Prediction {prefix}'] = prediction
150 | 
151 |     # Save predictions
152 |     age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv')
153 | 
154 |     # Get and save scores
155 |     for i_repetition in range(n_repetitions):
156 |         for i_fold in range(n_folds):
157 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
158 |             predictions = age_predictions[f'Prediction {prefix}'].values
159 | 
160 |             mae = mean_absolute_error(age, predictions)
161 |             rmse = sqrt(mean_squared_error(age, predictions))
162 |             r, _ = stats.pearsonr(age, predictions)
163 |             r2 = r2_score(age, predictions)
164 |             age_error_corr, _ = stats.spearmanr((predictions - age), age)
165 | 
166 |             # Save model scores
167 |             scores_array = np.array([r, r2, mae, rmse, age_error_corr])
168 |             np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array)
169 | 
170 | 
171 | if __name__ == '__main__':
172 |     main(args.training_experiment_name, args.test_experiment_name, args.scanner_name,
173 |          args.input_path_str, args.model_name, args.input_ids_file,
174 |          args.input_data_type, args.mask_filename)
175 | 


--------------------------------------------------------------------------------
/src/misc/README.md:
--------------------------------------------------------------------------------
1 | # MISC
2 | The scripts in this folder measure the performance of the models by the 
3 | size of the training set. They measure performance through univariate 
4 | analysis of each normalised brain region and pairwise t-test analysis 
5 | between SVM models with different hyperparameters C.
6 | 
7 | This analysis was not presented in the manuscript.
8 | 
9 | 


--------------------------------------------------------------------------------
/src/misc/misc_svm_hyperparameters_analysis.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """
 3 | Compares pairwise performance through independent t-test
 4 | of SVM models with different hyperparameters C.
 5 | Saves results into `svm_params_ttest.csv` and `svm_params_values.csv`
 6 | """
 7 | import argparse
 8 | import itertools
 9 | from pathlib import Path
10 | 
11 | import numpy as np
12 | import pandas as pd
13 | from joblib import load
14 | 
15 | from utils import ttest_ind_corrected
16 | 
17 | PROJECT_ROOT = Path.cwd()
18 | 
19 | parser = argparse.ArgumentParser()
20 | 
21 | parser.add_argument('-E', '--experiment_name',
22 |                     dest='experiment_name',
23 |                     help='Name of the experiment.')
24 | 
25 | args = parser.parse_args()
26 | 
27 | 
28 | def main(experiment_name):
29 |     """Pairwise comparison of SVM classifier performances with different hyperparameters C."""
30 |     # ----------------------------------------------------------------------------------------
31 |     svm_dir = PROJECT_ROOT / 'outputs' / experiment_name / 'SVM'
32 |     cv_dir = svm_dir / 'cv'
33 | 
34 |     n_repetitions = 10
35 |     n_folds = 10
36 | 
37 |     search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]}
38 | 
39 |     scores_params = []
40 |     for i_repetition in range(n_repetitions):
41 |         for i_fold in range(n_folds):
42 |             params_dict = load(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_params.joblib')
43 |             scores_params.append(params_dict['means'])
44 | 
45 |     scores_params = np.array(scores_params)
46 | 
47 |     combinations = list(itertools.combinations(range(scores_params.shape[1]), 2))
48 | 
49 |     # Bonferroni correction for multiple comparisons
50 |     corrected_alpha = 0.05 / len(combinations)
51 | 
52 |     results_df = pd.DataFrame(columns=['params', 'p-value', 'stats'])
53 | 
54 |     # Corrected repeated k-fold cv test to compare pairwise performance of the SVM classifiers
55 |     # through independent t-test
56 |     for param_a, param_b in combinations:
57 |         statistic, pvalue = ttest_ind_corrected(scores_params[:, param_a], scores_params[:, param_b],
58 |                                                 k=n_folds, r=n_repetitions)
59 | 
60 |         print(f"{search_space['C'][param_a]} vs. {search_space['C'][param_b]} pvalue: {pvalue:6.3}", end='')
61 |         if pvalue <= corrected_alpha:
62 |             print('*')
63 |         else:
64 |             print('')
65 | 
66 |         results_df = results_df.append({'params': f"{search_space['C'][param_a]} vs. {search_space['C'][param_b]}",
67 |                                         'p-value': pvalue,
68 |                                         'stats': statistic},
69 |                                        ignore_index=True)
70 |     # Output to csv
71 |     results_df.to_csv(svm_dir / 'svm_params_ttest.csv', index=False)
72 | 
73 |     values_df = pd.DataFrame(columns=['measures'] + list(search_space['C']))
74 | 
75 |     scores_params_mean = np.mean(scores_params, axis=0)
76 |     scores_params_std = np.std(scores_params, axis=0)
77 | 
78 |     values_df.loc[0] = ['mean'] + list(scores_params_mean)
79 |     values_df.loc[1] = ['std'] + list(scores_params_std)
80 | 
81 |     values_df.to_csv(svm_dir / 'svm_params_values.csv', index=False)
82 | 
83 | 
84 | if __name__ == '__main__':
85 |     main(args.experiment_name)
86 | 


--------------------------------------------------------------------------------
/src/misc/misc_univariate_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Implements univariate analysis based on [1], regresses for age and volume per region:
  4 |      1. normalise each brain region
  5 |      2. creates df with normalised brain region (dependent variable) and age of participant
  6 |      (independent variable) (+ quadratic and cubic age)
  7 |      3. outputs coefficient per subject
  8 | 
  9 | References:
 10 | [1] - Zhao, Lu, et al. (2018) Age-Related Differences in Brain Morphology and the Modifiers
 11 |  in Middle-Aged and Older Adults. Cerebral Cortex.
 12 | """
 13 | import argparse
 14 | from pathlib import Path
 15 | 
 16 | import numpy as np
 17 | import pandas as pd
 18 | import statsmodels.api as sm
 19 | 
 20 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 21 | 
 22 | PROJECT_ROOT = Path.cwd()
 23 | 
 24 | parser = argparse.ArgumentParser()
 25 | 
 26 | parser.add_argument('-E', '--experiment_name',
 27 |                     dest='experiment_name',
 28 |                     help='Name of the experiment.')
 29 | 
 30 | parser.add_argument('-S', '--scanner_name',
 31 |                     dest='scanner_name',
 32 |                     help='Name of the scanner.')
 33 | 
 34 | parser.add_argument('-I', '--input_ids_file',
 35 |                     dest='input_ids_file',
 36 |                     default='homogenized_ids.csv',
 37 |                     help='Filename indicating the ids to be used.')
 38 | 
 39 | args = parser.parse_args()
 40 | 
 41 | 
 42 | def normalise_region_df(df, region_name):
 43 |     """Normalise region by total intracranial volume
 44 | 
 45 |     Parameters
 46 |     ----------
 47 |     df: dataframe
 48 |         Data to be normalized
 49 |     region_name: str
 50 |         Region of interest
 51 | 
 52 |     Returns
 53 |     -------
 54 |     float
 55 |         Normalised region
 56 |     """
 57 |     return df[region_name] / df['EstimatedTotalIntraCranialVol'] * 100
 58 | 
 59 | 
 60 | def linear_regression(df, region_name):
 61 |     """Perform linear regression using ordinary least squares (OLS) method
 62 | 
 63 |     Parameters
 64 |     ----------
 65 |     df: dataframe
 66 |         Dataset to be regressed
 67 |     region_name: str
 68 |         Region of interest
 69 | 
 70 |     Returns
 71 |     -------
 72 |     OLS_results.params: ndarray
 73 |         Estimated parameters
 74 |     OLS_results.bse: float
 75 |         Standard error of the parameter estimates
 76 |     OLS_results.tvalues: parameter
 77 |         t-statistic of parameter estimates
 78 |     OLS_results.pvalues: float
 79 |         Two-tailed p-values of the t-statistics of the parameters
 80 |     """
 81 | 
 82 |     endog = df['Norm_vol_' + region_name].values
 83 |     exog = sm.add_constant(df[['Age', 'Age^2', 'Age^3']].values)
 84 | 
 85 |     OLS_model = sm.OLS(endog, exog)
 86 | 
 87 |     OLS_results = OLS_model.fit()
 88 | 
 89 |     return OLS_results.params, OLS_results.bse, OLS_results.tvalues, OLS_results.pvalues
 90 | 
 91 | 
 92 | def main(experiment_name, scanner_name, input_ids_file):
 93 |     # ----------------------------------------------------------------------------------------
 94 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 95 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 96 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 97 |     ids_path = experiment_dir / input_ids_file
 98 | 
 99 |     # Create experiment's output directory
100 |     univariate_dir = experiment_dir / 'univariate_analysis'
101 |     univariate_dir.mkdir(exist_ok=True)
102 | 
103 |     dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path)
104 | 
105 |     # Create new df to add normalised regional volumes
106 |     normalised_df = dataset[['participant_id', 'Diagn', 'Gender', 'Age']]
107 |     normalised_df['Age^2'] = normalised_df['Age'] ** 2
108 |     normalised_df['Age^3'] = normalised_df['Age'] ** 3
109 | 
110 |     # Create empty df for regression output; regions to be added
111 |     regression_output = pd.DataFrame({'Row_labels_stat': ['Coeff', 'Coeff', 'Coeff', 'Coeff',
112 |                                                           'std_err', 'std_err', 'std_err', 'std_err',
113 |                                                           't_stats', 't_stats', 't_stats', 't_stats',
114 |                                                           'p_val', 'p_val', 'p_val', 'p_val'],
115 | 
116 |                                       'Row_labels_exog': ['Constant', 'Age', 'Age2', 'Age3',
117 |                                                           'Constant', 'Age', 'Age2', 'Age3',
118 |                                                           'Constant', 'Age', 'Age2', 'Age3',
119 |                                                           'Constant', 'Age', 'Age2', 'Age3']})
120 | 
121 |     regression_output.set_index('Row_labels_stat', 'Row_labels_exog')
122 | 
123 |     for region_name in COLUMNS_NAME:
124 |         print(region_name)
125 |         normalised_df['Norm_vol_' + region_name] = normalise_region_df(dataset, region_name)
126 | 
127 |         # Linear regression - ordinary least squares (OLS)
128 |         coeff, std_err, t_value, p_value = linear_regression(normalised_df, region_name)
129 | 
130 |         regression_output[region_name] = np.concatenate((coeff, std_err, t_value, p_value), axis=0)
131 | 
132 |     # Output to csv
133 |     regression_output.to_csv(univariate_dir / 'OLS_result.csv', index=False)
134 | 
135 | 
136 | if __name__ == '__main__':
137 |     main(args.experiment_name, args.scanner_name,
138 |          args.input_ids_file)
139 | 


--------------------------------------------------------------------------------
/src/preprocessing/README.md:
--------------------------------------------------------------------------------
1 | # Preprocessing
2 | The scripts in this folder perform the preprocessing necessary 
3 | to conduct the analysis for brain age prediction.
4 | 
5 | This includes: 
6 | - cleaning data and removing subjects with incomplete information,
7 | - quality control of raw and segmented data, 
8 | - compute preprocessed kernels for the voxel data,
9 | - perform dimensionality reduction for the voxel data using PCA


--------------------------------------------------------------------------------
/src/preprocessing/clean_data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """ Clean UK Biobank data.
 3 | 
 4 | Most of the subjects are white and some ages have very low number of subjects (<100).
 5 | The ethnics minorities and age with low number are removed from further analysis
 6 | as well subjects with any mental or brain disorder.
 7 | """
 8 | import argparse
 9 | from pathlib import Path
10 | 
11 | from utils import load_demographic_data
12 | 
13 | PROJECT_ROOT = Path.cwd()
14 | 
15 | parser = argparse.ArgumentParser()
16 | 
17 | parser.add_argument('-E', '--experiment_name',
18 |                     dest='experiment_name',
19 |                     help='Name of the experiment.')
20 | 
21 | parser.add_argument('-S', '--scanner_name',
22 |                     dest='scanner_name',
23 |                     help='Name of the scanner.')
24 | 
25 | parser.add_argument('-I', '--input_ids_file',
26 |                     dest='input_ids_file',
27 |                     default='freesurferData.csv',
28 |                     help='Filename indicating the ids to be used.')
29 | 
30 | args = parser.parse_args()
31 | 
32 | 
33 | def main(experiment_name, scanner_name, input_ids_file):
34 |     """Clean UK Biobank data."""
35 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
36 |     ids_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / input_ids_file
37 | 
38 |     output_ids_filename = 'cleaned_ids_noqc.csv'
39 |     # ----------------------------------------------------------------------------------------
40 |     # Create experiment's output directory
41 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
42 |     experiment_dir.mkdir(exist_ok=True)
43 | 
44 |     dataset = load_demographic_data(participants_path, ids_path)
45 | 
46 |     # Exclude subjects outside [47, 73] interval (ages with <100 participants).
47 |     dataset = dataset.loc[(dataset['Age'] >= 47) & (dataset['Age'] <= 73)]
48 | 
49 |     # Exclude non-white ethnicities due to small subgroups
50 |     dataset = dataset[dataset['Ethnicity'] == 'White']
51 | 
52 |     # Exclude patients
53 |     dataset = dataset[dataset['Diagn'] == 1]
54 | 
55 |     output_ids_df = dataset[['image_id']]
56 | 
57 |     assert sum(output_ids_df.duplicated(subset='image_id')) == 0
58 | 
59 |     output_ids_df.to_csv(experiment_dir / output_ids_filename, index=False)
60 | 
61 | 
62 | if __name__ == '__main__':
63 |     main(args.experiment_name, args.scanner_name,
64 |          args.input_ids_file)
65 | 


--------------------------------------------------------------------------------
/src/preprocessing/compute_kernel_matrix.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """ Script to create the Kernel matrix (Gram matrix).
  3 | 
  4 | The Kernel matrix will be used on the analysis with voxel data.
  5 | """
  6 | import argparse
  7 | from pathlib import Path
  8 | 
  9 | import nibabel as nib
 10 | import numpy as np
 11 | import pandas as pd
 12 | from nilearn.masking import apply_mask
 13 | from tqdm import tqdm
 14 | 
 15 | PROJECT_ROOT = Path.cwd()
 16 | 
 17 | parser = argparse.ArgumentParser()
 18 | 
 19 | parser.add_argument('-P', '--input_path',
 20 |                     dest='input_path_str',
 21 |                     help='Path to the local folder with preprocessed images.')
 22 | 
 23 | parser.add_argument('-E', '--experiment_name',
 24 |                     dest='experiment_name',
 25 |                     help='Name of the experiment.')
 26 | 
 27 | parser.add_argument('-I', '--input_ids_file',
 28 |                     dest='input_ids_file',
 29 |                     default='homogenized_ids.csv',
 30 |                     help='Filename indicating the ids to be used.')
 31 | 
 32 | parser.add_argument('-D', '--input_data_type',
 33 |                     dest='input_data_type',
 34 |                     default='.nii.gz',
 35 |                     help='Input data type')
 36 | 
 37 | parser.add_argument('-M', '--mask_filename',
 38 |                     dest='mask_filename',
 39 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 40 |                     help='Input data type')
 41 | 
 42 | args = parser.parse_args()
 43 | 
 44 | 
 45 | def calculate_gram_matrix(subjects_path, mask_img, step_size=1000):
 46 |     """Calculate the Gram matrix. """
 47 |     n_samples = len(subjects_path)
 48 |     gram_matrix = np.float64(np.zeros((n_samples, n_samples)))
 49 | 
 50 |     # Outer loop
 51 |     outer_pbar = tqdm(range(int(np.ceil(n_samples / np.float(step_size)))))
 52 |     for ii in outer_pbar:
 53 |         outer_pbar.set_description(f'Processing outer loop {ii}')
 54 |         # Generate indices and then paths for this block
 55 |         start_ind_1 = ii * step_size
 56 |         stop_ind_1 = min(start_ind_1 + step_size, n_samples)
 57 |         block_paths_1 = subjects_path[start_ind_1:stop_ind_1]
 58 | 
 59 |         # Read in the images in this block
 60 |         images_1 = []
 61 |         images_1_pbar = tqdm(block_paths_1)
 62 |         for path in images_1_pbar:
 63 |             images_1_pbar.set_description(f'Loading outer image {path}')
 64 |             try:
 65 |                 img = nib.load(str(path))
 66 |             except FileNotFoundError:
 67 |                 print(f'No image file {path}.')
 68 |                 raise
 69 | 
 70 |             # Extract only the brain voxels. This will create a 1D array.
 71 |             img = apply_mask(img, mask_img)
 72 |             img = np.asarray(img, dtype='float64')
 73 |             img = np.nan_to_num(img)
 74 |             images_1.append(img)
 75 |             del img
 76 |         images_1 = np.array(images_1)
 77 | 
 78 |         # Inner loop
 79 |         inner_pbar = tqdm(range(ii + 1))
 80 |         for jj in inner_pbar:
 81 | 
 82 |             # If ii = jj, then sets of image data are the same - no need to load
 83 |             if ii == jj:
 84 |                 start_ind_2 = start_ind_1
 85 |                 stop_ind_2 = stop_ind_1
 86 |                 images_2 = images_1
 87 | 
 88 |             # If ii !=jj, read in a different block of images
 89 |             else:
 90 |                 # Generate indices and then paths for this block
 91 |                 start_ind_2 = jj * step_size
 92 |                 stop_ind_2 = min(start_ind_2 + step_size, n_samples)
 93 |                 block_paths_2 = subjects_path[start_ind_2:stop_ind_2]
 94 | 
 95 |                 images_2 = []
 96 |                 images_2_pbar = tqdm(block_paths_2)
 97 |                 for path in images_2_pbar:
 98 |                     images_2_pbar.set_description(f'Loading inner image {path}')
 99 |                     try:
100 |                         img = nib.load(str(path))
101 |                     except FileNotFoundError:
102 |                         print(f'No image file {path}.')
103 |                         raise
104 | 
105 |                     img = apply_mask(img, mask_img)
106 |                     img = np.asarray(img, dtype='float64')
107 |                     img = np.nan_to_num(img)
108 |                     images_2.append(img)
109 |                     del img
110 |                 images_2 = np.array(images_2)
111 | 
112 |             block_K = np.dot(images_1, np.transpose(images_2))
113 |             gram_matrix[start_ind_1:stop_ind_1, start_ind_2:stop_ind_2] = block_K
114 |             gram_matrix[start_ind_2:stop_ind_2, start_ind_1:stop_ind_1] = np.transpose(block_K)
115 | 
116 |     return gram_matrix
117 | 
118 | 
119 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename):
120 |     """"""
121 |     dataset_path = Path(input_path_str)
122 | 
123 |     output_path = PROJECT_ROOT / 'outputs' / 'kernels'
124 |     output_path.mkdir(exist_ok=True, parents=True)
125 | 
126 |     ids_df = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file)
127 | 
128 |     # Get list of subjects included in the analysis
129 |     subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']]
130 | 
131 |     print(f'Total number of images: {len(ids_df)}')
132 | 
133 |     # ----------------------------------------------------------------------------------------
134 |     # Load the mask image
135 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
136 |     mask_img = nib.load(str(brain_mask))
137 | 
138 |     gram_matrix = calculate_gram_matrix(subjects_path, mask_img)
139 | 
140 |     gram_df = pd.DataFrame(columns=ids_df['image_id'].tolist(), data=gram_matrix)
141 |     gram_df['image_id'] = ids_df['image_id']
142 |     gram_df = gram_df.set_index('image_id')
143 | 
144 |     gram_df.to_csv(output_path / 'kernel.csv')
145 | 
146 | 
147 | if __name__ == '__main__':
148 |     main(args.input_path_str, args.experiment_name,
149 |          args.input_ids_file,
150 |          args.input_data_type, args.mask_filename)
151 | 


--------------------------------------------------------------------------------
/src/preprocessing/compute_kernel_matrix_general.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """ Script to create the Kernel matrix (Gram matrix) for the generalisation analysis.
  3 | 
  4 | In this script, we measure the kernel function between subjects from site 1 and 2.
  5 | """
  6 | import argparse
  7 | from pathlib import Path
  8 | 
  9 | import nibabel as nib
 10 | import numpy as np
 11 | import pandas as pd
 12 | from nilearn.masking import apply_mask
 13 | from tqdm import tqdm
 14 | 
 15 | PROJECT_ROOT = Path.cwd()
 16 | 
 17 | parser = argparse.ArgumentParser()
 18 | 
 19 | parser.add_argument('-P', '--input_path',
 20 |                     dest='input_path_str',
 21 |                     help='Path to the local folder with preprocessed images.')
 22 | 
 23 | parser.add_argument('-E', '--experiment_name',
 24 |                     dest='experiment_name',
 25 |                     help='Name of the experiment.')
 26 | 
 27 | parser.add_argument('-I', '--input_ids_file',
 28 |                     dest='input_ids_file',
 29 |                     default='homogenized_ids.csv',
 30 |                     help='Filename indicating the ids to be used.')
 31 | 
 32 | parser.add_argument('-D', '--input_data_type',
 33 |                     dest='input_data_type',
 34 |                     default='.nii.gz',
 35 |                     help='Input data type')
 36 | 
 37 | parser.add_argument('-M', '--mask_filename',
 38 |                     dest='mask_filename',
 39 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 40 |                     help='Input data type')
 41 | 
 42 | parser.add_argument('-P2', '--input_path_2',
 43 |                     dest='input_path_str_2',
 44 |                     help='Path to the local folder with preprocessed images.')
 45 | 
 46 | parser.add_argument('-E2', '--experiment_name_2',
 47 |                     dest='experiment_name_2',
 48 |                     help='Name of the experiment.')
 49 | 
 50 | parser.add_argument('-I2', '--input_ids_file_2',
 51 |                     dest='input_ids_file_2',
 52 |                     default='cleaned_ids.csv',
 53 |                     help='Filename indicating the ids to be used.')
 54 | 
 55 | parser.add_argument('-S', '--output_suffix',
 56 |                     dest='output_suffix',
 57 |                     default='_general',
 58 |                     help='Filename indicating the ids to be used.')
 59 | 
 60 | args = parser.parse_args()
 61 | 
 62 | 
 63 | def calculate_gram_matrix(subjects_path, mask_img, subjects_path_2, step_size=500):
 64 |     """Calculate the Gram matrix. """
 65 |     n_samples = len(subjects_path)
 66 |     n_samples_2 = len(subjects_path_2)
 67 |     gram_matrix = np.float64(np.zeros((n_samples, n_samples_2)))
 68 | 
 69 |     # Outer loop
 70 |     outer_pbar = tqdm(range(int(np.ceil(n_samples / np.float(step_size)))))
 71 |     for ii in outer_pbar:
 72 |         outer_pbar.set_description(f'Processing outer loop {ii}')
 73 |         # Generate indices and then paths for this block
 74 |         start_ind_1 = ii * step_size
 75 |         stop_ind_1 = min(start_ind_1 + step_size, n_samples)
 76 |         block_paths_1 = subjects_path[start_ind_1:stop_ind_1]
 77 | 
 78 |         # Read in the images in this block
 79 |         images_1 = []
 80 |         images_1_pbar = tqdm(block_paths_1)
 81 |         for path in images_1_pbar:
 82 |             images_1_pbar.set_description(f'Loading outer image {path}')
 83 |             try:
 84 |                 img = nib.load(str(path))
 85 |             except FileNotFoundError:
 86 |                 print(f'No image file {path}.')
 87 |                 raise
 88 | 
 89 |             # Extract only the brain voxels. This will create a 1D array.
 90 |             img = apply_mask(img, mask_img)
 91 |             img = np.asarray(img, dtype='float64')
 92 |             img = np.nan_to_num(img)
 93 |             images_1.append(img)
 94 |             del img
 95 |         images_1 = np.array(images_1)
 96 | 
 97 |         # Inner loop
 98 |         inner_pbar = tqdm(range(int(np.ceil(n_samples_2 / np.float(step_size)))))
 99 |         for jj in inner_pbar:
100 |             # Generate indices and then paths for this block
101 |             start_ind_2 = jj * step_size
102 |             stop_ind_2 = min(start_ind_2 + step_size, n_samples_2)
103 |             block_paths_2 = subjects_path_2[start_ind_2:stop_ind_2]
104 | 
105 |             images_2 = []
106 |             images_2_pbar = tqdm(block_paths_2)
107 |             for path in images_2_pbar:
108 |                 images_2_pbar.set_description(f'Loading inner image {path}')
109 |                 try:
110 |                     img = nib.load(str(path))
111 |                 except FileNotFoundError:
112 |                     print(f'No image file {path}.')
113 |                     raise
114 | 
115 |                 img = apply_mask(img, mask_img)
116 |                 img = np.asarray(img, dtype='float64')
117 |                 img = np.nan_to_num(img)
118 |                 images_2.append(img)
119 |                 del img
120 |             images_2 = np.array(images_2)
121 | 
122 |             block_K = np.dot(images_1, np.transpose(images_2))
123 |             gram_matrix[start_ind_1:stop_ind_1, start_ind_2:stop_ind_2] = block_K
124 | 
125 |     return gram_matrix
126 | 
127 | 
128 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename,
129 |          input_path_str_2, experiment_name_2, input_ids_file_2, output_suffix):
130 |     """"""
131 |     dataset_path = Path(input_path_str)
132 | 
133 |     output_path = PROJECT_ROOT / 'outputs' / 'kernels'
134 |     output_path.mkdir(exist_ok=True, parents=True)
135 | 
136 |     ids_df = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file)
137 | 
138 |     # Get list of subjects included in the analysis
139 |     subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']]
140 | 
141 |     print(f'Total number of images: {len(ids_df)}')
142 | 
143 |     # Dataset_2
144 |     dataset_path_2 = Path(input_path_str_2)
145 |     ids_df_2 = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name_2 / input_ids_file_2)
146 |     subjects_path_2 = [str(dataset_path_2 / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df_2['image_id']]
147 | 
148 |     print(f'Total number of images: {len(ids_df_2)}')
149 | 
150 |     # ----------------------------------------------------------------------------------------
151 |     # Load the mask image
152 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
153 |     mask_img = nib.load(str(brain_mask))
154 | 
155 |     gram_matrix = calculate_gram_matrix(subjects_path, mask_img, subjects_path_2)
156 | 
157 |     gram_df = pd.DataFrame(columns=ids_df_2['image_id'].tolist(), data=gram_matrix)
158 |     gram_df['image_id'] = ids_df['image_id']
159 |     gram_df = gram_df.set_index('image_id')
160 | 
161 |     gram_df.to_csv(output_path / f'kernel{output_suffix}.csv')
162 | 
163 | 
164 | if __name__ == '__main__':
165 |     main(args.input_path_str, args.experiment_name,
166 |          args.input_ids_file,
167 |          args.input_data_type, args.mask_filename,
168 |          args.input_path_str_2, args.experiment_name_2, args.input_ids_file_2, args.output_suffix)
169 | 


--------------------------------------------------------------------------------
/src/preprocessing/compute_pca_variance_explained.py:
--------------------------------------------------------------------------------
 1 | """Script to calculate the % variance explained from the PCA models"""
 2 | 
 3 | import pandas as pd
 4 | from joblib import load
 5 | from pathlib import Path
 6 | 
 7 | PROJECT_ROOT = Path.cwd()
 8 | 
 9 | 
10 | def main():
11 |     pca_path = PROJECT_ROOT / 'outputs' / 'pca' / 'models'
12 | 
13 |     # Get list of file names for pca models
14 |     n_repetitions = 10
15 |     n_folds = 10
16 | 
17 |     pca_name_ls = []
18 |     for i_repetition in range(n_repetitions):
19 |         for i_fold in range(n_folds):
20 |             pca_name = f'{i_repetition:02d}_{i_fold:02d}_pca.joblib'
21 |             pca_name_ls.append(pca_name)
22 | 
23 |     # Loop over pca model file names, load models and get variance explained
24 |     pca_var_ls = []
25 |     for i_model in pca_name_ls:
26 |         print(i_model)
27 |         pca_model = load(pca_path / i_model)
28 |         var_explained = pca_model.explained_variance_ratio_.sum()
29 |         pca_var_ls.append(var_explained)
30 | 
31 |     # Create df for % variance explained per model iteration
32 |     pca_var_df = pd.DataFrame({'variance_explained':pca_var_ls})
33 | 
34 |     # Get mean and standard deviation for % variance explained across iterations
35 |     var_mean = pca_var_df['variance_explained'].mean()
36 |     var_std = pca_var_df['variance_explained'].stdev()
37 |     print(var_mean, var_std)
38 | 
39 |     # Save % variance explained per model
40 |     file_name = 'pca_variance_explained.csv'
41 |     file_path = PROJECT_ROOT / 'outputs' / 'pca'
42 |     pca_var_df.to_csv(file_path / file_name)
43 | 
44 | 
45 | 
46 | if __name__ == '__main__':
47 |     main()


--------------------------------------------------------------------------------
/src/preprocessing/compute_principal_components.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """ Script to extract the pca components from the participant's data. """
  3 | import argparse
  4 | from pathlib import Path
  5 | 
  6 | import nibabel as nib
  7 | import numpy as np
  8 | import pandas as pd
  9 | from nilearn.masking import apply_mask
 10 | from joblib import load
 11 | from tqdm import tqdm
 12 | 
 13 | 
 14 | PROJECT_ROOT = Path.cwd()
 15 | 
 16 | parser = argparse.ArgumentParser()
 17 | 
 18 | parser.add_argument('-P', '--input_path',
 19 |                     dest='input_path_str',
 20 |                     help='Path to the local folder with preprocessed images.')
 21 | 
 22 | parser.add_argument('-E', '--experiment_name',
 23 |                     dest='experiment_name',
 24 |                     help='Name of the experiment.')
 25 | 
 26 | parser.add_argument('-I', '--input_ids_file',
 27 |                     dest='input_ids_file',
 28 |                     default='homogenized_ids.csv',
 29 |                     help='Filename indicating the ids to be used.')
 30 | 
 31 | parser.add_argument('-D', '--input_data_type',
 32 |                     dest='input_data_type',
 33 |                     default='.nii.gz',
 34 |                     help='Input data type')
 35 | 
 36 | parser.add_argument('-M', '--mask_filename',
 37 |                     dest='mask_filename',
 38 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 39 |                     help='Input data type')
 40 | 
 41 | parser.add_argument('-S', '--output_suffix',
 42 |                     dest='output_suffix',
 43 |                     default='',
 44 |                     help='Filename indicating the ids to be used.')
 45 | 
 46 | args = parser.parse_args()
 47 | 
 48 | def load_all_subjects(subjects_path, mask_img):
 49 |     imgs = []
 50 |     subj_pbar = tqdm(subjects_path)
 51 |     for subject_path in subj_pbar:
 52 |         subj_pbar.set_description(f'Loading outer image {subject_path}')
 53 |         # Read in the images in this block
 54 |         try:
 55 |             img = nib.load(str(subject_path))
 56 |         except FileNotFoundError:
 57 |             print(f'No image file {subject_path}.')
 58 |             raise
 59 | 
 60 |         # Extract only the brain voxels. This will create a 1D array.
 61 |         img = apply_mask(img, mask_img)
 62 |         img = np.asarray(img, dtype='float32')
 63 |         img = np.nan_to_num(img)
 64 |         imgs.append(img)
 65 |     return imgs
 66 | 
 67 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename, output_suffix):
 68 |     dataset_path = Path(input_path_str)
 69 |     output_path = PROJECT_ROOT / 'outputs' / 'pca'
 70 |     models_output_path = output_path / 'models'
 71 | 
 72 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
 73 |     ids_df = pd.read_csv(ids_path)
 74 | 
 75 |     # Get list of subjects included in the analysis
 76 |     subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']]
 77 | 
 78 |     print(f'Total number of images: {len(ids_df)}')
 79 | 
 80 |     # ----------------------------------------------------------------------------------------
 81 |     # Load the mask image
 82 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
 83 |     mask_img = nib.load(str(brain_mask))
 84 | 
 85 |     imgs = load_all_subjects(subjects_path, mask_img)
 86 | 
 87 |     n_components = 150
 88 |     n_repetitions = 10
 89 |     n_folds = 10
 90 |     for i_repetition in range(n_repetitions):
 91 |         for i_fold in range(n_folds):
 92 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
 93 |             print(f'{prefix}')
 94 | 
 95 |             components = np.zeros((len(subjects_path), n_components))
 96 |             model = load(models_output_path / f'{prefix}_pca.joblib')
 97 | 
 98 |             for i_img, img in enumerate(tqdm(imgs)):
 99 |                 components[i_img, :] = model.transform(img[None, :])
100 | 
101 |             pca_df = pd.DataFrame(data=components)
102 |             pca_df['image_id'] = subjects_path
103 |             pca_df.to_csv(output_path / f'{prefix}_pca_components{output_suffix}.csv', index=False)
104 | 
105 | 
106 | if __name__ == '__main__':
107 |     main(args.input_path_str, args.experiment_name,
108 |          args.input_ids_file,
109 |          args.input_data_type, args.mask_filename, args.output_suffix)
110 | 


--------------------------------------------------------------------------------
/src/preprocessing/create_pca_models.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """ Script to create the PCA models.
  3 | 
  4 | Note: We calculate only 150 components due to resources limitations.
  5 | """
  6 | import argparse
  7 | import warnings
  8 | from pathlib import Path
  9 | 
 10 | import nibabel as nib
 11 | import numpy as np
 12 | import pandas as pd
 13 | from joblib import dump
 14 | from nilearn.masking import apply_mask
 15 | from sklearn.decomposition import IncrementalPCA
 16 | from sklearn.model_selection import StratifiedKFold
 17 | from tqdm import tqdm
 18 | 
 19 | from utils import load_demographic_data
 20 | 
 21 | PROJECT_ROOT = Path.cwd()
 22 | 
 23 | parser = argparse.ArgumentParser()
 24 | 
 25 | parser.add_argument('-P', '--input_path',
 26 |                     dest='input_path_str',
 27 |                     help='Path to the local folder with preprocessed images.')
 28 | 
 29 | parser.add_argument('-E', '--experiment_name',
 30 |                     dest='experiment_name',
 31 |                     help='Name of the experiment.')
 32 | 
 33 | parser.add_argument('-S', '--scanner_name',
 34 |                     dest='scanner_name',
 35 |                     help='Name of the scanner.')
 36 | 
 37 | parser.add_argument('-I', '--input_ids_file',
 38 |                     dest='input_ids_file',
 39 |                     default='homogenized_ids.csv',
 40 |                     help='Filename indicating the ids to be used.')
 41 | 
 42 | parser.add_argument('-D', '--input_data_type',
 43 |                     dest='input_data_type',
 44 |                     default='.nii.gz',
 45 |                     help='Input data type')
 46 | 
 47 | parser.add_argument('-M', '--mask_filename',
 48 |                     dest='mask_filename',
 49 |                     default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii',
 50 |                     help='Input data type')
 51 | 
 52 | args = parser.parse_args()
 53 | 
 54 | 
 55 | def main(input_path_str, experiment_name, input_ids_file, scanner_name, input_data_type, mask_filename):
 56 |     dataset_path = Path(input_path_str)
 57 | 
 58 |     output_path = PROJECT_ROOT / 'outputs' / 'pca'
 59 |     output_path.mkdir(exist_ok=True)
 60 | 
 61 |     models_output_path = output_path / 'models'
 62 |     models_output_path.mkdir(exist_ok=True)
 63 | 
 64 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
 65 |     ids_df = pd.read_csv(ids_path)
 66 | 
 67 |     # Get list of subjects included in the analysis
 68 |     subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in
 69 |                      ids_df['image_id'].str.rstrip('/')]
 70 | 
 71 |     print(f'Total number of images: {len(ids_df)}')
 72 | 
 73 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 74 | 
 75 |     dataset = load_demographic_data(participants_path, ids_path)
 76 | 
 77 |     age = dataset['Age'].values
 78 | 
 79 |     # ----------------------------------------------------------------------------------------
 80 |     # Load the mask image
 81 |     brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename
 82 |     mask_img = nib.load(str(brain_mask))
 83 | 
 84 |     n_repetitions = 10
 85 |     n_folds = 10
 86 |     step_size = 400
 87 |     for i_repetition in range(n_repetitions):
 88 |         # Create 10-fold cross-validation scheme stratified by age
 89 |         skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition)
 90 |         for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)):
 91 |             print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}')
 92 |             print(train_index.shape)
 93 |             n_samples = len(subjects_path)
 94 |             pca = IncrementalPCA(n_components=150, copy=False)
 95 | 
 96 |             for i in tqdm(range(int(np.ceil(n_samples / np.float(step_size))))):
 97 |                 # Generate indices and then paths for this block
 98 |                 start_ind = i * step_size
 99 |                 stop_ind = min(start_ind + step_size, n_samples)
100 |                 block_paths = subjects_path[start_ind:stop_ind]
101 | 
102 |                 # Read in the images in this block
103 |                 images = []
104 |                 for path in tqdm(block_paths):
105 |                     try:
106 |                         img = nib.load(str(path))
107 |                     except FileNotFoundError:
108 |                         print(f'No image file {path}.')
109 |                         raise
110 | 
111 |                     # Extract only the brain voxels. This will create a 1D array.
112 |                     img = apply_mask(img, mask_img)
113 |                     img = np.asarray(img, dtype='float32')
114 |                     img = np.nan_to_num(img)
115 |                     images.append(img)
116 |                     del img
117 |                 images = np.array(images, dtype='float32')
118 | 
119 |                 selected_index = train_index[(train_index >= start_ind) & (train_index < stop_ind)] - start_ind
120 |                 images_selected = images[selected_index]
121 |                 try:
122 |                     pca.partial_fit(images_selected)
123 |                 except ValueError:
124 |                     warnings.warn('n_components higher than number of subjects.')
125 | 
126 |             prefix = f'{i_repetition:02d}_{i_fold:02d}'
127 |             dump(pca, models_output_path / f'{prefix}_pca.joblib')
128 | 
129 | 
130 | if __name__ == '__main__':
131 |     main(args.input_path_str, args.experiment_name,
132 |          args.input_ids_file, args.scanner_name,
133 |          args.input_data_type, args.mask_filename)
134 | 


--------------------------------------------------------------------------------
/src/preprocessing/homogenize_gender.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Homogenize dataset.
  3 | 
  4 | We homogenize the dataset scanner_1 to not have a significant difference
  5 | between the proportion of men and women along the age. We used the
  6 | chi square test for homogeneity to verify if there is a difference.
  7 | """
  8 | import argparse
  9 | import itertools
 10 | from pathlib import Path
 11 | 
 12 | import numpy as np
 13 | import pandas as pd
 14 | import scipy.stats as stats
 15 | 
 16 | from utils import load_demographic_data
 17 | 
 18 | PROJECT_ROOT = Path.cwd()
 19 | 
 20 | parser = argparse.ArgumentParser()
 21 | 
 22 | parser.add_argument('-E', '--experiment_name',
 23 |                     dest='experiment_name',
 24 |                     help='Name of the experiment.')
 25 | 
 26 | parser.add_argument('-S', '--scanner_name',
 27 |                     dest='scanner_name',
 28 |                     help='Name of the scanner.')
 29 | 
 30 | parser.add_argument('-I', '--input_ids_file',
 31 |                     dest='input_ids_file',
 32 |                     default='cleaned_ids.csv',
 33 |                     help='Filename indicating the ids to be used.')
 34 | 
 35 | args = parser.parse_args()
 36 | 
 37 | 
 38 | def check_balance_across_groups(crosstab_df):
 39 |     """Verify if which age pair have gender imbalance."""
 40 |     combinations = list(itertools.combinations(crosstab_df.columns, 2))
 41 |     significance_level = 0.05 / len(combinations)
 42 | 
 43 |     for group1, group2 in combinations:
 44 |         contingency_table = crosstab_df[[group1, group2]]
 45 |         _, p_value, _, _ = stats.chi2_contingency(contingency_table, correction=False)
 46 | 
 47 |         if p_value < significance_level:
 48 |             return False, [group1, group2]
 49 | 
 50 |     return True, [None]
 51 | 
 52 | 
 53 | def get_problematic_group(crosstab_df):
 54 |     """Perform contingency analysis of the subjects gender."""
 55 |     balance_flag, problematic_groups = check_balance_across_groups(crosstab_df)
 56 | 
 57 |     if balance_flag:
 58 |         return None
 59 | 
 60 |     conditions_proportions = crosstab_df.apply(lambda r: r / r.sum(), axis=0)
 61 |     median_proportion = np.median(conditions_proportions.values[0, :])
 62 |     problematic_proportion = conditions_proportions[problematic_groups].values[0, :]
 63 | 
 64 |     problematic_group = problematic_groups[np.argmax(np.abs(problematic_proportion - median_proportion))]
 65 | 
 66 |     return problematic_group
 67 | 
 68 | 
 69 | def get_balanced_dataset(dataset_df):
 70 |     """Script to perform gender balancing across the subjects' age range."""
 71 | 
 72 |     while True:
 73 |         crosstab_df = pd.crosstab(dataset_df['Gender'], dataset_df['Age'])
 74 | 
 75 |         problematic_group = get_problematic_group(crosstab_df)
 76 | 
 77 |         if problematic_group is None:
 78 |             break
 79 | 
 80 |         condition_imbalanced = crosstab_df[problematic_group].idxmax()
 81 | 
 82 |         problematic_group_mask = (dataset_df['Age'] == problematic_group) & \
 83 |                                  (dataset_df['Gender'] == condition_imbalanced)
 84 | 
 85 |         list_to_drop = list(dataset_df[problematic_group_mask].sample(1).index)
 86 |         print('Dropping {:}'.format(dataset_df['image_id'].iloc[list_to_drop].values[0]))
 87 |         dataset_df = dataset_df.drop(list_to_drop, axis=0)
 88 | 
 89 |     return dataset_df
 90 | 
 91 | 
 92 | def main(experiment_name, scanner_name, input_ids_file):
 93 |     """Perform the exploratory data analysis."""
 94 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 95 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
 96 | 
 97 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 98 | 
 99 |     # Define random seed for sampling methods
100 |     np.random.seed(42)
101 | 
102 |     dataset_df = load_demographic_data(participants_path, ids_path)
103 | 
104 |     dataset_balanced = get_balanced_dataset(dataset_df)
105 | 
106 |     homogeneous_ids_df = dataset_balanced[['image_id']]
107 |     homogeneous_ids_df.to_csv(experiment_dir / 'homogenized_ids.csv', index=False)
108 | 
109 | 
110 | if __name__ == '__main__':
111 |     main(args.experiment_name, args.scanner_name,
112 |          args.input_ids_file)
113 | 


--------------------------------------------------------------------------------
/src/preprocessing/quality_control.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """Perform quality control.
 3 | 
 4 | This script removes participants that did not pass the quality
 5 | control performed using MRIQC [1] for raw MRI data and Qoala [2] for
 6 | FreeSurfer-preprocessed data. These analyses were performed separately and the
 7 | results are applied to the Biobank data in this script.
 8 | 
 9 | In Qoala, higher numbers indicate a higher chance of being a high quality scan
10 | (Source: https://qoala-t.shinyapps.io/qoala-t_app/).
11 | 
12 | In MRIQC, higher values indicates a higher probability of being from MRIQC's class 1 ('exclude')
13 | (Source: https://github.com/poldracklab/mriqc/blob/98610ad7596b586966413b01d10f4eb68366a038/mriqc/classifier/helper.py)
14 | 
15 | References
16 | ----------
17 | [1] - Esteban, Oscar, et al. "MRIQC: Advancing the automatic prediction
18 | of image quality in MRI from unseen sites." PloS one 12.9 (2017): e0184661.
19 | 
20 | [2] - Klapwijk, Eduard T., et al. "Qoala-T: A supervised-learning tool for
21 | quality control of FreeSurfer segmented MRI data." NeuroImage 189 (2019): 116-129.
22 | """
23 | import argparse
24 | from pathlib import Path
25 | 
26 | import pandas as pd
27 | 
28 | PROJECT_ROOT = Path.cwd()
29 | 
30 | parser = argparse.ArgumentParser()
31 | 
32 | parser.add_argument('-E', '--experiment_name',
33 |                     dest='experiment_name',
34 |                     help='Name of the experiment.')
35 | 
36 | parser.add_argument('-S', '--scanner_name',
37 |                     dest='scanner_name',
38 |                     help='Name of the scanner.')
39 | 
40 | parser.add_argument('-I', '--input_ids_file',
41 |                     dest='input_ids_file',
42 |                     default='cleaned_ids_noqc.csv',
43 |                     help='Filename indicating the ids to be used.')
44 | 
45 | parser.add_argument('-M', '--mriqc_threshold',
46 |                     dest='mriqc_threshold',
47 |                     nargs='?',
48 |                     type=float, default=0.5,
49 |                     help='Threshold value for MRIQC.')
50 | 
51 | parser.add_argument('-Q', '--qoala_threshold',
52 |                     dest='qoala_threshold',
53 |                     nargs='?',
54 |                     type=float, default=0.5,
55 |                     help='Threshold value for Qoala.')
56 | 
57 | args = parser.parse_args()
58 | 
59 | 
60 | def main(experiment_name, scanner_name, input_ids_file, mriqc_threshold, qoala_threshold):
61 |     """Remove UK Biobank participants that did not pass quality checks."""
62 |     ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file
63 |     mriqc_prob_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'mriqc_prob.csv'
64 |     qoala_prob_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'qoala_prob.csv'
65 | 
66 |     qc_output_filename = 'cleaned_ids.csv'
67 | 
68 |     # ----------------------------------------------------------------------------------------
69 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
70 | 
71 |     ids_df = pd.read_csv(ids_path)
72 |     prob_mriqc_df = pd.read_csv(mriqc_prob_path)
73 |     prob_qoala_df = pd.read_csv(qoala_prob_path)
74 | 
75 |     prob_mriqc_df = prob_mriqc_df.rename(columns={'prob_y': 'mriqc_prob'})
76 |     prob_mriqc_df = prob_mriqc_df[['image_id', 'mriqc_prob']]
77 | 
78 |     prob_qoala_df = prob_qoala_df.rename(columns={'prob_qoala': 'qoala_prob'})
79 |     prob_qoala_df = prob_qoala_df[['image_id', 'qoala_prob']]
80 | 
81 |     qc_df = pd.merge(prob_mriqc_df, prob_qoala_df, on='image_id')
82 | 
83 |     selected_subjects = qc_df[(qc_df['mriqc_prob'] < mriqc_threshold) | (qc_df['qoala_prob'] < qoala_threshold)]
84 | 
85 |     ids_qc_df = pd.merge(ids_df, selected_subjects[['image_id']], on='image_id')
86 | 
87 |     ids_qc_df.to_csv(experiment_dir / qc_output_filename, index=False)
88 | 
89 | 
90 | if __name__ == '__main__':
91 |     main(args.experiment_name, args.scanner_name,
92 |          args.input_ids_file,
93 |          args.mriqc_threshold, args.qoala_threshold)
94 | 


--------------------------------------------------------------------------------
/src/sample_size/README.md:
--------------------------------------------------------------------------------
1 | # Performance of models by the size of training set
2 | This folder includes the scripts to perform the analysis of the 
3 | impact of the sample size of the training set for brain age prediction.
4 | 
5 | The scripts use bootstrapping to assess the robustness of performance and
6 | determine the minimum training set size required for model performance 
7 | above chance level. Performance is measured in terms of the model's
8 | mean absolute error (MAE).


--------------------------------------------------------------------------------
/src/sample_size/sample_size_create_figures.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Plot results of bootstrap analysis
  3 | 
  4 | Ref:
  5 | https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/
  6 | """
  7 | import argparse
  8 | from pathlib import Path
  9 | 
 10 | import matplotlib.pyplot as plt
 11 | import numpy as np
 12 | 
 13 | PROJECT_ROOT = Path.cwd()
 14 | 
 15 | parser = argparse.ArgumentParser()
 16 | 
 17 | parser.add_argument('-E', '--experiment_name',
 18 |                     dest='experiment_name',
 19 |                     help='Name of the experiment.')
 20 | 
 21 | parser.add_argument('-M', '--model_name',
 22 |                     dest='model_name',
 23 |                     help='Name of the model.')
 24 | 
 25 | parser.add_argument('-N', '--n_bootstrap',
 26 |                     dest='n_bootstrap',
 27 |                     type=int, default=1000,
 28 |                     help='Number of bootstrap iterations.')
 29 | 
 30 | parser.add_argument('-F', '--n_min_pair',
 31 |                     dest='n_min_pair',
 32 |                     type=int, default=1,
 33 |                     help='Number minimum of pairs.')
 34 | 
 35 | parser.add_argument('-R', '--n_max_pair',
 36 |                     dest='n_max_pair',
 37 |                     type=int, default=20,
 38 |                     help='Number maximum of pairs.')
 39 | 
 40 | args = parser.parse_args()
 41 | 
 42 | 
 43 | def main(experiment_name, model_name, n_bootstrap, n_min_pair, n_max_pair):
 44 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 45 | 
 46 |     i_n_subject_pairs_list = range(n_min_pair, n_max_pair + 1)
 47 | 
 48 |     scores_i_n_subject_pairs = []
 49 |     train_scores_i_n_subject_pairs = []
 50 |     general_scores_i_n_subject_pairs = []
 51 | 
 52 |     for i_n_subject_pairs in i_n_subject_pairs_list:
 53 |         ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}'
 54 |         scores_dir = ids_with_n_subject_pairs_dir / 'scores'
 55 |         scores_bootstrap = []
 56 |         train_scores_bootstrap = []
 57 |         general_scores_bootstrap = []
 58 |         for i_bootstrap in range(n_bootstrap):
 59 |             filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'
 60 |             scores_bootstrap.append(np.load(str(filepath_scores))[1])
 61 | 
 62 |             train_filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'
 63 |             train_scores_bootstrap.append(np.load(str(train_filepath_scores))[1])
 64 | 
 65 |             general_filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'
 66 |             general_scores_bootstrap.append(np.load(str(general_filepath_scores))[1])
 67 | 
 68 |         scores_i_n_subject_pairs.append(scores_bootstrap)
 69 |         train_scores_i_n_subject_pairs.append(train_scores_bootstrap)
 70 |         general_scores_i_n_subject_pairs.append(general_scores_bootstrap)
 71 | 
 72 |     age_min = 47
 73 |     age_max = 73
 74 |     std_uniform_dist = np.sqrt(((age_max - age_min) ** 2) / 12)
 75 | 
 76 |     plt.figure(figsize=(10, 5))
 77 | 
 78 |     # Draw lines
 79 |     plt.plot(i_n_subject_pairs_list,
 80 |              np.median(scores_i_n_subject_pairs, axis=1),
 81 |              linewidth=1.0,
 82 |              color='r', label=model_name + ' test performance')
 83 | 
 84 |     plt.plot(i_n_subject_pairs_list,
 85 |              np.median(train_scores_i_n_subject_pairs, axis=1),
 86 |              linewidth=1.0,
 87 |              color='g', label=model_name + ' train performance')
 88 | 
 89 |     plt.plot(i_n_subject_pairs_list,
 90 |              np.median(general_scores_i_n_subject_pairs, axis=1),
 91 |              linewidth=1.0,
 92 |              color='b', label=model_name + ' generalisation performance')
 93 | 
 94 |     plt.plot(range(1, 21),
 95 |              std_uniform_dist * np.ones_like(range(1, 21)), '--',
 96 |              linewidth=1.0,
 97 |              color='#111111', label='Chance line')
 98 | 
 99 |     # Draw bands
100 |     plt.fill_between(i_n_subject_pairs_list,
101 |                      np.percentile(scores_i_n_subject_pairs, 2.5, axis=1),
102 |                      np.percentile(scores_i_n_subject_pairs, 97.5, axis=1),
103 |                      color='r', alpha=0.1)
104 | 
105 |     plt.fill_between(i_n_subject_pairs_list,
106 |                      np.percentile(train_scores_i_n_subject_pairs, 2.5, axis=1),
107 |                      np.percentile(train_scores_i_n_subject_pairs, 97.5, axis=1),
108 |                      color='g', alpha=0.1)
109 | 
110 |     plt.fill_between(i_n_subject_pairs_list,
111 |                      np.percentile(general_scores_i_n_subject_pairs, 2.5, axis=1),
112 |                      np.percentile(general_scores_i_n_subject_pairs, 97.5, axis=1),
113 |                      color='b', alpha=0.1)
114 | 
115 |     # Create plot
116 |     plt.xlabel('Number of subjects')
117 |     plt.xticks(range(1, 21), np.multiply(range(1, 21), 2 * ((73 - 47) + 1)))
118 |     plt.xlim(0.04999999999999993, 20.95)
119 |     plt.ylabel('Mean Absolute Error')
120 |     plt.legend(loc='best')
121 |     plt.tight_layout()
122 |     plt.savefig(experiment_dir / 'sample_size' / f'sample_size_{model_name}.eps', format='eps')
123 | 
124 | 
125 | if __name__ == '__main__':
126 |     main(args.experiment_name, args.model_name,
127 |          args.n_bootstrap, args.n_min_pair, args.n_max_pair)
128 | 


--------------------------------------------------------------------------------
/src/sample_size/sample_size_create_ids.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to create files with subjects' ids to perform sample size analysis
  3 | 
  4 | This script creates sex-homogeneous bootstrapped datasets.
  5 | Creates 20 bootstrap samples of increasing size
  6 | """
  7 | import argparse
  8 | from pathlib import Path
  9 | 
 10 | import numpy as np
 11 | import pandas as pd
 12 | 
 13 | from utils import load_demographic_data
 14 | 
 15 | PROJECT_ROOT = Path.cwd()
 16 | 
 17 | parser = argparse.ArgumentParser()
 18 | 
 19 | parser.add_argument('-E', '--experiment_name',
 20 |                     dest='experiment_name',
 21 |                     help='Name of the experiment.')
 22 | 
 23 | parser.add_argument('-S', '--scanner_name',
 24 |                     dest='scanner_name',
 25 |                     help='Name of the scanner.')
 26 | 
 27 | parser.add_argument('-I', '--input_ids_file',
 28 |                     dest='input_ids_file',
 29 |                     default='homogenized_ids.csv',
 30 |                     help='File name indicating the ids to be used.')
 31 | 
 32 | parser.add_argument('-N', '--n_bootstrap',
 33 |                     dest='n_bootstrap',
 34 |                     type=int, default=1000,
 35 |                     help='Number of bootstrap iterations.')
 36 | 
 37 | parser.add_argument('-R', '--n_max_pair',
 38 |                     dest='n_max_pair',
 39 |                     type=int, default=20,
 40 |                     help='Maximum number of pairs.')
 41 | 
 42 | args = parser.parse_args()
 43 | 
 44 | 
 45 | def main(experiment_name, scanner_name, input_ids_file, n_bootstrap, n_max_pair):
 46 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 47 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 48 |     ids_path = experiment_dir / input_ids_file
 49 | 
 50 |     sample_size_dir = experiment_dir / 'sample_size'
 51 |     sample_size_dir.mkdir(exist_ok=True)
 52 | 
 53 |     # ----------------------------------------------------------------------------------------
 54 |     # Set random seed for random sampling of subjects
 55 |     np.random.seed(42)
 56 | 
 57 |     dataset = load_demographic_data(participants_path, ids_path)
 58 | 
 59 |     # Find range of ages in homogeneous dataset
 60 |     age_min = int(dataset['Age'].min())  # 47
 61 |     age_max = int(dataset['Age'].max())  # 73
 62 | 
 63 |     # Loop to create 20 bootstrap samples that each contain up to 20 gender-balanced subject pairs per age group/year
 64 |     # Create a out-of-bag set (~test set)
 65 |     for i_n_subject_pairs in range(1, n_max_pair + 1):
 66 |         print(i_n_subject_pairs)
 67 |         ids_with_n_subject_pairs_dir = sample_size_dir / f'{i_n_subject_pairs:02d}'
 68 |         ids_with_n_subject_pairs_dir.mkdir(exist_ok=True)
 69 |         ids_dir = ids_with_n_subject_pairs_dir / 'ids'
 70 |         ids_dir.mkdir(exist_ok=True)
 71 | 
 72 |         # Loop to create 1000 random subject samples of the same size (with replacement) per bootstrap sample
 73 |         for i_bootstrap in range(n_bootstrap):
 74 |             # Create empty df to add bootstrap subjects to
 75 |             dataset_bootstrap_train = pd.DataFrame(columns=['image_id'])
 76 |             dataset_bootstrap_test = pd.DataFrame(columns=['image_id'])
 77 | 
 78 |             # Loop over ages (27 in total)
 79 |             for age in range(age_min, (age_max + 1)):
 80 | 
 81 |                 # Get dataset for specific age only
 82 |                 age_group = dataset.groupby('Age').get_group(age)
 83 | 
 84 |                 # Loop over genders (0: female, 1:male)
 85 |                 for gender in range(2):
 86 |                     gender_group = age_group.groupby('Gender').get_group(gender)
 87 | 
 88 |                     # Extract random subject of that gender and add to dataset_bootstrap_train
 89 |                     random_sample_train = gender_group.sample(n=i_n_subject_pairs, replace=True)
 90 |                     dataset_bootstrap_train = pd.concat([dataset_bootstrap_train, random_sample_train[['image_id']]])
 91 | 
 92 |                     # Sample test set with always the same size
 93 |                     not_sampled = ~gender_group['image_id'].isin(random_sample_train['image_id'])
 94 |                     random_sample_test = gender_group[not_sampled].sample(n=20, replace=False)
 95 |                     dataset_bootstrap_test = pd.concat([dataset_bootstrap_test, random_sample_test[['image_id']]])
 96 | 
 97 |             # Export dataset_bootstrap_train as csv
 98 |             output_prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}'
 99 |             dataset_bootstrap_train.to_csv(ids_dir / f'{output_prefix}_train.csv', index=False)
100 |             dataset_bootstrap_test.to_csv(ids_dir / f'{output_prefix}_test.csv', index=False)
101 | 
102 | 
103 | if __name__ == '__main__':
104 |     main(args.experiment_name, args.scanner_name,
105 |          args.input_ids_file,
106 |          args.n_bootstrap, args.n_max_pair)
107 | 


--------------------------------------------------------------------------------
/src/sample_size/sample_size_fs_data_gp_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to perform the sample size analysis using Gaussian Processes. """
  3 | import argparse
  4 | import random
  5 | from math import sqrt
  6 | from pathlib import Path
  7 | 
  8 | import numpy as np
  9 | from scipy import stats
 10 | from sklearn.gaussian_process import GaussianProcessRegressor
 11 | from sklearn.gaussian_process.kernels import DotProduct
 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 13 | from sklearn.preprocessing import RobustScaler
 14 | 
 15 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 16 | 
 17 | PROJECT_ROOT = Path.cwd()
 18 | 
 19 | parser = argparse.ArgumentParser()
 20 | 
 21 | parser.add_argument('-E', '--experiment_name',
 22 |                     dest='experiment_name',
 23 |                     help='Name of the experiment.')
 24 | 
 25 | parser.add_argument('-S', '--scanner_name',
 26 |                     dest='scanner_name',
 27 |                     help='Name of the scanner.')
 28 | 
 29 | parser.add_argument('-N', '--n_bootstrap',
 30 |                     dest='n_bootstrap',
 31 |                     type=int, default=1000,
 32 |                     help='Number of bootstrap iterations.')
 33 | 
 34 | parser.add_argument('-R', '--n_max_pair',
 35 |                     dest='n_max_pair',
 36 |                     type=int, default=20,
 37 |                     help='Number maximum of pairs.')
 38 | 
 39 | parser.add_argument('-G', '--general_experiment_name',
 40 |                     dest='general_experiment_name',
 41 |                     help='Name of the experiment.')
 42 | 
 43 | parser.add_argument('-C', '--general_scanner_name',
 44 |                     dest='general_scanner_name',
 45 |                     help='Name of the scanner for generalization.')
 46 | 
 47 | parser.add_argument('-I', '--general_input_ids_file',
 48 |                     dest='general_input_ids_file',
 49 |                     default='cleaned_ids.csv',
 50 |                     help='Filename indicating the ids to be used.')
 51 | 
 52 | args = parser.parse_args()
 53 | 
 54 | 
 55 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair,
 56 |          general_experiment_name, general_scanner_name, general_input_ids_file):
 57 |     model_name = 'GPR'
 58 | 
 59 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 60 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 61 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 62 | 
 63 |     general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv'
 64 |     general_freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'freesurferData.csv'
 65 | 
 66 |     general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file
 67 |     general_dataset = load_freesurfer_dataset(general_participants_path, general_ids_path, general_freesurfer_path)
 68 | 
 69 |     # Normalise regional volumes by total intracranial volume (tiv)
 70 |     general_regions = general_dataset[COLUMNS_NAME].values
 71 | 
 72 |     general_tiv = general_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 73 | 
 74 |     x_general = np.true_divide(general_regions, general_tiv)
 75 |     y_general = general_dataset['Age'].values
 76 | 
 77 |     # ----------------------------------------------------------------------------------------
 78 | 
 79 |     # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year
 80 |     for i_n_subject_pairs in range(1, n_max_pair + 1):
 81 |         print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}')
 82 |         ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids'
 83 | 
 84 |         scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores'
 85 |         scores_dir.mkdir(exist_ok=True)
 86 | 
 87 |         # Loop over the 1000 random subject samples per bootstrap
 88 |         for i_bootstrap in range(n_bootstrap):
 89 |             print(f'Sample number within bootstrap: {i_bootstrap}')
 90 | 
 91 |             prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}'
 92 |             train_dataset = load_freesurfer_dataset(participants_path,
 93 |                                                     ids_with_n_subject_pairs_dir / f'{prefix}_train.csv',
 94 |                                                     freesurfer_path)
 95 |             test_dataset = load_freesurfer_dataset(participants_path,
 96 |                                                    ids_with_n_subject_pairs_dir / f'{prefix}_test.csv',
 97 |                                                    freesurfer_path)
 98 | 
 99 |             # Initialise random seed
100 |             np.random.seed(42)
101 |             random.seed(42)
102 | 
103 |             # Normalise regional volumes by total intracranial volume (tiv)
104 |             regions = train_dataset[COLUMNS_NAME].values
105 | 
106 |             tiv = train_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
107 | 
108 |             x_train = np.true_divide(regions, tiv)
109 |             y_train = train_dataset['Age'].values
110 | 
111 |             test_tiv = test_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
112 |             test_regions = test_dataset[COLUMNS_NAME].values
113 | 
114 |             x_test = np.true_divide(test_regions, test_tiv)
115 |             y_test = test_dataset['Age'].values
116 | 
117 |             # Scaling in range [-1, 1]
118 |             scaler = RobustScaler()
119 |             x_train = scaler.fit_transform(x_train)
120 |             x_test = scaler.transform(x_test)
121 | 
122 |             gpr = GaussianProcessRegressor(kernel=DotProduct(), random_state=0)
123 | 
124 |             gpr.fit(x_train, y_train)
125 | 
126 |             # Test data
127 |             predictions = gpr.predict(x_test)
128 |             mae = mean_absolute_error(y_test, predictions)
129 |             rmse = sqrt(mean_squared_error(y_test, predictions))
130 |             r2 = r2_score(y_test, predictions)
131 |             age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test)
132 | 
133 |             scores = np.array([r2, mae, rmse, age_error_corr])
134 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores)
135 | 
136 |             print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
137 | 
138 |             # Train data
139 |             train_predictions = gpr.predict(x_train)
140 |             train_mae = mean_absolute_error(y_train, train_predictions)
141 |             train_rmse = sqrt(mean_squared_error(y_train, train_predictions))
142 |             train_r2 = r2_score(y_train, train_predictions)
143 |             train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train)
144 | 
145 |             train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr])
146 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores)
147 | 
148 |             # Generalisation data
149 |             x_general_norm = scaler.transform(x_general)
150 |             general_predictions = gpr.predict(x_general_norm)
151 |             general_mae = mean_absolute_error(y_general, general_predictions)
152 |             general_rmse = sqrt(mean_squared_error(y_general, general_predictions))
153 |             general_r2 = r2_score(y_general, general_predictions)
154 |             general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general)
155 | 
156 |             general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr])
157 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores)
158 | 
159 | 
160 | if __name__ == '__main__':
161 |     main(args.experiment_name, args.scanner_name,
162 |          args.n_bootstrap, args.n_max_pair,
163 |          args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file)
164 | 


--------------------------------------------------------------------------------
/src/sample_size/sample_size_fs_data_rvm_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Script to perform the sample size analysis using Relevant Vector Machine. """
  3 | import argparse
  4 | import random
  5 | import warnings
  6 | from math import sqrt
  7 | from pathlib import Path
  8 | 
  9 | import numpy as np
 10 | from scipy import stats
 11 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 12 | from sklearn.preprocessing import RobustScaler
 13 | from sklearn_rvm import EMRVR
 14 | 
 15 | from utils import COLUMNS_NAME, load_freesurfer_dataset
 16 | 
 17 | PROJECT_ROOT = Path.cwd()
 18 | 
 19 | warnings.filterwarnings('ignore')
 20 | 
 21 | parser = argparse.ArgumentParser()
 22 | 
 23 | parser.add_argument('-E', '--experiment_name',
 24 |                     dest='experiment_name',
 25 |                     help='Name of the experiment.')
 26 | 
 27 | parser.add_argument('-S', '--scanner_name',
 28 |                     dest='scanner_name',
 29 |                     help='Name of the scanner.')
 30 | 
 31 | parser.add_argument('-N', '--n_bootstrap',
 32 |                     dest='n_bootstrap',
 33 |                     type=int, default=1000,
 34 |                     help='Number of bootstrap iterations.')
 35 | 
 36 | parser.add_argument('-R', '--n_max_pair',
 37 |                     dest='n_max_pair',
 38 |                     type=int, default=20,
 39 |                     help='Number maximum of pairs.')
 40 | 
 41 | parser.add_argument('-G', '--general_experiment_name',
 42 |                     dest='general_experiment_name',
 43 |                     help='Name of the experiment.')
 44 | 
 45 | parser.add_argument('-C', '--general_scanner_name',
 46 |                     dest='general_scanner_name',
 47 |                     help='Name of the scanner for generalization.')
 48 | 
 49 | parser.add_argument('-I', '--general_input_ids_file',
 50 |                     dest='general_input_ids_file',
 51 |                     default='cleaned_ids.csv',
 52 |                     help='Filename indicating the ids to be used.')
 53 | 
 54 | args = parser.parse_args()
 55 | 
 56 | 
 57 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair,
 58 |          general_experiment_name, general_scanner_name, general_input_ids_file):
 59 |     model_name = 'RVM'
 60 | 
 61 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 62 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 63 |     freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv'
 64 | 
 65 |     general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv'
 66 |     general_freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'freesurferData.csv'
 67 | 
 68 |     general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file
 69 |     general_dataset = load_freesurfer_dataset(general_participants_path, general_ids_path, general_freesurfer_path)
 70 | 
 71 |     # Normalise regional volumes by total intracranial volume (tiv)
 72 |     general_regions = general_dataset[COLUMNS_NAME].values
 73 | 
 74 |     general_tiv = general_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
 75 | 
 76 |     x_general = np.true_divide(general_regions, general_tiv)
 77 |     y_general = general_dataset['Age'].values
 78 | 
 79 |     # ----------------------------------------------------------------------------------------
 80 | 
 81 |     # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year
 82 |     for i_n_subject_pairs in range(1, n_max_pair + 1):
 83 |         print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}')
 84 |         ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids'
 85 | 
 86 |         scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores'
 87 |         scores_dir.mkdir(exist_ok=True)
 88 | 
 89 |         # Loop over the 1000 random subject samples per bootstrap
 90 |         for i_bootstrap in range(n_bootstrap):
 91 |             print(f'Sample number within bootstrap: {i_bootstrap}')
 92 | 
 93 |             prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}'
 94 |             train_dataset = load_freesurfer_dataset(participants_path,
 95 |                                                     ids_with_n_subject_pairs_dir / f'{prefix}_train.csv',
 96 |                                                     freesurfer_path)
 97 |             test_dataset = load_freesurfer_dataset(participants_path,
 98 |                                                    ids_with_n_subject_pairs_dir / f'{prefix}_test.csv',
 99 |                                                    freesurfer_path)
100 | 
101 |             # Initialise random seed
102 |             np.random.seed(42)
103 |             random.seed(42)
104 | 
105 |             # Normalise regional volumes by total intracranial volume (tiv)
106 |             regions = train_dataset[COLUMNS_NAME].values
107 | 
108 |             tiv = train_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
109 | 
110 |             x_train = np.true_divide(regions, tiv)
111 |             y_train = train_dataset['Age'].values
112 | 
113 |             test_tiv = test_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis]
114 |             test_regions = test_dataset[COLUMNS_NAME].values
115 | 
116 |             x_test = np.true_divide(test_regions, test_tiv)
117 |             y_test = test_dataset['Age'].values
118 | 
119 |             # Scaling in range [-1, 1]
120 |             scaler = RobustScaler()
121 |             x_train = scaler.fit_transform(x_train)
122 |             x_test = scaler.transform(x_test)
123 | 
124 |             # Systematic search for best hyperparameters
125 |             rvm = EMRVR(kernel='linear', threshold_alpha=1e9)
126 |             rvm.fit(x_train, y_train)
127 | 
128 |             # Test data
129 |             predictions = rvm.predict(x_test)
130 |             mae = mean_absolute_error(y_test, predictions)
131 |             rmse = sqrt(mean_squared_error(y_test, predictions))
132 |             r2 = r2_score(y_test, predictions)
133 |             age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test)
134 | 
135 |             scores = np.array([r2, mae, rmse, age_error_corr])
136 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores)
137 | 
138 |             print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
139 | 
140 |             # Train data
141 |             train_predictions = rvm.predict(x_train)
142 |             train_mae = mean_absolute_error(y_train, train_predictions)
143 |             train_rmse = sqrt(mean_squared_error(y_train, train_predictions))
144 |             train_r2 = r2_score(y_train, train_predictions)
145 |             train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train)
146 | 
147 |             train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr])
148 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores)
149 | 
150 |             # Generalisation data
151 |             x_general_norm = scaler.transform(x_general)
152 |             general_predictions = rvm.predict(x_general_norm)
153 |             general_mae = mean_absolute_error(y_general, general_predictions)
154 |             general_rmse = sqrt(mean_squared_error(y_general, general_predictions))
155 |             general_r2 = r2_score(y_general, general_predictions)
156 |             general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general)
157 | 
158 |             general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr])
159 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores)
160 | 
161 | 
162 | if __name__ == '__main__':
163 |     main(args.experiment_name, args.scanner_name,
164 |          args.n_bootstrap, args.n_max_pair,
165 |          args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file)
166 | 


--------------------------------------------------------------------------------
/src/sample_size/sample_size_voxel_data_rvm_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Perform sample size Script to run RVM on bootstrap datasets of UK BIOBANK Scanner1. """
  3 | import argparse
  4 | import random
  5 | import warnings
  6 | from math import sqrt
  7 | from pathlib import Path
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from scipy import stats
 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 13 | from sklearn_rvm import EMRVR
 14 | 
 15 | from utils import load_demographic_data
 16 | 
 17 | PROJECT_ROOT = Path.cwd()
 18 | 
 19 | warnings.filterwarnings('ignore')
 20 | 
 21 | parser = argparse.ArgumentParser()
 22 | 
 23 | parser.add_argument('-E', '--experiment_name',
 24 |                     dest='experiment_name',
 25 |                     help='Name of the experiment.')
 26 | 
 27 | parser.add_argument('-S', '--scanner_name',
 28 |                     dest='scanner_name',
 29 |                     help='Name of the scanner.')
 30 | 
 31 | parser.add_argument('-N', '--n_bootstrap',
 32 |                     dest='n_bootstrap',
 33 |                     type=int, default=1000,
 34 |                     help='Number of bootstrap iterations.')
 35 | 
 36 | parser.add_argument('-R', '--n_max_pair',
 37 |                     dest='n_max_pair',
 38 |                     type=int, default=20,
 39 |                     help='Number maximum of pairs.')
 40 | 
 41 | parser.add_argument('-G', '--general_experiment_name',
 42 |                     dest='general_experiment_name',
 43 |                     help='Name of the experiment.')
 44 | 
 45 | parser.add_argument('-C', '--general_scanner_name',
 46 |                     dest='general_scanner_name',
 47 |                     help='Name of the scanner for generalization.')
 48 | 
 49 | parser.add_argument('-I', '--general_input_ids_file',
 50 |                     dest='general_input_ids_file',
 51 |                     default='cleaned_ids.csv',
 52 |                     help='Filename indicating the ids to be used.')
 53 | 
 54 | args = parser.parse_args()
 55 | 
 56 | 
 57 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair,
 58 |          general_experiment_name, general_scanner_name, general_input_ids_file):
 59 |     model_name = 'voxel_RVM'
 60 | 
 61 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 62 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 63 | 
 64 |     # Load the Gram matrix
 65 |     kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv'
 66 |     kernel = pd.read_csv(kernel_path, header=0, index_col=0)
 67 | 
 68 |     general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv'
 69 |     general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file
 70 | 
 71 |     kernel_path_general = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel_general.csv'
 72 |     kernel_general = pd.read_csv(kernel_path_general, header=0, index_col=0)
 73 |     general_dataset = load_demographic_data(general_participants_path, general_ids_path)
 74 | 
 75 |     y_general = general_dataset['Age'].values
 76 | 
 77 |     # ----------------------------------------------------------------------------------------
 78 |     # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year
 79 |     for i_n_subject_pairs in range(1, n_max_pair + 1):
 80 |         print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}')
 81 |         ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids'
 82 | 
 83 |         scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores'
 84 |         scores_dir.mkdir(exist_ok=True)
 85 | 
 86 |         # Loop over the 1000 random subject samples per bootstrap
 87 |         for i_bootstrap in range(n_bootstrap):
 88 |             print(f'Sample number within bootstrap: {i_bootstrap}')
 89 | 
 90 |             prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}'
 91 |             train_dataset = load_demographic_data(participants_path,
 92 |                                                   ids_with_n_subject_pairs_dir / f'{prefix}_train.csv')
 93 |             test_dataset = load_demographic_data(participants_path,
 94 |                                                  ids_with_n_subject_pairs_dir / f'{prefix}_test.csv')
 95 | 
 96 |             # Initialise random seed
 97 |             np.random.seed(42)
 98 |             random.seed(42)
 99 | 
100 |             train_index = train_dataset['image_id']
101 |             test_index = test_dataset['image_id']
102 | 
103 |             x_train = kernel.loc[train_index, train_index].values
104 |             x_test = kernel.loc[test_index, train_index].values
105 | 
106 |             y_train = train_dataset['Age'].values
107 |             y_test = test_dataset['Age'].values
108 | 
109 |             model = EMRVR(kernel='precomputed', threshold_alpha=1e9)
110 |             model.fit(x_train, y_train)
111 |             predictions = model.predict(x_test)
112 | 
113 |             mae = mean_absolute_error(y_test, predictions)
114 |             rmse = sqrt(mean_squared_error(y_test, predictions))
115 |             r2 = r2_score(y_test, predictions)
116 |             age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test)
117 | 
118 |             scores = np.array([r2, mae, rmse, age_error_corr])
119 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores)
120 | 
121 |             print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
122 | 
123 |             # Train data
124 |             train_predictions = model.predict(x_train)
125 |             train_mae = mean_absolute_error(y_train, train_predictions)
126 |             train_rmse = sqrt(mean_squared_error(y_train, train_predictions))
127 |             train_r2 = r2_score(y_train, train_predictions)
128 |             train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train)
129 | 
130 |             train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr])
131 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores)
132 | 
133 |             # Generalisation data
134 |             x_general = kernel_general.loc[train_index, :].T.values
135 |             general_predictions = model.predict(x_general)
136 |             general_mae = mean_absolute_error(y_general, general_predictions)
137 |             general_rmse = sqrt(mean_squared_error(y_general, general_predictions))
138 |             general_r2 = r2_score(y_general, general_predictions)
139 |             general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general)
140 | 
141 |             general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr])
142 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores)
143 | 
144 | 
145 | if __name__ == '__main__':
146 |     main(args.experiment_name, args.scanner_name,
147 |          args.n_bootstrap, args.n_max_pair,
148 |          args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file)
149 | 


--------------------------------------------------------------------------------
/src/sample_size/sample_size_voxel_data_svm_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Perform sample size Script to run SVM (linear SVR) on bootstrap datasets of UK BIOBANK Scanner1."""
  3 | import argparse
  4 | import random
  5 | import warnings
  6 | from math import sqrt
  7 | from pathlib import Path
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from scipy import stats
 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
 13 | from sklearn.model_selection import GridSearchCV, KFold
 14 | from sklearn.svm import SVR
 15 | 
 16 | from utils import load_demographic_data
 17 | 
 18 | PROJECT_ROOT = Path.cwd()
 19 | 
 20 | warnings.filterwarnings('ignore')
 21 | 
 22 | parser = argparse.ArgumentParser()
 23 | 
 24 | parser.add_argument('-E', '--experiment_name',
 25 |                     dest='experiment_name',
 26 |                     help='Name of the experiment.')
 27 | 
 28 | parser.add_argument('-S', '--scanner_name',
 29 |                     dest='scanner_name',
 30 |                     help='Name of the scanner.')
 31 | 
 32 | parser.add_argument('-N', '--n_bootstrap',
 33 |                     dest='n_bootstrap',
 34 |                     type=int, default=1000,
 35 |                     help='Number of bootstrap iterations.')
 36 | 
 37 | parser.add_argument('-R', '--n_max_pair',
 38 |                     dest='n_max_pair',
 39 |                     type=int, default=20,
 40 |                     help='Number maximum of pairs.')
 41 | 
 42 | parser.add_argument('-G', '--general_experiment_name',
 43 |                     dest='general_experiment_name',
 44 |                     help='Name of the experiment.')
 45 | 
 46 | parser.add_argument('-C', '--general_scanner_name',
 47 |                     dest='general_scanner_name',
 48 |                     help='Name of the scanner for generalization.')
 49 | 
 50 | parser.add_argument('-I', '--general_input_ids_file',
 51 |                     dest='general_input_ids_file',
 52 |                     default='cleaned_ids.csv',
 53 |                     help='Filename indicating the ids to be used.')
 54 | 
 55 | args = parser.parse_args()
 56 | 
 57 | 
 58 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair,
 59 |          general_experiment_name, general_scanner_name, general_input_ids_file):
 60 |     # ----------------------------------------------------------------------------------------
 61 |     model_name = 'voxel_SVM'
 62 | 
 63 |     experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name
 64 |     participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv'
 65 | 
 66 |     # Load the Gram matrix
 67 |     kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv'
 68 |     kernel = pd.read_csv(kernel_path, header=0, index_col=0)
 69 | 
 70 |     general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv'
 71 |     general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file
 72 | 
 73 |     kernel_path_general = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel_general.csv'
 74 |     kernel_general = pd.read_csv(kernel_path_general, header=0, index_col=0)
 75 |     general_dataset = load_demographic_data(general_participants_path, general_ids_path)
 76 | 
 77 |     y_general = general_dataset['Age'].values
 78 | 
 79 |     # ----------------------------------------------------------------------------------------
 80 |     # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year
 81 |     for i_n_subject_pairs in range(1, n_max_pair + 1):
 82 |         print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}')
 83 |         ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids'
 84 | 
 85 |         scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores'
 86 |         scores_dir.mkdir(exist_ok=True)
 87 | 
 88 |         # Loop over the 1000 random subject samples per bootstrap
 89 |         for i_bootstrap in range(n_bootstrap):
 90 |             print(f'Sample number within bootstrap: {i_bootstrap}')
 91 | 
 92 |             prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}'
 93 |             train_dataset = load_demographic_data(participants_path,
 94 |                                                   ids_with_n_subject_pairs_dir / f'{prefix}_train.csv')
 95 |             test_dataset = load_demographic_data(participants_path,
 96 |                                                  ids_with_n_subject_pairs_dir / f'{prefix}_test.csv')
 97 | 
 98 |             # Initialise random seed
 99 |             np.random.seed(42)
100 |             random.seed(42)
101 | 
102 |             train_index = train_dataset['image_id']
103 |             test_index = test_dataset['image_id']
104 | 
105 |             x_train = kernel.loc[train_index, train_index].values
106 |             x_test = kernel.loc[test_index, train_index].values
107 | 
108 |             y_train = train_dataset['Age'].values
109 |             y_test = test_dataset['Age'].values
110 | 
111 |             model = SVR(kernel='precomputed')
112 | 
113 |             # Systematic search for best hyperparameters
114 |             search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]}
115 |             n_nested_folds = 5
116 |             nested_kf = KFold(n_splits=n_nested_folds, shuffle=True, random_state=i_bootstrap)
117 |             gridsearch = GridSearchCV(model,
118 |                                       param_grid=search_space,
119 |                                       scoring='neg_mean_absolute_error',
120 |                                       refit=True, cv=nested_kf,
121 |                                       verbose=0, n_jobs=1)
122 | 
123 |             gridsearch.fit(x_train, y_train)
124 | 
125 |             best_model = gridsearch.best_estimator_
126 | 
127 |             # Test data
128 |             predictions = best_model.predict(x_test)
129 |             mae = mean_absolute_error(y_test, predictions)
130 |             rmse = sqrt(mean_squared_error(y_test, predictions))
131 |             r2 = r2_score(y_test, predictions)
132 |             age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test)
133 | 
134 |             scores = np.array([r2, mae, rmse, age_error_corr])
135 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores)
136 | 
137 |             print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}')
138 | 
139 |             # Train data
140 |             train_predictions = best_model.predict(x_train)
141 |             train_mae = mean_absolute_error(y_train, train_predictions)
142 |             train_rmse = sqrt(mean_squared_error(y_train, train_predictions))
143 |             train_r2 = r2_score(y_train, train_predictions)
144 |             train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train)
145 | 
146 |             train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr])
147 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores)
148 | 
149 |             # Generalisation data
150 |             x_general = kernel_general.loc[train_index, :].T.values
151 |             general_predictions = best_model.predict(x_general)
152 |             general_mae = mean_absolute_error(y_general, general_predictions)
153 |             general_rmse = sqrt(mean_squared_error(y_general, general_predictions))
154 |             general_r2 = r2_score(y_general, general_predictions)
155 |             general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general)
156 | 
157 |             general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr])
158 |             np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores)
159 | 
160 | 
161 | if __name__ == '__main__':
162 |     main(args.experiment_name, args.scanner_name,
163 |          args.n_bootstrap, args.n_max_pair,
164 |          args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file)
165 | 


--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
  1 | """Helper functions and constants."""
  2 | import pandas as pd
  3 | import numpy as np
  4 | from scipy import stats
  5 | 
  6 | 
  7 | def load_freesurfer_dataset(participants_path, ids_path, freesurfer_path):
  8 |     """Load dataset."""
  9 |     demographic_data = load_demographic_data(participants_path, ids_path)
 10 | 
 11 |     freesurfer_df = pd.read_csv(freesurfer_path)
 12 | 
 13 |     dataset_df = pd.merge(freesurfer_df, demographic_data, on='image_id')
 14 | 
 15 |     return dataset_df
 16 | 
 17 | 
 18 | def load_demographic_data(participants_path, ids_path):
 19 |     """Load dataset using selected ids."""
 20 |     participants_df = pd.read_csv(participants_path, sep='\t')
 21 |     participants_df = participants_df.dropna()
 22 | 
 23 |     ids_df = pd.read_csv(ids_path, usecols=['image_id'])
 24 | 
 25 |     ids_df['participant_id'] = ids_df['image_id'].str.split('_').str[0]
 26 | 
 27 |     dataset_df = pd.merge(ids_df, participants_df, on='participant_id')
 28 | 
 29 |     return dataset_df
 30 | 
 31 | 
 32 | def ttest_ind_corrected(performance_a, performance_b, k=10, r=10):
 33 |     """Corrected repeated k-fold cv test.
 34 |      The test assumes that the classifiers were evaluated using cross validation.
 35 | 
 36 |     Ref:
 37 |         Bouckaert, Remco R., and Eibe Frank. "Evaluating the replicability of significance tests for comparing learning
 38 |          algorithms." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, 2004
 39 | 
 40 |     Args:
 41 |         performance_a: performances from classifier A
 42 |         performance_b: performances from classifier B
 43 |         k: number of folds
 44 |         r: number of repetitions
 45 | 
 46 |     Returns:
 47 |          t: t-statistic of the corrected test.
 48 |          prob: p-value of the corrected test.
 49 |     """
 50 |     df = k * r - 1
 51 | 
 52 |     x = performance_a - performance_b
 53 |     m = np.mean(x)
 54 | 
 55 |     sigma_2 = np.var(x, ddof=1)
 56 |     denom = np.sqrt((1 / k * r + 1 / (k - 1)) * sigma_2)
 57 | 
 58 |     with np.errstate(divide='ignore', invalid='ignore'):
 59 |         t = np.divide(m, denom)
 60 | 
 61 |     prob = stats.t.sf(np.abs(t), df) * 2
 62 | 
 63 |     return t, prob
 64 | 
 65 | 
 66 | COLUMNS_NAME = ['Left-Lateral-Ventricle',
 67 |                 'Left-Inf-Lat-Vent',
 68 |                 'Left-Cerebellum-White-Matter',
 69 |                 'Left-Cerebellum-Cortex',
 70 |                 'Left-Thalamus-Proper',
 71 |                 'Left-Caudate',
 72 |                 'Left-Putamen',
 73 |                 'Left-Pallidum',
 74 |                 '3rd-Ventricle',
 75 |                 '4th-Ventricle',
 76 |                 'Brain-Stem',
 77 |                 'Left-Hippocampus',
 78 |                 'Left-Amygdala',
 79 |                 'CSF',
 80 |                 'Left-Accumbens-area',
 81 |                 'Left-VentralDC',
 82 |                 'Right-Lateral-Ventricle',
 83 |                 'Right-Inf-Lat-Vent',
 84 |                 'Right-Cerebellum-White-Matter',
 85 |                 'Right-Cerebellum-Cortex',
 86 |                 'Right-Thalamus-Proper',
 87 |                 'Right-Caudate',
 88 |                 'Right-Putamen',
 89 |                 'Right-Pallidum',
 90 |                 'Right-Hippocampus',
 91 |                 'Right-Amygdala',
 92 |                 'Right-Accumbens-area',
 93 |                 'Right-VentralDC',
 94 |                 'CC_Posterior',
 95 |                 'CC_Mid_Posterior',
 96 |                 'CC_Central',
 97 |                 'CC_Mid_Anterior',
 98 |                 'CC_Anterior',
 99 |                 'lh_bankssts_volume',
100 |                 'lh_caudalanteriorcingulate_volume',
101 |                 'lh_caudalmiddlefrontal_volume',
102 |                 'lh_cuneus_volume',
103 |                 'lh_entorhinal_volume',
104 |                 'lh_fusiform_volume',
105 |                 'lh_inferiorparietal_volume',
106 |                 'lh_inferiortemporal_volume',
107 |                 'lh_isthmuscingulate_volume',
108 |                 'lh_lateraloccipital_volume',
109 |                 'lh_lateralorbitofrontal_volume',
110 |                 'lh_lingual_volume',
111 |                 'lh_medialorbitofrontal_volume',
112 |                 'lh_middletemporal_volume',
113 |                 'lh_parahippocampal_volume',
114 |                 'lh_paracentral_volume',
115 |                 'lh_parsopercularis_volume',
116 |                 'lh_parsorbitalis_volume',
117 |                 'lh_parstriangularis_volume',
118 |                 'lh_pericalcarine_volume',
119 |                 'lh_postcentral_volume',
120 |                 'lh_posteriorcingulate_volume',
121 |                 'lh_precentral_volume',
122 |                 'lh_precuneus_volume',
123 |                 'lh_rostralanteriorcingulate_volume',
124 |                 'lh_rostralmiddlefrontal_volume',
125 |                 'lh_superiorfrontal_volume',
126 |                 'lh_superiorparietal_volume',
127 |                 'lh_superiortemporal_volume',
128 |                 'lh_supramarginal_volume',
129 |                 'lh_frontalpole_volume',
130 |                 'lh_temporalpole_volume',
131 |                 'lh_transversetemporal_volume',
132 |                 'lh_insula_volume',
133 |                 'rh_bankssts_volume',
134 |                 'rh_caudalanteriorcingulate_volume',
135 |                 'rh_caudalmiddlefrontal_volume',
136 |                 'rh_cuneus_volume',
137 |                 'rh_entorhinal_volume',
138 |                 'rh_fusiform_volume',
139 |                 'rh_inferiorparietal_volume',
140 |                 'rh_inferiortemporal_volume',
141 |                 'rh_isthmuscingulate_volume',
142 |                 'rh_lateraloccipital_volume',
143 |                 'rh_lateralorbitofrontal_volume',
144 |                 'rh_lingual_volume',
145 |                 'rh_medialorbitofrontal_volume',
146 |                 'rh_middletemporal_volume',
147 |                 'rh_parahippocampal_volume',
148 |                 'rh_paracentral_volume',
149 |                 'rh_parsopercularis_volume',
150 |                 'rh_parsorbitalis_volume',
151 |                 'rh_parstriangularis_volume',
152 |                 'rh_pericalcarine_volume',
153 |                 'rh_postcentral_volume',
154 |                 'rh_posteriorcingulate_volume',
155 |                 'rh_precentral_volume',
156 |                 'rh_precuneus_volume',
157 |                 'rh_rostralanteriorcingulate_volume',
158 |                 'rh_rostralmiddlefrontal_volume',
159 |                 'rh_superiorfrontal_volume',
160 |                 'rh_superiorparietal_volume',
161 |                 'rh_superiortemporal_volume',
162 |                 'rh_supramarginal_volume',
163 |                 'rh_frontalpole_volume',
164 |                 'rh_temporalpole_volume',
165 |                 'rh_transversetemporal_volume',
166 |                 'rh_insula_volume']
167 | 


--------------------------------------------------------------------------------