├── .gitignore ├── LICENSE ├── README.md ├── command_list.sh ├── data └── README.md ├── imaging_preprocessing_ANTs ├── README.md ├── antsBrainExtraction.sh ├── antsRegistrationSyNQuick.sh ├── mni_icbm152_t1_tal_nlin_sym_09c.nii ├── mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz ├── mni_icbm152_t1_tal_nlin_sym_09c_mask.nii └── sge_job_script_ants.sh ├── notebooks └── Boston_feature_importance.ipynb ├── outputs └── README.md ├── requirements.txt └── src ├── README.md ├── comparison ├── README.md ├── comparison_calculate_mean_predictions.py ├── comparison_fs_data_train_gp.py ├── comparison_fs_data_train_rvm.py ├── comparison_fs_data_train_svm.py ├── comparison_pca_data_train_gp.py ├── comparison_pca_data_train_rvm.py ├── comparison_pca_data_train_svm.py ├── comparison_statistical_analysis.py ├── comparison_voxel_data_rvm_relevance_vectors_weights.py ├── comparison_voxel_data_svm_primal_weights.py ├── comparison_voxel_data_train_rvm.py └── comparison_voxel_data_train_svm.py ├── download ├── README.md ├── download_ants_data.py └── download_data.py ├── generalisation ├── README.md ├── generalisation_calculate_mean_predictions.py ├── generalisation_test_fs_data.py ├── generalisation_test_pca_data.py ├── generalisation_test_voxel_data_rvm.py └── generalisation_test_voxel_data_svm.py ├── misc ├── README.md ├── misc_svm_hyperparameters_analysis.py └── misc_univariate_analysis.py ├── preprocessing ├── README.md ├── clean_data.py ├── compute_kernel_matrix.py ├── compute_kernel_matrix_general.py ├── compute_pca_variance_explained.py ├── compute_principal_components.py ├── create_pca_models.py ├── homogenize_gender.py └── quality_control.py ├── sample_size ├── README.md ├── sample_size_create_figures.py ├── sample_size_create_ids.py ├── sample_size_fs_data_gp_analysis.py ├── sample_size_fs_data_rvm_analysis.py ├── sample_size_fs_data_svm_analysis.py ├── sample_size_pca_data_gp_analysis.py ├── sample_size_pca_data_rvm_analysis.py ├── sample_size_pca_data_svm_analysis.py ├── sample_size_voxel_data_rvm_analysis.py └── sample_size_voxel_data_svm_analysis.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Environments 2 | venv/ 3 | 4 | # Pycharm 5 | .idea/ 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | 10 | # Project's dataset 11 | data/ 12 | !data/README.md -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Jessica Daflon, Lea Baecker, Pedro Ferreira da Costa, and 4 | Walter Hugo Lopez Pinaya 5 | 6 | Permission is hereby granted, free of charge, to any person obtaining a copy 7 | of this software and associated documentation files (the "Software"), to deal 8 | in the Software without restriction, including without limitation the rights 9 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | copies of the Software, and to permit persons to whom the Software is 11 | furnished to do so, subject to the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be included in all 14 | copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data 2 | [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg)](https://github.com/MLMH-Lab/Brain-age-prediction/blob/master/LICENSE) 3 | 4 | Official script for the paper "Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data". 5 | 6 | ## Abstract 7 | Brain age prediction can be used to detect abnormalities in the ageing trajectory of an individual and their associated health issues. Existing studies on brain age vary widely in terms of their methods and type of data, so at present the most accurate and generalisable methodological approach is unclear. We used the UK Biobank dataset (N = 10,814) to compare the performance of the machine learning models support vector regression (SVR), relevance vector regression (RVR), and Gaussian process regression (GPR) on whole-brain region-based or voxel-based structural Magnetic Resonance Imaging data with or without dimensionality reduction through principal component analysis (PCA). Performance was assessed in the validation set through cross-validation as well as an independent test set. The models achieved mean absolute errors between 3.7 and 4.7 years, with those trained on voxel-level data with PCA performing best. There was little difference in performance between models trained on the same data type, indicating that the type of input data has greater impact on performance than model choice. Furthermore, dataset size analysis revealed that RVR required around half the sample size than SVR and GPR to yield generalisable results (approx. 120 subjects). Our results illustrated that the most suitable methodological approach for a brain age study depends on the sample size and the available computational and time resources. We are making all of our scripts open source in the hope that this will aid future research. 8 | 9 | ## Test our models online 10 | 11 | ## Citation 12 | If you find this code useful for your research, please cite: 13 | 14 | Baecker L, Dafflon J, da Costa PF, Garcia-Dias R, Vieira S, Scarpazza C, Calhoun VD, Sato JR, Mechelli A*, Pinaya WHL* (in press). Brain age prediction: A comparison between machine learning models using region- and voxel-based morphometric data. Human Brain Mapping. * These authors contributed equally to this work 15 | -------------------------------------------------------------------------------- /command_list.sh: -------------------------------------------------------------------------------- 1 | ## Initiate virtual environment 2 | #source venv/bin/activate 3 | # 4 | ## Make all files executable 5 | #chmod -R +x ./ 6 | # 7 | export PYTHONPATH=$PYTHONPATH:./src 8 | ## Run python scripts 9 | ## ----------------------------- Getting data ------------------------------------- 10 | ## Download data from network-attached storage (MLMH lab use only) 11 | ./src/download/download_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/" 12 | ./src/download/download_ants_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/" -S "BIOBANK-SCANNER01" -O "/media/kcl_1/SSD2" 13 | ./src/download/download_ants_data.py -N "/run/user/1000/gvfs/smb-share:server=kc-deeplab.local,share=deeplearning/" -S "BIOBANK-SCANNER02" -O "/media/kcl_1/HDD/DATASETS/BIOBANK" 14 | 15 | # ----------------------------- Preprocessing ------------------------------------ 16 | # Clean UK Biobank data 17 | ./src/preprocessing/clean_data.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 18 | ./src/preprocessing/clean_data.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02" 19 | 20 | # Perform quality control 21 | ./src/preprocessing/quality_control.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 22 | ./src/preprocessing/quality_control.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02" 23 | 24 | # Make gender homogeneous along age range 25 | # This was only performed in scanner1 because we were concerned not to create a biased regressor 26 | ./src/preprocessing/homogenize_gender.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 27 | 28 | # Create kernel matrix for voxel-based analysis 29 | ./src/preprocessing/compute_kernel_matrix.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" 30 | ./src/preprocessing/compute_kernel_matrix_general.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" -P2 "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" -E2 "biobank_scanner2" 31 | 32 | # Create pca models 33 | ./src/preprocessing/create_pca_models.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 34 | ./src/preprocessing/compute_principal_components.py -P "/media/kcl_1/SSD2/BIOBANK" -E "biobank_scanner1" 35 | ./src/preprocessing/compute_principal_components.py -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" -E "biobank_scanner2" -I "cleaned_ids.csv" -S "_general" 36 | 37 | # ----------------------------- Regressor comparison ------------------------------------ 38 | ./src/comparison/comparison_fs_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 39 | ./src/comparison/comparison_fs_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 40 | ./src/comparison/comparison_fs_data_train_gp.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 41 | 42 | ./src/comparison/comparison_voxel_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 43 | ./src/comparison/comparison_voxel_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 44 | 45 | ./src/comparison/comparison_pca_data_train_rvm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 46 | ./src/comparison/comparison_pca_data_train_svm.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 47 | ./src/comparison/comparison_pca_data_train_gp.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 48 | 49 | ./src/comparison/comparison_statistical_analysis.py -E "biobank_scanner1" -S "_all" -M "SVM" "RVM" "GPR" "voxel_SVM" "voxel_RVM" "pca_RVM" "pca_SVM" "pca_GPR" 50 | 51 | ./src/comparison/comparison_voxel_data_svm_primal_weights.py -E "biobank_scanner1" -P "/media/kcl_1/SSD2/BIOBANK" 52 | 53 | ./src/comparison/comparison_voxel_data_rvm_relevance_vectors_weights.py -E "biobank_scanner1" -P "/media/kcl_1/SSD2/BIOBANK" 54 | 55 | ./comparison_feature_importance_visualisation.py 56 | 57 | ## ----------------------------- Generalisation comparison ----------------------- 58 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "SVM" -I "cleaned_ids.csv" 59 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "RVM" -I "cleaned_ids.csv" 60 | ./src/generalisation/generalisation_test_fs_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "GPR" -I "cleaned_ids.csv" 61 | 62 | ./src/generalisation/generalisation_test_voxel_data_svm.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "voxel_SVM" -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" 63 | ./src/generalisation/generalisation_test_voxel_data_rvm.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "voxel_RVM" -P "/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK" 64 | 65 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_RVM" -I "cleaned_ids.csv" 66 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_SVM" -I "cleaned_ids.csv" 67 | ./src/generalisation/generalisation_test_pca_data.py -T "biobank_scanner1" -G "biobank_scanner2" -S "BIOBANK-SCANNER02" -M "pca_GPR" -I "cleaned_ids.csv" 68 | 69 | ./src/comparison/comparison_statistical_analysis.py -E "biobank_scanner2" -S "_generalization" -M "SVM" "RVM" "GPR" "voxel_SVM" "voxel_RVM" "pca_RVM" "pca_SVM" "pca_GPR" 70 | 71 | # ----------------------------- Training set size analysis ------------------------------------ 72 | ./src/sample_size/sample_size_create_ids.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 73 | 74 | ./src/sample_size/sample_size_fs_data_svm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02" 75 | ./src/sample_size/sample_size_fs_data_gp_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02" 76 | ./src/sample_size/sample_size_fs_data_rvm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02" 77 | 78 | ./src/sample_size/sample_size_voxel_data_svm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02" 79 | ./src/sample_size/sample_size_voxel_data_rvm_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -G "biobank_scanner2" -C "BIOBANK-SCANNER02" 80 | 81 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "SVM" 82 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "RVM" 83 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "GPR" 84 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_RVM" -F 3 -R 10 85 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_SVM" -F 3 -R 10 86 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "pca_GPR" -F 3 -R 10 87 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "voxel_SVM" 88 | ./src/sample_size/sample_size_create_figures.py -E "biobank_scanner1" -M "voxel_RVM" 89 | 90 | # ----------------------------- Miscellaneous ------------------------------------ 91 | # Univariate analysis on FreeSurfer data 92 | ./src/misc/misc_univariate_analysis.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" 93 | 94 | ./misc_classifier_train_svm.py 95 | ./misc_classifier_regressor_comparison.py 96 | 97 | # Performance of different values of the SVM hyperparameter (C) 98 | ./src/misc/misc_svm_hyperparameters_analysis.py -E "biobank_scanner1" 99 | 100 | # ----------------------------- Exploratory Data Analysis ------------------------------------ 101 | ./src/eda/eda_demographic_data.py -E "biobank_scanner1" -S "BIOBANK-SCANNER01" -U "_homogenized" -I 'homogenized_ids.csv' 102 | ./src/eda/eda_demographic_data.py -E "biobank_scanner2" -S "BIOBANK-SCANNER02" -U "_cleaned" -I 'cleaned_ids.csv' 103 | ./src/eda/eda_education_age.py -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/data/README.md -------------------------------------------------------------------------------- /imaging_preprocessing_ANTs/README.md: -------------------------------------------------------------------------------- 1 | # ANTS preprocessing -------------------------------------------------------------------------------- /imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c.nii: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c.nii -------------------------------------------------------------------------------- /imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_brain.nii.gz -------------------------------------------------------------------------------- /imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/imaging_preprocessing_ANTs/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii -------------------------------------------------------------------------------- /imaging_preprocessing_ANTs/sge_job_script_ants.sh: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh 2 | #$ -o $HOME/prep_temp/dataset/logs/ 3 | #$ -e $HOME/prep_temp/dataset/logs/ 4 | #$ -q global 5 | #$ -N dataset_job 6 | #$ -l h_vmem=6G 7 | 8 | # First unload any modules loaded by ~/.cshrc then load the defaults 9 | module purge 10 | module load nan/default 11 | module load sge 12 | # Load in script dependent modules here 13 | module load ants/2.2.0 14 | 15 | # set the working variables 16 | # template and mask downloaded from http://nist.mni.mcgill.ca/?p=904 (ICBM 2009c Nonlinear Symmetric) 17 | set ants_template = $HOME/prep_temp/dataset/mni_icbm152_t1_tal_nlin_sym_09c.nii 18 | set ants_brain_mask = $HOME/prep_temp/dataset/mni_icbm152_t1_tal_nlin_sym_09c_mask.nii 19 | 20 | setenv working_data $HOME/prep_temp/dataset 21 | setenv sge_index ${working_data}/sge_index 22 | 23 | 24 | # Search the file for the SGE_TASK_ID number as a line number 25 | set file="`awk 'FNR==$SGE_TASK_ID' ${sge_index}`" 26 | 27 | # Used the tcsh :t to find last part of path, 28 | # then :r to remove .gz then :r to remove .nii 29 | set file_name=${file:t:r:r} 30 | 31 | # Based on https://github.com/ANTsX/ANTs/blob/master/Scripts/antsBrainExtraction.sh 32 | # https://github.com/ntustison/BasicBrainMapping 33 | # https://github.com/ntustison/antsBrainExtractionExample/blob/master/antsBrainExtractionCommand.sh 34 | # https://github.com/ANTsX/ANTs/blob/master/Scripts/antsRegistrationSyN.sh 35 | # https://sourceforge.net/p/advants/discussion/840261/thread/ca08a5aa74/?limit=25 36 | 37 | bash antsBrainExtraction.sh \ 38 | -d 3 \ 39 | -a ${file} \ 40 | -e ${ants_template} \ 41 | -m ${ants_brain_mask} \ 42 | -o ${working_data}/subjects_output/${file_name}_ 43 | 44 | bash antsRegistrationSyNQuick.sh \ 45 | #bash antsRegistrationSyN.sh \ 46 | -d 3 \ 47 | -f ${ants_template} \ 48 | -m ${working_data}/subjects_output/${file_name}_BrainExtractionBrain.nii.gz \ 49 | -x ${ants_brain_mask} \ 50 | -t s \ 51 | -o ${working_data}/subjects_output/${file_name}_ 52 | 53 | 54 | rm ${working_data}/subjects_output/${file_name}_InverseWarped.nii.gz 55 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionPrior0GenericAffine.mat 56 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionMask.nii.gz 57 | rm ${working_data}/subjects_output/${file_name}_BrainExtractionBrain.nii.gz 58 | rm ${working_data}/subjects_output/${file_name}_1Warp.nii.gz 59 | rm ${working_data}/subjects_output/${file_name}_1InverseWarp.nii.gz 60 | rm ${working_data}/subjects_output/${file_name}_0GenericAffine.mat 61 | -------------------------------------------------------------------------------- /outputs/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MLMH-Lab/Brain-age-prediction/91afccf9e96187b51fdc0c710e40d9c393a2ceee/outputs/README.md -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | imageio==2.6.1 2 | joblib==0.14.1 3 | matplotlib==3.1.2 4 | nibabel==3.0.0 5 | nilearn==0.6.0 6 | numpy==1.18.1 7 | pandas==0.25.3 8 | scikit-learn==0.22.1 9 | scipy==1.4.1 10 | sklearn-rvm==0.1 11 | tqdm==4.41.1 12 | statsmodels==0.10.2 -------------------------------------------------------------------------------- /src/README.md: -------------------------------------------------------------------------------- 1 | # Source code 2 | 3 | In this directory, we stored all the script files used in our analysis. 4 | These scripts were divided into subdirectories according to their main 5 | functionality: 6 | 7 | 1. [Comparison between machine learning methods](comparison) 8 | 2. [Measuring the generalization of trained regressors](generalisation) 9 | 3. [Performance of models by the size of training set](sample_size) 10 | -------------------------------------------------------------------------------- /src/comparison/README.md: -------------------------------------------------------------------------------- 1 | # Comparison of machine learning methods for brain age prediction 2 | 3 | Here, we assessed the difference of the prediction performance between different machine learning approaches 4 | trained using voxel-based or region-based morphometric MRI data 5 | preprocessed using ANTS and FreeSurfer software, respectively. 6 | In our analysis, we included the most commonly used methods in the brain age literature: 7 | 8 | Using voxel-based data: 9 | 1. [Support Vector Machine]() 10 | 2. [Relevance Vector Machine]() 11 | 12 | Using principal components from voxel-based data: 13 | 1. [Support Vector Machine]() 14 | 2. [Relevance Vector Machine]() 15 | 3. [Gaussian process model]() 16 | 17 | Using region-based data: 18 | 1. [Support Vector Machine]() 19 | 2. [Relevance Vector Machine]() 20 | 3. [Gaussian process model]() 21 | 22 | Each approach has their advantages and weaknesses. 23 | Voxel-based data preserve most information of the raw data with minimal preprocessing. 24 | However, this minimal preprocessing might include noise and irrelevant information for 25 | the task of brain age prediction. The presence of irrelevant features can have a negative impact on 26 | performance (as implied in the common machine learning concept: 27 | "garbage in, garbage out"). These features are especially harmful in shallow machine learning methods, 28 | as is the case in this study. 29 | For this reason, some feature engineering steps are commonly applied. This feature 30 | engineering can include feature selection, dimensionality reduction, feature extraction, etc. 31 | Here, we performed a dimensionality reduction using the Principal Component Analysis (PCA). Besides 32 | this approach, we transformed our raw data using the surface-based morphometry analysis that 33 | FreeSurfer software offers and worked with the features of the 101 selected regions of interest. 34 | 35 | In order to compare our approaches, we assessed all methods using the same subjects in the training 36 | set and in the test set. These sets were defined using an resampling method called 10 times 10-fold 37 | cross-validation (CV) that resulted in each model being evaluated 100 times. We chose this resampling method 38 | to have a better estimate of each approach and avoid influences caused by chance (lucky selection of training set). 39 | 40 | The metrics of performance were obtained from the test set (to avoid biased results, a problem known as double dipping), 41 | and we used the most common brain age prediction metrics from the literature: 42 | - Mean Absolute Error (MAE) 43 | - Root Mean Squared Error (RMSE) 44 | - R-squared 45 | - Correlation between prediction error and age (or 'age bias') 46 | 47 | Finally, we assessed if these performance metrics significantly differ between approaches through 48 | statistical testing. We used the corrected paired t-test to perform the hypothesis tests. 49 | -------------------------------------------------------------------------------- /src/comparison/comparison_calculate_mean_predictions.py: -------------------------------------------------------------------------------- 1 | """Script to create csv file with mean predictions across model repetitions""" 2 | 3 | import pandas as pd 4 | from pathlib import Path 5 | 6 | PROJECT_ROOT = Path.cwd() 7 | 8 | 9 | def main(): 10 | experiment_name = 'biobank_scanner1' 11 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 12 | 13 | model_ls = ['SVM', 'RVM', 'GPR', 14 | 'voxel_SVM', 'voxel_RVM', 15 | 'pca_SVM', 'pca_RVM', 'pca_GPR'] 16 | 17 | # Create df with subject IDs and chronological age 18 | # All mean model predictions will be added to this df in the loop 19 | # Based on an age_predictions csv file from model training to have the 20 | # same order of subjects 21 | example_file = pd.read_csv(experiment_dir / 'SVM' / 'age_predictions.csv') 22 | age_predictions_all = pd.DataFrame(example_file.loc[:, 'image_id':'Age']) 23 | 24 | # Loop over all models, calculate mean predictions across repetitions 25 | for model_name in model_ls: 26 | model_dir = experiment_dir / model_name 27 | file_name = model_dir / 'age_predictions.csv' 28 | try: 29 | model_data = pd.read_csv(file_name) 30 | except FileNotFoundError: 31 | print(f'No age prediction file for {model_name}.') 32 | raise 33 | 34 | repetition_cols = model_data.loc[:, 35 | 'Prediction repetition 00' : 'Prediction repetition 09'] 36 | 37 | # get mean predictions across repetitions 38 | model_data['prediction_mean'] = repetition_cols.mean(axis=1) 39 | 40 | # get those into one file for all models 41 | age_predictions_all[model_name] = model_data['prediction_mean'] 42 | 43 | # Calculate brainAGE for all models and add to age_predictions_all df 44 | # brainAGE = predicted age - chronological age 45 | for model_name in model_ls: 46 | brainage_model = age_predictions_all[model_name] - \ 47 | age_predictions_all['Age'] 48 | brainage_col_name = model_name + '_brainAGE' 49 | age_predictions_all[brainage_col_name] = brainage_model 50 | 51 | # Export age_predictions_all as csv 52 | age_predictions_all.to_csv(experiment_dir / 'age_predictions_allmodels.csv') 53 | 54 | 55 | if __name__ == '__main__': 56 | main() -------------------------------------------------------------------------------- /src/comparison/comparison_fs_data_train_gp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Gaussian Processes on FreeSurfer data. 3 | 4 | We trained the Gaussian Processes (GP) [1] in 10 repetitions of 5 | 10 stratified k-fold cross-validation (CV) (stratified by age). 6 | 7 | References 8 | ---------- 9 | [1] - Williams, Christopher KI, and Carl Edward Rasmussen. 10 | "Gaussian processes for regression." Advances in neural 11 | information processing systems. 1996. 12 | """ 13 | import argparse 14 | import random 15 | import warnings 16 | from math import sqrt 17 | from pathlib import Path 18 | 19 | import numpy as np 20 | from joblib import dump 21 | from scipy import stats 22 | from sklearn.gaussian_process import GaussianProcessRegressor 23 | from sklearn.gaussian_process.kernels import DotProduct 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 25 | from sklearn.model_selection import StratifiedKFold 26 | from sklearn.preprocessing import RobustScaler 27 | 28 | from utils import COLUMNS_NAME, load_freesurfer_dataset 29 | 30 | PROJECT_ROOT = Path.cwd() 31 | 32 | warnings.filterwarnings('ignore') 33 | 34 | parser = argparse.ArgumentParser() 35 | 36 | parser.add_argument('-E', '--experiment_name', 37 | dest='experiment_name', 38 | help='Name of the experiment.') 39 | 40 | parser.add_argument('-S', '--scanner_name', 41 | dest='scanner_name', 42 | help='Name of the scanner.') 43 | 44 | parser.add_argument('-I', '--input_ids_file', 45 | dest='input_ids_file', 46 | default='homogenized_ids.csv', 47 | help='Filename indicating the ids to be used.') 48 | 49 | args = parser.parse_args() 50 | 51 | 52 | def main(experiment_name, scanner_name, input_ids_file): 53 | # ---------------------------------------------------------------------------------------- 54 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 55 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 56 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 57 | ids_path = experiment_dir / input_ids_file 58 | 59 | model_dir = experiment_dir / 'GPR' 60 | model_dir.mkdir(exist_ok=True) 61 | cv_dir = model_dir / 'cv' 62 | cv_dir.mkdir(exist_ok=True) 63 | 64 | dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path) 65 | 66 | # ---------------------------------------------------------------------------------------- 67 | # Initialise random seed 68 | np.random.seed(42) 69 | random.seed(42) 70 | 71 | # Normalise regional volumes by total intracranial volume (tiv) 72 | regions = dataset[COLUMNS_NAME].values 73 | 74 | tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 75 | 76 | regions_norm = np.true_divide(regions, tiv) 77 | age = dataset['Age'].values 78 | 79 | # CV variables 80 | cv_r = [] 81 | cv_r2 = [] 82 | cv_mae = [] 83 | cv_rmse = [] 84 | cv_age_error_corr = [] 85 | 86 | # Create DataFrame to hold actual and predicted ages 87 | age_predictions = dataset[['image_id', 'Age']] 88 | age_predictions = age_predictions.set_index('image_id') 89 | 90 | n_repetitions = 10 91 | n_folds = 10 92 | 93 | for i_repetition in range(n_repetitions): 94 | # Create new empty column in age_predictions df to save age predictions of this repetition 95 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 96 | age_predictions[repetition_column_name] = np.nan 97 | 98 | # Create 10-fold CV scheme stratified by age 99 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 100 | for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)): 101 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 102 | 103 | x_train, x_test = regions_norm[train_index], regions_norm[test_index] 104 | y_train, y_test = age[train_index], age[test_index] 105 | 106 | # Scaling using inter-quartile range 107 | scaler = RobustScaler() 108 | x_train = scaler.fit_transform(x_train) 109 | x_test = scaler.transform(x_test) 110 | 111 | model = GaussianProcessRegressor(kernel=DotProduct(), random_state=0) 112 | 113 | model.fit(x_train, y_train) 114 | 115 | predictions = model.predict(x_test) 116 | 117 | mae = mean_absolute_error(y_test, predictions) 118 | rmse = sqrt(mean_squared_error(y_test, predictions)) 119 | r, _ = stats.pearsonr(y_test, predictions) 120 | r2 = r2_score(y_test, predictions) 121 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 122 | 123 | cv_r.append(r) 124 | cv_r2.append(r2) 125 | cv_mae.append(mae) 126 | cv_rmse.append(rmse) 127 | cv_age_error_corr.append(age_error_corr) 128 | 129 | # ---------------------------------------------------------------------------------------- 130 | # Save output files 131 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 132 | 133 | # Save scaler and model 134 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 135 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 136 | 137 | # Save model scores 138 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 139 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 140 | 141 | # ---------------------------------------------------------------------------------------- 142 | # Add predictions per test_index to age_predictions 143 | for row, value in zip(test_index, predictions): 144 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 145 | 146 | # Print results of the CV fold 147 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 148 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 149 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 150 | 151 | # Save predictions 152 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 153 | 154 | # Variables for mean scores of performance metrics of CV folds across all repetitions 155 | print('') 156 | print('Mean values:') 157 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 158 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 159 | 160 | 161 | if __name__ == '__main__': 162 | main(args.experiment_name, args.scanner_name, 163 | args.input_ids_file) 164 | -------------------------------------------------------------------------------- /src/comparison/comparison_fs_data_train_rvm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Relevant Vector Machines on FreeSurfer data. 3 | 4 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of 5 | 10 stratified k-fold cross-validation (stratified by age). 6 | 7 | References 8 | ---------- 9 | [1] - Tipping, Michael E. "The relevance vector machine." 10 | Advances in neural information processing systems. 2000. 11 | """ 12 | import argparse 13 | import random 14 | import warnings 15 | from math import sqrt 16 | from pathlib import Path 17 | 18 | import numpy as np 19 | from joblib import dump 20 | from scipy import stats 21 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 22 | from sklearn.model_selection import StratifiedKFold 23 | from sklearn.preprocessing import RobustScaler 24 | from sklearn_rvm import EMRVR 25 | 26 | from utils import COLUMNS_NAME, load_freesurfer_dataset 27 | 28 | PROJECT_ROOT = Path.cwd() 29 | 30 | warnings.filterwarnings('ignore') 31 | 32 | parser = argparse.ArgumentParser() 33 | 34 | parser.add_argument('-E', '--experiment_name', 35 | dest='experiment_name', 36 | help='Name of the experiment.') 37 | 38 | parser.add_argument('-S', '--scanner_name', 39 | dest='scanner_name', 40 | help='Name of the scanner.') 41 | 42 | parser.add_argument('-I', '--input_ids_file', 43 | dest='input_ids_file', 44 | default='homogenized_ids.csv', 45 | help='Filename indicating the ids to be used.') 46 | 47 | args = parser.parse_args() 48 | 49 | 50 | def main(experiment_name, scanner_name, input_ids_file): 51 | # ---------------------------------------------------------------------------------------- 52 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 53 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 54 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 55 | ids_path = experiment_dir / input_ids_file 56 | 57 | model_dir = experiment_dir / 'RVM' 58 | model_dir.mkdir(exist_ok=True) 59 | cv_dir = model_dir / 'cv' 60 | cv_dir.mkdir(exist_ok=True) 61 | 62 | dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path) 63 | 64 | # ---------------------------------------------------------------------------------------- 65 | # Initialise random seed 66 | np.random.seed(42) 67 | random.seed(42) 68 | 69 | # Normalise regional volumes by total intracranial volume (tiv) 70 | regions = dataset[COLUMNS_NAME].values 71 | 72 | tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 73 | 74 | regions_norm = np.true_divide(regions, tiv) 75 | age = dataset['Age'].values 76 | 77 | # Cross validation variables 78 | cv_r = [] 79 | cv_r2 = [] 80 | cv_mae = [] 81 | cv_rmse = [] 82 | cv_age_error_corr = [] 83 | 84 | # Create DataFrame to hold actual and predicted ages 85 | age_predictions = dataset[['image_id', 'Age']] 86 | age_predictions = age_predictions.set_index('image_id') 87 | 88 | n_repetitions = 10 89 | n_folds = 10 90 | 91 | for i_repetition in range(n_repetitions): 92 | # Create new empty column in age_predictions df to save age predictions of this repetition 93 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 94 | age_predictions[repetition_column_name] = np.nan 95 | 96 | # Create 10-fold cross-validation scheme stratified by age 97 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 98 | for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)): 99 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 100 | 101 | x_train, x_test = regions_norm[train_index], regions_norm[test_index] 102 | y_train, y_test = age[train_index], age[test_index] 103 | 104 | # Scaling using inter-quartile range 105 | scaler = RobustScaler() 106 | x_train = scaler.fit_transform(x_train) 107 | x_test = scaler.transform(x_test) 108 | 109 | model = EMRVR(kernel='linear', threshold_alpha=1e9) 110 | 111 | model.fit(x_train, y_train) 112 | 113 | predictions = model.predict(x_test) 114 | 115 | mae = mean_absolute_error(y_test, predictions) 116 | rmse = sqrt(mean_squared_error(y_test, predictions)) 117 | r, _ = stats.pearsonr(y_test, predictions) 118 | r2 = r2_score(y_test, predictions) 119 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 120 | 121 | cv_r.append(r) 122 | cv_r2.append(r2) 123 | cv_mae.append(mae) 124 | cv_rmse.append(rmse) 125 | cv_age_error_corr.append(age_error_corr) 126 | 127 | # ---------------------------------------------------------------------------------------- 128 | # Save output files 129 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 130 | 131 | # Save scaler and model 132 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 133 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 134 | 135 | # Save model scores 136 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 137 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 138 | 139 | # ---------------------------------------------------------------------------------------- 140 | # Add predictions per test_index to age_predictions 141 | for row, value in zip(test_index, predictions): 142 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 143 | 144 | # Print results of the CV fold 145 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 146 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 147 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 148 | 149 | # Save predictions 150 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 151 | 152 | # Variables for mean scores of performance metrics of CV folds across all repetitions 153 | print('') 154 | print('Mean values:') 155 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 156 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 157 | 158 | 159 | if __name__ == '__main__': 160 | main(args.experiment_name, args.scanner_name, 161 | args.input_ids_file) 162 | -------------------------------------------------------------------------------- /src/comparison/comparison_fs_data_train_svm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Support Vector Machines on FreeSurfer data. 3 | 4 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of 5 | 10 stratified k-fold cross-validation (CV) (stratified by age). 6 | The hyperparameter tuning was performed in an automatic way using 7 | nested CV. 8 | 9 | References 10 | ---------- 11 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." 12 | Machine learning 20.3 (1995): 273-297. 13 | """ 14 | import argparse 15 | import random 16 | import warnings 17 | from math import sqrt 18 | from pathlib import Path 19 | 20 | import numpy as np 21 | from joblib import dump 22 | from scipy import stats 23 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 24 | from sklearn.model_selection import GridSearchCV 25 | from sklearn.model_selection import StratifiedKFold 26 | from sklearn.preprocessing import RobustScaler 27 | from sklearn.svm import LinearSVR 28 | 29 | from utils import COLUMNS_NAME, load_freesurfer_dataset 30 | 31 | PROJECT_ROOT = Path.cwd() 32 | 33 | warnings.filterwarnings('ignore') 34 | 35 | parser = argparse.ArgumentParser() 36 | 37 | parser.add_argument('-E', '--experiment_name', 38 | dest='experiment_name', 39 | help='Name of the experiment.') 40 | 41 | parser.add_argument('-S', '--scanner_name', 42 | dest='scanner_name', 43 | help='Name of the scanner.') 44 | 45 | parser.add_argument('-I', '--input_ids_file', 46 | dest='input_ids_file', 47 | default='homogenized_ids.csv', 48 | help='Filename indicating the ids to be used.') 49 | 50 | args = parser.parse_args() 51 | 52 | 53 | def main(experiment_name, scanner_name, input_ids_file): 54 | # ---------------------------------------------------------------------------------------- 55 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 56 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 57 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 58 | ids_path = experiment_dir / input_ids_file 59 | 60 | model_dir = experiment_dir / 'SVM' 61 | model_dir.mkdir(exist_ok=True) 62 | cv_dir = model_dir / 'cv' 63 | cv_dir.mkdir(exist_ok=True) 64 | 65 | dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path) 66 | 67 | # ---------------------------------------------------------------------------------------- 68 | # Initialise random seed 69 | np.random.seed(42) 70 | random.seed(42) 71 | 72 | # Normalise regional volumes by total intracranial volume (tiv) 73 | regions = dataset[COLUMNS_NAME].values 74 | 75 | tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 76 | 77 | regions_norm = np.true_divide(regions, tiv) 78 | age = dataset['Age'].values 79 | 80 | # CV variables 81 | cv_r = [] 82 | cv_r2 = [] 83 | cv_mae = [] 84 | cv_rmse = [] 85 | cv_age_error_corr = [] 86 | 87 | # Create DataFrame to hold actual and predicted ages 88 | age_predictions = dataset[['image_id', 'Age']] 89 | age_predictions = age_predictions.set_index('image_id') 90 | 91 | n_repetitions = 10 92 | n_folds = 10 93 | n_nested_folds = 5 94 | 95 | for i_repetition in range(n_repetitions): 96 | # Create new empty column in age_predictions df to save age predictions of this repetition 97 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 98 | age_predictions[repetition_column_name] = np.nan 99 | 100 | # Create 10-fold CV scheme stratified by age 101 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 102 | for i_fold, (train_index, test_index) in enumerate(skf.split(regions_norm, age)): 103 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 104 | 105 | x_train, x_test = regions_norm[train_index], regions_norm[test_index] 106 | y_train, y_test = age[train_index], age[test_index] 107 | 108 | # Scaling using inter-quartile range 109 | scaler = RobustScaler() 110 | x_train = scaler.fit_transform(x_train) 111 | x_test = scaler.transform(x_test) 112 | 113 | model_type = LinearSVR(loss='epsilon_insensitive') 114 | 115 | # Systematic search for best hyperparameters 116 | search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]} 117 | nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True, random_state=i_repetition) 118 | gridsearch = GridSearchCV(model_type, 119 | param_grid=search_space, 120 | scoring='neg_mean_absolute_error', 121 | refit=True, cv=nested_skf, 122 | verbose=3, n_jobs=1) 123 | 124 | gridsearch.fit(x_train, y_train) 125 | 126 | model = gridsearch.best_estimator_ 127 | 128 | params_results = {'means': gridsearch.cv_results_['mean_test_score'], 129 | 'params': gridsearch.cv_results_['params']} 130 | 131 | predictions = model.predict(x_test) 132 | 133 | mae = mean_absolute_error(y_test, predictions) 134 | rmse = sqrt(mean_squared_error(y_test, predictions)) 135 | r, _ = stats.pearsonr(y_test, predictions) 136 | r2 = r2_score(y_test, predictions) 137 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 138 | 139 | cv_r.append(r) 140 | cv_r2.append(r2) 141 | cv_mae.append(mae) 142 | cv_rmse.append(rmse) 143 | cv_age_error_corr.append(age_error_corr) 144 | 145 | # ---------------------------------------------------------------------------------------- 146 | # Save output files 147 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 148 | 149 | # Save scaler and model 150 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 151 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 152 | dump(params_results, cv_dir / f'{output_prefix}_params.joblib') 153 | 154 | # Save model scores 155 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 156 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 157 | 158 | # ---------------------------------------------------------------------------------------- 159 | # Add predictions per test_index to age_predictions 160 | for row, value in zip(test_index, predictions): 161 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 162 | 163 | # Print results of the CV fold 164 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 165 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 166 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 167 | 168 | # Save predictions 169 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 170 | 171 | # Variables for mean scores of performance metrics of CV folds across all repetitions 172 | print('') 173 | print('Mean values:') 174 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 175 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 176 | 177 | 178 | if __name__ == '__main__': 179 | main(args.experiment_name, args.scanner_name, 180 | args.input_ids_file) 181 | -------------------------------------------------------------------------------- /src/comparison/comparison_pca_data_train_gp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Gaussian Processes on voxel-level data 3 | with reduced dimensionality through Principal Component Analysis (PCA). 4 | 5 | We trained the Gaussian Processes (GP) [1] in 10 repetitions of 6 | 10 stratified k-fold cross-validation (CV) (stratified by age). 7 | 8 | References 9 | ---------- 10 | [1] - Williams, Christopher KI, and Carl Edward Rasmussen. 11 | "Gaussian processes for regression." Advances in neural 12 | information processing systems. 1996. 13 | """ 14 | import argparse 15 | import random 16 | import warnings 17 | from math import sqrt 18 | from pathlib import Path 19 | 20 | import numpy as np 21 | from joblib import dump 22 | from scipy import stats 23 | from sklearn.gaussian_process import GaussianProcessRegressor 24 | from sklearn.gaussian_process.kernels import DotProduct 25 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 26 | from sklearn.model_selection import StratifiedKFold 27 | from sklearn.preprocessing import RobustScaler 28 | import pandas as pd 29 | from utils import load_demographic_data 30 | PROJECT_ROOT = Path.cwd() 31 | 32 | warnings.filterwarnings('ignore') 33 | 34 | parser = argparse.ArgumentParser() 35 | 36 | parser.add_argument('-E', '--experiment_name', 37 | dest='experiment_name', 38 | help='Name of the experiment.') 39 | 40 | parser.add_argument('-S', '--scanner_name', 41 | dest='scanner_name', 42 | help='Name of the scanner.') 43 | 44 | parser.add_argument('-I', '--input_ids_file', 45 | dest='input_ids_file', 46 | default='homogenized_ids.csv', 47 | help='Filename indicating the ids to be used.') 48 | 49 | args = parser.parse_args() 50 | 51 | def main(experiment_name, scanner_name, input_ids_file): 52 | # ---------------------------------------------------------------------------------------- 53 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 54 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 55 | pca_dir = PROJECT_ROOT / 'outputs' / 'pca' 56 | ids_path = experiment_dir / input_ids_file 57 | 58 | model_dir = experiment_dir / 'pca_GPR' 59 | model_dir.mkdir(exist_ok=True) 60 | cv_dir = model_dir / 'cv' 61 | cv_dir.mkdir(exist_ok=True) 62 | 63 | participants_df = load_demographic_data(participants_path, ids_path) 64 | 65 | # ---------------------------------------------------------------------------------------- 66 | # Initialise random seed 67 | np.random.seed(42) 68 | random.seed(42) 69 | 70 | age = participants_df['Age'].values 71 | 72 | # CV variables 73 | cv_r = [] 74 | cv_r2 = [] 75 | cv_mae = [] 76 | cv_rmse = [] 77 | cv_age_error_corr = [] 78 | 79 | # Create DataFrame to hold actual and predicted ages 80 | age_predictions = participants_df[['image_id', 'Age']] 81 | age_predictions = age_predictions.set_index('image_id') 82 | 83 | n_repetitions = 10 84 | n_folds = 10 85 | 86 | for i_repetition in range(n_repetitions): 87 | # Create new empty column in age_predictions df to save age predictions of this repetition 88 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 89 | age_predictions[repetition_column_name] = np.nan 90 | 91 | # Create 10-fold CV scheme stratified by age 92 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 93 | for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)): 94 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 95 | 96 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 97 | pca_path = pca_dir / f'{output_prefix}_pca_components.csv' 98 | 99 | pca_df = pd.read_csv(pca_path) 100 | pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','') #TODO: put this in relation to project_root? 101 | pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '') 102 | 103 | dataset_df = pd.merge(pca_df, participants_df, on='image_id') 104 | x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values 105 | 106 | x_train, x_test = x_values[train_index], x_values[test_index] 107 | y_train, y_test = age[train_index], age[test_index] 108 | 109 | # Scaling using inter-quartile range 110 | scaler = RobustScaler() 111 | x_train = scaler.fit_transform(x_train) 112 | x_test = scaler.transform(x_test) 113 | 114 | model = GaussianProcessRegressor(kernel=DotProduct(), random_state=0) 115 | 116 | model.fit(x_train, y_train) 117 | 118 | predictions = model.predict(x_test) 119 | 120 | mae = mean_absolute_error(y_test, predictions) 121 | rmse = sqrt(mean_squared_error(y_test, predictions)) 122 | r, _ = stats.pearsonr(y_test, predictions) 123 | r2 = r2_score(y_test, predictions) 124 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 125 | 126 | cv_r.append(r) 127 | cv_r2.append(r2) 128 | cv_mae.append(mae) 129 | cv_rmse.append(rmse) 130 | cv_age_error_corr.append(age_error_corr) 131 | 132 | # ---------------------------------------------------------------------------------------- 133 | # Save output files 134 | 135 | # Save scaler and model 136 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 137 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 138 | 139 | # Save model scores 140 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 141 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 142 | 143 | # ---------------------------------------------------------------------------------------- 144 | # Add predictions per test_index to age_predictions 145 | for row, value in zip(test_index, predictions): 146 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 147 | 148 | # Print results of the CV fold 149 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 150 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 151 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 152 | 153 | # Save predictions 154 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 155 | 156 | # Variables for mean scores of performance metrics of CV folds across all repetitions 157 | print('') 158 | print('Mean values:') 159 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 160 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 161 | 162 | 163 | if __name__ == '__main__': 164 | main(args.experiment_name, args.scanner_name, 165 | args.input_ids_file) 166 | -------------------------------------------------------------------------------- /src/comparison/comparison_pca_data_train_rvm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Relevant Vector Machines on voxel-level data 3 | with reduced dimensionality through Principal Component Analysis (PCA). 4 | 5 | 6 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of 7 | 10 stratified k-fold cross-validation (CV) (stratified by age). 8 | 9 | References 10 | ---------- 11 | [1] - Tipping, Michael E. "The relevance vector machine." 12 | Advances in neural information processing systems. 2000. 13 | """ 14 | import argparse 15 | import random 16 | import warnings 17 | from math import sqrt 18 | from pathlib import Path 19 | 20 | import numpy as np 21 | from joblib import dump 22 | from scipy import stats 23 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 24 | from sklearn.model_selection import StratifiedKFold 25 | from sklearn.preprocessing import RobustScaler 26 | from sklearn_rvm import EMRVR 27 | import pandas as pd 28 | from utils import load_demographic_data 29 | PROJECT_ROOT = Path.cwd() 30 | 31 | warnings.filterwarnings('ignore') 32 | 33 | parser = argparse.ArgumentParser() 34 | 35 | parser.add_argument('-E', '--experiment_name', 36 | dest='experiment_name', 37 | help='Name of the experiment.') 38 | 39 | parser.add_argument('-S', '--scanner_name', 40 | dest='scanner_name', 41 | help='Name of the scanner.') 42 | 43 | parser.add_argument('-I', '--input_ids_file', 44 | dest='input_ids_file', 45 | default='homogenized_ids.csv', 46 | help='Filename indicating the ids to be used.') 47 | 48 | args = parser.parse_args() 49 | 50 | def main(experiment_name, scanner_name, input_ids_file): 51 | # ---------------------------------------------------------------------------------------- 52 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 53 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 54 | pca_dir = PROJECT_ROOT / 'outputs' / 'pca' 55 | ids_path = experiment_dir / input_ids_file 56 | 57 | model_dir = experiment_dir / 'pca_RVM' 58 | model_dir.mkdir(exist_ok=True) 59 | cv_dir = model_dir / 'cv' 60 | cv_dir.mkdir(exist_ok=True) 61 | 62 | participants_df = load_demographic_data(participants_path, ids_path) 63 | 64 | # ---------------------------------------------------------------------------------------- 65 | # Initialise random seed 66 | np.random.seed(42) 67 | random.seed(42) 68 | 69 | age = participants_df['Age'].values 70 | 71 | # CV variables 72 | cv_r = [] 73 | cv_r2 = [] 74 | cv_mae = [] 75 | cv_rmse = [] 76 | cv_age_error_corr = [] 77 | 78 | # Create DataFrame to hold actual and predicted ages 79 | age_predictions = participants_df[['image_id', 'Age']] 80 | age_predictions = age_predictions.set_index('image_id') 81 | 82 | n_repetitions = 10 83 | n_folds = 10 84 | 85 | for i_repetition in range(n_repetitions): 86 | # Create new empty column in age_predictions df to save age predictions of this repetition 87 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 88 | age_predictions[repetition_column_name] = np.nan 89 | 90 | # Create 10-fold CV scheme stratified by age 91 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 92 | for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)): 93 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 94 | 95 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 96 | pca_path = pca_dir / f'{output_prefix}_pca_components.csv' 97 | 98 | pca_df = pd.read_csv(pca_path) 99 | pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','') #TODO: fix path? 100 | pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '') 101 | 102 | dataset_df = pd.merge(pca_df, participants_df, on='image_id') 103 | x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values 104 | 105 | x_train, x_test = x_values[train_index], x_values[test_index] 106 | y_train, y_test = age[train_index], age[test_index] 107 | 108 | # Scaling using inter-quartile range 109 | scaler = RobustScaler() 110 | x_train = scaler.fit_transform(x_train) 111 | x_test = scaler.transform(x_test) 112 | 113 | model = EMRVR(kernel='linear', threshold_alpha=1e9) 114 | 115 | model.fit(x_train, y_train) 116 | 117 | predictions = model.predict(x_test) 118 | 119 | mae = mean_absolute_error(y_test, predictions) 120 | rmse = sqrt(mean_squared_error(y_test, predictions)) 121 | r, _ = stats.pearsonr(y_test, predictions) 122 | r2 = r2_score(y_test, predictions) 123 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 124 | 125 | cv_r.append(r) 126 | cv_r2.append(r2) 127 | cv_mae.append(mae) 128 | cv_rmse.append(rmse) 129 | cv_age_error_corr.append(age_error_corr) 130 | 131 | # ---------------------------------------------------------------------------------------- 132 | # Save output files 133 | 134 | # Save scaler and model 135 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 136 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 137 | 138 | # Save model scores 139 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 140 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 141 | 142 | # ---------------------------------------------------------------------------------------- 143 | # Add predictions per test_index to age_predictions 144 | for row, value in zip(test_index, predictions): 145 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 146 | 147 | # Print results of the CV fold 148 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 149 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 150 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 151 | 152 | # Save predictions 153 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 154 | 155 | # Variables for mean scores of performance metrics of CV folds across all repetitions 156 | print('') 157 | print('Mean values:') 158 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 159 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 160 | 161 | 162 | if __name__ == '__main__': 163 | main(args.experiment_name, args.scanner_name, 164 | args.input_ids_file) 165 | -------------------------------------------------------------------------------- /src/comparison/comparison_pca_data_train_svm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Support Vector Machines on voxel-level data 3 | with reduced dimensionality through Principal Component Analysis (PCA). 4 | 5 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of 6 | 10 stratified k-fold cross-validation (CV) (stratified by age). 7 | The hyperparameter tuning was performed in an automatic way using 8 | nested CV. 9 | 10 | References 11 | ---------- 12 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." 13 | Machine learning 20.3 (1995): 273-297. 14 | """ 15 | import argparse 16 | import random 17 | import warnings 18 | from math import sqrt 19 | from pathlib import Path 20 | 21 | import numpy as np 22 | from joblib import dump 23 | from scipy import stats 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 25 | from sklearn.model_selection import GridSearchCV 26 | from sklearn.model_selection import StratifiedKFold 27 | from sklearn.preprocessing import RobustScaler 28 | from sklearn.svm import LinearSVR 29 | import pandas as pd 30 | from utils import load_demographic_data 31 | PROJECT_ROOT = Path.cwd() 32 | 33 | warnings.filterwarnings('ignore') 34 | 35 | parser = argparse.ArgumentParser() 36 | 37 | parser.add_argument('-E', '--experiment_name', 38 | dest='experiment_name', 39 | help='Name of the experiment.') 40 | 41 | parser.add_argument('-S', '--scanner_name', 42 | dest='scanner_name', 43 | help='Name of the scanner.') 44 | 45 | parser.add_argument('-I', '--input_ids_file', 46 | dest='input_ids_file', 47 | default='homogenized_ids.csv', 48 | help='Filename indicating the ids to be used.') 49 | 50 | args = parser.parse_args() 51 | 52 | def main(experiment_name, scanner_name, input_ids_file): 53 | # ---------------------------------------------------------------------------------------- 54 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 55 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 56 | pca_dir = PROJECT_ROOT / 'outputs' / 'pca' 57 | ids_path = experiment_dir / input_ids_file 58 | 59 | model_dir = experiment_dir / 'pca_SVM' 60 | model_dir.mkdir(exist_ok=True) 61 | cv_dir = model_dir / 'cv' 62 | cv_dir.mkdir(exist_ok=True) 63 | 64 | participants_df = load_demographic_data(participants_path, ids_path) 65 | 66 | # ---------------------------------------------------------------------------------------- 67 | # Initialise random seed 68 | np.random.seed(42) 69 | random.seed(42) 70 | 71 | age = participants_df['Age'].values 72 | 73 | # CV variables 74 | cv_r = [] 75 | cv_r2 = [] 76 | cv_mae = [] 77 | cv_rmse = [] 78 | cv_age_error_corr = [] 79 | 80 | # Create DataFrame to hold actual and predicted ages 81 | age_predictions = participants_df[['image_id', 'Age']] 82 | age_predictions = age_predictions.set_index('image_id') 83 | 84 | n_repetitions = 10 85 | n_folds = 10 86 | n_nested_folds = 5 87 | 88 | for i_repetition in range(n_repetitions): 89 | # Create new empty column in age_predictions df to save age predictions of this repetition 90 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 91 | age_predictions[repetition_column_name] = np.nan 92 | 93 | # Create 10-fold CV scheme stratified by age 94 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 95 | for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)): 96 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 97 | 98 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 99 | pca_path = pca_dir / f'{output_prefix}_pca_components.csv' 100 | 101 | pca_df = pd.read_csv(pca_path) 102 | pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/SSD2/BIOBANK/','') 103 | pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '') 104 | 105 | dataset_df = pd.merge(pca_df, participants_df, on='image_id') 106 | x_values = dataset_df[dataset_df.columns.difference(participants_df.columns)].values 107 | 108 | x_train, x_test = x_values[train_index], x_values[test_index] 109 | y_train, y_test = age[train_index], age[test_index] 110 | 111 | # Scaling using inter-quartile range 112 | scaler = RobustScaler() 113 | x_train = scaler.fit_transform(x_train) 114 | x_test = scaler.transform(x_test) 115 | 116 | model_type = LinearSVR(loss='epsilon_insensitive') 117 | 118 | # Systematic search for best hyperparameters 119 | search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]} 120 | nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True, random_state=i_repetition) 121 | gridsearch = GridSearchCV(model_type, 122 | param_grid=search_space, 123 | scoring='neg_mean_absolute_error', 124 | refit=True, cv=nested_skf, 125 | verbose=3, n_jobs=1) 126 | 127 | gridsearch.fit(x_train, y_train) 128 | 129 | model = gridsearch.best_estimator_ 130 | 131 | params_results = {'means': gridsearch.cv_results_['mean_test_score'], 132 | 'params': gridsearch.cv_results_['params']} 133 | 134 | predictions = model.predict(x_test) 135 | 136 | mae = mean_absolute_error(y_test, predictions) 137 | rmse = sqrt(mean_squared_error(y_test, predictions)) 138 | r, _ = stats.pearsonr(y_test, predictions) 139 | r2 = r2_score(y_test, predictions) 140 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 141 | 142 | cv_r.append(r) 143 | cv_r2.append(r2) 144 | cv_mae.append(mae) 145 | cv_rmse.append(rmse) 146 | cv_age_error_corr.append(age_error_corr) 147 | 148 | # ---------------------------------------------------------------------------------------- 149 | # Save output files 150 | 151 | # Save scaler and model 152 | dump(scaler, cv_dir / f'{output_prefix}_scaler.joblib') 153 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 154 | dump(params_results, cv_dir / f'{output_prefix}_params.joblib') 155 | 156 | # Save model scores 157 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 158 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 159 | 160 | # ---------------------------------------------------------------------------------------- 161 | # Add predictions per test_index to age_predictions 162 | for row, value in zip(test_index, predictions): 163 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 164 | 165 | # Print results of the CV fold 166 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 167 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 168 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 169 | 170 | # Save predictions 171 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 172 | 173 | # Variables for mean scores of performance metrics of CV folds across all repetitions 174 | print('') 175 | print('Mean values:') 176 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 177 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 178 | 179 | 180 | if __name__ == '__main__': 181 | main(args.experiment_name, args.scanner_name, 182 | args.input_ids_file) 183 | -------------------------------------------------------------------------------- /src/comparison/comparison_statistical_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to statistically assess the performance of machine learning models; 3 | Specifically, the script: 4 | i) creates a summary of performance scores of the 100 iterations of each model type 5 | ii) compares the performance of models using a version of 6 | the paired Student’s t-test that is corrected for the violation of the 7 | independence assumption from repeated k-fold cross-validation 8 | when training the model [1-3] 9 | 10 | References: 11 | ----------- 12 | [1] - https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/ 13 | 14 | [2] - Bouckaert, Remco R., and Eibe Frank. "Evaluating the replicability of significance tests for comparing 15 | learning algorithms." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, 16 | Heidelberg, 2004. 17 | 18 | [3] - https://github.com/BayesianTestsML/tutorial/blob/9fb0bf75b4435d61d42935be4d0bfafcc43e77b9/Python/bayesiantests.py 19 | """ 20 | import argparse 21 | from pathlib import Path 22 | import itertools 23 | 24 | import numpy as np 25 | import pandas as pd 26 | 27 | from utils import ttest_ind_corrected 28 | 29 | PROJECT_ROOT = Path.cwd() 30 | 31 | parser = argparse.ArgumentParser() 32 | 33 | parser.add_argument('-E', '--experiment_name', 34 | dest='experiment_name', 35 | help='Experiment name where the model predictions are stored.') 36 | 37 | parser.add_argument('-S', '--suffix', 38 | dest='suffix', 39 | help='Suffix to add on the output file regressors_comparison_suffix.csv.') 40 | 41 | parser.add_argument('-M', '--model_list', 42 | dest='model_list', 43 | nargs='+', 44 | help='Names of models to analyse.') 45 | 46 | args = parser.parse_args() 47 | 48 | 49 | def main(experiment_name, suffix, model_list): 50 | # Create summary of the performance scores across the 100 iterations 51 | # of each model type (10 times 10-fold CV) 52 | n_repetitions = 10 53 | n_folds = 10 54 | 55 | for model_name in model_list: 56 | model_dir = PROJECT_ROOT / 'outputs' / experiment_name / model_name 57 | cv_dir = model_dir / 'cv' 58 | 59 | r_list = [] 60 | r2_list = [] 61 | mae_list = [] 62 | rmse_list = [] 63 | age_error_corr_list = [] 64 | 65 | for i_repetition in range(n_repetitions): 66 | for i_fold in range(n_folds): 67 | r, r2, mae, rmse, age_error_corr = np.load(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_scores.npy') 68 | r_list.append(r) 69 | r2_list.append(r2) 70 | mae_list.append(mae) 71 | rmse_list.append(rmse) 72 | age_error_corr_list.append(age_error_corr) 73 | 74 | results = pd.DataFrame(columns=['Measure', 'Value']) 75 | results = results.append({'Measure': 'mean_r', 'Value': np.mean(r_list)}, ignore_index=True) 76 | results = results.append({'Measure': 'std_r', 'Value': np.std(r_list)}, ignore_index=True) 77 | results = results.append({'Measure': 'mean_r2', 'Value': np.mean(r2_list)}, ignore_index=True) 78 | results = results.append({'Measure': 'std_r2', 'Value': np.std(r2_list)}, ignore_index=True) 79 | results = results.append({'Measure': 'mean_mae', 'Value': np.mean(mae_list)}, ignore_index=True) 80 | results = results.append({'Measure': 'std_mae', 'Value': np.std(mae_list)}, ignore_index=True) 81 | results = results.append({'Measure': 'mean_rmse', 'Value': np.mean(rmse_list)}, ignore_index=True) 82 | results = results.append({'Measure': 'std_rmse', 'Value': np.std(rmse_list)}, ignore_index=True) 83 | results = results.append({'Measure': 'mean_age_error_corr', 'Value': np.mean(age_error_corr_list)}, 84 | ignore_index=True) 85 | results = results.append({'Measure': 'std_age_error_corr', 'Value': np.std(age_error_corr_list)}, 86 | ignore_index=True) 87 | 88 | results.to_csv(model_dir / f'{model_name}_scores_summary.csv', index=False) 89 | 90 | # Perform the statistical comparison of the summary performance metrics from different models 91 | combinations = list(itertools.combinations(model_list, 2)) 92 | 93 | # Create new significance threshold based on Bonferroni correction for multiple comparisons 94 | corrected_alpha = 0.05 / len(combinations) 95 | 96 | results_df = pd.DataFrame(columns=['regressors', 'p-value', 'stats']) 97 | 98 | for classifier_a, classifier_b in combinations: 99 | classifier_a_dir = PROJECT_ROOT / 'outputs' / experiment_name / classifier_a 100 | classifier_b_dir = PROJECT_ROOT / 'outputs' / experiment_name / classifier_b 101 | 102 | mae_a = [] 103 | mae_b = [] 104 | 105 | for i_repetition in range(n_repetitions): 106 | for i_fold in range(n_folds): 107 | performance_a = np.load(classifier_a_dir / 'cv' / f'{i_repetition:02d}_{i_fold:02d}_scores.npy')[1] 108 | performance_b = np.load(classifier_b_dir / 'cv' / f'{i_repetition:02d}_{i_fold:02d}_scores.npy')[1] 109 | 110 | mae_a.append(performance_a) 111 | mae_b.append(performance_b) 112 | 113 | statistic, pvalue = ttest_ind_corrected(np.asarray(mae_a), np.asarray(mae_b), k=n_folds, r=n_repetitions) 114 | 115 | print(f'{classifier_a} vs. {classifier_b} pvalue: {pvalue:6.3}', end='') 116 | if pvalue <= corrected_alpha: 117 | print('*') 118 | else: 119 | print('') 120 | 121 | results_df = results_df.append({'regressors': f'{classifier_a} vs. {classifier_b}', 122 | 'p-value': pvalue, 123 | 'stats': statistic}, 124 | ignore_index=True) 125 | 126 | results_df.to_csv(PROJECT_ROOT / 'outputs' / experiment_name / f'regressors_comparison{suffix}.csv', 127 | index=False) 128 | 129 | 130 | if __name__ == '__main__': 131 | main(args.experiment_name, args.suffix, args.model_list) 132 | -------------------------------------------------------------------------------- /src/comparison/comparison_voxel_data_rvm_relevance_vectors_weights.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to calculate the relevance vectors of the Relevance Vector Machine 3 | (RVM) approach for voxel-level data. 4 | 5 | """ 6 | import argparse 7 | import random 8 | from pathlib import Path 9 | 10 | import nibabel as nib 11 | import numpy as np 12 | import pandas as pd 13 | from joblib import load 14 | from nilearn.masking import apply_mask 15 | from tqdm import tqdm 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | parser = argparse.ArgumentParser() 20 | 21 | parser.add_argument('-E', '--experiment_name', 22 | dest='experiment_name', 23 | help='Name of the experiment.') 24 | 25 | parser.add_argument('-P', '--input_path', 26 | dest='input_path_str', 27 | help='Path to the local folder with preprocessed images.') 28 | 29 | parser.add_argument('-I', '--input_ids_file', 30 | dest='input_ids_file', 31 | default='homogenized_ids.csv', 32 | help='File name indicating the subject IDs to be used.') 33 | 34 | parser.add_argument('-D', '--input_data_type', 35 | dest='input_data_type', 36 | default='.nii.gz', 37 | help='Input data type') 38 | 39 | parser.add_argument('-M', '--mask_filename', 40 | dest='mask_filename', 41 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 42 | help='File name of brain mask') 43 | 44 | args = parser.parse_args() 45 | 46 | 47 | def main(experiment_name, input_path_str, input_ids_file, input_data_type, mask_filename): 48 | # ---------------------------------------------------------------------------------------- 49 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 50 | dataset_path = Path(input_path_str) 51 | 52 | model_dir = experiment_dir / 'voxel_RVM' 53 | cv_dir = model_dir / 'cv' 54 | 55 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 56 | ids_df = pd.read_csv(ids_path) 57 | 58 | # Load the mask image 59 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 60 | mask_img = nib.load(str(brain_mask)) 61 | 62 | # Initialise random seed 63 | np.random.seed(42) 64 | random.seed(42) 65 | 66 | n_repetitions = 10 67 | n_folds = 10 68 | coef_list = [] 69 | index_list = [] 70 | 71 | for i_repetition in range(n_repetitions): 72 | for i_fold in range(n_folds): 73 | # Load model 74 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 75 | model = load(cv_dir / f'{prefix}_regressor.joblib') 76 | 77 | # Load train index 78 | train_index = np.load(cv_dir / f'{prefix}_train_index.npy') 79 | 80 | coef_list.append(model.mu_[1:]) 81 | index_list.append(train_index[model.relevance_]) 82 | 83 | # Get the number of voxels in the mask 84 | mask_data = mask_img.get_fdata() 85 | n_voxels = sum(sum(sum(mask_data > 0))) 86 | n_models = 100 87 | weights = np.zeros((n_models, n_voxels)) 88 | 89 | relevance_vector_dict = dict((el, []) for el in range(100)) 90 | 91 | for i, subject_id in enumerate(tqdm(ids_df['image_id'])): 92 | # Check if subject is support vector in any model before loading the image 93 | is_support_vector = False 94 | for support_index in index_list: 95 | if i in support_index: 96 | is_support_vector = True 97 | break 98 | 99 | if is_support_vector == False: 100 | continue 101 | 102 | subject_path = dataset_path / f'{subject_id}_Warped{input_data_type}' 103 | 104 | try: 105 | img = nib.load(str(subject_path)) 106 | except FileNotFoundError: 107 | print(f'No image file {subject_path}.') 108 | raise 109 | 110 | # Extract only the brain voxels. This will create a 1D array. 111 | img = apply_mask(img, mask_img) 112 | img = np.asarray(img, dtype='float64') 113 | img = np.nan_to_num(img) 114 | 115 | for j, (dual_coef, support_index) in enumerate(zip(coef_list, index_list)): 116 | if i in support_index: 117 | selected_dual_coef = dual_coef[np.argwhere(support_index == i)] 118 | weights[j, :] = weights[j, :] + selected_dual_coef * img 119 | 120 | relevance_vector_dict[j].append(img.astype('float16')) 121 | 122 | coords = np.argwhere(mask_data > 0) 123 | i = 0 124 | for i_repetition in range(n_repetitions): 125 | for i_fold in range(n_folds): 126 | importance_map = np.zeros_like(mask_data) 127 | for xyz, importance in zip(coords, weights[i, :]): 128 | importance_map[tuple(xyz)] = importance 129 | 130 | importance_map_nifti = nib.Nifti1Image(importance_map, mask_img.affine) 131 | nib.save(importance_map_nifti, str(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_importance.nii.gz')) 132 | i = i + 1 133 | 134 | i = 0 135 | for i_repetition in range(n_repetitions): 136 | for i_fold in range(n_folds): 137 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 138 | np.savez_compressed(cv_dir / f'{prefix}_relevance_vectors.npz', 139 | relevance_vectors_=np.array(relevance_vector_dict[i])) 140 | i = i + 1 141 | 142 | 143 | if __name__ == '__main__': 144 | main(args.experiment_name, args.input_path_str, 145 | args.input_ids_file, 146 | args.input_data_type, args.mask_filename) 147 | -------------------------------------------------------------------------------- /src/comparison/comparison_voxel_data_svm_primal_weights.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Script to calculate the primal weights of the Support Vector Machine 3 | (SVM) approach for voxel-level data. 4 | 5 | """ 6 | import argparse 7 | import random 8 | from pathlib import Path 9 | 10 | import nibabel as nib 11 | import numpy as np 12 | import pandas as pd 13 | from joblib import load 14 | from nilearn.masking import apply_mask 15 | from tqdm import tqdm 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | parser = argparse.ArgumentParser() 20 | 21 | parser.add_argument('-E', '--experiment_name', 22 | dest='experiment_name', 23 | help='Name of the experiment.') 24 | 25 | parser.add_argument('-P', '--input_path', 26 | dest='input_path_str', 27 | help='Path to the local folder with preprocessed images.') 28 | 29 | parser.add_argument('-I', '--input_ids_file', 30 | dest='input_ids_file', 31 | default='homogenized_ids.csv', 32 | help='File name indicating the subject IDs to be used.') 33 | 34 | parser.add_argument('-D', '--input_data_type', 35 | dest='input_data_type', 36 | default='.nii.gz', 37 | help='Input data type') 38 | 39 | parser.add_argument('-M', '--mask_filename', 40 | dest='mask_filename', 41 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 42 | help='Input data type') 43 | 44 | args = parser.parse_args() 45 | 46 | 47 | def main(experiment_name, input_path_str, input_ids_file, input_data_type, mask_filename): 48 | # ---------------------------------------------------------------------------------------- 49 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 50 | dataset_path = Path(input_path_str) 51 | 52 | model_dir = experiment_dir / 'voxel_SVM' 53 | cv_dir = model_dir / 'cv' 54 | 55 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 56 | ids_df = pd.read_csv(ids_path) 57 | 58 | # Load the mask image 59 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 60 | mask_img = nib.load(str(brain_mask)) 61 | 62 | # Initialise random seed 63 | np.random.seed(42) 64 | random.seed(42) 65 | 66 | n_repetitions = 10 67 | n_folds = 10 68 | coef_list = [] 69 | index_list = [] 70 | 71 | for i_repetition in range(n_repetitions): 72 | for i_fold in range(n_folds): 73 | # Load model 74 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 75 | model = load(cv_dir / f'{prefix}_regressor.joblib') 76 | 77 | # Load train index 78 | train_index = np.load(cv_dir / f'{prefix}_train_index.npy') 79 | 80 | coef_list.append(model.dual_coef_[0]) 81 | index_list.append(train_index[model.support_]) 82 | 83 | # Get the number of voxels in the mask 84 | mask_data = mask_img.get_fdata() 85 | n_voxels = sum(sum(sum(mask_data > 0))) 86 | n_models = 100 87 | weights = np.zeros((n_models, n_voxels)) 88 | 89 | for i, subject_id in enumerate(tqdm(ids_df['image_id'])): 90 | # Check if subject is support vector in any model before load the image. 91 | is_support_vector = False 92 | for support_index in index_list: 93 | if i in support_index: 94 | is_support_vector = True 95 | break 96 | 97 | if is_support_vector == False: 98 | continue 99 | 100 | subject_path = dataset_path / f'{subject_id}_Warped{input_data_type}' 101 | 102 | try: 103 | img = nib.load(str(subject_path)) 104 | except FileNotFoundError: 105 | print(f'No image file {subject_path}.') 106 | raise 107 | 108 | # Extract only the brain voxels. This will create a 1D array. 109 | img = apply_mask(img, mask_img) 110 | img = np.asarray(img, dtype='float64') 111 | img = np.nan_to_num(img) 112 | 113 | for j, (dual_coef, support_index) in enumerate(zip(coef_list, index_list)): 114 | if i in support_index: 115 | selected_dual_coef = dual_coef[np.argwhere(support_index == i)] 116 | weights[j, :] = weights[j, :] + selected_dual_coef * img 117 | 118 | coords = np.argwhere(mask_data > 0) 119 | i = 0 120 | for i_repetition in range(n_repetitions): 121 | for i_fold in range(n_folds): 122 | importance_map = np.zeros_like(mask_data) 123 | for xyz, importance in zip(coords, weights[i, :]): 124 | importance_map[tuple(xyz)] = importance 125 | 126 | importance_map_nifti = nib.Nifti1Image(importance_map, mask_img.affine) 127 | nib.save(importance_map_nifti, str(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_importance.nii.gz')) 128 | i = i + 1 129 | 130 | 131 | if __name__ == '__main__': 132 | main(args.experiment_name, args.input_path_str, 133 | args.input_ids_file, 134 | args.input_data_type, args.mask_filename) 135 | -------------------------------------------------------------------------------- /src/comparison/comparison_voxel_data_train_rvm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Relevant Vector Machines on voxel-level data. 3 | 4 | We trained the Relevant Vector Machines (RVMs) [1] in 10 repetitions of 5 | 10 stratified k-fold cross-validation (CV) (stratified by age). 6 | 7 | This script assumes that a kernel has been already pre-computed. 8 | To compute the kernel use the script `precompute_3Ddata.py` 9 | 10 | References 11 | ---------- 12 | [1] - Tipping, Michael E. "The relevance vector machine." 13 | Advances in neural information processing systems. 2000. 14 | """ 15 | import argparse 16 | import random 17 | import warnings 18 | from math import sqrt 19 | from pathlib import Path 20 | 21 | import numpy as np 22 | import pandas as pd 23 | from joblib import dump 24 | from scipy import stats 25 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 26 | from sklearn.model_selection import StratifiedKFold 27 | from sklearn_rvm import EMRVR 28 | 29 | from utils import load_demographic_data 30 | 31 | PROJECT_ROOT = Path.cwd() 32 | 33 | warnings.filterwarnings('ignore') 34 | 35 | parser = argparse.ArgumentParser() 36 | 37 | parser.add_argument('-E', '--experiment_name', 38 | dest='experiment_name', 39 | help='Name of the experiment.') 40 | 41 | parser.add_argument('-S', '--scanner_name', 42 | dest='scanner_name', 43 | help='Name of the scanner.') 44 | 45 | parser.add_argument('-I', '--input_ids_file', 46 | dest='input_ids_file', 47 | default='homogenized_ids.csv', 48 | help='File name indicating the subject IDs to be used.') 49 | 50 | args = parser.parse_args() 51 | 52 | 53 | def main(experiment_name, scanner_name, input_ids_file): 54 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 55 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 56 | ids_path = experiment_dir / input_ids_file 57 | 58 | model_dir = experiment_dir / 'voxel_RVM' 59 | model_dir.mkdir(exist_ok=True) 60 | cv_dir = model_dir / 'cv' 61 | cv_dir.mkdir(exist_ok=True) 62 | 63 | # Load demographics 64 | demographics = load_demographic_data(participants_path, ids_path) 65 | 66 | # Load the Gram matrix 67 | kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv' 68 | kernel = pd.read_csv(kernel_path, header=0, index_col=0) 69 | 70 | # ---------------------------------------------------------------------------------------- 71 | # Initialise random seed 72 | np.random.seed(42) 73 | random.seed(42) 74 | 75 | age = demographics['Age'].values 76 | 77 | # CV variables 78 | cv_r = [] 79 | cv_r2 = [] 80 | cv_mae = [] 81 | cv_rmse = [] 82 | cv_age_error_corr = [] 83 | 84 | # Create DataFrame to hold actual and predicted ages 85 | age_predictions = demographics[['image_id', 'Age']] 86 | age_predictions = age_predictions.set_index('image_id') 87 | 88 | n_repetitions = 10 89 | n_folds = 10 90 | 91 | for i_repetition in range(n_repetitions): 92 | # Create new empty column in age_predictions df to save age predictions of this repetition 93 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 94 | age_predictions[repetition_column_name] = np.nan 95 | 96 | # Create 10-fold CV scheme stratified by age 97 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 98 | for i_fold, (train_index, test_index) in enumerate(skf.split(kernel, age)): 99 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 100 | 101 | x_train = kernel.iloc[train_index, train_index].values 102 | x_test = kernel.iloc[test_index, train_index].values 103 | y_train, y_test = age[train_index], age[test_index] 104 | 105 | model = EMRVR(kernel='precomputed', 106 | alpha_max=1e11, threshold_alpha=1e10) 107 | 108 | model.fit(x_train, y_train) 109 | 110 | predictions = model.predict(x_test) 111 | 112 | mae = mean_absolute_error(y_test, predictions) 113 | rmse = sqrt(mean_squared_error(y_test, predictions)) 114 | r, _ = stats.pearsonr(y_test, predictions) 115 | r2 = r2_score(y_test, predictions) 116 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 117 | 118 | cv_r.append(r) 119 | cv_r2.append(r2) 120 | cv_mae.append(mae) 121 | cv_rmse.append(rmse) 122 | cv_age_error_corr.append(age_error_corr) 123 | 124 | # ---------------------------------------------------------------------------------------- 125 | # Save output files 126 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 127 | 128 | # Save model 129 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 130 | 131 | # Save model scores 132 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 133 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 134 | 135 | # Save train index 136 | np.save(cv_dir / f'{output_prefix}_train_index.npy', train_index) 137 | 138 | # ---------------------------------------------------------------------------------------- 139 | # Add predictions per test_index to age_predictions 140 | for row, value in zip(test_index, predictions): 141 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 142 | 143 | # Print results of the CV fold 144 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 145 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 146 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 147 | 148 | # Save predictions 149 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 150 | 151 | # Variables for mean scores of performance metrics of CV folds across all repetitions 152 | print('') 153 | print('Mean values:') 154 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 155 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 156 | 157 | 158 | if __name__ == '__main__': 159 | main(args.experiment_name, args.scanner_name, 160 | args.input_ids_file) 161 | -------------------------------------------------------------------------------- /src/comparison/comparison_voxel_data_train_svm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to train Support Vector Machines on voxel-level data. 3 | 4 | We trained the Support Vector Machines (SVMs) [1] in 10 repetitions of 5 | 10 stratified k-fold cross-validation (CV) (stratified by age). 6 | The hyperparameter tuning was performed in an automatic way using 7 | a nested CV. 8 | 9 | This script assumes that a kernel has been already pre-computed. 10 | To compute the kernel use the script `src/preprocessing/compute_kernel_matrix.py` 11 | 12 | References 13 | ---------- 14 | [1] - Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." 15 | Machine learning 20.3 (1995): 273-297. 16 | """ 17 | import argparse 18 | import random 19 | import warnings 20 | from math import sqrt 21 | from pathlib import Path 22 | 23 | import numpy as np 24 | import pandas as pd 25 | from joblib import dump 26 | from scipy import stats 27 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 28 | from sklearn.model_selection import GridSearchCV 29 | from sklearn.model_selection import StratifiedKFold 30 | from sklearn.svm import SVR 31 | 32 | from utils import load_demographic_data 33 | 34 | PROJECT_ROOT = Path.cwd() 35 | 36 | warnings.filterwarnings('ignore') 37 | 38 | parser = argparse.ArgumentParser() 39 | 40 | parser.add_argument('-E', '--experiment_name', 41 | dest='experiment_name', 42 | help='Name of the experiment.') 43 | 44 | parser.add_argument('-S', '--scanner_name', 45 | dest='scanner_name', 46 | help='Name of the scanner.') 47 | 48 | parser.add_argument('-I', '--input_ids_file', 49 | dest='input_ids_file', 50 | default='homogenized_ids.csv', 51 | help='Filename indicating the ids to be used.') 52 | 53 | args = parser.parse_args() 54 | 55 | 56 | def main(experiment_name, scanner_name, input_ids_file): 57 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 58 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 59 | ids_path = experiment_dir / input_ids_file 60 | 61 | model_dir = experiment_dir / 'voxel_SVM' 62 | model_dir.mkdir(exist_ok=True) 63 | cv_dir = model_dir / 'cv' 64 | cv_dir.mkdir(exist_ok=True) 65 | 66 | # Load demographics 67 | demographics = load_demographic_data(participants_path, ids_path) 68 | 69 | # Load the Gram matrix 70 | kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv' 71 | kernel = pd.read_csv(kernel_path, header=0, index_col=0) 72 | 73 | # ---------------------------------------------------------------------------------------- 74 | # Initialise random seed 75 | np.random.seed(42) 76 | random.seed(42) 77 | 78 | age = demographics['Age'].values 79 | 80 | # CV variables 81 | cv_r = [] 82 | cv_r2 = [] 83 | cv_mae = [] 84 | cv_rmse = [] 85 | cv_age_error_corr = [] 86 | 87 | # Create DataFrame to hold actual and predicted ages 88 | age_predictions = demographics[['image_id', 'Age']] 89 | age_predictions = age_predictions.set_index('image_id') 90 | 91 | n_repetitions = 10 92 | n_folds = 10 93 | n_nested_folds = 5 94 | 95 | for i_repetition in range(n_repetitions): 96 | # Create new empty column in age_predictions df to save age predictions of this repetition 97 | repetition_column_name = f'Prediction repetition {i_repetition:02d}' 98 | age_predictions[repetition_column_name] = np.nan 99 | 100 | # Create 10-fold CV scheme stratified by age 101 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 102 | for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)): 103 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 104 | 105 | x_train = kernel.iloc[train_index, train_index].values 106 | x_test = kernel.iloc[test_index, train_index].values 107 | y_train, y_test = age[train_index], age[test_index] 108 | 109 | model_type = SVR(kernel='precomputed') 110 | 111 | # Systematic search for best hyperparameters 112 | search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]} 113 | nested_skf = StratifiedKFold(n_splits=n_nested_folds, shuffle=True, 114 | random_state=i_repetition) 115 | gridsearch = GridSearchCV(model_type, 116 | param_grid=search_space, 117 | scoring='neg_mean_absolute_error', 118 | refit=True, cv=nested_skf, 119 | verbose=3, n_jobs=1) 120 | 121 | gridsearch.fit(x_train, y_train) 122 | 123 | model = gridsearch.best_estimator_ 124 | 125 | params_results = {'means': gridsearch.cv_results_['mean_test_score'], 126 | 'params': gridsearch.cv_results_['params']} 127 | 128 | predictions = model.predict(x_test) 129 | 130 | mae = mean_absolute_error(y_test, predictions) 131 | rmse = sqrt(mean_squared_error(y_test, predictions)) 132 | r, _ = stats.pearsonr(y_test, predictions) 133 | r2 = r2_score(y_test, predictions) 134 | age_error_corr, _ = stats.spearmanr((predictions - y_test), y_test) 135 | 136 | cv_r.append(r) 137 | cv_r2.append(r2) 138 | cv_mae.append(mae) 139 | cv_rmse.append(rmse) 140 | cv_age_error_corr.append(age_error_corr) 141 | 142 | # ---------------------------------------------------------------------------------------- 143 | # Save output files 144 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 145 | 146 | # Save scaler and model 147 | dump(model, cv_dir / f'{output_prefix}_regressor.joblib') 148 | dump(params_results, cv_dir / f'{output_prefix}_params.joblib') 149 | 150 | # Save model scores 151 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 152 | np.save(cv_dir / f'{output_prefix}_scores.npy', scores_array) 153 | 154 | # Save train index 155 | np.save(cv_dir / f'{output_prefix}_train_index.npy', train_index) 156 | 157 | # ---------------------------------------------------------------------------------------- 158 | # Add predictions per test_index to age_predictions 159 | for row, value in zip(test_index, predictions): 160 | age_predictions.iloc[row, age_predictions.columns.get_loc(repetition_column_name)] = value 161 | 162 | # Print results of the CV fold 163 | print(f'Repetition {i_repetition:02d} Fold {i_fold:02d} ' 164 | f'r: {r:0.3f}, R2: {r2:0.3f}, ' 165 | f'MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 166 | 167 | # Save predictions 168 | age_predictions.to_csv(model_dir / 'age_predictions.csv') 169 | 170 | # Variables for mean scores of performance metrics of CV folds across all repetitions 171 | print('') 172 | print('Mean values:') 173 | print(f'r: {np.mean(cv_r):0.3f} R2: {np.mean(cv_r2):0.3f} MAE: {np.mean(cv_mae):0.3f} ' 174 | f'RMSE: {np.mean(cv_rmse):0.3f} CORR: {np.mean(cv_age_error_corr):0.3f}') 175 | 176 | 177 | if __name__ == '__main__': 178 | main(args.experiment_name, args.scanner_name, 179 | args.input_ids_file) 180 | -------------------------------------------------------------------------------- /src/download/README.md: -------------------------------------------------------------------------------- 1 | # Download 2 | 3 | The scripts in this folder were used to download the data 4 | (i.e., `participants.tsv`, `freesurferData.csv`, `mriqc_prob.csv`, 5 | `qoala_prob.csv`, imaging_preprocessing_ANTs files) from the Network 6 | Attached Storage System. The scripts are included in this repo to illustrate the 7 | structure in which the files were stored. 8 | -------------------------------------------------------------------------------- /src/download/download_ants_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script used to download the ANTs data from the storage server. 3 | 4 | Script to download all the UK BIOBANK files preprocessed using the 5 | scripts available at the imaging_preprocessing_ANTs folder. 6 | 7 | NOTE: Only for internal use at the Machine Learning in Mental Health Lab. 8 | """ 9 | import argparse 10 | from pathlib import Path 11 | from shutil import copyfile 12 | 13 | PROJECT_ROOT = Path.cwd() 14 | 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('-N', '--nas_path', 18 | dest='nas_path_str', 19 | help='Path to the Network Attached Storage system.') 20 | 21 | parser.add_argument('-S', '--scanner_name', 22 | dest='scanner_name', 23 | help='Name of the scanner.') 24 | 25 | parser.add_argument('-O', '--output_path', 26 | dest='output_path_str', 27 | help='Path to the local output folder.') 28 | 29 | args = parser.parse_args() 30 | 31 | 32 | def main(nas_path_str, scanner_name, output_path_str): 33 | """Perform download of selected datasets from the network-attached storage.""" 34 | nas_path = Path(nas_path_str) 35 | output_path = Path(output_path_str) 36 | 37 | dataset_name = 'BIOBANK' 38 | 39 | dataset_output_path = output_path / dataset_name 40 | dataset_output_path.mkdir(exist_ok=True) 41 | 42 | selected_path = nas_path / 'ANTS_NonLinear_preprocessed' / dataset_name / scanner_name 43 | 44 | for file_path in selected_path.glob('*.nii.gz'): 45 | print(file_path) 46 | copyfile(str(file_path), str(dataset_output_path / file_path.name)) 47 | 48 | 49 | if __name__ == '__main__': 50 | main(args.nas_path_str, args.scanner_name, args.output_path_str) 51 | -------------------------------------------------------------------------------- /src/download/download_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script used to download the ANTs data from the storage server. 3 | 4 | Script to download all the participants.tsv, freesurferData.csv, 5 | and quality metrics into the data folder. 6 | 7 | NOTE: Only for internal use at the Machine Learning in Mental Health Lab. 8 | """ 9 | import argparse 10 | from pathlib import Path 11 | from shutil import copyfile 12 | 13 | PROJECT_ROOT = Path.cwd() 14 | 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('-N', '--nas_path', 18 | dest='nas_path_str', 19 | help='Path to the Network Attached Storage system.') 20 | 21 | args = parser.parse_args() 22 | 23 | 24 | def download_files(data_dir, selected_path, dataset_prefix_path, nas_path): 25 | """Download the files necessary for the study from network-attached storage. 26 | These files include: 27 | - participants.tsv: Demographic data 28 | - freesurferData.csv: Neuroimaging data 29 | - mriqc_prob.csv: Raw data quality metrics 30 | - qoala_prob.csv: Freesurfer data quality metrics 31 | 32 | Parameters 33 | ---------- 34 | data_dir: PosixPath 35 | Path indicating local path to store the data. 36 | selected_path: PosixPath 37 | Path indicating external path with the data. 38 | dataset_prefix_path: str 39 | Datasets prefix. 40 | nas_path: PosixPath 41 | Path indicating NAS system. 42 | """ 43 | 44 | dataset_path = data_dir / dataset_prefix_path 45 | dataset_path.mkdir(exist_ok=True, parents=True) 46 | 47 | copyfile(str(selected_path / 'participants.tsv'), 48 | str(dataset_path / 'participants.tsv')) 49 | 50 | try: 51 | copyfile(str(nas_path / 'FreeSurfer_preprocessed' / dataset_prefix_path / 'freesurferData.csv'), 52 | str(dataset_path / 'freesurferData.csv')) 53 | except: 54 | print(f'{dataset_prefix_path} does not have freesurferData.csv') 55 | 56 | try: 57 | mriqc_prob_path = next((nas_path / 'MRIQC' / dataset_prefix_path).glob('*unseen_pred.csv')) 58 | copyfile(str(mriqc_prob_path), str(dataset_path / 'mriqc_prob.csv')) 59 | except: 60 | print(f'{dataset_prefix_path} does not have *unseen_pred.csv') 61 | 62 | try: 63 | qoala_prob_path = next((nas_path / 'Qoala' / dataset_prefix_path).glob('Qoala*')) 64 | copyfile(str(qoala_prob_path), str(dataset_path / 'qoala_prob.csv')) 65 | except: 66 | print(f'{dataset_prefix_path} does not have Qoala*') 67 | 68 | 69 | def main(nas_path_str): 70 | """Perform download of selected datasets from the network-attached storage.""" 71 | nas_path = Path(nas_path_str) 72 | data_dir = PROJECT_ROOT / 'data' 73 | 74 | dataset_name = 'BIOBANK' 75 | selected_path = nas_path / 'BIDS_data' / dataset_name 76 | 77 | for subdirectory_selected_path in selected_path.iterdir(): 78 | if not subdirectory_selected_path.is_dir(): 79 | continue 80 | 81 | print(subdirectory_selected_path) 82 | 83 | scanner_name = subdirectory_selected_path.stem 84 | if (subdirectory_selected_path / 'participants.tsv').is_file(): 85 | download_files(data_dir, subdirectory_selected_path, dataset_name + 86 | '/' + scanner_name, nas_path) 87 | 88 | 89 | if __name__ == '__main__': 90 | main(args.nas_path_str) 91 | -------------------------------------------------------------------------------- /src/generalisation/README.md: -------------------------------------------------------------------------------- 1 | # Measuring the generalization of trained regressors 2 | 3 | The scripts in this folder are used to test the generalization performance of 4 | the models trained in the scripts in the 'comparison' subdirectory. This means 5 | that the models are applied to a new, independent dataset that was acquired 6 | on a different MRI scanner. 7 | 8 | Measuring generalization performance in an independent dataset eliminates 9 | sample bias from the performance measures, and it provides a more realistic 10 | representation of brain age as a biomarker in clinical practice or the like. -------------------------------------------------------------------------------- /src/generalisation/generalisation_calculate_mean_predictions.py: -------------------------------------------------------------------------------- 1 | """Script to create csv file with mean predictions across model repetitions""" 2 | 3 | import pandas as pd 4 | from pathlib import Path 5 | PROJECT_ROOT = Path.cwd() 6 | 7 | def main(): 8 | experiment_name = 'biobank_scanner2' 9 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 10 | 11 | model_ls = ['SVM', 'RVM', 'GPR', 12 | 'voxel_SVM', 'voxel_RVM', 13 | 'pca_SVM', 'pca_RVM', 'pca_GPR'] 14 | 15 | # Create df with subject IDs and chronological age 16 | # All mean model predictions will be added to this df in the loop 17 | # Based on an age_predictions csv file from model training to have the 18 | # same order of subjects 19 | example_file = pd.read_csv(experiment_dir / 'SVM' / 'age_predictions_test.csv') 20 | age_predictions_all = pd.DataFrame(example_file.loc[:, 'image_id':'Age']) 21 | 22 | # Loop over all models, calculate mean predictions across repetitions 23 | for model_name in model_ls: 24 | model_dir = experiment_dir / model_name 25 | file_name = model_dir / 'age_predictions_test.csv' 26 | try: 27 | model_data = pd.read_csv(file_name) 28 | except FileNotFoundError: 29 | print(f'No age prediction file for {model_name}.') 30 | raise 31 | 32 | repetition_cols = model_data.loc[:, 33 | 'Prediction 00_00' : 'Prediction 09_09'] 34 | 35 | # get mean predictions across repetitions 36 | model_data['prediction_mean'] = repetition_cols.mean(axis=1) 37 | 38 | # get those into one file for all models 39 | age_predictions_all[model_name] = model_data['prediction_mean'] 40 | 41 | # Calculate brainAGE for all models and add to age_predictions_all df 42 | # brainAGE = predicted age - chronological age 43 | for model_name in model_ls: 44 | brainage_model = age_predictions_all[model_name] - \ 45 | age_predictions_all['Age'] 46 | brainage_col_name = model_name + '_brainAGE' 47 | age_predictions_all[brainage_col_name] = brainage_model 48 | 49 | # Export age_predictions_all as csv 50 | age_predictions_all.to_csv(experiment_dir / 'age_predictions_test_allmodels.csv') 51 | 52 | 53 | if __name__ == '__main__': 54 | main() -------------------------------------------------------------------------------- /src/generalisation/generalisation_test_fs_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Tests models developed using FreeSurfer data from Biobank Scanner1 4 | on previously unseen data from Biobank Scanner2 to predict brain age. 5 | 6 | The script loops over the 100 models created in comparison_fs_data_train_svm.py 7 | and comparison_fs_data_train_rvm.py, loads their regressors, applies 8 | them to the Scanner2 data and saves all predictions per subjects 9 | in age_predictions_test.csv. 10 | """ 11 | import argparse 12 | import random 13 | from math import sqrt 14 | from pathlib import Path 15 | 16 | import numpy as np 17 | from joblib import load 18 | from scipy import stats 19 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 20 | 21 | from utils import COLUMNS_NAME, load_freesurfer_dataset 22 | 23 | PROJECT_ROOT = Path.cwd() 24 | 25 | parser = argparse.ArgumentParser() 26 | 27 | parser.add_argument('-T', '--training_experiment_name', 28 | dest='training_experiment_name', 29 | help='Name of the experiment.') 30 | 31 | parser.add_argument('-G', '--test_experiment_name', 32 | dest='test_experiment_name', 33 | help='Name of the experiment.') 34 | 35 | parser.add_argument('-S', '--scanner_name', 36 | dest='scanner_name', 37 | help='Name of the scanner.') 38 | 39 | parser.add_argument('-M', '--model_name', 40 | dest='model_name', 41 | help='Name of the model.') 42 | 43 | parser.add_argument('-I', '--input_ids_file', 44 | dest='input_ids_file', 45 | default='cleaned_ids.csv', 46 | help='Filename indicating the ids to be used.') 47 | 48 | args = parser.parse_args() 49 | 50 | 51 | def main(training_experiment_name, test_experiment_name, scanner_name, model_name, input_ids_file): 52 | # ---------------------------------------------------------------------------------------- 53 | training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name 54 | test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name 55 | 56 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 57 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 58 | ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file 59 | 60 | # Create experiment's output directory 61 | test_model_dir = test_experiment_dir / model_name 62 | test_model_dir.mkdir(exist_ok=True) 63 | 64 | training_cv_dir = training_experiment_dir / model_name / 'cv' 65 | test_cv_dir = test_model_dir / 'cv' 66 | test_cv_dir.mkdir(exist_ok=True) 67 | 68 | dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path) 69 | 70 | # ---------------------------------------------------------------------------------------- 71 | # Initialise random seed 72 | np.random.seed(42) 73 | random.seed(42) 74 | 75 | # Normalise regional volumes in testing dataset by total intracranial volume (tiv) 76 | regions = dataset[COLUMNS_NAME].values 77 | 78 | tiv = dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 79 | 80 | regions_norm = np.true_divide(regions, tiv) 81 | age = dataset['Age'].values 82 | 83 | # Create dataframe to hold actual and predicted ages 84 | age_predictions = dataset[['image_id', 'Age']] 85 | age_predictions = age_predictions.set_index('image_id') 86 | 87 | n_repetitions = 10 88 | n_folds = 10 89 | 90 | for i_repetition in range(n_repetitions): 91 | for i_fold in range(n_folds): 92 | 93 | # Load model and scaler 94 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 95 | 96 | model = load(training_cv_dir / f'{prefix}_regressor.joblib') 97 | scaler = load(training_cv_dir / f'{prefix}_scaler.joblib') 98 | 99 | # Use RobustScaler to transform testing data 100 | x_test = scaler.transform(regions_norm) 101 | 102 | # Apply model to scaled data 103 | predictions = model.predict(x_test) 104 | 105 | mae = mean_absolute_error(age, predictions) 106 | rmse = sqrt(mean_squared_error(age, predictions)) 107 | r, _ = stats.pearsonr(age, predictions) 108 | r2 = r2_score(age, predictions) 109 | age_error_corr, _ = stats.spearmanr((predictions - age), age) 110 | 111 | # Save prediction per model in df 112 | age_predictions[f'Prediction {i_repetition:02d}_{i_fold:02d}'] = predictions 113 | 114 | # Save model scores 115 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 116 | np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array) 117 | 118 | # Save predictions 119 | age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv') 120 | 121 | 122 | if __name__ == '__main__': 123 | main(args.training_experiment_name, args.test_experiment_name, 124 | args.scanner_name, args.model_name, 125 | args.input_ids_file) 126 | -------------------------------------------------------------------------------- /src/generalisation/generalisation_test_pca_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Tests models trained using voxel-level data from Biobank Scanner1 4 | with reduced dimensionality through Principal Component Analysis (PCA), 5 | on previously unseen data from Biobank Scanner2 to predict brain age. 6 | 7 | The script loops over the 100 models created in comparison_pca_data_train_svm.py 8 | and comparison_pca_data_train_rvm.py, loads their regressors, applies them to the 9 | Scanner2 data and saves all predictions per subjects in age_predictions_test.csv 10 | """ 11 | import argparse 12 | import random 13 | from math import sqrt 14 | from pathlib import Path 15 | 16 | import numpy as np 17 | from joblib import load 18 | from scipy import stats 19 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 20 | from utils import load_demographic_data 21 | import pandas as pd 22 | from tqdm import tqdm 23 | 24 | PROJECT_ROOT = Path.cwd() 25 | 26 | parser = argparse.ArgumentParser() 27 | 28 | parser.add_argument('-T', '--training_experiment_name', 29 | dest='training_experiment_name', 30 | help='Name of the experiment.') 31 | 32 | parser.add_argument('-G', '--test_experiment_name', 33 | dest='test_experiment_name', 34 | help='Name of the experiment.') 35 | 36 | parser.add_argument('-S', '--scanner_name', 37 | dest='scanner_name', 38 | help='Name of the scanner.') 39 | 40 | parser.add_argument('-M', '--model_name', 41 | dest='model_name', 42 | help='Name of the model.') 43 | 44 | parser.add_argument('-I', '--input_ids_file', 45 | dest='input_ids_file', 46 | default='cleaned_ids.csv', 47 | help='Filename indicating the ids to be used.') 48 | 49 | args = parser.parse_args() 50 | 51 | 52 | def main(training_experiment_name, test_experiment_name, scanner_name, model_name, input_ids_file): 53 | # ---------------------------------------------------------------------------------------- 54 | training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name 55 | test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name 56 | 57 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 58 | pca_dir = PROJECT_ROOT / 'outputs' / 'pca' 59 | ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file 60 | # Create experiment's output directory 61 | test_model_dir = test_experiment_dir / model_name 62 | test_model_dir.mkdir(exist_ok=True) 63 | 64 | training_cv_dir = training_experiment_dir / model_name / 'cv' 65 | test_cv_dir = test_model_dir / 'cv' 66 | test_cv_dir.mkdir(exist_ok=True) 67 | 68 | participants_df = load_demographic_data(participants_path, ids_path) 69 | 70 | # ---------------------------------------------------------------------------------------- 71 | # Initialise random seed 72 | np.random.seed(42) 73 | random.seed(42) 74 | 75 | # Normalise regional volumes in testing dataset by total intracranial volume (tiv) 76 | age = participants_df['Age'].values 77 | 78 | # Create dataframe to hold actual and predicted ages 79 | age_predictions = participants_df[['image_id', 'Age']] 80 | age_predictions = age_predictions.set_index('image_id') 81 | 82 | n_repetitions = 10 83 | n_folds = 10 84 | 85 | for i_repetition in range(n_repetitions): 86 | print(f'Repetition : {i_repetition}') 87 | for i_fold in tqdm(range(n_folds)): 88 | 89 | # Load model and scaler 90 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 91 | 92 | model = load(training_cv_dir / f'{prefix}_regressor.joblib') 93 | scaler = load(training_cv_dir / f'{prefix}_scaler.joblib') 94 | 95 | # Use RobustScaler to transform testing data 96 | output_prefix = f'{i_repetition:02d}_{i_fold:02d}' 97 | pca_path = pca_dir / f'{output_prefix}_pca_components_general.csv' 98 | 99 | pca_df = pd.read_csv(pca_path) 100 | pca_df['image_id']=pca_df['image_id'].str.replace('/media/kcl_1/HDD/DATASETS/BIOBANK/BIOBANK/','') 101 | pca_df['image_id']=pca_df['image_id'].str.replace('_Warped.nii.gz', '') 102 | 103 | dataset_df = pd.merge(pca_df, participants_df, on='image_id') 104 | pca_components = dataset_df[dataset_df.columns.difference(participants_df.columns)].values 105 | x_test = scaler.transform(pca_components) 106 | 107 | # Apply model to scaled data and measure error 108 | predictions = model.predict(x_test) 109 | 110 | mae = mean_absolute_error(age, predictions) 111 | rmse = sqrt(mean_squared_error(age, predictions)) 112 | r, _ = stats.pearsonr(age, predictions) 113 | r2 = r2_score(age, predictions) 114 | age_error_corr, _ = stats.spearmanr((predictions - age), age) 115 | 116 | # Save prediction per model in df 117 | age_predictions[f'Prediction {i_repetition:02d}_{i_fold:02d}'] = predictions 118 | 119 | # Save model scores 120 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 121 | np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array) 122 | 123 | # Save predictions 124 | age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv') 125 | 126 | 127 | if __name__ == '__main__': 128 | main(args.training_experiment_name, args.test_experiment_name, 129 | args.scanner_name, args.model_name, 130 | args.input_ids_file) 131 | -------------------------------------------------------------------------------- /src/generalisation/generalisation_test_voxel_data_rvm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Tests RVM models developed using voxel data from Biobank Scanner1 4 | on previously unseen data from Biobank Scanner2 to predict brain age. 5 | 6 | The script loops over the 100 RVM models created in comparison_voxel_data_train_rvm.py, 7 | loads their regressors, applies them to the Scanner2 data and saves all predictions 8 | per subjects in age_predictions_test.csv. 9 | """ 10 | import argparse 11 | import random 12 | from math import sqrt 13 | from pathlib import Path 14 | 15 | import nibabel as nib 16 | import numpy as np 17 | from joblib import load 18 | from nilearn.masking import apply_mask 19 | from scipy import stats 20 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 21 | from sklearn.metrics.pairwise import pairwise_kernels 22 | from tqdm import tqdm 23 | 24 | from utils import load_demographic_data 25 | 26 | PROJECT_ROOT = Path.cwd() 27 | 28 | parser = argparse.ArgumentParser() 29 | 30 | parser.add_argument('-T', '--training_experiment_name', 31 | dest='training_experiment_name', 32 | help='Name of the experiment.') 33 | 34 | parser.add_argument('-G', '--test_experiment_name', 35 | dest='test_experiment_name', 36 | help='Name of the experiment.') 37 | 38 | parser.add_argument('-S', '--scanner_name', 39 | dest='scanner_name', 40 | help='Name of the scanner.') 41 | 42 | parser.add_argument('-P', '--input_path', 43 | dest='input_path_str', 44 | help='Path to the local folder with preprocessed images.') 45 | 46 | parser.add_argument('-M', '--model_name', 47 | dest='model_name', 48 | help='Name of the model.') 49 | 50 | parser.add_argument('-I', '--input_ids_file', 51 | dest='input_ids_file', 52 | default='cleaned_ids.csv', 53 | help='Filename indicating the ids to be used.') 54 | 55 | parser.add_argument('-N', '--mask_filename', 56 | dest='mask_filename', 57 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 58 | help='Input data type') 59 | 60 | parser.add_argument('-D', '--input_data_type', 61 | dest='input_data_type', 62 | default='.nii.gz', 63 | help='Input data type') 64 | 65 | args = parser.parse_args() 66 | 67 | 68 | def main(training_experiment_name, 69 | test_experiment_name, 70 | scanner_name, 71 | input_path_str, 72 | model_name, 73 | input_ids_file, 74 | input_data_type, 75 | mask_filename): 76 | # ---------------------------------------------------------------------------------------- 77 | input_path = Path(input_path_str) 78 | training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name 79 | test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name 80 | 81 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 82 | ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file 83 | 84 | # Create experiment's output directory 85 | test_model_dir = test_experiment_dir / model_name 86 | test_model_dir.mkdir(exist_ok=True) 87 | 88 | training_cv_dir = training_experiment_dir / model_name / 'cv' 89 | test_cv_dir = test_model_dir / 'cv' 90 | test_cv_dir.mkdir(exist_ok=True) 91 | 92 | demographic = load_demographic_data(participants_path, ids_path) 93 | 94 | # ---------------------------------------------------------------------------------------- 95 | # Initialise random seed 96 | np.random.seed(42) 97 | random.seed(42) 98 | 99 | # Load the mask image 100 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 101 | mask_img = nib.load(str(brain_mask)) 102 | 103 | age = demographic['Age'].values 104 | 105 | # Create dataframe to hold actual and predicted ages 106 | age_predictions = demographic[['image_id', 'Age']] 107 | age_predictions = age_predictions.set_index('image_id') 108 | 109 | n_repetitions = 10 110 | n_folds = 10 111 | 112 | pbar = tqdm(total=100) 113 | for i_repetition in range(n_repetitions): 114 | for i_fold in range(n_folds): 115 | 116 | # Load model 117 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 118 | pbar.set_description(f'{prefix}') 119 | relevance_vector = np.load(training_cv_dir / f'{prefix}_relevance_vectors.npz')['relevance_vectors_'] 120 | model = load(training_cv_dir / f'{prefix}_regressor.joblib') 121 | 122 | pbar2 = tqdm(demographic['image_id']) 123 | for i, subject_id in enumerate(pbar2): 124 | subject_path = input_path / f"{subject_id}_Warped{input_data_type}" 125 | pbar2.set_description(f'{subject_path.name}') 126 | 127 | try: 128 | img = nib.load(str(subject_path)) 129 | except FileNotFoundError: 130 | print(f'No image file {subject_path}.') 131 | raise 132 | 133 | img = apply_mask(img, mask_img) 134 | img = np.asarray(img, dtype='float64') 135 | img = np.nan_to_num(img) 136 | 137 | try: 138 | K = pairwise_kernels(img[None,:], Y=relevance_vector , metric='linear') 139 | except: 140 | K = [[]] 141 | K = K / model._scale 142 | K = np.hstack((np.ones((1, 1)), K)) 143 | 144 | prediction = K @ model.mu_ 145 | 146 | # Save prediction per model in df 147 | age_predictions.loc[subject_id, f'Prediction {prefix}'] = prediction 148 | 149 | # Get and save scores 150 | predictions = age_predictions[f'Prediction {prefix}'].values 151 | 152 | mae = mean_absolute_error(age, predictions) 153 | rmse = sqrt(mean_squared_error(age, predictions)) 154 | r, _ = stats.pearsonr(age, predictions) 155 | r2 = r2_score(age, predictions) 156 | age_error_corr, _ = stats.spearmanr((predictions - age), age) 157 | 158 | # Save model scores 159 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 160 | np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array) 161 | 162 | pbar.update(1) 163 | pbar.close() 164 | # Save predictions 165 | age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv') 166 | 167 | 168 | if __name__ == '__main__': 169 | main(args.training_experiment_name, args.test_experiment_name, args.scanner_name, 170 | args.input_path_str, args.model_name, args.input_ids_file, 171 | args.input_data_type, args.mask_filename) 172 | -------------------------------------------------------------------------------- /src/generalisation/generalisation_test_voxel_data_svm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Tests SVM models developed using voxel data from Biobank Scanner1 4 | on previously unseen data from Biobank Scanner2 to predict brain age. 5 | 6 | The script loops over the 100 SVM models created in 7 | train_svm_on_freesurfer_data.py, loads their regressors, 8 | applies them to the Scanner2 data and saves all predictions per subjects in a csv file 9 | 10 | This script assumes that a nifti file with the feature weights has been already 11 | pre-computed. To compute the weights use the script 12 | `src/comparison/comparison_voxel_data_svm_primal_weights.py` 13 | """ 14 | import argparse 15 | import random 16 | from math import sqrt 17 | from pathlib import Path 18 | 19 | import nibabel as nib 20 | import numpy as np 21 | from joblib import load 22 | from nilearn.masking import apply_mask 23 | from scipy import stats 24 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 25 | from tqdm import tqdm 26 | 27 | from utils import load_demographic_data 28 | 29 | PROJECT_ROOT = Path.cwd() 30 | 31 | parser = argparse.ArgumentParser() 32 | 33 | parser.add_argument('-T', '--training_experiment_name', 34 | dest='training_experiment_name', 35 | help='Name of the experiment.') 36 | 37 | parser.add_argument('-G', '--test_experiment_name', 38 | dest='test_experiment_name', 39 | help='Name of the experiment.') 40 | 41 | parser.add_argument('-S', '--scanner_name', 42 | dest='scanner_name', 43 | help='Name of the scanner.') 44 | 45 | parser.add_argument('-M', '--model_name', 46 | dest='model_name', 47 | help='Name of the model.') 48 | 49 | parser.add_argument('-P', '--input_path', 50 | dest='input_path_str', 51 | help='Path to the local folder with preprocessed images.') 52 | 53 | parser.add_argument('-I', '--input_ids_file', 54 | dest='input_ids_file', 55 | default='cleaned_ids.csv', 56 | help='Filename indicating the ids to be used.') 57 | 58 | parser.add_argument('-N', '--mask_filename', 59 | dest='mask_filename', 60 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 61 | help='Input data type') 62 | 63 | parser.add_argument('-D', '--input_data_type', 64 | dest='input_data_type', 65 | default='.nii.gz', 66 | help='Input data type') 67 | 68 | args = parser.parse_args() 69 | 70 | 71 | def main(training_experiment_name, 72 | test_experiment_name, 73 | scanner_name, 74 | input_path_str, 75 | model_name, 76 | input_ids_file, 77 | input_data_type, 78 | mask_filename): 79 | # ---------------------------------------------------------------------------------------- 80 | input_path = Path(input_path_str) 81 | training_experiment_dir = PROJECT_ROOT / 'outputs' / training_experiment_name 82 | test_experiment_dir = PROJECT_ROOT / 'outputs' / test_experiment_name 83 | 84 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 85 | ids_path = PROJECT_ROOT / 'outputs' / test_experiment_name / input_ids_file 86 | 87 | # Create experiment's output directory 88 | test_model_dir = test_experiment_dir / model_name 89 | test_model_dir.mkdir(exist_ok=True) 90 | 91 | training_cv_dir = training_experiment_dir / model_name / 'cv' 92 | test_cv_dir = test_model_dir / 'cv' 93 | test_cv_dir.mkdir(exist_ok=True) 94 | 95 | demographic = load_demographic_data(participants_path, ids_path) 96 | 97 | # ---------------------------------------------------------------------------------------- 98 | # Initialise random seed 99 | np.random.seed(42) 100 | random.seed(42) 101 | 102 | # Load the mask image 103 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 104 | mask_img = nib.load(str(brain_mask)) 105 | 106 | age = demographic['Age'].values 107 | 108 | # Create dataframe to hold actual and predicted ages 109 | age_predictions = demographic[['image_id', 'Age']] 110 | age_predictions = age_predictions.set_index('image_id') 111 | 112 | n_repetitions = 10 113 | n_folds = 10 114 | 115 | for i, subject_id in enumerate(tqdm(demographic['image_id'])): 116 | subject_path = input_path / f"{subject_id}_Warped{input_data_type}" 117 | print(subject_path) 118 | 119 | try: 120 | img = nib.load(str(subject_path)) 121 | except FileNotFoundError: 122 | print(f'No image file {subject_path}.') 123 | raise 124 | 125 | img = apply_mask(img, mask_img) 126 | img = np.asarray(img, dtype='float64') 127 | img = np.nan_to_num(img) 128 | 129 | for i_repetition in range(n_repetitions): 130 | for i_fold in range(n_folds): 131 | 132 | # Load model and scaler 133 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 134 | 135 | model = load(training_cv_dir / f'{prefix}_regressor.joblib') 136 | weights_path = training_cv_dir / f'{prefix}_importance.nii.gz' 137 | weights_img = nib.load(str(weights_path)) 138 | 139 | weights_img = weights_img.get_fdata() 140 | weights_img = nib.Nifti1Image(weights_img, mask_img.affine) 141 | 142 | weights_img = apply_mask(weights_img, mask_img) 143 | weights_img = np.asarray(weights_img, dtype='float64') 144 | weights_img = np.nan_to_num(weights_img) 145 | 146 | prediction = np.dot(weights_img, img.T) + model.intercept_ 147 | 148 | # Save prediction per model in df 149 | age_predictions.loc[subject_id, f'Prediction {prefix}'] = prediction 150 | 151 | # Save predictions 152 | age_predictions.to_csv(test_model_dir / 'age_predictions_test.csv') 153 | 154 | # Get and save scores 155 | for i_repetition in range(n_repetitions): 156 | for i_fold in range(n_folds): 157 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 158 | predictions = age_predictions[f'Prediction {prefix}'].values 159 | 160 | mae = mean_absolute_error(age, predictions) 161 | rmse = sqrt(mean_squared_error(age, predictions)) 162 | r, _ = stats.pearsonr(age, predictions) 163 | r2 = r2_score(age, predictions) 164 | age_error_corr, _ = stats.spearmanr((predictions - age), age) 165 | 166 | # Save model scores 167 | scores_array = np.array([r, r2, mae, rmse, age_error_corr]) 168 | np.save(test_cv_dir / f'{prefix}_scores.npy', scores_array) 169 | 170 | 171 | if __name__ == '__main__': 172 | main(args.training_experiment_name, args.test_experiment_name, args.scanner_name, 173 | args.input_path_str, args.model_name, args.input_ids_file, 174 | args.input_data_type, args.mask_filename) 175 | -------------------------------------------------------------------------------- /src/misc/README.md: -------------------------------------------------------------------------------- 1 | # MISC 2 | The scripts in this folder measure the performance of the models by the 3 | size of the training set. They measure performance through univariate 4 | analysis of each normalised brain region and pairwise t-test analysis 5 | between SVM models with different hyperparameters C. 6 | 7 | This analysis was not presented in the manuscript. 8 | 9 | -------------------------------------------------------------------------------- /src/misc/misc_svm_hyperparameters_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Compares pairwise performance through independent t-test 4 | of SVM models with different hyperparameters C. 5 | Saves results into `svm_params_ttest.csv` and `svm_params_values.csv` 6 | """ 7 | import argparse 8 | import itertools 9 | from pathlib import Path 10 | 11 | import numpy as np 12 | import pandas as pd 13 | from joblib import load 14 | 15 | from utils import ttest_ind_corrected 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | parser = argparse.ArgumentParser() 20 | 21 | parser.add_argument('-E', '--experiment_name', 22 | dest='experiment_name', 23 | help='Name of the experiment.') 24 | 25 | args = parser.parse_args() 26 | 27 | 28 | def main(experiment_name): 29 | """Pairwise comparison of SVM classifier performances with different hyperparameters C.""" 30 | # ---------------------------------------------------------------------------------------- 31 | svm_dir = PROJECT_ROOT / 'outputs' / experiment_name / 'SVM' 32 | cv_dir = svm_dir / 'cv' 33 | 34 | n_repetitions = 10 35 | n_folds = 10 36 | 37 | search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]} 38 | 39 | scores_params = [] 40 | for i_repetition in range(n_repetitions): 41 | for i_fold in range(n_folds): 42 | params_dict = load(cv_dir / f'{i_repetition:02d}_{i_fold:02d}_params.joblib') 43 | scores_params.append(params_dict['means']) 44 | 45 | scores_params = np.array(scores_params) 46 | 47 | combinations = list(itertools.combinations(range(scores_params.shape[1]), 2)) 48 | 49 | # Bonferroni correction for multiple comparisons 50 | corrected_alpha = 0.05 / len(combinations) 51 | 52 | results_df = pd.DataFrame(columns=['params', 'p-value', 'stats']) 53 | 54 | # Corrected repeated k-fold cv test to compare pairwise performance of the SVM classifiers 55 | # through independent t-test 56 | for param_a, param_b in combinations: 57 | statistic, pvalue = ttest_ind_corrected(scores_params[:, param_a], scores_params[:, param_b], 58 | k=n_folds, r=n_repetitions) 59 | 60 | print(f"{search_space['C'][param_a]} vs. {search_space['C'][param_b]} pvalue: {pvalue:6.3}", end='') 61 | if pvalue <= corrected_alpha: 62 | print('*') 63 | else: 64 | print('') 65 | 66 | results_df = results_df.append({'params': f"{search_space['C'][param_a]} vs. {search_space['C'][param_b]}", 67 | 'p-value': pvalue, 68 | 'stats': statistic}, 69 | ignore_index=True) 70 | # Output to csv 71 | results_df.to_csv(svm_dir / 'svm_params_ttest.csv', index=False) 72 | 73 | values_df = pd.DataFrame(columns=['measures'] + list(search_space['C'])) 74 | 75 | scores_params_mean = np.mean(scores_params, axis=0) 76 | scores_params_std = np.std(scores_params, axis=0) 77 | 78 | values_df.loc[0] = ['mean'] + list(scores_params_mean) 79 | values_df.loc[1] = ['std'] + list(scores_params_std) 80 | 81 | values_df.to_csv(svm_dir / 'svm_params_values.csv', index=False) 82 | 83 | 84 | if __name__ == '__main__': 85 | main(args.experiment_name) 86 | -------------------------------------------------------------------------------- /src/misc/misc_univariate_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Implements univariate analysis based on [1], regresses for age and volume per region: 4 | 1. normalise each brain region 5 | 2. creates df with normalised brain region (dependent variable) and age of participant 6 | (independent variable) (+ quadratic and cubic age) 7 | 3. outputs coefficient per subject 8 | 9 | References: 10 | [1] - Zhao, Lu, et al. (2018) Age-Related Differences in Brain Morphology and the Modifiers 11 | in Middle-Aged and Older Adults. Cerebral Cortex. 12 | """ 13 | import argparse 14 | from pathlib import Path 15 | 16 | import numpy as np 17 | import pandas as pd 18 | import statsmodels.api as sm 19 | 20 | from utils import COLUMNS_NAME, load_freesurfer_dataset 21 | 22 | PROJECT_ROOT = Path.cwd() 23 | 24 | parser = argparse.ArgumentParser() 25 | 26 | parser.add_argument('-E', '--experiment_name', 27 | dest='experiment_name', 28 | help='Name of the experiment.') 29 | 30 | parser.add_argument('-S', '--scanner_name', 31 | dest='scanner_name', 32 | help='Name of the scanner.') 33 | 34 | parser.add_argument('-I', '--input_ids_file', 35 | dest='input_ids_file', 36 | default='homogenized_ids.csv', 37 | help='Filename indicating the ids to be used.') 38 | 39 | args = parser.parse_args() 40 | 41 | 42 | def normalise_region_df(df, region_name): 43 | """Normalise region by total intracranial volume 44 | 45 | Parameters 46 | ---------- 47 | df: dataframe 48 | Data to be normalized 49 | region_name: str 50 | Region of interest 51 | 52 | Returns 53 | ------- 54 | float 55 | Normalised region 56 | """ 57 | return df[region_name] / df['EstimatedTotalIntraCranialVol'] * 100 58 | 59 | 60 | def linear_regression(df, region_name): 61 | """Perform linear regression using ordinary least squares (OLS) method 62 | 63 | Parameters 64 | ---------- 65 | df: dataframe 66 | Dataset to be regressed 67 | region_name: str 68 | Region of interest 69 | 70 | Returns 71 | ------- 72 | OLS_results.params: ndarray 73 | Estimated parameters 74 | OLS_results.bse: float 75 | Standard error of the parameter estimates 76 | OLS_results.tvalues: parameter 77 | t-statistic of parameter estimates 78 | OLS_results.pvalues: float 79 | Two-tailed p-values of the t-statistics of the parameters 80 | """ 81 | 82 | endog = df['Norm_vol_' + region_name].values 83 | exog = sm.add_constant(df[['Age', 'Age^2', 'Age^3']].values) 84 | 85 | OLS_model = sm.OLS(endog, exog) 86 | 87 | OLS_results = OLS_model.fit() 88 | 89 | return OLS_results.params, OLS_results.bse, OLS_results.tvalues, OLS_results.pvalues 90 | 91 | 92 | def main(experiment_name, scanner_name, input_ids_file): 93 | # ---------------------------------------------------------------------------------------- 94 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 95 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 96 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 97 | ids_path = experiment_dir / input_ids_file 98 | 99 | # Create experiment's output directory 100 | univariate_dir = experiment_dir / 'univariate_analysis' 101 | univariate_dir.mkdir(exist_ok=True) 102 | 103 | dataset = load_freesurfer_dataset(participants_path, ids_path, freesurfer_path) 104 | 105 | # Create new df to add normalised regional volumes 106 | normalised_df = dataset[['participant_id', 'Diagn', 'Gender', 'Age']] 107 | normalised_df['Age^2'] = normalised_df['Age'] ** 2 108 | normalised_df['Age^3'] = normalised_df['Age'] ** 3 109 | 110 | # Create empty df for regression output; regions to be added 111 | regression_output = pd.DataFrame({'Row_labels_stat': ['Coeff', 'Coeff', 'Coeff', 'Coeff', 112 | 'std_err', 'std_err', 'std_err', 'std_err', 113 | 't_stats', 't_stats', 't_stats', 't_stats', 114 | 'p_val', 'p_val', 'p_val', 'p_val'], 115 | 116 | 'Row_labels_exog': ['Constant', 'Age', 'Age2', 'Age3', 117 | 'Constant', 'Age', 'Age2', 'Age3', 118 | 'Constant', 'Age', 'Age2', 'Age3', 119 | 'Constant', 'Age', 'Age2', 'Age3']}) 120 | 121 | regression_output.set_index('Row_labels_stat', 'Row_labels_exog') 122 | 123 | for region_name in COLUMNS_NAME: 124 | print(region_name) 125 | normalised_df['Norm_vol_' + region_name] = normalise_region_df(dataset, region_name) 126 | 127 | # Linear regression - ordinary least squares (OLS) 128 | coeff, std_err, t_value, p_value = linear_regression(normalised_df, region_name) 129 | 130 | regression_output[region_name] = np.concatenate((coeff, std_err, t_value, p_value), axis=0) 131 | 132 | # Output to csv 133 | regression_output.to_csv(univariate_dir / 'OLS_result.csv', index=False) 134 | 135 | 136 | if __name__ == '__main__': 137 | main(args.experiment_name, args.scanner_name, 138 | args.input_ids_file) 139 | -------------------------------------------------------------------------------- /src/preprocessing/README.md: -------------------------------------------------------------------------------- 1 | # Preprocessing 2 | The scripts in this folder perform the preprocessing necessary 3 | to conduct the analysis for brain age prediction. 4 | 5 | This includes: 6 | - cleaning data and removing subjects with incomplete information, 7 | - quality control of raw and segmented data, 8 | - compute preprocessed kernels for the voxel data, 9 | - perform dimensionality reduction for the voxel data using PCA -------------------------------------------------------------------------------- /src/preprocessing/clean_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Clean UK Biobank data. 3 | 4 | Most of the subjects are white and some ages have very low number of subjects (<100). 5 | The ethnics minorities and age with low number are removed from further analysis 6 | as well subjects with any mental or brain disorder. 7 | """ 8 | import argparse 9 | from pathlib import Path 10 | 11 | from utils import load_demographic_data 12 | 13 | PROJECT_ROOT = Path.cwd() 14 | 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('-E', '--experiment_name', 18 | dest='experiment_name', 19 | help='Name of the experiment.') 20 | 21 | parser.add_argument('-S', '--scanner_name', 22 | dest='scanner_name', 23 | help='Name of the scanner.') 24 | 25 | parser.add_argument('-I', '--input_ids_file', 26 | dest='input_ids_file', 27 | default='freesurferData.csv', 28 | help='Filename indicating the ids to be used.') 29 | 30 | args = parser.parse_args() 31 | 32 | 33 | def main(experiment_name, scanner_name, input_ids_file): 34 | """Clean UK Biobank data.""" 35 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 36 | ids_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / input_ids_file 37 | 38 | output_ids_filename = 'cleaned_ids_noqc.csv' 39 | # ---------------------------------------------------------------------------------------- 40 | # Create experiment's output directory 41 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 42 | experiment_dir.mkdir(exist_ok=True) 43 | 44 | dataset = load_demographic_data(participants_path, ids_path) 45 | 46 | # Exclude subjects outside [47, 73] interval (ages with <100 participants). 47 | dataset = dataset.loc[(dataset['Age'] >= 47) & (dataset['Age'] <= 73)] 48 | 49 | # Exclude non-white ethnicities due to small subgroups 50 | dataset = dataset[dataset['Ethnicity'] == 'White'] 51 | 52 | # Exclude patients 53 | dataset = dataset[dataset['Diagn'] == 1] 54 | 55 | output_ids_df = dataset[['image_id']] 56 | 57 | assert sum(output_ids_df.duplicated(subset='image_id')) == 0 58 | 59 | output_ids_df.to_csv(experiment_dir / output_ids_filename, index=False) 60 | 61 | 62 | if __name__ == '__main__': 63 | main(args.experiment_name, args.scanner_name, 64 | args.input_ids_file) 65 | -------------------------------------------------------------------------------- /src/preprocessing/compute_kernel_matrix.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Script to create the Kernel matrix (Gram matrix). 3 | 4 | The Kernel matrix will be used on the analysis with voxel data. 5 | """ 6 | import argparse 7 | from pathlib import Path 8 | 9 | import nibabel as nib 10 | import numpy as np 11 | import pandas as pd 12 | from nilearn.masking import apply_mask 13 | from tqdm import tqdm 14 | 15 | PROJECT_ROOT = Path.cwd() 16 | 17 | parser = argparse.ArgumentParser() 18 | 19 | parser.add_argument('-P', '--input_path', 20 | dest='input_path_str', 21 | help='Path to the local folder with preprocessed images.') 22 | 23 | parser.add_argument('-E', '--experiment_name', 24 | dest='experiment_name', 25 | help='Name of the experiment.') 26 | 27 | parser.add_argument('-I', '--input_ids_file', 28 | dest='input_ids_file', 29 | default='homogenized_ids.csv', 30 | help='Filename indicating the ids to be used.') 31 | 32 | parser.add_argument('-D', '--input_data_type', 33 | dest='input_data_type', 34 | default='.nii.gz', 35 | help='Input data type') 36 | 37 | parser.add_argument('-M', '--mask_filename', 38 | dest='mask_filename', 39 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 40 | help='Input data type') 41 | 42 | args = parser.parse_args() 43 | 44 | 45 | def calculate_gram_matrix(subjects_path, mask_img, step_size=1000): 46 | """Calculate the Gram matrix. """ 47 | n_samples = len(subjects_path) 48 | gram_matrix = np.float64(np.zeros((n_samples, n_samples))) 49 | 50 | # Outer loop 51 | outer_pbar = tqdm(range(int(np.ceil(n_samples / np.float(step_size))))) 52 | for ii in outer_pbar: 53 | outer_pbar.set_description(f'Processing outer loop {ii}') 54 | # Generate indices and then paths for this block 55 | start_ind_1 = ii * step_size 56 | stop_ind_1 = min(start_ind_1 + step_size, n_samples) 57 | block_paths_1 = subjects_path[start_ind_1:stop_ind_1] 58 | 59 | # Read in the images in this block 60 | images_1 = [] 61 | images_1_pbar = tqdm(block_paths_1) 62 | for path in images_1_pbar: 63 | images_1_pbar.set_description(f'Loading outer image {path}') 64 | try: 65 | img = nib.load(str(path)) 66 | except FileNotFoundError: 67 | print(f'No image file {path}.') 68 | raise 69 | 70 | # Extract only the brain voxels. This will create a 1D array. 71 | img = apply_mask(img, mask_img) 72 | img = np.asarray(img, dtype='float64') 73 | img = np.nan_to_num(img) 74 | images_1.append(img) 75 | del img 76 | images_1 = np.array(images_1) 77 | 78 | # Inner loop 79 | inner_pbar = tqdm(range(ii + 1)) 80 | for jj in inner_pbar: 81 | 82 | # If ii = jj, then sets of image data are the same - no need to load 83 | if ii == jj: 84 | start_ind_2 = start_ind_1 85 | stop_ind_2 = stop_ind_1 86 | images_2 = images_1 87 | 88 | # If ii !=jj, read in a different block of images 89 | else: 90 | # Generate indices and then paths for this block 91 | start_ind_2 = jj * step_size 92 | stop_ind_2 = min(start_ind_2 + step_size, n_samples) 93 | block_paths_2 = subjects_path[start_ind_2:stop_ind_2] 94 | 95 | images_2 = [] 96 | images_2_pbar = tqdm(block_paths_2) 97 | for path in images_2_pbar: 98 | images_2_pbar.set_description(f'Loading inner image {path}') 99 | try: 100 | img = nib.load(str(path)) 101 | except FileNotFoundError: 102 | print(f'No image file {path}.') 103 | raise 104 | 105 | img = apply_mask(img, mask_img) 106 | img = np.asarray(img, dtype='float64') 107 | img = np.nan_to_num(img) 108 | images_2.append(img) 109 | del img 110 | images_2 = np.array(images_2) 111 | 112 | block_K = np.dot(images_1, np.transpose(images_2)) 113 | gram_matrix[start_ind_1:stop_ind_1, start_ind_2:stop_ind_2] = block_K 114 | gram_matrix[start_ind_2:stop_ind_2, start_ind_1:stop_ind_1] = np.transpose(block_K) 115 | 116 | return gram_matrix 117 | 118 | 119 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename): 120 | """""" 121 | dataset_path = Path(input_path_str) 122 | 123 | output_path = PROJECT_ROOT / 'outputs' / 'kernels' 124 | output_path.mkdir(exist_ok=True, parents=True) 125 | 126 | ids_df = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file) 127 | 128 | # Get list of subjects included in the analysis 129 | subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']] 130 | 131 | print(f'Total number of images: {len(ids_df)}') 132 | 133 | # ---------------------------------------------------------------------------------------- 134 | # Load the mask image 135 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 136 | mask_img = nib.load(str(brain_mask)) 137 | 138 | gram_matrix = calculate_gram_matrix(subjects_path, mask_img) 139 | 140 | gram_df = pd.DataFrame(columns=ids_df['image_id'].tolist(), data=gram_matrix) 141 | gram_df['image_id'] = ids_df['image_id'] 142 | gram_df = gram_df.set_index('image_id') 143 | 144 | gram_df.to_csv(output_path / 'kernel.csv') 145 | 146 | 147 | if __name__ == '__main__': 148 | main(args.input_path_str, args.experiment_name, 149 | args.input_ids_file, 150 | args.input_data_type, args.mask_filename) 151 | -------------------------------------------------------------------------------- /src/preprocessing/compute_kernel_matrix_general.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Script to create the Kernel matrix (Gram matrix) for the generalisation analysis. 3 | 4 | In this script, we measure the kernel function between subjects from site 1 and 2. 5 | """ 6 | import argparse 7 | from pathlib import Path 8 | 9 | import nibabel as nib 10 | import numpy as np 11 | import pandas as pd 12 | from nilearn.masking import apply_mask 13 | from tqdm import tqdm 14 | 15 | PROJECT_ROOT = Path.cwd() 16 | 17 | parser = argparse.ArgumentParser() 18 | 19 | parser.add_argument('-P', '--input_path', 20 | dest='input_path_str', 21 | help='Path to the local folder with preprocessed images.') 22 | 23 | parser.add_argument('-E', '--experiment_name', 24 | dest='experiment_name', 25 | help='Name of the experiment.') 26 | 27 | parser.add_argument('-I', '--input_ids_file', 28 | dest='input_ids_file', 29 | default='homogenized_ids.csv', 30 | help='Filename indicating the ids to be used.') 31 | 32 | parser.add_argument('-D', '--input_data_type', 33 | dest='input_data_type', 34 | default='.nii.gz', 35 | help='Input data type') 36 | 37 | parser.add_argument('-M', '--mask_filename', 38 | dest='mask_filename', 39 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 40 | help='Input data type') 41 | 42 | parser.add_argument('-P2', '--input_path_2', 43 | dest='input_path_str_2', 44 | help='Path to the local folder with preprocessed images.') 45 | 46 | parser.add_argument('-E2', '--experiment_name_2', 47 | dest='experiment_name_2', 48 | help='Name of the experiment.') 49 | 50 | parser.add_argument('-I2', '--input_ids_file_2', 51 | dest='input_ids_file_2', 52 | default='cleaned_ids.csv', 53 | help='Filename indicating the ids to be used.') 54 | 55 | parser.add_argument('-S', '--output_suffix', 56 | dest='output_suffix', 57 | default='_general', 58 | help='Filename indicating the ids to be used.') 59 | 60 | args = parser.parse_args() 61 | 62 | 63 | def calculate_gram_matrix(subjects_path, mask_img, subjects_path_2, step_size=500): 64 | """Calculate the Gram matrix. """ 65 | n_samples = len(subjects_path) 66 | n_samples_2 = len(subjects_path_2) 67 | gram_matrix = np.float64(np.zeros((n_samples, n_samples_2))) 68 | 69 | # Outer loop 70 | outer_pbar = tqdm(range(int(np.ceil(n_samples / np.float(step_size))))) 71 | for ii in outer_pbar: 72 | outer_pbar.set_description(f'Processing outer loop {ii}') 73 | # Generate indices and then paths for this block 74 | start_ind_1 = ii * step_size 75 | stop_ind_1 = min(start_ind_1 + step_size, n_samples) 76 | block_paths_1 = subjects_path[start_ind_1:stop_ind_1] 77 | 78 | # Read in the images in this block 79 | images_1 = [] 80 | images_1_pbar = tqdm(block_paths_1) 81 | for path in images_1_pbar: 82 | images_1_pbar.set_description(f'Loading outer image {path}') 83 | try: 84 | img = nib.load(str(path)) 85 | except FileNotFoundError: 86 | print(f'No image file {path}.') 87 | raise 88 | 89 | # Extract only the brain voxels. This will create a 1D array. 90 | img = apply_mask(img, mask_img) 91 | img = np.asarray(img, dtype='float64') 92 | img = np.nan_to_num(img) 93 | images_1.append(img) 94 | del img 95 | images_1 = np.array(images_1) 96 | 97 | # Inner loop 98 | inner_pbar = tqdm(range(int(np.ceil(n_samples_2 / np.float(step_size))))) 99 | for jj in inner_pbar: 100 | # Generate indices and then paths for this block 101 | start_ind_2 = jj * step_size 102 | stop_ind_2 = min(start_ind_2 + step_size, n_samples_2) 103 | block_paths_2 = subjects_path_2[start_ind_2:stop_ind_2] 104 | 105 | images_2 = [] 106 | images_2_pbar = tqdm(block_paths_2) 107 | for path in images_2_pbar: 108 | images_2_pbar.set_description(f'Loading inner image {path}') 109 | try: 110 | img = nib.load(str(path)) 111 | except FileNotFoundError: 112 | print(f'No image file {path}.') 113 | raise 114 | 115 | img = apply_mask(img, mask_img) 116 | img = np.asarray(img, dtype='float64') 117 | img = np.nan_to_num(img) 118 | images_2.append(img) 119 | del img 120 | images_2 = np.array(images_2) 121 | 122 | block_K = np.dot(images_1, np.transpose(images_2)) 123 | gram_matrix[start_ind_1:stop_ind_1, start_ind_2:stop_ind_2] = block_K 124 | 125 | return gram_matrix 126 | 127 | 128 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename, 129 | input_path_str_2, experiment_name_2, input_ids_file_2, output_suffix): 130 | """""" 131 | dataset_path = Path(input_path_str) 132 | 133 | output_path = PROJECT_ROOT / 'outputs' / 'kernels' 134 | output_path.mkdir(exist_ok=True, parents=True) 135 | 136 | ids_df = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file) 137 | 138 | # Get list of subjects included in the analysis 139 | subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']] 140 | 141 | print(f'Total number of images: {len(ids_df)}') 142 | 143 | # Dataset_2 144 | dataset_path_2 = Path(input_path_str_2) 145 | ids_df_2 = pd.read_csv(PROJECT_ROOT / 'outputs' / experiment_name_2 / input_ids_file_2) 146 | subjects_path_2 = [str(dataset_path_2 / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df_2['image_id']] 147 | 148 | print(f'Total number of images: {len(ids_df_2)}') 149 | 150 | # ---------------------------------------------------------------------------------------- 151 | # Load the mask image 152 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 153 | mask_img = nib.load(str(brain_mask)) 154 | 155 | gram_matrix = calculate_gram_matrix(subjects_path, mask_img, subjects_path_2) 156 | 157 | gram_df = pd.DataFrame(columns=ids_df_2['image_id'].tolist(), data=gram_matrix) 158 | gram_df['image_id'] = ids_df['image_id'] 159 | gram_df = gram_df.set_index('image_id') 160 | 161 | gram_df.to_csv(output_path / f'kernel{output_suffix}.csv') 162 | 163 | 164 | if __name__ == '__main__': 165 | main(args.input_path_str, args.experiment_name, 166 | args.input_ids_file, 167 | args.input_data_type, args.mask_filename, 168 | args.input_path_str_2, args.experiment_name_2, args.input_ids_file_2, args.output_suffix) 169 | -------------------------------------------------------------------------------- /src/preprocessing/compute_pca_variance_explained.py: -------------------------------------------------------------------------------- 1 | """Script to calculate the % variance explained from the PCA models""" 2 | 3 | import pandas as pd 4 | from joblib import load 5 | from pathlib import Path 6 | 7 | PROJECT_ROOT = Path.cwd() 8 | 9 | 10 | def main(): 11 | pca_path = PROJECT_ROOT / 'outputs' / 'pca' / 'models' 12 | 13 | # Get list of file names for pca models 14 | n_repetitions = 10 15 | n_folds = 10 16 | 17 | pca_name_ls = [] 18 | for i_repetition in range(n_repetitions): 19 | for i_fold in range(n_folds): 20 | pca_name = f'{i_repetition:02d}_{i_fold:02d}_pca.joblib' 21 | pca_name_ls.append(pca_name) 22 | 23 | # Loop over pca model file names, load models and get variance explained 24 | pca_var_ls = [] 25 | for i_model in pca_name_ls: 26 | print(i_model) 27 | pca_model = load(pca_path / i_model) 28 | var_explained = pca_model.explained_variance_ratio_.sum() 29 | pca_var_ls.append(var_explained) 30 | 31 | # Create df for % variance explained per model iteration 32 | pca_var_df = pd.DataFrame({'variance_explained':pca_var_ls}) 33 | 34 | # Get mean and standard deviation for % variance explained across iterations 35 | var_mean = pca_var_df['variance_explained'].mean() 36 | var_std = pca_var_df['variance_explained'].stdev() 37 | print(var_mean, var_std) 38 | 39 | # Save % variance explained per model 40 | file_name = 'pca_variance_explained.csv' 41 | file_path = PROJECT_ROOT / 'outputs' / 'pca' 42 | pca_var_df.to_csv(file_path / file_name) 43 | 44 | 45 | 46 | if __name__ == '__main__': 47 | main() -------------------------------------------------------------------------------- /src/preprocessing/compute_principal_components.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Script to extract the pca components from the participant's data. """ 3 | import argparse 4 | from pathlib import Path 5 | 6 | import nibabel as nib 7 | import numpy as np 8 | import pandas as pd 9 | from nilearn.masking import apply_mask 10 | from joblib import load 11 | from tqdm import tqdm 12 | 13 | 14 | PROJECT_ROOT = Path.cwd() 15 | 16 | parser = argparse.ArgumentParser() 17 | 18 | parser.add_argument('-P', '--input_path', 19 | dest='input_path_str', 20 | help='Path to the local folder with preprocessed images.') 21 | 22 | parser.add_argument('-E', '--experiment_name', 23 | dest='experiment_name', 24 | help='Name of the experiment.') 25 | 26 | parser.add_argument('-I', '--input_ids_file', 27 | dest='input_ids_file', 28 | default='homogenized_ids.csv', 29 | help='Filename indicating the ids to be used.') 30 | 31 | parser.add_argument('-D', '--input_data_type', 32 | dest='input_data_type', 33 | default='.nii.gz', 34 | help='Input data type') 35 | 36 | parser.add_argument('-M', '--mask_filename', 37 | dest='mask_filename', 38 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 39 | help='Input data type') 40 | 41 | parser.add_argument('-S', '--output_suffix', 42 | dest='output_suffix', 43 | default='', 44 | help='Filename indicating the ids to be used.') 45 | 46 | args = parser.parse_args() 47 | 48 | def load_all_subjects(subjects_path, mask_img): 49 | imgs = [] 50 | subj_pbar = tqdm(subjects_path) 51 | for subject_path in subj_pbar: 52 | subj_pbar.set_description(f'Loading outer image {subject_path}') 53 | # Read in the images in this block 54 | try: 55 | img = nib.load(str(subject_path)) 56 | except FileNotFoundError: 57 | print(f'No image file {subject_path}.') 58 | raise 59 | 60 | # Extract only the brain voxels. This will create a 1D array. 61 | img = apply_mask(img, mask_img) 62 | img = np.asarray(img, dtype='float32') 63 | img = np.nan_to_num(img) 64 | imgs.append(img) 65 | return imgs 66 | 67 | def main(input_path_str, experiment_name, input_ids_file, input_data_type, mask_filename, output_suffix): 68 | dataset_path = Path(input_path_str) 69 | output_path = PROJECT_ROOT / 'outputs' / 'pca' 70 | models_output_path = output_path / 'models' 71 | 72 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 73 | ids_df = pd.read_csv(ids_path) 74 | 75 | # Get list of subjects included in the analysis 76 | subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in ids_df['image_id']] 77 | 78 | print(f'Total number of images: {len(ids_df)}') 79 | 80 | # ---------------------------------------------------------------------------------------- 81 | # Load the mask image 82 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 83 | mask_img = nib.load(str(brain_mask)) 84 | 85 | imgs = load_all_subjects(subjects_path, mask_img) 86 | 87 | n_components = 150 88 | n_repetitions = 10 89 | n_folds = 10 90 | for i_repetition in range(n_repetitions): 91 | for i_fold in range(n_folds): 92 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 93 | print(f'{prefix}') 94 | 95 | components = np.zeros((len(subjects_path), n_components)) 96 | model = load(models_output_path / f'{prefix}_pca.joblib') 97 | 98 | for i_img, img in enumerate(tqdm(imgs)): 99 | components[i_img, :] = model.transform(img[None, :]) 100 | 101 | pca_df = pd.DataFrame(data=components) 102 | pca_df['image_id'] = subjects_path 103 | pca_df.to_csv(output_path / f'{prefix}_pca_components{output_suffix}.csv', index=False) 104 | 105 | 106 | if __name__ == '__main__': 107 | main(args.input_path_str, args.experiment_name, 108 | args.input_ids_file, 109 | args.input_data_type, args.mask_filename, args.output_suffix) 110 | -------------------------------------------------------------------------------- /src/preprocessing/create_pca_models.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ Script to create the PCA models. 3 | 4 | Note: We calculate only 150 components due to resources limitations. 5 | """ 6 | import argparse 7 | import warnings 8 | from pathlib import Path 9 | 10 | import nibabel as nib 11 | import numpy as np 12 | import pandas as pd 13 | from joblib import dump 14 | from nilearn.masking import apply_mask 15 | from sklearn.decomposition import IncrementalPCA 16 | from sklearn.model_selection import StratifiedKFold 17 | from tqdm import tqdm 18 | 19 | from utils import load_demographic_data 20 | 21 | PROJECT_ROOT = Path.cwd() 22 | 23 | parser = argparse.ArgumentParser() 24 | 25 | parser.add_argument('-P', '--input_path', 26 | dest='input_path_str', 27 | help='Path to the local folder with preprocessed images.') 28 | 29 | parser.add_argument('-E', '--experiment_name', 30 | dest='experiment_name', 31 | help='Name of the experiment.') 32 | 33 | parser.add_argument('-S', '--scanner_name', 34 | dest='scanner_name', 35 | help='Name of the scanner.') 36 | 37 | parser.add_argument('-I', '--input_ids_file', 38 | dest='input_ids_file', 39 | default='homogenized_ids.csv', 40 | help='Filename indicating the ids to be used.') 41 | 42 | parser.add_argument('-D', '--input_data_type', 43 | dest='input_data_type', 44 | default='.nii.gz', 45 | help='Input data type') 46 | 47 | parser.add_argument('-M', '--mask_filename', 48 | dest='mask_filename', 49 | default='mni_icbm152_t1_tal_nlin_sym_09c_mask.nii', 50 | help='Input data type') 51 | 52 | args = parser.parse_args() 53 | 54 | 55 | def main(input_path_str, experiment_name, input_ids_file, scanner_name, input_data_type, mask_filename): 56 | dataset_path = Path(input_path_str) 57 | 58 | output_path = PROJECT_ROOT / 'outputs' / 'pca' 59 | output_path.mkdir(exist_ok=True) 60 | 61 | models_output_path = output_path / 'models' 62 | models_output_path.mkdir(exist_ok=True) 63 | 64 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 65 | ids_df = pd.read_csv(ids_path) 66 | 67 | # Get list of subjects included in the analysis 68 | subjects_path = [str(dataset_path / f'{subject_id}_Warped{input_data_type}') for subject_id in 69 | ids_df['image_id'].str.rstrip('/')] 70 | 71 | print(f'Total number of images: {len(ids_df)}') 72 | 73 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 74 | 75 | dataset = load_demographic_data(participants_path, ids_path) 76 | 77 | age = dataset['Age'].values 78 | 79 | # ---------------------------------------------------------------------------------------- 80 | # Load the mask image 81 | brain_mask = PROJECT_ROOT / 'imaging_preprocessing_ANTs' / mask_filename 82 | mask_img = nib.load(str(brain_mask)) 83 | 84 | n_repetitions = 10 85 | n_folds = 10 86 | step_size = 400 87 | for i_repetition in range(n_repetitions): 88 | # Create 10-fold cross-validation scheme stratified by age 89 | skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=i_repetition) 90 | for i_fold, (train_index, test_index) in enumerate(skf.split(age, age)): 91 | print(f'Running repetition {i_repetition:02d}, fold {i_fold:02d}') 92 | print(train_index.shape) 93 | n_samples = len(subjects_path) 94 | pca = IncrementalPCA(n_components=150, copy=False) 95 | 96 | for i in tqdm(range(int(np.ceil(n_samples / np.float(step_size))))): 97 | # Generate indices and then paths for this block 98 | start_ind = i * step_size 99 | stop_ind = min(start_ind + step_size, n_samples) 100 | block_paths = subjects_path[start_ind:stop_ind] 101 | 102 | # Read in the images in this block 103 | images = [] 104 | for path in tqdm(block_paths): 105 | try: 106 | img = nib.load(str(path)) 107 | except FileNotFoundError: 108 | print(f'No image file {path}.') 109 | raise 110 | 111 | # Extract only the brain voxels. This will create a 1D array. 112 | img = apply_mask(img, mask_img) 113 | img = np.asarray(img, dtype='float32') 114 | img = np.nan_to_num(img) 115 | images.append(img) 116 | del img 117 | images = np.array(images, dtype='float32') 118 | 119 | selected_index = train_index[(train_index >= start_ind) & (train_index < stop_ind)] - start_ind 120 | images_selected = images[selected_index] 121 | try: 122 | pca.partial_fit(images_selected) 123 | except ValueError: 124 | warnings.warn('n_components higher than number of subjects.') 125 | 126 | prefix = f'{i_repetition:02d}_{i_fold:02d}' 127 | dump(pca, models_output_path / f'{prefix}_pca.joblib') 128 | 129 | 130 | if __name__ == '__main__': 131 | main(args.input_path_str, args.experiment_name, 132 | args.input_ids_file, args.scanner_name, 133 | args.input_data_type, args.mask_filename) 134 | -------------------------------------------------------------------------------- /src/preprocessing/homogenize_gender.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Homogenize dataset. 3 | 4 | We homogenize the dataset scanner_1 to not have a significant difference 5 | between the proportion of men and women along the age. We used the 6 | chi square test for homogeneity to verify if there is a difference. 7 | """ 8 | import argparse 9 | import itertools 10 | from pathlib import Path 11 | 12 | import numpy as np 13 | import pandas as pd 14 | import scipy.stats as stats 15 | 16 | from utils import load_demographic_data 17 | 18 | PROJECT_ROOT = Path.cwd() 19 | 20 | parser = argparse.ArgumentParser() 21 | 22 | parser.add_argument('-E', '--experiment_name', 23 | dest='experiment_name', 24 | help='Name of the experiment.') 25 | 26 | parser.add_argument('-S', '--scanner_name', 27 | dest='scanner_name', 28 | help='Name of the scanner.') 29 | 30 | parser.add_argument('-I', '--input_ids_file', 31 | dest='input_ids_file', 32 | default='cleaned_ids.csv', 33 | help='Filename indicating the ids to be used.') 34 | 35 | args = parser.parse_args() 36 | 37 | 38 | def check_balance_across_groups(crosstab_df): 39 | """Verify if which age pair have gender imbalance.""" 40 | combinations = list(itertools.combinations(crosstab_df.columns, 2)) 41 | significance_level = 0.05 / len(combinations) 42 | 43 | for group1, group2 in combinations: 44 | contingency_table = crosstab_df[[group1, group2]] 45 | _, p_value, _, _ = stats.chi2_contingency(contingency_table, correction=False) 46 | 47 | if p_value < significance_level: 48 | return False, [group1, group2] 49 | 50 | return True, [None] 51 | 52 | 53 | def get_problematic_group(crosstab_df): 54 | """Perform contingency analysis of the subjects gender.""" 55 | balance_flag, problematic_groups = check_balance_across_groups(crosstab_df) 56 | 57 | if balance_flag: 58 | return None 59 | 60 | conditions_proportions = crosstab_df.apply(lambda r: r / r.sum(), axis=0) 61 | median_proportion = np.median(conditions_proportions.values[0, :]) 62 | problematic_proportion = conditions_proportions[problematic_groups].values[0, :] 63 | 64 | problematic_group = problematic_groups[np.argmax(np.abs(problematic_proportion - median_proportion))] 65 | 66 | return problematic_group 67 | 68 | 69 | def get_balanced_dataset(dataset_df): 70 | """Script to perform gender balancing across the subjects' age range.""" 71 | 72 | while True: 73 | crosstab_df = pd.crosstab(dataset_df['Gender'], dataset_df['Age']) 74 | 75 | problematic_group = get_problematic_group(crosstab_df) 76 | 77 | if problematic_group is None: 78 | break 79 | 80 | condition_imbalanced = crosstab_df[problematic_group].idxmax() 81 | 82 | problematic_group_mask = (dataset_df['Age'] == problematic_group) & \ 83 | (dataset_df['Gender'] == condition_imbalanced) 84 | 85 | list_to_drop = list(dataset_df[problematic_group_mask].sample(1).index) 86 | print('Dropping {:}'.format(dataset_df['image_id'].iloc[list_to_drop].values[0])) 87 | dataset_df = dataset_df.drop(list_to_drop, axis=0) 88 | 89 | return dataset_df 90 | 91 | 92 | def main(experiment_name, scanner_name, input_ids_file): 93 | """Perform the exploratory data analysis.""" 94 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 95 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 96 | 97 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 98 | 99 | # Define random seed for sampling methods 100 | np.random.seed(42) 101 | 102 | dataset_df = load_demographic_data(participants_path, ids_path) 103 | 104 | dataset_balanced = get_balanced_dataset(dataset_df) 105 | 106 | homogeneous_ids_df = dataset_balanced[['image_id']] 107 | homogeneous_ids_df.to_csv(experiment_dir / 'homogenized_ids.csv', index=False) 108 | 109 | 110 | if __name__ == '__main__': 111 | main(args.experiment_name, args.scanner_name, 112 | args.input_ids_file) 113 | -------------------------------------------------------------------------------- /src/preprocessing/quality_control.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Perform quality control. 3 | 4 | This script removes participants that did not pass the quality 5 | control performed using MRIQC [1] for raw MRI data and Qoala [2] for 6 | FreeSurfer-preprocessed data. These analyses were performed separately and the 7 | results are applied to the Biobank data in this script. 8 | 9 | In Qoala, higher numbers indicate a higher chance of being a high quality scan 10 | (Source: https://qoala-t.shinyapps.io/qoala-t_app/). 11 | 12 | In MRIQC, higher values indicates a higher probability of being from MRIQC's class 1 ('exclude') 13 | (Source: https://github.com/poldracklab/mriqc/blob/98610ad7596b586966413b01d10f4eb68366a038/mriqc/classifier/helper.py) 14 | 15 | References 16 | ---------- 17 | [1] - Esteban, Oscar, et al. "MRIQC: Advancing the automatic prediction 18 | of image quality in MRI from unseen sites." PloS one 12.9 (2017): e0184661. 19 | 20 | [2] - Klapwijk, Eduard T., et al. "Qoala-T: A supervised-learning tool for 21 | quality control of FreeSurfer segmented MRI data." NeuroImage 189 (2019): 116-129. 22 | """ 23 | import argparse 24 | from pathlib import Path 25 | 26 | import pandas as pd 27 | 28 | PROJECT_ROOT = Path.cwd() 29 | 30 | parser = argparse.ArgumentParser() 31 | 32 | parser.add_argument('-E', '--experiment_name', 33 | dest='experiment_name', 34 | help='Name of the experiment.') 35 | 36 | parser.add_argument('-S', '--scanner_name', 37 | dest='scanner_name', 38 | help='Name of the scanner.') 39 | 40 | parser.add_argument('-I', '--input_ids_file', 41 | dest='input_ids_file', 42 | default='cleaned_ids_noqc.csv', 43 | help='Filename indicating the ids to be used.') 44 | 45 | parser.add_argument('-M', '--mriqc_threshold', 46 | dest='mriqc_threshold', 47 | nargs='?', 48 | type=float, default=0.5, 49 | help='Threshold value for MRIQC.') 50 | 51 | parser.add_argument('-Q', '--qoala_threshold', 52 | dest='qoala_threshold', 53 | nargs='?', 54 | type=float, default=0.5, 55 | help='Threshold value for Qoala.') 56 | 57 | args = parser.parse_args() 58 | 59 | 60 | def main(experiment_name, scanner_name, input_ids_file, mriqc_threshold, qoala_threshold): 61 | """Remove UK Biobank participants that did not pass quality checks.""" 62 | ids_path = PROJECT_ROOT / 'outputs' / experiment_name / input_ids_file 63 | mriqc_prob_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'mriqc_prob.csv' 64 | qoala_prob_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'qoala_prob.csv' 65 | 66 | qc_output_filename = 'cleaned_ids.csv' 67 | 68 | # ---------------------------------------------------------------------------------------- 69 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 70 | 71 | ids_df = pd.read_csv(ids_path) 72 | prob_mriqc_df = pd.read_csv(mriqc_prob_path) 73 | prob_qoala_df = pd.read_csv(qoala_prob_path) 74 | 75 | prob_mriqc_df = prob_mriqc_df.rename(columns={'prob_y': 'mriqc_prob'}) 76 | prob_mriqc_df = prob_mriqc_df[['image_id', 'mriqc_prob']] 77 | 78 | prob_qoala_df = prob_qoala_df.rename(columns={'prob_qoala': 'qoala_prob'}) 79 | prob_qoala_df = prob_qoala_df[['image_id', 'qoala_prob']] 80 | 81 | qc_df = pd.merge(prob_mriqc_df, prob_qoala_df, on='image_id') 82 | 83 | selected_subjects = qc_df[(qc_df['mriqc_prob'] < mriqc_threshold) | (qc_df['qoala_prob'] < qoala_threshold)] 84 | 85 | ids_qc_df = pd.merge(ids_df, selected_subjects[['image_id']], on='image_id') 86 | 87 | ids_qc_df.to_csv(experiment_dir / qc_output_filename, index=False) 88 | 89 | 90 | if __name__ == '__main__': 91 | main(args.experiment_name, args.scanner_name, 92 | args.input_ids_file, 93 | args.mriqc_threshold, args.qoala_threshold) 94 | -------------------------------------------------------------------------------- /src/sample_size/README.md: -------------------------------------------------------------------------------- 1 | # Performance of models by the size of training set 2 | This folder includes the scripts to perform the analysis of the 3 | impact of the sample size of the training set for brain age prediction. 4 | 5 | The scripts use bootstrapping to assess the robustness of performance and 6 | determine the minimum training set size required for model performance 7 | above chance level. Performance is measured in terms of the model's 8 | mean absolute error (MAE). -------------------------------------------------------------------------------- /src/sample_size/sample_size_create_figures.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Plot results of bootstrap analysis 3 | 4 | Ref: 5 | https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/ 6 | """ 7 | import argparse 8 | from pathlib import Path 9 | 10 | import matplotlib.pyplot as plt 11 | import numpy as np 12 | 13 | PROJECT_ROOT = Path.cwd() 14 | 15 | parser = argparse.ArgumentParser() 16 | 17 | parser.add_argument('-E', '--experiment_name', 18 | dest='experiment_name', 19 | help='Name of the experiment.') 20 | 21 | parser.add_argument('-M', '--model_name', 22 | dest='model_name', 23 | help='Name of the model.') 24 | 25 | parser.add_argument('-N', '--n_bootstrap', 26 | dest='n_bootstrap', 27 | type=int, default=1000, 28 | help='Number of bootstrap iterations.') 29 | 30 | parser.add_argument('-F', '--n_min_pair', 31 | dest='n_min_pair', 32 | type=int, default=1, 33 | help='Number minimum of pairs.') 34 | 35 | parser.add_argument('-R', '--n_max_pair', 36 | dest='n_max_pair', 37 | type=int, default=20, 38 | help='Number maximum of pairs.') 39 | 40 | args = parser.parse_args() 41 | 42 | 43 | def main(experiment_name, model_name, n_bootstrap, n_min_pair, n_max_pair): 44 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 45 | 46 | i_n_subject_pairs_list = range(n_min_pair, n_max_pair + 1) 47 | 48 | scores_i_n_subject_pairs = [] 49 | train_scores_i_n_subject_pairs = [] 50 | general_scores_i_n_subject_pairs = [] 51 | 52 | for i_n_subject_pairs in i_n_subject_pairs_list: 53 | ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' 54 | scores_dir = ids_with_n_subject_pairs_dir / 'scores' 55 | scores_bootstrap = [] 56 | train_scores_bootstrap = [] 57 | general_scores_bootstrap = [] 58 | for i_bootstrap in range(n_bootstrap): 59 | filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy' 60 | scores_bootstrap.append(np.load(str(filepath_scores))[1]) 61 | 62 | train_filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy' 63 | train_scores_bootstrap.append(np.load(str(train_filepath_scores))[1]) 64 | 65 | general_filepath_scores = scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy' 66 | general_scores_bootstrap.append(np.load(str(general_filepath_scores))[1]) 67 | 68 | scores_i_n_subject_pairs.append(scores_bootstrap) 69 | train_scores_i_n_subject_pairs.append(train_scores_bootstrap) 70 | general_scores_i_n_subject_pairs.append(general_scores_bootstrap) 71 | 72 | age_min = 47 73 | age_max = 73 74 | std_uniform_dist = np.sqrt(((age_max - age_min) ** 2) / 12) 75 | 76 | plt.figure(figsize=(10, 5)) 77 | 78 | # Draw lines 79 | plt.plot(i_n_subject_pairs_list, 80 | np.median(scores_i_n_subject_pairs, axis=1), 81 | linewidth=1.0, 82 | color='r', label=model_name + ' test performance') 83 | 84 | plt.plot(i_n_subject_pairs_list, 85 | np.median(train_scores_i_n_subject_pairs, axis=1), 86 | linewidth=1.0, 87 | color='g', label=model_name + ' train performance') 88 | 89 | plt.plot(i_n_subject_pairs_list, 90 | np.median(general_scores_i_n_subject_pairs, axis=1), 91 | linewidth=1.0, 92 | color='b', label=model_name + ' generalisation performance') 93 | 94 | plt.plot(range(1, 21), 95 | std_uniform_dist * np.ones_like(range(1, 21)), '--', 96 | linewidth=1.0, 97 | color='#111111', label='Chance line') 98 | 99 | # Draw bands 100 | plt.fill_between(i_n_subject_pairs_list, 101 | np.percentile(scores_i_n_subject_pairs, 2.5, axis=1), 102 | np.percentile(scores_i_n_subject_pairs, 97.5, axis=1), 103 | color='r', alpha=0.1) 104 | 105 | plt.fill_between(i_n_subject_pairs_list, 106 | np.percentile(train_scores_i_n_subject_pairs, 2.5, axis=1), 107 | np.percentile(train_scores_i_n_subject_pairs, 97.5, axis=1), 108 | color='g', alpha=0.1) 109 | 110 | plt.fill_between(i_n_subject_pairs_list, 111 | np.percentile(general_scores_i_n_subject_pairs, 2.5, axis=1), 112 | np.percentile(general_scores_i_n_subject_pairs, 97.5, axis=1), 113 | color='b', alpha=0.1) 114 | 115 | # Create plot 116 | plt.xlabel('Number of subjects') 117 | plt.xticks(range(1, 21), np.multiply(range(1, 21), 2 * ((73 - 47) + 1))) 118 | plt.xlim(0.04999999999999993, 20.95) 119 | plt.ylabel('Mean Absolute Error') 120 | plt.legend(loc='best') 121 | plt.tight_layout() 122 | plt.savefig(experiment_dir / 'sample_size' / f'sample_size_{model_name}.eps', format='eps') 123 | 124 | 125 | if __name__ == '__main__': 126 | main(args.experiment_name, args.model_name, 127 | args.n_bootstrap, args.n_min_pair, args.n_max_pair) 128 | -------------------------------------------------------------------------------- /src/sample_size/sample_size_create_ids.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to create files with subjects' ids to perform sample size analysis 3 | 4 | This script creates sex-homogeneous bootstrapped datasets. 5 | Creates 20 bootstrap samples of increasing size 6 | """ 7 | import argparse 8 | from pathlib import Path 9 | 10 | import numpy as np 11 | import pandas as pd 12 | 13 | from utils import load_demographic_data 14 | 15 | PROJECT_ROOT = Path.cwd() 16 | 17 | parser = argparse.ArgumentParser() 18 | 19 | parser.add_argument('-E', '--experiment_name', 20 | dest='experiment_name', 21 | help='Name of the experiment.') 22 | 23 | parser.add_argument('-S', '--scanner_name', 24 | dest='scanner_name', 25 | help='Name of the scanner.') 26 | 27 | parser.add_argument('-I', '--input_ids_file', 28 | dest='input_ids_file', 29 | default='homogenized_ids.csv', 30 | help='File name indicating the ids to be used.') 31 | 32 | parser.add_argument('-N', '--n_bootstrap', 33 | dest='n_bootstrap', 34 | type=int, default=1000, 35 | help='Number of bootstrap iterations.') 36 | 37 | parser.add_argument('-R', '--n_max_pair', 38 | dest='n_max_pair', 39 | type=int, default=20, 40 | help='Maximum number of pairs.') 41 | 42 | args = parser.parse_args() 43 | 44 | 45 | def main(experiment_name, scanner_name, input_ids_file, n_bootstrap, n_max_pair): 46 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 47 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 48 | ids_path = experiment_dir / input_ids_file 49 | 50 | sample_size_dir = experiment_dir / 'sample_size' 51 | sample_size_dir.mkdir(exist_ok=True) 52 | 53 | # ---------------------------------------------------------------------------------------- 54 | # Set random seed for random sampling of subjects 55 | np.random.seed(42) 56 | 57 | dataset = load_demographic_data(participants_path, ids_path) 58 | 59 | # Find range of ages in homogeneous dataset 60 | age_min = int(dataset['Age'].min()) # 47 61 | age_max = int(dataset['Age'].max()) # 73 62 | 63 | # Loop to create 20 bootstrap samples that each contain up to 20 gender-balanced subject pairs per age group/year 64 | # Create a out-of-bag set (~test set) 65 | for i_n_subject_pairs in range(1, n_max_pair + 1): 66 | print(i_n_subject_pairs) 67 | ids_with_n_subject_pairs_dir = sample_size_dir / f'{i_n_subject_pairs:02d}' 68 | ids_with_n_subject_pairs_dir.mkdir(exist_ok=True) 69 | ids_dir = ids_with_n_subject_pairs_dir / 'ids' 70 | ids_dir.mkdir(exist_ok=True) 71 | 72 | # Loop to create 1000 random subject samples of the same size (with replacement) per bootstrap sample 73 | for i_bootstrap in range(n_bootstrap): 74 | # Create empty df to add bootstrap subjects to 75 | dataset_bootstrap_train = pd.DataFrame(columns=['image_id']) 76 | dataset_bootstrap_test = pd.DataFrame(columns=['image_id']) 77 | 78 | # Loop over ages (27 in total) 79 | for age in range(age_min, (age_max + 1)): 80 | 81 | # Get dataset for specific age only 82 | age_group = dataset.groupby('Age').get_group(age) 83 | 84 | # Loop over genders (0: female, 1:male) 85 | for gender in range(2): 86 | gender_group = age_group.groupby('Gender').get_group(gender) 87 | 88 | # Extract random subject of that gender and add to dataset_bootstrap_train 89 | random_sample_train = gender_group.sample(n=i_n_subject_pairs, replace=True) 90 | dataset_bootstrap_train = pd.concat([dataset_bootstrap_train, random_sample_train[['image_id']]]) 91 | 92 | # Sample test set with always the same size 93 | not_sampled = ~gender_group['image_id'].isin(random_sample_train['image_id']) 94 | random_sample_test = gender_group[not_sampled].sample(n=20, replace=False) 95 | dataset_bootstrap_test = pd.concat([dataset_bootstrap_test, random_sample_test[['image_id']]]) 96 | 97 | # Export dataset_bootstrap_train as csv 98 | output_prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}' 99 | dataset_bootstrap_train.to_csv(ids_dir / f'{output_prefix}_train.csv', index=False) 100 | dataset_bootstrap_test.to_csv(ids_dir / f'{output_prefix}_test.csv', index=False) 101 | 102 | 103 | if __name__ == '__main__': 104 | main(args.experiment_name, args.scanner_name, 105 | args.input_ids_file, 106 | args.n_bootstrap, args.n_max_pair) 107 | -------------------------------------------------------------------------------- /src/sample_size/sample_size_fs_data_gp_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to perform the sample size analysis using Gaussian Processes. """ 3 | import argparse 4 | import random 5 | from math import sqrt 6 | from pathlib import Path 7 | 8 | import numpy as np 9 | from scipy import stats 10 | from sklearn.gaussian_process import GaussianProcessRegressor 11 | from sklearn.gaussian_process.kernels import DotProduct 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 13 | from sklearn.preprocessing import RobustScaler 14 | 15 | from utils import COLUMNS_NAME, load_freesurfer_dataset 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | parser = argparse.ArgumentParser() 20 | 21 | parser.add_argument('-E', '--experiment_name', 22 | dest='experiment_name', 23 | help='Name of the experiment.') 24 | 25 | parser.add_argument('-S', '--scanner_name', 26 | dest='scanner_name', 27 | help='Name of the scanner.') 28 | 29 | parser.add_argument('-N', '--n_bootstrap', 30 | dest='n_bootstrap', 31 | type=int, default=1000, 32 | help='Number of bootstrap iterations.') 33 | 34 | parser.add_argument('-R', '--n_max_pair', 35 | dest='n_max_pair', 36 | type=int, default=20, 37 | help='Number maximum of pairs.') 38 | 39 | parser.add_argument('-G', '--general_experiment_name', 40 | dest='general_experiment_name', 41 | help='Name of the experiment.') 42 | 43 | parser.add_argument('-C', '--general_scanner_name', 44 | dest='general_scanner_name', 45 | help='Name of the scanner for generalization.') 46 | 47 | parser.add_argument('-I', '--general_input_ids_file', 48 | dest='general_input_ids_file', 49 | default='cleaned_ids.csv', 50 | help='Filename indicating the ids to be used.') 51 | 52 | args = parser.parse_args() 53 | 54 | 55 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair, 56 | general_experiment_name, general_scanner_name, general_input_ids_file): 57 | model_name = 'GPR' 58 | 59 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 60 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 61 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 62 | 63 | general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv' 64 | general_freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'freesurferData.csv' 65 | 66 | general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file 67 | general_dataset = load_freesurfer_dataset(general_participants_path, general_ids_path, general_freesurfer_path) 68 | 69 | # Normalise regional volumes by total intracranial volume (tiv) 70 | general_regions = general_dataset[COLUMNS_NAME].values 71 | 72 | general_tiv = general_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 73 | 74 | x_general = np.true_divide(general_regions, general_tiv) 75 | y_general = general_dataset['Age'].values 76 | 77 | # ---------------------------------------------------------------------------------------- 78 | 79 | # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year 80 | for i_n_subject_pairs in range(1, n_max_pair + 1): 81 | print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}') 82 | ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids' 83 | 84 | scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores' 85 | scores_dir.mkdir(exist_ok=True) 86 | 87 | # Loop over the 1000 random subject samples per bootstrap 88 | for i_bootstrap in range(n_bootstrap): 89 | print(f'Sample number within bootstrap: {i_bootstrap}') 90 | 91 | prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}' 92 | train_dataset = load_freesurfer_dataset(participants_path, 93 | ids_with_n_subject_pairs_dir / f'{prefix}_train.csv', 94 | freesurfer_path) 95 | test_dataset = load_freesurfer_dataset(participants_path, 96 | ids_with_n_subject_pairs_dir / f'{prefix}_test.csv', 97 | freesurfer_path) 98 | 99 | # Initialise random seed 100 | np.random.seed(42) 101 | random.seed(42) 102 | 103 | # Normalise regional volumes by total intracranial volume (tiv) 104 | regions = train_dataset[COLUMNS_NAME].values 105 | 106 | tiv = train_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 107 | 108 | x_train = np.true_divide(regions, tiv) 109 | y_train = train_dataset['Age'].values 110 | 111 | test_tiv = test_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 112 | test_regions = test_dataset[COLUMNS_NAME].values 113 | 114 | x_test = np.true_divide(test_regions, test_tiv) 115 | y_test = test_dataset['Age'].values 116 | 117 | # Scaling in range [-1, 1] 118 | scaler = RobustScaler() 119 | x_train = scaler.fit_transform(x_train) 120 | x_test = scaler.transform(x_test) 121 | 122 | gpr = GaussianProcessRegressor(kernel=DotProduct(), random_state=0) 123 | 124 | gpr.fit(x_train, y_train) 125 | 126 | # Test data 127 | predictions = gpr.predict(x_test) 128 | mae = mean_absolute_error(y_test, predictions) 129 | rmse = sqrt(mean_squared_error(y_test, predictions)) 130 | r2 = r2_score(y_test, predictions) 131 | age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test) 132 | 133 | scores = np.array([r2, mae, rmse, age_error_corr]) 134 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores) 135 | 136 | print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 137 | 138 | # Train data 139 | train_predictions = gpr.predict(x_train) 140 | train_mae = mean_absolute_error(y_train, train_predictions) 141 | train_rmse = sqrt(mean_squared_error(y_train, train_predictions)) 142 | train_r2 = r2_score(y_train, train_predictions) 143 | train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train) 144 | 145 | train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr]) 146 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores) 147 | 148 | # Generalisation data 149 | x_general_norm = scaler.transform(x_general) 150 | general_predictions = gpr.predict(x_general_norm) 151 | general_mae = mean_absolute_error(y_general, general_predictions) 152 | general_rmse = sqrt(mean_squared_error(y_general, general_predictions)) 153 | general_r2 = r2_score(y_general, general_predictions) 154 | general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general) 155 | 156 | general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr]) 157 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores) 158 | 159 | 160 | if __name__ == '__main__': 161 | main(args.experiment_name, args.scanner_name, 162 | args.n_bootstrap, args.n_max_pair, 163 | args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file) 164 | -------------------------------------------------------------------------------- /src/sample_size/sample_size_fs_data_rvm_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Script to perform the sample size analysis using Relevant Vector Machine. """ 3 | import argparse 4 | import random 5 | import warnings 6 | from math import sqrt 7 | from pathlib import Path 8 | 9 | import numpy as np 10 | from scipy import stats 11 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 12 | from sklearn.preprocessing import RobustScaler 13 | from sklearn_rvm import EMRVR 14 | 15 | from utils import COLUMNS_NAME, load_freesurfer_dataset 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | warnings.filterwarnings('ignore') 20 | 21 | parser = argparse.ArgumentParser() 22 | 23 | parser.add_argument('-E', '--experiment_name', 24 | dest='experiment_name', 25 | help='Name of the experiment.') 26 | 27 | parser.add_argument('-S', '--scanner_name', 28 | dest='scanner_name', 29 | help='Name of the scanner.') 30 | 31 | parser.add_argument('-N', '--n_bootstrap', 32 | dest='n_bootstrap', 33 | type=int, default=1000, 34 | help='Number of bootstrap iterations.') 35 | 36 | parser.add_argument('-R', '--n_max_pair', 37 | dest='n_max_pair', 38 | type=int, default=20, 39 | help='Number maximum of pairs.') 40 | 41 | parser.add_argument('-G', '--general_experiment_name', 42 | dest='general_experiment_name', 43 | help='Name of the experiment.') 44 | 45 | parser.add_argument('-C', '--general_scanner_name', 46 | dest='general_scanner_name', 47 | help='Name of the scanner for generalization.') 48 | 49 | parser.add_argument('-I', '--general_input_ids_file', 50 | dest='general_input_ids_file', 51 | default='cleaned_ids.csv', 52 | help='Filename indicating the ids to be used.') 53 | 54 | args = parser.parse_args() 55 | 56 | 57 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair, 58 | general_experiment_name, general_scanner_name, general_input_ids_file): 59 | model_name = 'RVM' 60 | 61 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 62 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 63 | freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'freesurferData.csv' 64 | 65 | general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv' 66 | general_freesurfer_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'freesurferData.csv' 67 | 68 | general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file 69 | general_dataset = load_freesurfer_dataset(general_participants_path, general_ids_path, general_freesurfer_path) 70 | 71 | # Normalise regional volumes by total intracranial volume (tiv) 72 | general_regions = general_dataset[COLUMNS_NAME].values 73 | 74 | general_tiv = general_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 75 | 76 | x_general = np.true_divide(general_regions, general_tiv) 77 | y_general = general_dataset['Age'].values 78 | 79 | # ---------------------------------------------------------------------------------------- 80 | 81 | # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year 82 | for i_n_subject_pairs in range(1, n_max_pair + 1): 83 | print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}') 84 | ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids' 85 | 86 | scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores' 87 | scores_dir.mkdir(exist_ok=True) 88 | 89 | # Loop over the 1000 random subject samples per bootstrap 90 | for i_bootstrap in range(n_bootstrap): 91 | print(f'Sample number within bootstrap: {i_bootstrap}') 92 | 93 | prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}' 94 | train_dataset = load_freesurfer_dataset(participants_path, 95 | ids_with_n_subject_pairs_dir / f'{prefix}_train.csv', 96 | freesurfer_path) 97 | test_dataset = load_freesurfer_dataset(participants_path, 98 | ids_with_n_subject_pairs_dir / f'{prefix}_test.csv', 99 | freesurfer_path) 100 | 101 | # Initialise random seed 102 | np.random.seed(42) 103 | random.seed(42) 104 | 105 | # Normalise regional volumes by total intracranial volume (tiv) 106 | regions = train_dataset[COLUMNS_NAME].values 107 | 108 | tiv = train_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 109 | 110 | x_train = np.true_divide(regions, tiv) 111 | y_train = train_dataset['Age'].values 112 | 113 | test_tiv = test_dataset.EstimatedTotalIntraCranialVol.values[:, np.newaxis] 114 | test_regions = test_dataset[COLUMNS_NAME].values 115 | 116 | x_test = np.true_divide(test_regions, test_tiv) 117 | y_test = test_dataset['Age'].values 118 | 119 | # Scaling in range [-1, 1] 120 | scaler = RobustScaler() 121 | x_train = scaler.fit_transform(x_train) 122 | x_test = scaler.transform(x_test) 123 | 124 | # Systematic search for best hyperparameters 125 | rvm = EMRVR(kernel='linear', threshold_alpha=1e9) 126 | rvm.fit(x_train, y_train) 127 | 128 | # Test data 129 | predictions = rvm.predict(x_test) 130 | mae = mean_absolute_error(y_test, predictions) 131 | rmse = sqrt(mean_squared_error(y_test, predictions)) 132 | r2 = r2_score(y_test, predictions) 133 | age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test) 134 | 135 | scores = np.array([r2, mae, rmse, age_error_corr]) 136 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores) 137 | 138 | print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 139 | 140 | # Train data 141 | train_predictions = rvm.predict(x_train) 142 | train_mae = mean_absolute_error(y_train, train_predictions) 143 | train_rmse = sqrt(mean_squared_error(y_train, train_predictions)) 144 | train_r2 = r2_score(y_train, train_predictions) 145 | train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train) 146 | 147 | train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr]) 148 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores) 149 | 150 | # Generalisation data 151 | x_general_norm = scaler.transform(x_general) 152 | general_predictions = rvm.predict(x_general_norm) 153 | general_mae = mean_absolute_error(y_general, general_predictions) 154 | general_rmse = sqrt(mean_squared_error(y_general, general_predictions)) 155 | general_r2 = r2_score(y_general, general_predictions) 156 | general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general) 157 | 158 | general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr]) 159 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores) 160 | 161 | 162 | if __name__ == '__main__': 163 | main(args.experiment_name, args.scanner_name, 164 | args.n_bootstrap, args.n_max_pair, 165 | args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file) 166 | -------------------------------------------------------------------------------- /src/sample_size/sample_size_voxel_data_rvm_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Perform sample size Script to run RVM on bootstrap datasets of UK BIOBANK Scanner1. """ 3 | import argparse 4 | import random 5 | import warnings 6 | from math import sqrt 7 | from pathlib import Path 8 | 9 | import numpy as np 10 | import pandas as pd 11 | from scipy import stats 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 13 | from sklearn_rvm import EMRVR 14 | 15 | from utils import load_demographic_data 16 | 17 | PROJECT_ROOT = Path.cwd() 18 | 19 | warnings.filterwarnings('ignore') 20 | 21 | parser = argparse.ArgumentParser() 22 | 23 | parser.add_argument('-E', '--experiment_name', 24 | dest='experiment_name', 25 | help='Name of the experiment.') 26 | 27 | parser.add_argument('-S', '--scanner_name', 28 | dest='scanner_name', 29 | help='Name of the scanner.') 30 | 31 | parser.add_argument('-N', '--n_bootstrap', 32 | dest='n_bootstrap', 33 | type=int, default=1000, 34 | help='Number of bootstrap iterations.') 35 | 36 | parser.add_argument('-R', '--n_max_pair', 37 | dest='n_max_pair', 38 | type=int, default=20, 39 | help='Number maximum of pairs.') 40 | 41 | parser.add_argument('-G', '--general_experiment_name', 42 | dest='general_experiment_name', 43 | help='Name of the experiment.') 44 | 45 | parser.add_argument('-C', '--general_scanner_name', 46 | dest='general_scanner_name', 47 | help='Name of the scanner for generalization.') 48 | 49 | parser.add_argument('-I', '--general_input_ids_file', 50 | dest='general_input_ids_file', 51 | default='cleaned_ids.csv', 52 | help='Filename indicating the ids to be used.') 53 | 54 | args = parser.parse_args() 55 | 56 | 57 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair, 58 | general_experiment_name, general_scanner_name, general_input_ids_file): 59 | model_name = 'voxel_RVM' 60 | 61 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 62 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 63 | 64 | # Load the Gram matrix 65 | kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv' 66 | kernel = pd.read_csv(kernel_path, header=0, index_col=0) 67 | 68 | general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv' 69 | general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file 70 | 71 | kernel_path_general = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel_general.csv' 72 | kernel_general = pd.read_csv(kernel_path_general, header=0, index_col=0) 73 | general_dataset = load_demographic_data(general_participants_path, general_ids_path) 74 | 75 | y_general = general_dataset['Age'].values 76 | 77 | # ---------------------------------------------------------------------------------------- 78 | # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year 79 | for i_n_subject_pairs in range(1, n_max_pair + 1): 80 | print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}') 81 | ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids' 82 | 83 | scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores' 84 | scores_dir.mkdir(exist_ok=True) 85 | 86 | # Loop over the 1000 random subject samples per bootstrap 87 | for i_bootstrap in range(n_bootstrap): 88 | print(f'Sample number within bootstrap: {i_bootstrap}') 89 | 90 | prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}' 91 | train_dataset = load_demographic_data(participants_path, 92 | ids_with_n_subject_pairs_dir / f'{prefix}_train.csv') 93 | test_dataset = load_demographic_data(participants_path, 94 | ids_with_n_subject_pairs_dir / f'{prefix}_test.csv') 95 | 96 | # Initialise random seed 97 | np.random.seed(42) 98 | random.seed(42) 99 | 100 | train_index = train_dataset['image_id'] 101 | test_index = test_dataset['image_id'] 102 | 103 | x_train = kernel.loc[train_index, train_index].values 104 | x_test = kernel.loc[test_index, train_index].values 105 | 106 | y_train = train_dataset['Age'].values 107 | y_test = test_dataset['Age'].values 108 | 109 | model = EMRVR(kernel='precomputed', threshold_alpha=1e9) 110 | model.fit(x_train, y_train) 111 | predictions = model.predict(x_test) 112 | 113 | mae = mean_absolute_error(y_test, predictions) 114 | rmse = sqrt(mean_squared_error(y_test, predictions)) 115 | r2 = r2_score(y_test, predictions) 116 | age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test) 117 | 118 | scores = np.array([r2, mae, rmse, age_error_corr]) 119 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores) 120 | 121 | print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 122 | 123 | # Train data 124 | train_predictions = model.predict(x_train) 125 | train_mae = mean_absolute_error(y_train, train_predictions) 126 | train_rmse = sqrt(mean_squared_error(y_train, train_predictions)) 127 | train_r2 = r2_score(y_train, train_predictions) 128 | train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train) 129 | 130 | train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr]) 131 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores) 132 | 133 | # Generalisation data 134 | x_general = kernel_general.loc[train_index, :].T.values 135 | general_predictions = model.predict(x_general) 136 | general_mae = mean_absolute_error(y_general, general_predictions) 137 | general_rmse = sqrt(mean_squared_error(y_general, general_predictions)) 138 | general_r2 = r2_score(y_general, general_predictions) 139 | general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general) 140 | 141 | general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr]) 142 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores) 143 | 144 | 145 | if __name__ == '__main__': 146 | main(args.experiment_name, args.scanner_name, 147 | args.n_bootstrap, args.n_max_pair, 148 | args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file) 149 | -------------------------------------------------------------------------------- /src/sample_size/sample_size_voxel_data_svm_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Perform sample size Script to run SVM (linear SVR) on bootstrap datasets of UK BIOBANK Scanner1.""" 3 | import argparse 4 | import random 5 | import warnings 6 | from math import sqrt 7 | from pathlib import Path 8 | 9 | import numpy as np 10 | import pandas as pd 11 | from scipy import stats 12 | from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 13 | from sklearn.model_selection import GridSearchCV, KFold 14 | from sklearn.svm import SVR 15 | 16 | from utils import load_demographic_data 17 | 18 | PROJECT_ROOT = Path.cwd() 19 | 20 | warnings.filterwarnings('ignore') 21 | 22 | parser = argparse.ArgumentParser() 23 | 24 | parser.add_argument('-E', '--experiment_name', 25 | dest='experiment_name', 26 | help='Name of the experiment.') 27 | 28 | parser.add_argument('-S', '--scanner_name', 29 | dest='scanner_name', 30 | help='Name of the scanner.') 31 | 32 | parser.add_argument('-N', '--n_bootstrap', 33 | dest='n_bootstrap', 34 | type=int, default=1000, 35 | help='Number of bootstrap iterations.') 36 | 37 | parser.add_argument('-R', '--n_max_pair', 38 | dest='n_max_pair', 39 | type=int, default=20, 40 | help='Number maximum of pairs.') 41 | 42 | parser.add_argument('-G', '--general_experiment_name', 43 | dest='general_experiment_name', 44 | help='Name of the experiment.') 45 | 46 | parser.add_argument('-C', '--general_scanner_name', 47 | dest='general_scanner_name', 48 | help='Name of the scanner for generalization.') 49 | 50 | parser.add_argument('-I', '--general_input_ids_file', 51 | dest='general_input_ids_file', 52 | default='cleaned_ids.csv', 53 | help='Filename indicating the ids to be used.') 54 | 55 | args = parser.parse_args() 56 | 57 | 58 | def main(experiment_name, scanner_name, n_bootstrap, n_max_pair, 59 | general_experiment_name, general_scanner_name, general_input_ids_file): 60 | # ---------------------------------------------------------------------------------------- 61 | model_name = 'voxel_SVM' 62 | 63 | experiment_dir = PROJECT_ROOT / 'outputs' / experiment_name 64 | participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / scanner_name / 'participants.tsv' 65 | 66 | # Load the Gram matrix 67 | kernel_path = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel.csv' 68 | kernel = pd.read_csv(kernel_path, header=0, index_col=0) 69 | 70 | general_participants_path = PROJECT_ROOT / 'data' / 'BIOBANK' / general_scanner_name / 'participants.tsv' 71 | general_ids_path = PROJECT_ROOT / 'outputs' / general_experiment_name / general_input_ids_file 72 | 73 | kernel_path_general = PROJECT_ROOT / 'outputs' / 'kernels' / 'kernel_general.csv' 74 | kernel_general = pd.read_csv(kernel_path_general, header=0, index_col=0) 75 | general_dataset = load_demographic_data(general_participants_path, general_ids_path) 76 | 77 | y_general = general_dataset['Age'].values 78 | 79 | # ---------------------------------------------------------------------------------------- 80 | # Loop over the 20 bootstrap samples with up to 20 gender-balanced subject pairs per age group/year 81 | for i_n_subject_pairs in range(1, n_max_pair + 1): 82 | print(f'Bootstrap number of subject pairs: {i_n_subject_pairs}') 83 | ids_with_n_subject_pairs_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'ids' 84 | 85 | scores_dir = experiment_dir / 'sample_size' / f'{i_n_subject_pairs:02d}' / 'scores' 86 | scores_dir.mkdir(exist_ok=True) 87 | 88 | # Loop over the 1000 random subject samples per bootstrap 89 | for i_bootstrap in range(n_bootstrap): 90 | print(f'Sample number within bootstrap: {i_bootstrap}') 91 | 92 | prefix = f'{i_bootstrap:04d}_{i_n_subject_pairs:02d}' 93 | train_dataset = load_demographic_data(participants_path, 94 | ids_with_n_subject_pairs_dir / f'{prefix}_train.csv') 95 | test_dataset = load_demographic_data(participants_path, 96 | ids_with_n_subject_pairs_dir / f'{prefix}_test.csv') 97 | 98 | # Initialise random seed 99 | np.random.seed(42) 100 | random.seed(42) 101 | 102 | train_index = train_dataset['image_id'] 103 | test_index = test_dataset['image_id'] 104 | 105 | x_train = kernel.loc[train_index, train_index].values 106 | x_test = kernel.loc[test_index, train_index].values 107 | 108 | y_train = train_dataset['Age'].values 109 | y_test = test_dataset['Age'].values 110 | 111 | model = SVR(kernel='precomputed') 112 | 113 | # Systematic search for best hyperparameters 114 | search_space = {'C': [2 ** -7, 2 ** -5, 2 ** -3, 2 ** -1, 2 ** 0, 2 ** 1, 2 ** 3, 2 ** 5, 2 ** 7]} 115 | n_nested_folds = 5 116 | nested_kf = KFold(n_splits=n_nested_folds, shuffle=True, random_state=i_bootstrap) 117 | gridsearch = GridSearchCV(model, 118 | param_grid=search_space, 119 | scoring='neg_mean_absolute_error', 120 | refit=True, cv=nested_kf, 121 | verbose=0, n_jobs=1) 122 | 123 | gridsearch.fit(x_train, y_train) 124 | 125 | best_model = gridsearch.best_estimator_ 126 | 127 | # Test data 128 | predictions = best_model.predict(x_test) 129 | mae = mean_absolute_error(y_test, predictions) 130 | rmse = sqrt(mean_squared_error(y_test, predictions)) 131 | r2 = r2_score(y_test, predictions) 132 | age_error_corr, _ = stats.spearmanr(np.abs(y_test - predictions), y_test) 133 | 134 | scores = np.array([r2, mae, rmse, age_error_corr]) 135 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}.npy'), scores) 136 | 137 | print(f'R2: {r2:0.3f} MAE: {mae:0.3f} RMSE: {rmse:0.3f} CORR: {age_error_corr:0.3f}') 138 | 139 | # Train data 140 | train_predictions = best_model.predict(x_train) 141 | train_mae = mean_absolute_error(y_train, train_predictions) 142 | train_rmse = sqrt(mean_squared_error(y_train, train_predictions)) 143 | train_r2 = r2_score(y_train, train_predictions) 144 | train_age_error_corr, _ = stats.spearmanr(np.abs(y_train - train_predictions), y_train) 145 | 146 | train_scores = np.array([train_r2, train_mae, train_rmse, train_age_error_corr]) 147 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_train.npy'), train_scores) 148 | 149 | # Generalisation data 150 | x_general = kernel_general.loc[train_index, :].T.values 151 | general_predictions = best_model.predict(x_general) 152 | general_mae = mean_absolute_error(y_general, general_predictions) 153 | general_rmse = sqrt(mean_squared_error(y_general, general_predictions)) 154 | general_r2 = r2_score(y_general, general_predictions) 155 | general_age_error_corr, _ = stats.spearmanr(np.abs(y_general - general_predictions), y_general) 156 | 157 | general_scores = np.array([general_r2, general_mae, general_rmse, train_age_error_corr]) 158 | np.save(str(scores_dir / f'scores_{i_bootstrap:04d}_{model_name}_general.npy'), general_scores) 159 | 160 | 161 | if __name__ == '__main__': 162 | main(args.experiment_name, args.scanner_name, 163 | args.n_bootstrap, args.n_max_pair, 164 | args.general_experiment_name, args.general_scanner_name, args.general_input_ids_file) 165 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | """Helper functions and constants.""" 2 | import pandas as pd 3 | import numpy as np 4 | from scipy import stats 5 | 6 | 7 | def load_freesurfer_dataset(participants_path, ids_path, freesurfer_path): 8 | """Load dataset.""" 9 | demographic_data = load_demographic_data(participants_path, ids_path) 10 | 11 | freesurfer_df = pd.read_csv(freesurfer_path) 12 | 13 | dataset_df = pd.merge(freesurfer_df, demographic_data, on='image_id') 14 | 15 | return dataset_df 16 | 17 | 18 | def load_demographic_data(participants_path, ids_path): 19 | """Load dataset using selected ids.""" 20 | participants_df = pd.read_csv(participants_path, sep='\t') 21 | participants_df = participants_df.dropna() 22 | 23 | ids_df = pd.read_csv(ids_path, usecols=['image_id']) 24 | 25 | ids_df['participant_id'] = ids_df['image_id'].str.split('_').str[0] 26 | 27 | dataset_df = pd.merge(ids_df, participants_df, on='participant_id') 28 | 29 | return dataset_df 30 | 31 | 32 | def ttest_ind_corrected(performance_a, performance_b, k=10, r=10): 33 | """Corrected repeated k-fold cv test. 34 | The test assumes that the classifiers were evaluated using cross validation. 35 | 36 | Ref: 37 | Bouckaert, Remco R., and Eibe Frank. "Evaluating the replicability of significance tests for comparing learning 38 | algorithms." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, 2004 39 | 40 | Args: 41 | performance_a: performances from classifier A 42 | performance_b: performances from classifier B 43 | k: number of folds 44 | r: number of repetitions 45 | 46 | Returns: 47 | t: t-statistic of the corrected test. 48 | prob: p-value of the corrected test. 49 | """ 50 | df = k * r - 1 51 | 52 | x = performance_a - performance_b 53 | m = np.mean(x) 54 | 55 | sigma_2 = np.var(x, ddof=1) 56 | denom = np.sqrt((1 / k * r + 1 / (k - 1)) * sigma_2) 57 | 58 | with np.errstate(divide='ignore', invalid='ignore'): 59 | t = np.divide(m, denom) 60 | 61 | prob = stats.t.sf(np.abs(t), df) * 2 62 | 63 | return t, prob 64 | 65 | 66 | COLUMNS_NAME = ['Left-Lateral-Ventricle', 67 | 'Left-Inf-Lat-Vent', 68 | 'Left-Cerebellum-White-Matter', 69 | 'Left-Cerebellum-Cortex', 70 | 'Left-Thalamus-Proper', 71 | 'Left-Caudate', 72 | 'Left-Putamen', 73 | 'Left-Pallidum', 74 | '3rd-Ventricle', 75 | '4th-Ventricle', 76 | 'Brain-Stem', 77 | 'Left-Hippocampus', 78 | 'Left-Amygdala', 79 | 'CSF', 80 | 'Left-Accumbens-area', 81 | 'Left-VentralDC', 82 | 'Right-Lateral-Ventricle', 83 | 'Right-Inf-Lat-Vent', 84 | 'Right-Cerebellum-White-Matter', 85 | 'Right-Cerebellum-Cortex', 86 | 'Right-Thalamus-Proper', 87 | 'Right-Caudate', 88 | 'Right-Putamen', 89 | 'Right-Pallidum', 90 | 'Right-Hippocampus', 91 | 'Right-Amygdala', 92 | 'Right-Accumbens-area', 93 | 'Right-VentralDC', 94 | 'CC_Posterior', 95 | 'CC_Mid_Posterior', 96 | 'CC_Central', 97 | 'CC_Mid_Anterior', 98 | 'CC_Anterior', 99 | 'lh_bankssts_volume', 100 | 'lh_caudalanteriorcingulate_volume', 101 | 'lh_caudalmiddlefrontal_volume', 102 | 'lh_cuneus_volume', 103 | 'lh_entorhinal_volume', 104 | 'lh_fusiform_volume', 105 | 'lh_inferiorparietal_volume', 106 | 'lh_inferiortemporal_volume', 107 | 'lh_isthmuscingulate_volume', 108 | 'lh_lateraloccipital_volume', 109 | 'lh_lateralorbitofrontal_volume', 110 | 'lh_lingual_volume', 111 | 'lh_medialorbitofrontal_volume', 112 | 'lh_middletemporal_volume', 113 | 'lh_parahippocampal_volume', 114 | 'lh_paracentral_volume', 115 | 'lh_parsopercularis_volume', 116 | 'lh_parsorbitalis_volume', 117 | 'lh_parstriangularis_volume', 118 | 'lh_pericalcarine_volume', 119 | 'lh_postcentral_volume', 120 | 'lh_posteriorcingulate_volume', 121 | 'lh_precentral_volume', 122 | 'lh_precuneus_volume', 123 | 'lh_rostralanteriorcingulate_volume', 124 | 'lh_rostralmiddlefrontal_volume', 125 | 'lh_superiorfrontal_volume', 126 | 'lh_superiorparietal_volume', 127 | 'lh_superiortemporal_volume', 128 | 'lh_supramarginal_volume', 129 | 'lh_frontalpole_volume', 130 | 'lh_temporalpole_volume', 131 | 'lh_transversetemporal_volume', 132 | 'lh_insula_volume', 133 | 'rh_bankssts_volume', 134 | 'rh_caudalanteriorcingulate_volume', 135 | 'rh_caudalmiddlefrontal_volume', 136 | 'rh_cuneus_volume', 137 | 'rh_entorhinal_volume', 138 | 'rh_fusiform_volume', 139 | 'rh_inferiorparietal_volume', 140 | 'rh_inferiortemporal_volume', 141 | 'rh_isthmuscingulate_volume', 142 | 'rh_lateraloccipital_volume', 143 | 'rh_lateralorbitofrontal_volume', 144 | 'rh_lingual_volume', 145 | 'rh_medialorbitofrontal_volume', 146 | 'rh_middletemporal_volume', 147 | 'rh_parahippocampal_volume', 148 | 'rh_paracentral_volume', 149 | 'rh_parsopercularis_volume', 150 | 'rh_parsorbitalis_volume', 151 | 'rh_parstriangularis_volume', 152 | 'rh_pericalcarine_volume', 153 | 'rh_postcentral_volume', 154 | 'rh_posteriorcingulate_volume', 155 | 'rh_precentral_volume', 156 | 'rh_precuneus_volume', 157 | 'rh_rostralanteriorcingulate_volume', 158 | 'rh_rostralmiddlefrontal_volume', 159 | 'rh_superiorfrontal_volume', 160 | 'rh_superiorparietal_volume', 161 | 'rh_superiortemporal_volume', 162 | 'rh_supramarginal_volume', 163 | 'rh_frontalpole_volume', 164 | 'rh_temporalpole_volume', 165 | 'rh_transversetemporal_volume', 166 | 'rh_insula_volume'] 167 | --------------------------------------------------------------------------------