├── Software License for Peak Learning.docx ├── Readme.txt ├── README.md ├── LearnPeaks.py ├── Computational_Model.py ├── msiPL_ForTesting.py ├── msiPL_Run.py └── msiPL_Run_CrossValid_3DKidney.py /Software License for Peak Learning.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wabdelmoula/msiPL/HEAD/Software License for Peak Learning.docx -------------------------------------------------------------------------------- /Readme.txt: -------------------------------------------------------------------------------- 1 | This readme file shows how to properly run the msiPL code (Abdelmoula et al,): 2 | Walid Abdelmoula et al, msiPL: Non-linear Manifold and Peak Learning of Mass Spectrometry Imaging Data Using Artificial Neural Networks, bioRxiv, 2020 3 | 4 | License: The Peak Learning software (msiPL) will be shared using the 3D Slicer Software License agreement. 5 | 6 | ---------------------- Installations: Software and Libraries -------------- 7 | We have implemented our machine learning model using the following software items: 8 | 1- Python(3.6.4) 9 | 2- Keras (2.1.5-tf) with a Tensorflow(1.8.0) backend. 10 | 3- Packages: numpy(1.14.2), sklearn(0.19.1), scipy(1.0.0), and h5py(2.7.1) 11 | 4- We implemented this model on Windows 10 PC workstation(Intel Xenon 3.3GHz, 512 GB RAM, 64-bit Windows, 2 GPUs NVIDIA TITAN Xp). 12 | ----------------------------------------------------------------------------- 13 | 14 | ---------------------------------- Demo -------------------------------------- 15 | How to run the code? 16 | 1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have 17 | provided required comments for instructions and guidance. In this file you will be able to: 18 | 1.1. Load a dataset. 19 | 1.2. Load the computational neural network architecture (VAE_BN). 20 | 1.3. Train the model. 21 | 1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction) 22 | 1.4. Evaluate the learning quality by estimation and reconstruction of the original data 23 | 1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks. 24 | 1.6. Perform data clustering (GMM). 25 | 1.7. Identify localized peaks within each cluster. 26 | 27 | 2- "Computational_Model.py": implementation of the fully connected variational autoencoder, and regularized 28 | with batch normalization. 29 | 30 | 3- "LearnPeaks.py": implementation of a function that identifies peaks of interest. 31 | It should be called after training the model, as instructed in "msiPL_Run.py". 32 | 33 | 4. "msiPL_ForTesting.py": ultra-fast analysis on test data without any prior peak picking. 34 | You will need first to load the trained model from step#1 ("msiPL_Run.py"). 35 | 36 | ------------------------------------------------------------------------------------ 37 | We provide a sample of a publicly available MSI data to train and test the model and to ensure reproducibility. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![DOI](https://zenodo.org/badge/287381023.svg)](https://zenodo.org/badge/latestdoi/287381023) 2 | 3 | **msiPL** 4 | --------- 5 | Deep Learning based implementation for analysis of mass spectrometry imaging data 6 | 7 | This readme file shows how to properly run the msiPL code 8 | 9 | **Paper:** Walid Abdelmoula et al, msiPL: Non-linear Manifold and Peak Learning of Mass Spectrometry Imaging Data Using Artificial Neural Networks, bioRxiv, 2020 10 | 11 | **License:** The msiPL code is shared under the 3D Slicer Software License agreement. 12 | 13 | **Installations: Software and Libraries** 14 | -------- 15 | 16 | We have implemented our machine learning model using the following software items: 17 | 18 | 1- Python(3.6.4) 19 | 20 | 2- Keras (2.1.5-tf) with a Tensorflow(1.8.0) backend. 21 | 22 | 3- Packages: numpy(1.14.2), sklearn(0.19.1), scipy(1.0.0), and h5py(2.7.1) 23 | 24 | 4- We implemented this model on Windows 10 PC workstation(Intel Xenon 3.3GHz, 512 GB RAM, 64-bit Windows, 2 GPUs NVIDIA TITAN Xp). 25 | 26 | Demo 27 | --------------- 28 | 29 | * How to run the code? 30 | 31 | 1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have 32 | provided required comments for instructions and guidance. In this file you will be able to: 33 | 34 | 1.1. Load a dataset. 35 | 1.2. Load the computational neural network architecture (VAE_BN). 36 | 1.3. Train the model. 37 | 1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction) 38 | 1.4. Evaluate the learning quality by estimation and reconstruction of the original data 39 | 1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks. 40 | 1.6. Perform data clustering (GMM). 41 | 1.7. Identify localized peaks within each cluster. 42 | 43 | 2- "Computational_Model.py": implementation of the fully connected variational autoencoder, and regularized 44 | with batch normalization. 45 | 46 | 3- "LearnPeaks.py": implementation of a function that identifies peaks of interest. 47 | It should be called after training the model, as instructed in "msiPL_Run.py". 48 | 49 | 4. "msiPL_ForTesting.py": ultra-fast analysis on test data without any prior peak picking. 50 | You will need first to load the trained model from step#1 ("msiPL_Run.py"). 51 | 52 | If you used this implementation: 53 | ------ 54 | please cite the paper by Abdelmoula et al, msiPL: https://www.biorxiv.org/content/10.1101/2020.08.13.250142v1.abstract 55 | -------------------------------------------------------------------------------- /LearnPeaks.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Implementation of msiPL (Abdelmoula et al): Identify an informative peak Peaks 4 | 5 | - This function should be called after training the model 6 | - Briefly: this is a backpropagated-based threshold analysis on the neural network weight hyper-parameter (see, Equation#4) 7 | This analysis is to identify m/z contributed strongly to the learned non-manifold (encoded structures). 8 | """ 9 | import numpy as np 10 | from scipy.signal import argrelextrema 11 | 12 | 13 | def LearnPeaks(All_mz, W_enc, std_spectra, latent_dim,Beta,meanSpec_Orig): 14 | W1 = W_enc[0] # Connected with the input layer 15 | W2 = W_enc[6] # z_mean Layer 16 | for EncID in range(latent_dim): 17 | W2_EncFeat1 = W2[:,EncID] 18 | Act_Neuron_W2 = np.argsort(-W2_EncFeat1) #Note: -ve is used to sort descending 19 | W2_EncFeat1[Act_Neuron_W2[0]] 20 | Neuron_W1 = W1[:,Act_Neuron_W2[0]] 21 | Weights_norm_W1 = std_spectra*Neuron_W1 22 | ij = np.argsort(Weights_norm_W1)[::-1] 23 | Weights_norm_W1 = np.sort(Weights_norm_W1)[::-1] 24 | Weights_norm_W1[0] 25 | 26 | # ======== Threshold Weights mean + Beta*std: 27 | T = np.mean(Weights_norm_W1) + Beta*np.std(Weights_norm_W1) 28 | PeakID = ij[np.argwhere(Weights_norm_W1 >= T)]; PeakID = PeakID[:,0] #Ranked indices 29 | 30 | # ======== Get union list of m/z from all encFetaures ======== 31 | Enc_mz = [All_mz[i] for i in PeakID] 32 | if EncID==0: 33 | Learned_mzBins = [] 34 | Common_PeakID = [] 35 | Learned_mzBins = list(set().union(Enc_mz , Learned_mzBins)) 36 | Common_PeakID = list(set().union(PeakID , Common_PeakID)) 37 | 38 | if EncID==latent_dim-1: 39 | Learned_mzBins = np.sort(Learned_mzBins) 40 | Common_PeakID = np.sort(Common_PeakID) 41 | 42 | LocalMax = np.squeeze(np.transpose(argrelextrema(meanSpec_Orig, np.greater))) 43 | mz_LocalMax = [All_mz[i] for i in LocalMax] 44 | Nearest_Peakindx = [np.argmin(np.abs(mz_LocalMax[:] - Learned_mzBins[i])) for i in range(len(Learned_mzBins))] 45 | Peak_Indx = np.unique(Nearest_Peakindx) 46 | Learned_mzPeaks = [mz_LocalMax[i] for i in Peak_Indx] 47 | Learned_mzPeaks = np.asarray(Learned_mzPeaks) 48 | 49 | Real_PeakIdx = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in range(len(Learned_mzPeaks))] 50 | 51 | 52 | return Learned_mzBins, Learned_mzPeaks, Common_PeakID,Real_PeakIdx 53 | 54 | -------------------------------------------------------------------------------- /Computational_Model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Implementation of msiPL (Abdelmoula et al): Neural Network Architecture (VAE_BN) 4 | 5 | Keras-based implementation of a fully connected variational autoecnoder 6 | equipped with Batch normalization to correct for covariate shift and improve learning stability 7 | 8 | """ 9 | 10 | import numpy as np 11 | from keras.layers import Lambda, Input, Dense, ReLU, BatchNormalization 12 | from keras.models import Model 13 | from keras.losses import categorical_crossentropy 14 | from keras.utils import plot_model 15 | from keras import backend as K 16 | 17 | 18 | class VAE_BN(object): 19 | 20 | def __init__ (self, nSpecFeatures, intermediate_dim, latent_dim): 21 | self.nSpecFeatures = nSpecFeatures 22 | self.intermediate_dim = intermediate_dim 23 | self.latent_dim = latent_dim 24 | 25 | def sampling(self, args): 26 | """ 27 | Reparameterization trick by sampling from a continuous function (Gaussian with an auxiliary variable ~N(0,1)). 28 | [see Our methods and for more details see arXiv:1312.6114] 29 | """ 30 | self.z_mean, self.z_log_var = args 31 | self.batch = K.shape(self.z_mean)[0] 32 | self.dim = K.int_shape(self.z_mean)[1] 33 | self.epsilon = K.random_normal(shape=(self.batch, self.dim)) # random_normal (mean=0 and std=1) 34 | return self.z_mean + K.exp(0.5 * self.z_log_var) * self.epsilon 35 | 36 | 37 | def get_architecture(self): 38 | # =========== 1. Encoder Model================ 39 | input_shape = (self.nSpecFeatures, ) 40 | inputs = Input(shape=input_shape, name='encoder_input') 41 | h = Dense(self.intermediate_dim)(inputs) 42 | h = BatchNormalization()(h) 43 | h = ReLU()(h) 44 | z_mean = Dense(self.latent_dim, name = 'z_mean')(h) 45 | z_mean = BatchNormalization()(z_mean) 46 | z_log_var = Dense(self.latent_dim, name = 'z_log_var')(h) 47 | z_log_var = BatchNormalization()(z_log_var) 48 | 49 | # Reparametrization Tric: 50 | z = Lambda(self.sampling, output_shape = (self.latent_dim,), name='z')([z_mean, z_log_var]) 51 | encoder = Model(inputs, [z_mean, z_log_var, z], name = 'encoder') 52 | print("==== Encoder Architecture...") 53 | encoder.summary() 54 | # plot_model(encoder, to_file='VAE_BN_encoder.png', show_shapes=True) 55 | 56 | # =========== 2. Encoder Model================ 57 | latent_inputs = Input(shape = (self.latent_dim,), name='Latent_Space') 58 | hdec = Dense(self.intermediate_dim)(latent_inputs) 59 | hdec = BatchNormalization()(hdec) 60 | hdec = ReLU()(hdec) 61 | outputs = Dense(self.nSpecFeatures, activation = 'sigmoid')(hdec) 62 | decoder = Model(latent_inputs, outputs, name = 'decoder') 63 | print("==== Decoder Architecture...") 64 | decoder.summary() 65 | # plot_model(decoder, to_file='VAE_BN__decoder.png', show_shapes=True) 66 | 67 | #=========== VAE_BN: Encoder_Decoder ================ 68 | outputs = decoder(encoder(inputs)[2]) 69 | VAE_BN_model = Model(inputs, outputs, name='VAE_BN') 70 | 71 | # ====== Cost Function (Variational Lower Bound) ============== 72 | "KL-div (regularizes encoder) and reconstruction loss (of the decoder): see equation(3) in our paper" 73 | # 1. KL-Divergence: 74 | kl_Loss = 1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var) 75 | kl_Loss = K.sum(kl_Loss, axis=-1) 76 | kl_Loss *= -0.5 77 | # 2. Reconstruction Loss 78 | reconstruction_loss = categorical_crossentropy(inputs,outputs) # Use sigmoid at output layer 79 | reconstruction_loss *= self.nSpecFeatures 80 | 81 | # ========== Compile VAE_BN model =========== 82 | model_Loss = K.mean(reconstruction_loss + kl_Loss) 83 | VAE_BN_model.add_loss(model_Loss) 84 | VAE_BN_model.compile(optimizer='adam') 85 | return VAE_BN_model, encoder 86 | 87 | 88 | 89 | 90 | 91 | 92 | -------------------------------------------------------------------------------- /msiPL_ForTesting.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Implementation of msiPL (Abdelmoula et al): Ultra-fast Test Data Analysis 4 | 5 | Our trained msiPL model is applied on new unseen test data which was withheld 6 | from a large 3D MSI datacube. Foe the Analysis of 3D MSI data, msiPL provides: 7 | - Ultra-fast Analysis (just a few seconds) 8 | - Memory efficient: unlike conventional methods there is no need to load 9 | the full complex 3D MSI at once into the RAM. 10 | 11 | """ 12 | 13 | import numpy as np 14 | np.random.seed(1337) 15 | from tensorflow import set_random_seed 16 | set_random_seed(2) 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import os 22 | import h5py 23 | import matplotlib.pyplot as plt 24 | from sklearn.mixture import GaussianMixture 25 | from sklearn.metrics import mean_squared_error 26 | from scipy import stats 27 | from matplotlib.colors import LinearSegmentedColormap 28 | import time 29 | 30 | 31 | # ========= Load MSI Data without prior peak picking (hdf5 format) ========== 32 | f = h5py.File('Test_Data/MouseKindey_z73.h5','r') 33 | MSI_test = f["Data"] 34 | All_mz = f["mzArray"] 35 | nSpecFeatures = len(All_mz) 36 | xLocation = np.array(f["xLocation"]).astype(int) 37 | yLocation = np.array(f["yLocation"]).astype(int) 38 | col = max(np.unique(xLocation)) 39 | row = max(np.unique(yLocation)) 40 | im = np.zeros((col,row)) 41 | mzId = np.argmin(np.abs(All_mz[:] - 6227.9)) 42 | for i in range(len(xLocation)): 43 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_test[i,mzId] #image index starts at 0 not 1 44 | plt.imshow(im);plt.colorbar() 45 | 46 | # ====== Load VAE_BN: a fully-connected neural network model ======== 47 | from Computational_Model import * 48 | input_shape = (nSpecFeatures, ) 49 | intermediate_dim = 512 50 | latent_dim = 5 51 | VAE_BN_Model = VAE_BN(nSpecFeatures, intermediate_dim, latent_dim) 52 | myModel, encoder = VAE_BN_Model.get_architecture() 53 | myModel.summary() 54 | 55 | # ================ Load The Trained Model ===================== 56 | myModel.load_weights('TrainedModel_Kidney_Z1.h5') 57 | 58 | 59 | # ***************************************************************************** 60 | # ****************** Ultra Fast Analysis on new unseen data ***************** 61 | 62 | # ============= 1. Manifold Learning and Model Predictions =============== 63 | start_time = time.time() 64 | encoded_imgs = encoder.predict(MSI_test) # Learned non-linear manifold 65 | decoded_imgs = myModel.predict(MSI_test) # Reconstructed Data 66 | print("--- %s seconds : Ultra-Fast, isn't it?" % (time.time() - start_time)) 67 | dec_TIC = np.sum(decoded_imgs, axis=-1) 68 | 69 | # ======= 2. Compare Original and Reconstructed (inferred) Data ======== 70 | mse = mean_squared_error(MSI_test,decoded_imgs) 71 | meanSpec_Rec = np.mean(decoded_imgs,axis=0) 72 | print('mean squared error(mse) = ', mse) 73 | meanSpec_Orig = np.mean(MSI_test,axis=0) # TIC-norm original MSI Data 74 | N_DecImg = decoded_imgs/dec_TIC[:,None] # TIC-norm reconstructed MSI Data 75 | meanSpec_RecTIC = np.mean(N_DecImg,axis=0) 76 | plt.plot(All_mz,meanSpec_Orig); plt.plot(All_mz,meanSpec_RecTIC,color = [1.0, 0.5, 0.25]); 77 | plt.title('TIC-norm distribution of average spectrum: Original and Predicted') 78 | 79 | 80 | # ======== 3. Model Parameters of the Latent Space ========== 81 | Latent_mean, Latent_var, Latent_z = encoded_imgs 82 | 83 | # ======== 4. Non-linear dimensionality Reduction ========== 84 | ndim = Latent_z.shape[1] 85 | plt.figure(figsize=(14, 14)) 86 | for j in range(ndim): 87 | for i in range(len(xLocation)): 88 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Latent_z[i,j] 89 | ax = plt.subplot(1, ndim, j + 1) 90 | plt.imshow(im,cmap="hot"); # plt.colorbar() 91 | ax.get_xaxis().set_visible(False) 92 | ax.get_yaxis().set_visible(False) 93 | 94 | # ========= 5. Visualize Original & Reconstructed (inferred) m/z images ========== 95 | mzs = [2489.6,6627.9,8981.4,13961.2] 96 | directory = 'Results\\test' 97 | if not os.path.exists(directory): 98 | os.makedirs(directory) 99 | for indx in range(0,len(mzs)): 100 | mzId = np.argmin(np.abs(All_mz[:] - mzs[indx])) 101 | for i in range(len(xLocation)): 102 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = N_DecImg[i,mzId] # Reconstructed TIC-norm m/z image 103 | # im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_test[i,mzId] # Original TIC-norm m/z image 104 | ax = plt.subplot(1, len(mzs), indx + 1) 105 | plt.imshow(im); # plt.colorbar() 106 | ax.get_xaxis().set_visible(False) 107 | ax.get_yaxis().set_visible(False) 108 | plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Rec.jpg',im) 109 | # plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Orig.jpg',im) 110 | 111 | # ***************************************************************************** 112 | #********************* 6. Peak Learning (Manuscript Equation#4) *************** 113 | # Statistical Analysis on the trained neural network hyperparameter(weight) 114 | from LearnPeaks import * 115 | W_enc = encoder.get_weights() 116 | # Normalize Weights by multiplying it with std of original data variables 117 | std_spectra = np.std(MSI_test, axis=0) 118 | Beta = 2.5 119 | Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(All_mz, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig) 120 | 121 | 122 | 123 | # ***************************************************************************** 124 | # ========= Color Map ============== 125 | def discrete_cmap(N, base_cmap=None): 126 | """Create an N-bin discrete colormap from the specified input map""" 127 | base = plt.cm.get_cmap(base_cmap) 128 | color_list = base(np.linspace(0, 1, N)) 129 | cmap_name = base.name + str(N) 130 | return base.from_list(cmap_name, color_list, N) 131 | 132 | # *********************** Downstream Data Analysis **************************** 133 | # Data Clustering using GMM: applied on the encoded fetaures "Latent_z" 134 | # Peak Localization for each cluster 135 | nClusters = 7 136 | gmm = GaussianMixture(n_components=nClusters,random_state=0).fit(Latent_z) 137 | labels = gmm.predict(Latent_z) 138 | labels +=1 # To Avoid confilict with the natural background value of 0 139 | for i in range(len(xLocation)): 140 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = labels[i] 141 | MyCmap = discrete_cmap(nClusters+1, 'jet') 142 | plt.imshow(im,cmap=MyCmap); 143 | plt.colorbar(ticks=np.arange(0,nClusters+1,1)) 144 | plt.axis('off') 145 | 146 | 147 | # ======= Select a cluster of interest and correlate with the Learned_mzPeaks =============== 148 | # 1. Select CLuster: 149 | cluster_id = 6 150 | Kimg = labels==cluster_id 151 | Kimg = Kimg.astype(int) 152 | 153 | for i in range(len(xLocation)): 154 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Kimg[i] 155 | segCmp = [MyCmap(0),MyCmap(cluster_id)] 156 | cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2) 157 | plt.imshow(im, cmap=cm); 158 | plt.axis('off') 159 | 160 | # 2. Correlate the Select CLuster with the Learned_mzPeaks: 161 | Peaks_ID = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in range(len(Learned_mzPeaks))] 162 | MSI_PeakList = MSI_test[:,Peaks_ID[:]] # get only MSI data only for the shotlisted learned m/z peaks 163 | Corr_Val = np.zeros(len(Learned_mzPeaks)) 164 | for i in range(len(Learned_mzPeaks)): 165 | Corr_Val[i] = stats.pearsonr(Kimg,MSI_PeakList[:,i])[0] 166 | id_mzCorr = np.argmax(Corr_Val) 167 | rank_ij = np.argsort(Corr_Val)[::-1] 168 | 169 | for i in range(len(xLocation)): 170 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_PeakList[i,id_mzCorr] 171 | plt.imshow(im) 172 | plt.axis('off') 173 | print('m/z', Learned_mzPeaks[id_mzCorr]) 174 | print('corr_Value = ', Corr_Val[id_mzCorr]) 175 | 176 | plt.plot(Learned_mzPeaks,Corr_Val) 177 | print(['%0.4f' % i for i in Learned_mzPeaks[rank_ij[0:10]]]) 178 | print('Correlation Top 10 Ranked peaks:', end='') 179 | print(['%0.4f' % i for i in Corr_Val[rank_ij[0:10]]]) 180 | -------------------------------------------------------------------------------- /msiPL_Run.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Implementation of msiPL (Abdelmoula et al): Model Training 4 | This is the main file to run 5 | This implementation is based on the public AI platforms of Keras and Tensorflow. 6 | 7 | How to run the code? 8 | 1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have 9 | provided required comments for instructions and guidance. In this file you will be able to: 10 | 1.1. Load a dataset. 11 | 1.2. Load the computational neural network architecture (VAE_BN). 12 | 1.3. Train the model. 13 | 1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction) 14 | 1.4. Evaluate the learning quality by estimation and reconstruction of the original data 15 | 1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks. 16 | 1.6. Perform data clustering (GMM): The number of clusters can be set by the users or automatically using the BIC method. 17 | 1.7. Identify localized peaks within each cluster. 18 | 19 | """ 20 | from __future__ import absolute_import 21 | from __future__ import division 22 | from __future__ import print_function 23 | 24 | import numpy as np 25 | np.random.seed(1337) 26 | from tensorflow import set_random_seed 27 | set_random_seed(2) 28 | 29 | import os 30 | import h5py 31 | import matplotlib.pyplot as plt 32 | from sklearn.mixture import GaussianMixture 33 | from sklearn.metrics import mean_squared_error 34 | from scipy import stats 35 | from matplotlib.colors import LinearSegmentedColormap 36 | import time 37 | 38 | # ========= Color Map ============== 39 | def discrete_cmap(N, base_cmap=None): 40 | """Create an N-bin discrete colormap from the specified input map""" 41 | base = plt.cm.get_cmap(base_cmap) 42 | color_list = base(np.linspace(0, 1, N)) 43 | cmap_name = base.name + str(N) 44 | return base.from_list(cmap_name, color_list, N) 45 | 46 | # ========= Load MSI Data without prior peak picking (hdf5 format) ========== 47 | """ The MSI data is loaded as hdf5 file to maintain efficiency. 48 | We have noticed high efficiency in memory usage and fast performance in accessing 49 | high dimensional data form huge files when dealing with HDF5 formats as oppose to the imzML. 50 | You can convert the imzML to hdf5 using the following steps: 51 | - first install the python packages “h5py” and "imzML" 52 | - Load the imzML file to get: 53 | a- spectral data, and let's say save it in a variable called "Spec_Data", 54 | b- spatial information for each spectrum and save it "XCoord" and "YCoord". 55 | - use the hhdf5 method called "create_dataset" to save your data in h5. For example: 56 | - myHF = h5py.file("myData.h5",'w') 57 | - myHF.create_dataset('Data', data=Spec_Data) 58 | - myHF. create_dataset('xLocation', data=XCoord) 59 | - After you finish, close your h5 file: "myHF.close()" 60 | """ 61 | f = h5py.File('Training_Data/MouseKindey_z1.h5','r') 62 | MSI_train = f["Data"] # spectral information. 63 | All_mz = f["mzArray"] 64 | nSpecFeatures = len(All_mz) 65 | if MSI_train.shape[1] != nSpecFeatures: 66 | MSI_train = np.transpose(MSI_train) 67 | xLocation = np.array(f["xLocation"]).astype(int) 68 | yLocation = np.array(f["yLocation"]).astype(int) 69 | col = max(np.unique(xLocation)) 70 | row = max(np.unique(yLocation)) 71 | im = np.zeros((col,row)) 72 | mzId = np.argmin(np.abs(All_mz[:] - 6227.9)) 73 | for i in range(len(xLocation)): 74 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_train[i,mzId] #image index starts at 0 not 1 75 | plt.imshow(im);plt.colorbar() 76 | 77 | # ====== Load VAE_BN: a fully-connected neural network model ======== 78 | """ myModel represents the msiPL architecture """ 79 | from Computational_Model import * 80 | input_shape = (nSpecFeatures, ) 81 | intermediate_dim = 512 # Size of the first hidden layer 82 | # ---- dimensions of the latent space (i.e. encoded features) 83 | ans = int(input('The default value of the latent space dimensions is 5, would you like to set a different value? Yes=1; No=0 …:')) 84 | if ans == 1: 85 | latent_dim = int (input('Please set the new dimensions of the latent space = ')) 86 | else: 87 | latent_dim = 5 88 | # ------------------------------------- 89 | # Compile the msiPL computational model 90 | VAE_BN_Model = VAE_BN(nSpecFeatures, intermediate_dim, latent_dim) 91 | myModel, encoder = VAE_BN_Model.get_architecture() 92 | myModel.summary() 93 | 94 | # ============= Model Training ================= 95 | """ The training processes involves: 96 | epochs: 100 iterations 97 | batch_size: a randomly-shuffled subset of 128 spectra is loaded at a time into the RAM 98 | This phase will run faster if a GPU is utilized 99 | """ 100 | try: 101 | start_time = time.time() 102 | history = myModel.fit(MSI_train, epochs=100, batch_size=128, shuffle="batch") 103 | plt.plot(history.history['loss']) 104 | plt.ylabel('loss'); plt.xlabel('epoch') 105 | print("--- %s seconds ---" % (time.time() - start_time)) 106 | myModel.save_weights('TrainedModel_Kidney_Z1.h5') 107 | except MemoryError as error: 108 | import psutil 109 | Memory_Information = psutil.virtual_memory() 110 | print('>>> There is a memory issue: and here are a few suggestions:') 111 | print('>>>>>> 1- Make sure that you are using python 64-bit.') 112 | print('>>>>>> 2- use a lower value for the batch_size (default is 128).') 113 | print('**** Here is some information about your memory (MB):', Memory_Information) 114 | 115 | 116 | # ============= Model Predictions =============== 117 | encoded_imgs = encoder.predict(MSI_train) # Learned non-linear manifold 118 | decoded_imgs = myModel.predict(MSI_train) # Reconstructed Data 119 | dec_TIC = np.sum(decoded_imgs, axis=-1) 120 | 121 | # ======= Calculate mse between orig & rec. data ===== 122 | """ The mean squared error (mse): 123 | the mse is used to evaluate the quality of the reconstructed data""" 124 | mse = mean_squared_error(MSI_train,decoded_imgs) 125 | meanSpec_Rec = np.mean(decoded_imgs,axis=0) 126 | print('mean squared error(mse) = ', mse) 127 | meanSpec_Orig = np.mean(MSI_train,axis=0) # TIC-norm original MSI Data 128 | N_DecImg = decoded_imgs/dec_TIC[:,None] # TIC-norm reconstructed MSI Data 129 | meanSpec_RecTIC = np.mean(N_DecImg,axis=0) 130 | plt.plot(All_mz,meanSpec_Orig); plt.plot(All_mz,meanSpec_RecTIC,color = [1.0, 0.5, 0.25]); 131 | plt.title('TIC-norm distribution of average spectrum: Original and Predicted') 132 | 133 | # ======== Model Parameters of the Latent Space ========== 134 | """ Capturing the learned latent variable: 135 | encoded features (Latent_z), and its mean and variance""" 136 | Latent_mean, Latent_var, Latent_z = encoded_imgs 137 | 138 | # ======== Visualize encoded Features (learned non-linear spectral manifold) ========== 139 | ndim = Latent_z.shape[1] 140 | plt.figure(figsize=(14, 14)) 141 | for j in range(ndim): 142 | for i in range(len(xLocation)): 143 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Latent_z[i,j] 144 | ax = plt.subplot(1, ndim, j + 1) 145 | plt.imshow(im,cmap="hot"); # plt.colorbar() 146 | ax.get_xaxis().set_visible(False) 147 | ax.get_yaxis().set_visible(False) 148 | 149 | # ========= Visualize Original & Reconstructed m/z images ========== 150 | mzs = [2489.6,6627.9,8981.4,13961.2] 151 | directory = 'Results' 152 | if not os.path.exists(directory): 153 | os.makedirs(directory) 154 | for indx in range(0,len(mzs)): 155 | mzId = np.argmin(np.abs(All_mz[:] - mzs[indx])) 156 | for i in range(len(xLocation)): 157 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = N_DecImg[i,mzId] # Reconstructed TIC-norm m/z image 158 | # im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_train[i,mzId] # Original TIC-norm m/z image 159 | ax = plt.subplot(1, len(mzs), indx + 1) 160 | plt.imshow(im); # plt.colorbar() 161 | ax.get_xaxis().set_visible(False) 162 | ax.get_yaxis().set_visible(False) 163 | plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Rec.jpg',im) 164 | # plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Orig.jpg',im) 165 | 166 | #********************* Peak Learning (Manuscript Equation#4) ******************** 167 | """ Statistical Analysis on the trained neural network hyper-parameter(weight) 168 | See Equation (4) in the main manuscript) 169 | """ 170 | from LearnPeaks import * 171 | W_enc = encoder.get_weights() 172 | # Normalize Weights by multiplying it with std of original data variables 173 | std_spectra = np.std(MSI_train, axis=0) 174 | Beta = 2.5 # This variable can be adjusted by the user. We have observed good performance within the range [1,2.5] 175 | Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(All_mz, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig) 176 | # save results of learned peaks in excel sheet 177 | directory = 'Results' 178 | if not os.path.exists(directory): 179 | os.makedirs(directory_mz) 180 | import pandas as pd 181 | df_1 = pd.DataFrame({'mz Peaks': Learned_mzPeaks}) 182 | df_1.to_excel(directory+'/'+'Peaks_.xlsx', engine='xlsxwriter' , sheet_name='Sheet1') 183 | 184 | # ******************* Downstream Data Analysis ************************** 185 | """ Now the msiPL has been trained to learn a non-linear manifold, now the 186 | clustering step can be efficiently applied. 187 | Data Clustering using GMM: 188 | - Applied on the encoded features "Latent_z" 189 | - Peak Localization within each cluster 190 | - nClusters: this is the number of clusters that need to be set before running the GMM. 191 | "nClusters" can be set manually or automatically suggested based on an optimization process using the BIC algorithm. 192 | """ 193 | # ---- Bayesian Information Criterion (BIC) combined with the Kneedle algorithm for optimal model selection: 194 | """ The total number of K-clusters will be automatically suggested. 195 | - Different GMM models will generated using different number of K-clusters (e.g. K varies between[3,20]) 196 | - The BIC scores will be computed for each GMM model. 197 | - The Kneedle algorithm is applied on the BIC scores to identify the point of maximum curvature (knee point). 198 | - The knee point points to the best model and suggest the expected number of K-clusters. 199 | """ 200 | from kneed import KneeLocator 201 | # covariance_type = {'full', 'spherical', 'diag', 'tied'} 202 | cov_Type = 'full' 203 | n_components = np.arange(3, 20) 204 | models = [GaussianMixture(n, covariance_type=cov_Type, random_state=0).fit(Latent_z) 205 | for n in n_components] 206 | 207 | BIC_Scores = [m.bic(Latent_z) for m in models] 208 | kneedle_point = KneeLocator(n_components, BIC_Scores, curve='convex', direction='decreasing') 209 | print('The suggested number of clusters = ', kneedle_point.knee) 210 | Elbow_idx = np.where(BIC_Scores==kneedle_point.knee_y)[0] 211 | 212 | from matplotlib.ticker import MaxNLocator 213 | plt.plot(n_components, BIC_Scores,'-g', marker='o',markerfacecolor='blue',markeredgecolor='orange', 214 | markeredgewidth='2',markersize=10,markevery=Elbow_idx) 215 | plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True)) 216 | plt.legend(loc='best') 217 | plt.xlabel('Number of clusters'); 218 | plt.ylabel('BIC score'); 219 | plt.title('The suggested number of clusters = '+ np.str(kneedle_point.knee)) 220 | # plt.plot(n_components, [m.aic(Latent_z) for m in models], label='AIC') 221 | # Ref Kneedle algorithm [V. Satopaa et al., international conference on distributed computing systems workshops. IEEE, 2011.] 222 | 223 | # ======================== Apply GMM on Encoded Features ============= 224 | start_time_gmm = time.time() 225 | nClusters = (kneedle_point.knee) # this variable is set automatically based on the BIC algorithm 226 | # nClusters = 7 # this variable could be tuned by the user 227 | gmm = GaussianMixture(n_components=nClusters,covariance_type=cov_Type,random_state=0).fit(Latent_z) 228 | labels = gmm.predict(Latent_z) 229 | labels +=1 # To Avoid conflict with the natural background value of 0 230 | 231 | # Spatial Clusters Distribution: 232 | for i in range(len(xLocation)): 233 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = labels[i] 234 | MyCmap = discrete_cmap(nClusters+1, 'jet') 235 | plt.imshow(im,cmap=MyCmap); 236 | plt.colorbar(ticks=np.arange(0,nClusters+1,1)) 237 | plt.axis('off') 238 | print("Clustering time = %s seconds" % (time.time() - start_time_gmm)) 239 | 240 | # ======= Select a cluster of interest and correlate with the Learned_mzPeaks =============== 241 | # 1. Select CLuster: 242 | cluster_id = 2 243 | Kimg = labels==cluster_id 244 | Kimg = Kimg.astype(int) 245 | 246 | for i in range(len(xLocation)): 247 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Kimg[i] 248 | segCmp = [MyCmap(0),MyCmap(cluster_id)] 249 | cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2) 250 | plt.imshow(im, cmap=cm); 251 | plt.axis('off') 252 | 253 | # 2. Correlate the Select CLuster with the Learned_mzPeaks: 254 | # Note: it will also be fast to correlate the cluster with All_mz Data 255 | Peaks_ID = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in range(len(Learned_mzPeaks))] 256 | MSI_PeakList = MSI_train[:,Peaks_ID[:]] # get only MSI data only for the shotlisted learned m/z peaks 257 | Corr_Val = np.zeros(len(Learned_mzPeaks)) 258 | for i in range(len(Learned_mzPeaks)): 259 | Corr_Val[i] = stats.pearsonr(Kimg,MSI_PeakList[:,i])[0] 260 | id_mzCorr = np.argmax(Corr_Val) 261 | rank_ij = np.argsort(Corr_Val)[::-1] 262 | 263 | for i in range(len(xLocation)): 264 | im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_PeakList[i,id_mzCorr] 265 | plt.imshow(im) 266 | plt.axis('off') 267 | print('m/z', Learned_mzPeaks[id_mzCorr]) 268 | print('corr_Value = ', Corr_Val[id_mzCorr]) 269 | 270 | plt.plot(Learned_mzPeaks,Corr_Val) 271 | print(['%0.4f' % i for i in Learned_mzPeaks[rank_ij[0:10]]]) 272 | print('Correlation Top 10 Ranked peaks:', end='') 273 | print(['%0.4f' % i for i in Corr_Val[rank_ij[0:10]]]) -------------------------------------------------------------------------------- /msiPL_Run_CrossValid_3DKidney.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Implementation of msiPL (Abdelmoula et al): Model Cross-Valiation Analysis 4 | 5 | """ 6 | from __future__ import absolute_import 7 | from __future__ import division 8 | from __future__ import print_function 9 | 10 | import numpy as np 11 | np.random.seed(1337) 12 | from tensorflow import set_random_seed 13 | set_random_seed(2) 14 | 15 | import os 16 | import h5py 17 | import matplotlib.pyplot as plt 18 | from sklearn.mixture import GaussianMixture 19 | from sklearn.metrics import mean_squared_error 20 | from scipy import stats 21 | from matplotlib.colors import LinearSegmentedColormap 22 | import matplotlib as mpl 23 | import nibabel as nib 24 | import pandas as pd 25 | import time 26 | 27 | # ======= Directory Information: 28 | Cd = os.getcwd() 29 | Bd = os.path.dirname(Cd) 30 | 31 | # ========= Color Map ============== 32 | def discrete_cmap(N, base_cmap=None): 33 | """Create an N-bin discrete colormap from the specified input map""" 34 | base = plt.cm.get_cmap(base_cmap) 35 | color_list = base(np.linspace(0, 1, N)) 36 | cmap_name = base.name + str(N) 37 | return base.from_list(cmap_name, color_list, N) 38 | 39 | # ====== Visualize Image: From 1D vector to Image ============== 40 | def Image_Distribution(V,xLoc,yLoc): 41 | col = max(np.unique(xLoc)) 42 | row = max(np.unique(yLoc)) 43 | Myimg = np.zeros((col,row)) 44 | for i in range(len(xLoc)): 45 | Myimg[np.asscalar(xLoc[i])-1, np.asscalar(yLoc[i])-1] = V[i] 46 | return Myimg 47 | 48 | # ================= Correlate Cluster with MSI Data ============= 49 | def Correlate_Cluster_MSI(cluster_id,Labels,MSI_D,Peak_Indx,ZCoord_cv,XCoord_cv,YCoord_cv): 50 | Kimg = Labels==cluster_id 51 | Kimg = Kimg.astype(int) 52 | MSI_CleanPeaks = MSI_D[:,Peak_Indx[:]] 53 | Corr_Val = np.zeros(len(Peak_Indx)) 54 | 55 | for i in range(len(Peak_Indx)): 56 | Corr_Val[i] = stats.pearsonr(Kimg,MSI_CleanPeaks[:,i])[0] 57 | id_mzCorr = np.argmax(Corr_Val) 58 | rank_ij = np.argsort(Corr_Val)[::-1] 59 | return Corr_Val, rank_ij, MSI_CleanPeaks 60 | 61 | # ========================== 3D mz image ============================ 62 | def Get_3Dmz_nifti(MSI_CleanPeaks,mz_Peak,XCoord_cv,YCoord_cv,ZCoord_cv,directory): 63 | mzSections = np.unique(ZCoord_cv) 64 | Vol_mz = np.zeros((200,200,len(mzSections))) 65 | nSections = len(mzSections) 66 | directory_NIFT = directory + '\\mz_Vol\\Training' 67 | if not os.path.exists(directory_NIFT): 68 | os.makedirs(directory_NIFT) 69 | for Zsec in range(len(mzSections)): 70 | ij_r = np.argwhere(ZCoord_cv == mzSections[Zsec]) 71 | indx = ij_r[:,0] 72 | xLoc = XCoord_cv[indx] 73 | yLoc = YCoord_cv[indx] 74 | MSI_2D = np.squeeze(MSI_CleanPeaks[indx]) 75 | for idx in range(len(xLoc)): 76 | Vol_mz[np.asscalar(xLoc[idx])-1, np.asscalar(yLoc[idx])-1,Zsec] = MSI_2D[idx] 77 | 78 | I_nii = nib.Nifti1Image(Vol_mz,affine=np.eye(4)) 79 | nib.save(I_nii,directory_NIFT +'\\mz_' + str(mz_Peak) + '.nii') 80 | 81 | #============= Spatial Distribution Encoded Fetaures ============= 82 | def get_EncFeatures(Latent_z,Train_idx,myZCoord,xLocation,yLocation,directory,order): 83 | myzSections = np.unique(myZCoord) 84 | ndim = Latent_z.shape[1] 85 | for zr in range(len(Train_idx)): 86 | ij_r = np.argwhere(myZCoord == myzSections[zr]) 87 | indx = ij_r[:,0] 88 | xLoc = xLocation[indx] 89 | yLoc = yLocation[indx] 90 | zSection_Latent_z = np.squeeze(Latent_z[indx,]) 91 | plt.figure(figsize=(14, 14)) 92 | for j in range(ndim): 93 | EncFeat = zSection_Latent_z[:,j] #encoded_imgs[i,0] #image index starts at 0 not 1 94 | im = Image_Distribution(EncFeat,xLoc,yLoc); 95 | ax = plt.subplot(1, ndim, j + 1) 96 | plt.imshow(im,cmap="hot"); # plt.colorbar() 97 | ax.get_xaxis().set_visible(False) 98 | ax.get_yaxis().set_visible(False) 99 | 100 | directory_Latz = directory+'//Latent//Training_'+str(order) 101 | if not os.path.exists(directory_Latz): 102 | os.makedirs(directory_Latz) 103 | plt.savefig(directory_Latz + '\\EncFetaures_Tissue'+str(myzSections[zr])+'.png',bbox_inches='tight') 104 | 105 | # ================== Get GMM Image for cv analysis ================ 106 | def get_gmmImage(Train_idx,Features,nClusters,myZCoord,xLocation,yLocation,directory,order): 107 | myzSections = np.unique(myZCoord); Zsec=0; 108 | C_imgs = np.zeros((200,200,len(range(1,len(Train_idx)+1,1)),nClusters)) 109 | directoryGmm = directory+'//GMM//Training_'+str(order) 110 | if not os.path.exists(directoryGmm): 111 | os.makedirs(directoryGmm) 112 | 113 | for zr in range(len(Train_idx)): 114 | im = [] 115 | ij_r = np.argwhere(myZCoord == myzSections[zr]) 116 | indx = ij_r[:,0] 117 | xLoc = xLocation[indx] 118 | yLoc = yLocation[indx] 119 | zSection_labels = Features[indx] 120 | im = zSection_labels 121 | im = Image_Distribution(im,xLoc,yLoc); 122 | MyCmap = discrete_cmap(nClusters, 'jet') 123 | plt.imshow(im,cmap=MyCmap); 124 | plt.colorbar(ticks=np.arange(0,nClusters,1)) 125 | plt.axis('off') 126 | plt.show() 127 | plt.imsave(directoryGmm + '\\gmm_Training_'+str(myzSections[zr])+'_K_' + str(nClusters) + '.png',im,cmap=MyCmap) 128 | 129 | # Save single clusters: 130 | directory_SingleC =directoryGmm + '\\GMM_Section_'+str(myzSections[zr]) 131 | if not os.path.exists(directory_SingleC): 132 | os.makedirs(directory_SingleC) 133 | 134 | for c in range(0,nClusters,1): 135 | cluster_id = c 136 | Kimg = zSection_labels[:]==cluster_id 137 | Kimg = Kimg.astype(int) 138 | for idx in range(len(xLoc)): 139 | C_imgs[np.asscalar(xLoc[idx])-1, np.asscalar(yLoc[idx])-1,Zsec,cluster_id] = Kimg[idx] 140 | Kimg = Image_Distribution(Kimg,xLoc,yLoc); 141 | 142 | segCmp = [MyCmap(0),MyCmap(cluster_id)] 143 | segCmp[0]= (0,0,0) 144 | cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2) 145 | plt.imshow(Kimg, cmap=cm); 146 | plt.colorbar(ticks=np.arange(0,1,1)) 147 | plt.axis('off') 148 | plt.show() 149 | plt.imsave(directory_SingleC + '\\Cluster_' + str(cluster_id) + '.png',Kimg,cmap=cm) 150 | Zsec +=1 151 | #Save NIFTI 152 | directory_NIFT = directoryGmm + '\\NIFTI' 153 | if not os.path.exists(directory_NIFT): 154 | os.makedirs(directory_NIFT) 155 | for c in range(0,nClusters,1): 156 | I_nii = nib.Nifti1Image(C_imgs[:,:,:,c],affine=np.eye(4)) 157 | nib.save(I_nii,directory_NIFT +'\\Label_' + str(c) + '.nii') 158 | 159 | # ======== Save NIFTI image for each cluster ================== 160 | def Cluster_To_Nifti(directory,C_imgs,nClusters): 161 | directory_NIFT = directory + '\\GMM_K'+str(nClusters)+'\\NIFTI' 162 | if not os.path.exists(directory_NIFT): 163 | os.makedirs(directory_NIFT) 164 | for c in range(1,nClusters+1,1): 165 | I_nii = nib.Nifti1Image(C_imgs[:,:,:,c],affine=np.eye(4)) 166 | nib.save(I_nii,directory_NIFT +'\\Label_' + str(c) + '.nii') 167 | 168 | # =================== Load 3D MSI Data ========================# 169 | Combined_MSI = []; XCoord = []; YCoord = []; ZCoord = [] 170 | TissueIDs = [x for x in range(1,74,1)] 171 | 172 | for id in range(1,74,1): 173 | f = h5py.File(Bd+'//hd5//MouseKindey_z' + str(id) + '.h5','r') 174 | MSI_train = f["Data"] 175 | mzList = f["mzArray"] 176 | nSpecFeatures = len(mzList) 177 | xLocation = np.array(f["xLocation"]).astype(int) 178 | yLocation = np.array(f["yLocation"]).astype(int) 179 | zLocation = np.full(len(yLocation),id) 180 | col = max(np.unique(xLocation)) 181 | row = max(np.unique(yLocation)) 182 | im = np.zeros((col,row)) 183 | if id==1: 184 | Combined_MSI = MSI_train 185 | XCoord = xLocation 186 | YCoord = yLocation 187 | ZCoord = zLocation 188 | else: 189 | Combined_MSI = np.concatenate((MSI_train,Combined_MSI), axis=0) 190 | XCoord = np.concatenate((xLocation,XCoord)) 191 | YCoord = np.concatenate((yLocation,YCoord)) 192 | ZCoord = np.concatenate((zLocation,ZCoord)) 193 | 194 | 195 | # ============ KFold Cross Validation: 196 | from sklearn.model_selection import KFold 197 | from matplotlib.patches import Patch 198 | cmap_data = plt.cm.Paired 199 | cmap_cv = plt.cm.coolwarm 200 | 201 | n_folds = 5 202 | kfold = KFold(n_folds, shuffle=True) 203 | fig, ax = plt.subplots() 204 | ij_Training = []; ij_Testing = [] 205 | myHF = h5py.File('CV_Values//cv_Idx.h5', 'w') 206 | for ij, (Test_idx, Train_idx) in enumerate(kfold.split(TissueIDs)): 207 | print("Training: %s Testing:%s" %(Train_idx, Test_idx)) 208 | ij_Training.append(Train_idx) 209 | ij_Testing.append(Test_idx) 210 | myHF.create_dataset('indx_Training'+str(ij), data=Train_idx) 211 | myHF.create_dataset('indx_Testing'+str(ij), data=Test_idx) 212 | 213 | indices = np.array([np.nan] * len(TissueIDs)) 214 | indices[Train_idx] = 1 215 | indices[Test_idx] = 0 216 | # Visalize Corss Validation Behavior: 217 | ax.scatter(range(1,len(indices)+1), [ij + 1] * len(indices), 218 | c=indices, marker='_', lw=10, cmap=cmap_cv, vmin=-0.2, vmax=1.2) 219 | yticklabels = list(range(1,n_folds+1)) 220 | ax.set(yticks=np.arange(1,n_folds+1) , yticklabels=yticklabels, 221 | xlabel='2D MSI Sample index', ylabel="Iteration", 222 | ylim=[n_folds+1.2,-.2], xlim=[-2, len(TissueIDs)+4]) 223 | ax.set_title('KFold', fontsize=15) 224 | ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.1))], 225 | ['Training set', 'Testing set'], loc=(1.02, .8)) 226 | myHF.close() 227 | 228 | # -------- A function to get raining Data: 229 | def Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,CV_idx): 230 | ij_r=[]; ij_t=[]; MSI_train=[]; MSI_Test=[] 231 | # Get Training Data: 232 | for jr,zr in enumerate(CV_idx): 233 | if jr==0: 234 | ij_r = np.argwhere(ZCoord == zSections[zr]) 235 | else: 236 | ij_r = np.concatenate((ij_r, np.argwhere(ZCoord == zSections[zr])),axis=0) 237 | MSI_Data = np.squeeze(Combined_MSI[ij_r,]) 238 | XCoord_cv = XCoord[ij_r] 239 | YCoord_cv = YCoord[ij_r] 240 | ZCoord_cv = ZCoord[ij_r] 241 | 242 | return MSI_Data,XCoord_cv,YCoord_cv,ZCoord_cv 243 | 244 | # ------------------- Train and Test with CV: 245 | zSections = np.unique(ZCoord) 246 | directory = Cd+'/Results_CV' 247 | meanSpec_Orig_AllData = np.mean(Combined_MSI,axis=0) 248 | 249 | # --------- Train model status: 250 | TrainStatus = int(input("Would you like to Train Your Model? Yes=1; No=0 ... :")) 251 | if TrainStatus== 0: 252 | print('No Training') 253 | else: 254 | print('Model Training ...>>>......') 255 | 256 | hf_cv = h5py.File('CV_Values/cv_Idx.h5', 'r') 257 | for i in range(n_folds): 258 | Train_idx = hf_cv['indx_Training'+str(i)][:] 259 | #Test_idx = hf_cv['indx_Testing'+str(i)][:] 260 | MSI_train,XCoord_cv,YCoord_cv,ZCoord_cv = Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,Train_idx) 261 | #MSI_Test,XCoord_cv,YCoord_cv,ZCoord_cv = Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,Test_idx) 262 | myzSections = np.unique(ZCoord_cv) 263 | # ************************* Training ************************************** 264 | # 1. ====== Initialize the model: 265 | from Computational_Model import * 266 | input_shape = (nSpecFeatures, ) 267 | intermediate_dim = 512 268 | latent_dim = 5 269 | VAE_BN_Model = VAE_BN(nSpecFeatures, intermediate_dim, latent_dim) 270 | myModel, encoder = VAE_BN_Model.get_architecture() 271 | myModel.summary() 272 | # 2. ====== Train the model: 273 | if TrainStatus==1: 274 | start_time = time.time() 275 | history = myModel.fit(MSI_train, epochs=100, batch_size=128, shuffle="batch") 276 | myModel.save_weights(directory+'//'+'TrainedModel_'+str(i)+'.h5') 277 | else: 278 | myModel.load_weights(directory+'//'+'TrainedModel_'+str(i)+'.h5'); 279 | # 3. ============= Model Predictions: 280 | encoded_imgs = encoder.predict(MSI_train) # Learned non-linear manifold 281 | decoded_imgs = myModel.predict(MSI_train) # Reconstructed Data 282 | dec_TIC = np.sum(decoded_imgs, axis=-1) 283 | Latent_mean, Latent_var, Latent_z = encoded_imgs 284 | 285 | get_EncFeatures(Latent_z,Train_idx,ZCoord_cv,XCoord_cv,YCoord_cv,directory,i) 286 | 287 | # 4. ============= Plot Average Spectrum: 288 | mse = mean_squared_error(MSI_train,decoded_imgs) 289 | meanSpec_Rec = np.mean(decoded_imgs,axis=0) 290 | print('mean squared error(mse) = ', mse) 291 | meanSpec_Orig = np.mean(MSI_train,axis=0) # TIC-norm original MSI Data 292 | N_DecImg = decoded_imgs/dec_TIC[:,None] # TIC-norm reconstructed MSI Data 293 | meanSpec_RecTIC = np.mean(N_DecImg,axis=0) 294 | 295 | fig, ax = plt.subplots() 296 | plt.plot(history.history['loss']) 297 | plt.ylabel('loss'); plt.xlabel('epoch') 298 | print("--- %s seconds ---" % (time.time() - start_time)) 299 | plt.savefig(directory+'/'+'Convergence_TrainedModel_'+str(i)+'.tif') 300 | 301 | fig, ax = plt.subplots() 302 | #plt.figure(figsize=(10, 3)) 303 | plt.plot(mzList,meanSpec_Orig,color = [0, 1, 0,1]); plt.plot(mzList,meanSpec_RecTIC,color = [1, 0, 0,0.6]); 304 | plt.savefig(directory+'/'+'Overlay_Training_'+str(i)+'mse_'+str(mse)+'.tif') 305 | 306 | # 5. ============== Learn Peaks: 307 | from LearnPeaks import * 308 | W_enc = encoder.get_weights() 309 | # Normalize Weights by multiplying it with std of original data variables 310 | std_spectra = np.std(MSI_train, axis=0) 311 | Beta = 2.5 312 | Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(mzList, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig_AllData) 313 | xls_writer = pd.ExcelWriter(directory+'/'+'Peaks_Training_'+str(i)+'.xlsx', engine='xlsxwriter') 314 | df_1 = pd.DataFrame({'mz Bins': Learned_mzBins}) 315 | df_2 = pd.DataFrame({'mz Peaks': Learned_mzPeaks}) 316 | df_1.to_excel(xls_writer, sheet_name='Sheet'+str(i)) 317 | df_2.to_excel(xls_writer, sheet_name='Sheet'+str(i),startcol=3) 318 | workbook = xls_writer.book 319 | worksheet = xls_writer.sheets['Sheet'+str(i)] 320 | 321 | # 6. ============== Apply CLustering: 322 | nClusters = 8 323 | gmm = GaussianMixture(n_components=nClusters,random_state=0).fit(np.squeeze(Latent_z)) 324 | Labels = gmm.predict(np.squeeze(Latent_z)) 325 | get_gmmImage(Train_idx,Labels,nClusters,ZCoord_cv,XCoord_cv,YCoord_cv,directory,i) 326 | 327 | hf_cv.close() 328 | 329 | # 7. ========== Correlate Clusters with MSI Data: 330 | cluster_id = 7 331 | Corr_Val, CorrRank_ij,MSI_CleanPeaks = Correlate_Cluster_MSI(cluster_id,Labels,MSI_train,Real_PeakIdx,ZCoord_cv,XCoord_cv,YCoord_cv) 332 | 333 | print('m/z', Learned_mzPeaks[CorrRank_ij[0:5]]) 334 | print('corr_Value = ', Corr_Val[CorrRank_ij[0:5]]) 335 | plt.plot(Learned_mzPeaks,Corr_Val) 336 | 337 | # Visualize Correlated Peak at section z: 338 | mzID = CorrRank_ij[0]; 339 | myzSections = np.unique(ZCoord_cv) 340 | zr=0 341 | ij_r = np.argwhere(ZCoord_cv == myzSections[zr]) 342 | indx = ij_r[:,0] 343 | xLoc = XCoord_cv[indx] 344 | yLoc = YCoord_cv[indx] 345 | MSI_2D = np.squeeze(MSI_train[indx,mzID]) 346 | im_mz = Image_Distribution(MSI_2D,xLoc,yLoc); 347 | plt.imshow(im_mz); 348 | mz_Peak = Learned_mzPeaks[mzID] 349 | print('m/z Peak = ',mz_Peak) 350 | 351 | # ============ Get 3D m/z image ================= 352 | mzValue = 13972.1 353 | mzId = np.argmin(np.abs(mzList[:] - mzValue)) 354 | Get_3Dmz_nifti(Combined_MSI[:,mzId],mzValue,XCoord,YCoord,ZCoord,directory) 355 | 356 | # ============== Load Peak Learned by All training models: 357 | myBeta = 1.5 358 | ALL_Peaks_Train = pd.read_excel(directory+'//Peaks_Learned//Beta_'+str(myBeta)+'//Peaks_AllModels_Trained.xlsx') 359 | ALL_Peaks_Train = np.squeeze(np.asarray(ALL_Peaks_Train)) 360 | ALL_Peaks_Train = np.nan_to_num(ALL_Peaks_Train) 361 | 362 | My_marker= ['v', '*','d','^','s'] 363 | Point_Color = plt.cm.jet(np.linspace(0, 1, ALL_Peaks_Train.shape[1])) 364 | plt.figure(figsize=(25, 5)) 365 | plt.plot(mzList,meanSpec_Orig_AllData,linewidth=3,c='black'); 366 | for ij in range(ALL_Peaks_Train.shape[1]): 367 | Peaks_Train = ALL_Peaks_Train[:,ij] 368 | Peaks_Train = Peaks_Train[Peaks_Train !=0] 369 | Train_Peaks_Loc = [np.argmin(np.abs(mzList[:] - Peaks_Train[idx])) for idx in range(len(Peaks_Train))] 370 | Mean_PickedPeakst = np.mean(Combined_MSI[:,Train_Peaks_Loc],axis=0) 371 | plt.scatter(Peaks_Train,Mean_PickedPeakst,marker=My_marker[ij],c=Point_Color[ij]); 372 | 373 | 374 | Peaks_Vector = ALL_Peaks_Train.reshape(ALL_Peaks_Train.shape[0]*ALL_Peaks_Train.shape[1]) 375 | Peaks_Vector_NoZero = Peaks_Vector[Peaks_Vector !=0] 376 | U_Peaks = np.unique(Peaks_Vector_NoZero) 377 | L = len(np.unique(U_Peaks)) 378 | #plt.figure(figsize=(20, 5)) 379 | n, bins, patches = plt.hist(x=Peaks_Vector_NoZero, bins=U_Peaks); plt.show() 380 | 381 | #=== Scatter Plot Frequency: 382 | from matplotlib.ticker import MaxNLocator 383 | New_n = np.append(n,1) 384 | fig, ax = plt.subplots() 385 | ax.yaxis.set_major_locator(MaxNLocator(integer=True)) 386 | #plt.figure(figsize=(30, 5)) 387 | plt.scatter(U_Peaks,New_n,c=New_n) 388 | plt.xlabel("m/z") 389 | plt.ylabel("Frequency") 390 | 391 | # ====== Bar Plot: Counts of Peak Frequency 392 | N_Freq = [len(np.argwhere(n==ij)) for ij in np.unique(n)] 393 | xValue = [ij for ij in np.unique(n)] 394 | colors = plt.cm.jet(np.linspace(0, 1, len(np.unique(n)))) 395 | plt.bar(xValue, N_Freq,color=colors) 396 | plt.xlabel("Frequency") 397 | plt.ylabel("Count") 398 | 399 | --------------------------------------------------------------------------------