├── Software License for Peak Learning.docx
├── Readme.txt
├── README.md
├── LearnPeaks.py
├── Computational_Model.py
├── msiPL_ForTesting.py
├── msiPL_Run.py
└── msiPL_Run_CrossValid_3DKidney.py


/Software License for Peak Learning.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wabdelmoula/msiPL/HEAD/Software License for Peak Learning.docx


--------------------------------------------------------------------------------
/Readme.txt:
--------------------------------------------------------------------------------
 1 | This readme file shows how to properly run the msiPL code (Abdelmoula et al,):
 2 | Walid Abdelmoula et al, msiPL: Non-linear Manifold and Peak Learning of Mass Spectrometry Imaging Data Using Artificial Neural Networks, bioRxiv, 2020 
 3 | 
 4 | License: The Peak Learning software (msiPL) will be shared using the 3D Slicer Software License agreement.
 5 | 
 6 | ---------------------- Installations: Software and Libraries --------------
 7 | We have implemented our machine learning model using the following software items:
 8 | 	1- Python(3.6.4)
 9 | 	2- Keras (2.1.5-tf) with a Tensorflow(1.8.0) backend.
10 | 	3- Packages: numpy(1.14.2), sklearn(0.19.1), scipy(1.0.0), and h5py(2.7.1)
11 | 	4- We implemented this model on Windows 10 PC workstation(Intel Xenon 3.3GHz, 512 GB RAM, 64-bit Windows, 2 GPUs NVIDIA TITAN Xp).
12 | -----------------------------------------------------------------------------
13 | 
14 | ---------------------------------- Demo --------------------------------------
15 | How to run the code?
16 | 	1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have
17 | 	provided required comments for instructions and guidance. In this file you will be able to:
18 | 		1.1. Load a dataset.
19 | 		1.2. Load the computational neural network architecture (VAE_BN).
20 | 		1.3. Train the model.
21 | 		1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction)
22 | 		1.4. Evaluate the learning quality by estimation and reconstruction of the original data
23 | 		1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks.
24 | 		1.6. Perform data clustering (GMM).
25 | 		1.7. Identify localized peaks within each cluster.
26 | 		
27 | 	2- "Computational_Model.py": implementation of the fully connected variational autoencoder, and regularized
28 | 	    with batch normalization.
29 | 	
30 | 	3- "LearnPeaks.py": implementation of a function that identifies peaks of interest. 
31 | 		It should be called after training the model, as instructed in "msiPL_Run.py".
32 | 		
33 | 	4. "msiPL_ForTesting.py": ultra-fast analysis on test data without any prior peak picking.
34 | 		You will need first to load the trained model from step#1 ("msiPL_Run.py").
35 | 
36 | ------------------------------------------------------------------------------------
37 | We provide a sample of a publicly available MSI data to train and test the model and to ensure reproducibility.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![DOI](https://zenodo.org/badge/287381023.svg)](https://zenodo.org/badge/latestdoi/287381023)
 2 | 
 3 | **msiPL**
 4 | ---------
 5 | Deep Learning based implementation for analysis of mass spectrometry imaging data
 6 | 
 7 | This readme file shows how to properly run the msiPL code 
 8 | 
 9 | **Paper:** Walid Abdelmoula et al, msiPL: Non-linear Manifold and Peak Learning of Mass Spectrometry Imaging Data Using Artificial Neural Networks, bioRxiv, 2020 
10 | 
11 | **License:** The msiPL code is shared under the 3D Slicer Software License agreement.
12 | 
13 | **Installations: Software and Libraries** 
14 | --------
15 | 
16 | We have implemented our machine learning model using the following software items:
17 | 
18 | 1- Python(3.6.4)
19 | 
20 | 2- Keras (2.1.5-tf) with a Tensorflow(1.8.0) backend.
21 | 
22 | 3- Packages: numpy(1.14.2), sklearn(0.19.1), scipy(1.0.0), and h5py(2.7.1)
23 | 
24 | 4- We implemented this model on Windows 10 PC workstation(Intel Xenon 3.3GHz, 512 GB RAM, 64-bit Windows, 2 GPUs NVIDIA TITAN Xp).
25 | 	
26 |  Demo 
27 |  ---------------
28 |  
29 | * How to run the code?
30 | 
31 | 	1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have
32 | 	provided required comments for instructions and guidance. In this file you will be able to:
33 | 	
34 | 		1.1. Load a dataset.
35 | 		1.2. Load the computational neural network architecture (VAE_BN).
36 | 		1.3. Train the model.
37 | 		1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction)
38 | 		1.4. Evaluate the learning quality by estimation and reconstruction of the original data
39 | 		1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks.
40 | 		1.6. Perform data clustering (GMM).
41 | 		1.7. Identify localized peaks within each cluster.
42 | 		
43 | 	2- "Computational_Model.py": implementation of the fully connected variational autoencoder, and regularized
44 | 	    with batch normalization.
45 | 	
46 | 	3- "LearnPeaks.py": implementation of a function that identifies peaks of interest. 
47 | 		It should be called after training the model, as instructed in "msiPL_Run.py".
48 | 		
49 | 	4. "msiPL_ForTesting.py": ultra-fast analysis on test data without any prior peak picking.
50 | 		You will need first to load the trained model from step#1 ("msiPL_Run.py").
51 | 
52 | If you used this implementation:
53 | ------
54 | please cite the paper by Abdelmoula et al, msiPL: https://www.biorxiv.org/content/10.1101/2020.08.13.250142v1.abstract
55 | 


--------------------------------------------------------------------------------
/LearnPeaks.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Implementation of msiPL (Abdelmoula et al): Identify an informative peak Peaks
 4 | 
 5 |     - This function should be called after training the model
 6 |     - Briefly: this is a backpropagated-based threshold analysis on the neural network weight hyper-parameter (see, Equation#4)
 7 |       This analysis is to identify m/z contributed strongly to the learned non-manifold (encoded structures).
 8 | """
 9 | import numpy as np
10 | from scipy.signal import argrelextrema
11 | 
12 | 
13 | def LearnPeaks(All_mz, W_enc, std_spectra, latent_dim,Beta,meanSpec_Orig):
14 |     W1 = W_enc[0] # Connected with the input layer
15 |     W2 = W_enc[6] # z_mean Layer
16 |     for EncID in range(latent_dim):
17 |         W2_EncFeat1 = W2[:,EncID]          
18 |         Act_Neuron_W2 = np.argsort(-W2_EncFeat1) #Note: -ve is used to sort descending
19 |         W2_EncFeat1[Act_Neuron_W2[0]]
20 |         Neuron_W1 = W1[:,Act_Neuron_W2[0]]
21 |         Weights_norm_W1 = std_spectra*Neuron_W1
22 |         ij =  np.argsort(Weights_norm_W1)[::-1]
23 |         Weights_norm_W1 = np.sort(Weights_norm_W1)[::-1]
24 |         Weights_norm_W1[0]
25 |         
26 |     # ======== Threshold Weights mean + Beta*std:
27 |         T = np.mean(Weights_norm_W1) + Beta*np.std(Weights_norm_W1)
28 |         PeakID = ij[np.argwhere(Weights_norm_W1 >= T)]; PeakID = PeakID[:,0] #Ranked indices
29 |         
30 |     # ======== Get union list of m/z from all encFetaures ========
31 |         Enc_mz = [All_mz[i] for i in PeakID]
32 |         if EncID==0:
33 |             Learned_mzBins = []
34 |             Common_PeakID = []
35 |         Learned_mzBins = list(set().union(Enc_mz , Learned_mzBins))
36 |         Common_PeakID = list(set().union(PeakID , Common_PeakID))
37 |         
38 |         if EncID==latent_dim-1:
39 |             Learned_mzBins = np.sort(Learned_mzBins)
40 |             Common_PeakID = np.sort(Common_PeakID)
41 |         
42 |     LocalMax = np.squeeze(np.transpose(argrelextrema(meanSpec_Orig, np.greater))) 
43 |     mz_LocalMax = [All_mz[i] for i in LocalMax]
44 |     Nearest_Peakindx = [np.argmin(np.abs(mz_LocalMax[:] - Learned_mzBins[i])) for i in  range(len(Learned_mzBins))]
45 |     Peak_Indx = np.unique(Nearest_Peakindx)
46 |     Learned_mzPeaks = [mz_LocalMax[i] for i in Peak_Indx]
47 |     Learned_mzPeaks = np.asarray(Learned_mzPeaks)
48 |     
49 |     Real_PeakIdx = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in  range(len(Learned_mzPeaks))]
50 | 
51 | 
52 |     return Learned_mzBins, Learned_mzPeaks, Common_PeakID,Real_PeakIdx 
53 | 
54 | 


--------------------------------------------------------------------------------
/Computational_Model.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Implementation of msiPL (Abdelmoula et al): Neural Network Architecture (VAE_BN)
 4 | 
 5 |     Keras-based implementation of a fully connected variational autoecnoder
 6 |     equipped with Batch normalization to correct for covariate shift and improve learning stability
 7 | 
 8 | """
 9 | 
10 | import numpy as np
11 | from keras.layers import Lambda, Input, Dense, ReLU, BatchNormalization
12 | from keras.models import Model
13 | from keras.losses import  categorical_crossentropy
14 | from keras.utils import plot_model
15 | from keras import backend as K
16 | 
17 | 
18 | class VAE_BN(object):
19 |     
20 |     def __init__ (self, nSpecFeatures,  intermediate_dim, latent_dim):
21 |         self.nSpecFeatures = nSpecFeatures
22 |         self.intermediate_dim = intermediate_dim
23 |         self.latent_dim = latent_dim
24 |         
25 |     def sampling(self, args):
26 |         """
27 |         Reparameterization trick by sampling from a continuous function (Gaussian with an auxiliary variable ~N(0,1)).
28 |         [see Our methods and for more details see arXiv:1312.6114]
29 |         """
30 |         self.z_mean, self.z_log_var = args
31 |         self.batch = K.shape(self.z_mean)[0]
32 |         self.dim = K.int_shape(self.z_mean)[1]
33 |         self.epsilon = K.random_normal(shape=(self.batch, self.dim)) # random_normal (mean=0 and std=1)
34 |         return self.z_mean + K.exp(0.5 * self.z_log_var) * self.epsilon
35 |     
36 | 
37 |     def get_architecture(self):
38 |         # =========== 1. Encoder Model================
39 |         input_shape = (self.nSpecFeatures, )
40 |         inputs = Input(shape=input_shape, name='encoder_input')
41 |         h = Dense(self.intermediate_dim)(inputs)
42 |         h = BatchNormalization()(h)
43 |         h = ReLU()(h)
44 |         z_mean = Dense(self.latent_dim, name = 'z_mean')(h)
45 |         z_mean = BatchNormalization()(z_mean)
46 |         z_log_var = Dense(self.latent_dim, name = 'z_log_var')(h)
47 |         z_log_var = BatchNormalization()(z_log_var)
48 |         
49 |         # Reparametrization Tric:
50 |         z = Lambda(self.sampling, output_shape = (self.latent_dim,), name='z')([z_mean, z_log_var])
51 |         encoder = Model(inputs, [z_mean, z_log_var, z], name = 'encoder')
52 |         print("==== Encoder Architecture...")
53 |         encoder.summary()
54 |         # plot_model(encoder, to_file='VAE_BN_encoder.png', show_shapes=True)
55 |         
56 |         # =========== 2. Encoder Model================
57 |         latent_inputs = Input(shape = (self.latent_dim,), name='Latent_Space')
58 |         hdec = Dense(self.intermediate_dim)(latent_inputs)
59 |         hdec = BatchNormalization()(hdec)
60 |         hdec = ReLU()(hdec)
61 |         outputs = Dense(self.nSpecFeatures, activation = 'sigmoid')(hdec)
62 |         decoder = Model(latent_inputs, outputs, name = 'decoder')
63 |         print("==== Decoder Architecture...")
64 |         decoder.summary()       
65 |         # plot_model(decoder, to_file='VAE_BN__decoder.png', show_shapes=True)
66 |         
67 |         #=========== VAE_BN: Encoder_Decoder ================
68 |         outputs = decoder(encoder(inputs)[2])
69 |         VAE_BN_model = Model(inputs, outputs, name='VAE_BN')
70 |         
71 |         # ====== Cost Function (Variational Lower Bound)  ==============
72 |         "KL-div (regularizes encoder) and reconstruction loss (of the decoder): see equation(3) in our paper"
73 |         # 1. KL-Divergence:
74 |         kl_Loss = 1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var)
75 |         kl_Loss = K.sum(kl_Loss, axis=-1)
76 |         kl_Loss *= -0.5
77 |         # 2. Reconstruction Loss
78 |         reconstruction_loss = categorical_crossentropy(inputs,outputs) # Use sigmoid at output layer
79 |         reconstruction_loss *= self.nSpecFeatures
80 |         
81 |         # ========== Compile VAE_BN model ===========
82 |         model_Loss = K.mean(reconstruction_loss + kl_Loss)
83 |         VAE_BN_model.add_loss(model_Loss)
84 |         VAE_BN_model.compile(optimizer='adam')
85 |         return VAE_BN_model, encoder
86 | 
87 | 
88 | 
89 | 
90 | 
91 | 
92 | 


--------------------------------------------------------------------------------
/msiPL_ForTesting.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Implementation of msiPL (Abdelmoula et al): Ultra-fast Test Data Analysis
  4 | 
  5 |     Our trained msiPL model is applied on new unseen test data which was withheld
  6 |     from a large 3D MSI datacube. Foe the Analysis of 3D MSI data, msiPL provides:
  7 |     - Ultra-fast Analysis (just a few seconds)
  8 |     - Memory efficient: unlike conventional methods there is no need to load 
  9 |       the full complex 3D MSI at once into the RAM. 
 10 |                         
 11 | """
 12 | 
 13 | import numpy as np
 14 | np.random.seed(1337)
 15 | from tensorflow import set_random_seed
 16 | set_random_seed(2)
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import os
 22 | import h5py
 23 | import matplotlib.pyplot as plt
 24 | from sklearn.mixture import GaussianMixture
 25 | from sklearn.metrics import mean_squared_error
 26 | from scipy import stats
 27 | from matplotlib.colors import LinearSegmentedColormap
 28 | import time
 29 | 
 30 | 
 31 | # ========= Load MSI Data without prior peak picking (hdf5 format) ==========
 32 | f =  h5py.File('Test_Data/MouseKindey_z73.h5','r')  
 33 | MSI_test = f["Data"]  
 34 | All_mz = f["mzArray"]  
 35 | nSpecFeatures = len(All_mz)
 36 | xLocation = np.array(f["xLocation"]).astype(int)
 37 | yLocation = np.array(f["yLocation"]).astype(int)
 38 | col = max(np.unique(xLocation))
 39 | row = max(np.unique(yLocation))
 40 | im = np.zeros((col,row))
 41 | mzId = np.argmin(np.abs(All_mz[:] - 6227.9))
 42 | for i in range(len(xLocation)):
 43 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_test[i,mzId] #image index starts at 0 not 1
 44 | plt.imshow(im);plt.colorbar()
 45 | 
 46 | # ====== Load VAE_BN: a fully-connected neural network model ========
 47 | from Computational_Model import *
 48 | input_shape = (nSpecFeatures, )
 49 | intermediate_dim = 512
 50 | latent_dim = 5
 51 | VAE_BN_Model = VAE_BN(nSpecFeatures,  intermediate_dim, latent_dim)
 52 | myModel, encoder = VAE_BN_Model.get_architecture()
 53 | myModel.summary()
 54 | 
 55 | # ================ Load The Trained Model =====================
 56 | myModel.load_weights('TrainedModel_Kidney_Z1.h5')
 57 | 
 58 | 
 59 | # *****************************************************************************
 60 | # ****************** Ultra Fast Analysis on new unseen data *****************
 61 | 
 62 | # ============= 1. Manifold Learning and  Model Predictions ===============
 63 | start_time = time.time()
 64 | encoded_imgs = encoder.predict(MSI_test) # Learned non-linear manifold
 65 | decoded_imgs = myModel.predict(MSI_test) # Reconstructed Data
 66 | print("--- %s seconds : Ultra-Fast, isn't it?" % (time.time() - start_time))
 67 | dec_TIC = np.sum(decoded_imgs, axis=-1)
 68 | 
 69 | # ======= 2. Compare Original and Reconstructed (inferred) Data ========
 70 | mse = mean_squared_error(MSI_test,decoded_imgs)
 71 | meanSpec_Rec = np.mean(decoded_imgs,axis=0) 
 72 | print('mean squared error(mse)  = ', mse)
 73 | meanSpec_Orig = np.mean(MSI_test,axis=0) # TIC-norm original MSI Data
 74 | N_DecImg = decoded_imgs/dec_TIC[:,None]  # TIC-norm reconstructed MSI  Data
 75 | meanSpec_RecTIC = np.mean(N_DecImg,axis=0)
 76 | plt.plot(All_mz,meanSpec_Orig); plt.plot(All_mz,meanSpec_RecTIC,color = [1.0, 0.5, 0.25]); 
 77 | plt.title('TIC-norm distribution of average spectrum: Original and Predicted')
 78 | 
 79 | 
 80 | # ======== 3. Model Parameters of the Latent Space ==========
 81 | Latent_mean, Latent_var, Latent_z = encoded_imgs
 82 | 
 83 | # ======== 4. Non-linear dimensionality Reduction  ==========
 84 | ndim = Latent_z.shape[1]
 85 | plt.figure(figsize=(14, 14))
 86 | for j in range(ndim):
 87 |     for i in range(len(xLocation)):
 88 |         im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Latent_z[i,j]  
 89 |     ax = plt.subplot(1, ndim, j + 1)    
 90 |     plt.imshow(im,cmap="hot");  # plt.colorbar()   
 91 |     ax.get_xaxis().set_visible(False)
 92 |     ax.get_yaxis().set_visible(False)
 93 |  
 94 | # ========= 5. Visualize Original & Reconstructed (inferred) m/z images ==========
 95 | mzs = [2489.6,6627.9,8981.4,13961.2]  
 96 | directory = 'Results\\test'          
 97 | if not os.path.exists(directory):
 98 |     os.makedirs(directory)    
 99 | for indx in range(0,len(mzs)):
100 |     mzId = np.argmin(np.abs(All_mz[:] - mzs[indx]))     
101 |     for i in range(len(xLocation)):
102 |         im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = N_DecImg[i,mzId] # Reconstructed TIC-norm m/z image
103 | #        im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_test[i,mzId] # Original TIC-norm m/z image
104 |     ax = plt.subplot(1, len(mzs), indx + 1)    
105 |     plt.imshow(im);  # plt.colorbar()   
106 |     ax.get_xaxis().set_visible(False)
107 |     ax.get_yaxis().set_visible(False)
108 |     plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Rec.jpg',im)
109 | #    plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Orig.jpg',im)
110 | 
111 | # *****************************************************************************
112 | #********************* 6. Peak Learning (Manuscript Equation#4) ***************    
113 | # Statistical Analysis on the trained neural network hyperparameter(weight)
114 | from LearnPeaks import *
115 | W_enc = encoder.get_weights()
116 | # Normalize Weights by multiplying it with std of original data variables
117 | std_spectra = np.std(MSI_test, axis=0) 
118 | Beta = 2.5
119 | Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(All_mz, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig)
120 | 
121 | 
122 | 
123 | # *****************************************************************************
124 | # ========= Color Map ==============                                      
125 | def discrete_cmap(N, base_cmap=None):
126 |     """Create an N-bin discrete colormap from the specified input map"""
127 |     base = plt.cm.get_cmap(base_cmap)
128 |     color_list = base(np.linspace(0, 1, N))
129 |     cmap_name = base.name + str(N)
130 |     return base.from_list(cmap_name, color_list, N)
131 | 
132 | # *********************** Downstream Data Analysis ****************************
133 | # Data Clustering using GMM: applied on the encoded fetaures "Latent_z" 
134 | # Peak Localization for each cluster 
135 | nClusters = 7
136 | gmm = GaussianMixture(n_components=nClusters,random_state=0).fit(Latent_z)
137 | labels = gmm.predict(Latent_z)
138 | labels +=1 # To Avoid confilict with the natural background value of 0
139 | for i in range(len(xLocation)):
140 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = labels[i]
141 | MyCmap = discrete_cmap(nClusters+1, 'jet')
142 | plt.imshow(im,cmap=MyCmap);
143 | plt.colorbar(ticks=np.arange(0,nClusters+1,1))
144 | plt.axis('off')
145 | 
146 | 
147 | # ======= Select a cluster of interest and correlate with the Learned_mzPeaks ===============
148 | # 1. Select CLuster:
149 | cluster_id = 6
150 | Kimg = labels==cluster_id
151 | Kimg = Kimg.astype(int)
152 | 
153 | for i in range(len(xLocation)):
154 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Kimg[i]
155 | segCmp = [MyCmap(0),MyCmap(cluster_id)]
156 | cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2)
157 | plt.imshow(im, cmap=cm);
158 | plt.axis('off')
159 | 
160 | # 2. Correlate the Select CLuster with the Learned_mzPeaks:
161 | Peaks_ID = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in  range(len(Learned_mzPeaks))]
162 | MSI_PeakList = MSI_test[:,Peaks_ID[:]] # get only MSI data only for the shotlisted learned m/z peaks
163 | Corr_Val =  np.zeros(len(Learned_mzPeaks))
164 | for i in range(len(Learned_mzPeaks)):
165 |     Corr_Val[i] = stats.pearsonr(Kimg,MSI_PeakList[:,i])[0]
166 | id_mzCorr = np.argmax(Corr_Val)
167 | rank_ij =  np.argsort(Corr_Val)[::-1]
168 | 
169 | for i in range(len(xLocation)):
170 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_PeakList[i,id_mzCorr]  
171 | plt.imshow(im)
172 | plt.axis('off')
173 | print('m/z', Learned_mzPeaks[id_mzCorr])
174 | print('corr_Value = ', Corr_Val[id_mzCorr])
175 | 
176 | plt.plot(Learned_mzPeaks,Corr_Val)
177 | print(['%0.4f' % i for i in Learned_mzPeaks[rank_ij[0:10]]])
178 | print('Correlation Top 10 Ranked peaks:', end='')
179 | print(['%0.4f' % i for i in Corr_Val[rank_ij[0:10]]])  
180 | 


--------------------------------------------------------------------------------
/msiPL_Run.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Implementation of msiPL (Abdelmoula et al): Model Training
  4 | This is the main file to run
  5 | This implementation is based on the public AI platforms of Keras and Tensorflow.
  6 | 
  7 | How to run the code?
  8 | 	1- "msiPL_Run.py" is the main file that you should run first. The file should be running in a sequential manner, and we have
  9 | 	provided required comments for instructions and guidance. In this file you will be able to:
 10 | 		1.1. Load a dataset.
 11 | 		1.2. Load the computational neural network architecture (VAE_BN).
 12 | 		1.3. Train the model.
 13 | 		1.3. Non-linear manifold learning and data visualization (non-linear dimensionality reduction)
 14 | 		1.4. Evaluate the learning quality by estimation and reconstruction of the original data
 15 | 		1.5. Peak Learning learning (Equation#4): to get a smaller list of informative peaks.
 16 | 		1.6. Perform data clustering (GMM): The number of clusters can be set by the users or automatically using the BIC method.
 17 | 		1.7. Identify localized peaks within each cluster.
 18 | 
 19 | """
 20 | from __future__ import absolute_import
 21 | from __future__ import division
 22 | from __future__ import print_function
 23 | 
 24 | import numpy as np
 25 | np.random.seed(1337)
 26 | from tensorflow import set_random_seed
 27 | set_random_seed(2)
 28 | 
 29 | import os
 30 | import h5py
 31 | import matplotlib.pyplot as plt
 32 | from sklearn.mixture import GaussianMixture
 33 | from sklearn.metrics import mean_squared_error
 34 | from scipy import stats
 35 | from matplotlib.colors import LinearSegmentedColormap
 36 | import time
 37 | 
 38 | # ========= Color Map ==============                                      
 39 | def discrete_cmap(N, base_cmap=None):
 40 |     """Create an N-bin discrete colormap from the specified input map"""
 41 |     base = plt.cm.get_cmap(base_cmap)
 42 |     color_list = base(np.linspace(0, 1, N))
 43 |     cmap_name = base.name + str(N)
 44 |     return base.from_list(cmap_name, color_list, N)
 45 | 
 46 | # ========= Load MSI Data without prior peak picking (hdf5 format) ==========
 47 | """ The MSI data is loaded as hdf5 file to maintain efficiency.
 48 | 	We have noticed high efficiency in memory usage and fast performance in accessing
 49 | 	high dimensional data form huge files when dealing with HDF5 formats as oppose to the imzML.
 50 | You can convert the imzML to hdf5 using the following steps:
 51 | 	- first install the python packages “h5py” and "imzML"
 52 | 	- Load the imzML file to get: 
 53 | 		a- spectral data, and let's say save it in a variable called  "Spec_Data", 
 54 | 		b- spatial information for each spectrum and save it "XCoord" and "YCoord".
 55 | 	- use the hhdf5 method called "create_dataset" to save your data in h5. For example:
 56 | 		- myHF = h5py.file("myData.h5",'w')
 57 | 		- myHF.create_dataset('Data', data=Spec_Data)
 58 | 		- myHF. create_dataset('xLocation', data=XCoord)
 59 | 		- After you finish, close your h5 file: "myHF.close()"
 60 | """
 61 | f =  h5py.File('Training_Data/MouseKindey_z1.h5','r')  
 62 | MSI_train = f["Data"]  # spectral information.
 63 | All_mz = f["mzArray"]  
 64 | nSpecFeatures = len(All_mz)
 65 | if MSI_train.shape[1] != nSpecFeatures:
 66 |     MSI_train = np.transpose(MSI_train)
 67 | xLocation = np.array(f["xLocation"]).astype(int)
 68 | yLocation = np.array(f["yLocation"]).astype(int)
 69 | col = max(np.unique(xLocation))
 70 | row = max(np.unique(yLocation))
 71 | im = np.zeros((col,row))
 72 | mzId = np.argmin(np.abs(All_mz[:] - 6227.9))
 73 | for i in range(len(xLocation)):
 74 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_train[i,mzId] #image index starts at 0 not 1
 75 | plt.imshow(im);plt.colorbar()
 76 | 
 77 | # ====== Load VAE_BN: a fully-connected neural network model ========
 78 | """ myModel represents the msiPL architecture """
 79 | from Computational_Model import *
 80 | input_shape = (nSpecFeatures, )
 81 | intermediate_dim = 512 # Size of the first hidden layer
 82 | # ---- dimensions of the latent space (i.e. encoded features)
 83 | ans = int(input('The default value of the latent space dimensions is 5, would you like to set a different value? Yes=1; No=0 …:'))
 84 | if ans == 1:
 85 |     latent_dim = int (input('Please set the new dimensions of the latent space = '))
 86 | else:
 87 |     latent_dim = 5
 88 | # -------------------------------------
 89 | # Compile the msiPL computational model
 90 | VAE_BN_Model = VAE_BN(nSpecFeatures,  intermediate_dim, latent_dim)
 91 | myModel, encoder = VAE_BN_Model.get_architecture()
 92 | myModel.summary()
 93 | 
 94 | # ============= Model Training =================
 95 | """ The training processes involves: 
 96 | 	epochs: 100 iterations
 97 | 	batch_size: a randomly-shuffled subset of 128 spectra is loaded at a time into the RAM 
 98 | 	This phase will run faster if a GPU is utilized
 99 |  """
100 | try:
101 |     start_time = time.time()
102 |     history = myModel.fit(MSI_train, epochs=100, batch_size=128, shuffle="batch")   
103 |     plt.plot(history.history['loss'])
104 |     plt.ylabel('loss'); plt.xlabel('epoch')
105 |     print("--- %s seconds ---" % (time.time() - start_time))
106 |     myModel.save_weights('TrainedModel_Kidney_Z1.h5')
107 | except MemoryError as error:
108 |     import psutil
109 |     Memory_Information = psutil.virtual_memory()
110 |     print('>>> There is a memory issue: and here are a few suggestions:')
111 |     print('>>>>>> 1- Make sure that you are using  python 64-bit.')
112 |     print('>>>>>> 2- use a lower value for the batch_size (default is 128).')
113 |     print('**** Here is some information about your memory (MB):', Memory_Information)
114 | 
115 | 
116 | # ============= Model Predictions ===============
117 | encoded_imgs = encoder.predict(MSI_train) # Learned non-linear manifold
118 | decoded_imgs = myModel.predict(MSI_train) # Reconstructed Data
119 | dec_TIC = np.sum(decoded_imgs, axis=-1)
120 | 
121 | # ======= Calculate mse between orig & rec. data =====
122 | """ The mean squared error (mse): 
123 | 	the mse is used to evaluate the quality of the reconstructed data"""
124 | mse = mean_squared_error(MSI_train,decoded_imgs)
125 | meanSpec_Rec = np.mean(decoded_imgs,axis=0) 
126 | print('mean squared error(mse)  = ', mse)
127 | meanSpec_Orig = np.mean(MSI_train,axis=0) # TIC-norm original MSI Data
128 | N_DecImg = decoded_imgs/dec_TIC[:,None]  # TIC-norm reconstructed MSI  Data
129 | meanSpec_RecTIC = np.mean(N_DecImg,axis=0)
130 | plt.plot(All_mz,meanSpec_Orig); plt.plot(All_mz,meanSpec_RecTIC,color = [1.0, 0.5, 0.25]); 
131 | plt.title('TIC-norm distribution of average spectrum: Original and Predicted')
132 | 
133 | # ======== Model Parameters of the Latent Space ==========
134 | """ Capturing the learned latent variable:
135 | encoded features (Latent_z), and its mean and variance"""
136 | Latent_mean, Latent_var, Latent_z = encoded_imgs
137 | 
138 | # ======== Visualize encoded Features (learned non-linear spectral manifold) ==========
139 | ndim = Latent_z.shape[1]
140 | plt.figure(figsize=(14, 14))
141 | for j in range(ndim):
142 |     for i in range(len(xLocation)):
143 |         im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Latent_z[i,j]  
144 |     ax = plt.subplot(1, ndim, j + 1)    
145 |     plt.imshow(im,cmap="hot");  # plt.colorbar()   
146 |     ax.get_xaxis().set_visible(False)
147 |     ax.get_yaxis().set_visible(False)
148 | 
149 | # ========= Visualize Original & Reconstructed m/z images ==========
150 | mzs = [2489.6,6627.9,8981.4,13961.2]  
151 | directory = 'Results'          
152 | if not os.path.exists(directory):
153 |     os.makedirs(directory)    
154 | for indx in range(0,len(mzs)):
155 |     mzId = np.argmin(np.abs(All_mz[:] - mzs[indx]))     
156 |     for i in range(len(xLocation)):
157 |         im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = N_DecImg[i,mzId] # Reconstructed TIC-norm m/z image
158 | #        im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_train[i,mzId] # Original TIC-norm m/z image
159 |     ax = plt.subplot(1, len(mzs), indx + 1)    
160 |     plt.imshow(im);  # plt.colorbar()   
161 |     ax.get_xaxis().set_visible(False)
162 |     ax.get_yaxis().set_visible(False)
163 |     plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Rec.jpg',im)
164 | #    plt.imsave(directory + '\\mz' + str(All_mz[mzId]) + '_Orig.jpg',im)
165 | 
166 | #********************* Peak Learning (Manuscript Equation#4) ********************    
167 | """ Statistical Analysis on the trained neural network hyper-parameter(weight)
168 | 	See Equation (4) in the main manuscript)
169 | """
170 | from LearnPeaks import *
171 | W_enc = encoder.get_weights()
172 | # Normalize Weights by multiplying it with std of original data variables
173 | std_spectra = np.std(MSI_train, axis=0) 
174 | Beta = 2.5 # This variable can be adjusted by the user. We have observed good performance within the range [1,2.5] 
175 | Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(All_mz, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig)
176 | # save results of learned peaks in excel sheet
177 | directory = 'Results'
178 | if not os.path.exists(directory):
179 |     os.makedirs(directory_mz) 
180 | import pandas as pd
181 | df_1 = pd.DataFrame({'mz Peaks': Learned_mzPeaks})
182 | df_1.to_excel(directory+'/'+'Peaks_.xlsx', engine='xlsxwriter' , sheet_name='Sheet1')
183 | 
184 | # ******************* Downstream Data Analysis **************************
185 | """ Now the msiPL has been trained to learn a non-linear manifold, now the
186 | clustering step can be efficiently applied.
187 | Data Clustering using GMM: 
188 | 	- Applied on the encoded features "Latent_z" 
189 | 	- Peak Localization within each cluster 
190 | 	- nClusters: this is the number of clusters that need to be set before running the GMM.
191 | 	"nClusters" can be set manually or automatically suggested based on an optimization process using the BIC algorithm. 
192 |  """
193 | # ---- Bayesian Information Criterion (BIC) combined with the Kneedle algorithm for optimal model selection:
194 | """ The total number of K-clusters will be automatically suggested.
195 | 	- Different GMM models will generated using different number of K-clusters (e.g. K varies between[3,20])
196 | 	- The BIC scores will be computed for each GMM model.
197 | 	- The Kneedle algorithm is applied on the BIC scores to identify the point of maximum curvature (knee point).
198 | 	- The knee point points to the best model and suggest the expected number of K-clusters.
199 | """
200 | from kneed import KneeLocator
201 | # covariance_type = {'full', 'spherical', 'diag', 'tied'}
202 | cov_Type = 'full'
203 | n_components = np.arange(3, 20)
204 | models = [GaussianMixture(n, covariance_type=cov_Type, random_state=0).fit(Latent_z)
205 |           for n in n_components]
206 | 
207 | BIC_Scores = [m.bic(Latent_z) for m in models]
208 | kneedle_point = KneeLocator(n_components, BIC_Scores, curve='convex', direction='decreasing')
209 | print('The suggested number of clusters = ', kneedle_point.knee)
210 | Elbow_idx = np.where(BIC_Scores==kneedle_point.knee_y)[0]
211 | 
212 | from matplotlib.ticker import MaxNLocator
213 | plt.plot(n_components, BIC_Scores,'-g', marker='o',markerfacecolor='blue',markeredgecolor='orange',
214 |          markeredgewidth='2',markersize=10,markevery=Elbow_idx)
215 | plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))
216 | plt.legend(loc='best')
217 | plt.xlabel('Number of clusters');
218 | plt.ylabel('BIC score');
219 | plt.title('The suggested number of clusters = '+ np.str(kneedle_point.knee))
220 | # plt.plot(n_components, [m.aic(Latent_z) for m in models], label='AIC')
221 | # Ref Kneedle algorithm [V. Satopaa et al., international conference on distributed computing systems workshops. IEEE, 2011.]
222 | 
223 | # ======================== Apply GMM on Encoded Features ============= 
224 | start_time_gmm = time.time()
225 | nClusters = (kneedle_point.knee) # this variable is set automatically based on the BIC algorithm
226 | # nClusters = 7 # this variable could be tuned by the user
227 | gmm = GaussianMixture(n_components=nClusters,covariance_type=cov_Type,random_state=0).fit(Latent_z)
228 | labels = gmm.predict(Latent_z)
229 | labels +=1 # To Avoid conflict with the natural background value of 0
230 | 
231 | # Spatial Clusters Distribution:
232 | for i in range(len(xLocation)):
233 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = labels[i]
234 | MyCmap = discrete_cmap(nClusters+1, 'jet')
235 | plt.imshow(im,cmap=MyCmap);
236 | plt.colorbar(ticks=np.arange(0,nClusters+1,1))
237 | plt.axis('off')
238 | print("Clustering time =  %s seconds" % (time.time() - start_time_gmm))
239 | 
240 | # ======= Select a cluster of interest and correlate with the Learned_mzPeaks ===============
241 | # 1. Select CLuster:
242 | cluster_id = 2
243 | Kimg = labels==cluster_id
244 | Kimg = Kimg.astype(int)
245 | 
246 | for i in range(len(xLocation)):
247 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = Kimg[i]
248 | segCmp = [MyCmap(0),MyCmap(cluster_id)]
249 | cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2)
250 | plt.imshow(im, cmap=cm);
251 | plt.axis('off')
252 | 
253 | # 2. Correlate the Select CLuster with the Learned_mzPeaks:
254 | # Note: it will also be fast to correlate the cluster with All_mz Data
255 | Peaks_ID = [np.argmin(np.abs(All_mz[:] - Learned_mzPeaks[i])) for i in  range(len(Learned_mzPeaks))]
256 | MSI_PeakList = MSI_train[:,Peaks_ID[:]] # get only MSI data only for the shotlisted learned m/z peaks
257 | Corr_Val =  np.zeros(len(Learned_mzPeaks))
258 | for i in range(len(Learned_mzPeaks)):
259 |     Corr_Val[i] = stats.pearsonr(Kimg,MSI_PeakList[:,i])[0]
260 | id_mzCorr = np.argmax(Corr_Val)
261 | rank_ij =  np.argsort(Corr_Val)[::-1]
262 | 
263 | for i in range(len(xLocation)):
264 |     im[ np.asscalar(xLocation[i])-1, np.asscalar(yLocation[i])-1] = MSI_PeakList[i,id_mzCorr]  
265 | plt.imshow(im)
266 | plt.axis('off')
267 | print('m/z', Learned_mzPeaks[id_mzCorr])
268 | print('corr_Value = ', Corr_Val[id_mzCorr])
269 | 
270 | plt.plot(Learned_mzPeaks,Corr_Val)
271 | print(['%0.4f' % i for i in Learned_mzPeaks[rank_ij[0:10]]])
272 | print('Correlation Top 10 Ranked peaks:', end='')
273 | print(['%0.4f' % i for i in Corr_Val[rank_ij[0:10]]])


--------------------------------------------------------------------------------
/msiPL_Run_CrossValid_3DKidney.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Implementation of msiPL (Abdelmoula et al): Model Cross-Valiation Analysis
  4 | 
  5 | """
  6 | from __future__ import absolute_import
  7 | from __future__ import division
  8 | from __future__ import print_function
  9 | 
 10 | import numpy as np
 11 | np.random.seed(1337)
 12 | from tensorflow import set_random_seed
 13 | set_random_seed(2)
 14 | 
 15 | import os
 16 | import h5py
 17 | import matplotlib.pyplot as plt
 18 | from sklearn.mixture import GaussianMixture
 19 | from sklearn.metrics import mean_squared_error
 20 | from scipy import stats
 21 | from matplotlib.colors import LinearSegmentedColormap
 22 | import matplotlib as mpl
 23 | import nibabel as nib
 24 | import pandas as pd
 25 | import time
 26 | 
 27 | # ======= Directory Information:
 28 | Cd = os.getcwd()
 29 | Bd = os.path.dirname(Cd)
 30 | 
 31 | # ========= Color Map ==============                                      
 32 | def discrete_cmap(N, base_cmap=None):
 33 |     """Create an N-bin discrete colormap from the specified input map"""
 34 |     base = plt.cm.get_cmap(base_cmap)
 35 |     color_list = base(np.linspace(0, 1, N))
 36 |     cmap_name = base.name + str(N)
 37 |     return base.from_list(cmap_name, color_list, N)
 38 | 
 39 | # ====== Visualize Image: From 1D vector to Image ==============
 40 | def Image_Distribution(V,xLoc,yLoc):
 41 |     col = max(np.unique(xLoc))
 42 |     row = max(np.unique(yLoc))
 43 |     Myimg = np.zeros((col,row))
 44 |     for i in range(len(xLoc)):
 45 |         Myimg[np.asscalar(xLoc[i])-1, np.asscalar(yLoc[i])-1] = V[i]
 46 |     return Myimg
 47 | 
 48 | # ================= Correlate Cluster with MSI Data =============
 49 | def Correlate_Cluster_MSI(cluster_id,Labels,MSI_D,Peak_Indx,ZCoord_cv,XCoord_cv,YCoord_cv):
 50 |     Kimg = Labels==cluster_id
 51 |     Kimg = Kimg.astype(int)
 52 |     MSI_CleanPeaks = MSI_D[:,Peak_Indx[:]]
 53 |     Corr_Val =  np.zeros(len(Peak_Indx))
 54 |     
 55 |     for i in range(len(Peak_Indx)):
 56 |         Corr_Val[i] = stats.pearsonr(Kimg,MSI_CleanPeaks[:,i])[0]
 57 |     id_mzCorr = np.argmax(Corr_Val)
 58 |     rank_ij =  np.argsort(Corr_Val)[::-1]
 59 |     return Corr_Val, rank_ij, MSI_CleanPeaks
 60 | 
 61 | # ========================== 3D mz image ============================
 62 | def Get_3Dmz_nifti(MSI_CleanPeaks,mz_Peak,XCoord_cv,YCoord_cv,ZCoord_cv,directory):
 63 |     mzSections = np.unique(ZCoord_cv)
 64 |     Vol_mz = np.zeros((200,200,len(mzSections)))
 65 |     nSections = len(mzSections)
 66 |     directory_NIFT = directory + '\\mz_Vol\\Training'
 67 |     if not os.path.exists(directory_NIFT):
 68 |         os.makedirs(directory_NIFT)
 69 |     for Zsec in range(len(mzSections)):
 70 |         ij_r = np.argwhere(ZCoord_cv == mzSections[Zsec])
 71 |         indx = ij_r[:,0]
 72 |         xLoc = XCoord_cv[indx]
 73 |         yLoc = YCoord_cv[indx]
 74 |         MSI_2D = np.squeeze(MSI_CleanPeaks[indx])
 75 |         for idx in range(len(xLoc)):
 76 |             Vol_mz[np.asscalar(xLoc[idx])-1, np.asscalar(yLoc[idx])-1,Zsec] = MSI_2D[idx]  
 77 | 
 78 |     I_nii = nib.Nifti1Image(Vol_mz,affine=np.eye(4))
 79 |     nib.save(I_nii,directory_NIFT +'\\mz_' + str(mz_Peak) + '.nii')
 80 |     
 81 | #============= Spatial Distribution Encoded Fetaures =============
 82 | def get_EncFeatures(Latent_z,Train_idx,myZCoord,xLocation,yLocation,directory,order):
 83 |     myzSections = np.unique(myZCoord)
 84 |     ndim = Latent_z.shape[1]
 85 |     for zr in range(len(Train_idx)):
 86 |         ij_r = np.argwhere(myZCoord == myzSections[zr])
 87 |         indx = ij_r[:,0]
 88 |         xLoc = xLocation[indx]
 89 |         yLoc = yLocation[indx]
 90 |         zSection_Latent_z = np.squeeze(Latent_z[indx,])        
 91 |         plt.figure(figsize=(14, 14))
 92 |         for j in range(ndim):
 93 |             EncFeat = zSection_Latent_z[:,j] #encoded_imgs[i,0] #image index starts at 0 not 1 
 94 |             im = Image_Distribution(EncFeat,xLoc,yLoc);
 95 |             ax = plt.subplot(1, ndim, j + 1)    
 96 |             plt.imshow(im,cmap="hot");  # plt.colorbar()   
 97 |             ax.get_xaxis().set_visible(False)
 98 |             ax.get_yaxis().set_visible(False)
 99 |             
100 |         directory_Latz = directory+'//Latent//Training_'+str(order)
101 |         if not os.path.exists(directory_Latz):
102 |             os.makedirs(directory_Latz)
103 |         plt.savefig(directory_Latz + '\\EncFetaures_Tissue'+str(myzSections[zr])+'.png',bbox_inches='tight')
104 |         
105 | # ================== Get GMM Image for cv analysis ================        
106 | def get_gmmImage(Train_idx,Features,nClusters,myZCoord,xLocation,yLocation,directory,order):
107 |     myzSections = np.unique(myZCoord); Zsec=0; 
108 |     C_imgs = np.zeros((200,200,len(range(1,len(Train_idx)+1,1)),nClusters))
109 |     directoryGmm = directory+'//GMM//Training_'+str(order)
110 |     if not os.path.exists(directoryGmm):
111 |         os.makedirs(directoryGmm)
112 |         
113 |     for zr in range(len(Train_idx)):
114 |         im = []
115 |         ij_r = np.argwhere(myZCoord == myzSections[zr])
116 |         indx = ij_r[:,0]
117 |         xLoc = xLocation[indx]
118 |         yLoc = yLocation[indx]
119 |         zSection_labels = Features[indx]
120 |         im = zSection_labels
121 |         im = Image_Distribution(im,xLoc,yLoc);
122 |         MyCmap = discrete_cmap(nClusters, 'jet')        
123 |         plt.imshow(im,cmap=MyCmap);
124 |         plt.colorbar(ticks=np.arange(0,nClusters,1))
125 |         plt.axis('off')
126 |         plt.show()        
127 |         plt.imsave(directoryGmm + '\\gmm_Training_'+str(myzSections[zr])+'_K_' + str(nClusters) + '.png',im,cmap=MyCmap)  
128 | 
129 |          # Save single clusters:
130 |         directory_SingleC =directoryGmm + '\\GMM_Section_'+str(myzSections[zr])
131 |         if not os.path.exists(directory_SingleC):
132 |             os.makedirs(directory_SingleC)
133 |             
134 |         for c in range(0,nClusters,1):
135 |             cluster_id = c
136 |             Kimg = zSection_labels[:]==cluster_id
137 |             Kimg = Kimg.astype(int)
138 |             for idx in range(len(xLoc)):
139 |                 C_imgs[np.asscalar(xLoc[idx])-1, np.asscalar(yLoc[idx])-1,Zsec,cluster_id] = Kimg[idx]  
140 |             Kimg = Image_Distribution(Kimg,xLoc,yLoc);
141 |             
142 |             segCmp = [MyCmap(0),MyCmap(cluster_id)]
143 |             segCmp[0]= (0,0,0)
144 |             cm = LinearSegmentedColormap.from_list('Walid_cmp',segCmp,N=2)
145 |             plt.imshow(Kimg, cmap=cm);
146 |             plt.colorbar(ticks=np.arange(0,1,1))
147 |             plt.axis('off')
148 |             plt.show()
149 |             plt.imsave(directory_SingleC + '\\Cluster_' + str(cluster_id) + '.png',Kimg,cmap=cm)  
150 |         Zsec +=1
151 |     #Save NIFTI
152 |     directory_NIFT = directoryGmm + '\\NIFTI'
153 |     if not os.path.exists(directory_NIFT):
154 |         os.makedirs(directory_NIFT)
155 |     for c in range(0,nClusters,1):
156 |         I_nii = nib.Nifti1Image(C_imgs[:,:,:,c],affine=np.eye(4))
157 |         nib.save(I_nii,directory_NIFT +'\\Label_' + str(c) + '.nii')
158 | 
159 | # ======== Save NIFTI image for each cluster ==================
160 | def Cluster_To_Nifti(directory,C_imgs,nClusters):
161 |     directory_NIFT = directory + '\\GMM_K'+str(nClusters)+'\\NIFTI'
162 |     if not os.path.exists(directory_NIFT):
163 |         os.makedirs(directory_NIFT) 
164 |     for c in range(1,nClusters+1,1):
165 |         I_nii = nib.Nifti1Image(C_imgs[:,:,:,c],affine=np.eye(4))
166 |         nib.save(I_nii,directory_NIFT +'\\Label_' + str(c) + '.nii')
167 |                   
168 | # =================== Load 3D MSI Data ========================# 
169 | Combined_MSI = []; XCoord = []; YCoord = []; ZCoord = []
170 | TissueIDs = [x for x in range(1,74,1)]
171 | 
172 | for id in range(1,74,1):
173 |     f =  h5py.File(Bd+'//hd5//MouseKindey_z' + str(id) + '.h5','r')
174 |     MSI_train = f["Data"]
175 |     mzList = f["mzArray"]
176 |     nSpecFeatures = len(mzList)
177 |     xLocation = np.array(f["xLocation"]).astype(int)
178 |     yLocation = np.array(f["yLocation"]).astype(int)
179 |     zLocation = np.full(len(yLocation),id)
180 |     col = max(np.unique(xLocation))
181 |     row = max(np.unique(yLocation))
182 |     im = np.zeros((col,row))
183 |     if id==1:
184 |         Combined_MSI = MSI_train
185 |         XCoord = xLocation
186 |         YCoord = yLocation
187 |         ZCoord = zLocation
188 |     else:
189 |         Combined_MSI = np.concatenate((MSI_train,Combined_MSI), axis=0)
190 |         XCoord = np.concatenate((xLocation,XCoord))
191 |         YCoord = np.concatenate((yLocation,YCoord))
192 |         ZCoord = np.concatenate((zLocation,ZCoord))
193 | 
194 | 
195 | # ============ KFold Cross Validation:
196 | from sklearn.model_selection import  KFold
197 | from matplotlib.patches import Patch
198 | cmap_data = plt.cm.Paired
199 | cmap_cv = plt.cm.coolwarm
200 | 
201 | n_folds = 5
202 | kfold = KFold(n_folds,  shuffle=True) 
203 | fig, ax = plt.subplots() 
204 | ij_Training = []; ij_Testing = []
205 | myHF = h5py.File('CV_Values//cv_Idx.h5', 'w')
206 | for ij, (Test_idx, Train_idx) in enumerate(kfold.split(TissueIDs)):
207 |     print("Training: %s Testing:%s" %(Train_idx, Test_idx))
208 |     ij_Training.append(Train_idx)
209 |     ij_Testing.append(Test_idx)
210 |     myHF.create_dataset('indx_Training'+str(ij), data=Train_idx)
211 |     myHF.create_dataset('indx_Testing'+str(ij), data=Test_idx)   
212 |     
213 |     indices = np.array([np.nan] * len(TissueIDs))
214 |     indices[Train_idx] = 1
215 |     indices[Test_idx] = 0
216 |     # Visalize Corss Validation Behavior:
217 |     ax.scatter(range(1,len(indices)+1), [ij + 1] * len(indices),
218 |                c=indices, marker='_', lw=10, cmap=cmap_cv, vmin=-0.2, vmax=1.2)
219 |     yticklabels = list(range(1,n_folds+1))
220 |     ax.set(yticks=np.arange(1,n_folds+1) , yticklabels=yticklabels,
221 |            xlabel='2D MSI Sample index', ylabel="Iteration",
222 |            ylim=[n_folds+1.2,-.2], xlim=[-2, len(TissueIDs)+4])
223 |     ax.set_title('KFold', fontsize=15)
224 | ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.1))],
225 | ['Training set', 'Testing set'], loc=(1.02, .8))
226 | myHF.close()
227 | 
228 | # -------- A function to get raining Data:
229 | def Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,CV_idx):
230 |     ij_r=[]; ij_t=[]; MSI_train=[]; MSI_Test=[]
231 |     # Get Training Data:
232 |     for jr,zr in enumerate(CV_idx):        
233 |         if jr==0:
234 |             ij_r = np.argwhere(ZCoord == zSections[zr])
235 |         else:
236 |             ij_r = np.concatenate((ij_r, np.argwhere(ZCoord == zSections[zr])),axis=0)
237 |     MSI_Data = np.squeeze(Combined_MSI[ij_r,])
238 |     XCoord_cv = XCoord[ij_r]
239 |     YCoord_cv = YCoord[ij_r]
240 |     ZCoord_cv = ZCoord[ij_r]
241 | 
242 |     return MSI_Data,XCoord_cv,YCoord_cv,ZCoord_cv
243 | 
244 | # ------------------- Train and Test with CV:
245 | zSections = np.unique(ZCoord)
246 | directory = Cd+'/Results_CV'         
247 | meanSpec_Orig_AllData = np.mean(Combined_MSI,axis=0)
248 | 
249 | # --------- Train model status:
250 | TrainStatus =  int(input("Would you like to Train Your Model? Yes=1; No=0 ... :"))
251 | if TrainStatus== 0:
252 |     print('No Training')
253 | else:
254 |     print('Model Training ...>>>......')
255 |     
256 | hf_cv = h5py.File('CV_Values/cv_Idx.h5', 'r')       
257 | for i in range(n_folds):
258 |     Train_idx = hf_cv['indx_Training'+str(i)][:]
259 |     #Test_idx = hf_cv['indx_Testing'+str(i)][:]
260 |     MSI_train,XCoord_cv,YCoord_cv,ZCoord_cv = Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,Train_idx)
261 | 	#MSI_Test,XCoord_cv,YCoord_cv,ZCoord_cv = Get_cv_MSI(Combined_MSI,XCoord,YCoord,ZCoord,zSections,Test_idx)
262 |     myzSections = np.unique(ZCoord_cv)     
263 | # ************************* Training **************************************    
264 |     # 1. ====== Initialize the model:
265 |     from Computational_Model import *
266 |     input_shape = (nSpecFeatures, )
267 |     intermediate_dim = 512
268 |     latent_dim = 5
269 |     VAE_BN_Model = VAE_BN(nSpecFeatures,  intermediate_dim, latent_dim)
270 |     myModel, encoder = VAE_BN_Model.get_architecture()
271 |     myModel.summary()  
272 |     # 2. ====== Train the model:
273 |     if TrainStatus==1:
274 |         start_time = time.time()
275 |         history = myModel.fit(MSI_train, epochs=100, batch_size=128, shuffle="batch")   
276 |         myModel.save_weights(directory+'//'+'TrainedModel_'+str(i)+'.h5') 
277 |     else:
278 |         myModel.load_weights(directory+'//'+'TrainedModel_'+str(i)+'.h5');
279 |     # 3. ============= Model Predictions:
280 |     encoded_imgs = encoder.predict(MSI_train) # Learned non-linear manifold
281 |     decoded_imgs = myModel.predict(MSI_train) # Reconstructed Data
282 |     dec_TIC = np.sum(decoded_imgs, axis=-1)
283 |     Latent_mean, Latent_var, Latent_z = encoded_imgs
284 |     
285 |     get_EncFeatures(Latent_z,Train_idx,ZCoord_cv,XCoord_cv,YCoord_cv,directory,i)    
286 |     
287 |     # 4. ============= Plot Average Spectrum:
288 |     mse = mean_squared_error(MSI_train,decoded_imgs)
289 |     meanSpec_Rec = np.mean(decoded_imgs,axis=0) 
290 |     print('mean squared error(mse)  = ', mse)
291 |     meanSpec_Orig = np.mean(MSI_train,axis=0) # TIC-norm original MSI Data
292 |     N_DecImg = decoded_imgs/dec_TIC[:,None]  # TIC-norm reconstructed MSI  Data
293 |     meanSpec_RecTIC = np.mean(N_DecImg,axis=0)
294 |     
295 |     fig, ax = plt.subplots() 
296 |     plt.plot(history.history['loss'])
297 |     plt.ylabel('loss'); plt.xlabel('epoch')
298 |     print("--- %s seconds ---" % (time.time() - start_time))
299 |     plt.savefig(directory+'/'+'Convergence_TrainedModel_'+str(i)+'.tif')
300 |     
301 |     fig, ax = plt.subplots() 
302 |     #plt.figure(figsize=(10, 3))
303 |     plt.plot(mzList,meanSpec_Orig,color = [0, 1, 0,1]); plt.plot(mzList,meanSpec_RecTIC,color = [1, 0, 0,0.6]); 
304 |     plt.savefig(directory+'/'+'Overlay_Training_'+str(i)+'mse_'+str(mse)+'.tif')
305 |     
306 |     # 5. ============== Learn Peaks:
307 |     from LearnPeaks import *
308 |     W_enc = encoder.get_weights()
309 |     # Normalize Weights by multiplying it with std of original data variables
310 |     std_spectra = np.std(MSI_train, axis=0) 
311 |     Beta = 2.5
312 |     Learned_mzBins, Learned_mzPeaks, mzBin_Indx, Real_PeakIdx = LearnPeaks(mzList, W_enc,std_spectra,latent_dim,Beta,meanSpec_Orig_AllData)
313 |     xls_writer = pd.ExcelWriter(directory+'/'+'Peaks_Training_'+str(i)+'.xlsx', engine='xlsxwriter')   
314 |     df_1 = pd.DataFrame({'mz Bins': Learned_mzBins})
315 |     df_2 = pd.DataFrame({'mz Peaks': Learned_mzPeaks})
316 |     df_1.to_excel(xls_writer, sheet_name='Sheet'+str(i))
317 |     df_2.to_excel(xls_writer, sheet_name='Sheet'+str(i),startcol=3)
318 |     workbook  = xls_writer.book
319 |     worksheet = xls_writer.sheets['Sheet'+str(i)]
320 |     
321 |     # 6. ============== Apply CLustering:
322 |     nClusters = 8
323 |     gmm = GaussianMixture(n_components=nClusters,random_state=0).fit(np.squeeze(Latent_z))
324 |     Labels = gmm.predict(np.squeeze(Latent_z))
325 |     get_gmmImage(Train_idx,Labels,nClusters,ZCoord_cv,XCoord_cv,YCoord_cv,directory,i)
326 |                              
327 | hf_cv.close()
328 | 
329 | # 7. ========== Correlate Clusters with MSI Data:
330 | cluster_id = 7
331 | Corr_Val, CorrRank_ij,MSI_CleanPeaks =  Correlate_Cluster_MSI(cluster_id,Labels,MSI_train,Real_PeakIdx,ZCoord_cv,XCoord_cv,YCoord_cv)
332 | 
333 | print('m/z', Learned_mzPeaks[CorrRank_ij[0:5]])
334 | print('corr_Value = ', Corr_Val[CorrRank_ij[0:5]])
335 | plt.plot(Learned_mzPeaks,Corr_Val)
336 | 
337 | # Visualize Correlated Peak at section z: 
338 | mzID = CorrRank_ij[0];
339 | myzSections = np.unique(ZCoord_cv)     
340 | zr=0                    
341 | ij_r = np.argwhere(ZCoord_cv == myzSections[zr])
342 | indx = ij_r[:,0]
343 | xLoc = XCoord_cv[indx]
344 | yLoc = YCoord_cv[indx]
345 | MSI_2D = np.squeeze(MSI_train[indx,mzID])
346 | im_mz = Image_Distribution(MSI_2D,xLoc,yLoc);
347 | plt.imshow(im_mz);
348 | mz_Peak = Learned_mzPeaks[mzID]
349 | print('m/z Peak = ',mz_Peak)
350 | 
351 | # ============ Get 3D m/z image =================
352 | mzValue = 13972.1
353 | mzId = np.argmin(np.abs(mzList[:] - mzValue))
354 | Get_3Dmz_nifti(Combined_MSI[:,mzId],mzValue,XCoord,YCoord,ZCoord,directory)
355 | 
356 | # ============== Load Peak Learned by All training models:
357 | myBeta = 1.5
358 | ALL_Peaks_Train = pd.read_excel(directory+'//Peaks_Learned//Beta_'+str(myBeta)+'//Peaks_AllModels_Trained.xlsx')
359 | ALL_Peaks_Train = np.squeeze(np.asarray(ALL_Peaks_Train))
360 | ALL_Peaks_Train = np.nan_to_num(ALL_Peaks_Train)
361 | 
362 | My_marker= ['v', '*','d','^','s']
363 | Point_Color = plt.cm.jet(np.linspace(0, 1, ALL_Peaks_Train.shape[1]))
364 | plt.figure(figsize=(25, 5))
365 | plt.plot(mzList,meanSpec_Orig_AllData,linewidth=3,c='black'); 
366 | for ij in range(ALL_Peaks_Train.shape[1]):
367 |     Peaks_Train = ALL_Peaks_Train[:,ij]
368 |     Peaks_Train = Peaks_Train[Peaks_Train !=0]
369 |     Train_Peaks_Loc = [np.argmin(np.abs(mzList[:] - Peaks_Train[idx])) for idx in  range(len(Peaks_Train))]
370 |     Mean_PickedPeakst = np.mean(Combined_MSI[:,Train_Peaks_Loc],axis=0)
371 |     plt.scatter(Peaks_Train,Mean_PickedPeakst,marker=My_marker[ij],c=Point_Color[ij]);
372 | 
373 | 
374 | Peaks_Vector = ALL_Peaks_Train.reshape(ALL_Peaks_Train.shape[0]*ALL_Peaks_Train.shape[1])
375 | Peaks_Vector_NoZero = Peaks_Vector[Peaks_Vector !=0]
376 | U_Peaks = np.unique(Peaks_Vector_NoZero)
377 | L = len(np.unique(U_Peaks))
378 | #plt.figure(figsize=(20, 5))
379 | n, bins, patches = plt.hist(x=Peaks_Vector_NoZero, bins=U_Peaks); plt.show()
380 | 
381 | #=== Scatter Plot Frequency:
382 | from matplotlib.ticker import MaxNLocator
383 | New_n = np.append(n,1)
384 | fig, ax = plt.subplots()
385 | ax.yaxis.set_major_locator(MaxNLocator(integer=True))
386 | #plt.figure(figsize=(30, 5))
387 | plt.scatter(U_Peaks,New_n,c=New_n)
388 | plt.xlabel("m/z")
389 | plt.ylabel("Frequency")
390 | 
391 | # ====== Bar Plot: Counts of Peak Frequency
392 | N_Freq = [len(np.argwhere(n==ij)) for ij in np.unique(n)]
393 | xValue = [ij for ij in np.unique(n)]
394 | colors = plt.cm.jet(np.linspace(0, 1, len(np.unique(n))))
395 | plt.bar(xValue, N_Freq,color=colors)
396 | plt.xlabel("Frequency")
397 | plt.ylabel("Count")
398 | 
399 | 


--------------------------------------------------------------------------------