Diagnose Latent Cytomegalovirus Using CyTOF Data and Deep Learning

├── .Rhistory
├── .ipynb_checkpoints
    ├── DeepLearning_CyTOF-Copy1-checkpoint.ipynb
    ├── DeepLearning_CyTOF-checkpoint.ipynb
    └── Untitled-checkpoint.ipynb
├── Data
    ├── Figure1.png
    ├── Figure2.png
    ├── Final_weights.hdf5
    └── header.png
├── DeepLearning_CyTOF.html
├── DeepLearning_CyTOF.ipynb
├── FCS_to_Array
    ├── .ipynb_checkpoints
    │   └── FCS_to_Array-checkpoint.ipynb
    ├── FCS_to_Array.ipynb
    ├── allData.obj
    ├── metaData.csv
    ├── sample1.fcs
    ├── sample2.fcs
    └── sample3.fcs
├── FOCIS_deeplearning.pdf
├── ReadMe.md
├── Result
    └── saved_weights.hdf5
└── requirements.txt


/.Rhistory:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/.Rhistory


--------------------------------------------------------------------------------
/.ipynb_checkpoints/DeepLearning_CyTOF-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<center> <h1> Diagnose Latent Cytomegalovirus Using CyTOF Data and Deep Learning <h1> </center>\n",
  8 |     "\n",
  9 |     "<center>Zicheng Hu, Ph.D.</center>\n",
 10 |     "<center>Research Scientist</center>\n",
 11 |     "<center>ImmPort Team</center>\n",
 12 |     "<center>The Unversity of California, San Francisco</center>\n",
 13 |     "\n",
 14 |     "![alt text](Data/header.png)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "### Introduction\n",
 22 |     "A deep neural network (a.k.a. deep learning) is an artificial neural network with multiple layers between the input and output layers. It was proven to be highly effective for a variety of predictive tasks. In health care, deep learning is quickly gaining popularity and has been implemented for applications such as image-based diagnosis and personalized drug recommendations. In this tutorial, we will build a tailored deep-learning model for CyTOF data to diagnosis latent Cytomegalovirus infection using Keras and TensorFlow. To run this tutorial, download the [github repository](https://github.com/hzc363/DeepLearningCyTOF) and run the [jupyter notebook](https://github.com/hzc363/DeepLearningCyTOF/blob/master/DeepLearning_CyTOF.ipynb). "
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "### Step 1: Import Functions\n",
 30 |     "Before we start, we first import functions that we will use in this tutorial from different libraries. "
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "##### Step 1: import functions #####\n",
 40 |     "from keras.layers import Dense, Flatten, BatchNormalization, Activation, Conv2D, AveragePooling2D, Input\n",
 41 |     "from keras.models import load_model, Model\n",
 42 |     "from keras.optimizers import Adam\n",
 43 |     "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
 44 |     "from keras import backend as K\n",
 45 |     "import pickle\n",
 46 |     "import pandas as pd\n",
 47 |     "import numpy as np\n",
 48 |     "from numpy.random import seed; seed(111)\n",
 49 |     "import random\n",
 50 |     "import matplotlib.pyplot as plt\n",
 51 |     "import seaborn as sns\n",
 52 |     "from tensorflow import set_random_seed; set_random_seed(111)\n",
 53 |     "from sklearn.metrics import roc_curve, auc\n",
 54 |     "from sklearn.externals.six import StringIO  \n",
 55 |     "from sklearn.tree import export_graphviz, DecisionTreeRegressor\n",
 56 |     "from scipy.stats import ttest_ind\n",
 57 |     "from IPython.display import Image  \n",
 58 |     "import pydotplus"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "### Step 2: Load data\n",
 66 |     "\n",
 67 |     "We load the data, which are stored in the \"allData.obj\" file. The data includes three parts, meta-data, CyTOF data, and marker names. \n",
 68 |     "\n",
 69 |     "* The **CyTOF data** contains the single-cell profile of 27 markers. For the convenience of this tutorial, we already downloaded the fcs files from ImmPort and preprocessed the data into Numpy arrays. See [an example](https://github.com/hzc363/DeepLearningCyTOF/tree/master/FCS_to_Array) for the preprocessing of the FCS files. The dimension of the Numpy array is 472 samples x 10000 cells x 27 markers.\n",
 70 |     "* The **metadata** contains the sample level information, including the study accession number for each sample and the ground truth of CMV infection. It is stored as a pandas data frame.\n",
 71 |     "* The **marker names** contain the name of the 27 markers."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "##### Step 2: load data #####\n",
 81 |     "\n",
 82 |     "#Download data\n",
 83 |     "tutorial_files = ! ls Data\n",
 84 |     "if \"allData.obj\" not in tutorial_files:\n",
 85 |     "    print(\"Downloading Data:\")\n",
 86 |     "    ! wget https://figshare.com/ndownloader/files/38918480 -P ./Data\n",
 87 |     "        \n",
 88 |     "#load data\n",
 89 |     "allData = pickle.load( open( \"Data/allData.obj\", \"rb\" ) )\n",
 90 |     "metaData = allData[\"metaData\"]\n",
 91 |     "cytoData = allData[\"cytoData\"]\n",
 92 |     "markerNames = allData[\"markerNames\"]\n",
 93 |     "\n",
 94 |     "# inspect the data\n",
 95 |     "print(\"\\nFirst 5 rows of metaData: \")\n",
 96 |     "print(metaData.head(),\"\\n\")\n",
 97 |     "\n",
 98 |     "print(\"Dimensions of cytoData: \",cytoData.shape,\"\\n\")\n",
 99 |     "print(\"Names of the 27 makers: \\n\",markerNames.values)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "### Step 3: Split data into training, validation and testing sets\n",
107 |     "Now, lets split the data into training, validation, and testing sets. The training data is used to train the deep learning model. The validation dataset is used to select the best parameters for the model and to avoid overfitting. The test dataset is used to evaluate the performance of the final model.\n",
108 |     "\n",
109 |     "The CyTOF dataset contains samples from 9 studies available on ImmPort. We will use samples from the study SDY515 as a validation set, samples from the study SDY519 as a testing set, and the rest of the samples as a training set. "
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": null,
115 |    "metadata": {},
116 |    "outputs": [],
117 |    "source": [
118 |     "##### Step 3: split train, validation and test######\n",
119 |     "y = metaData.CMV_Ab.values\n",
120 |     "x = cytoData\n",
121 |     "\n",
122 |     "train_id = (metaData.study_accession.isin([\"SDY515\",\"SDY519\"])==False)\n",
123 |     "valid_id = metaData.study_accession==\"SDY515\"\n",
124 |     "test_id = metaData.study_accession ==\"SDY519\"\n",
125 |     "\n",
126 |     "x_train = x[train_id]; y_train = y[train_id]\n",
127 |     "x_valid = x[valid_id]; y_valid = y[valid_id]\n",
128 |     "x_test = x[test_id]; y_test = y[test_id]"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "### Step 4: Define the deep learning model\n",
136 |     "We will use a customized convolution neural network (CNN) to analyze the CyTOF data. For each sample, the CyTOF data is a matrix with rows as cells and columns as markers. It is crucial to notice that the CyTOF data is an unordered collection of cells (rows). For example, both matrix 1 and matrix 2 profiles the same sample in Figure 1A, even though they have different orders of rows. \n",
137 |     "\n",
138 |     "![alt text](Data/Figure1.png)\n",
139 |     "\n",
140 |     "\n",
141 |     "Based on the characteristics of the CyTOF data, we design a CNN model that is invariant to the permutation of rows. The model contains six layers: input layer, first and second convolution layer, pooling layer, dense layer, and output layer. \n",
142 |     "\n",
143 |     "* The **input layer** receives the CyTOF data matrix. \n",
144 |     "\n",
145 |     "* The **first convolution layer** uses three filters to scan each row of the CyTOF data. This layer extracts relevant information from the cell marker profile of each cell. \n",
146 |     "\n",
147 |     "* The **second convolution layer** uses three filters to scan each row of the first layer's output. Each filter combines information from the first layer for each cell. \n",
148 |     "\n",
149 |     "* The **pooling layers** averages the outputs of the second convolution layer. The purpose is to aggregate the cell level information into sample-level information. \n",
150 |     "\n",
151 |     "* The **dense layer** further extracts information from the pooling layer. \n",
152 |     "\n",
153 |     "* The **output layer** uses logistic regression to report the probability of CMV infection for each sample. \n"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": [
162 |     "##### Step 4: define model #####\n",
163 |     "\n",
164 |     "# input\n",
165 |     "model_input = Input(shape=x_train[0].shape)\n",
166 |     "\n",
167 |     "# first convolution layer\n",
168 |     "model_output = Conv2D(3, kernel_size=(1, x_train.shape[2]),\n",
169 |     "                 activation=None)(model_input)\n",
170 |     "model_output = BatchNormalization()(model_output)\n",
171 |     "model_output = Activation(\"relu\")(model_output)\n",
172 |     "\n",
173 |     "# sceond convolution layer\n",
174 |     "model_output = Conv2D(3, (1, 1), activation=None)(model_output)\n",
175 |     "model_output = BatchNormalization()(model_output)\n",
176 |     "model_output = Activation(\"relu\")(model_output)\n",
177 |     "\n",
178 |     "# pooling layer\n",
179 |     "model_output = AveragePooling2D(pool_size=(x_train.shape[1], 1))(model_output)\n",
180 |     "model_output = Flatten()(model_output)\n",
181 |     "\n",
182 |     "# Dense layer\n",
183 |     "model_output = Dense(3, activation=None)(model_output)\n",
184 |     "model_output = BatchNormalization()(model_output)\n",
185 |     "model_output = Activation(\"relu\")(model_output)\n",
186 |     "\n",
187 |     "# output layer\n",
188 |     "model_output = Dense(1, activation=None)(model_output)\n",
189 |     "model_output = BatchNormalization()(model_output)\n",
190 |     "model_output = Activation(\"sigmoid\")(model_output)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "### Step 5: Fit the model\n",
198 |     "In this step, we will use the training data to fit the model. We will use the Adam algorithm, which is an extension of the gradient descent method to train our model. Adam algorithm will search the model space step by step (epochs) until the optimal model is identified. At each step, we will use validation data to evaluate the performance of the model. The best model will be saved. "
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "##### Step 5: Fit model #####\n",
208 |     "\n",
209 |     "# specify input and output\n",
210 |     "model = Model(inputs=[model_input],\n",
211 |     "              outputs=model_output)\n",
212 |     "\n",
213 |     "# define loss function and optimizer\n",
214 |     "model.compile(loss='binary_crossentropy',\n",
215 |     "              optimizer=Adam(lr=0.0001),\n",
216 |     "              metrics=['accuracy'])\n",
217 |     "\n",
218 |     "# save the best performing model\n",
219 |     "checkpointer = ModelCheckpoint(filepath='Result/saved_weights.hdf5', \n",
220 |     "                               monitor='val_loss', verbose=0, \n",
221 |     "                               save_best_only=True)\n",
222 |     "\n",
223 |     "# model training\n",
224 |     "model.fit([x_train], y_train,\n",
225 |     "          batch_size=60,\n",
226 |     "          epochs=500, \n",
227 |     "          verbose=1,\n",
228 |     "          callbacks=[checkpointer],\n",
229 |     "          validation_data=([x_valid], y_valid))"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "### Step 6: Plot the training history\n",
237 |     "We can view the training history of the model by plotting the performance (value of the loss function) for training and validation data in each epoch. "
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {
244 |     "scrolled": true
245 |    },
246 |    "outputs": [],
247 |    "source": [
248 |     "##### Step 6: plot train and validation loss #####\n",
249 |     "plt.plot(model.history.history['loss'])\n",
250 |     "plt.plot(model.history.history['val_loss'])\n",
251 |     "plt.title('model train vs validation loss')\n",
252 |     "plt.ylabel('loss')\n",
253 |     "plt.xlabel('epoch')\n",
254 |     "plt.legend(['train', 'validation'], loc='upper right')\n",
255 |     "plt.show()"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "### Step 7: Evaluate the performance using test data\n",
263 |     "We load the final model from a save file (Final_weights.hdf5) for the following analysis steps. We will use the test data, which has not been touched so far, to evaluate the performance of the final model. We will draw a Receiver Operator Characteristic(ROC) Curve and use Area Under the Curve (AUC) to measure performance. "
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": [
272 |     "##### Step 7: test the final model #####\n",
273 |     "\n",
274 |     "# load final model\n",
275 |     "final_model = load_model('Data/Final_weights.hdf5')\n",
276 |     "\n",
277 |     "# generate ROC and AUC\n",
278 |     "y_scores = final_model.predict([x_test])\n",
279 |     "fpr, tpr, _ = roc_curve(y_test, y_scores)\n",
280 |     "roc_auc = auc(fpr, tpr)\n",
281 |     "\n",
282 |     "# plot ROC curve\n",
283 |     "plt.plot(fpr, tpr)\n",
284 |     "plt.plot([0, 1], [0, 1], 'k--')\n",
285 |     "plt.xlabel('False Positive Rate')\n",
286 |     "plt.ylabel('True Positive Rate')\n",
287 |     "plt.title('AUC = {0:.2f}'.format(roc_auc))\n",
288 |     "plt.show()"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "metadata": {
294 |     "collapsed": true
295 |    },
296 |    "source": [
297 |     "### Step 8: Interpret the deep learning model.\n",
298 |     "We will use a permutation-based method to interpret the deep CNN model (Fig. 2). For each cell in cytometry data, we up-sampled the cell by copying it to replace other randomly chosen cells within the sample. We then applied the deep CNN model on both the original data and the permuted data. The difference in the model output (ΔΥ) quantifies the impact of the cell on the output of the deep learning model. Finally, we build a decision tree model to idenitfy cell subsets that have the highest ΔΥ. \n",
299 |     "\n",
300 |     "![alt text](Data/Figure2.png)"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "##### Step 8: Interpret the deep learning model. #####\n",
310 |     "\n",
311 |     "# warning: may take a long time (around 30 mins) to run\n",
312 |     "\n",
313 |     "# Calculate the impact of each cell on the model output\n",
314 |     "dY = np.zeros([x_test.shape[0],x_test.shape[1]])\n",
315 |     "s1 = np.random.randint(0,(x_test.shape[1]-1),int(x_test.shape[1]*0.05))\n",
316 |     "\n",
317 |     "for i in range(x_test.shape[0]):\n",
318 |     "    pred_i = final_model.predict([x_test[[i],:,:,:]])\n",
319 |     "    for j in range(x_test.shape[1]):\n",
320 |     "        t1 = x_test[[i],:,:,:].copy()\n",
321 |     "        t1[:,s1,:,:] = t1[:,j,:,:]\n",
322 |     "        pred_j = final_model.predict([t1])\n",
323 |     "        dY[i,j] = pred_j-pred_i\n",
324 |     "\n",
325 |     "# reformat dY\n",
326 |     "x_test2 = x_test.reshape((x_test.shape[0]*x_test.shape[1],27))\n",
327 |     "dY = dY.reshape([x_test.shape[0]*x_test.shape[1]])\n",
328 |     "\n",
329 |     "# Build decision tree to identify cell subset with high dY\n",
330 |     "regr_1 = DecisionTreeRegressor(max_depth=4)\n",
331 |     "regr_1.fit(x_test2, dY)\n",
332 |     "\n",
333 |     "# Plot the decision tree\n",
334 |     "dot_data = StringIO()\n",
335 |     "export_graphviz(regr_1, out_file=dot_data, \n",
336 |     "                feature_names= markerNames,\n",
337 |     "                filled=True, rounded=True,\n",
338 |     "                special_characters=True)\n",
339 |     "graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  \n",
340 |     "Image(graph.create_png())"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "### Conclusion\n",
348 |     "In this tutorial, we build a deep convolutional neural network (CNN) to analyze CyTOF data. The deep CNN model is able to diagnose latent CMV infection with high accuracy. In addition, we were able to interpret the deep learning model using a permutation-based method. We discovered that a CD3+ CD8+ CD27- CD94+ population that have the highest impact on the deep CNN model. "
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": null,
354 |    "metadata": {},
355 |    "outputs": [],
356 |    "source": []
357 |   }
358 |  ],
359 |  "metadata": {
360 |   "kernelspec": {
361 |    "display_name": "Python 3 (ipykernel)",
362 |    "language": "python",
363 |    "name": "python3"
364 |   },
365 |   "language_info": {
366 |    "codemirror_mode": {
367 |     "name": "ipython",
368 |     "version": 3
369 |    },
370 |    "file_extension": ".py",
371 |    "mimetype": "text/x-python",
372 |    "name": "python",
373 |    "nbconvert_exporter": "python",
374 |    "pygments_lexer": "ipython3",
375 |    "version": "3.10.4"
376 |   }
377 |  },
378 |  "nbformat": 4,
379 |  "nbformat_minor": 1
380 | }
381 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/Data/Figure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/Data/Figure1.png


--------------------------------------------------------------------------------
/Data/Figure2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/Data/Figure2.png


--------------------------------------------------------------------------------
/Data/Final_weights.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/Data/Final_weights.hdf5


--------------------------------------------------------------------------------
/Data/header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/Data/header.png


--------------------------------------------------------------------------------
/DeepLearning_CyTOF.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<center> <h1> Diagnose Latent Cytomegalovirus Using CyTOF Data and Deep Learning <h1> </center>\n",
  8 |     "\n",
  9 |     "<center>Zicheng Hu, Ph.D.</center>\n",
 10 |     "<center>Research Scientist</center>\n",
 11 |     "<center>ImmPort Team</center>\n",
 12 |     "<center>The Unversity of California, San Francisco</center>\n",
 13 |     "\n",
 14 |     "![alt text](Data/header.png)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "### Introduction\n",
 22 |     "A deep neural network (a.k.a. deep learning) is an artificial neural network with multiple layers between the input and output layers. It was proven to be highly effective for a variety of predictive tasks. In health care, deep learning is quickly gaining popularity and has been implemented for applications such as image-based diagnosis and personalized drug recommendations. In this tutorial, we will build a tailored deep-learning model for CyTOF data to diagnosis latent Cytomegalovirus infection using Keras and TensorFlow. To run this tutorial, download the [github repository](https://github.com/hzc363/DeepLearningCyTOF) and run the [jupyter notebook](https://github.com/hzc363/DeepLearningCyTOF/blob/master/DeepLearning_CyTOF.ipynb). "
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "### Step 1: Import Functions\n",
 30 |     "Before we start, we first import functions that we will use in this tutorial from different libraries. "
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "##### Step 1: import functions #####\n",
 40 |     "from keras.layers import Dense, Flatten, BatchNormalization, Activation, Conv2D, AveragePooling2D, Input\n",
 41 |     "from keras.models import load_model, Model\n",
 42 |     "from keras.optimizers import Adam\n",
 43 |     "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
 44 |     "from keras import backend as K\n",
 45 |     "import pickle\n",
 46 |     "import pandas as pd\n",
 47 |     "import numpy as np\n",
 48 |     "from numpy.random import seed; seed(111)\n",
 49 |     "import random\n",
 50 |     "import matplotlib.pyplot as plt\n",
 51 |     "import seaborn as sns\n",
 52 |     "from tensorflow import set_random_seed; set_random_seed(111)\n",
 53 |     "from sklearn.metrics import roc_curve, auc\n",
 54 |     "from sklearn.externals.six import StringIO  \n",
 55 |     "from sklearn.tree import export_graphviz, DecisionTreeRegressor\n",
 56 |     "from scipy.stats import ttest_ind\n",
 57 |     "from IPython.display import Image  \n",
 58 |     "import pydotplus"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "### Step 2: Load data\n",
 66 |     "\n",
 67 |     "We load the data, which are stored in the \"allData.obj\" file. The data includes three parts, meta-data, CyTOF data, and marker names. \n",
 68 |     "\n",
 69 |     "* The **CyTOF data** contains the single-cell profile of 27 markers. For the convenience of this tutorial, we already downloaded the fcs files from ImmPort and preprocessed the data into Numpy arrays. See [an example](https://github.com/hzc363/DeepLearningCyTOF/tree/master/FCS_to_Array) for the preprocessing of the FCS files. The dimension of the Numpy array is 472 samples x 10000 cells x 27 markers.\n",
 70 |     "* The **metadata** contains the sample level information, including the study accession number for each sample and the ground truth of CMV infection. It is stored as a pandas data frame.\n",
 71 |     "* The **marker names** contain the name of the 27 markers."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "##### Step 2: load data #####\n",
 81 |     "\n",
 82 |     "#Download data\n",
 83 |     "tutorial_files = ! ls Data\n",
 84 |     "if \"allData.obj\" not in tutorial_files:\n",
 85 |     "    print(\"Downloading Data:\")\n",
 86 |     "    ! wget https://figshare.com/ndownloader/files/38918480 -P ./Data\n",
 87 |     "        \n",
 88 |     "#load data\n",
 89 |     "allData = pickle.load( open( \"Data/allData.obj\", \"rb\" ) )\n",
 90 |     "metaData = allData[\"metaData\"]\n",
 91 |     "cytoData = allData[\"cytoData\"]\n",
 92 |     "markerNames = allData[\"markerNames\"]\n",
 93 |     "\n",
 94 |     "# inspect the data\n",
 95 |     "print(\"\\nFirst 5 rows of metaData: \")\n",
 96 |     "print(metaData.head(),\"\\n\")\n",
 97 |     "\n",
 98 |     "print(\"Dimensions of cytoData: \",cytoData.shape,\"\\n\")\n",
 99 |     "print(\"Names of the 27 makers: \\n\",markerNames.values)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "### Step 3: Split data into training, validation and testing sets\n",
107 |     "Now, lets split the data into training, validation, and testing sets. The training data is used to train the deep learning model. The validation dataset is used to select the best parameters for the model and to avoid overfitting. The test dataset is used to evaluate the performance of the final model.\n",
108 |     "\n",
109 |     "The CyTOF dataset contains samples from 9 studies available on ImmPort. We will use samples from the study SDY515 as a validation set, samples from the study SDY519 as a testing set, and the rest of the samples as a training set. "
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": null,
115 |    "metadata": {},
116 |    "outputs": [],
117 |    "source": [
118 |     "##### Step 3: split train, validation and test######\n",
119 |     "y = metaData.CMV_Ab.values\n",
120 |     "x = cytoData\n",
121 |     "\n",
122 |     "train_id = (metaData.study_accession.isin([\"SDY515\",\"SDY519\"])==False)\n",
123 |     "valid_id = metaData.study_accession==\"SDY515\"\n",
124 |     "test_id = metaData.study_accession ==\"SDY519\"\n",
125 |     "\n",
126 |     "x_train = x[train_id]; y_train = y[train_id]\n",
127 |     "x_valid = x[valid_id]; y_valid = y[valid_id]\n",
128 |     "x_test = x[test_id]; y_test = y[test_id]"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "### Step 4: Define the deep learning model\n",
136 |     "We will use a customized convolution neural network (CNN) to analyze the CyTOF data. For each sample, the CyTOF data is a matrix with rows as cells and columns as markers. It is crucial to notice that the CyTOF data is an unordered collection of cells (rows). For example, both matrix 1 and matrix 2 profiles the same sample in Figure 1A, even though they have different orders of rows. \n",
137 |     "\n",
138 |     "![alt text](Data/Figure1.png)\n",
139 |     "\n",
140 |     "\n",
141 |     "Based on the characteristics of the CyTOF data, we design a CNN model that is invariant to the permutation of rows. The model contains six layers: input layer, first and second convolution layer, pooling layer, dense layer, and output layer. \n",
142 |     "\n",
143 |     "* The **input layer** receives the CyTOF data matrix. \n",
144 |     "\n",
145 |     "* The **first convolution layer** uses three filters to scan each row of the CyTOF data. This layer extracts relevant information from the cell marker profile of each cell. \n",
146 |     "\n",
147 |     "* The **second convolution layer** uses three filters to scan each row of the first layer's output. Each filter combines information from the first layer for each cell. \n",
148 |     "\n",
149 |     "* The **pooling layers** averages the outputs of the second convolution layer. The purpose is to aggregate the cell level information into sample-level information. \n",
150 |     "\n",
151 |     "* The **dense layer** further extracts information from the pooling layer. \n",
152 |     "\n",
153 |     "* The **output layer** uses logistic regression to report the probability of CMV infection for each sample. \n"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": [
162 |     "##### Step 4: define model #####\n",
163 |     "\n",
164 |     "# input\n",
165 |     "model_input = Input(shape=x_train[0].shape)\n",
166 |     "\n",
167 |     "# first convolution layer\n",
168 |     "model_output = Conv2D(3, kernel_size=(1, x_train.shape[2]),\n",
169 |     "                 activation=None)(model_input)\n",
170 |     "model_output = BatchNormalization()(model_output)\n",
171 |     "model_output = Activation(\"relu\")(model_output)\n",
172 |     "\n",
173 |     "# sceond convolution layer\n",
174 |     "model_output = Conv2D(3, (1, 1), activation=None)(model_output)\n",
175 |     "model_output = BatchNormalization()(model_output)\n",
176 |     "model_output = Activation(\"relu\")(model_output)\n",
177 |     "\n",
178 |     "# pooling layer\n",
179 |     "model_output = AveragePooling2D(pool_size=(x_train.shape[1], 1))(model_output)\n",
180 |     "model_output = Flatten()(model_output)\n",
181 |     "\n",
182 |     "# Dense layer\n",
183 |     "model_output = Dense(3, activation=None)(model_output)\n",
184 |     "model_output = BatchNormalization()(model_output)\n",
185 |     "model_output = Activation(\"relu\")(model_output)\n",
186 |     "\n",
187 |     "# output layer\n",
188 |     "model_output = Dense(1, activation=None)(model_output)\n",
189 |     "model_output = BatchNormalization()(model_output)\n",
190 |     "model_output = Activation(\"sigmoid\")(model_output)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "### Step 5: Fit the model\n",
198 |     "In this step, we will use the training data to fit the model. We will use the Adam algorithm, which is an extension of the gradient descent method to train our model. Adam algorithm will search the model space step by step (epochs) until the optimal model is identified. At each step, we will use validation data to evaluate the performance of the model. The best model will be saved. "
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "##### Step 5: Fit model #####\n",
208 |     "\n",
209 |     "# specify input and output\n",
210 |     "model = Model(inputs=[model_input],\n",
211 |     "              outputs=model_output)\n",
212 |     "\n",
213 |     "# define loss function and optimizer\n",
214 |     "model.compile(loss='binary_crossentropy',\n",
215 |     "              optimizer=Adam(lr=0.0001),\n",
216 |     "              metrics=['accuracy'])\n",
217 |     "\n",
218 |     "# save the best performing model\n",
219 |     "checkpointer = ModelCheckpoint(filepath='Result/saved_weights.hdf5', \n",
220 |     "                               monitor='val_loss', verbose=0, \n",
221 |     "                               save_best_only=True)\n",
222 |     "\n",
223 |     "# model training\n",
224 |     "model.fit([x_train], y_train,\n",
225 |     "          batch_size=60,\n",
226 |     "          epochs=500, \n",
227 |     "          verbose=1,\n",
228 |     "          callbacks=[checkpointer],\n",
229 |     "          validation_data=([x_valid], y_valid))"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "### Step 6: Plot the training history\n",
237 |     "We can view the training history of the model by plotting the performance (value of the loss function) for training and validation data in each epoch. "
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {
244 |     "scrolled": true
245 |    },
246 |    "outputs": [],
247 |    "source": [
248 |     "##### Step 6: plot train and validation loss #####\n",
249 |     "plt.plot(model.history.history['loss'])\n",
250 |     "plt.plot(model.history.history['val_loss'])\n",
251 |     "plt.title('model train vs validation loss')\n",
252 |     "plt.ylabel('loss')\n",
253 |     "plt.xlabel('epoch')\n",
254 |     "plt.legend(['train', 'validation'], loc='upper right')\n",
255 |     "plt.show()"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "### Step 7: Evaluate the performance using test data\n",
263 |     "We load the final model from a save file (Final_weights.hdf5) for the following analysis steps. We will use the test data, which has not been touched so far, to evaluate the performance of the final model. We will draw a Receiver Operator Characteristic(ROC) Curve and use Area Under the Curve (AUC) to measure performance. "
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": [
272 |     "##### Step 7: test the final model #####\n",
273 |     "\n",
274 |     "# load final model\n",
275 |     "final_model = load_model('Data/Final_weights.hdf5')\n",
276 |     "\n",
277 |     "# generate ROC and AUC\n",
278 |     "y_scores = final_model.predict([x_test])\n",
279 |     "fpr, tpr, _ = roc_curve(y_test, y_scores)\n",
280 |     "roc_auc = auc(fpr, tpr)\n",
281 |     "\n",
282 |     "# plot ROC curve\n",
283 |     "plt.plot(fpr, tpr)\n",
284 |     "plt.plot([0, 1], [0, 1], 'k--')\n",
285 |     "plt.xlabel('False Positive Rate')\n",
286 |     "plt.ylabel('True Positive Rate')\n",
287 |     "plt.title('AUC = {0:.2f}'.format(roc_auc))\n",
288 |     "plt.show()"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "metadata": {
294 |     "collapsed": true
295 |    },
296 |    "source": [
297 |     "### Step 8: Interpret the deep learning model.\n",
298 |     "We will use a permutation-based method to interpret the deep CNN model (Fig. 2). For each cell in cytometry data, we up-sampled the cell by copying it to replace other randomly chosen cells within the sample. We then applied the deep CNN model on both the original data and the permuted data. The difference in the model output (ΔΥ) quantifies the impact of the cell on the output of the deep learning model. Finally, we build a decision tree model to idenitfy cell subsets that have the highest ΔΥ. \n",
299 |     "\n",
300 |     "![alt text](Data/Figure2.png)"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "##### Step 8: Interpret the deep learning model. #####\n",
310 |     "\n",
311 |     "# warning: may take a long time (around 30 mins) to run\n",
312 |     "\n",
313 |     "# Calculate the impact of each cell on the model output\n",
314 |     "dY = np.zeros([x_test.shape[0],x_test.shape[1]])\n",
315 |     "s1 = np.random.randint(0,(x_test.shape[1]-1),int(x_test.shape[1]*0.05))\n",
316 |     "\n",
317 |     "for i in range(x_test.shape[0]):\n",
318 |     "    pred_i = final_model.predict([x_test[[i],:,:,:]])\n",
319 |     "    for j in range(x_test.shape[1]):\n",
320 |     "        t1 = x_test[[i],:,:,:].copy()\n",
321 |     "        t1[:,s1,:,:] = t1[:,j,:,:]\n",
322 |     "        pred_j = final_model.predict([t1])\n",
323 |     "        dY[i,j] = pred_j-pred_i\n",
324 |     "\n",
325 |     "# reformat dY\n",
326 |     "x_test2 = x_test.reshape((x_test.shape[0]*x_test.shape[1],27))\n",
327 |     "dY = dY.reshape([x_test.shape[0]*x_test.shape[1]])\n",
328 |     "\n",
329 |     "# Build decision tree to identify cell subset with high dY\n",
330 |     "regr_1 = DecisionTreeRegressor(max_depth=4)\n",
331 |     "regr_1.fit(x_test2, dY)\n",
332 |     "\n",
333 |     "# Plot the decision tree\n",
334 |     "dot_data = StringIO()\n",
335 |     "export_graphviz(regr_1, out_file=dot_data, \n",
336 |     "                feature_names= markerNames,\n",
337 |     "                filled=True, rounded=True,\n",
338 |     "                special_characters=True)\n",
339 |     "graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  \n",
340 |     "Image(graph.create_png())"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "### Conclusion\n",
348 |     "In this tutorial, we build a deep convolutional neural network (CNN) to analyze CyTOF data. The deep CNN model is able to diagnose latent CMV infection with high accuracy. In addition, we were able to interpret the deep learning model using a permutation-based method. We discovered that a CD3+ CD8+ CD27- CD94+ population that have the highest impact on the deep CNN model. "
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": null,
354 |    "metadata": {},
355 |    "outputs": [],
356 |    "source": []
357 |   }
358 |  ],
359 |  "metadata": {
360 |   "kernelspec": {
361 |    "display_name": "Python 3 (ipykernel)",
362 |    "language": "python",
363 |    "name": "python3"
364 |   },
365 |   "language_info": {
366 |    "codemirror_mode": {
367 |     "name": "ipython",
368 |     "version": 3
369 |    },
370 |    "file_extension": ".py",
371 |    "mimetype": "text/x-python",
372 |    "name": "python",
373 |    "nbconvert_exporter": "python",
374 |    "pygments_lexer": "ipython3",
375 |    "version": "3.10.4"
376 |   }
377 |  },
378 |  "nbformat": 4,
379 |  "nbformat_minor": 1
380 | }
381 | 


--------------------------------------------------------------------------------
/FCS_to_Array/.ipynb_checkpoints/FCS_to_Array-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Prepare fcs files for deep learning\n",
  8 |     "This is a small example for formatting data from fcs files into numpy array, and save the metaData, marker names and the numpy array into allData.obj file. Use the script as a template to prepare your own fcs files for deep learning. "
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 12,
 14 |    "metadata": {
 15 |     "collapsed": true
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "import numpy as np\n",
 20 |     "import scipy as sp\n",
 21 |     "import pandas as pd\n",
 22 |     "import rpy2 as rp\n",
 23 |     "from rpy2.robjects.packages import importr\n",
 24 |     "from rpy2.robjects import pandas2ri\n",
 25 |     "import os \n",
 26 |     "import rpy2.robjects as ro\n",
 27 |     "import pickle\n",
 28 |     "from collections import Counter\n",
 29 |     "\n",
 30 |     "\n",
 31 |     "# import R's \"flowCore\" package\n",
 32 |     "utils = importr('flowCore')"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 13,
 38 |    "metadata": {
 39 |     "collapsed": false
 40 |    },
 41 |    "outputs": [
 42 |     {
 43 |      "data": {
 44 |       "text/html": [
 45 |        "<div>\n",
 46 |        "<style scoped>\n",
 47 |        "    .dataframe tbody tr th:only-of-type {\n",
 48 |        "        vertical-align: middle;\n",
 49 |        "    }\n",
 50 |        "\n",
 51 |        "    .dataframe tbody tr th {\n",
 52 |        "        vertical-align: top;\n",
 53 |        "    }\n",
 54 |        "\n",
 55 |        "    .dataframe thead th {\n",
 56 |        "        text-align: right;\n",
 57 |        "    }\n",
 58 |        "</style>\n",
 59 |        "<table border=\"1\" class=\"dataframe\">\n",
 60 |        "  <thead>\n",
 61 |        "    <tr style=\"text-align: right;\">\n",
 62 |        "      <th></th>\n",
 63 |        "      <th>name</th>\n",
 64 |        "      <th>study_accession</th>\n",
 65 |        "      <th>CMV_Ab</th>\n",
 66 |        "    </tr>\n",
 67 |        "  </thead>\n",
 68 |        "  <tbody>\n",
 69 |        "    <tr>\n",
 70 |        "      <th>0</th>\n",
 71 |        "      <td>sample1.fcs</td>\n",
 72 |        "      <td>study1</td>\n",
 73 |        "      <td>True</td>\n",
 74 |        "    </tr>\n",
 75 |        "    <tr>\n",
 76 |        "      <th>1</th>\n",
 77 |        "      <td>sample2.fcs</td>\n",
 78 |        "      <td>study2</td>\n",
 79 |        "      <td>False</td>\n",
 80 |        "    </tr>\n",
 81 |        "    <tr>\n",
 82 |        "      <th>2</th>\n",
 83 |        "      <td>sample3.fcs</td>\n",
 84 |        "      <td>study3</td>\n",
 85 |        "      <td>True</td>\n",
 86 |        "    </tr>\n",
 87 |        "  </tbody>\n",
 88 |        "</table>\n",
 89 |        "</div>"
 90 |       ],
 91 |       "text/plain": [
 92 |        "          name study_accession  CMV_Ab\n",
 93 |        "0  sample1.fcs          study1    True\n",
 94 |        "1  sample2.fcs          study2   False\n",
 95 |        "2  sample3.fcs          study3    True"
 96 |       ]
 97 |      },
 98 |      "metadata": {},
 99 |      "output_type": "display_data"
100 |     }
101 |    ],
102 |    "source": [
103 |     "##### list fcs files #####\n",
104 |     "cytof_files = pd.read_csv(\"metaData.csv\")\n",
105 |     "display(cytof_files)\n",
106 |     "fn = [os.path.join(os.getcwd(),f) for f in cytof_files.name]"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 14,
112 |    "metadata": {
113 |     "collapsed": true
114 |    },
115 |    "outputs": [],
116 |    "source": [
117 |     "##### read fcs file using the flowCore R package #####\n",
118 |     "# flowCore is a very well maintained R package for reading and analyzing fcs files\n",
119 |     "# Many of the fcs file related packages in python are a little buggy to use\n",
120 |     "# Therefore, it is worth the trouble to read the fcs files using R \n",
121 |     "\n",
122 |     "r = rp.robjects.r\n",
123 |     "expr_list = []\n",
124 |     "for i in range(0,len(fn)):\n",
125 |     "    fn_i = fn[i]\n",
126 |     "    r_code = (\"library(flowCore);\"+\n",
127 |     "          \"library(MetaCyto);\"+\n",
128 |     "          \"fn = '\"+ fn_i+ \"'; \"+\n",
129 |     "          \"fcs = read.FCS(fn,truncate_max_range = FALSE);\"+\n",
130 |     "          \"expr = fcs@exprs;\"+\n",
131 |     "          \"markers = markerFinder(fcs);\"+\n",
132 |     "          \"colnames(expr) = markers;\"+\n",
133 |     "          \"expr = as.data.frame(expr);\"+\n",
134 |     "         # subsample 10,000 cells\n",
135 |     "         \"expr = expr[sample(1:nrow(expr),10000,replace = TRUE),]\")\n",
136 |     "    expr =  r(r_code)\n",
137 |     "    expr_list.append(expr)"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 15,
143 |    "metadata": {
144 |     "collapsed": false
145 |    },
146 |    "outputs": [
147 |     {
148 |      "name": "stdout",
149 |      "output_type": "stream",
150 |      "text": [
151 |       "['TIME', 'CD57', 'CD19', 'CD45RA', 'CD4', 'CD8', 'CD20', 'CD16', 'CD127', 'CD123', 'CXCR5', 'CD86', 'CD27', 'CD11C', 'CD14', 'CD56', 'CCR6', 'CD25', 'CCR7', 'CD3', 'CD38', 'CD161', 'CXCR3', 'HLADR', 'CD11B']\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "##### get common markers #####\n",
157 |     "markers = []\n",
158 |     "for i in range(len(expr_list)):\n",
159 |     "    markers.extend(expr_list[i].colnames)\n",
160 |     "\n",
161 |     "markers = Counter(markers)\n",
162 |     "markers = [k for k, c in markers.items() if c == 3]\n",
163 |     "print(markers)\n",
164 |     "\n",
165 |     "for i in range(0,len(expr_list)):\n",
166 |     "    t1 = expr_list[i] \n",
167 |     "    t1 = pandas2ri.ri2py(t1)\n",
168 |     "    expr_list[i] = t1.loc[:,markers]"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 16,
174 |    "metadata": {
175 |     "collapsed": false
176 |    },
177 |    "outputs": [
178 |     {
179 |      "name": "stdout",
180 |      "output_type": "stream",
181 |      "text": [
182 |       "The dimenstion of the data is:  (3, 10000, 24, 1)\n"
183 |      ]
184 |     }
185 |    ],
186 |    "source": [
187 |     "##### transform and format into numpy array\n",
188 |     "def arcsinh(x):\n",
189 |     "    return(np.arcsinh(x/5))\n",
190 |     "\n",
191 |     "coln = expr_list[0].columns.drop(\"TIME\")\n",
192 |     "for i in range(len(expr_list)):\n",
193 |     "    t1 = expr_list[i].drop(columns=\"TIME\")\n",
194 |     "    t1 = t1.apply(arcsinh)\n",
195 |     "    t1 = t1.values\n",
196 |     "    shape1 = list(t1.shape)+[1]\n",
197 |     "    t1 = t1.reshape(shape1)\n",
198 |     "    expr_list[i] = t1\n",
199 |     "    \n",
200 |     "expr_list = np.stack(expr_list)\n",
201 |     "print(\"The dimenstion of the data is: \", expr_list.shape)"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 17,
207 |    "metadata": {
208 |     "collapsed": true
209 |    },
210 |    "outputs": [],
211 |    "source": [
212 |     "allData = {\"cytof_files\":cytof_files, \n",
213 |     "            \"expr_list\" : expr_list,\n",
214 |     "            \"marker_names\" : coln}\n",
215 |     "\n",
216 |     "with open(\"allData.obj\", \"wb\") as f:\n",
217 |     "    pickle.dump(allData, f)"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {
224 |     "collapsed": true
225 |    },
226 |    "outputs": [],
227 |    "source": []
228 |   }
229 |  ],
230 |  "metadata": {
231 |   "kernelspec": {
232 |    "display_name": "Python 3",
233 |    "language": "python",
234 |    "name": "python3"
235 |   },
236 |   "language_info": {
237 |    "codemirror_mode": {
238 |     "name": "ipython",
239 |     "version": 3
240 |    },
241 |    "file_extension": ".py",
242 |    "mimetype": "text/x-python",
243 |    "name": "python",
244 |    "nbconvert_exporter": "python",
245 |    "pygments_lexer": "ipython3",
246 |    "version": "3.6.7"
247 |   }
248 |  },
249 |  "nbformat": 4,
250 |  "nbformat_minor": 2
251 | }
252 | 


--------------------------------------------------------------------------------
/FCS_to_Array/FCS_to_Array.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Prepare fcs files for deep learning\n",
  8 |     "This is a small example for formatting data from fcs files into numpy array, and save the metaData, marker names and the numpy array into allData.obj file. Use the script as a template to prepare your own fcs files for deep learning. "
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {
 15 |     "collapsed": false
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "import numpy as np\n",
 20 |     "import scipy as sp\n",
 21 |     "import pandas as pd\n",
 22 |     "import rpy2 as rp\n",
 23 |     "from rpy2.robjects.packages import importr\n",
 24 |     "from rpy2.robjects import pandas2ri\n",
 25 |     "from rpy2.robjects.conversion import localconverter\n",
 26 |     "import os \n",
 27 |     "import rpy2.robjects as ro\n",
 28 |     "import pickle\n",
 29 |     "from collections import Counter\n",
 30 |     "\n",
 31 |     "\n",
 32 |     "# import R's \"flowCore\" package\n",
 33 |     "utils = importr('flowCore')"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "metadata": {
 40 |     "collapsed": false
 41 |    },
 42 |    "outputs": [
 43 |     {
 44 |      "name": "stdout",
 45 |      "output_type": "stream",
 46 |      "text": [
 47 |       "          name study_accession  CMV_Ab\n",
 48 |       "0  sample1.fcs          study1    True\n",
 49 |       "1  sample2.fcs          study2   False\n",
 50 |       "2  sample3.fcs          study3    True\n"
 51 |      ]
 52 |     }
 53 |    ],
 54 |    "source": [
 55 |     "##### list fcs files #####\n",
 56 |     "cytof_files = pd.read_csv(\"metaData.csv\")\n",
 57 |     "print(cytof_files)\n",
 58 |     "fn = [os.path.join(os.getcwd(),f) for f in cytof_files.name]"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 3,
 64 |    "metadata": {
 65 |     "collapsed": true
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "##### read fcs file using the flowCore R package #####\n",
 70 |     "# flowCore is a very well maintained R package for reading and analyzing fcs files\n",
 71 |     "# Many of the fcs file related packages in python are a little buggy to use\n",
 72 |     "# Therefore, it is worth the trouble to read the fcs files using R \n",
 73 |     "\n",
 74 |     "r = rp.robjects.r\n",
 75 |     "expr_list = []\n",
 76 |     "for i in range(0,len(fn)):\n",
 77 |     "    fn_i = fn[i]\n",
 78 |     "    r_code = (\"library(flowCore);\"+\n",
 79 |     "          \"library(MetaCyto);\"+\n",
 80 |     "          \"fn = '\"+ fn_i+ \"'; \"+\n",
 81 |     "          \"fcs = read.FCS(fn,truncate_max_range = FALSE);\"+\n",
 82 |     "          \"expr = fcs@exprs;\"+\n",
 83 |     "          \"markers = markerFinder(fcs);\"+\n",
 84 |     "          \"colnames(expr) = markers;\"+\n",
 85 |     "          \"expr = as.data.frame(expr);\"+\n",
 86 |     "         # subsample 10,000 cells\n",
 87 |     "         \"expr = expr[sample(1:nrow(expr),10000,replace = TRUE),]\")\n",
 88 |     "    expr =  r(r_code)\n",
 89 |     "    expr_list.append(expr)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 4,
 95 |    "metadata": {
 96 |     "collapsed": false
 97 |    },
 98 |    "outputs": [
 99 |     {
100 |      "name": "stdout",
101 |      "output_type": "stream",
102 |      "text": [
103 |       "['TIME', 'CD57', 'CD19', 'CD45RA', 'CD4', 'CD8', 'CD20', 'CD16', 'CD127', 'CD123', 'CXCR5', 'CD86', 'CD27', 'CD11C', 'CD14', 'CD56', 'CCR6', 'CD25', 'CCR7', 'CD3', 'CD38', 'CD161', 'CXCR3', 'HLADR', 'CD11B']\n"
104 |      ]
105 |     }
106 |    ],
107 |    "source": [
108 |     "##### get common markers #####\n",
109 |     "markers = []\n",
110 |     "for i in range(len(expr_list)):\n",
111 |     "    markers.extend(expr_list[i].colnames)\n",
112 |     "\n",
113 |     "markers = Counter(markers)\n",
114 |     "markers = [k for k, c in markers.items() if c == 3]\n",
115 |     "print(markers)\n",
116 |     "\n",
117 |     "for i in range(0,len(expr_list)):\n",
118 |     "    t1 = expr_list[i] \n",
119 |     "    with localconverter(ro.default_converter + pandas2ri.converter):\n",
120 |     "        t1 = ro.conversion.rpy2py(t1)\n",
121 |     "    expr_list[i] = t1.loc[:,markers]"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 5,
127 |    "metadata": {
128 |     "collapsed": false
129 |    },
130 |    "outputs": [
131 |     {
132 |      "name": "stdout",
133 |      "output_type": "stream",
134 |      "text": [
135 |       "The dimenstion of the data is:  (3, 10000, 24, 1)\n"
136 |      ]
137 |     }
138 |    ],
139 |    "source": [
140 |     "##### transform and format into numpy array\n",
141 |     "def arcsinh(x):\n",
142 |     "    return(np.arcsinh(x/5))\n",
143 |     "\n",
144 |     "coln = expr_list[0].columns.drop(\"TIME\")\n",
145 |     "for i in range(len(expr_list)):\n",
146 |     "    t1 = expr_list[i].drop(columns=\"TIME\")\n",
147 |     "    t1 = t1.apply(arcsinh)\n",
148 |     "    t1 = t1.values\n",
149 |     "    shape1 = list(t1.shape)+[1]\n",
150 |     "    t1 = t1.reshape(shape1)\n",
151 |     "    expr_list[i] = t1\n",
152 |     "    \n",
153 |     "expr_list = np.stack(expr_list)\n",
154 |     "print(\"The dimenstion of the data is: \", expr_list.shape)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 6,
160 |    "metadata": {
161 |     "collapsed": true
162 |    },
163 |    "outputs": [],
164 |    "source": [
165 |     "allData = {\"cytof_files\":cytof_files, \n",
166 |     "            \"expr_list\" : expr_list,\n",
167 |     "            \"marker_names\" : coln}\n",
168 |     "\n",
169 |     "with open(\"allData.obj\", \"wb\") as f:\n",
170 |     "    pickle.dump(allData, f)"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {
177 |     "collapsed": false
178 |    },
179 |    "outputs": [],
180 |    "source": []
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {
186 |     "collapsed": true
187 |    },
188 |    "outputs": [],
189 |    "source": []
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {
195 |     "collapsed": true
196 |    },
197 |    "outputs": [],
198 |    "source": []
199 |   }
200 |  ],
201 |  "metadata": {
202 |   "kernelspec": {
203 |    "display_name": "Python 3",
204 |    "language": "python",
205 |    "name": "python3"
206 |   },
207 |   "language_info": {
208 |    "codemirror_mode": {
209 |     "name": "ipython",
210 |     "version": 3
211 |    },
212 |    "file_extension": ".py",
213 |    "mimetype": "text/x-python",
214 |    "name": "python",
215 |    "nbconvert_exporter": "python",
216 |    "pygments_lexer": "ipython3",
217 |    "version": "3.6.7"
218 |   }
219 |  },
220 |  "nbformat": 4,
221 |  "nbformat_minor": 2
222 | }
223 | 


--------------------------------------------------------------------------------
/FCS_to_Array/allData.obj:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/FCS_to_Array/allData.obj


--------------------------------------------------------------------------------
/FCS_to_Array/metaData.csv:
--------------------------------------------------------------------------------
1 | name,study_accession,CMV_Absample1.fcs,study1,TRUEsample2.fcs,study2,FALSEsample3.fcs,study3,TRUE


--------------------------------------------------------------------------------
/FCS_to_Array/sample1.fcs:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/FCS_to_Array/sample1.fcs


--------------------------------------------------------------------------------
/FCS_to_Array/sample2.fcs:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/FCS_to_Array/sample2.fcs


--------------------------------------------------------------------------------
/FCS_to_Array/sample3.fcs:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/FCS_to_Array/sample3.fcs


--------------------------------------------------------------------------------
/FOCIS_deeplearning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/FOCIS_deeplearning.pdf


--------------------------------------------------------------------------------
/ReadMe.md:
--------------------------------------------------------------------------------
 1 | ![alt text](Data/header.png)
 2 | 
 3 | <center> <h1> An End-to-end Deep Learning Model for CyTOF data <h1> </center>
 4 | <center>Zicheng Hu, Ph.D.</center><br/>
 5 | <center>Research Scientist</center><br/>
 6 | <center>ImmPort Team</center><br/>
 7 | <center>The Unversity of California, San Francisco</center><br/><br/>
 8 | 
 9 | ### Introduction
10 | This GitHub repository contains a tutorial for creating deep learning models tailored to CyTOF data. We will apply the model to diagnose latent cytomegalovirus (CMV) infection. We will also use a decision tree-based method to identify cell subsets that are associated with the CMV infection. 
11 | 
12 | <hr>
13 | 
14 | ### Install dependencies
15 | Navigate into the tutorial folder. Create conda environment with dependencies for the tutorial. Install time takes around 5 mins.  
16 | 
17 | ```conda create --name CyTOF_DL --file requirements.txt```
18 | 
19 | <hr>
20 | 
21 | ### Main tutorial
22 | Activate the conda environment.
23 | 
24 | ```conda activate CyTOF_DL```
25 | 
26 | Run the main tutorial [DeepLearning_CyTOF.ipynb](https://github.com/hzc363/DeepLearningCyTOF/blob/master/DeepLearning_CyTOF.ipynb). It takes around 40 mins to run in a laptop with 2.5 GHz GPU (Intel Core i7) and 16 GB Memory. 
27 | 
28 | ```jupyter notebook DeepLearning_CyTOF.ipynb```
29 | 
30 | <hr>
31 | 
32 | ### CyTOF data processing
33 | Please see the script in [FCS_to_Array folder](https://github.com/hzc363/DeepLearningCyTOF/tree/master/FCS_to_Array) for detailed preprocessing steps of the CyTOF data. 
34 | 
35 | <hr>
36 | 
37 | ### More information
38 | For more background information of deep learning and its use in cytometry data, see the [slides](https://github.com/hzc363/DeepLearningCyTOF/blob/master/FOCIS_deeplearning.pdf) of the FOCIS 2019 workshop. 
39 | 
40 | <hr>
41 | 


--------------------------------------------------------------------------------
/Result/saved_weights.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hzc363/DeepLearningCyTOF/ac6dc09a427edf4f2050f9f9ffbacfe732df4245/Result/saved_weights.hdf5


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | python==3.6.7
 2 | Keras==2.2.4
 3 | Pandas==0.24.2
 4 | numpy==1.16.4
 5 | matplotlib==3.1.1
 6 | seaborn==0.9.0
 7 | tensorflow==1.13.1
 8 | scikit-learn==0.21.3
 9 | pydotplus==2.0.2
10 | jupyter
11 | graphviz


--------------------------------------------------------------------------------
	name	study_accession	CMV_Ab
0	sample1.fcs	study1	True
1	sample2.fcs	study2	False
2	sample3.fcs	study3	True