├── DNAm_CNA_to_RNAseq.ipynb ├── DNAm_to_CNA.ipynb ├── Pre-processing.ipynb └── README.md /DNAm_CNA_to_RNAseq.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"},"colab":{"name":"DNAm_CNA_to_RNAseq.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"F7TT2a600Bjc","colab_type":"text"},"source":["# Deep denoising auto-encoder and MLP based multi-output regression on TCGA multi-omics data\n","## DNA Methylation and CNA to RNA-Seq"]},{"cell_type":"markdown","metadata":{"id":"AuzLXzrHWHwF","colab_type":"text"},"source":["# Setting environment"]},{"cell_type":"markdown","metadata":{"id":"Q6mV2MJR0Bjm","colab_type":"text"},"source":["Seeding the random number generators"]},{"cell_type":"code","metadata":{"trusted":true,"id":"We-k_eij0Bjp","colab_type":"code","colab":{}},"source":[" # have reproducible behavior for certain hash-based operations.\n","import os\n","os.environ['PYTHONHASHSEED'] = '0'\n","# The below is necessary for starting Numpy generated random numbers\n","# in a well-defined initial state.\n","import numpy as np\n","np.random.seed(42)\n","# The below is necessary for starting core Python generated random numbers\n","# in a well-defined state.\n","import random as rn\n","rn.seed(12345)\n","\n","# The below tf.set_random_seed() will make random number generation\n","# in the TensorFlow backend have a well-defined initial state.\n","import tensorflow as tf\n","tf.set_random_seed(1234)\n","\n","# Force TensorFlow to use single thread.\n","session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,\n"," inter_op_parallelism_threads=1)\n","from keras import backend as K\n","sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)\n","K.set_session(sess)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HS4bSNm40Bj5","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"dStQVwd30Bj8","colab_type":"code","colab":{}},"source":["from keras.layers import Input, Dense, Dropout\n","from keras.models import Model\n","from keras import regularizers\n","from sklearn.preprocessing import MinMaxScaler\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import mean_squared_error\n","import pandas as pd\n","import matplotlib\n","import matplotlib.pyplot as plt\n","import numpy as np\n","import tensorflow as tf\n","from sklearn.metrics import r2_score \n","from sklearn.metrics import classification_report\n","from sklearn.metrics import confusion_matrix\n","from imblearn.over_sampling import SMOTE"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"_iJym7v70BkF","colab_type":"code","colab":{}},"source":["def rSquared(true,predicted):\n"," cols = predicted.shape[1]\n"," rsq = np.zeros(shape=(cols), dtype = np.float32)\n"," for j in range(cols):\n"," rsq[j] = r2_score(true[:,j], predicted[:,j])\n"," return rsq"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"2kQvYtgqWOL2","colab_type":"text"},"source":["# Loading data"]},{"cell_type":"markdown","metadata":{"id":"tq5QN3Kq0BkP","colab_type":"text"},"source":["Importing data from pre-processed csv files (Change paths accordingly)"]},{"cell_type":"code","metadata":{"id":"cmJBB9flgeAU","colab_type":"code","colab":{}},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"of_LBPo4gsAu","colab_type":"code","colab":{}},"source":["#ls \"/content/drive/My Drive\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"2R8C7R7bhruf","colab_type":"code","colab":{}},"source":["preprocessed_DNAMeth = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_DNAMeth.csv')\n","preprocessed_RNASeq = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_RNASeq.csv')\n","preprocessed_CNA = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_CNA.csv')\n","labels = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_labels.csv')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Qz33rPHb0Bkg","colab_type":"code","colab":{}},"source":["x1 = preprocessed_DNAMeth\n","x2 = preprocessed_CNA \n","y = preprocessed_RNASeq"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Djlmve230Bkn","colab_type":"text"},"source":["Concatenating Methylation and CNV features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"lXhl777S0Bkq","colab_type":"code","colab":{}},"source":["x1 = pd.DataFrame(x1)\n","x2 = pd.DataFrame(x2)\n","df = [x1, x2]\n","z = pd.concat(df,axis=1)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"tmC71-v40Bkw","colab_type":"text"},"source":["Splitting the data into training and testing datasets"]},{"cell_type":"code","metadata":{"trusted":true,"id":"pUNvrzVv0Bkx","colab_type":"code","colab":{}},"source":["x_train, x_test, y_train, y_test, labels_train, labels_test = train_test_split(z, y, labels, test_size=0.2)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kP9vWgne0Bk4","colab_type":"text"},"source":["Scaling the data within [0-1] range"]},{"cell_type":"code","metadata":{"trusted":true,"id":"GHvZC4dW0Bk7","colab_type":"code","colab":{}},"source":["scalar = MinMaxScaler()\n","x_train = scalar.fit_transform(x_train)\n","x_test = scalar.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"TjKrhg1p0BlB","colab_type":"text"},"source":["Adding gaussian noise"]},{"cell_type":"code","metadata":{"trusted":true,"id":"X_GPRDwb0BlD","colab_type":"code","colab":{}},"source":["noise_factor = 0.5\n","x_train_noisy = x_train + noise_factor * np.random.normal(0.0, 1.0, x_train.shape)\n","x_test_noisy = x_test + noise_factor * np.random.normal(0.0, 1.0, x_test.shape)\n","\n","x_train_noisy = np.clip(x_train_noisy, 0., 1.)\n","x_test_noisy = np.clip(x_test_noisy, 0., 1.)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"hMJDjLMm0BlQ","colab_type":"text"},"source":["# Dimension Reduction/Feature Extraction using DDAE"]},{"cell_type":"markdown","metadata":{"id":"GMDqA1_y0BlJ","colab_type":"text"},"source":["Setting the no. of input and output neurons"]},{"cell_type":"code","metadata":{"trusted":true,"id":"OZa0LGeq0BlK","colab_type":"code","colab":{}},"source":["num_in_neurons = z.shape[1]\n","num_out_neurons = y.shape[1]"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Cf0X8WAg0BlS","colab_type":"code","colab":{}},"source":["# Auto-encoder to extract features from DNA Methylation and CNV data\n","\n","with tf.device('/gpu:0'):\n"," # this is the size of our encoded representations\n"," encoding_dim1 = 500\n"," encoding_dim2 = 200\n"," \n"," lambda_act = 0.0001\n"," lambda_weight = 0.001\n"," # this is our input placeholder\n"," input_data = Input(shape=(num_in_neurons,))\n"," # first encoded representation of the input\n"," encoded = Dense(encoding_dim1, activation='relu', activity_regularizer=regularizers.l1(lambda_act), kernel_regularizer=regularizers.l2(lambda_weight), name='encoder1')(input_data)\n"," # second encoded representation of the input\n"," encoded = Dense(encoding_dim2, activation='relu', activity_regularizer=regularizers.l1(lambda_act), kernel_regularizer=regularizers.l2(lambda_weight), name='encoder2')(encoded)\n"," # first lossy reconstruction of the input\n"," decoded = Dense(encoding_dim1, activation='relu', name='decoder1')(encoded)\n"," # the final lossy reconstruction of the input\n"," decoded = Dense(num_in_neurons, activation='sigmoid', name='decoder2')(decoded)\n"," \n"," # this model maps an input to its reconstruction\n"," autoencoder = Model(inputs=input_data, outputs=decoded)\n"," \n"," myencoder = Model(inputs=input_data, outputs=encoded)\n"," autoencoder.compile(optimizer='sgd', loss='mse')\n"," # training\n"," print('training the autoencoder')\n"," autoencoder.fit(x_train_noisy, x_train,\n"," epochs=25,\n"," batch_size=8,\n"," shuffle=True,\n"," validation_data=(x_test_noisy, x_test))\n"," autoencoder.trainable = False #freeze autoencoder weights"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"j6jCTcvb0BlX","colab_type":"text"},"source":["# Regression using MLP"]},{"cell_type":"code","metadata":{"scrolled":false,"trusted":true,"id":"HK5d2ubz0BlY","colab_type":"code","colab":{}},"source":["# MLP Multi-output Regression code goes here...\n","\n","num_hidden = encoding_dim2\n","with tf.device('/gpu:0'): \n"," x = autoencoder.get_layer('encoder2').output\n"," x = Dropout(0.2)(x) # adding 20% dropout\n"," h = Dense(int(num_hidden * 3), activation='relu', name='hidden1')(x)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," h = Dense(int(num_hidden * 5), activation='relu', name='hidden2')(h)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," y = Dense(num_out_neurons, activation='linear', name='prediction')(h)\n"," mlpRegressor = Model(inputs=autoencoder.inputs, outputs=y)\n","\n"," # Compile model\n"," mlpRegressor.compile(loss='mse', optimizer='adam', metrics=['accuracy']) # or loss='mae'\n"," \n"," # Fit the model\n"," print('training the MLP multi-output regressor')\n"," mlpRegressor.fit(x_train, y_train, epochs=50, batch_size=8)\n"," \n"," y_pred = mlpRegressor.predict(x_test)\n"," \n"," actual_mean = pd.DataFrame(y_test.mean(axis=0))\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"hHoR5G3B0Bld","colab_type":"text"},"source":["# Results"]},{"cell_type":"code","metadata":{"scrolled":true,"trusted":true,"id":"5cyG1HyB0Blf","colab_type":"code","colab":{}},"source":["print('MSE: (Actual Vs. Predicted)', mean_squared_error(y_test, y_pred))\n","print('r^2 value: (Mean of actual Vs. Mean of Predicted)', r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wlEDWBYz0Blj","colab_type":"text"},"source":["Plotting predicted Vs. Actual"]},{"cell_type":"code","metadata":{"trusted":true,"id":"jFNyH3TE0Blk","colab_type":"code","colab":{}},"source":["act=actual_mean.values.flatten()\n","pred=pred_mean.values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(20,10))\n","ax = plt.subplot(111)\n","plt.title('Average of actual and predicted gene expression values across all samples')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"WGEmDVRk0Blp","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"1WD-GAyl0Blr","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('DDAE-MLP')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"3waJ5f9n0Blv","colab_type":"text"},"source":["Plotting correlation scatter plot for mean of actual Vs. mean of predicted gene expressions"]},{"cell_type":"code","metadata":{"trusted":true,"id":"uO3k2InL0Blx","colab_type":"code","colab":{}},"source":["plt.figure(figsize=(20,10))\n","plt.scatter(actual_mean, pred_mean)\n","plt.title('Correlation between mean of actual and mean predicted gene expression across all samples')\n","plt.xlabel('Average of true gene expressions across samples')\n","plt.ylabel('Average of predicted gene expressions across samples')\n","plt.grid(True)\n","plt.show()\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"mz8F8tuZ0Bl1","colab_type":"text"},"source":["# Classification of Tumor and Normal samples using MLP"]},{"cell_type":"markdown","metadata":{"id":"Kayc51Va0Bl2","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"6i-ullSj0Bl3","colab_type":"code","colab":{}},"source":["from sklearn.metrics import classification_report\n","from sklearn.metrics import confusion_matrix\n","from sklearn.metrics import recall_score"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"__PGqiLD0Bl8","colab_type":"text"},"source":["MLP-classifier"]},{"cell_type":"code","metadata":{"trusted":true,"id":"ed6Xte-h0Bl9","colab_type":"code","colab":{}},"source":["num_hidden = encoding_dim2\n","with tf.device('/gpu:0'): \n"," x = autoencoder.get_layer('encoder2').output\n"," x = Dropout(0.2)(x) # adding 20% dropout\n"," h = Dense(int(num_hidden * 3), activation='relu', name='hidden1')(x)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," h = Dense(int(num_hidden * 5), activation='relu', name='hidden2')(h)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," y = Dense(1, activation='sigmoid', name='predictions')(h)\n","\n"," classifier = Model(inputs=autoencoder.inputs, outputs=y)\n"," # Compile model\n"," classifier.compile(loss='binary_crossentropy', optimizer='adam',\n"," metrics=['accuracy'])\n"," # Fit the model\n"," classifier.fit(x_train, labels_train, epochs=25, batch_size=8)\n","\n"," print('Now making predictions')\n"," predictions = classifier.predict(x_test)\n"," rounded_predictions = [round(x[0]) for x in predictions]"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"bgbNFdwI0BmB","colab_type":"text"},"source":["Evaluating the model"]},{"cell_type":"code","metadata":{"trusted":true,"id":"L3USITmc0BmC","colab_type":"code","colab":{}},"source":["_, train_acc = classifier.evaluate(x_train, labels_train, verbose=0)\n","_, test_acc = classifier.evaluate(x_test, labels_test, verbose=0)\n","print('\\nTraining accuracy: %.3f, Testing accuracy: %.3f' % (train_acc, test_acc))\n","print(\"Recall score = \",recall_score(labels_test, rounded_predictions))\n","cm = confusion_matrix(labels_test, rounded_predictions)\n","print(\"Confusion matrix:\")\n","print(cm)\n","report = classification_report(labels_test, rounded_predictions)\n","print(\"classification_report\")\n","print(report)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"icrX6klW0BmF","colab_type":"text"},"source":["# Comparing regression results with other standard methods"]},{"cell_type":"markdown","metadata":{"id":"xRygnqFe0BmH","colab_type":"text"},"source":["Evaluating the mean of actual y values"]},{"cell_type":"code","metadata":{"trusted":true,"id":"0Zks4RtD0BmJ","colab_type":"code","colab":{}},"source":["actual_mean = pd.DataFrame(y_test.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"h2IxiBv60BmO","colab_type":"text"},"source":["# 1. Linear Regression"]},{"cell_type":"markdown","metadata":{"id":"I9OhPUaF0BmQ","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"kP-q1S-r0BmR","colab_type":"code","colab":{}},"source":["from sklearn.metrics import r2_score \n","from sklearn.metrics import mean_squared_error\n","from sklearn.linear_model import LinearRegression"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1PCtFmEh0BmV","colab_type":"text"},"source":["Multi-output regression using Linear Regression (OLS) (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"xj0tTIcM0BmV","colab_type":"code","colab":{}},"source":["with tf.device('/gpu:0'):\n"," linear_Regr = LinearRegression(normalize=True)\n"," linear_Regr.fit(x_train, y_train)\n"," y_pred = linear_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse = mean_squared_error(y_test, y_pred)\n"," y_r2score = r2_score(actual_mean, pred_mean)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"3apegKi00BmZ","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"W5EAICAZ0Bmd","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"rpI2SL5F0Bmd","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('Linear Regression')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"J3ZAbV6T0Bmh","colab_type":"text"},"source":["# 2. Lasso"]},{"cell_type":"markdown","metadata":{"id":"YXK8Zqxl0Bmi","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"9D2dAWfb0Bmj","colab_type":"code","colab":{}},"source":["from sklearn.metrics import r2_score \n","from sklearn.metrics import mean_squared_error\n","from sklearn.linear_model import Lasso"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SQXYxyp00Bmn","colab_type":"text"},"source":["Multi-output regression using Lasso (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"Xyp0yST-0Bmo","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," #for alp in [0.1, 0.2, 0.3, 0.4, 0.5]\n"," for alp in [0.01,0.1,0.5,1,5]:\n"," print('Working with alpha=',alp)\n"," Lasso_Regr = Lasso(alpha=alp, normalize=True, random_state=42)\n"," Lasso_Regr.fit(x_train, y_train)\n"," y_pred = Lasso_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"fVU3HJIP0Bmr","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"IXZ92BsA0Bmu","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"Xt5NGL2f0Bmu","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('Lasso with alpha = 1.5')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"MNuYCwNJ0Bmy","colab_type":"text"},"source":["# 3. Ridge"]},{"cell_type":"markdown","metadata":{"id":"2u3njJvr0Bmz","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"2zLplmf20Bmz","colab_type":"code","colab":{}},"source":["from sklearn.linear_model import Ridge"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"WRgznSR-0Bm3","colab_type":"text"},"source":["Multi-output regression using Ridge (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"0nQzkzyK0Bm3","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for alp in [0.01,0.1,0.5,1,1.5]:\n"," print('Working with alpha = ',alp)\n"," Ridge_Regr = Ridge(alpha=alp, normalize=True)\n"," Ridge_Regr.fit(x_train, y_train)\n"," y_pred = Ridge_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"dsO8jc480Bm7","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1EHmSjb50Bm-","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"BpMX0T4-0BnA","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('Ridge Regression with alpha=1.5')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"VhlDwdPD0BnF","colab_type":"text"},"source":["# 4. PCA - Random Forest (PCA-RF)"]},{"cell_type":"markdown","metadata":{"id":"8ZLv1nDn0BnF","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"KBIpnfjS0BnH","colab_type":"code","colab":{}},"source":["from sklearn.ensemble import RandomForestRegressor\n","from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"6nQMUdNo0BnM","colab_type":"code","colab":{}},"source":["n=200\n","pca = PCA(n_components=n)\n","pca.fit(x_train)\n","x_train = pca.transform(x_train)\n","x_test = pca.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0idYs_oA0BnO","colab_type":"text"},"source":["Multi-output regression using Random Forest (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"Jjwgxai90BnP","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for est in [10,50,100,150,200]:\n"," print('estimators = ',est)\n"," rf_Regr = RandomForestRegressor(n_estimators=est, n_jobs=-1)\n"," rf_Regr.fit(x_train, y_train)\n"," y_pred = rf_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Sec2ULNv0BnR","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"yiEV7OwE0BnV","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"25x1RArE0Bnk","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('PCA-RF with 250 features and 100 estimators')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"UPwG6tW30Bnn","colab_type":"text"},"source":["# 5. k-Nearest Neighbor (kNN)"]},{"cell_type":"markdown","metadata":{"id":"IloQzOFC0Bno","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"M9u1lMOM0Bnp","colab_type":"code","colab":{}},"source":["from sklearn.neighbors import KNeighborsRegressor"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"xHB9yJyU0Bns","colab_type":"text"},"source":["Multi-output regression using kNN (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"YdNZ8A5h0Bnu","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for k in [5,10,15,20,25]:\n"," print('k=',k)\n"," knn_Regr = KNeighborsRegressor(n_neighbors=k, n_jobs=-1)\n"," knn_Regr.fit(x_train, y_train)\n"," y_pred = knn_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Rxk455040Bnx","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LwPD9EDW0Bnz","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"yQMRI5KJ0Bn0","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('k-NN with k=10')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Pwd-AWax0Bn4","colab_type":"text"},"source":["# 6. PCA - Support Vector Regression (PCA-SVR)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"IMBKQ9Oj0Bn5","colab_type":"code","colab":{}},"source":["from sklearn.svm import SVR\n","from sklearn.multioutput import MultiOutputRegressor\n","from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"tDpMdvy30Bn_","colab_type":"code","colab":{}},"source":["n=200\n","pca = PCA(n_components=n)\n","pca.fit(x_train)\n","x_train = pca.transform(x_train)\n","x_test = pca.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Yvz69XKX0BoC","colab_type":"code","colab":{}},"source":["with tf.device('/gpu:0'):\n"," y_mse=[]\n"," y_r2score = []\n"," for k in ['linear','poly','rbf','sigmoid']:\n"," print('kernel = ',k)\n"," mo_svr = MultiOutputRegressor(SVR(kernel=k,gamma='auto'))\n"," mo_svr.fit(x_train, y_train)\n"," y_pred = mo_svr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"c-ICBluk0BoE","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"BXGpN6Vh0BoG","colab_type":"text"},"source":["Plotting first 100 features"]},{"cell_type":"code","metadata":{"trusted":true,"id":"MEUvOBqb0BoH","colab_type":"code","colab":{}},"source":["act=actual_mean[0:100].values.flatten()\n","pred=pred_mean[0:100].values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(10,5))\n","ax = plt.subplot(111)\n","plt.title('PCA-RF with 250 features and linear kernel')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of gene expression values across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"d_lQ7fJA0BoK","colab_type":"text"},"source":["# Comparison regression results from AE-MLP with PCA-MLP"]},{"cell_type":"markdown","metadata":{"id":"Uroy8EG20BoL","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":false,"id":"CuvDm03W0BoM","colab_type":"code","colab":{}},"source":["from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"DgMI_5sl0BoO","colab_type":"text"},"source":["PCA-MLP"]},{"cell_type":"code","metadata":{"trusted":false,"id":"6KM2qEmB0BoQ","colab_type":"code","colab":{}},"source":["n=100\n","with tf.device('/gpu:0'):\n"," #pca\n"," pca = PCA(n_components=n)\n"," pca.fit(x_train)\n"," x_train = pca.transform(x_train)\n"," x_test = pca.transform(x_test)\n"," \n"," # MLP Multi-output Regression code goes here...\n"," num = n\n"," input_data = Input(shape=(num,))\n"," x = Dropout(0.2)(input_data) # adding 20% dropout\n"," h = Dense(int(num * 3), activation='relu', name='hidden1')(x)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," h = Dense(int(num * 5), activation='relu', name='hidden2')(h)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," y = Dense(num_out_neurons, activation='linear', name='prediction')(h)\n"," mlpRegressor = Model(inputs=input_data, outputs=y)\n"," \n"," # Compile model\n"," mlpRegressor.compile(loss='mse', optimizer='adam', metrics=['accuracy']) # or loss='mae'\n"," # Fit the model\n"," print('training the MLP multi-output regressor')\n"," mlpRegressor.fit(x_train, y_train, epochs=50, batch_size=8)\n"," y_pred = mlpRegressor.predict(x_test)\n"," actual_mean = pd.DataFrame(y_test.mean(axis=0))\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"aUkxWs2U0BoR","colab_type":"text"},"source":["Printing results"]},{"cell_type":"code","metadata":{"trusted":true,"id":"0fbbI4US0BoS","colab_type":"code","colab":{}},"source":["print('MSE: (Actual Vs. Predicted)', mean_squared_error(y_test, y_pred))\n","print('r^2 value: (Mean of actual Vs. Mean of Predicted)', r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]}]} -------------------------------------------------------------------------------- /DNAm_to_CNA.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"},"colab":{"name":"DNAm_to_CNA.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"4yPGgizG1KTK","colab_type":"text"},"source":["# Deep denoising auto-encoder and MLP based multi-output regression on TCGA multi-omics data\n","# DNA Methylation to Copy Number Alteration"]},{"cell_type":"markdown","metadata":{"id":"kJy_pIri1KTQ","colab_type":"text"},"source":["# Setting environment"]},{"cell_type":"markdown","metadata":{"id":"oJJ2zyZo1KTT","colab_type":"text"},"source":["![](http://)Seeding the random number generators"]},{"cell_type":"code","metadata":{"trusted":true,"id":"XQ5DPaTo1KTX","colab_type":"code","colab":{}},"source":["# have reproducible behavior for certain hash-based operations.\n","import os\n","os.environ['PYTHONHASHSEED'] = '0'\n","# The below is necessary for starting Numpy generated random numbers\n","# in a well-defined initial state.\n","import numpy as np\n","np.random.seed(42)\n","# The below is necessary for starting core Python generated random numbers\n","# in a well-defined state.\n","import random as rn\n","rn.seed(12345)\n","\n","# The below tf.set_random_seed() will make random number generation\n","# in the TensorFlow backend have a well-defined initial state.\n","import tensorflow as tf\n","tf.set_random_seed(1234)\n","\n","# Force TensorFlow to use single thread.\n","session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,\n"," inter_op_parallelism_threads=1)\n","from keras import backend as K\n","sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)\n","K.set_session(sess)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"4Nl334Dw1KTj","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"4GDVVcfj1KTn","colab_type":"code","colab":{}},"source":["from keras.layers import Input, Dense, Dropout\n","from keras.models import Model\n","from sklearn.preprocessing import MinMaxScaler\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import mean_squared_error\n","from keras import regularizers\n","import matplotlib\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","from sklearn.metrics import r2_score"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"eYGuFsqqZkMX","colab_type":"text"},"source":["# Loading data"]},{"cell_type":"markdown","metadata":{"id":"tq5QN3Kq0BkP","colab_type":"text"},"source":["Importing data from pre-processed csv files (Change paths accordingly)"]},{"cell_type":"code","metadata":{"id":"cmJBB9flgeAU","colab_type":"code","colab":{}},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"of_LBPo4gsAu","colab_type":"code","colab":{}},"source":["#ls \"/content/drive/My Drive\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"2R8C7R7bhruf","colab_type":"code","colab":{}},"source":["preprocessed_DNAMeth = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_DNAMeth.csv')\n","preprocessed_CNA = pd.read_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_CNA.csv')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Sc8zl9ko1KUD","colab_type":"code","colab":{}},"source":["x = preprocessed_DNAMeth\n","y = preprocessed_CNA"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"psvXlpKr1KUK","colab_type":"text"},"source":["Splitting the data into training and testing datasets"]},{"cell_type":"code","metadata":{"trusted":true,"id":"_HrtO3s41KUM","colab_type":"code","colab":{}},"source":["x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"23LoYaMm1KUU","colab_type":"text"},"source":["Scaling the data within [0-1] range"]},{"cell_type":"code","metadata":{"trusted":true,"id":"e3da-HjV1KUW","colab_type":"code","colab":{}},"source":["scalar = MinMaxScaler()\n","x_train = scalar.fit_transform(x_train)\n","x_test = scalar.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vY3jlrsK1KUf","colab_type":"text"},"source":["Adding gaussian noise"]},{"cell_type":"code","metadata":{"trusted":true,"id":"GA9KkTC11KUi","colab_type":"code","colab":{}},"source":["noise_factor = 0.5\n","x_train_noisy = x_train + noise_factor * np.random.normal(0.0, 1.0, x_train.shape)\n","x_test_noisy = x_test + noise_factor * np.random.normal(0.0, 1.0, x_test.shape)\n","\n","x_train_noisy = np.clip(x_train_noisy, 0., 1.)\n","x_test_noisy = np.clip(x_test_noisy, 0., 1.)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"g3Hz3xNL1KUz","colab_type":"text"},"source":["# Dimension Reduction/Feature Extraction using DDAE"]},{"cell_type":"markdown","metadata":{"id":"FHTP3t0G1KUp","colab_type":"text"},"source":["Setting the no. of input and output neurons"]},{"cell_type":"code","metadata":{"trusted":true,"id":"xEq-HmEb1KUs","colab_type":"code","colab":{}},"source":["num_in_neurons = x.shape[1]\n","num_out_neurons = y.shape[1]"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Nt8lJ5en1KU7","colab_type":"code","colab":{}},"source":["# Auto-encoder to extract features from DNA Methylation data\n","\n","with tf.device('/gpu:0'):\n"," \n"," # this is the size of our encoded representations\n"," encoding_dim1 = 500\n"," encoding_dim2 = 200\n"," \n"," lambda_act = 0.0001\n"," lambda_weight = 0.001\n"," # this is our input placeholder\n"," input_data = Input(shape=(num_in_neurons,))\n"," # first encoded representation of the input\n"," encoded = Dense(encoding_dim1, activation='relu', activity_regularizer=regularizers.l1(lambda_act), kernel_regularizer=regularizers.l2(lambda_weight), name='encoder1')(input_data)\n"," # second encoded representation of the input\n"," encoded = Dense(encoding_dim2, activation='relu', activity_regularizer=regularizers.l1(lambda_act), kernel_regularizer=regularizers.l2(lambda_weight), name='encoder2')(encoded)\n"," # first lossy reconstruction of the input\n"," decoded = Dense(encoding_dim1, activation='relu', name='decoder1')(encoded)\n"," # the final lossy reconstruction of the input\n"," decoded = Dense(num_in_neurons, activation='sigmoid', name='decoder2')(decoded)\n"," \n"," # this model maps an input to its reconstruction\n"," autoencoder = Model(inputs=input_data, outputs=decoded)\n"," \n"," myencoder = Model(inputs=input_data, outputs=encoded)\n"," autoencoder.compile(optimizer='sgd', loss='mse')\n"," # training\n"," print('training the autoencoder')\n"," autoencoder.fit(x_train_noisy, x_train,\n"," epochs=25,\n"," batch_size=8,\n"," shuffle=True,\n"," validation_data=(x_test_noisy, x_test))\n"," \n"," ae_train = myencoder.predict(x_train)\n"," ae_test = myencoder.predict(x_test)\n"," autoencoder.trainable = False #freeze autoencoder weights "],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"exEaTXSS1KVE","colab_type":"text"},"source":["# Regression using MLP"]},{"cell_type":"code","metadata":{"trusted":true,"id":"-tlKltAL1KVG","colab_type":"code","colab":{}},"source":["# MLP Multi-output Regression code goes here...\n","\n","num_hidden = encoding_dim2\n","with tf.device('/gpu:0'):\n","# create regression model\n"," x = autoencoder.get_layer('encoder2').output\n"," x = Dropout(0.2)(x) # adding 20% dropout\n"," h = Dense(int(num_hidden * 3), activation='relu', name='hidden1')(x)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," h = Dense(int(num_hidden * 5), activation='relu', name='hidden2')(h)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," y = Dense(num_out_neurons, activation='linear', name='prediction')(h)\n"," mlpRegressor = Model(inputs=autoencoder.inputs, outputs=y)\n","\n"," # Compile model\n"," mlpRegressor.compile(loss='mse', optimizer='adam', metrics=['accuracy']) # or loss='mae'\n"," # Fit the model\n"," print('training the MLP multi-output regressor')\n"," mlpRegressor.fit(x_train, y_train, epochs=50, batch_size=8)\n"," y_pred = mlpRegressor.predict(x_test)\n"," actual_mean = pd.DataFrame(y_test.mean(axis=0))\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vEZjBeBd1KVK","colab_type":"text"},"source":["# Results"]},{"cell_type":"code","metadata":{"trusted":true,"id":"F8ekvk241KVM","colab_type":"code","colab":{}},"source":["print('MSE: (Actual Vs. Predicted)', mean_squared_error(y_test, y_pred))\n","print('r^2 value: (Mean of actual Vs. Mean of Predicted)', r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"YaOwP8Hs1KVU","colab_type":"text"},"source":["Plotting actual Vs. predicted CNVs"]},{"cell_type":"code","metadata":{"trusted":true,"id":"npFNBizz1KVV","colab_type":"code","colab":{}},"source":["act=actual_mean.values.flatten()\n","pred=pred_mean.values.flatten()\n","\n","s1 = pd.Series(act)\n","s2 = pd.Series(pred)\n","\n","plt.figure(figsize=(20,10))\n","ax = plt.subplot(111)\n","plt.title('Average of actual and predicted CNVs across all samples')\n","plt.xlabel('No. of features (genes)')\n","plt.ylabel('Average of CNVs across samples')\n","ax.plot(s1, 'b--', label='Actual')\n","ax.plot(s2, 'r--', label='Predicted')\n","ax.legend()\n","plt.grid(True)\n","plt.show()\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"x8P7_hQX1KVZ","colab_type":"text"},"source":["Plotting correlation scatter plot for mean of actual Vs. mean of predicted CNVs"]},{"cell_type":"code","metadata":{"trusted":true,"id":"oQosUCvP1KVc","colab_type":"code","colab":{}},"source":["plt.figure(figsize=(20,10))\n","plt.scatter(actual_mean, pred_mean)\n","plt.title('Correlation between mean of actual and mean of predicted CNVs across all samples')\n","plt.xlabel('Average of actual values of CNVs across samples')\n","plt.ylabel('Average of predicted values of CNVs across samples')\n","plt.grid(True)\n","plt.show()\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"PlMF1mzt1KVh","colab_type":"text"},"source":["# Comparing regression results with other standard methods"]},{"cell_type":"markdown","metadata":{"id":"PTZsfz7K1KVj","colab_type":"text"},"source":["Evaluating the mean of actual y values"]},{"cell_type":"code","metadata":{"trusted":true,"id":"doeAJ1Qv1KVl","colab_type":"code","colab":{}},"source":["actual_mean = pd.DataFrame(y_test.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1LrshVjR1KVt","colab_type":"text"},"source":["# 1. Linear Regression"]},{"cell_type":"markdown","metadata":{"id":"S6L9bGTX1KVu","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"sgdKvZuq1KVw","colab_type":"code","colab":{}},"source":["from sklearn.metrics import r2_score \n","from sklearn.metrics import mean_squared_error\n","from sklearn.linear_model import LinearRegression"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"qdOOAl6K1KV2","colab_type":"text"},"source":["Multi-output regression using Linear Regression (OLS) (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"NAlqajK31KV5","colab_type":"code","colab":{}},"source":["with tf.device('/gpu:0'):\n"," linear_Regr = LinearRegression(normalize=True)\n"," linear_Regr.fit(x_train, y_train)\n"," y_pred = linear_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse=mean_squared_error(y_test, y_pred)\n"," y_r2score=r2_score(actual_mean, pred_mean)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"i7aOibZk1KV_","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"XiQAkdOY1KWE","colab_type":"text"},"source":["# 2. Lasso"]},{"cell_type":"markdown","metadata":{"id":"lfoHiUTD1KWG","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"L0X4qs1W1KWK","colab_type":"code","colab":{}},"source":["from sklearn.metrics import r2_score \n","from sklearn.metrics import mean_squared_error\n","from sklearn.linear_model import Lasso"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"QKBXjHXS1KWN","colab_type":"text"},"source":["Multi-output regression using Lasso (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"rcjEMZNw1KWR","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for alp in [0.01,0.1,0.5,1,1.5]:\n"," #for alp in [0.001]:\n"," print('Working with alpha=',alp)\n"," Lasso_Regr = Lasso(alpha=alp)\n"," Lasso_Regr.fit(x_train, y_train)\n"," y_pred = Lasso_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"JgKSXmFO1KWf","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"44ES6URt1KWl","colab_type":"text"},"source":["# 3. Ridge"]},{"cell_type":"markdown","metadata":{"id":"KdRZhf6E1KWn","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":false,"id":"UGv5FWKl1KWo","colab_type":"code","colab":{}},"source":["from sklearn.linear_model import Ridge"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wDQ7Dwo01KWs","colab_type":"text"},"source":["Multi-output regression using Ridge (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":false,"id":"d0WKZRcM1KWu","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for alp in [0.01,0.1,0.5,1,1.5]:\n"," Ridge_Regr = Ridge(alpha=alp, normalize=True)\n"," Ridge_Regr.fit(x_train, y_train)\n"," y_pred = Ridge_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"HrGru9dd1KWy","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error (y_test Vs. y_pred): \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"V5ybPdbu1KW3","colab_type":"text"},"source":["# 4. PCA - Random Forest (PCA-RF)"]},{"cell_type":"markdown","metadata":{"id":"40adSHS01KW4","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"jFDisRpK1KW5","colab_type":"code","colab":{}},"source":["from sklearn.ensemble import RandomForestRegressor\n","from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"uTsE5_zz1KW-","colab_type":"code","colab":{}},"source":["n=200\n","pca = PCA(n_components=n)\n","pca.fit(x_train)\n","x_train = pca.transform(x_train)\n","x_test = pca.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"E5zTXz2Y1KXB","colab_type":"text"},"source":["Multi-output regression using Random Forest (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"Y8c5Ze_Q1KXD","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for est in [10,50,100,150,200]:\n"," rf_Regr = RandomForestRegressor(n_estimators=est, n_jobs=-1)\n"," rf_Regr.fit(x_train, y_train)\n"," y_pred = rf_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"0P-XM_A-1KXH","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"KdsphQq11KXL","colab_type":"text"},"source":["# 5. k-Nearest Neighbor (kNN)"]},{"cell_type":"markdown","metadata":{"id":"ZNBeKQxs1KXM","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":false,"id":"Tb3CcPgs1KXO","colab_type":"code","colab":{}},"source":["from sklearn.neighbors import KNeighborsRegressor"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Aos1QK7Y1KXQ","colab_type":"text"},"source":["Multi-output regression using kNN (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":false,"id":"Xw_lhX711KXS","colab_type":"code","colab":{}},"source":["y_mse=[]\n","y_r2score=[]\n","with tf.device('/gpu:0'):\n"," for k in [5,10,15,20,25]:\n"," knn_Regr = KNeighborsRegressor(n_neighbors=k, n_jobs=-1)\n"," knn_Regr.fit(x_train, y_train)\n"," y_pred = knn_Regr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"colab_type":"code","id":"j1EswA1rLP5_","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"8PRE5gcm1KXY","colab_type":"text"},"source":["# 6. PCA - Support Vector Regression (PCA-SVR)"]},{"cell_type":"markdown","metadata":{"id":"7ikI6Ykg1KXa","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"bAnKuDpC1KXb","colab_type":"code","colab":{}},"source":["from sklearn.svm import SVR\n","from sklearn.multioutput import MultiOutputRegressor\n","from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":true,"id":"Qz4nVUso1KXf","colab_type":"code","colab":{}},"source":["n=200\n","pca = PCA(n_components=n)\n","pca.fit(x_train)\n","x_train = pca.transform(x_train)\n","x_test = pca.transform(x_test)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0KpX9Lkv1KXn","colab_type":"text"},"source":["Multi-output regression using kNN (sk-learn)"]},{"cell_type":"code","metadata":{"trusted":true,"id":"BM88x5Rr1KXp","colab_type":"code","colab":{}},"source":["with tf.device('/gpu:0'):\n"," y_mse=[]\n"," y_r2score = []\n"," for k in ['linear','poly','rbf','sigmoid']:\n"," print('kernel = ',k)\n"," mo_svr = MultiOutputRegressor(SVR(kernel=k,gamma='auto'))\n"," mo_svr.fit(x_train, y_train)\n"," y_pred = mo_svr.predict(x_test)\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))\n"," y_mse.append(mean_squared_error(y_test, y_pred))\n"," y_r2score.append(r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"rhHTUwKh1KXu","colab_type":"code","colab":{}},"source":["print(\"Mean Squared Error: \", y_mse)\n","print(\"r2 Score (y_test_mean Vs. y_pred_mean): \", y_r2score)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"MMnVNr4b1KXx","colab_type":"text"},"source":["# Comparison of AE-MLP with PCA-MLP"]},{"cell_type":"markdown","metadata":{"id":"-YDvihGz1KXz","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":true,"id":"dfl4WGAe1KX0","colab_type":"code","colab":{}},"source":["from sklearn.decomposition import PCA"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"nK7X3r8a1KX5","colab_type":"text"},"source":["PCA-MLP"]},{"cell_type":"code","metadata":{"trusted":true,"id":"8whS3ow91KX7","colab_type":"code","colab":{}},"source":["n=200\n","with tf.device('/gpu:0'):\n"," #pca\n"," pca = PCA(n_components=n)\n"," pca.fit(x_train)\n"," x_train = pca.transform(x_train)\n"," x_test = pca.transform(x_test)\n"," \n"," # MLP Multi-output Regression code goes here...\n"," # create regression model\n"," num = n\n"," input_data = Input(shape=(num,))\n"," x = Dropout(0.2)(input_data) # adding 20% dropout\n"," h = Dense(int(num * 3), activation='relu', name='hidden1')(x)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," h = Dense(int(num * 5), activation='relu', name='hidden2')(h)\n"," h = Dropout(0.5)(h) # adding 50% dropout\n"," y = Dense(num_out_neurons, activation='linear', name='prediction')(h)\n"," mlpRegressor = Model(inputs=input_data, outputs=y)\n"," \n"," # Compile model\n"," mlpRegressor.compile(loss='mse', optimizer='adam', metrics=['accuracy']) # or loss='mae'\n"," # Fit the model\n"," print('training the MLP multi-output regressor')\n"," mlpRegressor.fit(x_train, y_train, epochs=50, batch_size=8)\n"," y_pred = mlpRegressor.predict(x_test)\n"," actual_mean = pd.DataFrame(y_test.mean(axis=0))\n"," pred_mean = pd.DataFrame(y_pred.mean(axis=0))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Z7NHrncA1KYB","colab_type":"text"},"source":["Printing results"]},{"cell_type":"code","metadata":{"trusted":true,"id":"5bvDjhAk1KYC","colab_type":"code","colab":{}},"source":["print('MSE: (Actual Vs. Predicted)', mean_squared_error(y_test, y_pred))\n","print('r^2 value: (Mean of actual Vs. Mean of Predicted)', r2_score(actual_mean, pred_mean))"],"execution_count":0,"outputs":[]}]} -------------------------------------------------------------------------------- /Pre-processing.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"},"colab":{"name":"Pre-processing_Ver_2.ipynb","provenance":[{"file_id":"1LHLqt70oFXTTnxJtZPYfCRBPqWPAO6T2","timestamp":1596702929681}],"collapsed_sections":[],"toc_visible":true},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"UrM6jHdwy47-","colab_type":"text"},"source":["# Deep denoising auto-encoder and MLP based multi-output regression on TCGA multi-omics data\n","# Data Pre-processing"]},{"cell_type":"markdown","metadata":{"id":"hb-sT4V8y48J","colab_type":"text"},"source":["Note: You may skip this notebook if you already have the pre-processed data"]},{"cell_type":"markdown","metadata":{"id":"N2_z0jeay48M","colab_type":"text"},"source":["Importing libraries"]},{"cell_type":"code","metadata":{"trusted":false,"id":"_leYcysJy48P","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703444376,"user_tz":-330,"elapsed":1389,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["import pandas as pd\n","import numpy as np"],"execution_count":1,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"Dd1qV4aty48b","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703446015,"user_tz":-330,"elapsed":788,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def intersection(list1, list2, list3):\n"," return list(set(list1) & set(list2) & set(list3))"],"execution_count":2,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"V2ZeJlWZy48j","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703467007,"user_tz":-330,"elapsed":1287,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def extractMatchedIndices(list1, list2):\n"," comm = []\n"," for i in list2:\n"," for j in list1:\n"," if i in j:\n"," comm.append(list1.index(j))\n"," break #added in Ver. 2 to remove duplicate samples\n"," return comm"],"execution_count":3,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"RSWCxVCdy48r","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703467827,"user_tz":-330,"elapsed":1013,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def remrows(data):\n"," t = int(0.8*data.shape[1])\n"," data = data.dropna(thresh=t)\n"," #data = data[(data.T != 0).any()]\n"," return data"],"execution_count":4,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"Vj8xAf3zy48x","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703472536,"user_tz":-330,"elapsed":1328,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def remcolumns(data):\n"," t = int(0.8*data.shape[1])\n"," data = data.dropna(thresh=t,axis=1)\n"," #data = data.loc[:, (data != 0).any(axis=0)]\n"," return data"],"execution_count":5,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"njlD3x6ky483","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703476856,"user_tz":-330,"elapsed":1494,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def tumor_normal_labels(list):\n"," sample = [x[13:15] for x in list]\n"," label = np.array([x in ['01','02','03','04','05','06','07','08','09'] for x in sample])\n"," label=1*label # converting boolean into int\n"," return label"],"execution_count":6,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"qtITXlnpy489","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703476864,"user_tz":-330,"elapsed":858,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["def myNormalize(data):\n"," (rows, cols) = data.shape\n"," mins = np.zeros(shape=(cols), dtype = np.float32)\n"," maxs = np.zeros(shape=(cols), dtype = np.float32)\n"," for j in range(cols):\n"," mins[j] = np.min(data[:,j])\n"," maxs[j] = np.max(data[:,j])\n"," \n"," result = np.copy(data)\n"," for i in range(rows):\n"," for j in range(cols):\n"," result[i,j] = (data[i,j] - mins[j]) / (maxs[j] - mins[j])\n"," return result"],"execution_count":7,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1y5bSW22y49E","colab_type":"text"},"source":["# Loading data"]},{"cell_type":"code","metadata":{"id":"cmJBB9flgeAU","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":124},"executionInfo":{"status":"ok","timestamp":1596703501319,"user_tz":-330,"elapsed":23361,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}},"outputId":"572e8f7b-39ea-4c92-a090-0e72844fb006"},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":8,"outputs":[{"output_type":"stream","text":["Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n","\n","Enter your authorization code:\n","··········\n","Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"of_LBPo4gsAu","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703505814,"user_tz":-330,"elapsed":1412,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["#ls \"/content/drive/My Drive\""],"execution_count":9,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"KkAP-qmsAGpf","colab_type":"text"},"source":["Change paths accordingly"]},{"cell_type":"code","metadata":{"id":"2R8C7R7bhruf","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703519437,"user_tz":-330,"elapsed":14442,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMeth = pd.read_table('/content/drive/My Drive/TCGA Data/LIHC/LIHC_Methylation450__SingleValue__TSS1500__Both.txt',delimiter='\\t',index_col=0)\n","cna = pd.read_table('/content/drive/My Drive/TCGA Data/LIHC/LIHC__genome_wide_snp_6__GeneLevelCNA.txt',delimiter='\\t',index_col=0)\n","rnaSeq = pd.read_table('/content/drive/My Drive/TCGA Data/LIHC/LIHC_RNASeq__illuminahiseq_rnaseqv2__GeneExp.txt',delimiter='\\t',index_col=1) # Using Entrez ID as row identifier')"],"execution_count":10,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"XOoX4JEyy49M","colab_type":"text"},"source":["Dropping redundant columns"]},{"cell_type":"code","metadata":{"trusted":false,"id":"aANBFACmy49O","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703521789,"user_tz":-330,"elapsed":1342,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMeth = dnaMeth.drop(dnaMeth.columns[[0]], axis=1)\n","rnaSeq = rnaSeq.drop(rnaSeq.columns[[0]], axis=1)\n","cna = cna.drop(cna.columns[[0,1]], axis=1)"],"execution_count":11,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"4V99fvVhUjMs","colab_type":"text"},"source":["# Pre-processing"]},{"cell_type":"markdown","metadata":{"id":"ZoTL3vkfy49U","colab_type":"text"},"source":["Extracting sample names using TCGA barcode"]},{"cell_type":"code","metadata":{"trusted":false,"id":"r6TzRh3ly49W","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703525576,"user_tz":-330,"elapsed":1035,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMethSamples = list(dnaMeth)\n","rnaSamples = list(rnaSeq)\n","cnaSamples = list(cna)\n","methID = [x[8:16] for x in dnaMethSamples]\n","rnaID = [x[8:16] for x in rnaSamples]\n","cnaID = [x[8:16] for x in cnaSamples]"],"execution_count":12,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"_vTq7gEWy49f","colab_type":"text"},"source":["Removing duplicates"]},{"cell_type":"code","metadata":{"trusted":false,"id":"j8x8iSBDy49h","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703527348,"user_tz":-330,"elapsed":718,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["methID=set(methID)\n","rnaID=set(rnaID)\n","cnaID=set(cnaID)"],"execution_count":13,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"9y8j7aoDy49m","colab_type":"text"},"source":["Reconverting into lists"]},{"cell_type":"code","metadata":{"trusted":false,"id":"RxUixMQ2y49o","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703529878,"user_tz":-330,"elapsed":1156,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["methID=list(methID)\n","rnaID=list(rnaID)\n","cnaID=list(cnaID)"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kxzms2pBy49u","colab_type":"text"},"source":["Finding out common samples"]},{"cell_type":"code","metadata":{"id":"bcmkmI6OstRa","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703533562,"user_tz":-330,"elapsed":1348,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["commonSamples = intersection(methID, rnaID, cnaID)\n","#commonMeth = extractMatchedIndices(methID,commonSamples)\n","#commonRNA = extractMatchedIndices(rnaID,commonSamples)\n","#commonCNA = extractMatchedIndices(cnaID,commonSamples)\n","\n","#changes made in Ver. 2\n","commonMeth = extractMatchedIndices(list(dnaMeth),commonSamples)\n","commonRNA = extractMatchedIndices(list(rnaSeq),commonSamples)\n","commonCNA = extractMatchedIndices(list(cna),commonSamples)"],"execution_count":15,"outputs":[]},{"cell_type":"code","metadata":{"id":"qRfgeSmpx7mg","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":34},"executionInfo":{"status":"ok","timestamp":1596703538584,"user_tz":-330,"elapsed":1043,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}},"outputId":"12c8aaa3-4fd5-4dbe-b6c3-670b03271483"},"source":["#print(len(commonSamples), len(commonMeth),len(commonRNA),len(commonCNA))"],"execution_count":16,"outputs":[{"output_type":"stream","text":["404 404 404 404\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"Flahz8pMy490","colab_type":"text"},"source":["Removing rows (genes) having more than 20% missing values across all samples (patients)"]},{"cell_type":"code","metadata":{"trusted":false,"id":"6iP7d2oYy492","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703542796,"user_tz":-330,"elapsed":1370,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMeth = remrows(dnaMeth)\n","rnaSeq = remrows(rnaSeq)\n","cna = remrows(cna)"],"execution_count":17,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"qjmI6R16y496","colab_type":"text"},"source":["Removing columns (samples) having more than 20% missing values across all rows (genes)"]},{"cell_type":"code","metadata":{"trusted":false,"id":"G5jmX3Pmy497","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703544827,"user_tz":-330,"elapsed":1387,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMeth = remcolumns(dnaMeth)\n","rnaSeq = remcolumns(rnaSeq)\n","cna = remcolumns(cna)"],"execution_count":18,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"r7ovBPnVy4-A","colab_type":"text"},"source":["Reducing each omics data to common samples only"]},{"cell_type":"code","metadata":{"trusted":false,"id":"H5x0SPoHy4-B","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703546290,"user_tz":-330,"elapsed":893,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["dnaMeth = dnaMeth.iloc[:,commonMeth]\n","rnaSeq = rnaSeq.iloc[:,commonRNA]\n","cna = cna.iloc[:,commonCNA]"],"execution_count":19,"outputs":[]},{"cell_type":"code","metadata":{"id":"xzxRT25hiT80","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596703553779,"user_tz":-330,"elapsed":1309,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["#Validation\n","#df = pd.DataFrame(list(zip(*[commonSamples, list(dnaMeth),list(rnaSeq),list(cna)]))).add_prefix('Col')\n","#df.to_csv('/content/drive/My Drive/TCGA Data/LIHC/IDs_post_processing.csv', index=False)"],"execution_count":21,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"7AvbCr25y4-G","colab_type":"text"},"source":["Removing 1st quantile for rnaSeq"]},{"cell_type":"code","metadata":{"trusted":false,"id":"vL0ycXy1y4-I","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704174054,"user_tz":-330,"elapsed":1217,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["rnaSeq_rowsum = rnaSeq.sum(axis=1)\n","ind = pd.DataFrame(rnaSeq_rowsum > rnaSeq_rowsum.quantile(0.25))\n","rnaSeq = rnaSeq[ind.values]"],"execution_count":22,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1Hnr3rKky4-M","colab_type":"text"},"source":["Finding tumor and normal samples"]},{"cell_type":"code","metadata":{"trusted":false,"id":"0XEEHMSly4-N","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704176840,"user_tz":-330,"elapsed":1400,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["labels = tumor_normal_labels(list(dnaMeth))"],"execution_count":23,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"IAyt-xYCy4-Q","colab_type":"text"},"source":["Imputing remaining missing values"]},{"cell_type":"code","metadata":{"id":"s1ImY6yzlFKt","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704179390,"user_tz":-330,"elapsed":1324,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["from sklearn.impute import SimpleImputer"],"execution_count":24,"outputs":[]},{"cell_type":"code","metadata":{"id":"wfKrbg3Daoop","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704181412,"user_tz":-330,"elapsed":1322,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["imp = SimpleImputer(missing_values=np.NaN, strategy='mean', copy=True)\n","imputedDNAMeth = imp.fit_transform(dnaMeth)\n","imputedRNASeq = imp.fit_transform(rnaSeq)\n","imputedCNA = imp.fit_transform(cna)"],"execution_count":25,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"pu_W1sUdy4-T","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704184939,"user_tz":-330,"elapsed":1444,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["#imp = Imputer(missing_values='NaN', strategy='mean', axis=1, copy=True)\n","#imputedDNAMeth = imp.fit_transform(dnaMeth)\n","#imputedRNASeq = imp.fit_transform(rnaSeq)\n","#imputedCNA = imp.fit_transform(cna)"],"execution_count":26,"outputs":[]},{"cell_type":"code","metadata":{"trusted":false,"id":"rUiCY5wvy4-X","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704184946,"user_tz":-330,"elapsed":1085,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["imputedDNAMeth=imputedDNAMeth.transpose()\n","imputedRNASeq=imputedRNASeq.transpose()\n","imputedCNA=imputedCNA.transpose()"],"execution_count":27,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"JMWVQovJy4-a","colab_type":"text"},"source":["Normalizing datasets using min-max normalization"]},{"cell_type":"code","metadata":{"trusted":false,"id":"t6lxChj8y4-b","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704215966,"user_tz":-330,"elapsed":29860,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["normalized_DNAMeth = myNormalize(imputedDNAMeth)\n","normalized_RNASeq = myNormalize(imputedRNASeq)\n","normalized_CNA = myNormalize(imputedCNA)"],"execution_count":28,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"v3BsYUqlAWJ8","colab_type":"text"},"source":["Final dimensions after pre-processing"]},{"cell_type":"code","metadata":{"id":"F-4Ihkgl_KV_","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":34},"executionInfo":{"status":"ok","timestamp":1596704222003,"user_tz":-330,"elapsed":1830,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}},"outputId":"7765ec5e-8c2d-459c-9d1f-9857fa94dcfe"},"source":["print(dnaMeth.shape, rnaSeq.shape, cna.shape)"],"execution_count":29,"outputs":[{"output_type":"stream","text":["(18996, 404) (15397, 404) (23604, 404)\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"zUo-N3ImUr9a","colab_type":"text"},"source":["# Exporting data"]},{"cell_type":"markdown","metadata":{"id":"Y4Mlmy1Ly4-g","colab_type":"text"},"source":["Saving pre-processed files"]},{"cell_type":"code","metadata":{"trusted":false,"id":"3sibFUody4-h","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704608399,"user_tz":-330,"elapsed":1280,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["preprocessed_DNAMeth = pd.DataFrame(normalized_DNAMeth)\n","preprocessed_RNASeq = pd.DataFrame(normalized_RNASeq)\n","preprocessed_CNA = pd.DataFrame(normalized_CNA)\n","labels=pd.DataFrame(labels)"],"execution_count":31,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"lgVBld_Py4-k","colab_type":"text"},"source":["Exporting pre-processed data to csv files"]},{"cell_type":"code","metadata":{"trusted":true,"id":"DOKCZh_Jy4-n","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1596704658344,"user_tz":-330,"elapsed":48324,"user":{"displayName":"Dibyendu B. Seal","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GhPAguO61Uo4uNw9cuJjCQ-Nv8Nt-Q0CYP9qR7W=s64","userId":"09389361721160752492"}}},"source":["preprocessed_DNAMeth.to_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_DNAMeth.csv',index=False)\n","preprocessed_RNASeq.to_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_RNASeq.csv',index=False)\n","preprocessed_CNA.to_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_preprocessed_CNA.csv',index=False)\n","labels.to_csv('/content/drive/My Drive/TCGA Data/Preprocessed_Data/LIHC_labels.csv',index=False)"],"execution_count":32,"outputs":[]}]} -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # multiOmicsIntegration 2 | # Github Repository for multiOmics Integration project involving genomics, epigenomics and transcriptomics data 3 | The datasets used in the manuscript can be downloaded from Zenodo 4 | Seal, Dibyendu, Das, Vivek, Goswami, Saptarshi, & De, Rajat. (2020). Datasets used for Multi-omics integration using Deep Learning and other state-of-the-art regression models (Version v1) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3712496 5 | 6 | Currently there are 3 Python notebooks in this Github repository 7 | 1. `Pre-processing.ipynb` 8 | 2. `DNAm_CNA_to_RNAseq.ipynb` 9 | 3. `DNAm_to_CNA.ipynb` 10 | 11 | The above notebooks can be used to run the codes used for multi-omics integration using Deep Learning and state-of-the-art regression methods. 12 | --------------------------------------------------------------------------------