├── LICENSE ├── README.md ├── env ├── epee_CPU.txt └── epee_GPU.txt ├── script ├── epee.py └── run_epee.py └── test └── tests.sh /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018, University of Texas Southwestern Medical Center. All rights reserved. 2 | Contributors: Viren Amin, Murat Can Cobanoglu 3 | Department: Lyda Hill Department of Bioinformatics 4 | 5 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 8 | 9 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 10 | 11 | 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EPEE 2 | 3 | Effectors and Perturbation Estimation Engine (EPEE) a sparse linear model with graph constrained lasso regularization for differential analysis of RNA-seq data. The inputs are transcriptomic data for the two conditions under comparison, and context-specific TF-gene networks. If transcriptomic data is sequencing based, then data needs to be normalized to either TPM/FPKM/RPKM. EPEE is implemented in Python, using TensorFlow. 4 | 5 | ### Citation 6 | 7 | Viren Amin, Didem Agac, Spencer D Barnes, Murat Can Cobanoglu, "Accurate differential analysis of transcription factor activity from gene expression", *Bioinformatics*, May 2019. 8 | 9 | https://doi.org/10.1093/bioinformatics/btz398 10 | 11 | ### Inputs 12 | 13 | - conditionA.txt and conditionB.txt 14 | 15 | EPEE requires expression data matrix for the two conditions. Input is a tab delimited file in which columns are the samples and rows are the genes. First column of the file needs to be gene names. 16 | Please do not log normalize the dataset before running EPEE. EPEE log transforms the data to log(TPM/FPKM/RPKM + 1). 17 | 18 | - networkA and networkB 19 | 20 | EPEE requires context specific networks. Currently EPEE supports 426 context-specific networks published by Marbach et al. Nature Methods 2016. 21 | 22 | 23 | ### Setup 24 | 25 | 1. Install [Anaconda](https://www.anaconda.com/download) 26 | 27 | 2. Download the [networks](http://regulatorycircuits.org/download.html) and example [data](https://github.com/Cobanoglu-Lab/EPEE/tree/master/test/data) to run EPEE. 28 | 29 | 3. Clone the git repository and set up the conda environment to run EPEE. We provided the environment files in `env` directory. If your machine has GPU card then we recommend that you use `epee_GPU.txt` file to create environment, otherwise create environment using `epee_CPU.txt`. To create conda environment use following command 30 | ``` 31 | conda env create -f epee_CPU.txt -n epee 32 | ``` 33 | Activate the new environment: Windows: `activate epee`, macOS and Linux: `source activate epee` 34 | 35 | 4. View the available human networks, and determine the network appropriate for your context. 36 | 37 | 5. Usage to run EPEE 38 | 39 | ``` 40 | python run_epee.py -a 41 | -b 42 | -na 43 | -nb 44 | -o 45 | ``` 46 | 47 | ### Example 48 | 49 | ##### CD4 Naive vs Th2 differential analysis 50 | 51 | ``` 52 | python run_epee_v0.1.4.3.py -a ../data/rnaseq/immune/CD4_Naive.txt.gz 53 | -b ../data/rnaseq/immune/CD4_Th2.txt.gz 54 | -na ../data/network/cd4+_t_cells.txt.gz 55 | -nb ../data/network/cd4+_t_cells.txt.gz 56 | -o /path/to/output_directory/ 57 | -prefix Th2 58 | ``` 59 | 60 | ##### Normal Colon vs Colorectal Adenocarcinoma (COAD) differential analysis 61 | 62 | ``` 63 | python run_epee_v0.1.4.3.py -a ../data/rnaseq/tcga/TCGA_COAD_SolidTissueNormal_FPKM_UQ.txt 64 | -b ../data/rnaseq/tcga/TCGA_COAD_PrimaryTumor_FPKM_UQ.txt 65 | -na ../data/network/20_gastrointestinal_system.txt.gz 66 | -nb ../data/network/20_gastrointestinal_system.txt/gz 67 | -o /path/to/output_directory/ 68 | -prefix COAD 69 | ``` 70 | -------------------------------------------------------------------------------- /env/epee_CPU.txt: -------------------------------------------------------------------------------- 1 | # This file may be used to create an environment using: 2 | # $ conda create --name --file 3 | # platform: linux-64 4 | https://repo.continuum.io/pkgs/main/linux-64/anaconda-custom-py36hbbc8b67_0.tar.bz2 5 | https://conda.anaconda.org/anaconda/linux-64/bleach-1.5.0-py36_0.tar.bz2 6 | https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2017.11.5-0.tar.bz2 7 | https://conda.anaconda.org/conda-forge/linux-64/certifi-2017.11.5-py36_0.tar.bz2 8 | https://conda.anaconda.org/anaconda/linux-64/html5lib-0.9999999-py36_0.tar.bz2 9 | https://conda.anaconda.org/anaconda/linux-64/intel-openmp-2018.0.0-hc7b2577_8.tar.bz2 10 | https://conda.anaconda.org/anaconda/linux-64/libgcc-ng-7.2.0-h7cc24e2_2.tar.bz2 11 | https://conda.anaconda.org/anaconda/linux-64/libgfortran-ng-7.2.0-h9f7466a_2.tar.bz2 12 | https://conda.anaconda.org/anaconda/linux-64/libprotobuf-3.4.1-h5b8497f_0.tar.bz2 13 | https://conda.anaconda.org/anaconda/linux-64/libstdcxx-ng-7.2.0-h7a57d05_2.tar.bz2 14 | https://conda.anaconda.org/anaconda/linux-64/markdown-2.6.9-py36_0.tar.bz2 15 | https://conda.anaconda.org/anaconda/linux-64/mkl-2018.0.1-h19d6760_4.tar.bz2 16 | https://conda.anaconda.org/conda-forge/linux-64/ncurses-5.9-10.tar.bz2 17 | https://conda.anaconda.org/anaconda/linux-64/numpy-1.14.0-py36h3dfced4_1.tar.bz2 18 | https://conda.anaconda.org/conda-forge/linux-64/openssl-1.0.2n-0.tar.bz2 19 | https://conda.anaconda.org/anaconda/linux-64/pandas-0.22.0-py36hf484d3e_0.tar.bz2 20 | https://conda.anaconda.org/anaconda/linux-64/patsy-0.5.0-py36_0.tar.bz2 21 | https://conda.anaconda.org/conda-forge/linux-64/pip-9.0.1-py36_1.tar.bz2 22 | https://conda.anaconda.org/anaconda/linux-64/protobuf-3.4.1-py36h306e679_0.tar.bz2 23 | https://conda.anaconda.org/conda-forge/linux-64/python-3.6.4-0.tar.bz2 24 | https://conda.anaconda.org/anaconda/linux-64/python-dateutil-2.6.1-py36h88d3b88_1.tar.bz2 25 | https://conda.anaconda.org/anaconda/linux-64/pytz-2017.3-py36h63b9c63_0.tar.bz2 26 | https://conda.anaconda.org/conda-forge/linux-64/readline-7.0-0.tar.bz2 27 | https://conda.anaconda.org/anaconda/linux-64/scikit-learn-0.19.1-py36h7aa7ec6_0.tar.bz2 28 | https://conda.anaconda.org/anaconda/linux-64/scipy-1.0.0-py36hbf646e7_0.tar.bz2 29 | https://conda.anaconda.org/conda-forge/linux-64/setuptools-38.4.0-py36_0.tar.bz2 30 | https://conda.anaconda.org/anaconda/linux-64/six-1.11.0-py36h372c433_1.tar.bz2 31 | https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.20.1-2.tar.bz2 32 | https://conda.anaconda.org/anaconda/linux-64/statsmodels-0.8.0-py36h8533d0b_0.tar.bz2 33 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-1.4.1-0.tar.bz2 34 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-base-1.4.1-py36hd00c003_2.tar.bz2 35 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-tensorboard-0.1.5-py36_0.tar.bz2 36 | https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.7-0.tar.bz2 37 | https://conda.anaconda.org/anaconda/linux-64/werkzeug-0.14.1-py36_0.tar.bz2 38 | https://conda.anaconda.org/conda-forge/linux-64/wheel-0.30.0-py36_2.tar.bz2 39 | https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.3-0.tar.bz2 40 | https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.11-0.tar.bz2 41 | -------------------------------------------------------------------------------- /env/epee_GPU.txt: -------------------------------------------------------------------------------- 1 | # This file may be used to create an environment using: 2 | # $ conda create --name --file 3 | # platform: linux-64 4 | https://repo.continuum.io/pkgs/main/linux-64/anaconda-custom-py36hbbc8b67_0.tar.bz2 5 | https://conda.anaconda.org/anaconda/linux-64/bleach-1.5.0-py36_0.tar.bz2 6 | https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2017.11.5-0.tar.bz2 7 | https://conda.anaconda.org/conda-forge/linux-64/certifi-2017.11.5-py36_0.tar.bz2 8 | https://conda.anaconda.org/anaconda/linux-64/cudatoolkit-8.0-3.tar.bz2 9 | https://conda.anaconda.org/anaconda/linux-64/cudnn-7.0.5-cuda8.0_0.tar.bz2 10 | https://conda.anaconda.org/anaconda/linux-64/html5lib-0.9999999-py36_0.tar.bz2 11 | https://conda.anaconda.org/anaconda/linux-64/intel-openmp-2018.0.0-hc7b2577_8.tar.bz2 12 | https://conda.anaconda.org/anaconda/linux-64/libgcc-ng-7.2.0-h7cc24e2_2.tar.bz2 13 | https://conda.anaconda.org/anaconda/linux-64/libgfortran-ng-7.2.0-h9f7466a_2.tar.bz2 14 | https://conda.anaconda.org/anaconda/linux-64/libprotobuf-3.4.1-h5b8497f_0.tar.bz2 15 | https://conda.anaconda.org/anaconda/linux-64/libstdcxx-ng-7.2.0-h7a57d05_2.tar.bz2 16 | https://conda.anaconda.org/anaconda/linux-64/markdown-2.6.9-py36_0.tar.bz2 17 | https://conda.anaconda.org/anaconda/linux-64/mkl-2018.0.1-h19d6760_4.tar.bz2 18 | https://conda.anaconda.org/conda-forge/linux-64/ncurses-5.9-10.tar.bz2 19 | https://conda.anaconda.org/anaconda/linux-64/numpy-1.14.0-py36h3dfced4_1.tar.bz2 20 | https://conda.anaconda.org/conda-forge/linux-64/openssl-1.0.2n-0.tar.bz2 21 | https://conda.anaconda.org/anaconda/linux-64/pandas-0.22.0-py36hf484d3e_0.tar.bz2 22 | https://conda.anaconda.org/anaconda/linux-64/patsy-0.5.0-py36_0.tar.bz2 23 | https://conda.anaconda.org/conda-forge/linux-64/pip-9.0.1-py36_1.tar.bz2 24 | https://conda.anaconda.org/anaconda/linux-64/protobuf-3.4.1-py36h306e679_0.tar.bz2 25 | https://conda.anaconda.org/conda-forge/linux-64/python-3.6.4-0.tar.bz2 26 | https://conda.anaconda.org/anaconda/linux-64/python-dateutil-2.6.1-py36h88d3b88_1.tar.bz2 27 | https://conda.anaconda.org/anaconda/linux-64/pytz-2017.3-py36h63b9c63_0.tar.bz2 28 | https://conda.anaconda.org/conda-forge/linux-64/readline-7.0-0.tar.bz2 29 | https://conda.anaconda.org/anaconda/linux-64/scikit-learn-0.19.1-py36h7aa7ec6_0.tar.bz2 30 | https://conda.anaconda.org/anaconda/linux-64/scipy-1.0.0-py36hbf646e7_0.tar.bz2 31 | https://conda.anaconda.org/conda-forge/linux-64/setuptools-38.4.0-py36_0.tar.bz2 32 | https://conda.anaconda.org/anaconda/linux-64/six-1.11.0-py36h372c433_1.tar.bz2 33 | https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.20.1-2.tar.bz2 34 | https://conda.anaconda.org/anaconda/linux-64/statsmodels-0.8.0-py36h8533d0b_0.tar.bz2 35 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-1.4.1-0.tar.bz2 36 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-base-1.4.1-py36hd00c003_2.tar.bz2 37 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-gpu-1.4.1-0.tar.bz2 38 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-gpu-base-1.4.1-py36h01caf0a_0.tar.bz2 39 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-tensorboard-0.1.5-py36_0.tar.bz2 40 | https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.7-0.tar.bz2 41 | https://conda.anaconda.org/anaconda/linux-64/werkzeug-0.14.1-py36_0.tar.bz2 42 | https://conda.anaconda.org/conda-forge/linux-64/wheel-0.30.0-py36_2.tar.bz2 43 | https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.3-0.tar.bz2 44 | https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.11-0.tar.bz2 45 | -------------------------------------------------------------------------------- /script/epee.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018, Viren Amin, Murat Can Cobanoglu 2 | # Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 3 | # 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 4 | # 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 5 | # 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 6 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 7 | 8 | import pandas as pd 9 | import tensorflow as tf 10 | import numpy as np 11 | import logging 12 | import sklearn.preprocessing as pp 13 | import glob 14 | import statsmodels 15 | import statsmodels.api as sm 16 | 17 | 18 | def _normalizeweight(df, metric): 19 | """To normalize the TF-gene regulatory network weights. 20 | 21 | Parameters 22 | ---------- 23 | df 24 | Pandas dataframe containing network. Tab-delimited file where first 25 | column contains TF name, second column contains regulated gene name, 26 | and third column containing TF-gene regulator score. 27 | metric 28 | metric used to normalize the TF-gene regulatory score. 'minmax': 29 | performs the minmax scaling of regulatory scores between 1 and 2. 30 | 'log': performs natural logarithm transformation of the regulatory 31 | scores. 'log10': performs base 10 logarithmic transformation of the 32 | regulatory score. 'no': performs no normalization to the regulatory 33 | score. 34 | 35 | Returns 36 | ------- 37 | df 38 | Pandas dataframe containing the normalized weight 39 | 40 | """ 41 | if metric == 'minmax': 42 | weights = np.log(df['Score']).values.reshape(-1, 1) 43 | norm_score = pp.MinMaxScaler(feature_range=(1, 2)).fit_transform( 44 | weights) 45 | if metric == 'log': 46 | norm_score = np.log(df['Score']+1) 47 | if metric == 'log10': 48 | norm_score = np.log10(df['Score']+1) 49 | if metric == 'no': 50 | norm_score = df['Score'] 51 | df['Score'] = norm_score 52 | return df 53 | 54 | 55 | def _network_df(filename, Y, conditioning, weightNormalize='minmax'): 56 | """To take network filename and generate pandas dataframe object. 57 | 58 | Function takes the file path of the TF-gene regulatory network data and 59 | returns the pandas dataframe object containing row as gene and column as 60 | TF. Value in the table represents the putative regulation score of the 61 | TF regulating the gene. 62 | 63 | Parameters 64 | ---------- 65 | filename 66 | Path to the TF-gene regulatory network data. 67 | Y 68 | Pandas dataframe containing expression data. 69 | conditioning 70 | Boolean. If conditioning is True, then TF-gene with no putative 71 | regulation are given small uniform score. 1/10 of the minimum TF-gene 72 | regulatory score. 73 | weightNormalize 74 | metric used to normalize the TF-gene regulatory score. 'minmax': 75 | performs the minmax scaling of regulatory scores between 1 and 2. 76 | 'log': performs natural lograithmic transformation of the regulatory 77 | scores. 'log10': performs base 10 logarithmic transformation of the 78 | regulatory score. 'no': performs no normalization to the regulatory 79 | score. 80 | 81 | Returns 82 | ------- 83 | S 84 | Pandas dataframe containing network. Columns are TFs and rows are 85 | regulated genes. Values are the weights of the TF-gene regulation. 86 | min_val 87 | value used for the conditioning 88 | 89 | """ 90 | S = pd.read_csv(filename, sep='\t', header=None) 91 | S.columns = ['TF', 'Target', 'Score'] 92 | S = _normalizeweight(S, weightNormalize) 93 | if conditioning: 94 | min_val = np.min(S['Score'])/10 95 | else: 96 | min_val = 0 97 | S = pd.pivot_table(S, values='Score', index='Target', columns='TF', 98 | fill_value=min_val) 99 | regmask = [tf in Y.index for tf in S.columns] 100 | 101 | S = S.ix[Y.index, regmask].fillna(min_val) 102 | S = S.astype(np.float32) 103 | return (S, min_val) 104 | 105 | 106 | def _exp_df(filename): 107 | """To take expression data filename and generate pandas dataframe object. 108 | 109 | Function takes the file path of the expression data and returns the 110 | log(normalized expression data + 1). 111 | 112 | Parameters 113 | ---------- 114 | filename 115 | Path to the expression data. 116 | 117 | Returns 118 | ------- 119 | Y 120 | Pandas dataframe containing log(normalized expression data + 1) 121 | 122 | """ 123 | Y = pd.read_csv(filename, sep='\t', index_col=0) 124 | Y = np.log(Y + 1) 125 | Y = Y.ix[Y.index.drop_duplicates(keep=False), :] 126 | Y = Y.astype(np.float32) 127 | return Y 128 | 129 | 130 | def _eval_indices(Y1, Y2, S1, S2): 131 | """To eval whether the indices are same. 132 | 133 | Evalulate whether indices are same for both conditions expression data and 134 | the context-specific network dataframes. 135 | 136 | Parameters 137 | ---------- 138 | Y1 139 | Pandas dataframe containing condition 1 expression data 140 | Y2 141 | Pandas dataframe containing condition 2 expression data 142 | S1 143 | Pandas dataframe containing network 1. Columns are TFs and rows are 144 | regulated genes. Values are the weights of the TF-gene regulation. 145 | S2 146 | Pandas dataframe containing network 2. Columsn are TFs and rows are 147 | regulated genes. Values are the weights of the TF-gene regulation. 148 | 149 | Returns 150 | ------- 151 | None 152 | 153 | """ 154 | if all(Y1.index == Y2.index): 155 | logging.debug('Y index are equal') 156 | else: 157 | raise RuntimeError('Y1 Y2 index are not equal') 158 | 159 | if all(S1.index == S2.index): 160 | logging.debug('S index are equal') 161 | else: 162 | raise RuntimeError('S1 S2 index are not equal') 163 | 164 | if all(S1.index == Y1.index): 165 | logging.debug('S and Y are equal') 166 | else: 167 | raise RuntimeError('S Y index are not equal') 168 | 169 | 170 | def _get_min_samples(Y1, Y2, X1, X2, min_samples, seed=0): 171 | """Select min samples among conditions. 172 | 173 | Function takes the expression dataframes of two conditions and shuffle the 174 | labels. If the number of labels are not evenly distrubted, then the output 175 | contains minimum samples of the two conditions. 176 | 177 | Parameters 178 | ---------- 179 | Y1 180 | Pandas dataframe containing condition 1 expression data 181 | Y2 182 | Pandas dataframe containing condition 2 expression data 183 | X1 184 | Pandas dataframe containing TF expression across condition 1 samples 185 | X2 186 | Pandas dataframe containing TF expression across condition 2 samples 187 | min_samples 188 | Minimum number of samples between condition 1 and condition 2 189 | expression data 190 | seed 191 | Seed for random sampling. Default = 0. 192 | 193 | Returns 194 | ------- 195 | sY1 196 | Pandas dataframe containing shuffled labels expression data for 197 | condition 1 198 | sY2 199 | Pandas dataframe containing shuffled labels expression data for 200 | condition 2 201 | sX1 202 | Pandas dataframe containing shuffled labels TF expression data 203 | for condition 1 204 | sX2 205 | Pandas dataframe containing shuffled labels TF expression data 206 | for condition 2 207 | 208 | """ 209 | Y1_samples = list(Y1.columns) 210 | Y2_samples = list(Y2.columns) 211 | np.random.shuffle(Y1_samples) 212 | np.random.shuffle(Y2_samples) 213 | mY1 = Y1.loc[:, Y1_samples[:min_samples]] 214 | mY2 = Y2.loc[:, Y2_samples[:min_samples]] 215 | mX1 = X1.loc[:, Y1_samples[:min_samples]] 216 | mX2 = X2.loc[:, Y2_samples[:min_samples]] 217 | return (mY1, mY2, mX1, mX2) 218 | 219 | 220 | def _shuffled_inputs(Y1, Y2, X1, X2, seed=0, shuffleGenes=False): 221 | """Shuffle the labels of the inputs samples. 222 | 223 | Function takes the expression dataframes of two conditions and shuffle the 224 | labels. 225 | 226 | Parameters 227 | ---------- 228 | Y1 229 | Pandas dataframe containing condition 1 expression data 230 | Y2 231 | Pandas dataframe containing condition 2 expression data 232 | X1 233 | Pandas dataframe containing TF expression across condition 1 samples 234 | X2 235 | Pandas dataframe containing TF expression across condition 2 samples 236 | seed 237 | Seed for random sampling. Default = 0. 238 | shuffleGenes 239 | Boolean flag for shuffling genes instead of labels. Default = False 240 | 241 | Returns 242 | ------- 243 | sY1 244 | Pandas dataframe containing shuffled labels expression data for 245 | condition 1 246 | sY2 247 | Pandas dataframe containing shuffled labels expression data for 248 | condition 2 249 | sX1 250 | Pandas dataframe containing shuffled labels TF expression data 251 | for condition 1 252 | sX2 253 | Pandas dataframe containing shuffled labels TF expression data 254 | for condition 2 255 | 256 | """ 257 | np.random.seed(seed) 258 | 259 | Y1_2 = pd.merge(Y1, Y2, left_index=True, right_index=True, how='inner') 260 | X1_2 = pd.merge(X1, X2, left_index=True, right_index=True, how='inner') 261 | 262 | n_cols = Y1.shape[1] 263 | if not shuffleGenes: 264 | # We will shuffle labels 265 | samples = list(Y1_2.columns) 266 | np.random.shuffle(samples) 267 | sY1 = Y1_2.loc[:, samples[:n_cols]] 268 | sY2 = Y1_2.loc[:, samples[n_cols:]] 269 | sX1 = X1_2.loc[:, samples[:n_cols]] 270 | sX2 = X1_2.loc[:, samples[n_cols:]] 271 | else: 272 | # We will shuffle genes 273 | genes = list(Y1_2.index) 274 | np.random.shuffle(genes) 275 | sY1 = Y1_2.loc[genes, Y1.columns] 276 | sY2 = Y1_2.loc[genes, Y2.columns] 277 | for df in [sY1, sY2]: 278 | df.index = Y1_2.index 279 | tfs = list(X1_2.index) 280 | sX1 = sY1.loc[tfs, X1.columns] 281 | sX2 = sY2.loc[tfs, X2.columns] 282 | for df in [sX1, sX2]: 283 | df.index = tfs 284 | 285 | return (sY1, sY2, sX1, sX2) 286 | 287 | 288 | def get_epee_inputs(c1, c2, n1, n2, conditioning=True, weightNormalize='minmax', 289 | null=False, shuffleGenes=False, seed=0): 290 | """To generate inputs for EPEE. 291 | 292 | Function takes the network and expression data filenames and generates 293 | inputs for EPEE to run. 294 | 295 | Parameters 296 | ---------- 297 | c1 298 | filename containing condition 1 expression data 299 | c2 300 | filename containing condition 2 expression data 301 | n1 302 | filename containing network 1 303 | n2 304 | filename containing network 2 305 | conditioning 306 | whether to provide small weight value to non TF-gene putative 307 | interactions 308 | weightNormalize 309 | three metric of weight normalize are implemented. 'minmax' whether 310 | to log normalize the weights and scale the weights between 1 311 | and 2. 'log' whether to log normalize weight+1. 'log10' whether to 312 | use log base 10 normalize weight+1. 313 | null 314 | If flag is set, then samples in condition1 and condition2 are shuffled 315 | shuffleGenes 316 | If flag is set, we run null tests by shuffling the genes instead of labels 317 | seed 318 | Seed for random sampling 319 | 320 | Returns 321 | ------- 322 | Y1 323 | Pandas dataframe containing condition 1 expression data 324 | Y2 325 | Pandas dataframe containing condition 2 expression data 326 | X1 327 | Pandas dataframe containing TF expression across condition 1 samples 328 | X2 329 | Pandas dataframe containing TF expression across condition 2 samples 330 | S1 331 | Pandas dataframe containing network 1. Columns are TFs and rows are 332 | regulated genes. Values are the weights of the TF-gene regulation. 333 | S2 334 | Pandas dataframe containing network 2. Columsn are TFs and rows are 335 | regulated genes. Values are the weights of the TF-gene regulation. 336 | conditioning_val 337 | Soft thresholding value used to non TF-gene putative interactions 338 | 339 | """ 340 | Y1 = _exp_df(c1) 341 | Y2 = _exp_df(c2) 342 | 343 | S1, conditioning_val = _network_df(n1, Y1, conditioning, 344 | weightNormalize=weightNormalize) 345 | S2, _ = _network_df(n2, Y2, conditioning, weightNormalize=weightNormalize) 346 | 347 | if n1 != n2: 348 | genes = list(set(np.concatenate([S1.index, S2.index]))) 349 | tfs = list(set(np.concatenate([S1.columns, S2.columns]))) 350 | S1 = S1.ix[genes, tfs].fillna(conditioning_val) 351 | S2 = S2.ix[genes, tfs].fillna(conditioning_val) 352 | Y1 = Y1.ix[S1.index, :].fillna(0) 353 | Y2 = Y2.ix[S2.index, :].fillna(0) 354 | else: 355 | Y1 = Y1.ix[S1.index, :].fillna(0) 356 | Y2 = Y2.ix[S2.index, :].fillna(0) 357 | 358 | X1 = Y1.ix[S1.columns, :] 359 | X2 = Y2.ix[S2.columns, :] 360 | _eval_indices(Y1, Y2, S1, S2) 361 | if null: 362 | Y1, Y2, X1, X2 = _shuffled_inputs(Y1, Y2, X1, X2, 363 | seed=seed, shuffleGenes=shuffleGenes) 364 | return (Y1, Y2, X1, X2, S1, S2, conditioning_val) 365 | 366 | 367 | def _gcp(x, r): 368 | """To calculate the graph contrained penalty term. 369 | 370 | Function that tensorflow uses to fold through each column of the matrix. It 371 | goes through each TF column vector, finds target genes, creates weights 372 | vector of the target genes, calculates pairwise difference of the weights, 373 | performs tanh tranformation of the differences, and then sums the 374 | difference. 375 | 376 | Parameters 377 | ---------- 378 | x 379 | is the previous value 380 | r 381 | Weights of the TF matrix 382 | 383 | Returns 384 | ------- 385 | float 386 | Value with sum of the weight difference between the target genes 387 | 388 | """ 389 | ind = tf.where(tf.abs(r) > 0) 390 | vec = tf.expand_dims(tf.gather_nd(r, ind), 0) 391 | pc = tf.tanh(tf.transpose(vec)-vec) 392 | val = tf.divide(tf.reduce_sum(tf.abs(pc)), 2) 393 | return x+val 394 | 395 | 396 | def run_model(Y, X, S, step, itr, log_itr, seed, 397 | l_reg=1e-4, g_reg=1e-4, stopthreshold=0.01, val=0, 398 | model='epee-gcl'): 399 | """To run sparse linear model. 400 | 401 | There are two sparse linear model implemented: lasso and 402 | graph-constrained-lasso. By default, method runs graph-constrained-lasso. 403 | 404 | Parameters 405 | ---------- 406 | Y 407 | Pandas dataframe containing expression data. Rows are genes and 408 | columns are samples. Values are log(RPKM/TPM/FPKM + 1) 409 | X 410 | Pandas dataframe containing expression data of TFs. Rows are TFs and 411 | columns are samples. Values are log(RPKM/TPM/FPKM + 1) 412 | S 413 | Pandas dataframe containing network. Rows are genes and columns are 414 | TFs. Values are weight corresponding to the TF regulating a gene 415 | step 416 | learning rate for the optimizer 417 | log_itr 418 | Iterations to log the loss and percent change 419 | seed 420 | Setting the tensforflow random seed 421 | l_reg 422 | lasso regularization constant 423 | g_reg 424 | graph constrained regularization constant 425 | stopthreshold 426 | threshold when to stop learning the model if the loss change is 0.1 427 | between the previous iteration and the current interation 428 | val 429 | weight given to the not known TF-gene pairs 430 | model 431 | model to use for regulator and perturbed gene inference score 432 | 433 | Returns 434 | ------- 435 | curr_y 436 | Inferred Y 437 | curr_w 438 | Inferred W 439 | loss_arr 440 | Loss per each iteration 441 | 442 | """ 443 | tf.set_random_seed(seed) 444 | genes, samples = Y.shape 445 | regulators = X.shape[0] 446 | S_h = np.copy(S) 447 | S_h = np.float32(S_h > val) 448 | 449 | with tf.Graph().as_default(): 450 | w = tf.Variable( 451 | tf.random_gamma([genes, regulators], alpha=1, beta=30, 452 | dtype=tf.float32, seed=seed)) 453 | 454 | if model == 'no-penalty': 455 | # least squares loss along with L1 regularization 456 | y = tf.matmul(tf.multiply(w, S), X) 457 | 458 | # loss 459 | loss = tf.reduce_mean(tf.square(Y - y)) 460 | 461 | if model == 'epee-l': 462 | # least squares loss along with L1 regularization 463 | y = tf.matmul(tf.multiply(w, S), X) 464 | 465 | # loss 466 | loss = tf.reduce_mean( 467 | tf.square(Y - y))+tf.multiply(tf.reduce_sum(tf.abs(w)), l_reg) 468 | 469 | if model == 'epee-gcl': 470 | y = tf.matmul(tf.multiply(w, S), X) 471 | wa = tf.transpose( 472 | tf.multiply(tf.expand_dims(w, 1), tf.expand_dims(S_h, 1))) 473 | 474 | gc = tf.foldl( 475 | _gcp, wa, initializer=tf.constant(0, dtype=tf.float32), 476 | parallel_iterations=10, back_prop=False, swap_memory=True) 477 | 478 | loss = tf.reduce_mean( 479 | tf.square(Y-y)) + tf.multiply( 480 | tf.reduce_sum(tf.abs(w)), l_reg) + tf.multiply(gc, g_reg) 481 | 482 | # optimizer 483 | optimizer = tf.train.AdamOptimizer(learning_rate=step, epsilon=1e-8) 484 | 485 | train = optimizer.minimize(loss) 486 | 487 | # training loop 488 | init = tf.global_variables_initializer() # before starting init var 489 | sess = tf.Session() # launch the graph 490 | sess.run(init) # reset values to wrong 491 | 492 | loss_arr = [] 493 | # outarr = [] 494 | for s in range(itr): 495 | sess.run(train) 496 | if s % 100 == 0: 497 | curr_loss = sess.run(loss) 498 | if np.isnan(curr_loss): 499 | raise RuntimeError('NAN value is computed for the loss.\ 500 | Make sure that the inputs are RPKM, TPM, FPKM normalized\ 501 | without any transformation (ie. log). Also can try to \ 502 | lower the learning rate.') 503 | loss_arr.append(curr_loss) 504 | if len(loss_arr) >= 2: 505 | delta = loss_arr[-2] - loss_arr[-1] 506 | 507 | logging.debug((s, curr_loss, delta)) 508 | if delta < 1: 509 | logging.debug("TRANING FINISHED") 510 | break 511 | else: 512 | logging.debug((s, curr_loss)) 513 | 514 | curr_w, curr_y, curr_loss = sess.run([w, y, loss]) 515 | return (curr_y, curr_w, loss_arr) 516 | 517 | 518 | def get_perturb_scores(Y1, y1, X1, w1, Y2, y2, X2, w2, S1, S2): 519 | """To get perturb scores. 520 | 521 | The funtion calculates the log likelihood ratio of error by swapping the 522 | weights between condition and error within same condition. The function 523 | returns sorted dataframe containing perturbation score. 524 | 525 | Parameters 526 | ---------- 527 | Y1 528 | Pandas dataframe containing condition 1 expression data 529 | y1 530 | Pandas dataframe containing inferred condition 1 expression data 531 | X1 532 | Pandas dataframe containing TF expression across condition 1 samples 533 | w1 534 | Pandas dataframe containing inferred TF-gene weights for condition 1 535 | Y2 536 | Pandas dataframe containing condition 2 expression data 537 | y2 538 | Pandas dataframe containing inferred condition 2 expression data 539 | X2 540 | Pandas dataframe containing TF expression across condition 2 samples 541 | w2 542 | Pandas dataframe containing inferred TF-gene weights for condition 2 543 | S1 544 | Pandas dataframe containing network 1. Columns are TFs and rows are 545 | regulated genes. Values are the weights of the TF-gene regulation. 546 | S2 547 | Pandas dataframe containing network 2. Columsn are TFs and rows are 548 | regulated genes. Values are the weights of the TF-gene regulation. 549 | 550 | Returns 551 | ------- 552 | scores_df_sorted 553 | Pandas dataframe containing perturb scores 554 | 555 | """ 556 | err11 = np.square(Y1-y1).sum(axis=1) 557 | err22 = np.square(Y2-y2).sum(axis=1) 558 | err12 = np.square(Y1-np.dot(np.multiply(w2, S2), X1)).sum(axis=1) 559 | err21 = np.square(Y2-np.dot(np.multiply(w1, S1), X2)).sum(axis=1) 560 | 561 | base = err11+err22 562 | err = err21+err12 563 | scores = err/base 564 | 565 | # sort_scores_idx = sorted(range(len(scores)), key=lambda k: scores[k]) 566 | scores_df = pd.DataFrame({'gene': scores.index, 'score': scores}) 567 | scores_df_sorted = scores_df.sort_values(by='score', ascending=False) 568 | return scores_df_sorted 569 | 570 | 571 | def get_summary_scoresdf(df, metric='sum'): 572 | """To calculate scores from multiple models. 573 | 574 | Each independent model generates perturb and regulatory score. The function 575 | calculates summary score for each perturb gene and assigns that score to 576 | the gene. The function returns dataframe with summary scores sorted. 577 | 578 | Parameters 579 | ---------- 580 | df 581 | Pandas dataframe containing scores from multiple models 582 | metric 583 | Metric used to summarize the score from multiple models. 584 | 'sum', 'mean' and 'median' are valid options. Default = 'sum' 585 | 586 | Returns 587 | ------- 588 | out_df_sort 589 | Pandas dataframe containing the summarized scores 590 | 591 | """ 592 | if metric == 'median': 593 | df_score = df.iloc[:, 1:].median(axis=1) 594 | if metric == 'sum': 595 | df_score = df.iloc[:, 1:].sum(axis=1) 596 | if metric == 'mean': 597 | df_score = df.iloc[:, 1:].mean(axis=1) 598 | out_df = pd.DataFrame({'gene': df['gene'], 'score': df_score}) 599 | out_df_sort = out_df.sort_values(by='score', ascending=False) 600 | out_df_sort.reset_index(inplace=True, drop=True) 601 | return out_df_sort 602 | 603 | 604 | def get_weights_df(w, genes, tfs): 605 | """To name the rows and columns of the weight numpy ndarray object. 606 | 607 | Functions converts the numpy ndarray object to Pandas dataframe with 608 | labeled rows and columns. 609 | 610 | Parameters 611 | ---------- 612 | w 613 | numpy ndarray containing the inferred TF-gene weights 614 | genes 615 | rownames of the w 616 | tfs 617 | column names of the w 618 | Returns 619 | ------- 620 | df 621 | Pandas dataframe containing TF-gene inferred weights W 622 | 623 | """ 624 | df = pd.DataFrame(w) 625 | df.columns = tfs 626 | df.index = genes 627 | return df 628 | 629 | 630 | def get_regulator_scores(perturb_genes, W): 631 | """To provide regulator score given weights and perturb genes. 632 | 633 | Function calculates the regulator score given list of perturb genes and 634 | inferred W. 635 | 636 | Parameters 637 | ---------- 638 | perturb_genes 639 | list of perturb genes 640 | W 641 | Pandas dataframe containing TF-gene inferred weights W 642 | 643 | Returns 644 | ------- 645 | sort_scores_df 646 | Pandas dataframe containing the regulatory scores 647 | 648 | """ 649 | # perturb_genes 650 | W_perturb = W.iloc[W.index.isin(perturb_genes), ] 651 | # not perturb_genes 652 | W_notperturb = W.iloc[~W.index.isin(perturb_genes), ] 653 | reg_score_outarr = [] 654 | for reg in W.columns: 655 | num_score = np.sum(W_perturb[reg]) 656 | not_reg_df = W_perturb.iloc[:, ~W_perturb.columns.isin([reg])] 657 | dem_score1 = not_reg_df.sum().sum() 658 | dem_score2 = np.sum(W_notperturb[reg]) 659 | if dem_score1 == 0: 660 | dem_score1 = 1e-6 661 | if num_score == 0: 662 | score = 0 663 | else: 664 | score = (num_score/np.sqrt(dem_score1*dem_score2)) 665 | reg_score_outarr.append(score) 666 | 667 | score_df = pd.DataFrame({'id': W.columns, 'score': reg_score_outarr}) 668 | sort_scores_df = score_df.sort_values(by='score', ascending=False) 669 | sort_scores_df.reset_index(inplace=True, drop=True) 670 | return sort_scores_df 671 | 672 | 673 | def get_diff_regulatory_activity(perturbed_genes, w1, w2, top_regs=10): 674 | """To provide differential regulator scores between condition. 675 | 676 | Function calculates the differential regulator score given list of perturb 677 | genes, and inferred W from the two conditions. 678 | 679 | Parameters 680 | ---------- 681 | perturb_genes 682 | list of perturb genes 683 | w1 684 | Pandas dataframe containing TF-gene inferred weights W from condition 1 685 | w2 686 | Pandas dataframe containing TF-gene inferred weights W from condition 2 687 | top_regs 688 | Top number of differential regulators to select 689 | 690 | Returns 691 | ------- 692 | reg_score 693 | Pandas dataframe containing the differential regulator scores 694 | diff_reg 695 | Pandas dataframe containing the top number of differential regulators 696 | 697 | """ 698 | reg_score_w1 = get_regulator_scores(perturbed_genes, np.abs(w1)) 699 | reg_score_w2 = get_regulator_scores(perturbed_genes, np.abs(w2)) 700 | merge_reg_score = pd.merge(reg_score_w1, reg_score_w2, on='id') 701 | merge_reg_score.columns = ['id', 'w1', 'w2'] 702 | merge_reg_score['w2-1'] = merge_reg_score['w2']-merge_reg_score['w1'] 703 | diff_sorted_reg_score = merge_reg_score.sort_values( 704 | by='w2-1', ascending=False) 705 | diff_sorted_reg_score.reset_index(drop=True, inplace=True) 706 | reg_score = diff_sorted_reg_score[['id', 'w2-1']] 707 | diff_reg = pd.concat([reg_score.head(top_regs), reg_score.tail(top_regs)]) 708 | diff_reg.reset_index(drop=True, inplace=True) 709 | return(reg_score, diff_reg) 710 | 711 | 712 | def get_significant_scores(df, nullvalues, two_sided=False): 713 | """To calculate the score significance from emperical null CDF. 714 | 715 | Function generates emperical null CDF from the null values and calculate 716 | significance of the scores from the emperical CDF. Pvalues are corrected 717 | for the multiple hypothesis testing through Benjamini-Hochberg procedure. 718 | 719 | Parameters 720 | ---------- 721 | df 722 | Pandas dataframe containing the perturb or regulator score 723 | nullvalues 724 | list of null values 725 | two_sided 726 | Boolean. Whether to use two-sided or one-sided test. Default = False. 727 | 728 | Returns 729 | ------- 730 | df 731 | Pandas dataframe containing the pvalues 732 | 733 | """ 734 | emcdf = sm.distributions.ECDF(nullvalues) 735 | median = np.median(nullvalues) 736 | pvals = [] 737 | for score in df['score']: 738 | if two_sided: 739 | if score > median: 740 | pval = 1-emcdf(score) 741 | else: 742 | pval = emcdf(score) 743 | else: 744 | pval = 1-emcdf(score) 745 | pvals.append(pval) 746 | df['pvals'] = pvals 747 | df['fdr_bh'] = statsmodels.stats.multitest.multipletests( 748 | pvals, alpha=0.25, method='fdr_bh', returnsorted=False)[1] 749 | return df 750 | 751 | 752 | def get_null_scores(dir): 753 | """To get null perturb scores. 754 | 755 | The functions generates null perturb score list. 756 | 757 | Parameters 758 | ---------- 759 | dir 760 | Directory containing the perturbation scores from shuffled labels 761 | 762 | Returns 763 | ------- 764 | null_scores_list 765 | list of null perturb scores 766 | 767 | """ 768 | files = glob.glob('{}/pert*'.format(dir)) 769 | null_scores = pd.DataFrame() 770 | for file in files: 771 | df = pd.read_csv(file, sep='\t', index_col=0) 772 | if null_scores.shape[0] == 0: 773 | null_scores = df 774 | else: 775 | null_scores = pd.merge(null_scores, df, on='gene') 776 | null_scores_list = null_scores.set_index('gene').values.flatten() 777 | return null_scores_list 778 | 779 | 780 | def get_null_regscores(dir, perturbed_genes): 781 | """To get null regulatory scores. 782 | 783 | The function generates null regulatory scores from directory containing the 784 | inferred W's from the shuffled labels. 785 | 786 | Parameters 787 | ---------- 788 | dir 789 | Directory containing the shuffled labels inferred W's. 790 | perturbed_genes 791 | list of perturbed genes used from the comparisons with true labels 792 | 793 | Returns 794 | ------- 795 | null_score_list 796 | list of null regulator scores 797 | 798 | """ 799 | files = glob.glob('{}/WS1*'.format(dir)) 800 | null_regscore_df = pd.DataFrame() 801 | for file in files: 802 | seed = file.split('_')[-1].split('.')[0] 803 | ws1 = pd.read_csv(file, sep='\t', index_col=0) 804 | ws2 = pd.read_csv(file.replace('WS1', 'WS2'), sep='\t', index_col=0) 805 | null_regscore, _ = get_diff_regulatory_activity( 806 | perturbed_genes, 807 | ws1, ws2, top_regs=20) 808 | null_regscore.columns = ['gene', seed] 809 | if null_regscore_df.shape[0] == 0: 810 | null_regscore_df = null_regscore 811 | else: 812 | null_regscore_df = pd.merge(null_regscore_df, null_regscore, 813 | on='gene') 814 | null_score_list = null_regscore_df.set_index('gene').values.flatten() 815 | return null_score_list 816 | 817 | 818 | def get_median_rank_position(actual, predicted, n, model): 819 | """To get median relative rank poistion. 820 | 821 | The method is used when ground truth reference is known and with query list 822 | want to known median rank position of the reference objects. 823 | 824 | Parameters 825 | ---------- 826 | actual 827 | array of genes that are ground truth 828 | predicted 829 | array of genes that want to query 830 | n 831 | length of the query list 832 | model 833 | model used to get the query list 834 | 835 | Returns 836 | ------- 837 | out 838 | median rank 839 | 840 | """ 841 | predicted = pd.DataFrame(predicted).reset_index().set_index(0) 842 | predicted['index'] = (predicted['index']/n)*100 843 | rank = [] 844 | for r in predicted.ix[actual, :].iterrows(): 845 | val = r[1][0] 846 | rank.append(val) 847 | out = (np.median(rank), model) 848 | return out 849 | -------------------------------------------------------------------------------- /script/run_epee.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018, Viren Amin, Murat Can Cobanoglu 2 | # Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 3 | # 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 4 | # 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 5 | # 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 6 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 7 | 8 | # filter future warnings 9 | import warnings 10 | warnings.simplefilter("ignore", category=FutureWarning) 11 | 12 | from epee import * 13 | import numpy as np 14 | import pandas as pd 15 | import argparse 16 | import logging 17 | import time 18 | import os 19 | import itertools 20 | import multiprocessing 21 | from time import localtime, strftime 22 | 23 | # set tensorflow verbosity 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 25 | 26 | 27 | 28 | parser = argparse.ArgumentParser() 29 | parser.add_argument("-a", "--conditiona", help="RNA-seq data for Condition A", 30 | type=str, required=True) 31 | parser.add_argument("-b", "--conditionb", help="RNA-seq data for Condition B", 32 | type=str, required=True) 33 | parser.add_argument("-na", "--networka", help="Network for condition A", 34 | type=str, required=True) 35 | parser.add_argument("-nb", "--networkb", help="Network for condition B", 36 | type=str, required=True) 37 | # DEFAULTS 38 | parser.add_argument("-o", "--output", help="output directory", type=str, 39 | default='') 40 | parser.add_argument("-reg1", "--lregularization", help="lasso regularization \ 41 | parameter", type=float, default=0.01) 42 | parser.add_argument("-reg2", "--gregularization", help="graph contrained \ 43 | regularization parameter", type=float, default=0.01) 44 | parser.add_argument("-s", "--step", help="optimizer learning-rate", 45 | type=float, default=0.0001) 46 | parser.add_argument("-c", "--conditioning", help="Weight for the interactions \ 47 | not known", type=bool, default=True) 48 | parser.add_argument("-r", "--runs", help="Number of independent runs", type=int, 49 | default=20) 50 | parser.add_argument("-i", "--iterations", help="Number of iterations", 51 | type=int, default=100000) 52 | parser.add_argument("-ag", "--aggregation", help=""" 53 | Method for aggregating runs. Default: "sum" 54 | Valid options: {"mean", "median", "sum"} """, 55 | type=str, default='sum') 56 | parser.add_argument("-n", "--normalize", help=""" 57 | Weight normalization strategy. Default:"minmax" 58 | Valid options: {"minmax", "log", "log10", "no"} """, 59 | type=str, default='minmax') 60 | parser.add_argument("-m", "--model", help=""" 61 | Model regularization choice. Default: "epee-gcl" 62 | Valid options: {"epee-gcl","epee-l","no-penalty" """, 63 | type=str, default='epee-gcl') 64 | parser.add_argument("-v", "--verbose", 65 | help="logging info levels 10, 20, or 30", 66 | type=int, default=10) 67 | # OPTIONAL SETTINGS 68 | parser.add_argument("-eval", "--evaluate", 69 | help="Evaluation mode available for Th1, Th2, Th17, \ 70 | Bmem, COAD, and AML", 71 | type=str, default=None) 72 | parser.add_argument("-pr", "--prefix", 73 | help="Add prefix to the log", 74 | type=str, default=strftime('%Y%m%d')) 75 | # OPTIONAL FLAGS 76 | parser.add_argument("-w", "--store_weights", 77 | help="Store all the inferred weights", 78 | action='store_true') 79 | parser.add_argument("-mp", "--multiprocess", 80 | help="multiprocess the calculation of perturb and \ 81 | regulator scores", action='store_true') 82 | # NULL FLAG 83 | parser.add_argument("-null", "--null", 84 | help="Generate null scores by label permutation", 85 | action='store_true') 86 | # NULL SETTINGS 87 | parser.add_argument("-d", "--seed", help="Starting seed number", 88 | type=int, default=0) 89 | parser.add_argument("-p", "--perturb", help="True label perturb scores. Required when running permutations for null model", 90 | type=str, default=None) 91 | parser.add_argument("-sg", "--shuffle_genes", 92 | help="Generate null scores by gene permutation", 93 | action='store_true') 94 | 95 | 96 | def get_scores(sel): 97 | """To get perturb and regulator score""" 98 | y1, w1, w1_df, y2, w2, w2_df, count = sel 99 | 100 | # Calculate perturb scores 101 | genescore_runi = get_perturb_scores(Y1, y1, X1, w1, 102 | Y2, y2, X2, w2, S1, S2) 103 | genescore_runi.columns = ['gene', 'set{}'.format(count)] 104 | 105 | if args.null: 106 | regscore_runi, diff_regs = get_diff_regulatory_activity( 107 | actual_perturb['gene'][:1000], 108 | w1_df, w2_df, top_regs=20) 109 | else: 110 | regscore_runi, diff_regs = get_diff_regulatory_activity( 111 | genescore_runi['gene'][:1000], 112 | w1_df, w2_df, top_regs=20) 113 | 114 | regscore_runi.columns = ['gene', 'set{}'.format(count)] 115 | 116 | return (genescore_runi, regscore_runi) 117 | 118 | 119 | def run_epee(): 120 | """To run EPEE with specified inputs.""" 121 | logging.info('SAMPLES: Y1: {} | Y2: {}'.format(Y1.shape[1], Y2.shape[1])) 122 | logging.info('Tensorflow: {}'.format(tf.__version__)) 123 | logging.info('GENES: {}'.format(Y1.shape[0])) 124 | logging.info('TFs: {}'.format(S1.shape[1])) 125 | logging.info('MODEL LEARNING STARTED') 126 | genescore_df = pd.DataFrame() 127 | regscore_df = pd.DataFrame() 128 | loss_runs = [] 129 | y1_s = [] 130 | y2_s = [] 131 | w1_s = [] 132 | w2_s = [] 133 | w1S1_s = [] 134 | w2S2_s = [] 135 | 136 | for rid in range(args.runs): 137 | start = time.time() 138 | logging.debug('Tensorflow: {}'.format(tf.__version__)) 139 | logging.debug('MODEL: {} learning Y1'.format(rid)) 140 | y1, w1, loss_arr1 = run_model(np.array(Y1), np.array(X1), 141 | np.array(S1), 142 | l_reg=args.lregularization, 143 | g_reg=args.gregularization, 144 | step=args.step, 145 | itr=args.iterations, 146 | log_itr=round(args.iterations/20), 147 | seed=rid+args.seed, 148 | model=args.model, 149 | val=condition_val) 150 | logging.debug('MODEL: {} learning Y2'.format(rid)) 151 | y2, w2, loss_arr2 = run_model(np.array(Y2), np.array(X2), 152 | np.array(S2), 153 | l_reg=args.lregularization, 154 | g_reg=args.gregularization, 155 | step=args.step, 156 | itr=args.iterations, 157 | log_itr=round(args.iterations/20), 158 | seed=rid+args.seed, 159 | model=args.model, 160 | val=condition_val) 161 | 162 | loss_runs.append((rid, loss_arr1[-1], loss_arr2[-1])) 163 | 164 | # Calculate w1S1 and w2S2 165 | w1_s1 = np.multiply(w1, S1) 166 | w2_s2 = np.multiply(w2, S2) 167 | 168 | w1_df = get_weights_df(w1_s1, Y1.index, X1.index) 169 | w2_df = get_weights_df(w2_s2, Y2.index, X2.index) 170 | 171 | w1o_df = get_weights_df(w1, Y1.index, X1.index) 172 | w2o_df = get_weights_df(w2, Y2.index, X2.index) 173 | 174 | # Store dataframes 175 | y1_s.append(y1) 176 | y2_s.append(y2) 177 | w1_s.append(w1) 178 | w2_s.append(w2) 179 | w1S1_s.append(w1_df) 180 | w2S2_s.append(w2_df) 181 | 182 | # Output inferred weights if args.store_weights is True and args.null is False 183 | if args.store_weights and not args.null: 184 | w1o_df.to_csv('{}/model/w1_{}.txt'.format(outdir, rid), 185 | sep='\t') 186 | w2o_df.to_csv('{}/model/w2_{}.txt'.format(outdir, rid), 187 | sep='\t') 188 | if rid == 0: 189 | S1.to_csv('{}/model/S1_input.txt'.format(outdir), 190 | sep='\t') 191 | S2.to_csv('{}/model/S2_input.txt'.format(outdir), 192 | sep='\t') 193 | X1.to_csv('{}/model/X1_input.txt'.format(outdir), 194 | sep='\t') 195 | X2.to_csv('{}/model/X2_input.txt'.format(outdir), 196 | sep='\t') 197 | Y1.to_csv('{}/model/Y1_input.txt'.format(outdir), 198 | sep='\t') 199 | Y2.to_csv('{}/model/Y2_input.txt'.format(outdir), 200 | sep='\t') 201 | 202 | end = time.time() 203 | 204 | logging.info('MODEL: {} RUNTIME: {} mins'.format(rid, 205 | round((end-start)/60, 2))) 206 | 207 | # For each pairs of inferred weights calculate perturb and regulator scores 208 | # logging.info('CALCULATE PERTURB AND REGULATOR SCORES') 209 | logging.info('SCORES: pairwise comparison of all Y1 and Y2 models') 210 | 211 | list_runs = list(range(args.runs)) 212 | pairs = list(itertools.product(list_runs, list_runs)) 213 | score_inputs = [] 214 | for count, p in enumerate(pairs): 215 | m1, m2 = p 216 | score_inputs.append((y1_s[m1], w1_s[m1], w1S1_s[m1], 217 | y2_s[m2], w2_s[m2], w2S2_s[m2], 218 | count)) 219 | 220 | if args.multiprocess: 221 | cpu_count = multiprocessing.cpu_count() 222 | p = multiprocessing.Pool(int(cpu_count/2)) 223 | out = p.map(get_scores, score_inputs) 224 | else: 225 | out = [] 226 | for i in score_inputs: 227 | i_out = get_scores(i) 228 | out.append(i_out) 229 | 230 | for count, scores in enumerate(out): 231 | genescore_runi, regscore_runi = scores 232 | if count == 0: 233 | genescore_df = genescore_runi.copy() 234 | regscore_df = regscore_runi.copy() 235 | else: 236 | # if np.all(genescore_runi.index == genescore_df.index): 237 | # genescore_df[genescore_runi.columns[1]] = genescore_runi.iloc[:, 1] 238 | # else: 239 | genescore_df = pd.merge(genescore_df, genescore_runi, on='gene') 240 | # if np.all(regscore_runi.index == regscore_df.index): 241 | # regscore_df[regscore_runi.columns[1]] = regscore_runi.iloc[:, 1] 242 | # else: 243 | regscore_df = pd.merge(regscore_df, regscore_runi, on='gene') 244 | 245 | sum_genescore_df = get_summary_scoresdf(genescore_df, args.aggregation) 246 | sum_regscore_df = get_summary_scoresdf(regscore_df, args.aggregation) 247 | 248 | if args.null: 249 | sum_regscore_df.to_csv('{}/null/regulator_scores_{}.txt'.format( 250 | outdir, args.seed), 251 | sep='\t') 252 | sum_genescore_df.to_csv('{}/null/perturb_scores_{}.txt'.format( 253 | outdir, args.seed), 254 | sep='\t') 255 | regscore_df.to_csv('{}/null/all_regulator_scores_{}.txt'.format( 256 | outdir, args.seed), 257 | sep='\t') 258 | genescore_df.to_csv('{}/null/all_perturb_scores_{}.txt'.format( 259 | outdir, args.seed), 260 | sep='\t') 261 | else: 262 | sum_regscore_df.to_csv('{}/scores/regulator_scores.txt'.format( 263 | outdir), sep='\t') 264 | sum_genescore_df.to_csv('{}/scores/perturb_scores.txt'.format( 265 | outdir), sep='\t') 266 | regscore_df.to_csv('{}/scores/all_regulator_scores.txt'.format( 267 | outdir), sep='\t') 268 | genescore_df.to_csv('{}/scores/all_perturb_scores.txt'.format( 269 | outdir), sep='\t') 270 | 271 | loss_df = pd.DataFrame(loss_runs) 272 | loss1_df = pd.DataFrame(loss_arr1) 273 | loss2_df = pd.DataFrame(loss_arr2) 274 | loss_df.to_csv('{}/model/loss_runs.txt'.format(outdir), 275 | sep='\t') 276 | loss1_df.to_csv('{}/model/loss1_arr_y1.txt'.format(outdir), 277 | sep='\t') 278 | loss2_df.to_csv('{}/model/loss2_arr_y2.txt'.format(outdir), 279 | sep='\t') 280 | 281 | if args.evaluate: 282 | 283 | known_regs = {'Th2': ['STAT6', 'GATA3'], 284 | 'Th1': ['TBX21', 'STAT1'], 285 | 'Th17': ['ARID5A', 'RORA', 'STAT3'], 286 | 'Bmem': ['STAT5A'], 287 | 'AML': ['WT1', 'MYB', 'ETV6', 'SOX4', 'CEBPA', 'RUNX1'], 288 | 'COAD': ['MYC', 'KLF4'] 289 | } 290 | regs = X1.index 291 | actual = known_regs[args.evaluate] 292 | line = [args.model, args.prefix, args.seed, args.lregularization, 293 | args.gregularization, len(Y1.columns), len(Y2.columns), 294 | args.conditiona, args.conditionb] 295 | 296 | for t in actual: 297 | line.append(list(regscore_df['gene']).index(t)) 298 | 299 | logging.info(line) 300 | 301 | median_rank = get_median_rank_position(actual, 302 | list(regscore_df['gene']), 303 | len(regs), args.model) 304 | logging.info('RANK: {}'.format(median_rank[0])) 305 | print('RANK: {}'.format(median_rank[0])) 306 | 307 | line.append(median_rank[0]) 308 | 309 | performance = '\t'.join(map(str, line)) 310 | 311 | 312 | outfilename = '{}/{}_eval.txt'.format(args.output, args.evaluate) 313 | # write header if filename does not exists 314 | if not os.path.isfile(outfilename): 315 | with open(outfilename, 'a') as myfile: 316 | headerlist = ['MODEL', 'PREFIX', 'SEED', 'L1', 'L2', 317 | 'SamplesY1', 'SamplesY2', 'PathY1', 'PathY2'] 318 | for t in actual: 319 | headerlist.append('{}_RANK'.format(t)) 320 | headerlist.append('MEDIAN_RANK') 321 | header = '\t'.join(headerlist) 322 | myfile.write('{}\n'.format(header)) 323 | 324 | with open(outfilename, 'a') as myfile: 325 | myfile.write('{}\n'.format(performance)) 326 | 327 | 328 | if __name__ == '__main__': 329 | 330 | run_start = time.time() 331 | args = parser.parse_args() 332 | outdir = '{d}{p}_{m}_{l}_{g}_{s}'.format(d=args.output, p=args.prefix, 333 | m=args.model, 334 | l=args.lregularization, 335 | g=args.gregularization, 336 | s=args.seed) 337 | if args.null: 338 | if args.perturb == None: 339 | raise RuntimeError('Please provide perturb genes output generated with actual labels. --perturb ') 340 | os.makedirs(os.path.dirname('{}/null/'.format(outdir)), 341 | exist_ok=True) 342 | logfile = '{}/null_log.txt'.format(outdir) 343 | actual_perturb = pd.read_csv(args.perturb, index_col=0, sep='\t') 344 | else: 345 | os.makedirs(os.path.dirname('{}/model/'.format(outdir)), 346 | exist_ok=True) 347 | os.makedirs(os.path.dirname('{}/scores/'.format(outdir)), 348 | exist_ok=True) 349 | logfile = '{}/log.txt'.format(outdir) 350 | 351 | for handler in logging.root.handlers[:]: 352 | logging.root.removeHandler(handler) 353 | logging.basicConfig(filename=logfile, 354 | level=args.verbose) 355 | 356 | logging.info('####### {} STARTING ANALYSIS #######'.format(args.prefix)) 357 | logging.debug('Multiprocessing: {}'.format(args.multiprocess)) 358 | logging.info('EPEE: {}'.format(strftime("%a, %d %b %Y %H:%M:%S", 359 | localtime()))) 360 | Y1, Y2, X1, X2, S1, S2, condition_val = get_epee_inputs( 361 | args.conditiona, args.conditionb, 362 | args.networka, args.networkb, 363 | conditioning=args.conditioning, 364 | weightNormalize=args.normalize, 365 | null=args.null, 366 | shuffleGenes=args.shuffle_genes, 367 | seed=args.seed) 368 | run_epee() 369 | run_end = time.time() 370 | 371 | logging.info('Time elapsed: {} mins'.format( 372 | round((run_end-run_start)/60, 2))) 373 | logging.info('####### {} ANALYSIS COMPLETED ########'.format(args.prefix)) 374 | -------------------------------------------------------------------------------- /test/tests.sh: -------------------------------------------------------------------------------- 1 | # # testing scenario 1 without multiprocessing 2 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out 3 | # 4 | # # testing scenario 2 with multiprocessing 5 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -multiprocess 6 | # 7 | # # testing scenario 3 null without multiprocessing 8 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -perturb data/perturb_score.txt.gz -null 9 | # 10 | # # testing scenario 4 null with multiprocessing 11 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -perturb data/perturb_score.txt.gz -null -multiprocess 12 | 13 | # testing tfwrapper 14 | ../script/analysis/tfwrapper_paramsrunner.sh python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_tfwrapper 15 | --------------------------------------------------------------------------------