├── LICENSE
├── README.md
├── env
    ├── epee_CPU.txt
    └── epee_GPU.txt
├── script
    ├── epee.py
    └── run_epee.py
└── test
    └── tests.sh


/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2018, University of Texas Southwestern Medical Center. All rights reserved.
 2 | Contributors: Viren Amin, Murat Can Cobanoglu
 3 | Department: Lyda Hill Department of Bioinformatics
 4 | 
 5 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 6 | 
 7 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 8 | 
 9 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
10 | 
11 | 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
12 | 
13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EPEE
 2 | 
 3 | Effectors and Perturbation Estimation Engine (EPEE) a sparse linear model with graph constrained lasso regularization for differential analysis of RNA-seq data. The inputs are transcriptomic data for the two conditions under comparison, and context-specific TF-gene networks. If transcriptomic data is sequencing based, then data needs to be normalized to either TPM/FPKM/RPKM. EPEE is implemented in Python, using TensorFlow.
 4 | 
 5 | ### Citation
 6 | 
 7 | Viren Amin, Didem Agac, Spencer D Barnes, Murat Can Cobanoglu, "Accurate differential analysis of transcription factor activity from gene expression", *Bioinformatics*, May 2019.
 8 | 
 9 | https://doi.org/10.1093/bioinformatics/btz398
10 | 
11 | ### Inputs
12 | 
13 | - conditionA.txt and conditionB.txt
14 | 
15 | EPEE requires expression data matrix for the two conditions. Input is a tab delimited file in which columns are the samples and rows are the genes. First column of the file needs to be gene names.
16 | Please do not log normalize the dataset before running EPEE. EPEE log transforms the data to log(TPM/FPKM/RPKM + 1).
17 | 
18 | - networkA and networkB
19 | 
20 | EPEE requires context specific networks. Currently EPEE supports 426 context-specific networks published by Marbach et al. Nature Methods 2016.
21 | 
22 | 
23 | ### Setup
24 | 
25 | 1. Install [Anaconda](https://www.anaconda.com/download)
26 | 
27 | 2. Download the [networks](http://regulatorycircuits.org/download.html) and example [data](https://github.com/Cobanoglu-Lab/EPEE/tree/master/test/data) to run EPEE.
28 | 
29 | 3. Clone the git repository and set up the conda environment to run EPEE. We provided the environment files in `env` directory. If your machine has GPU card then we recommend that you use `epee_GPU.txt` file to create environment, otherwise create environment using `epee_CPU.txt`. To create conda environment use following command
30 | ```
31 | conda env create -f epee_CPU.txt -n epee
32 | ```
33 | Activate the new environment: Windows: `activate epee`, macOS and Linux: `source activate epee`
34 | 
35 | 4. View the available human networks, and determine the network appropriate for your context.
36 | 
37 | 5. Usage to run EPEE
38 | 
39 | ```
40 | python run_epee.py -a <conditionA.txt>
41 |                    -b <conditionB.txt>
42 |                    -na <networkA.txt>
43 |                    -nb <networkB.txt>
44 |                    -o <output_directory>
45 | ```
46 | 
47 | ### Example
48 | 
49 | ##### CD4 Naive vs Th2 differential analysis
50 | 
51 | ```
52 | python run_epee_v0.1.4.3.py -a ../data/rnaseq/immune/CD4_Naive.txt.gz
53 |                             -b ../data/rnaseq/immune/CD4_Th2.txt.gz
54 |                             -na ../data/network/cd4+_t_cells.txt.gz
55 |                             -nb ../data/network/cd4+_t_cells.txt.gz
56 |                             -o /path/to/output_directory/
57 |                             -prefix Th2
58 | ```
59 | 
60 | ##### Normal Colon vs Colorectal Adenocarcinoma (COAD) differential analysis
61 | 
62 | ```
63 | python run_epee_v0.1.4.3.py -a ../data/rnaseq/tcga/TCGA_COAD_SolidTissueNormal_FPKM_UQ.txt
64 |                             -b ../data/rnaseq/tcga/TCGA_COAD_PrimaryTumor_FPKM_UQ.txt
65 |                             -na ../data/network/20_gastrointestinal_system.txt.gz
66 |                             -nb ../data/network/20_gastrointestinal_system.txt/gz
67 |                             -o /path/to/output_directory/
68 |                             -prefix COAD
69 | ```
70 | 


--------------------------------------------------------------------------------
/env/epee_CPU.txt:
--------------------------------------------------------------------------------
 1 | # This file may be used to create an environment using:
 2 | # $ conda create --name <env> --file <this file>
 3 | # platform: linux-64
 4 | https://repo.continuum.io/pkgs/main/linux-64/anaconda-custom-py36hbbc8b67_0.tar.bz2
 5 | https://conda.anaconda.org/anaconda/linux-64/bleach-1.5.0-py36_0.tar.bz2
 6 | https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2017.11.5-0.tar.bz2
 7 | https://conda.anaconda.org/conda-forge/linux-64/certifi-2017.11.5-py36_0.tar.bz2
 8 | https://conda.anaconda.org/anaconda/linux-64/html5lib-0.9999999-py36_0.tar.bz2
 9 | https://conda.anaconda.org/anaconda/linux-64/intel-openmp-2018.0.0-hc7b2577_8.tar.bz2
10 | https://conda.anaconda.org/anaconda/linux-64/libgcc-ng-7.2.0-h7cc24e2_2.tar.bz2
11 | https://conda.anaconda.org/anaconda/linux-64/libgfortran-ng-7.2.0-h9f7466a_2.tar.bz2
12 | https://conda.anaconda.org/anaconda/linux-64/libprotobuf-3.4.1-h5b8497f_0.tar.bz2
13 | https://conda.anaconda.org/anaconda/linux-64/libstdcxx-ng-7.2.0-h7a57d05_2.tar.bz2
14 | https://conda.anaconda.org/anaconda/linux-64/markdown-2.6.9-py36_0.tar.bz2
15 | https://conda.anaconda.org/anaconda/linux-64/mkl-2018.0.1-h19d6760_4.tar.bz2
16 | https://conda.anaconda.org/conda-forge/linux-64/ncurses-5.9-10.tar.bz2
17 | https://conda.anaconda.org/anaconda/linux-64/numpy-1.14.0-py36h3dfced4_1.tar.bz2
18 | https://conda.anaconda.org/conda-forge/linux-64/openssl-1.0.2n-0.tar.bz2
19 | https://conda.anaconda.org/anaconda/linux-64/pandas-0.22.0-py36hf484d3e_0.tar.bz2
20 | https://conda.anaconda.org/anaconda/linux-64/patsy-0.5.0-py36_0.tar.bz2
21 | https://conda.anaconda.org/conda-forge/linux-64/pip-9.0.1-py36_1.tar.bz2
22 | https://conda.anaconda.org/anaconda/linux-64/protobuf-3.4.1-py36h306e679_0.tar.bz2
23 | https://conda.anaconda.org/conda-forge/linux-64/python-3.6.4-0.tar.bz2
24 | https://conda.anaconda.org/anaconda/linux-64/python-dateutil-2.6.1-py36h88d3b88_1.tar.bz2
25 | https://conda.anaconda.org/anaconda/linux-64/pytz-2017.3-py36h63b9c63_0.tar.bz2
26 | https://conda.anaconda.org/conda-forge/linux-64/readline-7.0-0.tar.bz2
27 | https://conda.anaconda.org/anaconda/linux-64/scikit-learn-0.19.1-py36h7aa7ec6_0.tar.bz2
28 | https://conda.anaconda.org/anaconda/linux-64/scipy-1.0.0-py36hbf646e7_0.tar.bz2
29 | https://conda.anaconda.org/conda-forge/linux-64/setuptools-38.4.0-py36_0.tar.bz2
30 | https://conda.anaconda.org/anaconda/linux-64/six-1.11.0-py36h372c433_1.tar.bz2
31 | https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.20.1-2.tar.bz2
32 | https://conda.anaconda.org/anaconda/linux-64/statsmodels-0.8.0-py36h8533d0b_0.tar.bz2
33 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-1.4.1-0.tar.bz2
34 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-base-1.4.1-py36hd00c003_2.tar.bz2
35 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-tensorboard-0.1.5-py36_0.tar.bz2
36 | https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.7-0.tar.bz2
37 | https://conda.anaconda.org/anaconda/linux-64/werkzeug-0.14.1-py36_0.tar.bz2
38 | https://conda.anaconda.org/conda-forge/linux-64/wheel-0.30.0-py36_2.tar.bz2
39 | https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.3-0.tar.bz2
40 | https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.11-0.tar.bz2
41 | 


--------------------------------------------------------------------------------
/env/epee_GPU.txt:
--------------------------------------------------------------------------------
 1 | # This file may be used to create an environment using:
 2 | # $ conda create --name <env> --file <this file>
 3 | # platform: linux-64
 4 | https://repo.continuum.io/pkgs/main/linux-64/anaconda-custom-py36hbbc8b67_0.tar.bz2
 5 | https://conda.anaconda.org/anaconda/linux-64/bleach-1.5.0-py36_0.tar.bz2
 6 | https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2017.11.5-0.tar.bz2
 7 | https://conda.anaconda.org/conda-forge/linux-64/certifi-2017.11.5-py36_0.tar.bz2
 8 | https://conda.anaconda.org/anaconda/linux-64/cudatoolkit-8.0-3.tar.bz2
 9 | https://conda.anaconda.org/anaconda/linux-64/cudnn-7.0.5-cuda8.0_0.tar.bz2
10 | https://conda.anaconda.org/anaconda/linux-64/html5lib-0.9999999-py36_0.tar.bz2
11 | https://conda.anaconda.org/anaconda/linux-64/intel-openmp-2018.0.0-hc7b2577_8.tar.bz2
12 | https://conda.anaconda.org/anaconda/linux-64/libgcc-ng-7.2.0-h7cc24e2_2.tar.bz2
13 | https://conda.anaconda.org/anaconda/linux-64/libgfortran-ng-7.2.0-h9f7466a_2.tar.bz2
14 | https://conda.anaconda.org/anaconda/linux-64/libprotobuf-3.4.1-h5b8497f_0.tar.bz2
15 | https://conda.anaconda.org/anaconda/linux-64/libstdcxx-ng-7.2.0-h7a57d05_2.tar.bz2
16 | https://conda.anaconda.org/anaconda/linux-64/markdown-2.6.9-py36_0.tar.bz2
17 | https://conda.anaconda.org/anaconda/linux-64/mkl-2018.0.1-h19d6760_4.tar.bz2
18 | https://conda.anaconda.org/conda-forge/linux-64/ncurses-5.9-10.tar.bz2
19 | https://conda.anaconda.org/anaconda/linux-64/numpy-1.14.0-py36h3dfced4_1.tar.bz2
20 | https://conda.anaconda.org/conda-forge/linux-64/openssl-1.0.2n-0.tar.bz2
21 | https://conda.anaconda.org/anaconda/linux-64/pandas-0.22.0-py36hf484d3e_0.tar.bz2
22 | https://conda.anaconda.org/anaconda/linux-64/patsy-0.5.0-py36_0.tar.bz2
23 | https://conda.anaconda.org/conda-forge/linux-64/pip-9.0.1-py36_1.tar.bz2
24 | https://conda.anaconda.org/anaconda/linux-64/protobuf-3.4.1-py36h306e679_0.tar.bz2
25 | https://conda.anaconda.org/conda-forge/linux-64/python-3.6.4-0.tar.bz2
26 | https://conda.anaconda.org/anaconda/linux-64/python-dateutil-2.6.1-py36h88d3b88_1.tar.bz2
27 | https://conda.anaconda.org/anaconda/linux-64/pytz-2017.3-py36h63b9c63_0.tar.bz2
28 | https://conda.anaconda.org/conda-forge/linux-64/readline-7.0-0.tar.bz2
29 | https://conda.anaconda.org/anaconda/linux-64/scikit-learn-0.19.1-py36h7aa7ec6_0.tar.bz2
30 | https://conda.anaconda.org/anaconda/linux-64/scipy-1.0.0-py36hbf646e7_0.tar.bz2
31 | https://conda.anaconda.org/conda-forge/linux-64/setuptools-38.4.0-py36_0.tar.bz2
32 | https://conda.anaconda.org/anaconda/linux-64/six-1.11.0-py36h372c433_1.tar.bz2
33 | https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.20.1-2.tar.bz2
34 | https://conda.anaconda.org/anaconda/linux-64/statsmodels-0.8.0-py36h8533d0b_0.tar.bz2
35 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-1.4.1-0.tar.bz2
36 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-base-1.4.1-py36hd00c003_2.tar.bz2
37 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-gpu-1.4.1-0.tar.bz2
38 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-gpu-base-1.4.1-py36h01caf0a_0.tar.bz2
39 | https://conda.anaconda.org/anaconda/linux-64/tensorflow-tensorboard-0.1.5-py36_0.tar.bz2
40 | https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.7-0.tar.bz2
41 | https://conda.anaconda.org/anaconda/linux-64/werkzeug-0.14.1-py36_0.tar.bz2
42 | https://conda.anaconda.org/conda-forge/linux-64/wheel-0.30.0-py36_2.tar.bz2
43 | https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.3-0.tar.bz2
44 | https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.11-0.tar.bz2
45 | 


--------------------------------------------------------------------------------
/script/epee.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018, Viren Amin, Murat Can Cobanoglu
  2 | # Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  3 | # 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  4 | # 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  5 | # 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
  6 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  7 | 
  8 | import pandas as pd
  9 | import tensorflow as tf
 10 | import numpy as np
 11 | import logging
 12 | import sklearn.preprocessing as pp
 13 | import glob
 14 | import statsmodels
 15 | import statsmodels.api as sm
 16 | 
 17 | 
 18 | def _normalizeweight(df, metric):
 19 |     """To normalize the TF-gene regulatory network weights.
 20 | 
 21 |     Parameters
 22 |     ----------
 23 |     df
 24 |         Pandas dataframe containing network. Tab-delimited file where first
 25 |         column contains TF name, second column contains regulated gene name,
 26 |         and third column containing TF-gene regulator score.
 27 |     metric
 28 |         metric used to normalize the TF-gene regulatory score. 'minmax':
 29 |         performs the minmax scaling of regulatory scores between 1 and 2.
 30 |         'log': performs natural logarithm transformation of the regulatory
 31 |         scores. 'log10': performs base 10 logarithmic transformation of the
 32 |         regulatory score. 'no': performs no normalization to the regulatory
 33 |         score.
 34 | 
 35 |     Returns
 36 |     -------
 37 |     df
 38 |         Pandas dataframe containing the normalized weight
 39 | 
 40 |     """
 41 |     if metric == 'minmax':
 42 |         weights = np.log(df['Score']).values.reshape(-1, 1)
 43 |         norm_score = pp.MinMaxScaler(feature_range=(1, 2)).fit_transform(
 44 |                                                            weights)
 45 |     if metric == 'log':
 46 |         norm_score = np.log(df['Score']+1)
 47 |     if metric == 'log10':
 48 |         norm_score = np.log10(df['Score']+1)
 49 |     if metric == 'no':
 50 |         norm_score = df['Score']
 51 |     df['Score'] = norm_score
 52 |     return df
 53 | 
 54 | 
 55 | def _network_df(filename, Y, conditioning, weightNormalize='minmax'):
 56 |     """To take network filename and generate pandas dataframe object.
 57 | 
 58 |     Function takes the file path of the TF-gene regulatory network data and
 59 |     returns the pandas dataframe object containing row as gene and column as
 60 |     TF. Value in the table represents the putative regulation score of the
 61 |     TF regulating the gene.
 62 | 
 63 |     Parameters
 64 |     ----------
 65 |     filename
 66 |         Path to the TF-gene regulatory network data.
 67 |     Y
 68 |         Pandas dataframe containing expression data.
 69 |     conditioning
 70 |         Boolean. If conditioning is True, then TF-gene with no putative
 71 |         regulation are given small uniform score. 1/10 of the minimum TF-gene
 72 |         regulatory score.
 73 |     weightNormalize
 74 |         metric used to normalize the TF-gene regulatory score. 'minmax':
 75 |         performs the minmax scaling of regulatory scores between 1 and 2.
 76 |         'log': performs natural lograithmic transformation of the regulatory
 77 |         scores. 'log10': performs base 10 logarithmic transformation of the
 78 |         regulatory score. 'no': performs no normalization to the regulatory
 79 |         score.
 80 | 
 81 |     Returns
 82 |     -------
 83 |     S
 84 |         Pandas dataframe containing network. Columns are TFs and rows are
 85 |         regulated genes. Values are the weights of the TF-gene regulation.
 86 |     min_val
 87 |         value used for the conditioning
 88 | 
 89 |     """
 90 |     S = pd.read_csv(filename, sep='\t', header=None)
 91 |     S.columns = ['TF', 'Target', 'Score']
 92 |     S = _normalizeweight(S, weightNormalize)
 93 |     if conditioning:
 94 |         min_val = np.min(S['Score'])/10
 95 |     else:
 96 |         min_val = 0
 97 |     S = pd.pivot_table(S, values='Score', index='Target', columns='TF',
 98 |                        fill_value=min_val)
 99 |     regmask = [tf in Y.index for tf in S.columns]
100 | 
101 |     S = S.ix[Y.index, regmask].fillna(min_val)
102 |     S = S.astype(np.float32)
103 |     return (S, min_val)
104 | 
105 | 
106 | def _exp_df(filename):
107 |     """To take expression data filename and generate pandas dataframe object.
108 | 
109 |     Function takes the file path of the expression data and returns the
110 |     log(normalized expression data + 1).
111 | 
112 |     Parameters
113 |     ----------
114 |     filename
115 |         Path to the expression data.
116 | 
117 |     Returns
118 |     -------
119 |     Y
120 |         Pandas dataframe containing log(normalized expression data + 1)
121 | 
122 |     """
123 |     Y = pd.read_csv(filename, sep='\t', index_col=0)
124 |     Y = np.log(Y + 1)
125 |     Y = Y.ix[Y.index.drop_duplicates(keep=False), :]
126 |     Y = Y.astype(np.float32)
127 |     return Y
128 | 
129 | 
130 | def _eval_indices(Y1, Y2, S1, S2):
131 |     """To eval whether the indices are same.
132 | 
133 |     Evalulate whether indices are same for both conditions expression data and
134 |     the context-specific network dataframes.
135 | 
136 |     Parameters
137 |     ----------
138 |     Y1
139 |         Pandas dataframe containing condition 1 expression data
140 |     Y2
141 |         Pandas dataframe containing condition 2 expression data
142 |     S1
143 |         Pandas dataframe containing network 1. Columns are TFs and rows are
144 |         regulated genes. Values are the weights of the TF-gene regulation.
145 |     S2
146 |         Pandas dataframe containing network 2. Columsn are TFs and rows are
147 |         regulated genes. Values are the weights of the TF-gene regulation.
148 | 
149 |     Returns
150 |     -------
151 |     None
152 | 
153 |     """
154 |     if all(Y1.index == Y2.index):
155 |         logging.debug('Y index are equal')
156 |     else:
157 |         raise RuntimeError('Y1 Y2 index are not equal')
158 | 
159 |     if all(S1.index == S2.index):
160 |         logging.debug('S index are equal')
161 |     else:
162 |         raise RuntimeError('S1 S2 index are not equal')
163 | 
164 |     if all(S1.index == Y1.index):
165 |         logging.debug('S and Y are equal')
166 |     else:
167 |         raise RuntimeError('S Y index are not equal')
168 | 
169 | 
170 | def _get_min_samples(Y1, Y2, X1, X2, min_samples, seed=0):
171 |     """Select min samples among conditions.
172 | 
173 |     Function takes the expression dataframes of two conditions and shuffle the
174 |     labels. If the number of labels are not evenly distrubted, then the output
175 |     contains minimum samples of the two conditions.
176 | 
177 |     Parameters
178 |     ----------
179 |     Y1
180 |         Pandas dataframe containing condition 1 expression data
181 |     Y2
182 |         Pandas dataframe containing condition 2 expression data
183 |     X1
184 |         Pandas dataframe containing TF expression across condition 1 samples
185 |     X2
186 |         Pandas dataframe containing TF expression across condition 2 samples
187 |     min_samples
188 |         Minimum number of samples between condition 1 and condition 2
189 |         expression data
190 |     seed
191 |         Seed for random sampling. Default = 0.
192 | 
193 |     Returns
194 |     -------
195 |     sY1
196 |         Pandas dataframe containing shuffled labels expression data for
197 |         condition 1
198 |     sY2
199 |         Pandas dataframe containing shuffled labels expression data for
200 |         condition 2
201 |     sX1
202 |         Pandas dataframe containing shuffled labels TF expression data
203 |         for condition 1
204 |     sX2
205 |         Pandas dataframe containing shuffled labels TF expression data
206 |         for condition 2
207 | 
208 |     """
209 |     Y1_samples = list(Y1.columns)
210 |     Y2_samples = list(Y2.columns)
211 |     np.random.shuffle(Y1_samples)
212 |     np.random.shuffle(Y2_samples)
213 |     mY1 = Y1.loc[:, Y1_samples[:min_samples]]
214 |     mY2 = Y2.loc[:, Y2_samples[:min_samples]]
215 |     mX1 = X1.loc[:, Y1_samples[:min_samples]]
216 |     mX2 = X2.loc[:, Y2_samples[:min_samples]]
217 |     return (mY1, mY2, mX1, mX2)
218 | 
219 | 
220 | def _shuffled_inputs(Y1, Y2, X1, X2, seed=0, shuffleGenes=False):
221 |     """Shuffle the labels of the inputs samples.
222 | 
223 |     Function takes the expression dataframes of two conditions and shuffle the
224 |     labels. 
225 | 
226 |     Parameters
227 |     ----------
228 |     Y1
229 |         Pandas dataframe containing condition 1 expression data
230 |     Y2
231 |         Pandas dataframe containing condition 2 expression data
232 |     X1
233 |         Pandas dataframe containing TF expression across condition 1 samples
234 |     X2
235 |         Pandas dataframe containing TF expression across condition 2 samples
236 |     seed
237 |         Seed for random sampling. Default = 0.
238 |     shuffleGenes
239 |         Boolean flag for shuffling genes instead of labels. Default = False
240 | 
241 |     Returns
242 |     -------
243 |     sY1
244 |         Pandas dataframe containing shuffled labels expression data for
245 |         condition 1
246 |     sY2
247 |         Pandas dataframe containing shuffled labels expression data for
248 |         condition 2
249 |     sX1
250 |         Pandas dataframe containing shuffled labels TF expression data
251 |         for condition 1
252 |     sX2
253 |         Pandas dataframe containing shuffled labels TF expression data
254 |         for condition 2
255 | 
256 |     """
257 |     np.random.seed(seed)
258 | 
259 |     Y1_2 = pd.merge(Y1, Y2, left_index=True, right_index=True, how='inner')
260 |     X1_2 = pd.merge(X1, X2, left_index=True, right_index=True, how='inner')
261 | 
262 |     n_cols = Y1.shape[1]
263 |     if not shuffleGenes:
264 |         # We will shuffle labels
265 |         samples = list(Y1_2.columns)
266 |         np.random.shuffle(samples)
267 |         sY1 = Y1_2.loc[:, samples[:n_cols]]
268 |         sY2 = Y1_2.loc[:, samples[n_cols:]]
269 |         sX1 = X1_2.loc[:, samples[:n_cols]]
270 |         sX2 = X1_2.loc[:, samples[n_cols:]]
271 |     else:
272 |         # We will shuffle genes
273 |         genes = list(Y1_2.index)
274 |         np.random.shuffle(genes)
275 |         sY1 = Y1_2.loc[genes, Y1.columns] 
276 |         sY2 = Y1_2.loc[genes, Y2.columns]
277 |         for df in [sY1, sY2]:
278 |             df.index = Y1_2.index
279 |         tfs = list(X1_2.index)
280 |         sX1 = sY1.loc[tfs, X1.columns]
281 |         sX2 = sY2.loc[tfs, X2.columns]
282 |         for df in [sX1, sX2]:
283 |             df.index = tfs 
284 | 
285 |     return (sY1, sY2, sX1, sX2)
286 | 
287 | 
288 | def get_epee_inputs(c1, c2, n1, n2, conditioning=True, weightNormalize='minmax',
289 |                     null=False, shuffleGenes=False, seed=0):
290 |     """To generate inputs for EPEE.
291 | 
292 |     Function takes the network and expression data filenames and generates
293 |     inputs for EPEE to run.
294 | 
295 |     Parameters
296 |     ----------
297 |     c1
298 |         filename containing condition 1 expression data
299 |     c2
300 |         filename containing condition 2 expression data
301 |     n1
302 |         filename containing network 1
303 |     n2
304 |         filename containing network 2
305 |     conditioning
306 |         whether to provide small weight value to non TF-gene putative
307 |         interactions
308 |     weightNormalize
309 |         three metric of weight normalize are implemented. 'minmax' whether
310 |         to log normalize the weights and scale the weights between 1
311 |         and 2. 'log' whether to log normalize weight+1. 'log10' whether to
312 |         use log base 10 normalize weight+1.
313 |     null
314 |         If flag is set, then samples in condition1 and condition2 are shuffled
315 |     shuffleGenes
316 |         If flag is set, we run null tests by shuffling the genes instead of labels
317 |     seed
318 |         Seed for random sampling
319 | 
320 |     Returns
321 |     -------
322 |     Y1
323 |         Pandas dataframe containing condition 1 expression data
324 |     Y2
325 |         Pandas dataframe containing condition 2 expression data
326 |     X1
327 |         Pandas dataframe containing TF expression across condition 1 samples
328 |     X2
329 |         Pandas dataframe containing TF expression across condition 2 samples
330 |     S1
331 |         Pandas dataframe containing network 1. Columns are TFs and rows are
332 |         regulated genes. Values are the weights of the TF-gene regulation.
333 |     S2
334 |         Pandas dataframe containing network 2. Columsn are TFs and rows are
335 |         regulated genes. Values are the weights of the TF-gene regulation.
336 |     conditioning_val
337 |         Soft thresholding value used to non TF-gene putative interactions
338 | 
339 |     """
340 |     Y1 = _exp_df(c1)
341 |     Y2 = _exp_df(c2)
342 | 
343 |     S1, conditioning_val = _network_df(n1, Y1, conditioning,
344 |                                        weightNormalize=weightNormalize)
345 |     S2, _ = _network_df(n2, Y2, conditioning, weightNormalize=weightNormalize)
346 | 
347 |     if n1 != n2:
348 |         genes = list(set(np.concatenate([S1.index, S2.index])))
349 |         tfs = list(set(np.concatenate([S1.columns, S2.columns])))
350 |         S1 = S1.ix[genes, tfs].fillna(conditioning_val)
351 |         S2 = S2.ix[genes, tfs].fillna(conditioning_val)
352 |         Y1 = Y1.ix[S1.index, :].fillna(0)
353 |         Y2 = Y2.ix[S2.index, :].fillna(0)
354 |     else:
355 |         Y1 = Y1.ix[S1.index, :].fillna(0)
356 |         Y2 = Y2.ix[S2.index, :].fillna(0)
357 | 
358 |     X1 = Y1.ix[S1.columns, :]
359 |     X2 = Y2.ix[S2.columns, :]
360 |     _eval_indices(Y1, Y2, S1, S2)
361 |     if null:
362 |         Y1, Y2, X1, X2 = _shuffled_inputs(Y1, Y2, X1, X2, 
363 |                 seed=seed, shuffleGenes=shuffleGenes)
364 |     return (Y1, Y2, X1, X2, S1, S2, conditioning_val)
365 | 
366 | 
367 | def _gcp(x, r):
368 |     """To calculate the graph contrained penalty term.
369 | 
370 |     Function that tensorflow uses to fold through each column of the matrix. It
371 |     goes through each TF column vector, finds target genes, creates weights
372 |     vector of the target genes, calculates pairwise difference of the weights,
373 |     performs tanh tranformation of the differences, and then sums the
374 |     difference.
375 | 
376 |     Parameters
377 |     ----------
378 |     x
379 |         is the previous value
380 |     r
381 |         Weights of the TF matrix
382 | 
383 |     Returns
384 |     -------
385 |     float
386 |         Value with sum of the weight difference between the target genes
387 | 
388 |     """
389 |     ind = tf.where(tf.abs(r) > 0)
390 |     vec = tf.expand_dims(tf.gather_nd(r, ind), 0)
391 |     pc = tf.tanh(tf.transpose(vec)-vec)
392 |     val = tf.divide(tf.reduce_sum(tf.abs(pc)), 2)
393 |     return x+val
394 | 
395 | 
396 | def run_model(Y, X, S, step, itr, log_itr, seed,
397 |               l_reg=1e-4, g_reg=1e-4, stopthreshold=0.01, val=0,
398 |               model='epee-gcl'):
399 |     """To run sparse linear model.
400 | 
401 |     There are two sparse linear model implemented: lasso and
402 |     graph-constrained-lasso. By default, method runs graph-constrained-lasso.
403 | 
404 |     Parameters
405 |     ----------
406 |     Y
407 |         Pandas dataframe containing expression data. Rows are genes and
408 |         columns are samples. Values are log(RPKM/TPM/FPKM + 1)
409 |     X
410 |         Pandas dataframe containing expression data of TFs. Rows are TFs and
411 |         columns are samples. Values are log(RPKM/TPM/FPKM + 1)
412 |     S
413 |         Pandas dataframe containing network. Rows are genes and columns are
414 |         TFs. Values are weight corresponding to the TF regulating a gene
415 |     step
416 |         learning rate for the optimizer
417 |     log_itr
418 |         Iterations to log the loss and percent change
419 |     seed
420 |         Setting the tensforflow random seed
421 |     l_reg
422 |         lasso regularization constant
423 |     g_reg
424 |         graph constrained regularization constant
425 |     stopthreshold
426 |         threshold when to stop learning the model if the loss change is 0.1
427 |         between the previous iteration and the current interation
428 |     val
429 |         weight given to the not known TF-gene pairs
430 |     model
431 |         model to use for regulator and perturbed gene inference score
432 | 
433 |     Returns
434 |     -------
435 |     curr_y
436 |         Inferred Y
437 |     curr_w
438 |         Inferred W
439 |     loss_arr
440 |         Loss per each iteration
441 | 
442 |     """
443 |     tf.set_random_seed(seed)
444 |     genes, samples = Y.shape
445 |     regulators = X.shape[0]
446 |     S_h = np.copy(S)
447 |     S_h = np.float32(S_h > val)
448 | 
449 |     with tf.Graph().as_default():
450 |         w = tf.Variable(
451 |             tf.random_gamma([genes, regulators], alpha=1, beta=30,
452 |                             dtype=tf.float32, seed=seed))
453 | 
454 |         if model == 'no-penalty':
455 |             # least squares loss along with L1 regularization
456 |             y = tf.matmul(tf.multiply(w, S), X)
457 | 
458 |             # loss
459 |             loss = tf.reduce_mean(tf.square(Y - y))
460 | 
461 |         if model == 'epee-l':
462 |             # least squares loss along with L1 regularization
463 |             y = tf.matmul(tf.multiply(w, S), X)
464 | 
465 |             # loss
466 |             loss = tf.reduce_mean(
467 |                 tf.square(Y - y))+tf.multiply(tf.reduce_sum(tf.abs(w)), l_reg)
468 | 
469 |         if model == 'epee-gcl':
470 |             y = tf.matmul(tf.multiply(w, S), X)
471 |             wa = tf.transpose(
472 |                 tf.multiply(tf.expand_dims(w, 1), tf.expand_dims(S_h, 1)))
473 | 
474 |             gc = tf.foldl(
475 |                 _gcp, wa, initializer=tf.constant(0, dtype=tf.float32),
476 |                 parallel_iterations=10, back_prop=False, swap_memory=True)
477 | 
478 |             loss = tf.reduce_mean(
479 |                 tf.square(Y-y)) + tf.multiply(
480 |                     tf.reduce_sum(tf.abs(w)), l_reg) + tf.multiply(gc, g_reg)
481 | 
482 |         # optimizer
483 |         optimizer = tf.train.AdamOptimizer(learning_rate=step, epsilon=1e-8)
484 | 
485 |         train = optimizer.minimize(loss)
486 | 
487 |         # training loop
488 |         init = tf.global_variables_initializer()  # before starting init var
489 |         sess = tf.Session()  # launch the graph
490 |         sess.run(init)  # reset values to wrong
491 | 
492 |         loss_arr = []
493 |         # outarr = []
494 |         for s in range(itr):
495 |             sess.run(train)
496 |             if s % 100 == 0:
497 |                 curr_loss = sess.run(loss)
498 |                 if np.isnan(curr_loss):
499 |                     raise RuntimeError('NAN value is computed for the loss.\
500 |                     Make sure that the inputs are RPKM, TPM, FPKM normalized\
501 |                     without any transformation (ie. log). Also can try to \
502 |                     lower the learning rate.')
503 |                 loss_arr.append(curr_loss)
504 |                 if len(loss_arr) >= 2:
505 |                     delta = loss_arr[-2] - loss_arr[-1]
506 | 
507 |                     logging.debug((s, curr_loss, delta))
508 |                     if delta < 1:
509 |                         logging.debug("TRANING FINISHED")
510 |                         break
511 |                 else:
512 |                     logging.debug((s, curr_loss))
513 | 
514 |         curr_w, curr_y, curr_loss = sess.run([w, y, loss])
515 |     return (curr_y, curr_w, loss_arr)
516 | 
517 | 
518 | def get_perturb_scores(Y1, y1, X1, w1, Y2, y2, X2, w2, S1, S2):
519 |     """To get perturb scores.
520 | 
521 |     The funtion calculates the log likelihood ratio of error by swapping the
522 |     weights between condition and error within same condition. The function
523 |     returns sorted dataframe containing perturbation score.
524 | 
525 |     Parameters
526 |     ----------
527 |     Y1
528 |         Pandas dataframe containing condition 1 expression data
529 |     y1
530 |         Pandas dataframe containing inferred condition 1 expression data
531 |     X1
532 |         Pandas dataframe containing TF expression across condition 1 samples
533 |     w1
534 |         Pandas dataframe containing inferred TF-gene weights for condition 1
535 |     Y2
536 |         Pandas dataframe containing condition 2 expression data
537 |     y2
538 |         Pandas dataframe containing inferred condition 2 expression data
539 |     X2
540 |         Pandas dataframe containing TF expression across condition 2 samples
541 |     w2
542 |         Pandas dataframe containing inferred TF-gene weights for condition 2
543 |     S1
544 |         Pandas dataframe containing network 1. Columns are TFs and rows are
545 |         regulated genes. Values are the weights of the TF-gene regulation.
546 |     S2
547 |         Pandas dataframe containing network 2. Columsn are TFs and rows are
548 |         regulated genes. Values are the weights of the TF-gene regulation.
549 | 
550 |     Returns
551 |     -------
552 |     scores_df_sorted
553 |         Pandas dataframe containing perturb scores
554 | 
555 |     """
556 |     err11 = np.square(Y1-y1).sum(axis=1)
557 |     err22 = np.square(Y2-y2).sum(axis=1)
558 |     err12 = np.square(Y1-np.dot(np.multiply(w2, S2), X1)).sum(axis=1)
559 |     err21 = np.square(Y2-np.dot(np.multiply(w1, S1), X2)).sum(axis=1)
560 | 
561 |     base = err11+err22
562 |     err = err21+err12
563 |     scores = err/base
564 | 
565 |     # sort_scores_idx = sorted(range(len(scores)), key=lambda k: scores[k])
566 |     scores_df = pd.DataFrame({'gene': scores.index, 'score': scores})
567 |     scores_df_sorted = scores_df.sort_values(by='score', ascending=False)
568 |     return scores_df_sorted
569 | 
570 | 
571 | def get_summary_scoresdf(df, metric='sum'):
572 |     """To calculate scores from multiple models.
573 | 
574 |     Each independent model generates perturb and regulatory score. The function
575 |     calculates summary score for each perturb gene and assigns that score to
576 |     the gene. The function returns dataframe with summary scores sorted.
577 | 
578 |     Parameters
579 |     ----------
580 |     df
581 |         Pandas dataframe containing scores from multiple models
582 |     metric
583 |         Metric used to summarize the score from multiple models. 
584 |         'sum', 'mean' and 'median' are valid options. Default = 'sum'
585 | 
586 |     Returns
587 |     -------
588 |     out_df_sort
589 |         Pandas dataframe containing the summarized scores
590 | 
591 |     """
592 |     if metric == 'median':
593 |         df_score = df.iloc[:, 1:].median(axis=1)
594 |     if metric == 'sum':
595 |         df_score = df.iloc[:, 1:].sum(axis=1)
596 |     if metric == 'mean':
597 |         df_score = df.iloc[:, 1:].mean(axis=1)
598 |     out_df = pd.DataFrame({'gene': df['gene'], 'score': df_score})
599 |     out_df_sort = out_df.sort_values(by='score', ascending=False)
600 |     out_df_sort.reset_index(inplace=True, drop=True)
601 |     return out_df_sort
602 | 
603 | 
604 | def get_weights_df(w, genes, tfs):
605 |     """To name the rows and columns of the weight numpy ndarray object.
606 | 
607 |     Functions converts the numpy ndarray object to Pandas dataframe with
608 |     labeled rows and columns.
609 | 
610 |     Parameters
611 |     ----------
612 |     w
613 |         numpy ndarray containing the inferred TF-gene weights
614 |     genes
615 |         rownames of the w
616 |     tfs
617 |         column names of the w
618 |     Returns
619 |     -------
620 |     df
621 |         Pandas dataframe containing TF-gene inferred weights W
622 | 
623 |     """
624 |     df = pd.DataFrame(w)
625 |     df.columns = tfs
626 |     df.index = genes
627 |     return df
628 | 
629 | 
630 | def get_regulator_scores(perturb_genes, W):
631 |     """To provide regulator score given weights and perturb genes.
632 | 
633 |     Function calculates the regulator score given list of perturb genes and
634 |     inferred W.
635 | 
636 |     Parameters
637 |     ----------
638 |     perturb_genes
639 |         list of perturb genes
640 |     W
641 |         Pandas dataframe containing TF-gene inferred weights W
642 | 
643 |     Returns
644 |     -------
645 |     sort_scores_df
646 |         Pandas dataframe containing the regulatory scores
647 | 
648 |     """
649 |     # perturb_genes
650 |     W_perturb = W.iloc[W.index.isin(perturb_genes), ]
651 |     # not perturb_genes
652 |     W_notperturb = W.iloc[~W.index.isin(perturb_genes), ]
653 |     reg_score_outarr = []
654 |     for reg in W.columns:
655 |         num_score = np.sum(W_perturb[reg])
656 |         not_reg_df = W_perturb.iloc[:, ~W_perturb.columns.isin([reg])]
657 |         dem_score1 = not_reg_df.sum().sum()
658 |         dem_score2 = np.sum(W_notperturb[reg])
659 |         if dem_score1 == 0:
660 |             dem_score1 = 1e-6
661 |         if num_score == 0:
662 |             score = 0
663 |         else:
664 |             score = (num_score/np.sqrt(dem_score1*dem_score2))
665 |         reg_score_outarr.append(score)
666 | 
667 |     score_df = pd.DataFrame({'id': W.columns, 'score': reg_score_outarr})
668 |     sort_scores_df = score_df.sort_values(by='score', ascending=False)
669 |     sort_scores_df.reset_index(inplace=True, drop=True)
670 |     return sort_scores_df
671 | 
672 | 
673 | def get_diff_regulatory_activity(perturbed_genes, w1, w2, top_regs=10):
674 |     """To provide differential regulator scores between condition.
675 | 
676 |     Function calculates the differential regulator score given list of perturb
677 |     genes, and inferred W from the two conditions.
678 | 
679 |     Parameters
680 |     ----------
681 |     perturb_genes
682 |         list of perturb genes
683 |     w1
684 |         Pandas dataframe containing TF-gene inferred weights W from condition 1
685 |     w2
686 |         Pandas dataframe containing TF-gene inferred weights W from condition 2
687 |     top_regs
688 |         Top number of differential regulators to select
689 | 
690 |     Returns
691 |     -------
692 |     reg_score
693 |         Pandas dataframe containing the differential regulator scores
694 |     diff_reg
695 |         Pandas dataframe containing the top number of differential regulators
696 | 
697 |     """
698 |     reg_score_w1 = get_regulator_scores(perturbed_genes, np.abs(w1))
699 |     reg_score_w2 = get_regulator_scores(perturbed_genes, np.abs(w2))
700 |     merge_reg_score = pd.merge(reg_score_w1, reg_score_w2, on='id')
701 |     merge_reg_score.columns = ['id', 'w1', 'w2']
702 |     merge_reg_score['w2-1'] = merge_reg_score['w2']-merge_reg_score['w1']
703 |     diff_sorted_reg_score = merge_reg_score.sort_values(
704 |         by='w2-1', ascending=False)
705 |     diff_sorted_reg_score.reset_index(drop=True, inplace=True)
706 |     reg_score = diff_sorted_reg_score[['id', 'w2-1']]
707 |     diff_reg = pd.concat([reg_score.head(top_regs), reg_score.tail(top_regs)])
708 |     diff_reg.reset_index(drop=True, inplace=True)
709 |     return(reg_score, diff_reg)
710 | 
711 | 
712 | def get_significant_scores(df, nullvalues, two_sided=False):
713 |     """To calculate the score significance from emperical null CDF.
714 | 
715 |     Function generates emperical null CDF from the null values and calculate
716 |     significance of the scores from the emperical CDF. Pvalues are corrected
717 |     for the multiple hypothesis testing through Benjamini-Hochberg procedure.
718 | 
719 |     Parameters
720 |     ----------
721 |     df
722 |         Pandas dataframe containing the perturb or regulator score
723 |     nullvalues
724 |         list of null values
725 |     two_sided
726 |         Boolean. Whether to use two-sided or one-sided test. Default = False.
727 | 
728 |     Returns
729 |     -------
730 |     df
731 |         Pandas dataframe containing the pvalues
732 | 
733 |     """
734 |     emcdf = sm.distributions.ECDF(nullvalues)
735 |     median = np.median(nullvalues)
736 |     pvals = []
737 |     for score in df['score']:
738 |         if two_sided:
739 |             if score > median:
740 |                 pval = 1-emcdf(score)
741 |             else:
742 |                 pval = emcdf(score)
743 |         else:
744 |             pval = 1-emcdf(score)
745 |         pvals.append(pval)
746 |     df['pvals'] = pvals
747 |     df['fdr_bh'] = statsmodels.stats.multitest.multipletests(
748 |         pvals, alpha=0.25, method='fdr_bh', returnsorted=False)[1]
749 |     return df
750 | 
751 | 
752 | def get_null_scores(dir):
753 |     """To get null perturb scores.
754 | 
755 |     The functions generates null perturb score list.
756 | 
757 |     Parameters
758 |     ----------
759 |     dir
760 |         Directory containing the perturbation scores from shuffled labels
761 | 
762 |     Returns
763 |     -------
764 |     null_scores_list
765 |         list of null perturb scores
766 | 
767 |     """
768 |     files = glob.glob('{}/pert*'.format(dir))
769 |     null_scores = pd.DataFrame()
770 |     for file in files:
771 |         df = pd.read_csv(file, sep='\t', index_col=0)
772 |         if null_scores.shape[0] == 0:
773 |             null_scores = df
774 |         else:
775 |             null_scores = pd.merge(null_scores, df, on='gene')
776 |     null_scores_list = null_scores.set_index('gene').values.flatten()
777 |     return null_scores_list
778 | 
779 | 
780 | def get_null_regscores(dir, perturbed_genes):
781 |     """To get null regulatory scores.
782 | 
783 |     The function generates null regulatory scores from directory containing the
784 |     inferred W's from the shuffled labels.
785 | 
786 |     Parameters
787 |     ----------
788 |     dir
789 |         Directory containing the shuffled labels inferred W's.
790 |     perturbed_genes
791 |         list of perturbed genes used from the comparisons with true labels
792 | 
793 |     Returns
794 |     -------
795 |     null_score_list
796 |         list of null regulator scores
797 | 
798 |     """
799 |     files = glob.glob('{}/WS1*'.format(dir))
800 |     null_regscore_df = pd.DataFrame()
801 |     for file in files:
802 |         seed = file.split('_')[-1].split('.')[0]
803 |         ws1 = pd.read_csv(file, sep='\t', index_col=0)
804 |         ws2 = pd.read_csv(file.replace('WS1', 'WS2'), sep='\t', index_col=0)
805 |         null_regscore, _ = get_diff_regulatory_activity(
806 |                                      perturbed_genes,
807 |                                      ws1, ws2, top_regs=20)
808 |         null_regscore.columns = ['gene', seed]
809 |         if null_regscore_df.shape[0] == 0:
810 |             null_regscore_df = null_regscore
811 |         else:
812 |             null_regscore_df = pd.merge(null_regscore_df, null_regscore,
813 |                                         on='gene')
814 |     null_score_list = null_regscore_df.set_index('gene').values.flatten()
815 |     return null_score_list
816 | 
817 | 
818 | def get_median_rank_position(actual, predicted, n, model):
819 |     """To get median relative rank poistion.
820 | 
821 |     The method is used when ground truth reference is known and with query list
822 |     want to known median rank position of the reference objects.
823 | 
824 |     Parameters
825 |     ----------
826 |     actual
827 |         array of genes that are ground truth
828 |     predicted
829 |         array of genes that want to query
830 |     n
831 |         length of the query list
832 |     model
833 |         model used to get the query list
834 | 
835 |     Returns
836 |     -------
837 |     out
838 |         median rank
839 | 
840 |     """
841 |     predicted = pd.DataFrame(predicted).reset_index().set_index(0)
842 |     predicted['index'] = (predicted['index']/n)*100
843 |     rank = []
844 |     for r in predicted.ix[actual, :].iterrows():
845 |         val = r[1][0]
846 |         rank.append(val)
847 |     out = (np.median(rank), model)
848 |     return out
849 | 


--------------------------------------------------------------------------------
/script/run_epee.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018, Viren Amin, Murat Can Cobanoglu
  2 | # Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  3 | # 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  4 | # 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  5 | # 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
  6 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  7 | 
  8 | # filter future warnings
  9 | import warnings
 10 | warnings.simplefilter("ignore", category=FutureWarning)
 11 | 
 12 | from epee import *
 13 | import numpy as np
 14 | import pandas as pd
 15 | import argparse
 16 | import logging
 17 | import time
 18 | import os
 19 | import itertools
 20 | import multiprocessing
 21 | from time import localtime, strftime
 22 | 
 23 | # set tensorflow verbosity
 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
 25 | 
 26 | 
 27 | 
 28 | parser = argparse.ArgumentParser()
 29 | parser.add_argument("-a", "--conditiona", help="RNA-seq data for Condition A",
 30 |                     type=str, required=True)
 31 | parser.add_argument("-b", "--conditionb", help="RNA-seq data for Condition B",
 32 |                     type=str, required=True)
 33 | parser.add_argument("-na", "--networka", help="Network for condition A",
 34 |                     type=str, required=True)
 35 | parser.add_argument("-nb", "--networkb", help="Network for condition B",
 36 |                     type=str, required=True)
 37 | # DEFAULTS
 38 | parser.add_argument("-o", "--output", help="output directory", type=str,
 39 |                     default='')
 40 | parser.add_argument("-reg1", "--lregularization", help="lasso regularization \
 41 |                     parameter", type=float, default=0.01)
 42 | parser.add_argument("-reg2", "--gregularization", help="graph contrained \
 43 |                     regularization parameter", type=float, default=0.01)
 44 | parser.add_argument("-s", "--step", help="optimizer learning-rate",
 45 |                     type=float, default=0.0001)
 46 | parser.add_argument("-c", "--conditioning", help="Weight for the interactions \
 47 |                     not known", type=bool, default=True)
 48 | parser.add_argument("-r", "--runs", help="Number of independent runs", type=int,
 49 |                     default=20)
 50 | parser.add_argument("-i", "--iterations", help="Number of iterations",
 51 |                     type=int, default=100000)
 52 | parser.add_argument("-ag", "--aggregation", help="""
 53 |                     Method for aggregating runs. Default: "sum"
 54 |                     Valid options: {"mean", "median", "sum"} """,
 55 |                     type=str, default='sum')
 56 | parser.add_argument("-n", "--normalize", help="""
 57 |                     Weight normalization strategy. Default:"minmax"
 58 |                     Valid options: {"minmax", "log", "log10", "no"} """,
 59 |                     type=str, default='minmax')
 60 | parser.add_argument("-m", "--model", help="""
 61 |                     Model regularization choice. Default: "epee-gcl"
 62 |                     Valid options: {"epee-gcl","epee-l","no-penalty" """,
 63 |                     type=str, default='epee-gcl')
 64 | parser.add_argument("-v", "--verbose",
 65 |                     help="logging info levels 10, 20, or 30",
 66 |                     type=int, default=10)
 67 | # OPTIONAL SETTINGS
 68 | parser.add_argument("-eval", "--evaluate",
 69 |                     help="Evaluation mode available for Th1, Th2, Th17, \
 70 |                     Bmem, COAD, and AML",
 71 |                     type=str, default=None)
 72 | parser.add_argument("-pr", "--prefix",
 73 |                     help="Add prefix to the log",
 74 |                     type=str, default=strftime('%Y%m%d'))
 75 | # OPTIONAL FLAGS
 76 | parser.add_argument("-w", "--store_weights",
 77 |                     help="Store all the inferred weights",
 78 |                     action='store_true')
 79 | parser.add_argument("-mp", "--multiprocess",
 80 |                     help="multiprocess the calculation of perturb and \
 81 |                     regulator scores", action='store_true')
 82 | # NULL FLAG
 83 | parser.add_argument("-null", "--null",
 84 |                     help="Generate null scores by label permutation",
 85 |                     action='store_true')
 86 | # NULL SETTINGS
 87 | parser.add_argument("-d", "--seed", help="Starting seed number",
 88 |                     type=int, default=0)
 89 | parser.add_argument("-p", "--perturb", help="True label perturb scores. Required when running permutations for null model",
 90 |                     type=str, default=None)
 91 | parser.add_argument("-sg", "--shuffle_genes",
 92 |                     help="Generate null scores by gene permutation",
 93 |                     action='store_true')
 94 | 
 95 | 
 96 | def get_scores(sel):
 97 |     """To get perturb and regulator score"""
 98 |     y1, w1, w1_df, y2, w2, w2_df, count = sel
 99 | 
100 |     # Calculate perturb scores
101 |     genescore_runi = get_perturb_scores(Y1, y1, X1, w1,
102 |                                         Y2, y2, X2, w2, S1, S2)
103 |     genescore_runi.columns = ['gene', 'set{}'.format(count)]
104 | 
105 |     if args.null:
106 |         regscore_runi, diff_regs = get_diff_regulatory_activity(
107 |                                          actual_perturb['gene'][:1000],
108 |                                          w1_df, w2_df, top_regs=20)
109 |     else:
110 |         regscore_runi, diff_regs = get_diff_regulatory_activity(
111 |                                          genescore_runi['gene'][:1000],
112 |                                          w1_df, w2_df, top_regs=20)
113 | 
114 |     regscore_runi.columns = ['gene', 'set{}'.format(count)]
115 | 
116 |     return (genescore_runi, regscore_runi)
117 | 
118 | 
119 | def run_epee():
120 |     """To run EPEE with specified inputs."""
121 |     logging.info('SAMPLES: Y1: {} | Y2: {}'.format(Y1.shape[1], Y2.shape[1]))
122 |     logging.info('Tensorflow: {}'.format(tf.__version__))
123 |     logging.info('GENES: {}'.format(Y1.shape[0]))
124 |     logging.info('TFs: {}'.format(S1.shape[1]))
125 |     logging.info('MODEL LEARNING STARTED')
126 |     genescore_df = pd.DataFrame()
127 |     regscore_df = pd.DataFrame()
128 |     loss_runs = []
129 |     y1_s = []
130 |     y2_s = []
131 |     w1_s = []
132 |     w2_s = []
133 |     w1S1_s = []
134 |     w2S2_s = []
135 | 
136 |     for rid in range(args.runs):
137 |         start = time.time()
138 |         logging.debug('Tensorflow: {}'.format(tf.__version__))
139 |         logging.debug('MODEL: {} learning Y1'.format(rid))
140 |         y1, w1, loss_arr1 = run_model(np.array(Y1), np.array(X1),
141 |                                       np.array(S1),
142 |                                       l_reg=args.lregularization,
143 |                                       g_reg=args.gregularization,
144 |                                       step=args.step,
145 |                                       itr=args.iterations,
146 |                                       log_itr=round(args.iterations/20),
147 |                                       seed=rid+args.seed,
148 |                                       model=args.model,
149 |                                       val=condition_val)
150 |         logging.debug('MODEL: {} learning Y2'.format(rid))
151 |         y2, w2, loss_arr2 = run_model(np.array(Y2), np.array(X2),
152 |                                       np.array(S2),
153 |                                       l_reg=args.lregularization,
154 |                                       g_reg=args.gregularization,
155 |                                       step=args.step,
156 |                                       itr=args.iterations,
157 |                                       log_itr=round(args.iterations/20),
158 |                                       seed=rid+args.seed,
159 |                                       model=args.model,
160 |                                       val=condition_val)
161 | 
162 |         loss_runs.append((rid, loss_arr1[-1], loss_arr2[-1]))
163 | 
164 |         # Calculate w1S1 and w2S2
165 |         w1_s1 = np.multiply(w1, S1)
166 |         w2_s2 = np.multiply(w2, S2)
167 | 
168 |         w1_df = get_weights_df(w1_s1, Y1.index, X1.index)
169 |         w2_df = get_weights_df(w2_s2, Y2.index, X2.index)
170 | 
171 |         w1o_df = get_weights_df(w1, Y1.index, X1.index)
172 |         w2o_df = get_weights_df(w2, Y2.index, X2.index)
173 | 
174 |         # Store dataframes
175 |         y1_s.append(y1)
176 |         y2_s.append(y2)
177 |         w1_s.append(w1)
178 |         w2_s.append(w2)
179 |         w1S1_s.append(w1_df)
180 |         w2S2_s.append(w2_df)
181 | 
182 |         # Output inferred weights if args.store_weights is True and args.null is False
183 |         if args.store_weights and not args.null:
184 |             w1o_df.to_csv('{}/model/w1_{}.txt'.format(outdir, rid),
185 |                           sep='\t')
186 |             w2o_df.to_csv('{}/model/w2_{}.txt'.format(outdir, rid),
187 |                           sep='\t')
188 |             if rid == 0:
189 |                 S1.to_csv('{}/model/S1_input.txt'.format(outdir),
190 |                           sep='\t')
191 |                 S2.to_csv('{}/model/S2_input.txt'.format(outdir),
192 |                           sep='\t')
193 |                 X1.to_csv('{}/model/X1_input.txt'.format(outdir),
194 |                           sep='\t')
195 |                 X2.to_csv('{}/model/X2_input.txt'.format(outdir),
196 |                           sep='\t')
197 |                 Y1.to_csv('{}/model/Y1_input.txt'.format(outdir),
198 |                           sep='\t')
199 |                 Y2.to_csv('{}/model/Y2_input.txt'.format(outdir),
200 |                           sep='\t')
201 | 
202 |         end = time.time()
203 | 
204 |         logging.info('MODEL: {} RUNTIME: {} mins'.format(rid,
205 |                      round((end-start)/60, 2)))
206 | 
207 |     # For each pairs of inferred weights calculate perturb and regulator scores
208 |     # logging.info('CALCULATE PERTURB AND REGULATOR SCORES')
209 |     logging.info('SCORES: pairwise comparison of all Y1 and Y2 models')
210 | 
211 |     list_runs = list(range(args.runs))
212 |     pairs = list(itertools.product(list_runs, list_runs))
213 |     score_inputs = []
214 |     for count, p in enumerate(pairs):
215 |         m1, m2 = p
216 |         score_inputs.append((y1_s[m1], w1_s[m1], w1S1_s[m1],
217 |                              y2_s[m2], w2_s[m2], w2S2_s[m2],
218 |                              count))
219 | 
220 |     if args.multiprocess:
221 |         cpu_count = multiprocessing.cpu_count()
222 |         p = multiprocessing.Pool(int(cpu_count/2))
223 |         out = p.map(get_scores, score_inputs)
224 |     else:
225 |         out = []
226 |         for i in score_inputs:
227 |             i_out = get_scores(i)
228 |             out.append(i_out)
229 | 
230 |     for count, scores in enumerate(out):
231 |         genescore_runi, regscore_runi = scores
232 |         if count == 0:
233 |             genescore_df = genescore_runi.copy()
234 |             regscore_df = regscore_runi.copy()
235 |         else:
236 |             # if np.all(genescore_runi.index == genescore_df.index):
237 |             #     genescore_df[genescore_runi.columns[1]] = genescore_runi.iloc[:, 1]
238 |             # else:
239 |             genescore_df = pd.merge(genescore_df, genescore_runi, on='gene')
240 |             # if np.all(regscore_runi.index == regscore_df.index):
241 |             #     regscore_df[regscore_runi.columns[1]] = regscore_runi.iloc[:, 1]
242 |             # else:
243 |             regscore_df = pd.merge(regscore_df, regscore_runi, on='gene')
244 | 
245 |     sum_genescore_df = get_summary_scoresdf(genescore_df, args.aggregation)
246 |     sum_regscore_df  = get_summary_scoresdf(regscore_df,  args.aggregation)
247 | 
248 |     if args.null:
249 |         sum_regscore_df.to_csv('{}/null/regulator_scores_{}.txt'.format(
250 |                                outdir, args.seed),
251 |                                sep='\t')
252 |         sum_genescore_df.to_csv('{}/null/perturb_scores_{}.txt'.format(
253 |                                 outdir, args.seed),
254 |                                 sep='\t')
255 |         regscore_df.to_csv('{}/null/all_regulator_scores_{}.txt'.format(
256 |                            outdir, args.seed),
257 |                            sep='\t')
258 |         genescore_df.to_csv('{}/null/all_perturb_scores_{}.txt'.format(
259 |                             outdir, args.seed),
260 |                             sep='\t')
261 |     else:
262 |         sum_regscore_df.to_csv('{}/scores/regulator_scores.txt'.format(
263 |                                outdir), sep='\t')
264 |         sum_genescore_df.to_csv('{}/scores/perturb_scores.txt'.format(
265 |                                 outdir), sep='\t')
266 |         regscore_df.to_csv('{}/scores/all_regulator_scores.txt'.format(
267 |                            outdir), sep='\t')
268 |         genescore_df.to_csv('{}/scores/all_perturb_scores.txt'.format(
269 |                             outdir), sep='\t')
270 | 
271 |         loss_df = pd.DataFrame(loss_runs)
272 |         loss1_df = pd.DataFrame(loss_arr1)
273 |         loss2_df = pd.DataFrame(loss_arr2)
274 |         loss_df.to_csv('{}/model/loss_runs.txt'.format(outdir),
275 |                        sep='\t')
276 |         loss1_df.to_csv('{}/model/loss1_arr_y1.txt'.format(outdir),
277 |                         sep='\t')
278 |         loss2_df.to_csv('{}/model/loss2_arr_y2.txt'.format(outdir),
279 |                         sep='\t')
280 | 
281 |     if args.evaluate:
282 | 
283 |         known_regs = {'Th2': ['STAT6', 'GATA3'],
284 |                       'Th1': ['TBX21', 'STAT1'],
285 |                       'Th17': ['ARID5A', 'RORA', 'STAT3'],
286 |                       'Bmem': ['STAT5A'],
287 |                       'AML': ['WT1', 'MYB', 'ETV6', 'SOX4', 'CEBPA', 'RUNX1'],
288 |                       'COAD': ['MYC', 'KLF4']
289 |                       }
290 |         regs = X1.index
291 |         actual = known_regs[args.evaluate]
292 |         line = [args.model, args.prefix, args.seed, args.lregularization,
293 |                 args.gregularization, len(Y1.columns), len(Y2.columns),
294 |                 args.conditiona, args.conditionb]
295 | 
296 |         for t in actual:
297 |             line.append(list(regscore_df['gene']).index(t))
298 | 
299 |         logging.info(line)
300 | 
301 |         median_rank = get_median_rank_position(actual,
302 |                                                list(regscore_df['gene']),
303 |                                                len(regs), args.model)
304 |         logging.info('RANK: {}'.format(median_rank[0]))
305 |         print('RANK: {}'.format(median_rank[0]))
306 | 
307 |         line.append(median_rank[0])
308 | 
309 |         performance = '\t'.join(map(str, line))
310 | 
311 | 
312 |         outfilename = '{}/{}_eval.txt'.format(args.output, args.evaluate)
313 |         # write header if filename does not exists
314 |         if not os.path.isfile(outfilename):
315 |             with open(outfilename, 'a') as myfile:
316 |                 headerlist = ['MODEL', 'PREFIX', 'SEED', 'L1', 'L2',
317 |                               'SamplesY1', 'SamplesY2', 'PathY1', 'PathY2']
318 |                 for t in actual:
319 |                     headerlist.append('{}_RANK'.format(t))
320 |                 headerlist.append('MEDIAN_RANK')
321 |                 header = '\t'.join(headerlist)
322 |                 myfile.write('{}\n'.format(header))
323 | 
324 |         with open(outfilename, 'a') as myfile:
325 |             myfile.write('{}\n'.format(performance))
326 | 
327 | 
328 | if __name__ == '__main__':
329 | 
330 |     run_start = time.time()
331 |     args = parser.parse_args()
332 |     outdir = '{d}{p}_{m}_{l}_{g}_{s}'.format(d=args.output, p=args.prefix,
333 |                                                   m=args.model,
334 |                                                   l=args.lregularization,
335 |                                                   g=args.gregularization,
336 |                                                   s=args.seed)
337 |     if args.null:
338 |         if args.perturb == None:
339 |             raise RuntimeError('Please provide perturb genes output generated with actual labels. --perturb <path to perturb_scores.txt>')
340 |         os.makedirs(os.path.dirname('{}/null/'.format(outdir)),
341 |                     exist_ok=True)
342 |         logfile = '{}/null_log.txt'.format(outdir)
343 |         actual_perturb = pd.read_csv(args.perturb, index_col=0, sep='\t')
344 |     else:
345 |         os.makedirs(os.path.dirname('{}/model/'.format(outdir)),
346 |                     exist_ok=True)
347 |         os.makedirs(os.path.dirname('{}/scores/'.format(outdir)),
348 |                     exist_ok=True)
349 |         logfile = '{}/log.txt'.format(outdir)
350 | 
351 |     for handler in logging.root.handlers[:]:
352 |         logging.root.removeHandler(handler)
353 |     logging.basicConfig(filename=logfile,
354 |                         level=args.verbose)
355 | 
356 |     logging.info('####### {} STARTING ANALYSIS  #######'.format(args.prefix))
357 |     logging.debug('Multiprocessing: {}'.format(args.multiprocess))
358 |     logging.info('EPEE: {}'.format(strftime("%a, %d %b %Y %H:%M:%S",
359 |                  localtime())))
360 |     Y1, Y2, X1, X2, S1, S2, condition_val = get_epee_inputs(
361 |                                      args.conditiona, args.conditionb,
362 |                                      args.networka, args.networkb,
363 |                                      conditioning=args.conditioning,
364 |                                      weightNormalize=args.normalize,
365 |                                      null=args.null, 
366 |                                      shuffleGenes=args.shuffle_genes,
367 |                                      seed=args.seed)
368 |     run_epee()
369 |     run_end = time.time()
370 | 
371 |     logging.info('Time elapsed: {} mins'.format(
372 |                                             round((run_end-run_start)/60, 2)))
373 |     logging.info('#######  {} ANALYSIS COMPLETED ########'.format(args.prefix))
374 | 


--------------------------------------------------------------------------------
/test/tests.sh:
--------------------------------------------------------------------------------
 1 | # # testing scenario 1 without multiprocessing
 2 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out
 3 | #
 4 | # # testing scenario 2 with multiprocessing
 5 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -multiprocess
 6 | #
 7 | # # testing scenario 3 null without multiprocessing
 8 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -perturb data/perturb_score.txt.gz -null
 9 | #
10 | # # testing scenario 4 null with multiprocessing
11 | # python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_out -perturb data/perturb_score.txt.gz -null -multiprocess
12 | 
13 | # testing tfwrapper
14 | ../script/analysis/tfwrapper_paramsrunner.sh python ../script/run_epee_v0.1.4.3.py --conditiona data/CD4_Naive.txt.gz --conditionb data/CD4_Th2.txt.gz --networka data/cd4+_t_cells.txt.gz --networkb data/cd4+_t_cells.txt.gz -r 2 -i 100 -prefix test_tfwrapper
15 | 


--------------------------------------------------------------------------------