├── README.md ├── _config.yml ├── eigenMT.py └── example.tar.gz /README.md: -------------------------------------------------------------------------------- 1 | # eigenMT 2 | An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants 3 | 4 | Introduction 5 | ------------ 6 | eigenMT is a computationally efficient multiple testing correction method for cis-eQTL studies. Typically, modern cis-eQTL studies correct for multiple testing across thousands of variants for a single gene using permutations. This correction method, though accurate, requires a high computational cost on the order of days to weeks for large numbers of permutations and/or large sample sizes. Our method reduces this computational burden by orders of magnitude while maintaining high accuracy when compared to permutations. To accomplish this, our method attempts to estimate the number of effective tests performed for a given gene by directly accounting for the LD relationship among the tested variants. 7 | 8 | Download 9 | ------------ 10 | Begin by downloading or cloning this respository. eigenMT runs as a stand-alone python script. It requires an active installaton of python (version 2.7 or higher) to be installed. To download and install a python distribution, there are a few convenient options: 11 | - [Anaconda](https://store.continuum.io/cshop/anaconda/) 12 | - [Enthought Canopy](https://www.enthought.com/products/canopy/) 13 | 14 | These bundled installations already include a number of modules required for running eigenMT including: 15 | - [numpy](http://www.numpy.org/) 16 | - [scipy](http://www.scipy.org/) 17 | - [Scikit-learn](http://scikit-learn.org/stable/) 18 | 19 | Again, these modules should come pre-packaged in one of the bundled python installations above. If you decide to install these packages yourself, please see the listed websites for detailed instructions. 20 | 21 | Finally, if you wish to run the Matrix eQTL example script, you will need to install the Matrix eQTL package, which can be accomplished simply by opening an R session and running the following command: 22 | ``` 23 | install.packages('MatrixEQTL') 24 | ``` 25 | 26 | Input 27 | ------------ 28 | A typical run will have the following command line format: 29 | ``` 30 | python eigenMT.py --CHROM 31 | --QTL \ 32 | --GEN \ 33 | --POS \ 34 | --var_thresh [variance explained threshold, default 0.99] \ 35 | --OUT \ 36 | --window [window size, default 200] 37 | ``` 38 | 39 | eigenMT was designed to fit within a pipeline utilizing [Matrix eQTL](https://cran.r-project.org/web/packages/MatrixEQTL/index.html) ([Shabalin, Bioinformatics 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3348564/)) for eQTL calling. The input fields are defined as follows. 40 | 41 | Argument | Description 42 | --------------------------- |------------- 43 | CHROM | Chromosome ID. Indicates which chromosome the analysis will be performed on. Must match the ID used in the Matrix eQTL SNP-gene tests file. 44 | QTL | Filename for Matrix eQTL SNP-gene tests file. See `example/cis.qtls.txt` for an example. 45 | GEN | Genotype matrix in Matrix eQTL format. Can accept either hardcoded or dosage based genotypes. Missing genotypes must be encoded as NA and will be imputed to the mean genotype during correction. See `example/genotypes.txt` for an example. 46 | GENPOS | Matrix eQTL genotype position file. See `example/positions.txt` for an example. 47 | var_thresh | Threshold for amount of variance explained in the genotype correlation matrix. Default is 99% variance explained. Increasing this threshold will increase estimates of effective number of tests (M_eff) and decrease accuracy of the approximation. 48 | OUT | Output filename. Output format is described below. 49 | window | Window size parameter. Determines what size of disjoint windows to split genotype matrices for each gene into. Default is 200 SNPs. We recommend using a window size of at least 50 SNPs up to 200 SNPs to balance accuracy and speed. 50 | cis_dist | Threshold for bp distance from the gene TSS to perform multiple testing correction. Default is 1e6. 51 | external | Logical indicating whether the provided genotype matrix is different from the one originally used to test for cis-eQTLs. 52 | 53 | Although we have designed our software to utilize output directly from Matrix eQTL, eigenMT can handle a more general QTL input file. This more general input format need only have three columns: 1. the SNP ID, 2. the gene ID, and 3. the cis-eQTL p-value. This file must contain a header line with the p-value column indicated by `p-value`. 54 | 55 | Output 56 | ------------ 57 | The output file is in tab-separated format with the following columns. 58 | 59 | Column | Description 60 | --------------- | ------------ 61 | 1 | variant ID 62 | 2 | Gene ID 63 | 3 | estimate of effect size BETA from Matrix eQTL 64 | 4 | T-statistic from Matrix eQTL 65 | 5 | p-value from Matrix eQTL 66 | 6 | FDR estimate from Matrix eQTL 67 | 7 | eigenMT corrected p-value 68 | 8 | estimated number of independent tests for the gene 69 | 70 | Note: The first 5 columns of the output will correspond to the typical Matrix eQTL output. Also, each tested gene will appear once in the output file with its most significant SNP and the eigenMT corrected p-value. 71 | 72 | 73 | Example 74 | ------------ 75 | We offer a small example dataset for testing. We provide genotype and normalized gene expression matrices with corresponding position files in Matrix eQTL format. This genotype matrix is for chromosome 19 for the EUR373 samples as part of the [GEUVADIS](http://www.nature.com/nature/journal/v501/n7468/full/nature12531.html?WT.ec_id=NATURE-20130926) study. The normalized expression matrix covers the first 100 genes on chromosome 19 with quantification in the GEUVADIS dataset. The expression matrix has been quantile normalized to a standard normal distribution following variance stabilization by DESeq2 and removal of 30 PEER factors. We also provide a sample of the cis-eQTL tests performed for these samples using Matrix eQTL. This sample variant-gene tests file represents 100 genes. To extract the example dataset, run 76 | ``` 77 | tar -zxvf example.tar.gz 78 | ``` 79 | 80 | We provide an example R script for running Matrix eQTL. To learn more about using Matrix eQTL, please refer to the [package documentation on CRAN](https://cran.r-project.org/web/packages/MatrixEQTL/index.html) or the [original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3348564/). To generate the file of cis-eQTL tests, use the following command: 81 | ``` 82 | Rscript example/MatrixEQTL.R --SNP example/genotypes.txt \ 83 | --GE example/phenotypes.txt \ 84 | --snploc example/gen.positions.txt \ 85 | --geneloc example/phe.positions.txt \ 86 | --OutputFilePath example/cis.eqtls.txt \ 87 | --cisDist 1e6 \ 88 | --ANOVA FALSE \ 89 | --pValue 1 > example/matrix.eqtl.log 90 | ``` 91 | 92 | To run eigenMT on the example data, use the following command: 93 | ``` 94 | python eigenMT.py --CHROM 19 \ 95 | --QTL example/cis.eqtls.txt \ 96 | --GEN example/genotypes.txt \ 97 | --GENPOS example/gen.positions.txt \ 98 | --PHEPOS example/phe.positions.txt \ 99 | --OUT example/exampleOut.txt 100 | ``` 101 | Note: this example uses the default settings for window size (200 SNPs) and variance explained threshold (.99). 102 | 103 | To visualize the results of the correction, use the following command: 104 | ``` 105 | Rscript example/compareToEmpirical.R 106 | ``` 107 | 108 | This R script will generate a PDF with three figures: 109 | 110 | 1. A plot of the eigenMT corrected P-values (y-axis) against the empirical P-values from 10000 permutations (x-axis). 111 | 2. The same plot as in (1) but with a -log10 transformation of the x- and y- axes. 112 | 3. The same plot as in (2) zoomed in to exclude empirical P-values = 1. In this final plot, the points should be strongly correlated and should fall slightly below the diagonal (for the most part). 113 | 114 | External genotype data 115 | ------------ 116 | Our method offers the ability to perform multiple testing correction using a genotype matrix from a separate sample population than the one initially used for cis-eQTL testing. It is important to note that this external genotype matrix should come from the same background population as the one under study. For example, if cis-eQTL testing is performed in samples of European ancestry, then any external genotype matrix used should come from a similar European population. We have shown that using genotype data from studies with larger sample sizes can improve the accuracy of our method compared to using the genotype data for the study. Genotype data from larger studies will provide better estimates of the LD structure for variants around each gene, improving our estimates of the effective number of tests. 117 | 118 | Population stratification and other covariates 119 | ------------ 120 | We recommend first removing the effects of population stratification (genotype PCs, population or ancestry assignments) and other covariates (age, gender, PEER factors, etc) from the expression matrix. We have shown that by first removing these effects and then performing cis-eQTL calling on the inverse rank normalized residuals, we maintain conservativeness and accuracy. 121 | 122 | 123 | Citation 124 | ------------ 125 | Davis JR, Fresard L, Knowles DA, Pala M, Bustamante CD, Battle A, Montgomery SB (2015). An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants. The American Journal of Human Genetics, 98(1), 216–224. doi: (http://doi.org/10.1016/j.ajhg.2015.11.021) 126 | 127 | Contact 128 | ------------ 129 | - Joe Davis: joed3@stanford.edu 130 | - Laure Fresard: lfresard@stanford.edu 131 | - Stephen Montgomery: smontgom@stanford.edu 132 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman -------------------------------------------------------------------------------- /eigenMT.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import os 3 | import sys 4 | import fileinput 5 | import argparse 6 | import numpy as np 7 | import pandas as pd 8 | import scipy.linalg as splin 9 | import gc 10 | import gzip 11 | from sklearn import covariance 12 | 13 | ##############FUNCTIONS 14 | 15 | def open_file(filename): 16 | """Function to open a (potentially gzipped) file.""" 17 | with open(filename, 'rb') as file_connection: 18 | file_header = file_connection.readline() 19 | if file_header.startswith(b"\x1f\x8b\x08"): 20 | opener = gzip.open(filename, 'rt') 21 | else: 22 | opener = open(filename) 23 | return opener 24 | 25 | def load_tensorqtl_output(tensorqtl_parquet, group_size_s=None): 26 | """Read tensorQTL output""" 27 | df = pd.read_parquet(tensorqtl_parquet) 28 | if 'gene_id' not in df: 29 | if ':' in df['phenotype_id'].iloc[0]: 30 | df['gene_id'] = df['phenotype_id'].apply(lambda x: x.rsplit(':',1)[1] if ':' in x else x) 31 | else: 32 | df.rename(columns={'phenotype_id':'gene_id'}, inplace=True) 33 | # eigenMT requires a 'p-value' column (see make_test_dict); first column must be variant, second gene/phenotype 34 | df = df[['variant_id', 'gene_id']+[i for i in df.columns if i not in ['variant_id', 'gene_id']]] 35 | # select p-value column 36 | if 'pval_nominal' in df.columns: 37 | df['p-value'] = df['pval_nominal'].copy() 38 | elif 'pval_gi' in df.columns: # interaction model 39 | df['p-value'] = df['pval_gi'].copy() 40 | if group_size_s is not None: 41 | print(' * adjusting p-values by phenotype group size') 42 | df['p-value'] = np.minimum(df['p-value']*df['gene_id'].map(group_size_s), 1.0) 43 | return df 44 | 45 | def make_genpos_dict(POS_fh, CHROM): 46 | ''' 47 | Function to read in SNPs and their positions and make a dict. 48 | Keys are SNP IDs; values are positions. 49 | Only stores the information of the SNPs from the given chromosome. 50 | ''' 51 | pos_dict = {} 52 | with open_file(POS_fh) as POS: 53 | POS.readline() # skip header 54 | for line in POS: 55 | line = line.rstrip().split() 56 | if line[1] == CHROM: 57 | pos_dict[line[0]] = float(line[2]) 58 | return pos_dict 59 | 60 | def make_phepos_dict(POS_fh, CHROM): 61 | ''' 62 | Function to read in phenotypes (probes, genes, peaks) with their start and end positions and make a dict. 63 | Keys are phenotype IDs; values are start and end positions. 64 | Only stores the information of the phenotypes from the given chromosome. 65 | ''' 66 | pos_dict = {} 67 | with open_file(POS_fh) as POS: 68 | POS.readline() # skip header 69 | for line in POS: 70 | line = line.rstrip().split() 71 | if line[1] == CHROM: 72 | pos_array = np.array(line[2:4]) 73 | pos_dict[line[0]] = np.float64(pos_array) 74 | return pos_dict 75 | 76 | def make_gen_dict(GEN_fh, pos_dict, sample_ids=None): 77 | ''' 78 | Function to read in genotype matrix from MatrixEQTL and make dict. 79 | Keys are SNP positions; values are genotypes. 80 | ''' 81 | gen_dict = {} 82 | with open_file(GEN_fh) as GEN: 83 | header = GEN.readline().rstrip().split() 84 | if sample_ids is not None: 85 | ix = [header[1:].index(i) for i in sample_ids] 86 | for line in GEN: #Go through each line of the genotype matrix and add line to gen_dict 87 | line = line.rstrip().split() 88 | snp = pos_dict[line[0]] 89 | genos = np.array(line[1:]) 90 | if sample_ids is not None: 91 | genos = genos[ix] 92 | genos[genos == 'NA'] = -1 # no effect if already -1 93 | genos = np.float64(genos) 94 | genos[genos == -1] = np.mean(genos[genos != -1]) 95 | gen_dict[snp] = genos 96 | return gen_dict # pos->genotypes 97 | 98 | def make_test_dict(QTL_fh, gen_dict, genpos_dict, phepos_dict, cis_dist): 99 | ''' 100 | Function to make dict of SNP-gene tests. Also returns the header of the file. 101 | Keys are gene IDs. Values are dict giving list of tested SNPs and the best SNP and it's p-value. 102 | SNPs are identified by their positions. 103 | Assumes the first column of the input file is the SNP ID and the second column is the GENE. 104 | The column with the p-value is determined by the function, which allows for a more flexible input format. 105 | ''' 106 | QTL = open_file(QTL_fh) 107 | header = QTL.readline().rstrip().split() 108 | # find the column with the p-value 109 | if 'p-value' in header: 110 | pvalIndex = header.index('p-value') 111 | elif 'p.value' in header: 112 | pvalIndex = header.index('p.value') 113 | elif 'pvalue' in header: 114 | pvalIndex = header.index('pvalue') 115 | else: 116 | sys.exit('Cannot find the p-value column in the tests file.') 117 | 118 | test_dict = {} 119 | 120 | for line in QTL: 121 | line = line.rstrip().split() 122 | if line[0] in genpos_dict: 123 | snp = genpos_dict[line[0]] 124 | gene = line[1] 125 | if snp in gen_dict and gene in phepos_dict: 126 | phepos = phepos_dict[gene] 127 | distance = min(abs(phepos - snp)) 128 | if distance <= cis_dist: 129 | pval = line[pvalIndex] 130 | pval = float(pval) 131 | if gene not in test_dict: 132 | test_dict[gene] = {'snps' : [snp], 'best_snp' : snp, 'pval' : pval, 'line' : '\t'.join(line)} 133 | else: 134 | if pval < test_dict[gene]['pval']: 135 | test_dict[gene]['best_snp'] = snp 136 | test_dict[gene]['pval'] = pval 137 | test_dict[gene]['line'] = '\t'.join(line) 138 | test_dict[gene]['snps'].append(snp) 139 | 140 | QTL.close() 141 | return test_dict, "\t".join(header) 142 | 143 | def make_test_dict_tensorqtl(QTL_fh, gen_dict, genpos_dict, cis_dist, group_size_s=None): 144 | """ 145 | Same arguments and output as make_test_dict, for output from tensorQTL. 146 | QTL_fh: parquet file with variant-gene pair associations 147 | """ 148 | qtl_df = load_tensorqtl_output(QTL_fh, group_size_s=group_size_s) 149 | qtl_df = qtl_df[qtl_df['tss_distance'].abs()<=cis_dist] 150 | 151 | gdf = qtl_df.groupby('gene_id') 152 | test_dict = {} 153 | for gene_id,g in gdf: 154 | g0 = g.loc[g['p-value'].idxmin()] 155 | test_dict[gene_id] = { 156 | 'snps': [genpos_dict[i] for i in g['variant_id']], # variant positions 157 | 'best_snp':genpos_dict[g0['variant_id']], 158 | 'pval':g0['p-value'], 159 | 'line':'\t'.join([i if isinstance(i, str) else '{:.6g}'.format(i) for i in g0.values]) 160 | } 161 | return test_dict, '\t'.join(qtl_df.columns) 162 | 163 | def make_test_dict_external(QTL_fh, gen_dict, genpos_dict, phepos_dict, cis_dist): 164 | ''' 165 | Same arguments and output as make_test_dict. 166 | Main difference from the previous function is that it assumes the genotype matrix and position file 167 | are separate from that used in the Matrix-eQTL run. This function is to be used with the external option 168 | to allow calculation of the effective number of tests using a different, preferably larger, genotype sample. 169 | ''' 170 | QTL = open_file(QTL_fh) 171 | header = QTL.readline().rstrip().split() 172 | # find the column with the p-value 173 | if 'p-value' in header: 174 | pvalIndex = header.index('p-value') 175 | elif 'p.value' in header: 176 | pvalIndex = header.index('p.value') 177 | elif 'pvalue' in header: 178 | pvalIndex = header.index('pvalue') 179 | else: 180 | sys.exit('Cannot find the p-value column in the tests file.') 181 | 182 | test_dict = {} 183 | 184 | for line in QTL: 185 | line = line.rstrip().split() 186 | if line[0] in genpos_dict: 187 | snp = genpos_dict[line[0]] 188 | gene = line[1] 189 | if snp in gen_dict and gene in phepos_dict: 190 | phepos = phepos_dict[gene] 191 | distance = min(abs(phepos - snp)) 192 | if distance <= cis_dist: 193 | pval = line[pvalIndex] 194 | pval = float(pval) 195 | if gene not in test_dict: 196 | test_dict[gene] = {'best_snp' : snp, 'pval' : pval, 'line' : '\t'.join(line)} 197 | else: 198 | if pval < test_dict[gene]['pval']: 199 | test_dict[gene]['best_snp'] = snp 200 | test_dict[gene]['pval'] = pval 201 | test_dict[gene]['line'] = '\t'.join(line) 202 | 203 | QTL.close() 204 | snps = np.array(genpos_dict.values()) 205 | for gene in test_dict: 206 | phepos = phepos_dict[gene] 207 | #Calculate distances to phenotype start and end positions 208 | is_in_cis_start = abs(snps - phepos[0]) <= cis_dist 209 | is_in_cis_end = abs(snps - phepos[1]) <= cis_dist 210 | test_dict[gene]['snps'] = snps[is_in_cis_start | is_in_cis_end] 211 | return test_dict, "\t".join(header) 212 | 213 | 214 | def bf_eigen_windows(test_dict, gen_dict, phepos_dict, OUT_fh, input_header, var_thresh, window): 215 | ''' 216 | Function to process dict of SNP-gene tests. 217 | Calculates the genotype correlation matrix for the SNPs tested for each gene using windows around the gene. 218 | Will calculate a regularized correlation matrix from the Ledoit-Wolf estimator of the covariance matrix. 219 | Finds the eigenvalues for this matrix. 220 | Calculates how many eigenvalues are needed to reach the variance threshold. 221 | This final value will be the effective Bonferroni correction number. 222 | Will output to file corrected pvalue for best SNP per gene as it goes along. 223 | ''' 224 | OUT = open(OUT_fh, 'w') 225 | OUT.write(input_header + '\tBF\tTESTS\n') 226 | 227 | counter = 1.0 228 | genes = test_dict.keys() 229 | numgenes = len(genes) 230 | TSSs = [] 231 | for gene in genes: 232 | TSSs.append(phepos_dict[gene][0]) 233 | TSSs, genes = [list(x) for x in zip(*sorted(zip(TSSs, genes), key=lambda p: p[0]))] 234 | for gene in genes: 235 | perc = (100 * counter / numgenes) 236 | if (counter % 100) == 0: 237 | print(str(counter) + ' out of ' + str(numgenes) + ' completed ' + '(' + str(round(perc, 3)) + '%)', flush=True) 238 | counter += 1 239 | snps = np.sort(test_dict[gene]['snps']) 240 | start = 0 241 | stop = window 242 | M = len(snps) 243 | m_eff = 0 244 | window_counter = 0 245 | while start < M: 246 | if stop - start == 1: 247 | m_eff += 1 248 | break ##can't compute eigenvalues for a scalar, so add 1 to m_eff and break from the while loop 249 | snps_window = snps[start:stop] 250 | genotypes = [] 251 | for snp in snps_window: 252 | if snp in gen_dict: 253 | genotypes.append(gen_dict[snp]) 254 | genotypes = np.matrix(genotypes) 255 | m, n = np.shape(genotypes) 256 | gen_corr, alpha = lw_shrink(genotypes) 257 | window_counter += 1 258 | eigenvalues = splin.eigvalsh(gen_corr) 259 | eigenvalues[eigenvalues < 0] = 0 260 | m_eff += find_num_eigs(eigenvalues, m, var_thresh) 261 | start += window 262 | stop += window 263 | if stop > M: 264 | stop = M 265 | OUT.write(test_dict[gene]['line'] + '\t' + str(min(test_dict[gene]['pval'] * m_eff, 1)) + '\t' + str(m_eff) + '\n') 266 | OUT.flush() 267 | gc.collect() 268 | OUT.close() 269 | 270 | def lw_shrink(genotypes): 271 | ''' 272 | Function to obtain smoothed estimate of the genotype correlation matrix. 273 | Uses the method proposed by Ledoit and Wolf to estimate the shrinkage parameter alpha. 274 | Input: genotype matrix. 275 | Output: smoother correlation matrix and the estimated shrinkage parameter. 276 | ''' 277 | lw = covariance.LedoitWolf() 278 | m, n = np.shape(genotypes) 279 | try: 280 | fitted = lw.fit(genotypes.T) 281 | alpha = fitted.shrinkage_ 282 | shrunk_cov = fitted.covariance_ 283 | shrunk_precision = np.mat(np.diag(np.diag(shrunk_cov)**(-.5))) 284 | shrunk_cor = shrunk_precision * shrunk_cov * shrunk_precision 285 | except: #Exception for handling case where SNPs in the window are all in perfect LD 286 | row = np.repeat(1, m) 287 | shrunk_cor = [] 288 | for i in range(0,m): 289 | shrunk_cor.append(row) 290 | shrunk_cor = np.mat(shrunk_cor) 291 | alpha = 'NA' 292 | return shrunk_cor, alpha 293 | 294 | def find_num_eigs(eigenvalues, variance, var_thresh): 295 | ''' 296 | Function to find the number of eigenvalues required to reach a certain threshold of variance explained. 297 | ''' 298 | eigenvalues = np.sort(eigenvalues)[::-1] 299 | running_sum = 0 300 | counter = 0 301 | while running_sum < variance * var_thresh: 302 | running_sum += eigenvalues[counter] 303 | counter += 1 304 | return counter 305 | 306 | 307 | ##############MAIN 308 | if __name__=='__main__': 309 | USAGE = """ 310 | Takes in SNP-gene tests from MatrixEQTL output and performs gene level Bonferroni correction using eigenvalue decomposition of 311 | the genotype correlation matrix. Picks best SNP per gene. 312 | """ 313 | 314 | parser = argparse.ArgumentParser(description = USAGE) 315 | parser.add_argument('--QTL', required = True, help = 'Matrix-EQTL output file for one chromosome') 316 | parser.add_argument('--GEN', required = True, help = 'genotype matrix file') 317 | parser.add_argument('--var_thresh', type=float, default = 0.99, help = 'variance threshold') 318 | parser.add_argument('--OUT', required = True, help = 'output filename') 319 | parser.add_argument('--window', type=int, default = 200, help = 'SNP window size') 320 | parser.add_argument('--GENPOS', required = True, help = 'map of genotype to chr and position (as required by Matrix-eQTL)') 321 | parser.add_argument('--PHEPOS', required = True, help = 'map of measured phenotypes to chr and position (eg. gene expression to CHROM and TSS; as required by Matrix-eQTL)') 322 | parser.add_argument('--CHROM', required = True, help = 'Chromosome that is being processed (must match format of chr in POS)') 323 | parser.add_argument('--cis_dist', type=float, default = 1e6, help = 'threshold for bp distance from the gene TSS to perform multiple testing correction (default = 1e6)') 324 | parser.add_argument('--external', action = 'store_true', help = 'indicates whether the provided genotype matrix is different from the one used to call cis-eQTLs initially (default = False)') 325 | parser.add_argument('--sample_list', default=None, help='File with sample IDs (one per line) to select from genotypes') 326 | parser.add_argument('--phenotype_groups', default=None, help='File with phenotype_id->group_id mapping') 327 | args = parser.parse_args() 328 | 329 | ##Make SNP position dict 330 | print('Processing genotype position file.', flush=True) 331 | genpos_dict = make_genpos_dict(args.GENPOS, args.CHROM) 332 | 333 | ##Make phenotype position dict 334 | print('Processing phenotype position file.', flush=True) 335 | phepos_dict = make_phepos_dict(args.PHEPOS, args.CHROM) 336 | 337 | ##Make genotype dict 338 | print('Processing genotype matrix.', flush=True) 339 | if args.sample_list is not None: 340 | with open(args.sample_list) as f: 341 | sample_ids = f.read().strip().split('\n') 342 | print(' * using subset of '+str(len(sample_ids))+' samples.') 343 | else: 344 | sample_ids = None 345 | gen_dict = make_gen_dict(args.GEN, genpos_dict, sample_ids) 346 | 347 | ##Make SNP-gene test dict 348 | if not args.external: 349 | if args.QTL.endswith('.parquet'): 350 | print('Processing tensorQTL tests file.', flush=True) 351 | if args.phenotype_groups is not None: 352 | group_s = pd.read_csv(args.phenotype_groups, sep='\t', index_col=0, header=None, squeeze=True) 353 | group_size_s = group_s.value_counts() 354 | else: 355 | group_size_s = None 356 | test_dict, input_header = make_test_dict_tensorqtl(args.QTL, gen_dict, genpos_dict, args.cis_dist, group_size_s=group_size_s) 357 | else: 358 | print('Processing Matrix-eQTL tests file.', flush=True) 359 | test_dict, input_header = make_test_dict(args.QTL, gen_dict, genpos_dict, phepos_dict, args.cis_dist) 360 | else: 361 | print('Processing Matrix-eQTL tests file. External genotype matrix and position file assumed.', flush=True) 362 | test_dict, input_header = make_test_dict_external(args.QTL, gen_dict, genpos_dict, phepos_dict, args.cis_dist) 363 | 364 | ##Perform BF correction using eigenvalue decomposition of the correlation matrix 365 | print('Performing eigenMT correction.', flush=True) 366 | bf_eigen_windows(test_dict, gen_dict, phepos_dict, args.OUT, input_header, args.var_thresh, args.window) 367 | -------------------------------------------------------------------------------- /example.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joed3/eigenMT/f91e2837d34403cff48c9a248a7d4fb3fac98541/example.tar.gz --------------------------------------------------------------------------------