├── .gitignore ├── README.md ├── pyproject.toml ├── setup.cfg └── src ├── .DS_Store └── sldp ├── .DS_Store ├── __init__.py ├── chunkstats.py ├── config.py ├── preprocessannot ├── preprocessannot.py ├── preprocesspheno ├── preprocesspheno.py ├── preprocessrefpanel ├── sldp ├── storyteller.py └── weights.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | #Ipython Notebook 62 | .ipynb_checkpoints 63 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SLDP (Signed LD Profile) regression 2 | 3 | SLDP regression is a method for looking for a directional effect of a signed functional annotation on a heritable trait using GWAS summary statistics. This repository contains code for the SLDP regression method as well as tools required for preprocessing data for use with SLDP regression. 4 | 5 | ## Installation 6 | 7 | First, make sure you have a python distribution installed that includes scientific computing packages like numpy/scipy/pandas as well as the package manager pip; we recommend [Anaconda](https://store.continuum.io/cshop/anaconda/). 8 | 9 | To install `sldp`, type the following command. 10 | ``` 11 | pip install sldp 12 | ``` 13 | This should install both sldp as well as any required packages, such as [gprim](https://github.com/yakirr/gprim) and [ypy](https://github.com/yakirr/ypy). 14 | 15 | If you prefer to install `sldp` without pip, just clone this repository, together with [gprim](https://github.com/yakirr/gprim) and [ypy](https://github.com/yakirr/ypy), and add an entry for each into your python path. 16 | 17 | 18 | ## Getting started 19 | 20 | To verify that the installation went okay, run 21 | ``` 22 | sldp -h 23 | ``` 24 | to print a list of all command-line options. If this command fails, there was a problem with the installation. 25 | 26 | Once this works, take a look at our [wiki](https://github.com/yakirr/sldp/wiki) for a short tutorial on how to use `sldp`. 27 | 28 | 29 | ## Where can I get signed LD profiles? 30 | 31 | You can download signed LD profiles (as well as raw signed functional annotations) for ENCODE ChIP-seq experiments from the [sldp data page](https://data.broadinstitute.org/alkesgroup/SLDP/). These signed LD profiles were created using 1000 Genomes Phase 3 Europeans as the reference panel. 32 | 33 | ## Where can I get reference panel information such as SVDs of LD blocks and LD scores? 34 | 35 | You can download all required reference panel information, computed using 1000 Genomes Phase 3 Europeans, from the [sldp data page](https://data.broadinstitute.org/alkesgroup/SLDP/). 36 | 37 | ## Errata 38 | ### Gene-set enrichment method for SLDP results 39 | In the published paper, we described a gene-set enrichment method for assessing whether a genome-wide signed relationship between an annotation and a trait is stronger in areas of the genome that are near a gene set of interest. The method was described in terms of a vector `s` that summarizes the gene-set of interest and a vector `q` that summarizes the SLDP association of interest. Both `s` and `q` have one entry per LD block: the `i`-th entry of `s` contains the number genes from the gene set that lie in the `i`-th LD block, and the `i`-th entry of `q` contains the estimated covariance across SNPs in the `i`-th LD block between the signed LD profile of the annotation in question and summary statistics of the trait in question. There are two errata related to this analysis, which we describe below. 40 | 41 | #### Description of gene-set enrichment analysis statistic 42 | In the paper, we stated that statistic we compute is 43 | 44 | ![equation](https://latex.codecogs.com/png.latex?a%20%3A%3D%20%5Cfrac%7B%5Csum_i%20s_iq_i%7D%7B%5Csum_i%20s_i%7D) 45 | 46 | that is, the weighted average of `q` across the LD blocks with non-zero values of `s`. However, the statistic that is used in actuality is 47 | 48 | ![equation](https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Csum_i%20s_iq_i%7D%7B%5Csum_i%20s_i%7D%20-%20%5Cfrac%7B%5Csum_i%20%5Cmathbf%7B1%7D%28s_i%3D0%29%20q_i%7D%7B%5Csum_i%20%5Cmathbf%7B1%7D%28s_i%3D0%29%7D) 49 | 50 | that is, we take the difference between the weighted average of `q` across the LD blocks with non-zero values of `s` on the one hand, and the average of `q` across the LD blocks in which `s` is zero on the other hand. 51 | 52 | #### Computation of empirical p-values in gene-set enrichment analysis 53 | Our gene-set enrichment procedure computed p-values by shuffling `s` over LD blocks. However, the code that produced our published results computed a simple average of `q` rather than a weighted average when computing the statistic for the null distribution. Fixing the bug led to qualitatively similar but not identical results. For more detail, download the [corrected version of Supplementary Table 10](https://data.broadinstitute.org/alkesgroup/SLDP/errata/corrected_SuppTable10.xlsx) that lists the published and corrected p- and q-values of the gene-set enrichments highlighted in our publication. 54 | 55 | ## Citation 56 | 57 | If you use `sldp`, please cite 58 | 59 | [Reshef, et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. 60 | Nature Genetics, 2018.](https://www.nature.com/articles/s41588-018-0196-7) 61 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = [ 3 | "setuptools>=42", 4 | "wheel" 5 | ] 6 | build-backend = "setuptools.build_meta" 7 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = sldp 3 | version = 1.1.4 4 | author = Yakir Reshef 5 | author_email = yreshef@broadinstitute.org 6 | description = Signed LD profile regression 7 | long_description = file: README.md 8 | long_description_content_type = text/markdown 9 | url = http://github.com/yakirr/sldp 10 | project_urls = 11 | Bug Tracker = https://github.com/yakirr/sldp/issues 12 | Tutorial = https://github.com/yakirr/sldp/blob/master/README.md 13 | classifiers = 14 | Programming Language :: Python :: 3 15 | License :: OSI Approved :: MIT License 16 | Operating System :: OS Independent 17 | 18 | [options] 19 | package_dir = 20 | = src 21 | packages = find: 22 | python_requires = >=3 23 | install_requires = 24 | numpy 25 | pandas>=2 26 | scipy 27 | matplotlib 28 | gprim 29 | ypy 30 | 31 | [options.packages.find] 32 | where = src 33 | 34 | [options.entry_points] 35 | console_scripts = 36 | sldp = sldp.sldp:main 37 | preprocessannot = sldp.preprocessannot:main 38 | preprocesspheno = sldp.preprocesspheno:main 39 | preprocessrefpanel = sldp.preprocessrefpanel:main -------------------------------------------------------------------------------- /src/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/.DS_Store -------------------------------------------------------------------------------- /src/sldp/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/sldp/.DS_Store -------------------------------------------------------------------------------- /src/sldp/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/sldp/__init__.py -------------------------------------------------------------------------------- /src/sldp/chunkstats.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import gc 3 | import numpy as np 4 | import pandas as pd 5 | import scipy.stats as st 6 | import ypy.fs as fs 7 | 8 | # make idependent blocks 9 | def collapse_to_chunks(ldblocks, numerators, denominators, numblocks): 10 | # define endpoints of chunks 11 | ldblocks.M_H.fillna(0, inplace=True) 12 | totalM = ldblocks.M_H.sum() 13 | chunksize = totalM / numblocks 14 | avgldblocksize = totalM / (ldblocks.M_H != 0).sum() 15 | chunkendpoints = [0] 16 | currldblock = 0; currsize = 0 17 | while currldblock < len(ldblocks): 18 | while currsize <= max(1,chunksize-avgldblocksize/2) and currldblock < len(ldblocks): 19 | currsize += ldblocks.iloc[currldblock].M_H 20 | currldblock += 1 21 | currsize = 0 22 | chunkendpoints += [currldblock] 23 | 24 | # store SNP indices of begin- and end-points of chunks 25 | chunkinfo = pd.DataFrame() 26 | 27 | # collapse data within chunks 28 | chunk_nums = []; chunk_denoms = [] 29 | for n, (i,j) in enumerate(zip(chunkendpoints[:-1], chunkendpoints[1:])): 30 | ldblock_ind = [l for l in ldblocks.iloc[i:j].index if l in numerators.keys()] 31 | if len(ldblock_ind) > 0: 32 | chunk_nums.append(sum( 33 | [numerators[l] for l in ldblock_ind])) 34 | chunk_denoms.append(sum( 35 | [denominators[l] for l in ldblock_ind])) 36 | newrow_chunkinfo = { 37 | 'ldblock_begin':min(ldblock_ind), 38 | 'ldblock_end':max(ldblock_ind)+1, 39 | 'chr_begin':ldblocks.loc[min(ldblock_ind),'chr'], 40 | 'chr_end':ldblocks.loc[max(ldblock_ind),'chr'], 41 | 'bp_begin':ldblocks.loc[min(ldblock_ind),'start'], 42 | 'bp_end':ldblocks.loc[max(ldblock_ind),'end'], 43 | 'snpind_begin':ldblocks.loc[min(ldblock_ind),'snpind_begin'], 44 | 'snpind_end':ldblocks.loc[max(ldblock_ind),'snpind_end'], 45 | 'numsnps':sum(ldblocks.loc[ldblock_ind,'M_H'])} 46 | chunkinfo = pd.concat([chunkinfo, pd.DataFrame([newrow_chunkinfo])], ignore_index=True) 47 | 48 | ## compute leave-one-out sums 49 | loonumerators = []; loodenominators = [] 50 | for i in range(len(chunk_nums)): 51 | loonumerators.append(sum(chunk_nums[:i]+chunk_nums[(i+1):])) 52 | loodenominators.append(sum(chunk_denoms[:i]+chunk_denoms[(i+1):])) 53 | 54 | return chunk_nums, chunk_denoms, loonumerators, loodenominators, chunkinfo 55 | 56 | # compute estimate of effect size 57 | def get_est(num, denom, k, num_background): 58 | ind = list(range(num_background)) + [num_background+k] 59 | num = num[ind] 60 | denom = denom[ind][:,ind] 61 | try: 62 | return np.linalg.solve(denom, num)[-1] 63 | except np.linalg.linalg.LinAlgError: 64 | return np.nan 65 | 66 | # compute standard error of estimate using jackknife 67 | def jackknife_se(est, loonumerators, loodenominators, k, num_background): 68 | m = np.ones(len(loonumerators)) 69 | theta_notj = [get_est(nu, de, k, num_background) 70 | for nu, de in zip(loonumerators, loodenominators)] 71 | g = len(m) 72 | n = m.sum() 73 | h = n/m 74 | theta_J = g*est - ((n-m)/n*theta_notj).sum() 75 | tau = est*h - (h-1)*theta_notj 76 | return np.sqrt(np.mean((tau - theta_J)**2/(h-1))) 77 | 78 | # residualize the first num_background annotations out of the num_background+k-th 79 | # marginal annotation 80 | # q is the numerator of the regression after background annots are residualized out 81 | # r is the denominator of the regression after background annots are residualized out 82 | # mux is the vector of coefficients required to residualize the background out of the 83 | # marginal annotatioin question 84 | # muy is the vector of coefficients required to residualize the backgroud out of the 85 | # vector of GWAS summary statistics 86 | def residualize(chunk_nums, chunk_denoms, num_background, k): 87 | q = np.array([num[num_background+k] for num in chunk_nums]) 88 | r = np.array([denom[num_background+k,num_background+k] for denom in chunk_denoms]) 89 | 90 | if num_background > 0: 91 | num = sum(chunk_nums) 92 | denom = sum(chunk_denoms) 93 | ATA = denom[:num_background][:,:num_background] 94 | ATy = num[:num_background] 95 | ATx = denom[:num_background][:,num_background+k] 96 | muy = np.linalg.solve(ATA, ATy) 97 | mux = np.linalg.solve(ATA, ATx) 98 | xiaiT = np.array([d[num_background+k,:num_background] 99 | for d in chunk_denoms]) 100 | yiaiT = np.array([nu[:num_background] 101 | for nu in chunk_nums]) 102 | aiaiT = np.array([d[:num_background][:,:num_background] 103 | for d in chunk_denoms]) 104 | q = q - xiaiT.dot(muy) - yiaiT.dot(mux) + aiaiT.dot(muy).dot(mux) 105 | r = r - 2*xiaiT.dot(mux) + aiaiT.dot(mux).dot(mux) 106 | 107 | return q, r, mux, muy 108 | 109 | # do sign-flipping to get p-value 110 | def signflip(q, T, printmem=True, mode='sum'): 111 | def mask(a, t): 112 | a_ = a.copy() 113 | a_[np.abs(a_) < t] = 0 114 | return a_ 115 | 116 | print('before sign-flipping:', fs.mem(), 'MB') 117 | 118 | if mode == 'sum': # use sum of q as the test statistic 119 | score = q.sum() 120 | elif mode == 'medrank': # examine how far the rank of 0 deviates from the 50th percentile 121 | score = np.searchsorted(np.sort(q), 0)/len(q) - 0.5 122 | elif mode == 'thresh': # threshold q at some absolute magnitude threshold 123 | top = np.percentile(np.abs(q), 75) 124 | print(top) 125 | ts = np.arange(0, top, top/10) 126 | q_thresh = np.array([mask(q, t) for t in ts]).T 127 | q_thresh /= np.linalg.norm(q_thresh, axis=0) 128 | scores = np.sum(q_thresh, axis=0) 129 | score = scores[np.argmax(np.abs(scores))] 130 | else: 131 | print('ERROR: invalid mode') 132 | return None 133 | 134 | null = np.zeros(T); current = 0; block = 100000 135 | while current < len(null): 136 | s = (-1)**np.random.binomial(1,0.5,size=(block, len(q))) 137 | if mode == 'sum': 138 | null[current:current+block] = s.dot(q) 139 | elif mode == 'medrank': 140 | null_q = s[:,:]*q[None,:] 141 | null_q = np.sort(null_q, axis=1) 142 | null[current:current+block] = \ 143 | np.array([np.searchsorted(s, 0)/len(s) - 0.5 for s in null_q]) 144 | elif mode == 'thresh': 145 | null_q_thresh = s[:,:,None]*q_thresh[None,:,:] 146 | sums = np.sum(null_q_thresh, axis=1) 147 | null[current:current+block] = np.array([s[np.argmax(np.abs(s))] for s in sums]) 148 | 149 | current += block 150 | p = max(1, ((np.abs(null) >= np.abs(score)).sum())) / float(current) 151 | del s; gc.collect() 152 | if p >= 0.01: 153 | null = null[:current] 154 | break 155 | 156 | se = np.abs(score)/np.sqrt(st.chi2.isf(p,1)) 157 | del null; gc.collect() 158 | print('after sign-flipping:', fs.mem(), 'MB. p=', p) 159 | 160 | return p, score/se 161 | -------------------------------------------------------------------------------- /src/sldp/config.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | def add_default_params(args): 4 | if args.config is not None: 5 | # read in config file and replace '-' with '_' to match argparse behavior 6 | with open(args.config, 'r') as f: 7 | config = { k.replace('-','_') : v for k,v in json.load(f).items()} 8 | # overwrite with any non-None entries, resolving conflicts in favor of args 9 | config.update({ 10 | k:v for k,v in args.__dict__.items() 11 | if (v is not None and v != []) or k not in config.keys() 12 | }) 13 | # replace information in args 14 | args.__dict__ = config 15 | -------------------------------------------------------------------------------- /src/sldp/preprocessannot: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function, division 3 | import argparse, sys 4 | import sldp.config as config 5 | import ypy.pretty as pretty 6 | 7 | def main(): 8 | parser = argparse.ArgumentParser() 9 | # required arguments 10 | parser.add_argument('--sannot-chr', nargs='+', required=True, 11 | help='Multiple (space-delimited) paths to sannot.gz files, not including '+\ 12 | 'chromosome') 13 | 14 | # optional arguments 15 | parser.add_argument('--alpha', type=float, default=-1, 16 | help='scale annotation values by sqrt(2*maf(1-maf))^{alpha+1}. '+\ 17 | '-1 means assume annotation values are already per-normalized-genotype, '+\ 18 | '0 means assume they were per allele. Default is -1.') 19 | parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int, 20 | help='Space-delimited list of chromosomes to analyze. Default is 1..22') 21 | 22 | # configurable arguments 23 | parser.add_argument('--config', default=None, 24 | help='Path to a json file with values for other parameters. ' +\ 25 | 'Values in this file will be overridden by any values passed ' +\ 26 | 'explicitly via the command line.') 27 | parser.add_argument('--bfile-chr', default=None, 28 | help='Path to plink bfile of reference panel to use, not including ' +\ 29 | 'chromosome number. If not supplied, will be read from config file.') 30 | parser.add_argument('--print-snps', default=None, 31 | help='Path to set of potentially typed SNPs. If not supplied, will be read '+\ 32 | 'from config file.') 33 | parser.add_argument('--ld-blocks', default=None, 34 | help='Path to UCSC bed file containing one bed interval per LD block. If '+\ 35 | 'not supplied, will be read from config file.') 36 | 37 | print('=====') 38 | print(' '.join(sys.argv)) 39 | print('=====') 40 | args = parser.parse_args() 41 | config.add_default_params(args) 42 | pretty.print_namespace(args) 43 | print('=====') 44 | 45 | import sldp.preprocessannot as preprocessannot 46 | preprocessannot.main(args) 47 | 48 | if __name__ == "__main__": 49 | main() -------------------------------------------------------------------------------- /src/sldp/preprocessannot.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | def main(args): 4 | print('initializing...') 5 | import gzip, gc, time 6 | import numpy as np 7 | import pandas as pd 8 | import gprim.annotation as ga 9 | import gprim.dataset as gd 10 | import ypy.memo as memo 11 | 12 | # basic initialization 13 | mhc_bp = [25684587, 35455756] 14 | refpanel = gd.Dataset(args.bfile_chr) 15 | annots = [ga.Annotation(annot) for annot in args.sannot_chr] 16 | 17 | # read in ld blocks, remove MHC, read SNPs to print 18 | ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None, 19 | names=['chr','start', 'end']) 20 | mhcblocks = (ldblocks.chr == 'chr6') & \ 21 | (ldblocks.end > mhc_bp[0]) & \ 22 | (ldblocks.start < mhc_bp[1]) 23 | ldblocks = ldblocks[~mhcblocks] 24 | print(len(ldblocks), 'loci after removing MHC') 25 | print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP']) 26 | print_snps['printsnp'] = True 27 | print(len(print_snps), 'print snps') 28 | 29 | # process annotations 30 | for annot in annots: 31 | t0 = time.time() 32 | for c in args.chroms: 33 | print(time.time()-t0, ': loading chr', c, 'of', args.chroms) 34 | # get refpanel snp metadata for this chromosome 35 | snps = refpanel.bim_df(c) 36 | snps = ga.smart_merge(snps, refpanel.frq_df(c)[['SNP','MAF']]) 37 | print(len(snps), 'snps in refpanel', 38 | len(snps.columns), 'columns, including metadata') 39 | 40 | # read in annotation 41 | print('reading annot', annot.filestem()) 42 | names = annot.names(c) # names of annotations 43 | namesR = [n+'.R' for n in names] # names of results 44 | a = annot.sannot_df(c) 45 | if 'SNP' in a.columns: 46 | print('not a thinannot => doing full reconciliation of snps and allele coding') 47 | snps = ga.reconciled_to(snps, a, names, missing_val=0) 48 | else: 49 | print('detected thinannot, so assuming that annotation is synched to refpanel') 50 | snps = pd.concat([snps, a[names]], axis=1) 51 | 52 | # add information on which snps to print 53 | print('merging in print_snps') 54 | snps = pd.merge(snps, print_snps, how='left', on='SNP') 55 | snps.printsnp.fillna(False, inplace=True) 56 | snps.printsnp.astype(bool) 57 | 58 | # put on per-normalized-genotype scale 59 | if args.alpha != -1: 60 | print('scaling by maf according to alpha=', args.alpha) 61 | snps[names] = snps[names].values*\ 62 | np.power(2*snps.MAF.values*(1-snps.MAF.values), 63 | (1.+args.alpha)/2)[:,None] 64 | 65 | # make room for RV and make sure annotation values are treated as floats 66 | snps = pd.concat( 67 | [snps, pd.DataFrame(np.zeros(snps[names].shape), columns=namesR)], 68 | axis=1) 69 | snps[names] = snps[names].astype(float) 70 | 71 | # compute simple statistics about annotation 72 | print('computing basic statistics and writing') 73 | info = pd.DataFrame( 74 | columns=['M', 'M_5_50', 'sqnorm', 'sqnorm_5_50', 'supp', 'supp_5_50']) 75 | info['name'] = names 76 | info.set_index('name', inplace=True) 77 | info['M'] = len(snps) 78 | info['sqnorm'] = np.linalg.norm(snps[names], axis=0)**2 79 | info['supp'] = np.linalg.norm(snps[names], ord=0, axis=0) 80 | M_5_50 = (snps.MAF >= 0.05).values 81 | info['M_5_50'] = M_5_50.sum() 82 | info['sqnorm_5_50'] = np.linalg.norm(snps.loc[M_5_50, names], axis=0)**2 83 | info['supp_5_50'] = np.linalg.norm(snps.loc[M_5_50, names], ord=0, axis=0) 84 | info.to_csv(annot.info_filename(c), sep='\t') 85 | 86 | # process ldblocks one by one 87 | for ldblock, X, meta, ind in refpanel.block_data(ldblocks, c, meta=snps): 88 | if meta.printsnp.sum() == 0: 89 | print('no print-snps in this block') 90 | continue 91 | print(meta.printsnp.sum(), 'print-snps') 92 | if (meta[names] == 0).values.all(): 93 | print('annotations are all 0 in this block') 94 | snps.loc[ind, namesR] = 0 95 | else: 96 | mask = meta.printsnp.values 97 | V = meta[names].values 98 | XV = X.dot(V) 99 | snps.loc[ind[mask], namesR] = \ 100 | X[:,mask].T.dot(XV[:,-len(names):]) / X.shape[0] 101 | 102 | # write 103 | print('writing output') 104 | with gzip.open(annot.RV_filename(c), 'w') as f: 105 | snps.loc[snps.printsnp,['SNP','A1','A2']+names+namesR].to_csv( 106 | f, index=False, sep='\t') 107 | 108 | del snps; memo.reset(); gc.collect() 109 | 110 | print('done') 111 | -------------------------------------------------------------------------------- /src/sldp/preprocesspheno: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function, division 3 | import argparse, sys 4 | import sldp.config as config 5 | import ypy.pretty as pretty 6 | 7 | def main(): 8 | parser = argparse.ArgumentParser() 9 | # required arguments 10 | parser.add_argument('--sumstats-stem', required=True, 11 | help='path to sumstats.gz files, not including ".sumstats.gz" extension') 12 | 13 | # optional arguments 14 | parser.add_argument('--refpanel-name', default='KG3.95', 15 | help='suffix added to the directory created for storing output. '+\ 16 | 'Default is KG3.95, corresponding to 1KG Phase 3 reference panel '+\ 17 | 'processed with default parameters by preprocessrefpanel.py.') 18 | parser.add_argument('-no-M-5-50', default=False, action='store_true', 19 | help='Dont filter to SNPs with MAF >= 0.05 when estimating heritabilities') 20 | parser.add_argument('--set-h2g', default=None, type=float, 21 | help='Scale Z-scores to achieve this approximate heritability') 22 | parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int, 23 | help='Space-delimited list of chromosomes to analyze. Default is 1..22') 24 | 25 | # configurable arguments 26 | parser.add_argument('--config', default=None, 27 | help='Path to a json file with values for other parameters. ' +\ 28 | 'Values in this file will be overridden by any values passed ' +\ 29 | 'explicitly via the command line.') 30 | parser.add_argument('--bfile-chr', default=None, 31 | help='Path to plink bfile of reference panel to use, not including ' +\ 32 | 'chromosome number. If not supplied, will be read from config file.') 33 | parser.add_argument('--svd-stem', default=None, 34 | help='Path to directory containing truncated svds of reference panel, by LD '+\ 35 | 'block, as produced by preprocessrefpanel.py. If not supplied, will be '+\ 36 | 'read from config file.') 37 | parser.add_argument('--print-snps', default=None, 38 | help='Path to set of potentially typed SNPs. If not supplied, will be read '+\ 39 | 'from config file.') 40 | parser.add_argument('--ldscores-chr', default=None, 41 | help='Path to LD scores at a smallish set of SNPs (~1M). LD should be computed '+\ 42 | 'to all potentially causal snps. Used for heritability estimation. '+\ 43 | 'If not supplied, will be read from config file.') 44 | parser.add_argument('--ld-blocks', default=None, 45 | help='Path to UCSC bed file containing one bed interval per LD block. If '+\ 46 | 'not supplied, will be read from config file.') 47 | 48 | print('=====') 49 | print(' '.join(sys.argv)) 50 | print('=====') 51 | args = parser.parse_args() 52 | config.add_default_params(args) 53 | pretty.print_namespace(args) 54 | print('=====') 55 | 56 | import sldp.preprocesspheno as preprocesspheno 57 | preprocesspheno.main(args) 58 | 59 | if __name__ == '__main__': 60 | main() -------------------------------------------------------------------------------- /src/sldp/preprocesspheno.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | def main(args): 4 | print('initializing...') 5 | import gzip, os, gc, time 6 | import numpy as np 7 | import pandas as pd 8 | import gprim.annotation as ga 9 | import gprim.dataset as gd 10 | import ypy.fs as fs 11 | import ypy.memo as memo 12 | import sldp.weights as weights 13 | 14 | # read in refpanel, ld blocks, and svd snps 15 | refpanel = gd.Dataset(args.bfile_chr) 16 | ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None, 17 | names=['chr','start', 'end']) 18 | print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP']) 19 | print_snps['printsnp'] = True 20 | print(len(print_snps), 'svd snps') 21 | 22 | # read sumstats 23 | print('reading sumstats', args.sumstats_stem) 24 | ss = pd.read_csv(args.sumstats_stem+'.sumstats.gz', sep='\t') 25 | ss = ss[ss.Z.notnull() & ss.N.notnull()] 26 | print('{} snps, {}-{} individuals (avg: {})'.format( 27 | len(ss), np.min(ss.N), np.max(ss.N), np.mean(ss.N))) 28 | ss = pd.merge(ss, print_snps[['SNP']], on='SNP', how='inner') 29 | print(len(ss), 'snps typed') 30 | 31 | # read ld scores 32 | print('reading in ld scores') 33 | ld = pd.concat([pd.read_csv(args.ldscores_chr+str(c)+'.l2.ldscore.gz', 34 | delim_whitespace=True) 35 | for c in range(1,23)], axis=0) 36 | if args.no_M_5_50: 37 | M = sum([int(open(args.ldscores_chr+str(c)+'.l2.M').next()) 38 | for c in range(1,23)]) 39 | else: 40 | M = sum([int(open(args.ldscores_chr+str(c)+'.l2.M_5_50').next()) 41 | for c in range(1,23)]) 42 | print(len(ld), 'snps with ld scores') 43 | ssld = pd.merge(ss, ld, on='SNP', how='left') 44 | print(len(ssld), 'hm3 snps with sumstats after merge.') 45 | 46 | # estimate heritability using aggregate estimator 47 | def esth2g(ssld): 48 | meanchi2 = (ssld.Z**2).mean() 49 | meanNl2 = (ssld.N*ssld.L2).mean() 50 | sigma2g = (meanchi2 - 1)/meanNl2 51 | h2g = sigma2g * M 52 | K = M/meanNl2 # h2g = K (meanchi2 - 1) 53 | return h2g, sigma2g, meanchi2, K 54 | h2g, sigma2g, meanchi2, K = esth2g(ssld) 55 | h2g = max(h2g, 0.03) #0.03 is an arbitrarily chosen minimum 56 | print('mean chi2:', meanchi2) 57 | print('h2g estimated at:', h2g, 'sigma2g =', sigma2g) 58 | if args.set_h2g: 59 | print('scaling Z-scores to achieve h2g of', args.set_h2g) 60 | norm = meanchi2 / (1 + args.set_h2g/ K) 61 | print('dividing all z-scores by', np.sqrt(norm)) 62 | ssld.Z /= np.sqrt(norm) 63 | h2g, sigma2g, _, _ = esth2g(ssld) 64 | print('h2g is now', h2g) 65 | 66 | # write h2g results to file 67 | dirname = args.sumstats_stem + '.' + args.refpanel_name 68 | fs.makedir(dirname) 69 | if 1 in args.chroms: 70 | print('writing info file') 71 | info = pd.DataFrame(); info=info.append( 72 | {'pheno':args.sumstats_stem.split('/')[-1], 73 | 'h2g':h2g, 74 | 'sigma2g':sigma2g, 75 | 'Nbar':ss.N.mean()},ignore_index=True) 76 | info.to_csv(dirname+'/info', sep='\t', index=False) 77 | 78 | # preprocess ld blocks 79 | t0 = time.time() 80 | for c in args.chroms: 81 | print(time.time()-t0, ': loading chr', c, 'of', args.chroms) 82 | # get refpanel snp metadata for this chromosome 83 | snps = refpanel.bim_df(c) 84 | snps = pd.merge(snps, print_snps, on='SNP', how='left') 85 | snps.printsnp.fillna(False, inplace=True) 86 | print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata') 87 | 88 | # merge annot and sumstats 89 | print('reconciling') 90 | snps = ga.reconciled_to(snps, ss, ['Z'], 91 | othercolnames=['N'], missing_val=np.nan) 92 | snps['typed'] = snps.Z.notnull() 93 | snps['ahat'] = snps.Z / np.sqrt(snps.N) 94 | 95 | # initialize result dataframe 96 | # I = no weights 97 | # h = heuristic weights, using R_o 98 | snps['Winv_ahat_I'] = np.nan # = W_o^{-1} ahat_o 99 | snps['R_Winv_ahat_I'] = np.nan # = R_{*o} W_o^{-1} ahat_o 100 | snps['Winv_ahat_h'] = np.nan # = W_o^{-1} ahat_o 101 | snps['R_Winv_ahat_h'] = np.nan # = R_{*o} W_o^{-1} ahat_o 102 | 103 | # restrict to ld blocks in this chr and process them in chunks 104 | for ldblock, X, meta, ind in refpanel.block_data(ldblocks, c, meta=snps): 105 | if meta.printsnp.sum() == 0 or \ 106 | not os.path.exists(args.svd_stem+str(ldblock.name)+'.R.npz'): 107 | print('no svd snps found in this block') 108 | continue 109 | print(meta.printsnp.sum(), 'svd snps', meta.typed.sum(), 'typed snps') 110 | if meta.typed.sum() == 0: 111 | print('no typed snps found in this block') 112 | snps.loc[ind, [ 113 | 'R_Winv_ahat_I', 114 | 'R_Winv_ahat_h' 115 | ]] = 0 116 | continue 117 | R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz') 118 | R2 = np.load(args.svd_stem+str(ldblock.name)+'.R2.npz') 119 | N = meta[meta.typed.values].N.mean() 120 | meta_svd = meta[meta.printsnp.values] 121 | 122 | # multiply ahat by the weights 123 | x_I = snps.loc[ind[meta.printsnp],'Winv_ahat_I'] = weights.invert_weights( 124 | R, R2, sigma2g, N, meta_svd.ahat.values, mode='Winv_ahat_I') 125 | x_h = snps.loc[ind[meta.printsnp],'Winv_ahat_h'] = weights.invert_weights( 126 | R, R2, sigma2g, N, meta_svd.ahat.values, mode='Winv_ahat_h') 127 | 128 | print('writing processed sumstats') 129 | with gzip.open('{}/{}.pss.gz'.format(dirname, c), 'w') as f: 130 | snps.loc[snps.printsnp,['N', 131 | 'Winv_ahat_I', 132 | 'Winv_ahat_h' 133 | ]].to_csv( 134 | f, index=False, sep='\t') 135 | 136 | del snps; memo.reset(); gc.collect() 137 | 138 | print('done') 139 | -------------------------------------------------------------------------------- /src/sldp/preprocessrefpanel: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function, division 3 | import argparse, gc, sys 4 | import numpy as np 5 | import pandas as pd 6 | import gprim.dataset as gd 7 | import ypy.memo as memo 8 | import ypy.fs as fs 9 | import ypy.pretty as pretty 10 | import sldp.config as config 11 | 12 | 13 | def main(): 14 | parser = argparse.ArgumentParser() 15 | # optional arguments 16 | parser.add_argument('--spectrum-percent', type=float, default=95, 17 | help='Determines how many eigenvectors are kept in the truncated SVD. ' +\ 18 | 'A value of x means that x percent of the eigenspectrum will be kept. '+\ 19 | 'Default value: 95') 20 | parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int, 21 | help='Space-delimited list of chromosomes to analyze. Default is 1..22') 22 | 23 | # configurable arguments 24 | parser.add_argument('--config', default=None, 25 | help='Path to a json file with values for other parameters. ' +\ 26 | 'Values in this file will be overridden by any values passed ' +\ 27 | 'explicitly via the command line.') 28 | parser.add_argument('--svd-stem', default=None, 29 | help='Path to directory in which output files will be stored. ' +\ 30 | 'If not supplied, will be read from config file.') 31 | parser.add_argument('--bfile-chr', default=None, 32 | help='Path to plink bfile of reference panel to use, not including ' +\ 33 | 'chromosome number. If not supplied, will be read from config file.') 34 | parser.add_argument('--print-snps', default=None, 35 | help='Path to set of potentially typed SNPs. If not supplied, will be read '+\ 36 | 'from config file.') 37 | parser.add_argument('--ld-blocks', default=None, 38 | help='Path to UCSC bed file containing one bed interval per LD block. If '+\ 39 | 'not supplied, will be read from config file.') 40 | 41 | print('=====') 42 | print(' '.join(sys.argv)) 43 | print('=====') 44 | args = parser.parse_args() 45 | config.add_default_params(args) 46 | pretty.print_namespace(args) 47 | print('=====') 48 | 49 | # basic initialization 50 | mhc = [25684587, 35455756] 51 | refpanel = gd.Dataset(args.bfile_chr) 52 | fs.makedir_for_file(args.svd_stem) 53 | 54 | # read in ld blocks, remove MHC, read SNPs to print 55 | ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None, 56 | names=['chr','start', 'end']) 57 | mhcblocks = (ldblocks.chr == 'chr6') & (ldblocks.end > mhc[0]) & (ldblocks.start < mhc[1]) 58 | ldblocks = ldblocks[~mhcblocks] 59 | print(len(ldblocks), 'loci after removing MHC') 60 | print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP']) 61 | print_snps['printsnp'] = True 62 | print(len(print_snps), 'print snps') 63 | 64 | for c in args.chroms: 65 | print('loading chr', c, 'of', args.chroms) 66 | # get refpanel snp metadata for this chromosome 67 | snps = refpanel.bim_df(c) 68 | snps = pd.merge(snps, print_snps, on='SNP', how='left') 69 | snps.printsnp.fillna(False, inplace=True) 70 | print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata') 71 | 72 | # iterate through ld blocks and process each one 73 | for ldblock, X, meta, _ in refpanel.block_data(ldblocks, c, meta=snps): 74 | if meta.printsnp.sum() == 0: 75 | print('no print snps found in this block') 76 | continue 77 | 78 | # restrict X to have only potentially typed SNPs along one axis 79 | mask = meta.printsnp.values 80 | X_ = X[:,mask] 81 | 82 | def bestsvd(A): 83 | try: 84 | U_, svs_, _ = np.linalg.svd(A.T); svs_ = svs_**2 / A.shape[0] 85 | except np.linalg.linalg.LinAlgError: 86 | print('\t\tresorting to svd of XTX') 87 | U_, svs_, _ = np.linalg.svd(A.T.dot(A)); svs_ = svs_ / A.shape[0] 88 | return U_, svs_ 89 | 90 | # compute the (right-hand) SVD of R 91 | print('\tcomputing SVD of R_print') 92 | U_, svs_ = bestsvd(X_) 93 | k = np.argmax(np.cumsum(svs_)/svs_.sum() >= args.spectrum_percent / 100.) 94 | print('\treduced rank of', k, 'out of', meta.printsnp.sum(), 'printed snps') 95 | np.savez('{}{}.R'.format(args.svd_stem, ldblock.name), U=U_[:,:k], svs=svs_[:k]) 96 | 97 | # compute the SVD of R2 98 | print('\tcomputing R2_print') 99 | R2 = X_.T.dot(X.dot(X.T)).dot(X_) / X.shape[0]**2 100 | print('\tcomputing SVD of R2_print') 101 | R2_U, R2_svs, _ = np.linalg.svd(R2) 102 | k = np.argmax(np.cumsum(R2_svs)/R2_svs.sum() >= args.spectrum_percent / 100.) 103 | print('\treduced rank of', k, 'out of', meta.printsnp.sum(), 'printed snps') 104 | np.savez('{}{}.R2'.format(args.svd_stem, ldblock.name), 105 | U=R2_U[:,:k], svs=R2_svs[:k]) 106 | 107 | del snps; memo.reset(); gc.collect() 108 | print('done') 109 | 110 | 111 | if __name__ == '__main__': 112 | main() 113 | -------------------------------------------------------------------------------- /src/sldp/sldp: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function, division 3 | import argparse, sys 4 | import sldp.config as config 5 | import ypy.pretty as pretty 6 | 7 | def main(): 8 | parser = argparse.ArgumentParser() 9 | # required arguments 10 | parser.add_argument('--outfile-stem', required=True, 11 | help='Path to an output file stem.') 12 | pheno = parser.add_mutually_exclusive_group(required=True) 13 | pheno.add_argument('--pss-chr', default=None, 14 | help='Path to .pss.gz file, without chromosome number or .pss.gz extension. '+\ 15 | 'This is the phenotype that SLDP will analyze.') 16 | pheno.add_argument('--sumstats-stem', default=None, 17 | help='Path to a .sumstats.gz file, not including ".sumstats.gz" extension. '+\ 18 | 'SLDP will process this into a set of .pss.gz files before running.') 19 | parser.add_argument('--sannot-chr', nargs='+', required=True, 20 | help='One or more (space-delimited) paths to gzipped annot files, without '+\ 21 | 'chromosome number or .sannot.gz/.RV.gz extension. These are the '+\ 22 | 'annotations that SLDP will analyze against the phenotype.') 23 | 24 | # optional arguments 25 | parser.add_argument('--verbose-thresh', default=0., type=float, 26 | help='Print additional information about each association studied with a '+\ 27 | 'p-value below this number. (Default is 0.) This includes: '+\ 28 | 'the covariance in each independent block of genome (.chunks files), '+\ 29 | 'and the coefficients required to residualize any background '+\ 30 | 'annotations out of the other annotations being analyzed.') 31 | parser.add_argument('-fastp', default=False, action='store_true', 32 | help='Estimate p-values fast (without permutation)') 33 | parser.add_argument('-bothp', default=False, action='store_true', 34 | help='Print both fastp p-values (as p_fast) and normal p-values. '+\ 35 | 'Takes precedence over fastp') 36 | parser.add_argument('--tell-me-stories', default=0., 37 | help='!!Experimental!! For associations with a p-value less than this number, '+\ 38 | 'print information about loci that may be promising to study. '+\ 39 | 'This will produce plots of (potentially overlapping) loci where the '+\ 40 | 'signed LD profile is highly correlated with the GWAS signal in a '+\ 41 | 'direction consistent with the global effect. '+\ 42 | 'Default value is 0.') 43 | parser.add_argument('--story-corr-thresh', default=0.8, type=float, 44 | help='The threshold to use for correlation between Rv and alphahat in order '+\ 45 | 'for a locus to be considered worthy of a story') 46 | parser.add_argument('-more-stats', default=False, action='store_true', 47 | help='Print additional statistis about q in results file') 48 | parser.add_argument('--T', type=int, default=1000000, 49 | help='number of times to sign flip for empirical p-values. Default is 10^6.') 50 | parser.add_argument('--jk-blocks', type=int, default=300, 51 | help='Number of jackknife blocks to use. Default is 300.') 52 | parser.add_argument('--weights', default='Winv_ahat_h', 53 | help='which set of regression weights to use. Default is Winv_ahat_h, '+\ 54 | 'corresponding to weights described in Reshef et al. 2017.') 55 | parser.add_argument('--seed', default=None, type=int, 56 | help='Seed random number generator to a certain value. Off by default.') 57 | parser.add_argument('--stat', default='sum', 58 | help='*experimental* Which statistic to use for hypothesis testing. Options '+\ 59 | 'are: sum, medrank, or thresh.') 60 | parser.add_argument('--chi2-thresh', default=0, type=float, 61 | help='only use SNPs with a chi2 above this number for the regression') 62 | parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int, 63 | help='Space-delimited list of chromosomes to analyze. Default is 1..22') 64 | 65 | # configurable arguments 66 | parser.add_argument('--config', default=None, 67 | help='Path to a json file with values for other parameters. ' +\ 68 | 'Values in this file will be overridden by any values passed ' +\ 69 | 'explicitly via the command line.') 70 | parser.add_argument('--background-sannot-chr', nargs='+', default=[], 71 | help='One or more (space-delimited) paths to gzipped annot files, without '+\ 72 | 'chromosome number or .sannot.gz extension. These are the annotations '+\ 73 | 'that SLDP will control for.') 74 | parser.add_argument('--svd-stem', default=None, 75 | help='Path to directory in which output files will be stored. ' +\ 76 | 'If not supplied, will be read from config file.') 77 | parser.add_argument('--bfile-reg-chr', default=None, 78 | help='Path to plink bfile of reference panel to use, not including ' +\ 79 | 'chromosome number. This bfile should contain only regression SNPs '+\ 80 | '(as opposed to, e.g., all potentially causal SNPs). '+\ 81 | 'If not supplied, will be read from config file.') 82 | parser.add_argument('--ld-blocks', default=None, 83 | help='Path to UCSC bed file containing one bed interval per LD block. If '+\ 84 | 'not supplied, will be read from config file.') 85 | 86 | 87 | print('=====') 88 | print(' '.join(sys.argv)) 89 | print('=====') 90 | args = parser.parse_args() 91 | config.add_default_params(args) 92 | pretty.print_namespace(args) 93 | print('=====') 94 | 95 | preprocess_sumstats(args) 96 | preprocess_sannots(args) 97 | 98 | print('initializing...') 99 | import os, time 100 | import numpy as np 101 | import pandas as pd 102 | import scipy.stats as st 103 | import gprim.annotation as ga 104 | import gprim.dataset as gd 105 | import ypy.memo as memo 106 | import sldp.weights as weights 107 | import sldp.chunkstats as cs 108 | import sldp.storyteller as storyteller 109 | 110 | # basic initialization 111 | mhc_bp = [25684587, 35455756] 112 | refpanel = gd.Dataset(args.bfile_reg_chr) 113 | pheno_name = args.pss_chr.split('/')[-2].replace('.KG3.95','') 114 | if args.seed is not None: 115 | np.random.seed(args.seed) 116 | print('random seed:', args.seed) 117 | 118 | # read in names of background annotations and marginal annotations 119 | annots = [ga.Annotation(annot) for annot in args.sannot_chr] 120 | marginal_names = sum([a.names(22, RV=True) for a in annots], []) 121 | marginal_names = [n for n in marginal_names if '.R' in n] 122 | marginal_infos = pd.concat([a.info_df(args.chroms) for a in annots], axis=0) 123 | backgroundannots = [ga.Annotation(annot) for annot in args.background_sannot_chr] 124 | background_names = sum([a.names(22, True) for a in backgroundannots], []) 125 | background_names = [n for n in background_names if '.R' in n] 126 | print('background annotations:', background_names) 127 | print('marginal annotations:', marginal_names) 128 | if len(set(background_names) & set(marginal_names)) > 0: 129 | raise ValueError('the background annotation names and the marginal annotation '+\ 130 | 'names must be disjoint sets') 131 | 132 | # read in ldblocks and remove ones that overlap mhc 133 | ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None, 134 | names=['chr','start', 'end']) 135 | mhcblocks = (ldblocks.chr == 'chr6') & \ 136 | (ldblocks.end > mhc_bp[0]) & \ 137 | (ldblocks.start < mhc_bp[1]) 138 | ldblocks = ldblocks[~mhcblocks] 139 | 140 | # read information about sumstats 141 | sumstats_info = pd.read_csv(args.pss_chr+'info', sep='\t') 142 | sigma2g = sumstats_info.loc[0].sigma2g 143 | h2g = sumstats_info.loc[0].h2g 144 | 145 | # read in sumstats and annots, and compute numerator and denominator of regression for 146 | # each ldblock. These will later be processed and aggregated 147 | numerators = dict(); denominators = dict() 148 | t0 = time.time() 149 | for c in args.chroms: 150 | print(time.time()-t0, ': loading chr', c, 'of', args.chroms) 151 | 152 | # get refpanel snp metadata 153 | snps = refpanel.bim_df(c) 154 | print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata') 155 | 156 | # read sumstats 157 | print('reading sumstats') 158 | ss = pd.read_csv(args.pss_chr+str(c)+'.pss.gz', sep='\t') 159 | print(np.isnan(ss[args.weights]).sum(), 'sumstats nans out of', len(ss)) 160 | snps['Winv_ahat'] = ss[args.weights] 161 | snps['N'] = ss.N 162 | snps['typed'] = snps.Winv_ahat.notnull() 163 | if args.chi2_thresh > 0: 164 | print('applying chi2 threshold of', args.chi2_thresh) 165 | snps.typed &= (ss.Winv_ahat_I**2 * ss.N > args.chi2_thresh) 166 | print(snps.typed.sum(), 'typed snps left') 167 | 168 | # read annotations 169 | print('reading annotations') 170 | for annot in backgroundannots+annots: 171 | mynames = [n for n in annot.names(22, RV=True) if '.R' in n] #names of annotations 172 | print(time.time()-t0, ': reading annot', annot.filestem()) 173 | print('adding', mynames) 174 | snps = pd.concat([snps, annot.RV_df(c)[mynames]], axis=1) 175 | if (~np.isfinite(snps[mynames].values)).sum() > 0: 176 | raise ValueError('There should be no nans in the postprocessed annotation. '+\ 177 | 'But there are '+str((~np.isfinite(snps[mynames].values)).sum())) 178 | 179 | # make sure things are in the order we think they are 180 | if (np.array(background_names+marginal_names) != 181 | snps.columns.values[-len(background_names+marginal_names):]).any(): 182 | raise ValueError('Merged annotations are not in the right order') 183 | 184 | # perform computations 185 | for ldblock, _, meta, ind in refpanel.block_data( 186 | ldblocks, c, meta=snps, genos=False, verbose=0): 187 | if meta.typed.sum() == 0 or \ 188 | not os.path.exists(args.svd_stem+str(ldblock.name)+'.R.npz'): 189 | # no typed snps/hm3 snps in this block. set num snps to 0 190 | ldblocks.loc[ldblock.name, 'M_H'] = 0 191 | continue 192 | if (meta[background_names+marginal_names] == 0).values.all(): 193 | # annotations are all 0 in this block 194 | ldblocks.loc[ldblock.name, 'M_H'] = 0 195 | continue 196 | # record the number of typed snps in this block, and start- and end- snp indices 197 | ldblocks.loc[ldblock.name, 'M_H'] = meta.typed.sum() 198 | ldblocks.loc[ldblock.name, 'snpind_begin'] = min(meta.index) # for verbose 199 | ldblocks.loc[ldblock.name, 'snpind_end'] = max(meta.index)+1 # for verbose 200 | 201 | # load regression weights and prepare for regression computation 202 | meta_t = meta[meta.typed.values] 203 | N = meta_t.N.mean() 204 | if args.weights == 'Winv_ahat_h' or args.weights == 'Winv_ahat_hlN': 205 | R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz') 206 | R2 = None 207 | if len(R['U']) != len(meta): 208 | raise ValueError('regression wgts dimension must match regression snps') 209 | elif args.weights == 'Winv_ahat_h2' or args.weights == 'Winv_ahat': 210 | R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz') 211 | R2 = np.load(args.svd_stem+str(ldblock.name)+'.R2.npz') 212 | if len(R['U']) != len(meta) or len(R2['U']) != len(meta): 213 | raise ValueError('regression wgts dimension must match regression snps') 214 | else: 215 | R = None; R2 = None 216 | 217 | # multiply ahat by the weights 218 | Winv_RV = weights.invert_weights( 219 | R, R2, sigma2g, N, meta[background_names+marginal_names].values, 220 | typed=meta.typed.values, mode=args.weights) 221 | 222 | numerators[ldblock.name] = \ 223 | (meta_t[background_names+marginal_names].T.dot( 224 | meta_t.Winv_ahat)).values/1e6 225 | denominators[ldblock.name] = meta_t[background_names+marginal_names].T.dot( 226 | Winv_RV[meta.typed.values]).values/1e6 227 | memo.reset() 228 | 229 | # get data for jackknifing 230 | print('jackknifing') 231 | chunk_nums, chunk_denoms, loo_nums, loo_denoms, chunkinfo = cs.collapse_to_chunks( 232 | ldblocks, 233 | numerators, 234 | denominators, 235 | args.jk_blocks) 236 | 237 | # compute final results 238 | global q, results 239 | results = pd.DataFrame() 240 | for i, name in enumerate(marginal_names): 241 | print(i, name) 242 | # metadata about v 243 | sqnorm = marginal_infos.loc[name[:-2], 'sqnorm'] 244 | supp = marginal_infos.loc[name[:-2], 'supp'] 245 | 246 | # estimate mu and initialize results row 247 | k = marginal_names.index(name) 248 | mu = cs.get_est(sum(chunk_nums), sum(chunk_denoms), k, len(background_names)) 249 | newrow_results = { 250 | 'pheno':pheno_name, 251 | 'annot':name} 252 | results = pd.concat([results, pd.DataFrame([newrow_results])], ignore_index=True) 253 | 254 | # compute q 255 | q, r, mux, muy = cs.residualize(chunk_nums, chunk_denoms, len(background_names), k) 256 | 257 | 258 | # p-values 259 | if args.bothp or not args.fastp: 260 | p_emp, z_emp = cs.signflip(q, args.T, printmem=True, mode=args.stat) 261 | results.loc[i,'z'] = z_emp 262 | results.loc[i,'p'] = p_emp 263 | if args.bothp or args.fastp: 264 | z_fast = np.sum(q)/np.linalg.norm(q) 265 | p_fast = st.chi2.sf(z_fast**2, 1) 266 | results.loc[i, 'z_fast'] = z_fast 267 | results.loc[i, 'p_fast'] = p_fast 268 | if args.fastp and not args.bothp: # if only fastp, rename the output columns 269 | results.loc[i,'p'] = results.loc[i,'p_fast'] 270 | results.loc[i,'z'] = results.loc[i,'z_fast'] 271 | results.drop(['pfast','zfast'], axis=1, inplace=True) 272 | 273 | # print verbose information if required 274 | if results.loc[i].p < args.verbose_thresh: 275 | fname = args.outfile_stem+'.'+pheno_name+'.'+name 276 | print('writing verbose results to', fname) 277 | chunkinfo['q'] = q 278 | chunkinfo['r'] = r 279 | chunkinfo.to_csv(fname+'.chunks', sep='\t', index=False) 280 | coeffs = pd.DataFrame() 281 | coeffs['annot'] = background_names 282 | coeffs['mux'] = mux 283 | coeffs['muy'] = muy 284 | coeffs.to_csv(fname+'.coeffs', sep='\t', index=False) 285 | 286 | # nominate interesting loci if desired 287 | if results.loc[i].p < args.tell_me_stories: 288 | storyteller.write(args.outfile_stem+'.'+name+'.loci', 289 | args, name, background_names, mux, muy, results.loc[i].z, 290 | corr_thresh=args.story_corr_thresh) 291 | 292 | # add more output if desired 293 | if args.more_stats: 294 | results.loc[i,'qkurtosis'] = st.kurtosis(q) 295 | results.loc[i,'qstd'] = np.std(q) 296 | results.loc[i,'p_jk'] = st.chi2.sf((mu/se)**2,1) 297 | results.loc[i,'sqnorm'] = sqnorm 298 | 299 | # jackknife 300 | se = cs.jackknife_se(mu, loo_nums, loo_denoms, k, len(background_names)) 301 | results.loc[i,'mu'] = mu 302 | results.loc[i,'se(mu)'] = se 303 | 304 | # add estimates of rf, h2v, and other associated quantities 305 | M = marginal_infos.loc[name[:-2],'M'] 306 | results.loc[i,'h2g'] = h2g 307 | results.loc[i,'rf'] = mu * np.sqrt( 308 | sqnorm / h2g) 309 | results.loc[i,'h2v/h2g'] = results.loc[i].rf**2 - \ 310 | results.loc[i,'se(mu)']**2 * sqnorm / (M*sigma2g) 311 | results.loc[i,'h2v'] = results.loc[i,'h2v/h2g'] * h2g 312 | results.loc[i,'supp(v)/M'] = supp/M 313 | results.to_csv(args.outfile_stem + '.gwresults', sep='\t', index=False, na_rep='nan') 314 | 315 | print(results) 316 | print('writing results to', args.outfile_stem + '.gwresults') 317 | results.to_csv(args.outfile_stem + '.gwresults', sep='\t', index=False, na_rep='nan') 318 | print('done') 319 | 320 | # preprocess any sumstats that need preprocessing 321 | def preprocess_sumstats(args): 322 | import os 323 | if args.pss_chr is None: 324 | unprocessed_chroms = [ 325 | c for c in args.chroms 326 | if not os.path.exists( 327 | args.sumstats_stem + '.' + args.refpanel_name + '/'+str(c)+'.pss.gz') 328 | ] 329 | if len(unprocessed_chroms) > 0: 330 | print('Preprocessing', args.sumstats_stem+'.sumstats.gz... at', unprocessed_chroms) 331 | if args.config is None: 332 | raise ValueError('automatic pre-processing of a sumstats file requires '+\ 333 | 'specification of a config file; otherwise I dont know what '+\ 334 | 'parameters to use. If you want, you can preprocess the sumstats '+\ 335 | 'without a config file by running preprocesspheno manually') 336 | print('Using config file', args.config, 'and default options') 337 | 338 | # run the command 339 | import sldp.preprocesspheno, copy 340 | args_ = copy.copy(args) 341 | args_.no_M_5_50 = False 342 | args_.set_h2g = None 343 | args_.chroms = unprocessed_chroms 344 | sldp.preprocesspheno.main(args_) 345 | 346 | # modify args to reflect existing of pss-chr files 347 | args.pss_chr = args.sumstats_stem + '.' + args.refpanel_name + '/' 348 | args.sumstats_stem = None 349 | print('== finished preprocessing sumstats ==') 350 | 351 | # preprocess any annotations that need preprocessing 352 | def preprocess_sannots(args): 353 | import os 354 | 355 | for sannot in args.sannot_chr: 356 | unprocessed_chroms = [ 357 | c for c in args.chroms 358 | if not (os.path.exists(sannot + str(c) + '.RV.gz') and 359 | os.path.exists(sannot + str(c) + '.info')) 360 | ] 361 | if len(unprocessed_chroms) > 0: 362 | print('Preprocessing', sannot, 'at chromosomes', unprocessed_chroms) 363 | if args.config is None: 364 | raise ValueError('automatic pre-processing of an annotation '+\ 365 | 'requires specification of a config file; otherwise I dont know what '+\ 366 | 'parameters to use. If you want, you can preprocess the annotation '+\ 367 | 'without a config file by running preprocessannot manually') 368 | print('Using config file', args.config, 'and default options') 369 | 370 | # run preprocessing command 371 | import sldp.preprocessannot, copy 372 | args_ = copy.copy(args) 373 | args_.alpha = -1 374 | args_.chroms = unprocessed_chroms 375 | sldp.preprocessannot.main(args_) 376 | 377 | print('== finished preprocessing annotation', sanno) 378 | 379 | 380 | if __name__ == '__main__': 381 | main() 382 | -------------------------------------------------------------------------------- /src/sldp/storyteller.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import numpy as np 3 | import pandas as pd 4 | import gprim.annotation as ga 5 | import gprim.dataset as gd 6 | import ypy.fs as fs 7 | import matplotlib.pyplot as plt 8 | 9 | # find windows with genome-wide significant SNPs that are consistent with the global signal 10 | def write(folder, args, name, background_names, mux, muy, z, corr_thresh=0.8): 11 | print('STORYTELLING for ', name, 'z=', z) 12 | refpanel = gd.Dataset(args.bfile_reg_chr) 13 | annot = [ga.Annotation(a) for a in args.sannot_chr 14 | if name in ga.Annotation(a).names(22, RV=True)][0] 15 | 16 | backgroundannots = [ga.Annotation(a) for a in args.background_sannot_chr] 17 | print('focal annotation columns:', annot.names(22, True)) 18 | print('background annotations:', background_names) 19 | 20 | # get refpanel snp metadata 21 | print('re-reading snps') 22 | snps = pd.concat([refpanel.bim_df(c) for c in args.chroms], axis=0) 23 | 24 | # read sumstats 25 | print('re-reading sumstats') 26 | ss = pd.concat([ 27 | pd.read_csv(args.pss_chr+str(c)+'.pss.gz', sep='\t', usecols=['N','Winv_ahat_I']) 28 | for c in args.chroms]) 29 | snps['ahat'] = ss['Winv_ahat_I'] 30 | snps['N'] = ss['N'] 31 | del ss 32 | 33 | # read annotations 34 | print('re-reading background annotations') 35 | for a in backgroundannots: 36 | mynames = [n for n in a.names(22, RV=True) if '.R' in n] #names of annotations 37 | snps = pd.concat([snps, 38 | pd.concat([a.RV_df(c)[mynames] for c in args.chroms], axis=0) 39 | ], axis=1) 40 | 41 | print('reading focal annotation') 42 | snps = pd.concat([snps, 43 | pd.concat([annot.RV_df(c)[name] for c in args.chroms], axis=0) 44 | ], axis=1) 45 | 46 | print('residualizing background out of focal') 47 | A = snps[background_names] 48 | snps['chi2'] = snps.N * snps.ahat**2 49 | snps['Rv'] = snps[name] 50 | snps['ahat_resid'] = snps.ahat - A.values.dot(muy) 51 | snps['Rv_resid'] = snps.Rv - A.values.dot(mux) 52 | snps['typed'] = snps.ahat_resid.notnull() 53 | snps = snps[snps.typed].reset_index(drop=True) 54 | snps['significant'] = snps.chi2 > 29.716785 55 | print(snps.significant.sum(), 'genome-wide significant SNPs') 56 | 57 | print('searching for good windows') 58 | # get endpoints of windows 59 | stride = 20 60 | windowsize_in_strides = 5 61 | windowsize = stride * windowsize_in_strides 62 | 63 | # find all starting points of windows containing GWAS-sig SNPs 64 | starts = np.concatenate([ 65 | [ int(i/stride)*stride - k*stride 66 | for k in range(0, windowsize_in_strides)] 67 | for i in np.where(snps.significant)[0]]) 68 | starts = np.array(sorted(list(set(starts)))) 69 | # compute corresponding endpoints 70 | ends = starts + windowsize 71 | 72 | # truncate any windows that extend past the ends of the genome 73 | starts = starts[ends < len(snps)] 74 | ends = ends[ends < len(snps)] 75 | ends = ends[starts >= 0] 76 | starts = starts[starts >= 0] 77 | 78 | print(len(starts), 'windows with GWAS hits found') 79 | 80 | # compute correlations 81 | numbers = pd.DataFrame( 82 | np.array([[i,j, 83 | np.max(snps.iloc[i:j].chi2), 84 | np.corrcoef(snps.iloc[i:j].Rv_resid, snps.iloc[i:j].ahat_resid)[0,1]] 85 | for i,j in zip(starts, ends)]), 86 | columns=['start','end','maxchi2','corr']) 87 | 88 | # keep only cases with strong correlations in the right direction 89 | numbers = numbers[numbers['corr']**2 >= corr_thresh] 90 | numbers = numbers[np.sign(numbers['corr']) == np.sign(z)] 91 | print(len(numbers), 'windows with GWAS hits and squared correlation with Rv >=', 92 | corr_thresh, 'in the right direction') 93 | for i,j in zip(numbers.start, numbers.end): 94 | i = int(i); j = int(j) 95 | print('saving', i,j) 96 | c = snps.iloc[i].CHR 97 | start = snps.iloc[i].BP 98 | end = snps.iloc[j].BP 99 | plt.figure() 100 | plt.scatter(snps.iloc[i:j].Rv_resid, 101 | snps.iloc[i:j].ahat_resid * np.sqrt(snps.iloc[i:j].N)) 102 | plt.title('chr{}:{}-{}'.format(c, start, end)) 103 | plt.xlabel(r'residual $Rv$') 104 | plt.ylabel(r'residual $Z$') 105 | 106 | filename = '{}/chr{}:{}-{}.pdf'.format(folder, c, start, end) 107 | fs.makedir_for_file(filename) 108 | plt.savefig(filename) 109 | plt.close() 110 | -------------------------------------------------------------------------------- /src/sldp/weights.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import numpy as np 3 | 4 | # R: SVD of (R, restricted to regression SNPs) 5 | # R2: SVD of (R^2, restricted to regression SNPs) 6 | def invert_weights(R, R2, sigma2g, N, x, typed=None, mode='Winv_ahat_h'): 7 | if typed is None: 8 | typed = np.isfinite(x.reshape((len(x),-1)).sum(axis=1)) 9 | 10 | # trivial weights 11 | if mode == 'Winv_ahat_I': 12 | result = x 13 | # heuristic, with large-N approximation 14 | elif mode == 'Winv_ahat_hlN': 15 | U = R['U'][typed,:]; svs=R['svs'] 16 | result = np.full(x.shape, np.nan) 17 | result[typed] = (U/(svs**2)).dot(U.T.dot(x[typed])) 18 | # heuristic, no large N approximation, using (R_o)^2 to approximate (R2)_o 19 | elif mode == 'Winv_ahat_h': 20 | U = R['U'][typed,:]; svs=R['svs'] 21 | result = np.full(x.shape, np.nan) 22 | result[typed] = (U/(sigma2g*svs**2+svs/N)).dot(U.T.dot(x[typed])) 23 | # heuristic, no large N approximation, using R2 instead of R 24 | elif mode == 'Winv_ahat_h2': 25 | U = R2['U'][typed,:]; svs=R2['svs'] 26 | result = np.full(x.shape, np.nan) 27 | result[typed] = (U/(sigma2g*svs+np.sqrt(svs)/N)).dot(U.T.dot(x[typed])) 28 | # exact 29 | elif mode == 'Winv_ahat': 30 | R_ = (R['U'][typed]*R['svs']).dot(R['U'][typed].T) 31 | R2_ = (R2['U'][typed]*R2['svs']).dot(R2['U'][typed].T) 32 | W = R_/N + sigma2g*R2_ 33 | U, svs, _ = np.linalg.svd(W) 34 | k = np.argmax(np.cumsum(svs)/svs.sum() >= 0.95) 35 | # k = np.argmax(svs[:-1]/svs[1:] >= 1e5)+1 36 | print(R['U'].shape, R2['U'].shape, 'k=',k) 37 | U = U[:,:k]; svs = svs[:k] 38 | result = np.full(x.shape, np.nan) 39 | result[typed] = (U/svs).dot(U.T.dot(x[typed])) 40 | return result 41 | --------------------------------------------------------------------------------