├── .gitignore
├── README.md
├── pyproject.toml
├── setup.cfg
└── src
    ├── .DS_Store
    └── sldp
        ├── .DS_Store
        ├── __init__.py
        ├── chunkstats.py
        ├── config.py
        ├── preprocessannot
        ├── preprocessannot.py
        ├── preprocesspheno
        ├── preprocesspheno.py
        ├── preprocessrefpanel
        ├── sldp
        ├── storyteller.py
        └── weights.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | 
55 | # Sphinx documentation
56 | docs/_build/
57 | 
58 | # PyBuilder
59 | target/
60 | 
61 | #Ipython Notebook
62 | .ipynb_checkpoints
63 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SLDP (Signed LD Profile) regression
 2 | 
 3 | SLDP regression is a method for looking for a directional effect of a signed functional annotation on a heritable trait using GWAS summary statistics. This repository contains code for the SLDP regression method as well as tools required for preprocessing data for use with SLDP regression.
 4 | 
 5 | ## Installation
 6 | 
 7 | First, make sure you have a python distribution installed that includes scientific computing packages like numpy/scipy/pandas as well as the package manager pip; we recommend [Anaconda](https://store.continuum.io/cshop/anaconda/).
 8 | 
 9 | To install `sldp`, type the following command.
10 | ```  
11 | pip install sldp
12 | ```
13 | This should install both sldp as well as any required packages, such as [gprim](https://github.com/yakirr/gprim) and [ypy](https://github.com/yakirr/ypy).
14 | 
15 | If you prefer to install `sldp` without pip, just clone this repository, together with [gprim](https://github.com/yakirr/gprim) and [ypy](https://github.com/yakirr/ypy), and add an entry for each into your python path.
16 | 
17 | 
18 | ## Getting started
19 | 
20 | To verify that the installation went okay, run
21 | ```
22 | sldp -h
23 | ```
24 | to print a list of all command-line options. If this command fails, there was a problem with the installation.
25 | 
26 | Once this works, take a look at our [wiki](https://github.com/yakirr/sldp/wiki) for a short tutorial on how to use `sldp`.
27 | 
28 | 
29 | ## Where can I get signed LD profiles?
30 | 
31 | You can download signed LD profiles (as well as raw signed functional annotations) for ENCODE ChIP-seq experiments from the [sldp data page](https://data.broadinstitute.org/alkesgroup/SLDP/). These signed LD profiles were created using 1000 Genomes Phase 3 Europeans as the reference panel.
32 | 
33 | ## Where can I get reference panel information such as SVDs of LD blocks and LD scores?
34 | 
35 | You can download all required reference panel information, computed using 1000 Genomes Phase 3 Europeans, from the [sldp data page](https://data.broadinstitute.org/alkesgroup/SLDP/).
36 | 
37 | ## Errata
38 | ### Gene-set enrichment method for SLDP results
39 | In the published paper, we described a gene-set enrichment method for assessing whether a genome-wide signed relationship between an annotation and a trait is stronger in areas of the genome that are near a gene set of interest. The method was described in terms of a vector `s` that summarizes the gene-set of interest and a vector `q` that summarizes the SLDP association of interest. Both `s` and `q` have one entry per LD block: the `i`-th entry of `s` contains the number genes from the gene set that lie in the `i`-th LD block, and the `i`-th entry of `q` contains the estimated covariance across SNPs in the `i`-th LD block between the signed LD profile of the annotation in question and summary statistics of the trait in question. There are two errata related to this analysis, which we describe below.
40 | 
41 | #### Description of gene-set enrichment analysis statistic
42 | In the paper, we stated that statistic we compute is
43 | 
44 | ![equation](https://latex.codecogs.com/png.latex?a%20%3A%3D%20%5Cfrac%7B%5Csum_i%20s_iq_i%7D%7B%5Csum_i%20s_i%7D)
45 | 
46 | that is, the weighted average of `q` across the LD blocks with non-zero values of `s`. However, the statistic that is used in actuality is
47 | 
48 | ![equation](https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Csum_i%20s_iq_i%7D%7B%5Csum_i%20s_i%7D%20-%20%5Cfrac%7B%5Csum_i%20%5Cmathbf%7B1%7D%28s_i%3D0%29%20q_i%7D%7B%5Csum_i%20%5Cmathbf%7B1%7D%28s_i%3D0%29%7D)
49 | 
50 | that is, we take the difference between the weighted average of `q` across the LD blocks with non-zero values of `s` on the one hand, and the average of `q` across the LD blocks in which `s` is zero on the other hand.
51 | 
52 | #### Computation of empirical p-values in gene-set enrichment analysis
53 | Our gene-set enrichment procedure computed p-values by shuffling `s` over LD blocks. However, the code that produced our published results computed a simple average of `q` rather than a weighted average when computing the statistic for the null distribution. Fixing the bug led to qualitatively similar but not identical results. For more detail, download the [corrected version of Supplementary Table 10](https://data.broadinstitute.org/alkesgroup/SLDP/errata/corrected_SuppTable10.xlsx) that lists the published and corrected p- and q-values of the gene-set enrichments highlighted in our publication.
54 | 
55 | ## Citation
56 | 
57 | If you use `sldp`, please cite
58 | 
59 | [Reshef, et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk.
60 | Nature Genetics, 2018.](https://www.nature.com/articles/s41588-018-0196-7)
61 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = [
3 |     "setuptools>=42",
4 |     "wheel"
5 | ]
6 | build-backend = "setuptools.build_meta"
7 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
 1 | [metadata]
 2 | name = sldp
 3 | version = 1.1.4
 4 | author = Yakir Reshef
 5 | author_email = yreshef@broadinstitute.org
 6 | description = Signed LD profile regression
 7 | long_description = file: README.md
 8 | long_description_content_type = text/markdown
 9 | url = http://github.com/yakirr/sldp
10 | project_urls =
11 |     Bug Tracker = https://github.com/yakirr/sldp/issues
12 |     Tutorial = https://github.com/yakirr/sldp/blob/master/README.md
13 | classifiers =
14 |     Programming Language :: Python :: 3
15 |     License :: OSI Approved :: MIT License
16 |     Operating System :: OS Independent
17 | 
18 | [options]
19 | package_dir =
20 |     = src
21 | packages = find:
22 | python_requires = >=3
23 | install_requires = 
24 |         numpy
25 |         pandas>=2
26 |         scipy
27 |         matplotlib
28 |         gprim
29 |         ypy
30 | 
31 | [options.packages.find]
32 | where = src
33 | 
34 | [options.entry_points]
35 | console_scripts =
36 |     sldp = sldp.sldp:main
37 |     preprocessannot = sldp.preprocessannot:main
38 |     preprocesspheno = sldp.preprocesspheno:main
39 |     preprocessrefpanel = sldp.preprocessrefpanel:main


--------------------------------------------------------------------------------
/src/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/.DS_Store


--------------------------------------------------------------------------------
/src/sldp/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/sldp/.DS_Store


--------------------------------------------------------------------------------
/src/sldp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yakirr/sldp/847ec3f29c2313e385ca1f2f269c89fd7310867d/src/sldp/__init__.py


--------------------------------------------------------------------------------
/src/sldp/chunkstats.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | import gc
  3 | import numpy as np
  4 | import pandas as pd
  5 | import scipy.stats as st
  6 | import ypy.fs as fs
  7 | 
  8 | # make idependent blocks
  9 | def collapse_to_chunks(ldblocks, numerators, denominators, numblocks):
 10 |     # define endpoints of chunks
 11 |     ldblocks.M_H.fillna(0, inplace=True)
 12 |     totalM = ldblocks.M_H.sum()
 13 |     chunksize = totalM / numblocks
 14 |     avgldblocksize = totalM / (ldblocks.M_H != 0).sum()
 15 |     chunkendpoints = [0]
 16 |     currldblock = 0; currsize = 0
 17 |     while currldblock < len(ldblocks):
 18 |         while currsize <= max(1,chunksize-avgldblocksize/2) and currldblock < len(ldblocks):
 19 |             currsize += ldblocks.iloc[currldblock].M_H
 20 |             currldblock += 1
 21 |         currsize = 0
 22 |         chunkendpoints += [currldblock]
 23 | 
 24 |     # store SNP indices of begin- and end-points of chunks
 25 |     chunkinfo = pd.DataFrame()
 26 | 
 27 |     # collapse data within chunks
 28 |     chunk_nums = []; chunk_denoms = []
 29 |     for n, (i,j) in enumerate(zip(chunkendpoints[:-1], chunkendpoints[1:])):
 30 |         ldblock_ind = [l for l in ldblocks.iloc[i:j].index if l in numerators.keys()]
 31 |         if len(ldblock_ind) > 0:
 32 |             chunk_nums.append(sum(
 33 |                 [numerators[l] for l in ldblock_ind]))
 34 |             chunk_denoms.append(sum(
 35 |                 [denominators[l] for l in ldblock_ind]))
 36 |             newrow_chunkinfo = {
 37 |                 'ldblock_begin':min(ldblock_ind),
 38 |                 'ldblock_end':max(ldblock_ind)+1,
 39 |                 'chr_begin':ldblocks.loc[min(ldblock_ind),'chr'],
 40 |                 'chr_end':ldblocks.loc[max(ldblock_ind),'chr'],
 41 |                 'bp_begin':ldblocks.loc[min(ldblock_ind),'start'],
 42 |                 'bp_end':ldblocks.loc[max(ldblock_ind),'end'],
 43 |                 'snpind_begin':ldblocks.loc[min(ldblock_ind),'snpind_begin'],
 44 |                 'snpind_end':ldblocks.loc[max(ldblock_ind),'snpind_end'],
 45 |                 'numsnps':sum(ldblocks.loc[ldblock_ind,'M_H'])}
 46 |             chunkinfo = pd.concat([chunkinfo, pd.DataFrame([newrow_chunkinfo])], ignore_index=True)
 47 | 
 48 |     ## compute leave-one-out sums
 49 |     loonumerators = []; loodenominators = []
 50 |     for i in range(len(chunk_nums)):
 51 |         loonumerators.append(sum(chunk_nums[:i]+chunk_nums[(i+1):]))
 52 |         loodenominators.append(sum(chunk_denoms[:i]+chunk_denoms[(i+1):]))
 53 | 
 54 |     return chunk_nums, chunk_denoms, loonumerators, loodenominators, chunkinfo
 55 | 
 56 | # compute estimate of effect size
 57 | def get_est(num, denom, k, num_background):
 58 |     ind = list(range(num_background)) + [num_background+k]
 59 |     num = num[ind]
 60 |     denom = denom[ind][:,ind]
 61 |     try:
 62 |         return np.linalg.solve(denom, num)[-1]
 63 |     except np.linalg.linalg.LinAlgError:
 64 |         return np.nan
 65 | 
 66 | # compute standard error of estimate using jackknife
 67 | def jackknife_se(est, loonumerators, loodenominators, k, num_background):
 68 |     m = np.ones(len(loonumerators))
 69 |     theta_notj = [get_est(nu, de, k, num_background)
 70 |             for nu, de in zip(loonumerators, loodenominators)]
 71 |     g = len(m)
 72 |     n = m.sum()
 73 |     h = n/m
 74 |     theta_J = g*est - ((n-m)/n*theta_notj).sum()
 75 |     tau = est*h - (h-1)*theta_notj
 76 |     return np.sqrt(np.mean((tau - theta_J)**2/(h-1)))
 77 | 
 78 | # residualize the first num_background annotations out of the num_background+k-th
 79 | #   marginal annotation
 80 | # q is the numerator of the regression after background annots are residualized out
 81 | # r is the denominator of the regression after background annots are residualized out
 82 | # mux is the vector of coefficients required to residualize the background out of the
 83 | #   marginal annotatioin question
 84 | # muy is the vector of coefficients required to residualize the backgroud out of the
 85 | #   vector of GWAS summary statistics
 86 | def residualize(chunk_nums, chunk_denoms, num_background, k):
 87 |     q = np.array([num[num_background+k] for num in chunk_nums])
 88 |     r = np.array([denom[num_background+k,num_background+k] for denom in chunk_denoms])
 89 | 
 90 |     if num_background > 0:
 91 |         num = sum(chunk_nums)
 92 |         denom = sum(chunk_denoms)
 93 |         ATA = denom[:num_background][:,:num_background]
 94 |         ATy = num[:num_background]
 95 |         ATx = denom[:num_background][:,num_background+k]
 96 |         muy = np.linalg.solve(ATA, ATy)
 97 |         mux = np.linalg.solve(ATA, ATx)
 98 |         xiaiT = np.array([d[num_background+k,:num_background]
 99 |             for d in chunk_denoms])
100 |         yiaiT = np.array([nu[:num_background]
101 |             for nu in chunk_nums])
102 |         aiaiT = np.array([d[:num_background][:,:num_background]
103 |             for d in chunk_denoms])
104 |         q = q - xiaiT.dot(muy) - yiaiT.dot(mux) + aiaiT.dot(muy).dot(mux)
105 |         r = r - 2*xiaiT.dot(mux) + aiaiT.dot(mux).dot(mux)
106 | 
107 |     return q, r, mux, muy
108 | 
109 | # do sign-flipping to get p-value
110 | def signflip(q, T, printmem=True, mode='sum'):
111 |     def mask(a, t):
112 |         a_ = a.copy()
113 |         a_[np.abs(a_) < t] = 0
114 |         return a_
115 | 
116 |     print('before sign-flipping:', fs.mem(), 'MB')
117 | 
118 |     if mode == 'sum': # use sum of q as the test statistic
119 |         score = q.sum()
120 |     elif mode == 'medrank': # examine how far the rank of 0 deviates from the 50th percentile
121 |         score = np.searchsorted(np.sort(q), 0)/len(q) - 0.5
122 |     elif mode == 'thresh': # threshold q at some absolute magnitude threshold
123 |         top = np.percentile(np.abs(q), 75)
124 |         print(top)
125 |         ts = np.arange(0, top, top/10)
126 |         q_thresh = np.array([mask(q, t) for t in ts]).T
127 |         q_thresh /= np.linalg.norm(q_thresh, axis=0)
128 |         scores = np.sum(q_thresh, axis=0)
129 |         score = scores[np.argmax(np.abs(scores))]
130 |     else:
131 |         print('ERROR: invalid mode')
132 |         return None
133 | 
134 |     null = np.zeros(T); current = 0; block = 100000
135 |     while current < len(null):
136 |         s = (-1)**np.random.binomial(1,0.5,size=(block, len(q)))
137 |         if mode == 'sum':
138 |             null[current:current+block] = s.dot(q)
139 |         elif mode == 'medrank':
140 |             null_q = s[:,:]*q[None,:]
141 |             null_q = np.sort(null_q, axis=1)
142 |             null[current:current+block] = \
143 |                     np.array([np.searchsorted(s, 0)/len(s) - 0.5 for s in null_q])
144 |         elif mode == 'thresh':
145 |             null_q_thresh = s[:,:,None]*q_thresh[None,:,:]
146 |             sums = np.sum(null_q_thresh, axis=1)
147 |             null[current:current+block] = np.array([s[np.argmax(np.abs(s))] for s in sums])
148 | 
149 |         current += block
150 |         p = max(1, ((np.abs(null) >= np.abs(score)).sum())) / float(current)
151 |         del s; gc.collect()
152 |         if p >= 0.01:
153 |             null = null[:current]
154 |             break
155 | 
156 |     se = np.abs(score)/np.sqrt(st.chi2.isf(p,1))
157 |     del null; gc.collect()
158 |     print('after sign-flipping:', fs.mem(), 'MB. p=', p)
159 | 
160 |     return p, score/se
161 | 


--------------------------------------------------------------------------------
/src/sldp/config.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | def add_default_params(args):
 4 |     if args.config is not None:
 5 |         # read in config file and replace '-' with '_' to match argparse behavior
 6 |         with open(args.config, 'r') as f:
 7 |             config = { k.replace('-','_') : v for k,v in json.load(f).items()}
 8 |         # overwrite with any non-None entries, resolving conflicts in favor of args
 9 |         config.update({
10 |             k:v for k,v in args.__dict__.items()
11 |             if (v is not None and v != []) or k not in config.keys()
12 |             })
13 |         # replace information in args
14 |         args.__dict__ = config
15 | 


--------------------------------------------------------------------------------
/src/sldp/preprocessannot:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from __future__ import print_function, division
 3 | import argparse, sys
 4 | import sldp.config as config
 5 | import ypy.pretty as pretty
 6 | 
 7 | def main():
 8 |     parser = argparse.ArgumentParser()
 9 |     # required arguments
10 |     parser.add_argument('--sannot-chr', nargs='+', required=True,
11 |             help='Multiple (space-delimited) paths to sannot.gz files, not including '+\
12 |                     'chromosome')
13 | 
14 |     # optional arguments
15 |     parser.add_argument('--alpha', type=float, default=-1,
16 |         help='scale annotation values by sqrt(2*maf(1-maf))^{alpha+1}. '+\
17 |                 '-1 means assume annotation values are already per-normalized-genotype, '+\
18 |                 '0 means assume they were per allele. Default is -1.')
19 |     parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int,
20 |             help='Space-delimited list of chromosomes to analyze. Default is 1..22')
21 | 
22 |     # configurable arguments
23 |     parser.add_argument('--config', default=None,
24 |             help='Path to a json file with values for other parameters. ' +\
25 |                     'Values in this file will be overridden by any values passed ' +\
26 |                     'explicitly via the command line.')
27 |     parser.add_argument('--bfile-chr', default=None,
28 |             help='Path to plink bfile of reference panel to use, not including ' +\
29 |                     'chromosome number. If not supplied, will be read from config file.')
30 |     parser.add_argument('--print-snps', default=None,
31 |             help='Path to set of potentially typed SNPs. If not supplied, will be read '+\
32 |                     'from config file.')
33 |     parser.add_argument('--ld-blocks', default=None,
34 |             help='Path to UCSC bed file containing one bed interval per LD block. If '+\
35 |                     'not supplied, will be read from config file.')
36 | 
37 |     print('=====')
38 |     print(' '.join(sys.argv))
39 |     print('=====')
40 |     args = parser.parse_args()
41 |     config.add_default_params(args)
42 |     pretty.print_namespace(args)
43 |     print('=====')
44 | 
45 |     import sldp.preprocessannot as preprocessannot
46 |     preprocessannot.main(args)
47 | 
48 | if __name__ == "__main__":
49 |     main()


--------------------------------------------------------------------------------
/src/sldp/preprocessannot.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | 
  3 | def main(args):
  4 |     print('initializing...')
  5 |     import gzip, gc, time
  6 |     import numpy as np
  7 |     import pandas as pd
  8 |     import gprim.annotation as ga
  9 |     import gprim.dataset as gd
 10 |     import ypy.memo as memo
 11 | 
 12 |     # basic initialization
 13 |     mhc_bp = [25684587, 35455756]
 14 |     refpanel = gd.Dataset(args.bfile_chr)
 15 |     annots = [ga.Annotation(annot) for annot in args.sannot_chr]
 16 | 
 17 |     # read in ld blocks, remove MHC, read SNPs to print
 18 |     ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None,
 19 |             names=['chr','start', 'end'])
 20 |     mhcblocks = (ldblocks.chr == 'chr6') & \
 21 |             (ldblocks.end > mhc_bp[0]) & \
 22 |             (ldblocks.start < mhc_bp[1])
 23 |     ldblocks = ldblocks[~mhcblocks]
 24 |     print(len(ldblocks), 'loci after removing MHC')
 25 |     print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP'])
 26 |     print_snps['printsnp'] = True
 27 |     print(len(print_snps), 'print snps')
 28 | 
 29 |     # process annotations
 30 |     for annot in annots:
 31 |         t0 = time.time()
 32 |         for c in args.chroms:
 33 |             print(time.time()-t0, ': loading chr', c, 'of', args.chroms)
 34 |             # get refpanel snp metadata for this chromosome
 35 |             snps = refpanel.bim_df(c)
 36 |             snps = ga.smart_merge(snps, refpanel.frq_df(c)[['SNP','MAF']])
 37 |             print(len(snps), 'snps in refpanel',
 38 |                     len(snps.columns), 'columns, including metadata')
 39 | 
 40 |             # read in annotation
 41 |             print('reading annot', annot.filestem())
 42 |             names = annot.names(c) # names of annotations
 43 |             namesR = [n+'.R' for n in names] # names of results
 44 |             a = annot.sannot_df(c)
 45 |             if 'SNP' in a.columns:
 46 |                 print('not a thinannot => doing full reconciliation of snps and allele coding')
 47 |                 snps = ga.reconciled_to(snps, a, names, missing_val=0)
 48 |             else:
 49 |                 print('detected thinannot, so assuming that annotation is synched to refpanel')
 50 |                 snps = pd.concat([snps, a[names]], axis=1)
 51 | 
 52 |             # add information on which snps to print
 53 |             print('merging in print_snps')
 54 |             snps = pd.merge(snps, print_snps, how='left', on='SNP')
 55 |             snps.printsnp.fillna(False, inplace=True)
 56 |             snps.printsnp.astype(bool)
 57 | 
 58 |             # put on per-normalized-genotype scale
 59 |             if args.alpha != -1:
 60 |                 print('scaling by maf according to alpha=', args.alpha)
 61 |                 snps[names] = snps[names].values*\
 62 |                         np.power(2*snps.MAF.values*(1-snps.MAF.values),
 63 |                                 (1.+args.alpha)/2)[:,None]
 64 | 
 65 |             # make room for RV and make sure annotation values are treated as floats
 66 |             snps = pd.concat(
 67 |                     [snps, pd.DataFrame(np.zeros(snps[names].shape), columns=namesR)],
 68 |                     axis=1)
 69 |             snps[names] = snps[names].astype(float)
 70 | 
 71 |             # compute simple statistics about annotation
 72 |             print('computing basic statistics and writing')
 73 |             info = pd.DataFrame(
 74 |                     columns=['M', 'M_5_50', 'sqnorm', 'sqnorm_5_50', 'supp', 'supp_5_50'])
 75 |             info['name'] = names
 76 |             info.set_index('name', inplace=True)
 77 |             info['M'] = len(snps)
 78 |             info['sqnorm'] = np.linalg.norm(snps[names], axis=0)**2
 79 |             info['supp'] = np.linalg.norm(snps[names], ord=0, axis=0)
 80 |             M_5_50 = (snps.MAF >= 0.05).values
 81 |             info['M_5_50'] = M_5_50.sum()
 82 |             info['sqnorm_5_50'] = np.linalg.norm(snps.loc[M_5_50, names], axis=0)**2
 83 |             info['supp_5_50'] = np.linalg.norm(snps.loc[M_5_50, names], ord=0, axis=0)
 84 |             info.to_csv(annot.info_filename(c), sep='\t')
 85 | 
 86 |             # process ldblocks one by one
 87 |             for ldblock, X, meta, ind in refpanel.block_data(ldblocks, c, meta=snps):
 88 |                 if meta.printsnp.sum() == 0:
 89 |                     print('no print-snps in this block')
 90 |                     continue
 91 |                 print(meta.printsnp.sum(), 'print-snps')
 92 |                 if (meta[names] == 0).values.all():
 93 |                     print('annotations are all 0 in this block')
 94 |                     snps.loc[ind, namesR] = 0
 95 |                 else:
 96 |                     mask = meta.printsnp.values
 97 |                     V = meta[names].values
 98 |                     XV = X.dot(V)
 99 |                     snps.loc[ind[mask], namesR] = \
100 |                             X[:,mask].T.dot(XV[:,-len(names):]) / X.shape[0]
101 | 
102 |             # write
103 |             print('writing output')
104 |             with gzip.open(annot.RV_filename(c), 'w') as f:
105 |                 snps.loc[snps.printsnp,['SNP','A1','A2']+names+namesR].to_csv(
106 |                         f, index=False, sep='\t')
107 | 
108 |             del snps; memo.reset(); gc.collect()
109 | 
110 |     print('done')
111 | 


--------------------------------------------------------------------------------
/src/sldp/preprocesspheno:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from __future__ import print_function, division
 3 | import argparse, sys
 4 | import sldp.config as config
 5 | import ypy.pretty as pretty
 6 | 
 7 | def main():
 8 |     parser = argparse.ArgumentParser()
 9 |     # required arguments
10 |     parser.add_argument('--sumstats-stem', required=True,
11 |             help='path to sumstats.gz files, not including ".sumstats.gz" extension')
12 | 
13 |     # optional arguments
14 |     parser.add_argument('--refpanel-name', default='KG3.95',
15 |             help='suffix added to the directory created for storing output. '+\
16 |                     'Default is KG3.95, corresponding to 1KG Phase 3 reference panel '+\
17 |                     'processed with default parameters by preprocessrefpanel.py.')
18 |     parser.add_argument('-no-M-5-50', default=False, action='store_true',
19 |             help='Dont filter to SNPs with MAF >= 0.05 when estimating heritabilities')
20 |     parser.add_argument('--set-h2g', default=None, type=float,
21 |             help='Scale Z-scores to achieve this approximate heritability')
22 |     parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int,
23 |             help='Space-delimited list of chromosomes to analyze. Default is 1..22')
24 | 
25 |     # configurable arguments
26 |     parser.add_argument('--config', default=None,
27 |             help='Path to a json file with values for other parameters. ' +\
28 |                     'Values in this file will be overridden by any values passed ' +\
29 |                     'explicitly via the command line.')
30 |     parser.add_argument('--bfile-chr', default=None,
31 |             help='Path to plink bfile of reference panel to use, not including ' +\
32 |                     'chromosome number. If not supplied, will be read from config file.')
33 |     parser.add_argument('--svd-stem', default=None,
34 |             help='Path to directory containing truncated svds of reference panel, by LD '+\
35 |                     'block, as produced by preprocessrefpanel.py. If not supplied, will be '+\
36 |                     'read from config file.')
37 |     parser.add_argument('--print-snps', default=None,
38 |             help='Path to set of potentially typed SNPs. If not supplied, will be read '+\
39 |                     'from config file.')
40 |     parser.add_argument('--ldscores-chr', default=None,
41 |             help='Path to LD scores at a smallish set of SNPs (~1M). LD should be computed '+\
42 |                     'to all potentially causal snps. Used for heritability estimation. '+\
43 |                     'If not supplied, will be read from config file.')
44 |     parser.add_argument('--ld-blocks', default=None,
45 |             help='Path to UCSC bed file containing one bed interval per LD block. If '+\
46 |                     'not supplied, will be read from config file.')
47 | 
48 |     print('=====')
49 |     print(' '.join(sys.argv))
50 |     print('=====')
51 |     args = parser.parse_args()
52 |     config.add_default_params(args)
53 |     pretty.print_namespace(args)
54 |     print('=====')
55 | 
56 |     import sldp.preprocesspheno as preprocesspheno
57 |     preprocesspheno.main(args)
58 | 
59 | if __name__ == '__main__':
60 |     main()


--------------------------------------------------------------------------------
/src/sldp/preprocesspheno.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | 
  3 | def main(args):
  4 |     print('initializing...')
  5 |     import gzip, os, gc, time
  6 |     import numpy as np
  7 |     import pandas as pd
  8 |     import gprim.annotation as ga
  9 |     import gprim.dataset as gd
 10 |     import ypy.fs as fs
 11 |     import ypy.memo as memo
 12 |     import sldp.weights as weights
 13 | 
 14 |     # read in refpanel, ld blocks, and svd snps
 15 |     refpanel = gd.Dataset(args.bfile_chr)
 16 |     ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None,
 17 |             names=['chr','start', 'end'])
 18 |     print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP'])
 19 |     print_snps['printsnp'] = True
 20 |     print(len(print_snps), 'svd snps')
 21 | 
 22 |     # read sumstats
 23 |     print('reading sumstats', args.sumstats_stem)
 24 |     ss = pd.read_csv(args.sumstats_stem+'.sumstats.gz', sep='\t')
 25 |     ss = ss[ss.Z.notnull() & ss.N.notnull()]
 26 |     print('{} snps, {}-{} individuals (avg: {})'.format(
 27 |         len(ss), np.min(ss.N), np.max(ss.N), np.mean(ss.N)))
 28 |     ss = pd.merge(ss, print_snps[['SNP']], on='SNP', how='inner')
 29 |     print(len(ss), 'snps typed')
 30 | 
 31 |     # read ld scores
 32 |     print('reading in ld scores')
 33 |     ld = pd.concat([pd.read_csv(args.ldscores_chr+str(c)+'.l2.ldscore.gz',
 34 |                         delim_whitespace=True)
 35 |                     for c in range(1,23)], axis=0)
 36 |     if args.no_M_5_50:
 37 |         M = sum([int(open(args.ldscores_chr+str(c)+'.l2.M').next())
 38 |                         for c in range(1,23)])
 39 |     else:
 40 |         M = sum([int(open(args.ldscores_chr+str(c)+'.l2.M_5_50').next())
 41 |                         for c in range(1,23)])
 42 |     print(len(ld), 'snps with ld scores')
 43 |     ssld = pd.merge(ss, ld, on='SNP', how='left')
 44 |     print(len(ssld), 'hm3 snps with sumstats after merge.')
 45 | 
 46 |     # estimate heritability using aggregate estimator
 47 |     def esth2g(ssld):
 48 |         meanchi2 = (ssld.Z**2).mean()
 49 |         meanNl2 = (ssld.N*ssld.L2).mean()
 50 |         sigma2g = (meanchi2 - 1)/meanNl2
 51 |         h2g = sigma2g * M
 52 |         K = M/meanNl2 # h2g = K (meanchi2 - 1)
 53 |         return h2g, sigma2g, meanchi2, K
 54 |     h2g, sigma2g, meanchi2, K = esth2g(ssld)
 55 |     h2g = max(h2g, 0.03) #0.03 is an arbitrarily chosen minimum
 56 |     print('mean chi2:', meanchi2)
 57 |     print('h2g estimated at:', h2g, 'sigma2g =', sigma2g)
 58 |     if args.set_h2g:
 59 |         print('scaling Z-scores to achieve h2g of', args.set_h2g)
 60 |         norm = meanchi2 / (1 + args.set_h2g/ K)
 61 |         print('dividing all z-scores by', np.sqrt(norm))
 62 |         ssld.Z /= np.sqrt(norm)
 63 |         h2g, sigma2g, _, _ = esth2g(ssld)
 64 |         print('h2g is now', h2g)
 65 | 
 66 |     # write h2g results to file
 67 |     dirname = args.sumstats_stem + '.' + args.refpanel_name
 68 |     fs.makedir(dirname)
 69 |     if 1 in args.chroms:
 70 |         print('writing info file')
 71 |         info = pd.DataFrame(); info=info.append(
 72 |                 {'pheno':args.sumstats_stem.split('/')[-1],
 73 |                     'h2g':h2g,
 74 |                     'sigma2g':sigma2g,
 75 |                     'Nbar':ss.N.mean()},ignore_index=True)
 76 |         info.to_csv(dirname+'/info', sep='\t', index=False)
 77 | 
 78 |     # preprocess ld blocks
 79 |     t0 = time.time()
 80 |     for c in args.chroms:
 81 |         print(time.time()-t0, ': loading chr', c, 'of', args.chroms)
 82 |         # get refpanel snp metadata for this chromosome
 83 |         snps = refpanel.bim_df(c)
 84 |         snps = pd.merge(snps, print_snps, on='SNP', how='left')
 85 |         snps.printsnp.fillna(False, inplace=True)
 86 |         print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata')
 87 | 
 88 |         # merge annot and sumstats
 89 |         print('reconciling')
 90 |         snps = ga.reconciled_to(snps, ss, ['Z'],
 91 |                 othercolnames=['N'], missing_val=np.nan)
 92 |         snps['typed'] = snps.Z.notnull()
 93 |         snps['ahat'] = snps.Z / np.sqrt(snps.N)
 94 | 
 95 |         # initialize result dataframe
 96 |         # I = no weights
 97 |         # h = heuristic weights, using R_o
 98 |         snps['Winv_ahat_I'] = np.nan # = W_o^{-1} ahat_o
 99 |         snps['R_Winv_ahat_I'] = np.nan # = R_{*o} W_o^{-1} ahat_o
100 |         snps['Winv_ahat_h'] = np.nan # = W_o^{-1} ahat_o
101 |         snps['R_Winv_ahat_h'] = np.nan # = R_{*o} W_o^{-1} ahat_o
102 | 
103 |         # restrict to ld blocks in this chr and process them in chunks
104 |         for ldblock, X, meta, ind in refpanel.block_data(ldblocks, c, meta=snps):
105 |             if meta.printsnp.sum() == 0 or \
106 |                     not os.path.exists(args.svd_stem+str(ldblock.name)+'.R.npz'):
107 |                 print('no svd snps found in this block')
108 |                 continue
109 |             print(meta.printsnp.sum(), 'svd snps', meta.typed.sum(), 'typed snps')
110 |             if meta.typed.sum() == 0:
111 |                 print('no typed snps found in this block')
112 |                 snps.loc[ind, [
113 |                     'R_Winv_ahat_I',
114 |                     'R_Winv_ahat_h'
115 |                     ]] = 0
116 |                 continue
117 |             R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz')
118 |             R2 = np.load(args.svd_stem+str(ldblock.name)+'.R2.npz')
119 |             N = meta[meta.typed.values].N.mean()
120 |             meta_svd = meta[meta.printsnp.values]
121 | 
122 |             # multiply ahat by the weights
123 |             x_I = snps.loc[ind[meta.printsnp],'Winv_ahat_I'] = weights.invert_weights(
124 |                     R, R2, sigma2g, N, meta_svd.ahat.values, mode='Winv_ahat_I')
125 |             x_h = snps.loc[ind[meta.printsnp],'Winv_ahat_h'] = weights.invert_weights(
126 |                     R, R2, sigma2g, N, meta_svd.ahat.values, mode='Winv_ahat_h')
127 | 
128 |         print('writing processed sumstats')
129 |         with gzip.open('{}/{}.pss.gz'.format(dirname, c), 'w') as f:
130 |             snps.loc[snps.printsnp,['N',
131 |                 'Winv_ahat_I',
132 |                 'Winv_ahat_h'
133 |                 ]].to_csv(
134 |                     f, index=False, sep='\t')
135 | 
136 |         del snps; memo.reset(); gc.collect()
137 | 
138 |     print('done')
139 | 


--------------------------------------------------------------------------------
/src/sldp/preprocessrefpanel:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from __future__ import print_function, division
  3 | import argparse, gc, sys
  4 | import numpy as np
  5 | import pandas as pd
  6 | import gprim.dataset as gd
  7 | import ypy.memo as memo
  8 | import ypy.fs as fs
  9 | import ypy.pretty as pretty
 10 | import sldp.config as config
 11 | 
 12 | 
 13 | def main():
 14 |     parser = argparse.ArgumentParser()
 15 |     # optional arguments
 16 |     parser.add_argument('--spectrum-percent', type=float, default=95,
 17 |             help='Determines how many eigenvectors are kept in the truncated SVD. ' +\
 18 |                     'A value of x means that x percent of the eigenspectrum will be kept. '+\
 19 |                     'Default value: 95')
 20 |     parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int,
 21 |             help='Space-delimited list of chromosomes to analyze. Default is 1..22')
 22 | 
 23 |     # configurable arguments
 24 |     parser.add_argument('--config', default=None,
 25 |             help='Path to a json file with values for other parameters. ' +\
 26 |                     'Values in this file will be overridden by any values passed ' +\
 27 |                     'explicitly via the command line.')
 28 |     parser.add_argument('--svd-stem', default=None,
 29 |             help='Path to directory in which output files will be stored. ' +\
 30 |                     'If not supplied, will be read from config file.')
 31 |     parser.add_argument('--bfile-chr', default=None,
 32 |             help='Path to plink bfile of reference panel to use, not including ' +\
 33 |                     'chromosome number. If not supplied, will be read from config file.')
 34 |     parser.add_argument('--print-snps', default=None,
 35 |             help='Path to set of potentially typed SNPs. If not supplied, will be read '+\
 36 |                     'from config file.')
 37 |     parser.add_argument('--ld-blocks', default=None,
 38 |             help='Path to UCSC bed file containing one bed interval per LD block. If '+\
 39 |                     'not supplied, will be read from config file.')
 40 | 
 41 |     print('=====')
 42 |     print(' '.join(sys.argv))
 43 |     print('=====')
 44 |     args = parser.parse_args()
 45 |     config.add_default_params(args)
 46 |     pretty.print_namespace(args)
 47 |     print('=====')
 48 | 
 49 |     # basic initialization
 50 |     mhc = [25684587, 35455756]
 51 |     refpanel = gd.Dataset(args.bfile_chr)
 52 |     fs.makedir_for_file(args.svd_stem)
 53 | 
 54 |     # read in ld blocks, remove MHC, read SNPs to print
 55 |     ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None,
 56 |             names=['chr','start', 'end'])
 57 |     mhcblocks = (ldblocks.chr == 'chr6') & (ldblocks.end > mhc[0]) & (ldblocks.start < mhc[1])
 58 |     ldblocks = ldblocks[~mhcblocks]
 59 |     print(len(ldblocks), 'loci after removing MHC')
 60 |     print_snps = pd.read_csv(args.print_snps, header=None, names=['SNP'])
 61 |     print_snps['printsnp'] = True
 62 |     print(len(print_snps), 'print snps')
 63 | 
 64 |     for c in args.chroms:
 65 |         print('loading chr', c, 'of', args.chroms)
 66 |         # get refpanel snp metadata for this chromosome
 67 |         snps = refpanel.bim_df(c)
 68 |         snps = pd.merge(snps, print_snps, on='SNP', how='left')
 69 |         snps.printsnp.fillna(False, inplace=True)
 70 |         print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata')
 71 | 
 72 |         # iterate through ld blocks and process each one
 73 |         for ldblock, X, meta, _ in refpanel.block_data(ldblocks, c, meta=snps):
 74 |             if meta.printsnp.sum() == 0:
 75 |                 print('no print snps found in this block')
 76 |                 continue
 77 | 
 78 |             # restrict X to have only potentially typed SNPs along one axis
 79 |             mask = meta.printsnp.values
 80 |             X_ = X[:,mask]
 81 | 
 82 |             def bestsvd(A):
 83 |                 try:
 84 |                     U_, svs_, _ = np.linalg.svd(A.T); svs_ = svs_**2 / A.shape[0]
 85 |                 except np.linalg.linalg.LinAlgError:
 86 |                     print('\t\tresorting to svd of XTX')
 87 |                     U_, svs_, _ = np.linalg.svd(A.T.dot(A)); svs_ = svs_ / A.shape[0]
 88 |                 return U_, svs_
 89 | 
 90 |             # compute the (right-hand) SVD of R
 91 |             print('\tcomputing SVD of R_print')
 92 |             U_, svs_ = bestsvd(X_)
 93 |             k = np.argmax(np.cumsum(svs_)/svs_.sum() >= args.spectrum_percent / 100.)
 94 |             print('\treduced rank of', k, 'out of', meta.printsnp.sum(), 'printed snps')
 95 |             np.savez('{}{}.R'.format(args.svd_stem, ldblock.name), U=U_[:,:k], svs=svs_[:k])
 96 | 
 97 |             # compute the SVD of R2
 98 |             print('\tcomputing R2_print')
 99 |             R2 = X_.T.dot(X.dot(X.T)).dot(X_) / X.shape[0]**2
100 |             print('\tcomputing SVD of R2_print')
101 |             R2_U, R2_svs, _ = np.linalg.svd(R2)
102 |             k = np.argmax(np.cumsum(R2_svs)/R2_svs.sum() >= args.spectrum_percent / 100.)
103 |             print('\treduced rank of', k, 'out of', meta.printsnp.sum(), 'printed snps')
104 |             np.savez('{}{}.R2'.format(args.svd_stem, ldblock.name),
105 |                     U=R2_U[:,:k], svs=R2_svs[:k])
106 | 
107 |         del snps; memo.reset(); gc.collect()
108 |     print('done')
109 | 
110 | 
111 | if __name__ == '__main__':
112 |     main()
113 | 


--------------------------------------------------------------------------------
/src/sldp/sldp:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from __future__ import print_function, division
  3 | import argparse, sys
  4 | import sldp.config as config
  5 | import ypy.pretty as pretty
  6 | 
  7 | def main():
  8 |     parser = argparse.ArgumentParser()
  9 |     # required arguments
 10 |     parser.add_argument('--outfile-stem', required=True,
 11 |             help='Path to an output file stem.')
 12 |     pheno = parser.add_mutually_exclusive_group(required=True)
 13 |     pheno.add_argument('--pss-chr', default=None,
 14 |             help='Path to .pss.gz file, without chromosome number or .pss.gz extension. '+\
 15 |                     'This is the phenotype that SLDP will analyze.')
 16 |     pheno.add_argument('--sumstats-stem', default=None,
 17 |             help='Path to a .sumstats.gz file, not including ".sumstats.gz" extension. '+\
 18 |                     'SLDP will process this into a set of .pss.gz files before running.')
 19 |     parser.add_argument('--sannot-chr', nargs='+', required=True,
 20 |             help='One or more (space-delimited) paths to gzipped annot files, without '+\
 21 |                     'chromosome number or .sannot.gz/.RV.gz extension. These are the '+\
 22 |                     'annotations that SLDP will analyze against the phenotype.')
 23 | 
 24 |     # optional arguments
 25 |     parser.add_argument('--verbose-thresh', default=0., type=float,
 26 |             help='Print additional information about each association studied with a '+\
 27 |                     'p-value below this number. (Default is 0.) This includes: '+\
 28 |                     'the covariance in each independent block of genome (.chunks files), '+\
 29 |                     'and the coefficients required to residualize any background '+\
 30 |                         'annotations out of the other annotations being analyzed.')
 31 |     parser.add_argument('-fastp', default=False, action='store_true',
 32 |             help='Estimate p-values fast (without permutation)')
 33 |     parser.add_argument('-bothp', default=False, action='store_true',
 34 |             help='Print both fastp p-values (as p_fast) and normal p-values. '+\
 35 |                     'Takes precedence over fastp')
 36 |     parser.add_argument('--tell-me-stories', default=0.,
 37 |             help='!!Experimental!! For associations with a p-value less than this number, '+\
 38 |                     'print information about loci that may be promising to study. '+\
 39 |                     'This will produce plots of (potentially overlapping) loci where the '+\
 40 |                     'signed LD profile is highly correlated with the GWAS signal in a '+\
 41 |                     'direction consistent with the global effect. '+\
 42 |                     'Default value is 0.')
 43 |     parser.add_argument('--story-corr-thresh', default=0.8, type=float,
 44 |             help='The threshold to use for correlation between Rv and alphahat in order '+\
 45 |                     'for a locus to be considered worthy of a story')
 46 |     parser.add_argument('-more-stats', default=False, action='store_true',
 47 |             help='Print additional statistis about q in results file')
 48 |     parser.add_argument('--T', type=int, default=1000000,
 49 |             help='number of times to sign flip for empirical p-values. Default is 10^6.')
 50 |     parser.add_argument('--jk-blocks', type=int, default=300,
 51 |             help='Number of jackknife blocks to use. Default is 300.')
 52 |     parser.add_argument('--weights', default='Winv_ahat_h',
 53 |             help='which set of regression weights to use. Default is Winv_ahat_h, '+\
 54 |                     'corresponding to weights described in Reshef et al. 2017.')
 55 |     parser.add_argument('--seed', default=None, type=int,
 56 |             help='Seed random number generator to a certain value. Off by default.')
 57 |     parser.add_argument('--stat', default='sum',
 58 |             help='*experimental* Which statistic to use for hypothesis testing. Options '+\
 59 |                     'are: sum, medrank, or thresh.')
 60 |     parser.add_argument('--chi2-thresh', default=0, type=float,
 61 |             help='only use SNPs with a chi2 above this number for the regression')
 62 |     parser.add_argument('--chroms', nargs='+', default=range(1,23), type=int,
 63 |             help='Space-delimited list of chromosomes to analyze. Default is 1..22')
 64 | 
 65 |     # configurable arguments
 66 |     parser.add_argument('--config', default=None,
 67 |             help='Path to a json file with values for other parameters. ' +\
 68 |                     'Values in this file will be overridden by any values passed ' +\
 69 |                     'explicitly via the command line.')
 70 |     parser.add_argument('--background-sannot-chr', nargs='+', default=[],
 71 |             help='One or more (space-delimited) paths to gzipped annot files, without '+\
 72 |                     'chromosome number or .sannot.gz extension. These are the annotations '+\
 73 |                     'that SLDP will control for.')
 74 |     parser.add_argument('--svd-stem', default=None,
 75 |             help='Path to directory in which output files will be stored. ' +\
 76 |                     'If not supplied, will be read from config file.')
 77 |     parser.add_argument('--bfile-reg-chr', default=None,
 78 |             help='Path to plink bfile of reference panel to use, not including ' +\
 79 |                     'chromosome number. This bfile should contain only regression SNPs '+\
 80 |                     '(as opposed to, e.g., all potentially causal SNPs). '+\
 81 |                     'If not supplied, will be read from config file.')
 82 |     parser.add_argument('--ld-blocks', default=None,
 83 |             help='Path to UCSC bed file containing one bed interval per LD block. If '+\
 84 |                     'not supplied, will be read from config file.')
 85 | 
 86 | 
 87 |     print('=====')
 88 |     print(' '.join(sys.argv))
 89 |     print('=====')
 90 |     args = parser.parse_args()
 91 |     config.add_default_params(args)
 92 |     pretty.print_namespace(args)
 93 |     print('=====')
 94 | 
 95 |     preprocess_sumstats(args)
 96 |     preprocess_sannots(args)
 97 | 
 98 |     print('initializing...')
 99 |     import os, time
100 |     import numpy as np
101 |     import pandas as pd
102 |     import scipy.stats as st
103 |     import gprim.annotation as ga
104 |     import gprim.dataset as gd
105 |     import ypy.memo as memo
106 |     import sldp.weights as weights
107 |     import sldp.chunkstats as cs
108 |     import sldp.storyteller as storyteller
109 | 
110 |     # basic initialization
111 |     mhc_bp = [25684587, 35455756]
112 |     refpanel = gd.Dataset(args.bfile_reg_chr)
113 |     pheno_name = args.pss_chr.split('/')[-2].replace('.KG3.95','')
114 |     if args.seed is not None:
115 |         np.random.seed(args.seed)
116 |         print('random seed:', args.seed)
117 | 
118 |     # read in names of background annotations and marginal annotations
119 |     annots = [ga.Annotation(annot) for annot in args.sannot_chr]
120 |     marginal_names = sum([a.names(22, RV=True) for a in annots], [])
121 |     marginal_names = [n for n in marginal_names if '.R' in n]
122 |     marginal_infos = pd.concat([a.info_df(args.chroms) for a in annots], axis=0)
123 |     backgroundannots = [ga.Annotation(annot) for annot in args.background_sannot_chr]
124 |     background_names = sum([a.names(22, True) for a in backgroundannots], [])
125 |     background_names = [n for n in background_names if '.R' in n]
126 |     print('background annotations:', background_names)
127 |     print('marginal annotations:', marginal_names)
128 |     if len(set(background_names) & set(marginal_names)) > 0:
129 |         raise ValueError('the background annotation names and the marginal annotation '+\
130 |                 'names must be disjoint sets')
131 | 
132 |     # read in ldblocks and remove ones that overlap mhc
133 |     ldblocks = pd.read_csv(args.ld_blocks, delim_whitespace=True, header=None,
134 |             names=['chr','start', 'end'])
135 |     mhcblocks = (ldblocks.chr == 'chr6') & \
136 |             (ldblocks.end > mhc_bp[0]) & \
137 |             (ldblocks.start < mhc_bp[1])
138 |     ldblocks = ldblocks[~mhcblocks]
139 | 
140 |     # read information about sumstats
141 |     sumstats_info = pd.read_csv(args.pss_chr+'info', sep='\t')
142 |     sigma2g = sumstats_info.loc[0].sigma2g
143 |     h2g = sumstats_info.loc[0].h2g
144 | 
145 |     # read in sumstats and annots, and compute numerator and denominator of regression for
146 |     # each ldblock. These will later be processed and aggregated
147 |     numerators = dict(); denominators = dict()
148 |     t0 = time.time()
149 |     for c in args.chroms:
150 |         print(time.time()-t0, ': loading chr', c, 'of', args.chroms)
151 | 
152 |         # get refpanel snp metadata 
153 |         snps = refpanel.bim_df(c)
154 |         print(len(snps), 'snps in refpanel', len(snps.columns), 'columns, including metadata')
155 | 
156 |         # read sumstats
157 |         print('reading sumstats')
158 |         ss = pd.read_csv(args.pss_chr+str(c)+'.pss.gz', sep='\t')
159 |         print(np.isnan(ss[args.weights]).sum(), 'sumstats nans out of', len(ss))
160 |         snps['Winv_ahat'] = ss[args.weights]
161 |         snps['N'] = ss.N
162 |         snps['typed'] = snps.Winv_ahat.notnull()
163 |         if args.chi2_thresh > 0:
164 |             print('applying chi2 threshold of', args.chi2_thresh)
165 |             snps.typed &= (ss.Winv_ahat_I**2 * ss.N > args.chi2_thresh)
166 |             print(snps.typed.sum(), 'typed snps left')
167 | 
168 |         # read annotations
169 |         print('reading annotations')
170 |         for annot in backgroundannots+annots:
171 |             mynames = [n for n in annot.names(22, RV=True) if '.R' in n] #names of annotations
172 |             print(time.time()-t0, ': reading annot', annot.filestem())
173 |             print('adding', mynames)
174 |             snps = pd.concat([snps, annot.RV_df(c)[mynames]], axis=1)
175 |             if (~np.isfinite(snps[mynames].values)).sum() > 0:
176 |                 raise ValueError('There should be no nans in the postprocessed annotation. '+\
177 |                         'But there are '+str((~np.isfinite(snps[mynames].values)).sum()))
178 | 
179 |         # make sure things are in the order we think they are
180 |         if (np.array(background_names+marginal_names) !=
181 |                 snps.columns.values[-len(background_names+marginal_names):]).any():
182 |             raise ValueError('Merged annotations are not in the right order')
183 | 
184 |         # perform computations
185 |         for ldblock, _, meta, ind in refpanel.block_data(
186 |                 ldblocks, c, meta=snps, genos=False, verbose=0):
187 |             if meta.typed.sum() == 0 or \
188 |                     not os.path.exists(args.svd_stem+str(ldblock.name)+'.R.npz'):
189 |                 # no typed snps/hm3 snps in this block. set num snps to 0
190 |                 ldblocks.loc[ldblock.name, 'M_H'] = 0
191 |                 continue
192 |             if (meta[background_names+marginal_names] == 0).values.all():
193 |                 # annotations are all 0 in this block
194 |                 ldblocks.loc[ldblock.name, 'M_H'] = 0
195 |                 continue
196 |             # record the number of typed snps in this block, and start- and end- snp indices
197 |             ldblocks.loc[ldblock.name, 'M_H'] = meta.typed.sum()
198 |             ldblocks.loc[ldblock.name, 'snpind_begin'] = min(meta.index) # for verbose
199 |             ldblocks.loc[ldblock.name, 'snpind_end'] = max(meta.index)+1 # for verbose
200 | 
201 |             # load regression weights and prepare for regression computation
202 |             meta_t = meta[meta.typed.values]
203 |             N = meta_t.N.mean()
204 |             if args.weights == 'Winv_ahat_h' or args.weights == 'Winv_ahat_hlN':
205 |                 R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz')
206 |                 R2 = None
207 |                 if len(R['U']) != len(meta):
208 |                     raise ValueError('regression wgts dimension must match regression snps')
209 |             elif args.weights == 'Winv_ahat_h2' or args.weights == 'Winv_ahat':
210 |                 R = np.load(args.svd_stem+str(ldblock.name)+'.R.npz')
211 |                 R2 = np.load(args.svd_stem+str(ldblock.name)+'.R2.npz')
212 |                 if len(R['U']) != len(meta) or len(R2['U']) != len(meta):
213 |                     raise ValueError('regression wgts dimension must match regression snps')
214 |             else:
215 |                 R = None; R2 = None
216 | 
217 |             # multiply ahat by the weights
218 |             Winv_RV = weights.invert_weights(
219 |                     R, R2, sigma2g, N, meta[background_names+marginal_names].values,
220 |                     typed=meta.typed.values, mode=args.weights)
221 | 
222 |             numerators[ldblock.name] = \
223 |                 (meta_t[background_names+marginal_names].T.dot(
224 |                     meta_t.Winv_ahat)).values/1e6
225 |             denominators[ldblock.name] = meta_t[background_names+marginal_names].T.dot(
226 |                     Winv_RV[meta.typed.values]).values/1e6
227 |         memo.reset()
228 | 
229 |     # get data for jackknifing
230 |     print('jackknifing')
231 |     chunk_nums, chunk_denoms, loo_nums, loo_denoms, chunkinfo = cs.collapse_to_chunks(
232 |             ldblocks,
233 |             numerators,
234 |             denominators,
235 |             args.jk_blocks)
236 | 
237 |     # compute final results
238 |     global q, results
239 |     results = pd.DataFrame()
240 |     for i, name in enumerate(marginal_names):
241 |         print(i, name)
242 |         # metadata about v
243 |         sqnorm = marginal_infos.loc[name[:-2], 'sqnorm']
244 |         supp = marginal_infos.loc[name[:-2], 'supp']
245 | 
246 |         # estimate mu and initialize results row
247 |         k = marginal_names.index(name)
248 |         mu = cs.get_est(sum(chunk_nums), sum(chunk_denoms), k, len(background_names))
249 |         newrow_results = {
250 |             'pheno':pheno_name,
251 |             'annot':name}
252 |         results = pd.concat([results, pd.DataFrame([newrow_results])], ignore_index=True)
253 | 
254 |         # compute q
255 |         q, r, mux, muy = cs.residualize(chunk_nums, chunk_denoms, len(background_names), k)
256 | 
257 | 
258 |         # p-values
259 |         if args.bothp or not args.fastp:
260 |             p_emp, z_emp = cs.signflip(q, args.T, printmem=True, mode=args.stat)
261 |             results.loc[i,'z'] = z_emp
262 |             results.loc[i,'p'] = p_emp
263 |         if args.bothp or args.fastp:
264 |             z_fast = np.sum(q)/np.linalg.norm(q)
265 |             p_fast = st.chi2.sf(z_fast**2, 1)
266 |             results.loc[i, 'z_fast'] = z_fast
267 |             results.loc[i, 'p_fast'] = p_fast
268 |         if args.fastp and not args.bothp: # if only fastp, rename the output columns
269 |             results.loc[i,'p'] = results.loc[i,'p_fast']
270 |             results.loc[i,'z'] = results.loc[i,'z_fast']
271 |             results.drop(['pfast','zfast'], axis=1, inplace=True)
272 | 
273 |         # print verbose information if required
274 |         if results.loc[i].p < args.verbose_thresh:
275 |             fname = args.outfile_stem+'.'+pheno_name+'.'+name
276 |             print('writing verbose results to', fname)
277 |             chunkinfo['q'] = q
278 |             chunkinfo['r'] = r
279 |             chunkinfo.to_csv(fname+'.chunks', sep='\t', index=False)
280 |             coeffs = pd.DataFrame()
281 |             coeffs['annot'] = background_names
282 |             coeffs['mux'] = mux
283 |             coeffs['muy'] = muy
284 |             coeffs.to_csv(fname+'.coeffs', sep='\t', index=False)
285 | 
286 |         # nominate interesting loci if desired
287 |         if results.loc[i].p < args.tell_me_stories:
288 |             storyteller.write(args.outfile_stem+'.'+name+'.loci',
289 |                     args, name, background_names, mux, muy, results.loc[i].z,
290 |                     corr_thresh=args.story_corr_thresh)
291 | 
292 |         # add more output if desired
293 |         if args.more_stats:
294 |             results.loc[i,'qkurtosis'] = st.kurtosis(q)
295 |             results.loc[i,'qstd'] = np.std(q)
296 |             results.loc[i,'p_jk'] = st.chi2.sf((mu/se)**2,1)
297 |             results.loc[i,'sqnorm'] = sqnorm
298 | 
299 |         # jackknife
300 |         se = cs.jackknife_se(mu, loo_nums, loo_denoms, k, len(background_names))
301 |         results.loc[i,'mu'] = mu
302 |         results.loc[i,'se(mu)'] = se
303 | 
304 |         # add estimates of rf, h2v, and other associated quantities
305 |         M = marginal_infos.loc[name[:-2],'M']
306 |         results.loc[i,'h2g'] = h2g
307 |         results.loc[i,'rf'] = mu * np.sqrt(
308 |                 sqnorm / h2g)
309 |         results.loc[i,'h2v/h2g'] = results.loc[i].rf**2 - \
310 |                 results.loc[i,'se(mu)']**2 * sqnorm / (M*sigma2g)
311 |         results.loc[i,'h2v'] = results.loc[i,'h2v/h2g'] * h2g
312 |         results.loc[i,'supp(v)/M'] = supp/M
313 |         results.to_csv(args.outfile_stem + '.gwresults', sep='\t', index=False, na_rep='nan')
314 | 
315 |     print(results)
316 |     print('writing results to', args.outfile_stem + '.gwresults')
317 |     results.to_csv(args.outfile_stem + '.gwresults', sep='\t', index=False, na_rep='nan')
318 |     print('done')
319 | 
320 | # preprocess any sumstats that need preprocessing
321 | def preprocess_sumstats(args):
322 |     import os
323 |     if args.pss_chr is None:
324 |         unprocessed_chroms = [
325 |                 c for c in args.chroms
326 |                 if not os.path.exists(
327 |                     args.sumstats_stem + '.' + args.refpanel_name + '/'+str(c)+'.pss.gz')
328 |                 ]
329 |         if len(unprocessed_chroms) > 0:
330 |             print('Preprocessing', args.sumstats_stem+'.sumstats.gz... at', unprocessed_chroms)
331 |             if args.config is None:
332 |                 raise ValueError('automatic pre-processing of a sumstats file requires '+\
333 |                         'specification of a config file; otherwise I dont know what '+\
334 |                         'parameters to use. If you want, you can preprocess the sumstats '+\
335 |                         'without a config file by running preprocesspheno manually')
336 |             print('Using config file', args.config, 'and default options')
337 | 
338 |             # run the command
339 |             import sldp.preprocesspheno, copy
340 |             args_ = copy.copy(args)
341 |             args_.no_M_5_50 = False
342 |             args_.set_h2g = None
343 |             args_.chroms = unprocessed_chroms
344 |             sldp.preprocesspheno.main(args_)
345 | 
346 |         # modify args to reflect existing of pss-chr files
347 |         args.pss_chr = args.sumstats_stem + '.' + args.refpanel_name + '/'
348 |         args.sumstats_stem = None
349 |         print('== finished preprocessing sumstats ==')
350 | 
351 | # preprocess any annotations that need preprocessing
352 | def preprocess_sannots(args):
353 |     import os
354 | 
355 |     for sannot in args.sannot_chr:
356 |         unprocessed_chroms = [
357 |                 c for c in args.chroms
358 |                 if not (os.path.exists(sannot + str(c) + '.RV.gz') and
359 |                     os.path.exists(sannot + str(c) + '.info'))
360 |                 ]
361 |         if len(unprocessed_chroms) > 0:
362 |             print('Preprocessing', sannot, 'at chromosomes', unprocessed_chroms)
363 |             if args.config is None:
364 |                 raise ValueError('automatic pre-processing of an annotation '+\
365 |                         'requires specification of a config file; otherwise I dont know what '+\
366 |                         'parameters to use. If you want, you can preprocess the annotation '+\
367 |                         'without a config file by running preprocessannot manually')
368 |             print('Using config file', args.config, 'and default options')
369 | 
370 |             # run preprocessing command
371 |             import sldp.preprocessannot, copy
372 |             args_ = copy.copy(args)
373 |             args_.alpha = -1
374 |             args_.chroms = unprocessed_chroms
375 |             sldp.preprocessannot.main(args_)
376 | 
377 |             print('== finished preprocessing annotation', sanno)
378 | 
379 | 
380 | if __name__ == '__main__':
381 |     main()
382 | 


--------------------------------------------------------------------------------
/src/sldp/storyteller.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | import numpy as np
  3 | import pandas as pd
  4 | import gprim.annotation as ga
  5 | import gprim.dataset as gd
  6 | import ypy.fs as fs
  7 | import matplotlib.pyplot as plt
  8 | 
  9 | # find windows with genome-wide significant SNPs that are consistent with the global signal
 10 | def write(folder, args, name, background_names, mux, muy, z, corr_thresh=0.8):
 11 |     print('STORYTELLING for ', name, 'z=', z)
 12 |     refpanel = gd.Dataset(args.bfile_reg_chr)
 13 |     annot = [ga.Annotation(a) for a in args.sannot_chr
 14 |             if name in ga.Annotation(a).names(22, RV=True)][0]
 15 | 
 16 |     backgroundannots = [ga.Annotation(a) for a in args.background_sannot_chr]
 17 |     print('focal annotation columns:', annot.names(22, True))
 18 |     print('background annotations:', background_names)
 19 | 
 20 |     # get refpanel snp metadata 
 21 |     print('re-reading snps')
 22 |     snps = pd.concat([refpanel.bim_df(c) for c in args.chroms], axis=0)
 23 | 
 24 |     # read sumstats
 25 |     print('re-reading sumstats')
 26 |     ss = pd.concat([
 27 |         pd.read_csv(args.pss_chr+str(c)+'.pss.gz', sep='\t', usecols=['N','Winv_ahat_I'])
 28 |         for c in args.chroms])
 29 |     snps['ahat'] = ss['Winv_ahat_I']
 30 |     snps['N'] = ss['N']
 31 |     del ss
 32 | 
 33 |     # read annotations
 34 |     print('re-reading background annotations')
 35 |     for a in backgroundannots:
 36 |         mynames = [n for n in a.names(22, RV=True) if '.R' in n] #names of annotations
 37 |         snps = pd.concat([snps,
 38 |             pd.concat([a.RV_df(c)[mynames] for c in args.chroms], axis=0)
 39 |             ], axis=1)
 40 | 
 41 |     print('reading focal annotation')
 42 |     snps = pd.concat([snps,
 43 |         pd.concat([annot.RV_df(c)[name] for c in args.chroms], axis=0)
 44 |         ], axis=1)
 45 | 
 46 |     print('residualizing background out of focal')
 47 |     A = snps[background_names]
 48 |     snps['chi2'] = snps.N * snps.ahat**2
 49 |     snps['Rv'] = snps[name]
 50 |     snps['ahat_resid'] = snps.ahat - A.values.dot(muy)
 51 |     snps['Rv_resid'] = snps.Rv - A.values.dot(mux)
 52 |     snps['typed'] = snps.ahat_resid.notnull()
 53 |     snps = snps[snps.typed].reset_index(drop=True)
 54 |     snps['significant'] = snps.chi2 > 29.716785
 55 |     print(snps.significant.sum(), 'genome-wide significant SNPs')
 56 | 
 57 |     print('searching for good windows')
 58 |     # get endpoints of windows
 59 |     stride = 20
 60 |     windowsize_in_strides = 5
 61 |     windowsize = stride * windowsize_in_strides
 62 | 
 63 |     # find all starting points of windows containing GWAS-sig SNPs
 64 |     starts = np.concatenate([
 65 |                 [ int(i/stride)*stride - k*stride
 66 |                     for k in range(0, windowsize_in_strides)]
 67 |                 for i in np.where(snps.significant)[0]])
 68 |     starts = np.array(sorted(list(set(starts))))
 69 |     # compute corresponding endpoints
 70 |     ends = starts + windowsize
 71 | 
 72 |     # truncate any windows that extend past the ends of the genome
 73 |     starts = starts[ends < len(snps)]
 74 |     ends = ends[ends < len(snps)]
 75 |     ends = ends[starts >= 0]
 76 |     starts = starts[starts >= 0]
 77 | 
 78 |     print(len(starts), 'windows with GWAS hits found')
 79 | 
 80 |     # compute correlations
 81 |     numbers = pd.DataFrame(
 82 |             np.array([[i,j,
 83 |                 np.max(snps.iloc[i:j].chi2),
 84 |                 np.corrcoef(snps.iloc[i:j].Rv_resid, snps.iloc[i:j].ahat_resid)[0,1]]
 85 |                 for i,j in zip(starts, ends)]),
 86 |             columns=['start','end','maxchi2','corr'])
 87 | 
 88 |     # keep only cases with strong correlations in the right direction
 89 |     numbers = numbers[numbers['corr']**2 >= corr_thresh]
 90 |     numbers = numbers[np.sign(numbers['corr']) == np.sign(z)]
 91 |     print(len(numbers), 'windows with GWAS hits and squared correlation with Rv >=',
 92 |             corr_thresh, 'in the right direction')
 93 |     for i,j in zip(numbers.start, numbers.end):
 94 |         i = int(i); j = int(j)
 95 |         print('saving', i,j)
 96 |         c = snps.iloc[i].CHR
 97 |         start = snps.iloc[i].BP
 98 |         end = snps.iloc[j].BP
 99 |         plt.figure()
100 |         plt.scatter(snps.iloc[i:j].Rv_resid,
101 |                 snps.iloc[i:j].ahat_resid * np.sqrt(snps.iloc[i:j].N))
102 |         plt.title('chr{}:{}-{}'.format(c, start, end))
103 |         plt.xlabel(r'residual $Rv$')
104 |         plt.ylabel(r'residual $Z$')
105 | 
106 |         filename = '{}/chr{}:{}-{}.pdf'.format(folder, c, start, end)
107 |         fs.makedir_for_file(filename)
108 |         plt.savefig(filename)
109 |         plt.close()
110 | 


--------------------------------------------------------------------------------
/src/sldp/weights.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, division
 2 | import numpy as np
 3 | 
 4 | # R: SVD of (R, restricted to regression SNPs)
 5 | # R2: SVD of (R^2, restricted to regression SNPs)
 6 | def invert_weights(R, R2, sigma2g, N, x, typed=None, mode='Winv_ahat_h'):
 7 |     if typed is None:
 8 |         typed = np.isfinite(x.reshape((len(x),-1)).sum(axis=1))
 9 | 
10 |     # trivial weights
11 |     if mode == 'Winv_ahat_I':
12 |         result = x
13 |     # heuristic, with large-N approximation
14 |     elif mode == 'Winv_ahat_hlN':
15 |         U = R['U'][typed,:]; svs=R['svs']
16 |         result = np.full(x.shape, np.nan)
17 |         result[typed] = (U/(svs**2)).dot(U.T.dot(x[typed]))
18 |     # heuristic, no large N approximation, using (R_o)^2 to approximate (R2)_o
19 |     elif mode == 'Winv_ahat_h':
20 |         U = R['U'][typed,:]; svs=R['svs']
21 |         result = np.full(x.shape, np.nan)
22 |         result[typed] = (U/(sigma2g*svs**2+svs/N)).dot(U.T.dot(x[typed]))
23 |     # heuristic, no large N approximation, using R2 instead of R
24 |     elif mode == 'Winv_ahat_h2':
25 |         U = R2['U'][typed,:]; svs=R2['svs']
26 |         result = np.full(x.shape, np.nan)
27 |         result[typed] = (U/(sigma2g*svs+np.sqrt(svs)/N)).dot(U.T.dot(x[typed]))
28 |     # exact
29 |     elif mode == 'Winv_ahat':
30 |         R_ = (R['U'][typed]*R['svs']).dot(R['U'][typed].T)
31 |         R2_ = (R2['U'][typed]*R2['svs']).dot(R2['U'][typed].T)
32 |         W = R_/N + sigma2g*R2_
33 |         U, svs, _ = np.linalg.svd(W)
34 |         k = np.argmax(np.cumsum(svs)/svs.sum() >= 0.95)
35 |         # k = np.argmax(svs[:-1]/svs[1:] >= 1e5)+1
36 |         print(R['U'].shape, R2['U'].shape, 'k=',k)
37 |         U = U[:,:k]; svs = svs[:k]
38 |         result = np.full(x.shape, np.nan)
39 |         result[typed] = (U/svs).dot(U.T.dot(x[typed]))
40 |     return result
41 | 


--------------------------------------------------------------------------------