├── Makefile ├── README.md ├── _config.yml ├── bin └── apex_Linux_x86_64.gz ├── doc ├── README.md └── logo.svg ├── get_dependencies.sh └── src ├── BRENT ├── brent_fmin.cpp └── brent_fmin.hpp ├── Main.cpp ├── Main.hpp ├── apexLMM.cpp ├── apexLMM.hpp ├── cisMapping.cpp ├── cisMapping.hpp ├── dataParser.hpp ├── eigenData.cpp ├── eigenData.hpp ├── fitModels.cpp ├── fitModels.hpp ├── fitUtils.cpp ├── fitUtils.hpp ├── genotypeData.cpp ├── genotypeData.hpp ├── htsWrappers.cpp ├── htsWrappers.hpp ├── mapID.cpp ├── mapID.hpp ├── mathStats.cpp ├── mathStats.hpp ├── metaAnalysis.cpp ├── metaAnalysis.hpp ├── miscUtils.cpp ├── miscUtils.hpp ├── processVCOV.cpp ├── processVCOV.hpp ├── readBED.cpp ├── readBED.hpp ├── readTable.cpp ├── readTable.hpp ├── scanSignals.cpp ├── setOptions.cpp ├── setOptions.hpp ├── setRegions.cpp ├── setRegions.hpp ├── transMapping.cpp ├── transMapping.hpp └── xzFormat.hpp /Makefile: -------------------------------------------------------------------------------- 1 | VERSION=0.3 2 | 3 | CURL_STATIC_LIBS := $(shell curl-config --static-libs) 4 | GSSAPI_STATIC_LIBS := $(shell krb5-config --libs gssapi) 5 | 6 | CXXFLAGS= -static -std=c++0x -fopenmp -fno-math-errno -flto -O3 -pipe -s -fuse-ld=gold -I$(cdir)/src/eigen -I$(cdir)/src/spectra/include -I$(cdir)/src/boost -I$(cdir)/src/htslib/include -L$(cdir)/src/htslib/lib -I$(cdir)/src/BRENT -I$(cdir)/src/shrinkwrap/include 7 | LDFLAGS= -lhts -static -fopenmp -flto -fuse-linker-plugin -O3 -pipe -Wno-unused-function -lgcov -lpthread -lz -lm -llzma -I$(cdir)/src/boost -I$(cdir)/src/htslib/include -L$(cdir)/src/htslib/lib -Wl,-rpath=$(cdir)/src/htslib/lib -I$(cdir)/src/BRENT -I$(cdir)/src/shrinkwrap/include -Wl,--no-as-needed 8 | cppsrc = $(wildcard src/*.cpp src/BRENT/*.cpp) 9 | csrc = $(wildcard src/*.c) 10 | 11 | objs = $(cppsrc:.cpp=.o) $(csrc:.c=.o) 12 | 13 | cdir = ${CURDIR} 14 | 15 | bin/apex: $(objs) 16 | @mkdir -p $(@D) 17 | $(CXX) -o $@ $^ $(LDFLAGS) 18 | 19 | .PHONY: clean 20 | 21 | clean: 22 | rm -f *.o src/*.o src/BRENT/*.o 23 | 24 | 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 3 | 4 | APEX (All-in-one Package for Efficient Xqtl analysis) is a toolkit for analysis of molecular quantitative trait loci (xQTLs), such as mRNA expression or methylation levels. It performs cis and trans analysis, single- or multiple-variant analysis, and single-study or meta-analysis. 5 | 6 | To install APEX from source or download precompiled binaries, [**see installation guide below**](#installation). 7 | 8 | ## Analysis modes and documentation 9 | - [**xQTL factor analysis.**](https://corbinq.github.io/apex/doc/mode_factor/) `./apex factor {options}` 10 | - Rapid high-dimensional factor analysis to account for technical and biological variation in molecular trait data. 11 | - [**xQTL linear mixed model preprocessing.**](https://corbinq.github.io/apex/doc/mode_lmm/) `./apex lmm {options}` 12 | - Precompute null model terms for rapid LMM association analysis. 13 | - [**cis-xQTL analysis.**](https://corbinq.github.io/apex/doc/mode_cis/) `./apex cis {options}` 14 | - Analyze xQTL associations within a genomic window of each molecular trait. For example, analyze variant associations within 1 megabase (Mb) of each gene's transcription start site (TSS). 15 | - [**trans-xQTL analysis.**](https://corbinq.github.io/apex/doc/mode_trans/) `./apex trans {options}` 16 | - Analyze xQTL genome-wide associations for each molecular trait, optionally excluding variants within the cis region. 17 | - [**xQTL meta-analysis.**](https://corbinq.github.io/apex/doc/mode_meta/) `./apex meta {options}` 18 | - Meta-analyze xQTL association summary statistics across one or multiple studies. 19 | - [**xQTL summary data storage.**](https://corbinq.github.io/apex/doc/mode_store/) `./apex store {options}` 20 | - Store genotypic variance-covariance (LD) data matrices for data-sharing or meta-analysis. These files enable accurate multiple-variant analysis from xQTL summary statistics, including joint and conditional regression analysis, Bayesian finemapping, aggregation tests, and penalized regression / variable selection. 21 | 22 | ## Additional documentation pages 23 | See above for documentation pages on each analysis mode (factor, lmm, cis, trans, meta, store). Additional documentation pages are listed below. 24 | - **[Input file formats for individual-level analysis](https://corbinq.github.io/apex/doc/input_files/)** in modes `apex cis`, `apex trans`, and `apex store`. 25 | - **APEX xQTL summary association data file formats** generated by `apex store` and used in `apex meta`. *Documentation coming soon.* 26 | - **Bayesian finemapping using APEX xQTL summary association data files** generated by `apex store`. *Documentation coming soon.* 27 | 28 | ## Release notes 29 | - apex-factor 30 | - Currently, apex-factor assumes sample size is less than the number of molecular traits. In the next release, we will provide optimizations for sample size > number of traits. 31 | - Currently, apex-factor calculates inferred covariates in a format suitable for OLS or GRM LMMs. To model factor covariates as random effects, inferred covariates must be calculated internally in apex-cis and apex-trans. In the next release, apex-factor will provide inferred factor covariates in a format suitable for low-rank LMMs in apex-cis and apex-trans. 32 | - apex-lmm 33 | - Currently, apex-lmm supports LMMs with GRMs, but not low-rank LMMs. Low-rank LMMs can be estimated directly in apex-cis and apex-trans. In the next release, apex-lmm will also support low-rank LMMs. 34 | - apex-cis and apex-trans 35 | - Currently, pre-computed variance component estimates from apex-lmm can be used in apex-cis and apex-trans, but LMM trait residuals and genotypic variance interpolation points are precalculated internally in apex-cis and apex-trans. In the next release, apex-cis and apex-trans will accept precomputed LMM trait residuals and genotypic variance interpolation points from apex-lmm. 36 | - apex-store 37 | - Currently, apex-store provides adjusted LD for OLS models. In the next release, apex-store will provide options for adjusted LD in LMMs. 38 | - apex-meta 39 | - Currently, apex-meta supports only cis-xQTL meta-analysis. In the next release, apex-meta will also support trans-xQTL meta-analysis. 40 | 41 | ## Installation 42 | APEX is primarily written in C++. To compile APEX from source, run: 43 | ``` 44 | git clone https://github.com/corbinq/apex.git 45 | cd apex 46 | bash get_dependencies.sh 47 | make 48 | ``` 49 | Precompiled binaries are also available for 64-bit Linux systems as follows: 50 | ``` 51 | git clone https://github.com/corbinq/apex.git 52 | cd apex/bin 53 | gunzip apex_Linux_x86_64.gz 54 | mv apex_Linux_x86_64 apex && chmod +x apex 55 | ./apex --help 56 | ``` 57 | 58 | ## Software dependencies 59 | 60 | - [Eigen C++ matrix library](http://eigen.tuxfamily.org/) 61 | - [HTSlib C library](http://www.htslib.org/) 62 | - [github.com/jonathonl/shrinkwrap](https://github.com/jonathonl/shrinkwrap) 63 | - [github.com/Taywee/args](https://github.com/Taywee/args) 64 | - [Boost math C++ library](https://www.boost.org/) 65 | 66 | ## Contact, feedback, bug reports 67 | APEX is a work in progress, and under active development. We appreciate feedback and bug reports, and welcome any questions or feature requests. 68 | **Contact:** . 69 | 70 | ## Citation 71 | If you use APEX, please cite `Quick, C; Guan, L; Z, Li; X, Li; Dey, R; Liu, Y; Scott, L; Lin, X (2020). "A versatile toolkit for molecular QTL mapping and meta-analysis at scale" URL: https://github.com/corbinq/apex` (preprint pending). Thanks! 72 | 73 | ## Acknowledgements 74 | APEX is developed by Corbin Quick and Li Guan. Special thanks to 75 | 76 | - Yaowu Liu 77 | - Xihao Li 78 | - Zilin Li 79 | - Rounak Dey 80 | - Laura Scott 81 | - Hyun Min Kang 82 | - Xihong Lin 83 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | plugins: 2 | - jekyll-relative-links 3 | relative_links: 4 | enabled: true 5 | collections: true 6 | include: 7 | - README.md 8 | - doc/README.md 9 | - doc/logo.svg 10 | - doc/benchmarking/README.md 11 | - doc/input_files/README.md 12 | - doc/mode_cis/README.md 13 | - doc/mode_trans/README.md 14 | - doc/mode_meta/README.md 15 | -------------------------------------------------------------------------------- /bin/apex_Linux_x86_64.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/corbinq/apex/63b605e00766141693a69eee6d630ad8ec3b9de0/bin/apex_Linux_x86_64.gz -------------------------------------------------------------------------------- /doc/README.md: -------------------------------------------------------------------------------- 1 | 2 | # 3 | 4 | APEX is a toolkit for analysis of molecular quantitative trait loci (xQTLs), such as mRNA expression or methylation levels. It performs cis and trans analysis, single- or multiple-variant analysis, and single-study or meta-analysis. 5 | 6 | To install APEX from source or download precompiled binaries, [**see installation guide below**](#installation). 7 | 8 | ## Analysis modes and documentation 9 | - [**xQTL factor analysis.**](/apex/doc/mode_factor/) `./apex factor {options}` 10 | - Rapid high-dimensional factor analysis to account for technical and biological variation in molecular trait data. 11 | - [**xQTL linear mixed model preprocessing.**](/apex/doc/mode_lmm/) `./apex lmm {options}` 12 | - Precompute null model terms for rapid LMM association analysis. 13 | - [**cis-xQTL analysis.**](/apex/doc/mode_cis/) `./apex cis {options}` 14 | - Analyze xQTL associations within a genomic window of each molecular trait. For example, analyze variant associations within 1 megabase (Mb) of each gene's transcription start site (TSS). 15 | - [**trans-xQTL analysis.**](/apex/doc/mode_trans/) `./apex trans {options}` 16 | - Analyze xQTL genome-wide associations for each molecular trait, optionally excluding variants within the cis region. 17 | - [**xQTL meta-analysis.**](/apex/doc/mode_meta/) `./apex meta {options}` 18 | - Meta-analyze xQTL association summary statistics across one or multiple studies. 19 | - [**xQTL summary data storage.**](/apex/doc/mode_store/) `./apex store {options}` 20 | - Store genotypic variance-covariance (LD) data matrices for data-sharing or meta-analysis. These files enable accurate multiple-variant analysis from xQTL summary statistics, including joint and conditional regression analysis, Bayesian finemapping, aggregation tests, and penalized regression / variable selection. 21 | 22 | ## Additional documentation pages 23 | See above for documentation pages on each analysis mode (cis, trans, meta, store). Additional documentation pages are listed below. 24 | - **[Input file formats for individual-level analysis](/apex/doc/input_files)** in modes `apex cis`, `apex trans`, and `apex store`. 25 | - **APEX xQTL summary association data file formats** generated by `apex store` and used in `apex meta`. *Documentation coming soon.* 26 | - **Bayesian finemapping using APEX xQTL summary association data files** generated by `apex store`. *Documentation coming soon.* 27 | 28 | ## Release notes 29 | - apex-factor 30 | - Currently, apex-factor assumes sample size is less than the number of molecular traits. In the next release, we will provide optimizations for sample size > number of traits. 31 | - Currently, apex-factor calculates inferred covariates in a format suitable for OLS or GRM LMMs. To model factor covariates as random effects, inferred covariates must be calculated internally in apex-cis and apex-trans. In the next release, apex-factor will provide inferred factor covariates in a format suitable for low-rank LMMs in apex-cis and apex-trans. 32 | - apex-lmm 33 | - Currently, apex-lmm supports LMMs with GRMs, but not low-rank LMMs. Low-rank LMMs can be estimated directly in apex-cis and apex-trans. In the next release, apex-lmm will also support low-rank LMMs. 34 | - apex-cis and apex-trans 35 | - Currently, pre-computed variance component estimates from apex-lmm can be used in apex-cis and apex-trans, but LMM trait residuals and genotypic variance interpolation points are precalculated internally in apex-cis and apex-trans. In the next release, apex-cis and apex-trans will accept precomputed LMM trait residuals and genotypic variance interpolation points from apex-lmm. 36 | - apex-store 37 | - Currently, apex-store provides adjusted LD for OLS models. In the next release, apex-store will provide options for adjusted LD in LMMs. 38 | - apex-meta 39 | - Currently, apex-meta supports only cis-xQTL meta-analysis. In the next release, apex-meta will also support trans-xQTL meta-analysis. 40 | 41 | ## Installation 42 | APEX is primarily written in C++. To compile APEX from source, run: 43 | ``` 44 | git clone https://github.com/corbinq/apex.git 45 | cd apex 46 | bash get_dependencies.sh 47 | make 48 | ``` 49 | Precompiled binaries are also available for 64-bit Linux systems as follows: 50 | ``` 51 | git clone https://github.com/corbinq/apex.git 52 | cd apex/bin 53 | gunzip apex_Linux_x86_64.gz 54 | mv apex_Linux_x86_64 apex && chmod +x apex 55 | ./apex --help 56 | ``` 57 | 58 | ## Software dependencies 59 | 60 | - [Eigen C++ matrix library](http://eigen.tuxfamily.org/) 61 | - [HTSlib C library](http://www.htslib.org/) 62 | - [github.com/jonathonl/shrinkwrap](https://github.com/jonathonl/shrinkwrap) 63 | - [github.com/Taywee/args](https://github.com/Taywee/args) 64 | - [Boost math C++ library](https://www.boost.org/) 65 | 66 | ## Contact, feedback, bug reports 67 | APEX is a work in progress, and under active development. We appreciate feedback and bug reports, and welcome any questions or feature requests. **Contact:** . 68 | 69 | ## Citation 70 | If you use APEX, please cite `Quick, C; Guan, L; Lin, X (2020). URL: https://github.com/corbinq/apex` (preprint pending). Thanks! 71 | 72 | ## Acknowledgements 73 | APEX is developed by Corbin Quick and Li Guan. Special thanks to 74 | 75 | - Yaowu Liu 76 | - Xihao Li 77 | - Zilin Li 78 | - Rounak Dey 79 | - Laura Scott 80 | - Hyun Min Kang 81 | - Xihong Lin 82 | 83 | 84 | 85 | -------------------------------------------------------------------------------- /get_dependencies.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | # args library for processing command line arguments 5 | git clone https://github.com/Taywee/args.git src/args 6 | 7 | # Eigen library for linear algebra 8 | git clone https://gitlab.com/libeigen/eigen.git src/eigen 9 | 10 | # Spectra library for sparse, high-dimensional eigenvalue problems 11 | #git clone https://github.com/yixuan/spectra.git src/spectra 12 | git clone --single-branch --branch 0.9.x https://github.com/yixuan/spectra.git src/spectra 13 | 14 | git clone https://github.com/corbinq/shrinkwrap.git src/shrinkwrap 15 | # cd src/shrinkwrap && git reset --hard 5be49b4b62b86904191e7528040b296281fe5c26 && cd - 16 | 17 | # Brent's algorithm 18 | 19 | # mkdir -p BRENT && cd BRENT 20 | # wget --no-check-certificate https://people.sc.fsu.edu/~jburkardt/cpp_src/brent/brent.hpp 21 | # wget --no-check-certificate https://people.sc.fsu.edu/~jburkardt/cpp_src/brent/brent.cpp 22 | # cd - 23 | 24 | 25 | # try loading modules 26 | if [ $(whereis module | cut -d: -f2 | wc -c) -lt 0 ]; then 27 | module load boost 28 | # module load htslib 29 | fi 30 | 31 | # boost::math library 32 | if [ $(whereis boost | cut -d: -f2 | wc -c) -lt 0 ]; 33 | then 34 | git clone https://github.com/boostorg/boost src/boost 35 | echo "WARNING: Could not find boost library on system.\n" 36 | echo " Preparing to install boost locally from source.\n" 37 | echo "NOTE: Installing boost will take several minutes.\n" 38 | sleep 2 39 | cd src/boost 40 | git submodule update --init 41 | bash bootstrap.sh 42 | ./b2 headers 43 | cd ../../ 44 | else 45 | echo "Found boost installed at $(whereis boost | cut -d: -f2)" 46 | fi 47 | 48 | # htslib library, which we use for BCF/VCF access and indexing 49 | tbx_loc=$(echo '#include ' | cpp -H -o /dev/null 2>&1 | head -n1) 50 | 51 | # if [[ $tbx_loc == *"error"* ]]; then 52 | # echo "Could not find htslib on system." 53 | echo "Installing htslib locally." 54 | 55 | git clone https://github.com/samtools/htslib.git src/htslib 56 | cd src/htslib 57 | git submodule update --init --recursive 58 | autoreconf 59 | ./configure --disable-lzma --disable-bz2 --disable-libcurl 60 | make install prefix=$PWD 61 | cd ../../ 62 | # else 63 | # echo "Found htslib installed on system." 64 | # fi 65 | 66 | # GDS format : 67 | # git clone https://github.com/CoreArray/GDSFormat.git 68 | 69 | # module load boost 70 | 71 | 72 | -------------------------------------------------------------------------------- /src/BRENT/brent_fmin.cpp: -------------------------------------------------------------------------------- 1 | 2 | // Code for Brent's algorithm was copied from R's C source files. 3 | // It was modified slightly. 4 | 5 | /* 6 | * R : A Computer Language for Statistical Data Analysis 7 | * Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka 8 | * Copyright (C) 2003-2004 The R Foundation 9 | * Copyright (C) 1998--2014 The R Core Team 10 | * 11 | * This program is free software; you can redistribute it and/or modify 12 | * it under the terms of the GNU General Public License as published by 13 | * the Free Software Foundation; either version 2 of the License, or 14 | * (at your option) any later version. 15 | * 16 | * This program is distributed in the hope that it will be useful, 17 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 18 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 | * GNU General Public License for more details. 20 | * 21 | * You should have received a copy of the GNU General Public License 22 | * along with this program; if not, a copy is available at 23 | * https://www.R-project.org/Licenses/ 24 | */ 25 | 26 | #include "brent_fmin.hpp" 27 | 28 | double Brent_fmin(double ax, double bx, std::function& f, double tol) 29 | { 30 | /* c is the squared inverse of the golden ratio */ 31 | const double c = (3. - sqrt(5.)) * .5; 32 | 33 | /* Local variables */ 34 | double a, b, d, e, p, q, r, u, v, w, x; 35 | double t2, fu, fv, fw, fx, xm, eps, tol1, tol3; 36 | 37 | /* eps is approximately the square root of the relative machine precision. */ 38 | eps = DBL_EPSILON; 39 | tol1 = eps + 1.;/* the smallest 1.000... > 1 */ 40 | eps = sqrt(eps); 41 | 42 | a = ax; 43 | b = bx; 44 | v = a + c * (b - a); 45 | w = v; 46 | x = v; 47 | 48 | d = 0.;/* -Wall */ 49 | e = 0.; 50 | fx = f(x); 51 | fv = fx; 52 | fw = fx; 53 | tol3 = tol / 3.; 54 | 55 | /* main loop starts here ----------------------------------- */ 56 | 57 | for(;;) { 58 | xm = (a + b) * .5; 59 | tol1 = eps * fabs(x) + tol3; 60 | t2 = tol1 * 2.; 61 | 62 | /* check stopping criterion */ 63 | 64 | if (fabs(x - xm) <= t2 - (b - a) * .5) break; 65 | p = 0.; 66 | q = 0.; 67 | r = 0.; 68 | if (fabs(e) > tol1) { /* fit parabola */ 69 | 70 | r = (x - w) * (fx - fv); 71 | q = (x - v) * (fx - fw); 72 | p = (x - v) * q - (x - w) * r; 73 | q = (q - r) * 2.; 74 | if (q > 0.) p = -p; else q = -q; 75 | r = e; 76 | e = d; 77 | } 78 | 79 | if (fabs(p) >= fabs(q * .5 * r) || 80 | p <= q * (a - x) || p >= q * (b - x)) { /* a golden-section step */ 81 | 82 | if (x < xm) e = b - x; else e = a - x; 83 | d = c * e; 84 | } 85 | else { /* a parabolic-interpolation step */ 86 | 87 | d = p / q; 88 | u = x + d; 89 | 90 | /* f must not be evaluated too close to ax or bx */ 91 | 92 | if (u - a < t2 || b - u < t2) { 93 | d = tol1; 94 | if (x >= xm) d = -d; 95 | } 96 | } 97 | 98 | /* f must not be evaluated too close to x */ 99 | 100 | if (fabs(d) >= tol1) 101 | u = x + d; 102 | else if (d > 0.) 103 | u = x + tol1; 104 | else 105 | u = x - tol1; 106 | 107 | fu = f(u); 108 | 109 | /* update a, b, v, w, and x */ 110 | 111 | if (fu <= fx) { 112 | if (u < x) b = x; else a = x; 113 | v = w; w = x; x = u; 114 | fv = fw; fw = fx; fx = fu; 115 | } else { 116 | if (u < x) a = u; else b = u; 117 | if (fu <= fw || w == x) { 118 | v = w; fv = fw; 119 | w = u; fw = fu; 120 | } else if (fu <= fv || v == x || v == w) { 121 | v = u; fv = fu; 122 | } 123 | } 124 | } 125 | /* end of main loop */ 126 | 127 | return x; 128 | } // Brent_fmin() 129 | 130 | 131 | -------------------------------------------------------------------------------- /src/BRENT/brent_fmin.hpp: -------------------------------------------------------------------------------- 1 | 2 | // Code for Brent's algorithm was copied from R's C source files. 3 | // It was modified slightly. 4 | 5 | /* 6 | * R : A Computer Language for Statistical Data Analysis 7 | * Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka 8 | * Copyright (C) 2003-2004 The R Foundation 9 | * Copyright (C) 1998--2014 The R Core Team 10 | * 11 | * This program is free software; you can redistribute it and/or modify 12 | * it under the terms of the GNU General Public License as published by 13 | * the Free Software Foundation; either version 2 of the License, or 14 | * (at your option) any later version. 15 | * 16 | * This program is distributed in the hope that it will be useful, 17 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 18 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 | * GNU General Public License for more details. 20 | * 21 | * You should have received a copy of the GNU General Public License 22 | * along with this program; if not, a copy is available at 23 | * https://www.R-project.org/Licenses/ 24 | */ 25 | 26 | #ifndef BRENT_HPP 27 | #define BRENT_HPP 28 | 29 | #include 30 | #include 31 | #include /* DBL_EPSILON */ 32 | 33 | double Brent_fmin(double ax, double bx, std::function& f, double tol); 34 | 35 | #endif 36 | 37 | -------------------------------------------------------------------------------- /src/Main.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef APEX_HPP 18 | #define APEX_HPP 19 | 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | 26 | #include 27 | 28 | #include 29 | #include 30 | #include 31 | 32 | #include "args/args.hxx" 33 | 34 | #include "setOptions.hpp" 35 | 36 | #include "htsWrappers.hpp" 37 | #include "setRegions.hpp" 38 | 39 | #include "readBED.hpp" 40 | #include "readTable.hpp" 41 | #include "genotypeData.hpp" 42 | 43 | #include "transMapping.hpp" 44 | #include "cisMapping.hpp" 45 | #include "metaAnalysis.hpp" 46 | #include "fitUtils.hpp" 47 | #include "mapID.hpp" 48 | 49 | #include "miscUtils.hpp" 50 | #include "mathStats.hpp" 51 | #include "processVCOV.hpp" 52 | 53 | 54 | #endif 55 | 56 | -------------------------------------------------------------------------------- /src/apexLMM.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "apexLMM.hpp" 18 | 19 | void fit_LMM_null_models(table& c_data, bed_data& e_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda, const bool& rknorm_y, const bool& rknorm_r) 20 | { 21 | 22 | Eigen::SparseMatrix Q; 23 | Eigen::VectorXd Q_lambda; 24 | 25 | subset_eigen(L, GRM_lambda, Q, Q_lambda); 26 | 27 | std::cerr << "Reordering trait and covariate matrices ...\n"; 28 | 29 | Eigen::MatrixXd& Y = e_data.data_matrix; 30 | Eigen::MatrixXd& C = c_data.data_matrix; 31 | 32 | double n_traits = Y.cols(); 33 | double n_samples = Y.rows(); 34 | double n_covar = C.cols(); 35 | 36 | // std::cerr << "Started cis-QTL analysis ...\n"; 37 | 38 | if( rknorm_y ){ 39 | std::cerr << "Rank-normalizing expression traits ... \n"; 40 | rank_normalize(Y); 41 | } 42 | std::cerr << "Scaling expression traits ... \n"; 43 | std::vector y_scale; 44 | scale_and_center(Y, y_scale); 45 | 46 | std::cerr << "Calculating partial rotations ...\n"; 47 | Eigen::MatrixXd QtC = (Q.transpose() * C).eval(); 48 | Eigen::MatrixXd QtY = (Q.transpose() * Y).eval(); 49 | Eigen::MatrixXd CtY = (C.transpose() * Y).eval(); 50 | Eigen::MatrixXd CtC = (C.transpose() * C).eval(); 51 | Eigen::MatrixXd CtC_i = (CtC.inverse()).eval(); 52 | 53 | std::cerr << "Rotating expression and covariates ... "; 54 | //Y = (L.transpose() * Y).eval(); 55 | Eigen::MatrixXd X = (L.transpose() * C).eval(); 56 | std::cerr << "Done.\n"; 57 | 58 | std::string theta_file_path = global_opts::out_prefix + "." + "theta" + ".gz"; 59 | 60 | BGZF* theta_file; 61 | theta_file = bgzf_open(theta_file_path.c_str(), "w"); 62 | 63 | std::string iter_cerr_suffix = " traits out of " + std::to_string(Y.cols()) + " total"; 64 | std::cerr << "Fit null model for "; 65 | print_iter_cerr(1, 0, iter_cerr_suffix); 66 | 67 | std::vector phi_v(Y.cols()); 68 | std::vector hsq_v(Y.cols()); 69 | std::vector sigma_v(Y.cols()); 70 | std::vector SSR_v(Y.cols()); 71 | 72 | e_data.stdev.resize(Y.cols()); 73 | 74 | for(int j = 0; j < Y.cols(); j++ ){ 75 | 76 | Y.col(j) = (L.transpose() * Y.col(j)).eval(); 77 | 78 | LMM_fitter fit(X, Y.col(j), GRM_lambda); 79 | fit.fit_REML(); 80 | 81 | const DiagonalXd& Vi = fit.Vi; 82 | const double& sigma2 = fit.sigma2; 83 | const double& phi = fit.phi; 84 | 85 | 86 | 87 | // DiagonalXd Vi = Eigen::VectorXd::Ones(Y.col(j).size()).asDiagonal(); 88 | // double sigma2 = 1.00; 89 | // double phi = 0.05; 90 | 91 | double tau2 = phi*sigma2; 92 | double hsq = tau2 / (tau2 + sigma2); 93 | double scale = std::sqrt(tau2 + sigma2); 94 | 95 | // Calculate "sum of squared residuals" 96 | 97 | DiagonalXd Psi = calc_Psi(phi, Q_lambda); 98 | Eigen::MatrixXd XtDX = (CtC - QtC.transpose() * Psi * QtC )/(1.00 + phi); 99 | Eigen::VectorXd XtDy = X.transpose() * Vi * Y.col(j); 100 | Eigen::VectorXd b = XtDX.colPivHouseholderQr().solve(XtDy); 101 | Eigen::VectorXd y_hat = X * b; 102 | 103 | Eigen::VectorXd y_res = Y.col(j) - y_hat; 104 | 105 | SSR_v[j] = y_res.dot(Vi * Y.col(j))/sigma2; 106 | 107 | 108 | // if( !e_data.is_residual ){ 109 | // DiagonalXd Psi = calc_Psi(phi, Q_lambda); 110 | // Eigen::MatrixXd XtDX = (CtC - QtC.transpose() * Psi * QtC )/(1.00 + phi); 111 | // Eigen::VectorXd XtDy = X.transpose() * Vi * Y.col(j); 112 | // Eigen::VectorXd b = XtDX.colPivHouseholderQr().solve(XtDy); 113 | // Eigen::VectorXd y_hat = X * b; 114 | 115 | // Eigen::VectorXd y_res = Y.col(j) - y_hat; 116 | 117 | // SSR_v[j] = y_res.dot(Vi * Y.col(j))/sigma2; 118 | 119 | // // Y now stores the rotated residuals. 120 | // Y.col(j) = (Vi * y_res/std::sqrt(sigma2)).eval(); 121 | // } 122 | 123 | std::stringstream theta_line; 124 | 125 | theta_line << 126 | clean_chrom(e_data.chr[j]) << "\t" << 127 | e_data.start[j] << "\t" << 128 | e_data.end[j] << "\t" << 129 | e_data.gene_id[j] << "\t" << 130 | sigma2 << "\t" << 131 | tau2 << "\t" << 132 | phi << "\t" << 133 | SSR_v[j] << "\t" << 134 | y_scale[j] << "\n"; 135 | 136 | write_to_bgzf(theta_line.str().c_str(), theta_file); 137 | 138 | if( global_opts::write_resid_mat ){ 139 | 140 | Y.col(j) = (Vi * y_res/std::sqrt(sigma2)).eval(); 141 | 142 | Y.col(j) = (L * Y.col(j)).eval(); 143 | } 144 | 145 | 146 | print_iter_cerr(j, j+1, iter_cerr_suffix); 147 | } 148 | 149 | 150 | bgzf_close(theta_file); 151 | build_tabix_index(theta_file_path, 1); 152 | 153 | 154 | if( global_opts::write_resid_mat ){ 155 | // Y = (L * Y).eval(); 156 | std::string bed_out = global_opts::out_prefix + ".lmm_resid.bed.gz"; 157 | std::string bed_header = "## APEX_LMM_RESID"; 158 | e_data.write_bed(bed_out, bed_header); 159 | } 160 | 161 | } 162 | 163 | 164 | void fit_LMM_null_models_low_rank(const int& n_fac, table& c_data, bed_data& e_data, const bool& rknorm_y, const bool& rknorm_r, Eigen::MatrixXd& Y_epc ) 165 | { 166 | 167 | Eigen::MatrixXd Q; 168 | Eigen::VectorXd Q_lambda; 169 | 170 | Eigen::MatrixXd& Y = e_data.data_matrix; 171 | Eigen::MatrixXd& C = c_data.data_matrix; 172 | 173 | double n_traits = Y.cols(); 174 | double n_samples = Y.rows(); 175 | double n_covar = C.cols(); 176 | 177 | if( rknorm_y ){ 178 | std::cerr << "Rank-normalizing expression traits ... \n"; 179 | rank_normalize(Y); 180 | rank_normalize(Y_epc); 181 | } 182 | std::cerr << "Scaling expression traits ... \n"; 183 | std::vector y_scale; 184 | scale_and_center(Y, y_scale); 185 | scale_and_center(Y_epc); 186 | 187 | // bool resid_ePC = true; 188 | 189 | // if( resid_ePC ){ 190 | 191 | // Eigen::MatrixXd U = get_half_hat_matrix(C); 192 | 193 | // std::cerr << "Calculating expression residuals...\n"; 194 | // make_resid_from_half_hat(Y_epc, U); 195 | // } 196 | 197 | std::cerr << "Estimating latent factors ... \n"; 198 | calc_eGRM_PCs(Q, Q_lambda, Y_epc, n_fac); 199 | std::cerr << "Done.\n"; 200 | 201 | Y_epc.resize(0,0); 202 | 203 | std::cerr << "Calculating partial rotations ...\n"; 204 | Eigen::MatrixXd QtC = (Q.transpose() * C).eval(); 205 | Eigen::MatrixXd QtY = (Q.transpose() * Y).eval(); 206 | Eigen::MatrixXd CtY = (C.transpose() * Y).eval(); 207 | Eigen::MatrixXd CtC = (C.transpose() * C).eval(); 208 | Eigen::MatrixXd CtC_i = (CtC.inverse()).eval(); 209 | 210 | std::vector hsq_vals{0.0, 0.5, 1.0}; 211 | 212 | std::string theta_file_path = global_opts::out_prefix + "." + "theta" + ".gz"; 213 | BGZF* theta_file; 214 | theta_file = bgzf_open(theta_file_path.c_str(), "w"); 215 | 216 | // ------------------------------------- 217 | // Begin fitting null models. 218 | // ------------------------------------- 219 | 220 | std::vector phi_v(Y.cols()); 221 | std::vector hsq_v(Y.cols()); 222 | std::vector sigma_v(Y.cols()); 223 | std::vector SSR_v(Y.cols()); 224 | 225 | // Eigen::MatrixXd PY( Y.rows(), Y.cols() ); 226 | 227 | e_data.stdev.resize(Y.cols()); 228 | 229 | theta_data t_data; 230 | bool use_theta = false; 231 | 232 | // if( theta_path != "" ){ 233 | // std::cerr << "Set null model for "; 234 | // t_data.open(theta_path); 235 | // use_theta = true; 236 | // }else{ 237 | std::cerr << "Fit null model for "; 238 | // } 239 | 240 | std::string iter_cerr_suffix = " traits out of " + std::to_string(Y.cols()) + " total"; 241 | print_iter_cerr(1, 0, iter_cerr_suffix); 242 | 243 | int last_j = 0; 244 | for(int j = 0; j < Y.cols(); j++ ){ 245 | DiagonalXd Vi; 246 | double sigma2, phi, tau2; 247 | 248 | double yty = Y.col(j).squaredNorm(); 249 | double sample_size = Y.rows(); 250 | 251 | LMM_fitter_low_rank fit(CtC, QtC, CtY.col(j), QtY.col(j), yty, Q_lambda, sample_size); 252 | 253 | fit.fit_REML(); 254 | 255 | // Vi = fit.Vi; 256 | sigma2 = fit.sigma2; 257 | phi = fit.phi; 258 | tau2 = phi*sigma2; 259 | 260 | std::stringstream theta_line; 261 | 262 | theta_line << 263 | clean_chrom(e_data.chr[j]) << "\t" << 264 | e_data.start[j] << "\t" << 265 | e_data.end[j] << "\t" << 266 | e_data.gene_id[j] << "\t" << 267 | sigma2 << "\t" << 268 | tau2 << "\t" << 269 | phi << "\n"; 270 | 271 | write_to_bgzf(theta_line.str().c_str(), theta_file); 272 | 273 | if ( global_opts::write_resid_mat ){ 274 | 275 | double hsq = tau2 / (tau2 + sigma2); 276 | double scale = std::sqrt(tau2 + sigma2); 277 | 278 | DiagonalXd Psi = calc_Psi_low_rank(phi, Q_lambda); 279 | Eigen::MatrixXd XtDX = (CtC - QtC.transpose() * Psi * QtC ); 280 | 281 | Eigen::VectorXd XtDy = QtC.transpose() * Psi * QtY.col(j); 282 | XtDy = (CtY.col(j) - XtDy).eval(); 283 | Eigen::VectorXd b = XtDX.colPivHouseholderQr().solve(XtDy); 284 | Eigen::VectorXd y_fixeff = C * b; 285 | 286 | Eigen::VectorXd v1 = QtC * b; 287 | Eigen::VectorXd v2 = QtY.col(j) - v1; 288 | Eigen::VectorXd y_raneff = Q * Psi * v2; 289 | Eigen::VectorXd y_resid = Y.col(j) - y_raneff - y_fixeff; 290 | 291 | SSR_v[j] = (yty - QtY.col(j).dot(Psi * QtY.col(j)) - XtDy.dot(b))/sigma2; 292 | 293 | Y.col(j) = (y_resid/std::sqrt(sigma2)).eval(); 294 | 295 | phi_v[j] = phi; 296 | hsq_v[j] = hsq; 297 | sigma_v[j] = sigma2; 298 | 299 | e_data.stdev[j] = std::sqrt(sigma2); 300 | 301 | } 302 | print_iter_cerr(j, j+1, iter_cerr_suffix); 303 | } 304 | std::cerr << "\n"; 305 | 306 | bgzf_close(theta_file); 307 | build_tabix_index(theta_file_path, 1); 308 | 309 | if( global_opts::write_resid_mat ){ 310 | std::string bed_out = global_opts::out_prefix + ".lmm_resid.bed.gz"; 311 | std::string bed_header = "## APEX_LMM_RESID"; 312 | e_data.write_bed(bed_out); 313 | } 314 | } 315 | 316 | 317 | 318 | void calculate_V_anchor_points( Eigen::MatrixXd& V_mat, genotype_data& g_data, const Eigen::MatrixXd& C, const std::vector& hsq_vals, const Eigen::MatrixXd& CtC, const Eigen::MatrixXd& CtC_i, const Eigen::MatrixXd& QtG, Eigen::MatrixXd& QtC, const Eigen::VectorXd Q_lambda ){ 319 | 320 | BGZF* anchor_file; 321 | 322 | int n_hsq = hsq_vals.size(); 323 | int n_snps = g_data.genotypes.cols(); 324 | int n_samples = g_data.genotypes.rows(); 325 | int n_cov = CtC.rows(); 326 | 327 | if( global_opts::write_v_anchors ){ 328 | std::string anchor_path = global_opts::out_prefix + ".gvar_points.gz"; 329 | anchor_file = bgzf_open(anchor_path.c_str(), "w"); 330 | 331 | write_to_bgzf("##\t" + std::to_string(n_samples) + "\t" + std::to_string(n_snps) + "\t" + std::to_string(n_cov) + "\n", anchor_file); 332 | 333 | write_to_bgzf("#chrom\tpos\tref\talt", anchor_file); 334 | 335 | std::stringstream out_line; 336 | for(const double& val : hsq_vals){ 337 | out_line << "\tVR" << 100.0*val; 338 | } 339 | out_line << "\n"; 340 | write_to_bgzf(out_line.str().c_str(), anchor_file); 341 | } 342 | 343 | std::vector M_list; 344 | V_mat = Eigen::MatrixXd(n_snps, n_hsq); 345 | 346 | for( const double& hsq : hsq_vals ){ 347 | M_list.push_back( 348 | (CtC - (QtC.transpose() * calc_Psi(hsq, Q_lambda) * QtC)).inverse() 349 | ); 350 | } 351 | 352 | std::cerr << "Calculating genotype-covariate covariance ... \n"; 353 | Eigen::MatrixXd CtG = (C.transpose() * g_data.genotypes).eval(); 354 | 355 | std::vector GtG_diag(n_snps); 356 | 357 | std::cerr << "Calculating genotype residual variances ..."; 358 | 359 | for( int i = 0; i < n_snps; i++) 360 | { 361 | GtG_diag[i] = g_data.genotypes.col(i).squaredNorm(); 362 | g_data.var[i] = GtG_diag[i] - CtG.col(i).dot(CtC_i * CtG.col(i)); 363 | V_mat(i,0) = g_data.var[i]; 364 | } 365 | std::cerr << "Done.\n"; 366 | 367 | for(int j = 1; j < n_hsq; j++){ 368 | 369 | std::cerr << "Calculating genotype variance anchor points (" << j+1 << "/" << n_hsq << ") ... \n"; 370 | 371 | Eigen::MatrixXd Psi_a = calc_Psi(hsq_vals[j], Q_lambda) * QtG; 372 | Eigen::MatrixXd Res = CtG - QtC.transpose() * Psi_a; 373 | 374 | for( int i = 0; i < n_snps; i++) 375 | { 376 | V_mat(i,j) = ( 377 | GtG_diag[i] - QtG.col(i).dot(Psi_a.col(i)) - Res.col(i).dot(M_list[j] * Res.col(i)) 378 | ); 379 | } 380 | 381 | } 382 | 383 | if( global_opts::write_v_anchors ){ 384 | std::cerr << "Writing genotype variance anchor points ... "; 385 | for( int i = 0; i < n_snps; i++) 386 | { 387 | std::stringstream out_line; 388 | out_line << 389 | clean_chrom(g_data.chr[i]) << "\t" << 390 | g_data.pos[i] << "\t" << 391 | g_data.ref[i] << "\t" << 392 | g_data.alt[i]; 393 | for(int j = 0; j < n_hsq; j++){ 394 | out_line << "\t" << V_mat(i,j); 395 | } 396 | out_line << "\n"; 397 | write_to_bgzf(out_line.str().c_str(), anchor_file); 398 | } 399 | bgzf_close(anchor_file); 400 | } 401 | 402 | V_mat = (V_mat * (getPredParams(hsq_vals).transpose())).eval(); 403 | 404 | std::cerr << "Done.\n"; 405 | 406 | return; 407 | } 408 | 409 | 410 | void calculate_V_anchor_points_low_rank( Eigen::MatrixXd& V_mat, genotype_data& g_data, const Eigen::MatrixXd& C, const std::vector& hsq_vals, const Eigen::MatrixXd& CtC, const Eigen::MatrixXd& CtC_i, const Eigen::MatrixXd& QtG, Eigen::MatrixXd& QtC, const Eigen::VectorXd Q_lambda ){ 411 | 412 | 413 | BGZF* anchor_file; 414 | std::string anchor_path = global_opts::out_prefix + "." + "gvar_points.gz"; 415 | 416 | int n_hsq = hsq_vals.size(); 417 | int n_snps = g_data.genotypes.cols(); 418 | int n_samples = g_data.genotypes.rows(); 419 | int n_cov = CtC.rows(); 420 | 421 | if( global_opts::write_v_anchors ){ 422 | 423 | anchor_file = bgzf_open(anchor_path.c_str(), "w"); 424 | 425 | write_to_bgzf("##\t" + std::to_string(n_samples) + "\t" + std::to_string(n_snps) + "\t" + std::to_string(n_cov) + "\n", anchor_file); 426 | 427 | write_to_bgzf("#chrom\tpos\tref\talt", anchor_file); 428 | std::stringstream out_line; 429 | for(const double& val : hsq_vals){ 430 | out_line << "\tVR" << 100.0*val; 431 | } 432 | out_line << "\n"; 433 | write_to_bgzf(out_line.str().c_str(), anchor_file); 434 | } 435 | 436 | std::vector M_list; 437 | V_mat = Eigen::MatrixXd(n_snps, n_hsq); 438 | 439 | for( const double& hsq : hsq_vals ){ 440 | M_list.push_back( 441 | (CtC - (QtC.transpose() * calc_Psi_low_rank(hsq, Q_lambda) * QtC)).inverse() 442 | ); 443 | } 444 | 445 | std::cerr << "Calculating genotype-covariate covariance ... \n"; 446 | Eigen::MatrixXd CtG = (C.transpose() * g_data.genotypes).eval(); 447 | 448 | std::vector GtG_diag(n_snps); 449 | 450 | std::cerr << "Calculating genotype residual variances ..."; 451 | 452 | for( int i = 0; i < n_snps; i++) 453 | { 454 | GtG_diag[i] = g_data.genotypes.col(i).squaredNorm(); 455 | g_data.var[i] = GtG_diag[i] - CtG.col(i).dot(CtC_i * CtG.col(i)); 456 | V_mat(i,0) = g_data.var[i]; 457 | } 458 | std::cerr << "Done.\n"; 459 | 460 | for(int j = 1; j < n_hsq; j++){ 461 | std::cerr << "Calculating genotype variance anchor points (" << j+1 << "/" << n_hsq << ") ... \n"; 462 | 463 | Eigen::MatrixXd Psi_a = calc_Psi_low_rank(hsq_vals[j], Q_lambda) * QtG; 464 | Eigen::MatrixXd Res = CtG - QtC.transpose() * Psi_a; 465 | 466 | for( int i = 0; i < n_snps; i++) 467 | { 468 | V_mat(i,j) = ( 469 | GtG_diag[i] - QtG.col(i).dot(Psi_a.col(i)) - Res.col(i).dot(M_list[j] * Res.col(i)) 470 | ); 471 | } 472 | } 473 | 474 | if( global_opts::write_v_anchors ){ 475 | std::cerr << "Writing genotype variance anchor points ... "; 476 | for( int i = 0; i < n_snps; i++) 477 | { 478 | std::stringstream out_line; 479 | out_line << 480 | clean_chrom(g_data.chr[i]) << "\t" << 481 | g_data.pos[i] << "\t" << 482 | g_data.ref[i] << "\t" << 483 | g_data.alt[i]; 484 | for(int j = 0; j < n_hsq; j++){ 485 | out_line << "\t" << V_mat(i,j); 486 | } 487 | out_line << "\n"; 488 | write_to_bgzf(out_line.str().c_str(), anchor_file); 489 | } 490 | bgzf_close(anchor_file); 491 | build_tabix_index(anchor_path, 0); 492 | } 493 | 494 | V_mat = (V_mat * (getPredParams(hsq_vals).transpose())).eval(); 495 | 496 | std::cerr << "Done.\n"; 497 | 498 | return; 499 | } 500 | 501 | void read_V_anchor_points( const std::string& anchor_path, Eigen::MatrixXd& V_mat, genotype_data& g_data, const std::vector& hsq_vals) 502 | { 503 | 504 | int n_hsq = hsq_vals.size(); 505 | int n_snps = g_data.genotypes.cols(); 506 | 507 | V_mat = Eigen::MatrixXd(n_snps, n_hsq); 508 | 509 | std::vector chr; 510 | std::vector pos; 511 | std::vector ref; 512 | std::vector alt; 513 | 514 | data_parser dp; 515 | 516 | dp.add_field(chr, 0); 517 | dp.add_field(pos, 1); 518 | dp.add_field(ref, 2); 519 | dp.add_field(alt, 3); 520 | 521 | dp.add_matrix(V_mat, false, 4, n_hsq); 522 | 523 | dp.parse_file(anchor_path, global_opts::global_region); 524 | dp.clear(); 525 | 526 | if( chr.size() != n_snps ){ 527 | std::cerr << "Number of variants in gvar file does not match genotype matrix dimensions." << "\n"; 528 | abort(); 529 | } 530 | 531 | V_mat = (V_mat * (getPredParams(hsq_vals).transpose())).eval(); 532 | 533 | return; 534 | } 535 | 536 | void save_V_anchor_points( bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda ) 537 | { 538 | 539 | Eigen::SparseMatrix Q; 540 | Eigen::VectorXd Q_lambda; 541 | 542 | subset_eigen(L, GRM_lambda, Q, Q_lambda); 543 | 544 | Eigen::MatrixXd& C = c_data.data_matrix; 545 | 546 | std::cerr << "Calculating partial rotations ...\n"; 547 | Eigen::MatrixXd QtY, CtY, QtC, QtG, CtC, CtC_i; 548 | 549 | QtG = (Q.transpose() * g_data.genotypes).eval(); 550 | 551 | QtC = (Q.transpose() * C).eval(); 552 | CtC = (C.transpose() * C).eval(); 553 | CtC_i = (CtC.inverse()).eval(); 554 | 555 | std::cerr << "Rotating expression and covariates ... "; 556 | 557 | 558 | std::cerr << "Done.\n"; 559 | 560 | std::vector hsq_vals{0.0, 0.5, 1.0}; 561 | 562 | Eigen::MatrixXd V_mat, X; 563 | 564 | calculate_V_anchor_points(V_mat, g_data, C, hsq_vals, CtC, CtC_i, QtG, QtC, Q_lambda); 565 | 566 | return; 567 | } 568 | -------------------------------------------------------------------------------- /src/apexLMM.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef APEXLMM_HPP 18 | #define APEXLMM_HPP 19 | 20 | #include "setOptions.hpp" 21 | #include "readBED.hpp" 22 | #include "setRegions.hpp" 23 | #include "readTable.hpp" 24 | #include "genotypeData.hpp" 25 | #include "htsWrappers.hpp" 26 | #include "fitUtils.hpp" 27 | #include "fitModels.hpp" 28 | #include "miscUtils.hpp" 29 | #include "mathStats.hpp" 30 | 31 | void fit_LMM_null_models(table& c_data, bed_data& e_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda, const bool& rknorm_y, const bool& rknorm_r); 32 | 33 | void fit_LMM_null_models_low_rank(const int& n_fac, table& c_data, bed_data& e_data, const bool& rknorm_y, const bool& rknorm_r, Eigen::MatrixXd& Y_epc ); 34 | 35 | void calculate_V_anchor_points( Eigen::MatrixXd& V_mat, genotype_data& g_data, const Eigen::MatrixXd& C, const std::vector& hsq_vals, const Eigen::MatrixXd& CtC, const Eigen::MatrixXd& CtC_i, const Eigen::MatrixXd& QtG, Eigen::MatrixXd& QtC, const Eigen::VectorXd Q_lambda ); 36 | 37 | void calculate_V_anchor_points_low_rank( Eigen::MatrixXd& V_mat, genotype_data& g_data, const Eigen::MatrixXd& C, const std::vector& hsq_vals, const Eigen::MatrixXd& CtC, const Eigen::MatrixXd& CtC_i, const Eigen::MatrixXd& QtG, Eigen::MatrixXd& QtC, const Eigen::VectorXd Q_lambda ); 38 | 39 | void save_V_anchor_points(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda); 40 | 41 | void read_V_anchor_points( const std::string& anchor_path, Eigen::MatrixXd& V_mat, genotype_data& g_data, const std::vector& hsq_vals); 42 | 43 | #endif 44 | -------------------------------------------------------------------------------- /src/cisMapping.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef CISMAPPING_HPP 18 | #define CISMAPPING_HPP 19 | 20 | #include "setOptions.hpp" 21 | #include "readBED.hpp" 22 | #include "setRegions.hpp" 23 | #include "readTable.hpp" 24 | #include "genotypeData.hpp" 25 | #include "htsWrappers.hpp" 26 | #include "fitUtils.hpp" 27 | #include "fitModels.hpp" 28 | #include "miscUtils.hpp" 29 | #include "mathStats.hpp" 30 | 31 | 32 | void run_cis_QTL_analysis(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, bed_data& e_data, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const bool& just_long); 33 | 34 | void run_cis_QTL_analysis_LMM(bcf_srs_t*& sr, bcf_hdr_t*& hdr,genotype_data& g_data, table& c_data, bed_data& e_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const bool& just_long, const std::string& theta_path = "", const std::string& anchor_path = ""); 35 | 36 | void run_cis_QTL_analysis_eLMM(const int& n_fac, const int& n_fac_fe, bcf_srs_t*& sr, bcf_hdr_t*& hdr,genotype_data& g_data, table& c_data, bed_data& e_data, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const bool& just_long, Eigen::MatrixXd& Y_epc); 37 | 38 | void run_cis_QTL_analysis_eFE (const int& n_fac, bcf_srs_t*& sr, bcf_hdr_t*& hdr,genotype_data& g_data, table& c_data, bed_data& e_data, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const bool& just_long, Eigen::MatrixXd& Y_epc ); 39 | 40 | void scan_signals(bcf_srs_t*& sr, bcf_hdr_t*& hdr,genotype_data& g_data, table& c_data, bed_data& e_data, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r); 41 | 42 | void save_fa_covariates(const int& n_fac, genotype_data& g_data, table& c_data, bed_data& e_data, const bool& rknorm_y, const bool& rknorm_r); 43 | 44 | #endif 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /src/dataParser.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | /* 18 | This header provides a generic interface for parsing multiple, 19 | heterogeneous fields from htsFiles into multiple, heterogeneous 20 | data objects and formats (std::vector and Eigen::Matrix). 21 | 22 | See readBed.cpp for an example. 23 | */ 24 | 25 | #ifndef DATAPARSER_HPP 26 | #define DATAPARSER_HPP 27 | 28 | #include 29 | 30 | #include "htsWrappers.hpp" 31 | #include "setOptions.hpp" 32 | 33 | 34 | static std::string nullstr = ""; 35 | 36 | static int int0 = 0; 37 | 38 | // class for filling Eigen matrix entries by reference from file 39 | class matrix_filler 40 | { 41 | public: 42 | matrix_filler(): n(-1), chunk_size(5000), unknown_dim(false) {}; 43 | 44 | // Unknown matrix size (unknown number of input columns) 45 | void set(Eigen::MatrixXd* new_mat, const bool is_trans, int i_offset){ 46 | mat = new_mat; 47 | start_at = i_offset; 48 | is_transposed = is_trans; 49 | n = 0; 50 | unknown_dim = true; 51 | } 52 | 53 | // Known matrix size; assign columns from interval range 54 | void set(Eigen::MatrixXd* new_mat, const bool is_trans, const int& start, const int& n_fields){ 55 | mat = new_mat; 56 | for( int i = start; i < start + n_fields; i++){ 57 | idx.push_back(i); 58 | } 59 | is_transposed = is_trans; 60 | n = 0; 61 | if( n_fields != ( is_trans ? mat->rows() : mat->cols() ) ){ 62 | std::cerr << "Fatal: matrix_filler error, mismatched dimensions\n"; 63 | std::cerr << n_fields << " != (" << mat->cols() << ", " << mat->rows() << ").\n"; 64 | abort(); 65 | } 66 | } 67 | 68 | // Known matrix size; assign columns from index vector 69 | void set(Eigen::MatrixXd* new_mat, const bool is_trans, const std::vector& kp_index, int i_offset = 0){ 70 | mat = new_mat; 71 | idx = kp_index; 72 | if( i_offset != 0 ){ 73 | for(int j = 0; j < idx.size(); j++){ 74 | idx[j] += i_offset; 75 | } 76 | } 77 | is_transposed = is_trans; 78 | n = 0; 79 | if( kp_index.size() != ( is_trans ? mat->rows() : mat->cols() ) ){ 80 | std::cerr << "Fatal: matrix_filler error, mismatched dimensions\n"; 81 | std::cerr << kp_index.size() << " != (" << mat->cols() << ", " << mat->rows() << ").\n"; 82 | abort(); 83 | } 84 | } 85 | 86 | void push_line(kstring_t& str, int*& offsets, int& n_fields){ 87 | if( unknown_dim ){ 88 | setSize(n_fields); 89 | }else{ 90 | checkResize(); 91 | } 92 | if( n >= 0 ){ 93 | int idx_size = idx.size(); 94 | for(int i = 0; i < idx_size; i++){ 95 | assign_value(str, offsets, idx[i], i); 96 | } 97 | n++; 98 | } 99 | } 100 | 101 | void freeze(){ 102 | if( is_transposed ){ 103 | mat->conservativeResize(Eigen::NoChange, n); 104 | }else{ 105 | mat->conservativeResize(n, Eigen::NoChange); 106 | } 107 | } 108 | 109 | private: 110 | Eigen::MatrixXd* mat; 111 | bool is_transposed, unknown_dim; 112 | std::vector idx; 113 | int n, start_at, chunk_size; 114 | 115 | void setSize(int n_fields){ 116 | if( unknown_dim & n == 0){ 117 | int n_cols = n_fields - start_at; 118 | if( is_transposed ){ 119 | mat->resize(n_cols, n + chunk_size); 120 | }else{ 121 | mat->resize(n + chunk_size, n_cols); 122 | } 123 | for( int i = start_at; i < start_at + n_cols; i++){ 124 | idx.push_back(i); 125 | } 126 | unknown_dim = false; 127 | } 128 | } 129 | 130 | void checkResize(){ 131 | if( is_transposed ){ 132 | if( mat->cols() < n + 1 ){ 133 | mat->conservativeResize(Eigen::NoChange, n + chunk_size); 134 | } 135 | }else{ 136 | if( mat->rows() < n + 1 ){ 137 | mat->conservativeResize(n + chunk_size, Eigen::NoChange); 138 | } 139 | } 140 | } 141 | 142 | void assign_value(kstring_t& str, int*& offsets, const int& off_i, const int& mat_i){ 143 | if( is_transposed ){ 144 | mat->coeffRef(mat_i, n) = atof(str.s + offsets[off_i]); 145 | }else{ 146 | mat->coeffRef(n, mat_i) = atof(str.s + offsets[off_i]); 147 | } 148 | } 149 | }; 150 | 151 | class data_parser 152 | { 153 | public: 154 | // Matrix of unknown dimension 155 | void add_matrix(Eigen::MatrixXd& new_mat, bool is_trans, int start){ 156 | int nm = v_matrix.size(); 157 | v_matrix.push_back(matrix_filler()); 158 | v_matrix[nm].set(&new_mat, is_trans, start); 159 | } 160 | // Matrix of known dimension; keep range of columns 161 | void add_matrix(Eigen::MatrixXd& new_mat, const bool is_trans, const int& start, const int& n_fields){ 162 | int nm = v_matrix.size(); 163 | v_matrix.push_back(matrix_filler()); 164 | v_matrix[nm].set(&new_mat, is_trans, start, n_fields); 165 | } 166 | // Matrix of known dimension; keep fixed column indices 167 | void add_matrix(Eigen::MatrixXd& new_mat, const bool is_trans, const std::vector& kp_index, int i_offset = 0){ 168 | int nm = v_matrix.size(); 169 | v_matrix.push_back(matrix_filler()); 170 | v_matrix[nm].set(&new_mat, is_trans, kp_index, i_offset); 171 | } 172 | void add_field(std::vector& v, const int& i){ 173 | v_string.push_back(&v); 174 | i_string.push_back(i); 175 | } 176 | void add_field(std::vector& v, const int& i){ 177 | v_long.push_back(&v); 178 | i_long.push_back(i); 179 | } 180 | void add_field(std::vector& v, const int& i){ 181 | v_int.push_back(&v); 182 | i_int.push_back(i); 183 | } 184 | void add_field(std::vector& v, const int& i){ 185 | v_double.push_back(&v); 186 | i_double.push_back(i); 187 | } 188 | void add_header(std::vector& v, const int& i){ 189 | v_header.push_back(&v); 190 | i_header.push_back(i); 191 | } 192 | 193 | void add_target(std::vector& tg, const int& i, const bool& target_mode_include = true){ 194 | target_keys.push_back(&tg); 195 | target_cols.push_back(i); 196 | target_mode.push_back(target_mode_include); 197 | } 198 | 199 | void parse_header(kstring_t& str){ 200 | int n_fields; 201 | int *offsets = ksplit(&str, 0, &n_fields); 202 | for( int i = i_header[0]; i < n_fields; i++){ 203 | v_header[0]->push_back(std::string(str.s + offsets[i])); 204 | } 205 | } 206 | 207 | void parse_fields(kstring_t& str, int*& offsets, int& n_fields) 208 | { 209 | if( target_cols.size() > 0 ){ 210 | for( int i = 0; i < target_cols.size(); i++){ 211 | std::string key = std::string(str.s + offsets[target_cols[i]]); 212 | if( target_mode[i] ){ 213 | if( !has_element(*target_keys[i], key) ){ 214 | return; 215 | } 216 | }else{ 217 | if( has_element(*target_keys[i], key) ){ 218 | return; 219 | } 220 | } 221 | } 222 | } 223 | for( int i = 0; i < v_string.size(); i++){ 224 | v_string[i]->push_back(std::string(str.s + offsets[i_string[i]])); 225 | } 226 | for( int i = 0; i < v_long.size(); i++){ 227 | v_long[i]->push_back(std::atol(str.s + offsets[i_long[i]])); 228 | } 229 | for( int i = 0; i < v_int.size(); i++){ 230 | v_int[i]->push_back(std::atoi(str.s + offsets[i_int[i]])); 231 | } 232 | for( int i = 0; i < v_double.size(); i++){ 233 | v_double[i]->push_back(std::atof(str.s + offsets[i_double[i]])); 234 | } 235 | for( matrix_filler& mf : v_matrix ){ 236 | mf.push_line(str, offsets, n_fields); 237 | } 238 | } 239 | 240 | void parse_fields(kstring_t& str) 241 | { 242 | int n_fields; 243 | int *offsets = ksplit(&str, 0, &n_fields); 244 | parse_fields(str, offsets, n_fields); 245 | } 246 | 247 | template 248 | void parse_file(hts_file& htsf, int& n_rows = int0){ 249 | kstring_t str = {0,0,0}; 250 | bool skip_header = ( v_header.size() == 0 ); 251 | while( htsf.next_line(str) >= 0 ){ 252 | if( !str.l ) break; 253 | if ( str.s[0] == '#' ){ 254 | if( str.s[1] == '#' || (skip_header && n_rows == 0 ) ){ 255 | continue; 256 | } 257 | } 258 | if( n_rows == 0 && i_header.size() > 0 ){ 259 | parse_header(str); 260 | }else{ 261 | parse_fields(str); 262 | } 263 | n_rows++; 264 | } 265 | ks_free(&str); 266 | clear(); 267 | } 268 | void parse_file(const std::string& fn, int& n_rows = int0) 269 | { 270 | basic_hts_file htsf(fn); 271 | parse_file(htsf, n_rows); 272 | htsf.close(); 273 | } 274 | void parse_file(const std::string& fn, const std::string& region, int& n_rows = int0) 275 | { 276 | if( region != "" ){ 277 | indexed_hts_file htsf(fn, region); 278 | parse_file(htsf, n_rows); 279 | htsf.close(); 280 | }else{ 281 | basic_hts_file htsf(fn); 282 | parse_file(htsf, n_rows); 283 | htsf.close(); 284 | } 285 | } 286 | 287 | void clear(){ 288 | for( matrix_filler& mf : v_matrix ){mf.freeze();} 289 | v_header.clear(); 290 | i_header.clear(); 291 | v_matrix.clear(); 292 | v_string.clear(); 293 | i_string.clear(); 294 | v_long.clear(); 295 | i_long.clear(); 296 | v_int.clear(); 297 | i_int.clear(); 298 | v_double.clear(); 299 | i_double.clear(); 300 | target_cols.clear(); 301 | target_keys.clear(); 302 | target_mode.clear(); 303 | }; 304 | 305 | private: 306 | std::vector*> v_header; 307 | std::vector i_header; 308 | std::vector v_matrix; 309 | std::vector*> v_string; 310 | std::vector i_string; 311 | std::vector*> v_long; 312 | std::vector i_long; 313 | std::vector*> v_int; 314 | std::vector i_int; 315 | std::vector*> v_double; 316 | std::vector i_double; 317 | 318 | std::vector target_cols; 319 | std::vector*> target_keys; 320 | std::vector target_mode; 321 | }; 322 | 323 | 324 | class file_list 325 | { 326 | public: 327 | std::vector file_paths; 328 | 329 | file_list( const std::string& in_string ){ 330 | 331 | // Find files from format "path/chr{1:22}.txt" 332 | std::size_t s_brack = in_string.find("{"); 333 | if( s_brack != std::string::npos ){ 334 | std::string prefix = in_string.substr(0, s_brack); 335 | std::string suffix = in_string.substr(in_string.find("}")+1); 336 | 337 | std::string rstr = in_string.substr(s_brack+1); 338 | 339 | int start = std::stoi(rstr.substr(0, rstr.find(":"))); 340 | rstr = rstr.substr(rstr.find(":")+1); 341 | 342 | int end = std::stoi(rstr.substr(0, rstr.find("}"))); 343 | 344 | for(int i = start; i <= end; i++){ 345 | file_paths.push_back(prefix + std::to_string(i) + suffix); 346 | } 347 | }else{ 348 | 349 | std::vector types = {".vcf", ".vcf.gz", ".bcf"}; 350 | 351 | for( const std::string& tp : types ){ 352 | if( in_string.substr(in_string.size() - tp.size()) == tp ){ 353 | file_paths.push_back(in_string); 354 | } 355 | } 356 | 357 | if( file_paths.size() == 0 ){ 358 | // No match. Assume that in_string is a list of files. 359 | data_parser dp; 360 | dp.add_field(file_paths, 0); 361 | dp.parse_file(in_string); 362 | } 363 | 364 | } 365 | }; 366 | 367 | }; 368 | 369 | 370 | #endif -------------------------------------------------------------------------------- /src/eigenData.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #include "eigenData.hpp" 17 | 18 | 19 | void read_eigen(const std::string& file_name, Eigen::SparseMatrix& eig_vec, Eigen::VectorXd& eig_val, const std::vector& kp_ids){ 20 | 21 | int n_samples, n_eigs, nnzero; 22 | 23 | // std::string file_name = file_prefix + ".eigen.gz"; 24 | // file_name = std::regex_replace(file_name, std::regex(".eigen.gz.eigen.gz"), std::string(".eigen.gz")); 25 | 26 | // ---------------------------------------------- 27 | // ---------------------------------------------- 28 | // Read file 29 | 30 | htsFile *htsf = hts_open(file_name.c_str(), "r"); 31 | kstring_t str = {0,0,0}; 32 | int n_fields; 33 | int *offsets; 34 | 35 | 36 | // ------------------------- 37 | // 1) First line contains number of samples, number of eigenvectors, number non-zero. 38 | 39 | hts_getline(htsf, KS_SEP_LINE, &str); 40 | if( !str.l ){ 41 | std::cerr << "ERROR: eigen file is incomplete.\n"; 42 | abort(); 43 | } 44 | offsets = ksplit(&str, 0, &n_fields); 45 | 46 | // Number of samples 47 | n_samples = std::atoi((str.s + offsets[0])); 48 | // NUmber of eigenvectors (retained based on eigenvalues) 49 | n_eigs = std::atoi((str.s + offsets[1])); 50 | // Note: Get this with sp.nonZeros(); 51 | nnzero = std::atoi((str.s + offsets[2])); 52 | 53 | if( n_samples != kp_ids.size() ){ 54 | std::cerr << "ERROR: sample IDs in eigen file do not match in input.\n"; 55 | abort(); 56 | } 57 | 58 | // Resize the vector and matrix objects 59 | eig_vec.resize(n_samples, n_eigs); 60 | eig_val.resize(n_eigs); 61 | 62 | 63 | // ------------------------- 64 | // 2) Second line contains number of samples and number of eigenvectors. 65 | 66 | hts_getline(htsf, KS_SEP_LINE, &str); 67 | if( !str.l ){ 68 | std::cerr << "ERROR: eigen file is incomplete.\n"; 69 | abort(); 70 | } 71 | offsets = ksplit(&str, 0, &n_fields); 72 | 73 | if( n_fields != kp_ids.size() ){ 74 | std::cerr << "ERROR: sample IDs in eigen file do not match input.\n"; 75 | abort(); 76 | } 77 | 78 | for( int i = 0; i < n_fields; i++) 79 | { 80 | if( std::string(str.s + offsets[i]) != kp_ids[i] ){ 81 | std::cerr << "ERROR: sample IDs in eigen file do not match input.\n"; 82 | abort(); 83 | } 84 | } 85 | 86 | // ------------------------- 87 | // 3) Third line contains eigenvalues. 88 | 89 | 90 | hts_getline(htsf, KS_SEP_LINE, &str); 91 | if( !str.l ){ 92 | std::cerr << "ERROR: eigen file is incomplete.\n"; 93 | abort(); 94 | } 95 | offsets = ksplit(&str, 0, &n_fields); 96 | 97 | if( n_fields != n_eigs ){ 98 | std::cerr << "ERROR: Incorrect number of eigenvalues.\n"; 99 | abort(); 100 | } 101 | 102 | for( int i = 0; i < n_fields; i++) 103 | { 104 | eig_val.coeffRef(i) = std::atof(str.s + offsets[i]); 105 | } 106 | 107 | 108 | // ------------------------- 109 | // 4) Remaining lines contain eigenvector triplets. 110 | 111 | using td = Eigen::Triplet; 112 | 113 | std::vector triplets; 114 | triplets.reserve(nnzero); 115 | 116 | while( hts_getline(htsf, KS_SEP_LINE, &str) ){ 117 | if( !str.l ) break; 118 | 119 | offsets = ksplit(&str, 0, &n_fields); 120 | 121 | if( n_fields != 3 ){ 122 | std::cerr << "ERROR: Malformed eigen file.\n"; 123 | // std::cerr << "Line " << triplets.size() << "\n"; 124 | // std::cerr << "No. fields " << n_fields << "\n"; 125 | abort(); 126 | } 127 | 128 | int ii = std::atoi(str.s + offsets[0]); 129 | int jj = std::atoi(str.s + offsets[1]); 130 | double val = std::atof(str.s + offsets[2]); 131 | 132 | triplets.push_back(td(ii,jj,val)); 133 | } 134 | 135 | eig_vec.setFromTriplets(triplets.begin(), triplets.end()); 136 | 137 | // ------------------------- 138 | // 5) Close input file. 139 | 140 | ks_free(&str); 141 | hts_close(htsf); 142 | } 143 | 144 | 145 | void write_eigen(const std::string& file_prefix, Eigen::SparseMatrix& eig_vec, Eigen::VectorXd& eig_val, const std::vector& kp_ids){ 146 | 147 | // std::cerr << "Start write eigen.\n"; // NOTE: DEBUG LINE 148 | 149 | // std::cerr << "Prefix: " << file_prefix << "\n"; // NOTE: DEBUG LINE 150 | 151 | std::string file_name = file_prefix + ".eigen.gz"; 152 | 153 | 154 | // std::cerr << "Prefix: " << file_prefix << "\n"; // NOTE: DEBUG LINE 155 | 156 | // file_name = std::regex_replace(file_name, std::regex(".eigen.gz.eigen.gz"), std::string(".eigen.gz")); 157 | 158 | // std::cerr << "Processed file name.\n"; // NOTE: DEBUG LINE 159 | 160 | // std::cerr << file_name << "\n"; // NOTE: DEBUG LINE 161 | 162 | // ---------------------------------------------- 163 | // ---------------------------------------------- 164 | // Open output file 165 | 166 | BGZF* out_file = bgzf_open(file_name.c_str(), "w"); 167 | 168 | // std::cerr << "Opened file.\n"; // NOTE: DEBUG LINE 169 | 170 | // ------------------------- 171 | // 1) First line contains number of samples, number of eigenvectors, number non-zero. 172 | 173 | write_to_bgzf(std::to_string( kp_ids.size() ), out_file ); 174 | write_to_bgzf("\t", out_file ); 175 | write_to_bgzf(std::to_string( eig_val.size() ), out_file ); 176 | write_to_bgzf("\t", out_file ); 177 | write_to_bgzf(std::to_string( eig_vec.nonZeros() ), out_file ); 178 | write_to_bgzf("\n", out_file ); 179 | 180 | // std::cerr << "1\n"; // NOTE: DEBUG LINE 181 | 182 | // ------------------------- 183 | // 2) Second line contains list of sample IDs 184 | 185 | write_to_bgzf(kp_ids[0], out_file); 186 | for( int i = 1; i < kp_ids.size(); i++ ){ 187 | write_to_bgzf("\t", out_file ); 188 | write_to_bgzf(kp_ids[i], out_file ); 189 | } 190 | write_to_bgzf("\n", out_file ); 191 | 192 | // std::cerr << "2\n"; // NOTE: DEBUG LINE 193 | 194 | // ------------------------- 195 | // 3) Third line contains eigenvalues. 196 | 197 | write_to_bgzf(std::to_string(eig_val.coeffRef(0)), out_file ); 198 | for( int i = 1; i < eig_val.size(); i++ ){ 199 | write_to_bgzf("\t", out_file ); 200 | write_to_bgzf(std::to_string(eig_val.coeffRef(i)), out_file ); 201 | } 202 | write_to_bgzf("\n", out_file); 203 | 204 | // std::cerr << "3\n"; // NOTE: DEBUG LINE 205 | 206 | // ------------------------- 207 | // 4) Remaining lines contain eigenvector triplets. 208 | 209 | for (int i = 0; i < eig_vec.outerSize(); i++) 210 | { 211 | for (Eigen::SparseMatrix::InnerIterator trip(eig_vec,i); trip; ++trip) 212 | { 213 | write_to_bgzf(std::to_string( trip.row() ), out_file ); 214 | write_to_bgzf("\t", out_file ); 215 | write_to_bgzf(std::to_string( trip.col() ), out_file ); 216 | write_to_bgzf("\t", out_file ); 217 | write_to_bgzf(std::to_string( trip.value() ), out_file ); 218 | write_to_bgzf("\n", out_file ); 219 | } 220 | } 221 | 222 | // ------------------------- 223 | // 5) Close the output file. 224 | 225 | bgzf_close(out_file); 226 | 227 | } 228 | 229 | 230 | 231 | void update_blocks( int i, int j, std::vector& cluster_ids, std::vector>& clusters){ 232 | if( i == j ){ 233 | // A diagonal element, potentially block of size 1. 234 | // This does not affect block structure. 235 | return; 236 | } 237 | 238 | int c_i = cluster_ids[i]; 239 | int c_j = cluster_ids[j]; 240 | 241 | if( c_i > c_j ){ 242 | std::swap(c_i, c_j); 243 | std::swap(i, j); 244 | } 245 | 246 | if( c_i < 0 && c_j < 0 ){ 247 | 248 | // Neither i nor j has been assigned to a block. 249 | // Create a new block that comprises the pair. 250 | 251 | cluster_ids[i] = clusters.size(); 252 | cluster_ids[j] = clusters.size(); 253 | clusters.push_back({i,j}); 254 | 255 | }else if( c_i < 0 ){ 256 | 257 | // Since c_i < c_j, only i might never have been assigned. 258 | // If i was never assigned, then add it to j's block. 259 | 260 | cluster_ids[i] = c_j; 261 | clusters[c_j].insert(i); 262 | 263 | }else if(c_i != c_j){ 264 | 265 | // i and j were previously assigned to different blocks. 266 | // We must merge these two blocks together. 267 | 268 | // Add all members of j's block into i's block. 269 | clusters[c_i].insert(clusters[c_j].begin(), clusters[c_j].end()); 270 | 271 | // Set block ids of members of block c_j to c_i. 272 | for( const int& k : clusters[c_j] ){ 273 | cluster_ids[k] = c_i; 274 | } 275 | 276 | // Delete the block c_j, as it has been merged into c_i. 277 | clusters.erase (clusters.begin() + c_j ); 278 | 279 | // There is now one less block. 280 | // For all blocks > c_j, decrement block id by 1. 281 | for( int& k : cluster_ids ){ 282 | if( k > c_j ){ 283 | k -= 1; 284 | } 285 | } 286 | 287 | } 288 | return; 289 | } 290 | 291 | 292 | void read_sparse_GRM(const std::string& filename, Eigen::SparseMatrix& GRM, const std::vector& kp_ids, const double& r_scale, const int& r_col, std::vector>& related) 293 | { 294 | int n = kp_ids.size(); 295 | 296 | GRM = Eigen::SparseMatrix(n,n); 297 | 298 | if( filename == "" ){ 299 | GRM.setIdentity(); 300 | return; 301 | } 302 | 303 | std::vector cluster_ids; 304 | std::vector> clusters; 305 | 306 | std::unordered_map id_map; 307 | for( int i = 0; i < n; i++ ){ 308 | id_map[ kp_ids[i] ] = i; 309 | cluster_ids.push_back(-1); 310 | } 311 | 312 | std::vector id1, id2; 313 | std::vector val; 314 | 315 | data_parser dp; 316 | dp.add_field(id1, 0); 317 | dp.add_field(id2, 1); 318 | dp.add_field(val, r_col - 1); 319 | 320 | int nr = 0; 321 | dp.parse_file(filename, nr); 322 | 323 | using td = Eigen::Triplet; 324 | 325 | std::vector triplets; 326 | 327 | bool add_diag = true; 328 | 329 | for( int i = 0; i < nr; i++ ) 330 | { 331 | const auto f1 = id_map.find(id1[i]); 332 | const auto f2 = id_map.find(id2[i]); 333 | 334 | if( add_diag ){ 335 | if( id1[i] == id2[i] ){ 336 | add_diag = false; 337 | } 338 | } 339 | 340 | if( f1 != id_map.end() && f2 != id_map.end() ) 341 | { 342 | int& ii = f1->second; 343 | int& jj = f2->second; 344 | 345 | update_blocks( ii, jj, cluster_ids, clusters); 346 | 347 | triplets.push_back(td(ii,jj,r_scale*val[i])); 348 | if( ii != jj ){ 349 | triplets.push_back(td(jj,ii,r_scale*val[i])); 350 | } 351 | } 352 | 353 | } 354 | if( add_diag ){ 355 | for(int i = 0; i < n; ++i) triplets.push_back(td(i,i,1.0)); 356 | } 357 | 358 | related.clear(); 359 | for( const auto& s : clusters ){ 360 | related.push_back(std::vector(s.begin(), s.end())); 361 | } 362 | 363 | GRM.setFromTriplets(triplets.begin(), triplets.end()); 364 | } 365 | -------------------------------------------------------------------------------- /src/eigenData.hpp: -------------------------------------------------------------------------------- 1 | #ifndef EIGENDATA_HPP 2 | #define EIGENDATA_HPP 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | #include "setOptions.hpp" 10 | #include "mapID.hpp" 11 | #include "htsWrappers.hpp" 12 | #include "dataParser.hpp" 13 | #include "genotypeData.hpp" 14 | 15 | #include 16 | #include 17 | 18 | void update_blocks( int i, int j, std::vector& cluster_ids, std::vector>& clusters); 19 | 20 | void read_eigen(const std::string& file_name, Eigen::SparseMatrix& eig_vec, Eigen::VectorXd& eig_val, const std::vector& kp_ids); 21 | 22 | void write_eigen(const std::string& file_prefix, Eigen::SparseMatrix& eig_vec, Eigen::VectorXd& eig_val, const std::vector& kp_ids); 23 | 24 | void read_sparse_GRM(const std::string& filename, Eigen::SparseMatrix& GRM, const std::vector& kp_ids, const double& r_scale, const int& r_col, std::vector>& related); 25 | 26 | void read_dense_GRM(const std::string&, Eigen::MatrixXd&, std::vector&); 27 | 28 | #endif -------------------------------------------------------------------------------- /src/fitUtils.cpp: -------------------------------------------------------------------------------- 1 | /* fitUtils: 2 | 3 | Copyright (C) 2020 4 | Author: Corbin Quick 5 | 6 | This file is a part of APEX. 7 | 8 | APEX is distributed "AS IS" in the hope that it will be 9 | useful, but WITHOUT ANY WARRANTY; without even the implied 10 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 11 | FOR A PARTICULAR PURPOSE. 12 | 13 | The above copyright notice and disclaimer of warranty must 14 | be included in all copies or substantial portions of APEX. 15 | */ 16 | 17 | #include "fitUtils.hpp" 18 | 19 | 20 | Eigen::MatrixXd resid( const Eigen::MatrixXd& Y, const Eigen::MatrixXd& X ){ 21 | return Y - X * (X.transpose() * X).ldlt().solve(X.transpose() * Y); 22 | } 23 | 24 | void rank_normalize(Eigen::MatrixXd& Y){ 25 | double m = Y.cols(); 26 | double n = Y.rows(); 27 | 28 | std::vector z((int) n); 29 | std::vector rk((int) n); 30 | 31 | double mu = 0; 32 | double sd = 0; 33 | for(int i = 0; i < n; ++i){ 34 | z[i] = qnorm( ((double)i+1.0)/((double)n+1.0) ); 35 | mu += z[i]; 36 | sd += z[i]*z[i]; 37 | } 38 | sd = std::sqrt( sd/(n - 1) - mu*mu/( n*(n - 1.0) ) ); 39 | mu = mu/n; 40 | for(int i = 0; i < n; ++i){ 41 | z[i] = (z[i] - mu)/sd; 42 | } 43 | 44 | for(int j = 0; j < m; ++j){ 45 | 46 | std::vector v(n); 47 | for(int i = 0; i< n; ++i){ 48 | v[i] = Y(i,j); 49 | } 50 | 51 | std::vector ranks = rank_vector(v); 52 | 53 | for(int i = 0; i< n; ++i){ 54 | Y(i,j) = z[ ranks[i]-1 ]; 55 | } 56 | } 57 | } 58 | 59 | void scale_and_center(Eigen::MatrixXd& Y, std::vector& sd_vec){ 60 | double m = Y.cols(); 61 | double n = Y.rows(); 62 | 63 | if( sd_vec.size() != m ) sd_vec = std::vector(m, 0.0); 64 | 65 | for(int j = 0; j < m; ++j){ 66 | double mu = 0; 67 | double sd = 0; 68 | for(int i = 0; i < n; ++i){ 69 | mu += Y(i,j); 70 | sd += Y(i,j)*Y(i,j); 71 | } 72 | sd = std::sqrt( sd/(n - 1) - mu*mu/( n*(n - 1.0) ) ); 73 | sd_vec[j] = sd; 74 | mu = mu/n; 75 | for(int i = 0; i < n; ++i){ 76 | Y(i,j) = (Y(i,j) - mu)/sd; 77 | } 78 | } 79 | } 80 | 81 | void scale_and_center(Eigen::MatrixXd& Y){ 82 | double m = Y.cols(); 83 | double n = Y.rows(); 84 | 85 | for(int j = 0; j < m; j++){ 86 | double mu = 0; 87 | double sd = 0; 88 | for(int i = 0; i < n; i++){ 89 | mu += Y(i,j); 90 | sd += Y(i,j)*Y(i,j); 91 | } 92 | sd = std::sqrt( sd/(n - 1) - mu*mu/( n*(n - 1.0) ) ); 93 | mu = mu/n; 94 | for(int i = 0; i < n; i++){ 95 | Y(i,j) = (Y(i,j) - mu)/sd; 96 | } 97 | } 98 | } 99 | 100 | Eigen::MatrixXd get_half_hat_matrix(const Eigen::MatrixXd& X){ 101 | Eigen::SelfAdjointEigenSolver XtX_es(X.transpose() * X); 102 | Eigen::VectorXd lambda = XtX_es.eigenvalues(); 103 | for( auto& a : lambda ){ 104 | if( a <= 0.00 ){ 105 | std::cerr << "ERROR: Covariate matrix is not full rank. Check input. \n"; 106 | abort(); 107 | } 108 | a = 1/std::sqrt(a); 109 | } 110 | Eigen::MatrixXd U = X * XtX_es.eigenvectors() * lambda.asDiagonal(); 111 | return U; 112 | } 113 | 114 | void make_half_hat_matrix(Eigen::MatrixXd& X){ 115 | Eigen::SelfAdjointEigenSolver XtX_es(X.transpose() * X); 116 | Eigen::VectorXd lambda = XtX_es.eigenvalues(); 117 | for( auto& a : lambda ){ 118 | if( a <= 0.00 ){ 119 | std::cerr << "ERROR: Covariate matrix is not full rank. Check input. \n"; 120 | abort(); 121 | } 122 | a = 1/std::sqrt(a); 123 | } 124 | X *= XtX_es.eigenvectors(); 125 | X *= lambda.asDiagonal(); 126 | return; 127 | } 128 | 129 | Eigen::MatrixXd resid_from_half_hat( const Eigen::MatrixXd& Y, const Eigen::MatrixXd& C ){ 130 | Eigen::MatrixXd CtY = C.transpose() * Y; 131 | return (Y - C * CtY).eval(); 132 | } 133 | 134 | void make_resid_from_half_hat( Eigen::MatrixXd& Y, const Eigen::MatrixXd& C ){ 135 | Eigen::MatrixXd CtY = C.transpose() * Y; 136 | Y.noalias() -= (C *CtY); 137 | return; 138 | } 139 | 140 | Eigen::VectorXd resid_vec_from_half_hat( const Eigen::VectorXd& Y, const Eigen::MatrixXd& C ){ 141 | Eigen::VectorXd CtY = C.transpose() * Y; 142 | Eigen::VectorXd Yhat = C * CtY; 143 | return (Y - Yhat).eval(); 144 | } 145 | 146 | void appendInterceptColumn( Eigen::MatrixXd &X ){ 147 | X.conservativeResize(Eigen::NoChange, X.cols()+1); 148 | X.col(X.cols()-1) = Eigen::VectorXd::Ones(X.rows()); 149 | return; 150 | } 151 | 152 | 153 | -------------------------------------------------------------------------------- /src/fitUtils.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #ifndef FITUTILS_HPP 17 | #define FITUTILS_HPP 18 | 19 | #include 20 | #include 21 | #include 22 | 23 | #include 24 | #include 25 | 26 | #include "mathStats.hpp" 27 | 28 | 29 | inline void printMeanDiag( const Eigen::SparseMatrix& mat ){ 30 | std::cout << "Mean diagonal element = " << mat.diagonal().mean() << "\n"; 31 | } 32 | 33 | inline void printMeanDiag( const Eigen::MatrixXd& mat ){ 34 | std::cout << "Mean diagonal element = " << mat.diagonal().mean() << "\n"; 35 | } 36 | 37 | void appendInterceptColumn(Eigen::MatrixXd&); 38 | Eigen::MatrixXd get_half_hat_matrix(const Eigen::MatrixXd&); 39 | Eigen::MatrixXd resid_from_half_hat(const Eigen::MatrixXd&, const Eigen::MatrixXd&); 40 | 41 | Eigen::VectorXd resid_vec_from_half_hat(const Eigen::VectorXd& Y, const Eigen::MatrixXd&); 42 | void make_half_hat_matrix(Eigen::MatrixXd& X); 43 | void make_resid_from_half_hat( Eigen::MatrixXd& Y, const Eigen::MatrixXd& C ); 44 | 45 | 46 | void scale_and_center(Eigen::MatrixXd&, std::vector&); 47 | void scale_and_center(Eigen::MatrixXd&); 48 | void rank_normalize (Eigen::MatrixXd&); 49 | 50 | #endif 51 | 52 | -------------------------------------------------------------------------------- /src/genotypeData.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #include "genotypeData.hpp" 17 | 18 | bool sp_geno_fmt = true; 19 | 20 | void genotype_data::read_bcf_variants(bcf_srs_t*& sr, bcf_hdr_t*& hdr, int& n_var, bool store_geno, bool scan_geno) 21 | { 22 | 23 | if( store_geno ){ 24 | std::cerr << "Processed genotype data for "; 25 | }else{ 26 | std::cerr << "Processed variant data for "; 27 | } 28 | std::string iter_cerr_suffix = " variants ... "; 29 | 30 | print_iter_cerr(1, 0, iter_cerr_suffix); 31 | int last = 0; 32 | 33 | initialize_genotypes(n_var); 34 | n_var = 0; 35 | geno_size = 0; 36 | 37 | while( bcf_sr_next_line(sr) ) 38 | { 39 | bcf1_t *rec = bcf_sr_get_line(sr,0); 40 | if( process_bcf_variant(rec, hdr, store_geno, scan_geno) ){ 41 | n_var++; 42 | if( store_geno ) geno_size++; 43 | } 44 | thinned_iter_cerr(last, n_var, iter_cerr_suffix, 2500); 45 | } 46 | print_iter_cerr(last, n_var, iter_cerr_suffix); 47 | 48 | geno_start = 0; 49 | 50 | get_ld_index(); 51 | 52 | return; 53 | } 54 | 55 | int genotype_data::get_ld_index() 56 | { 57 | int bp = global_opts::ld_window_bp; 58 | 59 | int max_entries = 0; 60 | 61 | int ii = 0; 62 | int ii_s = 0; 63 | int ii_e = 0; 64 | while( ii < pos.size() ){ 65 | while( pos[ii] - pos[ii_s] > bp && ii_s < pos.size() ){ 66 | ii_s++; 67 | } 68 | while( pos[ii_e] - pos[ii] <= bp && ii_e < pos.size() ){ 69 | ii_e++; 70 | } 71 | 72 | ld_index_range.push_back( std::make_pair(ii, ii_e) ); 73 | 74 | if( ii_e - ii - 1 > max_entries ){ 75 | max_entries = ii_e - ii - 1; 76 | } 77 | 78 | ii++; 79 | } 80 | return max_entries; 81 | } 82 | 83 | void genotype_data::clear_genotypes() 84 | { 85 | genotypes.setZero(); 86 | genotypes.data().squeeze(); 87 | } 88 | 89 | void genotype_data::clear() 90 | { 91 | n_variants = 0; 92 | 93 | chr.clear(); 94 | pos.clear(); 95 | ref.clear(); 96 | alt.clear(); 97 | 98 | mean.clear(); 99 | var.clear(); 100 | 101 | flipped.clear(); 102 | 103 | clear_genotypes(); 104 | } 105 | 106 | inline bool genotype_data::add_bcf_genotypes(int*& gt_rec, const int& col_n, double& mean_, double& var_, bool& flip_, const bool store_geno) 107 | { 108 | int n = 0; 109 | mean_ = 0; 110 | var_ = 0; 111 | flip_ = false; 112 | sparse_gt gts; 113 | std::vector missing; 114 | if( sp_geno_fmt ){ 115 | gts.resize(ids.idx.size()); 116 | }else{ 117 | if( dense_genotypes.rows() < ids.keep.size() ){ 118 | dense_genotypes.conservativeResize(ids.keep.size(), Eigen::NoChange); 119 | } 120 | if( dense_genotypes.cols() < col_n ){ 121 | dense_genotypes.conservativeResize(Eigen::NoChange, col_n + chunk_size); 122 | } 123 | } 124 | int n_obs = 0; 125 | 126 | int n_2 = 0; 127 | int n_1 = 0; 128 | 129 | for(const int& i: ids.idx){ 130 | 131 | int a1 = bcf_gt_allele(gt_rec[i*2+0]); 132 | int a2 = bcf_gt_allele(gt_rec[i*2+1]); 133 | if(a1 < 0 || a2 < 0) 134 | { 135 | if( global_opts::exclude_missing > 0 ){ 136 | return false; 137 | }else{ 138 | // a1 = a1 < 0 ? 0 : a1; 139 | // a2 = a2 < 0 ? 0 : a2; 140 | if(store_geno){ 141 | if( sp_geno_fmt ){ 142 | gts.set_gt(n,-1); 143 | }else{ 144 | missing.push_back(i); 145 | } 146 | } 147 | } 148 | }else{ 149 | 150 | if( a1 + a2 > 0 ){ 151 | a1 += a2; 152 | if( a1 > 1){ 153 | n_2++; 154 | }else{ 155 | n_1++; 156 | } 157 | if(store_geno){ 158 | if( sp_geno_fmt ){ 159 | gts.set_gt(n,a1); 160 | }else{ 161 | // DO DENSE 162 | // std::cerr << "DO DENSE"; 163 | dense_genotypes(n, col_n) = a1; 164 | } 165 | } 166 | mean_ += a1; 167 | var_ += a1*a1; 168 | } 169 | n_obs++; 170 | } 171 | n++; 172 | } 173 | 174 | mean_ = mean_/((double) n_obs); 175 | var_ = (var_ - n_obs*mean_*mean_)/ ( (double) n_obs + 1 ); 176 | 177 | if( n_2 > n_obs - n_1 - n_2 ){ 178 | flip_ = true; 179 | mean_ = 2.0 - mean_; 180 | if( store_geno && sp_geno_fmt ) gts.flip(n); 181 | } 182 | if( store_geno && sp_geno_fmt ){ 183 | // checkResize(col_n); 184 | gts.add_gt_sparsemat(genotypes, col_n); 185 | } 186 | if( store_geno && !sp_geno_fmt && missing.size() > 0 ){ 187 | for(const int& i : missing ){ 188 | dense_genotypes(i, col_n) = mean_; 189 | } 190 | } 191 | 192 | return true; 193 | } 194 | 195 | inline bool genotype_data::add_bcf_dosages(float*& ds_rec, const int& col_n, double& mean_, double& var_, bool& flip_, const bool store_geno) 196 | { 197 | int n = 0; 198 | int n_obs = 0; 199 | mean_ = 0; 200 | var_ = 0; 201 | flip_ = false; 202 | sparse_ds sp_ds; 203 | std::vector missing; 204 | if( sp_geno_fmt ){ 205 | sp_ds.resize(ids.idx.size()); 206 | }else{ 207 | if( dense_genotypes.rows() < ids.keep.size() ){ 208 | dense_genotypes.conservativeResize(ids.keep.size(), Eigen::NoChange); 209 | } 210 | if( dense_genotypes.cols() < col_n ){ 211 | dense_genotypes.conservativeResize(Eigen::NoChange, col_n + chunk_size); 212 | } 213 | } 214 | 215 | int n_2 = 0; 216 | int n_0 = 0; 217 | 218 | for(const int& i: ids.idx) 219 | { 220 | float ds = ds_rec[i]; 221 | if( ds < 0 ){ 222 | if(store_geno){ 223 | if( sp_geno_fmt ){ 224 | sp_ds.set_ds(n,-1.00); 225 | }else{ 226 | missing.push_back(i); 227 | } 228 | } 229 | }else{ 230 | if( ds <= global_opts::dosage_thresh ){ 231 | ds = 0.00; 232 | n_0++; 233 | }else if( ds >= 2.00 - global_opts::dosage_thresh ){ 234 | ds = 2.00; 235 | n_2++; 236 | } 237 | if(store_geno){ 238 | if( sp_geno_fmt ){ 239 | sp_ds.set_ds(n,ds); 240 | }else{ 241 | // std::cerr << "DO DENSE"; 242 | dense_genotypes(n, col_n) = ds; 243 | } 244 | } 245 | mean_ += ds; 246 | var_ += ds*ds; 247 | n_obs++; 248 | } 249 | n++; 250 | } 251 | 252 | mean_ = mean_/((double) n_obs); 253 | var_ = (var_ - n_obs*mean_*mean_)/ ( (double) n_obs + 1 ); 254 | 255 | if( n_2 > n_0 ){ 256 | flip_ = true; 257 | mean_ = 2.0 - mean_; 258 | if( store_geno && sp_geno_fmt ) sp_ds.flip(); 259 | } 260 | if( store_geno && sp_geno_fmt ){ 261 | // checkResize(col_n); 262 | sp_ds.add_ds_sparsemat(genotypes, col_n); 263 | } 264 | if( store_geno && !sp_geno_fmt && missing.size() > 0 ){ 265 | for(const int& i : missing ){ 266 | dense_genotypes(i, col_n) = mean_; 267 | } 268 | } 269 | 270 | return true; 271 | } 272 | 273 | void genotype_data::read_bcf_header(bcf_hdr_t* hdr){ 274 | 275 | n_samples = bcf_hdr_nsamples(hdr); 276 | 277 | for(int i = 0; i < n_samples; i++){ 278 | ids.file.push_back(hdr->samples[i]); 279 | } 280 | if( ids.keep.size() == 0 ){ 281 | ids.setKeepIDs(ids.file); 282 | } 283 | } 284 | 285 | void genotype_data::freeze_genotypes(){ 286 | genotypes.conservativeResize(ids.keep.size(), geno_size); 287 | genotypes.finalize(); 288 | genotypes.makeCompressed(); 289 | } 290 | 291 | inline bool genotype_data::process_bcf_variant(bcf1_t*& rec, bcf_hdr_t*& hdr, bool store_geno, bool scan_geno){ 292 | 293 | if( rec->n_allele != 2 ){ 294 | std::cerr << "failed rec->n_allele" << rec->n_allele << "\n"; 295 | return false; 296 | } 297 | 298 | if( n_samples <= 0 ){ 299 | if( ids.keep.size() == 0 ){ 300 | read_bcf_header(hdr); 301 | }else{ 302 | n_samples = ids.keep.size(); 303 | } 304 | } 305 | 306 | double mean_ = -1; 307 | double var_ = -1; 308 | 309 | bool flip_ = false; 310 | bool keep_ = true; 311 | 312 | // std::string snp_id(rec->d.id); 313 | 314 | if( scan_geno ){ 315 | 316 | if( global_opts::use_dosages ){ 317 | int nds_arr = 0; 318 | float *ds = NULL; 319 | bcf_get_format_float(hdr, rec, "DS", &ds, &nds_arr)/n_samples; 320 | keep_ = add_bcf_dosages(ds, n_variants, mean_, var_, flip_, store_geno); 321 | delete ds; 322 | }else{ 323 | int ngt_arr = 0; 324 | int *gt = NULL; 325 | bcf_get_format_int32(hdr, rec, "GT", >, &ngt_arr)/n_samples; 326 | keep_ = add_bcf_genotypes(gt, n_variants, mean_, var_, flip_, store_geno); 327 | delete gt; 328 | } 329 | 330 | if( !keep_ ){ 331 | //genotypes.conservativeResize(n_variants, Eigen::NoChange); 332 | std::cerr << "Warning: Failed to add genotypes.\n"; 333 | return false; 334 | } 335 | 336 | } 337 | 338 | n_variants++; 339 | 340 | chr.push_back(bcf_hdr_id2name(hdr, rec->rid)); 341 | pos.push_back(rec->pos + 1); 342 | //rsid.push_back(rsid_); 343 | ref.push_back(rec->d.allele[0]); 344 | alt.push_back(rec->d.allele[1]); 345 | 346 | flipped.push_back(flip_); 347 | keep.push_back(keep_); 348 | 349 | mean.push_back(mean_); 350 | var.push_back(var_); 351 | 352 | return true; 353 | } 354 | 355 | bool genotype_data::record_matches(bcf1_t*& rec, bcf_hdr_t*& hdr, const int& i){ 356 | //cerr << bcf_hdr_id2name(hdr, rec->rid) << "\t" << (rec->pos + 1) << "\t" 357 | if( bcf_hdr_id2name(hdr, rec->rid) == chr[i] ){ 358 | if( (rec->pos + 1) == pos[i] ){ 359 | if( rec->d.allele[0] == ref[i] ){ 360 | if( rec->d.allele[1] == alt[i] ){ 361 | return true; 362 | } 363 | } 364 | } 365 | } 366 | return false; 367 | } 368 | 369 | void genotype_data::read_genotypes(bcf_srs_t*& sr, bcf_hdr_t*& hdr, const int& i_s, const int& n_s) 370 | { 371 | clear_genotypes(); 372 | resize_genotypes(n_s); 373 | 374 | int i_e = i_s + n_s - 1; 375 | if( chr[i_s] != chr[i_e] ){ 376 | std::cerr << "Fatal: regional chr mismatch in genotype_data::read_genotypes. \n"; 377 | exit(1); 378 | } 379 | bcf_seek(sr, chr[i_s], pos[i_s]); 380 | 381 | int i_i = i_s; 382 | 383 | geno_start = i_s; 384 | geno_size = 0; 385 | 386 | int r_i = 0; 387 | 388 | while( bcf_sr_next_line(sr) && r_i < n_s ) 389 | { 390 | bcf1_t *rec = bcf_sr_get_line(sr,0); 391 | 392 | if( record_matches(rec, hdr, i_i) ){ 393 | 394 | if( keep[i_i] ){ 395 | 396 | bool is_flipped = false; 397 | 398 | 399 | if( global_opts::use_dosages ){ 400 | int nds_arr = 0; 401 | float *ds = NULL; 402 | bcf_get_format_float(hdr, rec, "DS", &ds, &nds_arr)/n_samples; 403 | keep[i_i] = add_bcf_dosages(ds, r_i, mean[i_i], var[i_i], is_flipped, true); 404 | delete ds; 405 | }else{ 406 | int ngt_arr = 0; 407 | int *gt = NULL; 408 | int n_gts = bcf_get_format_int32(hdr, rec, "GT", >, &ngt_arr)/n_samples; 409 | keep[i_i] = add_bcf_genotypes(gt, r_i, mean[i_i], var[i_i], is_flipped, true); 410 | delete gt; 411 | } 412 | 413 | flipped[i_i] = is_flipped; 414 | 415 | 416 | } 417 | 418 | r_i++; 419 | i_i++; 420 | geno_size++; 421 | } 422 | } 423 | if( i_i < i_e ){ 424 | std::cerr << "Fatal: Failed to fill region in genotype_data::read_genotypes. \n"; 425 | std::cerr << i_s << ", " << i_i << ", " << i_e << "\n"; 426 | exit(1); 427 | } 428 | freeze_genotypes(); 429 | } 430 | -------------------------------------------------------------------------------- /src/genotypeData.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef GENOTYPEDATA_HPP 18 | #define GENOTYPEDATA_HPP 19 | 20 | #include 21 | #include 22 | 23 | #include "setOptions.hpp" 24 | #include "mapID.hpp" 25 | #include "htsWrappers.hpp" 26 | #include "dataParser.hpp" 27 | #include "eigenData.hpp" 28 | 29 | #include 30 | #include 31 | 32 | class sparse_gt 33 | { 34 | private: 35 | std::vector> entries; 36 | double sum_gt; 37 | int n_samples; 38 | int n_missing; 39 | int last; 40 | public: 41 | sparse_gt(const int& ns) : sum_gt(0.00), n_samples(ns), n_missing(0), last(-1) {}; 42 | 43 | sparse_gt() : sum_gt(0.00), n_samples(0), n_missing(0), last(-1) {}; 44 | 45 | void resize(const int& ns){n_samples = ns;}; 46 | 47 | int size(){ 48 | return entries.size(); 49 | }; 50 | void set_gt(const int& n, const int& gt){ 51 | if( n > last ){ 52 | if( gt < 0 ){ 53 | entries.push_back(std::make_pair(n,-1)); 54 | n_missing++; 55 | }else if( gt > 0 ){ 56 | entries.push_back(std::make_pair(n,gt)); 57 | sum_gt += gt; 58 | } 59 | last = n; 60 | }else{ 61 | std::cerr << "Fatal: Tried to add gt out of order!\n"; 62 | exit(2); 63 | } 64 | }; 65 | void flip(const int& n){ 66 | int j = 0, i = 0; 67 | std::vector> flipped; 68 | while(i < n){ 69 | if( j < entries.size() ){ 70 | while( i < entries[j].first && i < n ){ 71 | flipped.push_back( std::make_pair(i, 2) ); 72 | i++; 73 | } 74 | if( i == entries[j].first ){ 75 | if( entries[j].second < 0 ){ 76 | flipped.push_back(entries[j]); 77 | }else if( entries[j].second == 1 ){ 78 | flipped.push_back(entries[j]); 79 | }else if(entries[j].second != 2){ 80 | std::cerr <<"\n\n" << entries[j].second << "\n\n"; 81 | std::cerr << "entries[j].second != 2\n"; 82 | exit(1); 83 | } 84 | i++; 85 | j++; 86 | }else if( j > i ){ 87 | std::cerr << "This should never happen!\n"; 88 | exit(2); 89 | } 90 | }else{ 91 | flipped.push_back( std::make_pair(i, 2) ); 92 | i++; 93 | } 94 | } 95 | sum_gt = 2.00 * ( (double) (n_samples - n_missing) ) - sum_gt; 96 | entries = flipped; 97 | }; 98 | void add_gt_sparsemat(Eigen::SparseMatrix& smat, const int& col_n){ 99 | smat.startVec(col_n); 100 | double missing_val = sum_gt/( (double) (n_samples - n_missing) ); 101 | for( const auto& x : entries){ 102 | if( x.second > 0 ){ 103 | smat.insertBack(x.first, col_n) = x.second; 104 | }else{ 105 | smat.insertBack(x.first, col_n) = missing_val; 106 | } 107 | } 108 | }; 109 | }; 110 | 111 | 112 | class sparse_ds 113 | { 114 | public: 115 | std::vector dosages; 116 | int last; 117 | double sum_dos; 118 | int n_missing; 119 | int n_samples; 120 | 121 | sparse_ds(const int& N) : sum_dos(0.00), n_missing(0), n_samples(N), last(-1) {dosages.resize(N);}; 122 | 123 | sparse_ds() : sum_dos(0.00), n_missing(0), n_samples(0), last(-1) {}; 124 | 125 | void resize(const int& N){n_samples = N; dosages.resize(N); }; 126 | 127 | 128 | void set_ds(const int& n, const float& ds){ 129 | if( n > last ){ 130 | if( ds < 0 ){ 131 | n_missing++; 132 | }else{ 133 | sum_dos += ds; 134 | } 135 | dosages[n] = ds; 136 | last = n; 137 | }else{ 138 | std::cerr << "Fatal: Tried to add ds out of order!\n"; 139 | exit(2); 140 | } 141 | }; 142 | void flip(){ 143 | for( float& ds: dosages ){ 144 | if(ds >= 0){ 145 | ds = 2.00 - ds; 146 | } 147 | } 148 | sum_dos = 2.00 * ( (double) (n_samples - n_missing) ) - sum_dos; 149 | }; 150 | void add_ds_sparsemat(Eigen::SparseMatrix& smat, const int& col_n){ 151 | double missing_val = sum_dos/( (double) (n_samples - n_missing) ); 152 | smat.startVec(col_n); 153 | for(int i = 0; i < dosages.size(); i++){ 154 | if( dosages[i] < 0.00 ){ 155 | if( missing_val > global_opts::dosage_thresh ){ 156 | smat.insertBack(i, col_n) = missing_val; 157 | } 158 | }else if( dosages[i] > global_opts::dosage_thresh ){ 159 | smat.insertBack(i, col_n) = (double) dosages[i]; 160 | } 161 | } 162 | }; 163 | }; 164 | 165 | 166 | class genotype_data 167 | { 168 | public: 169 | genotype_data(): chunk_size(1000), nz_frac(0.1), n_variants(0), n_samples(0) {}; 170 | 171 | int chunk_size; 172 | double nz_frac; 173 | 174 | int n_samples; 175 | int n_variants; 176 | 177 | id_map ids; 178 | 179 | std::vector chr; 180 | std::vector pos; 181 | //std::vector rsid; 182 | std::vector ref; 183 | std::vector alt; 184 | 185 | std::vector> ld_index_range; 186 | 187 | std::vector mean; 188 | std::vector var; 189 | 190 | std::vector flipped; 191 | std::vector keep; 192 | 193 | int geno_start; 194 | int geno_size; 195 | 196 | Eigen::SparseMatrix genotypes; 197 | Eigen::MatrixXd dense_genotypes; 198 | 199 | void read_bcf_variants(bcf_srs_t*&, bcf_hdr_t*&, int&, bool store_geno = true, bool scan_geno = true); 200 | 201 | int get_ld_index(); 202 | inline bool process_bcf_variant(bcf1_t*&, bcf_hdr_t*&, bool store_geno = true, bool scan_geno = true); 203 | void read_bcf_header(bcf_hdr_t*); 204 | void freeze_genotypes(); 205 | void resize_genotypes(const int& nv){ 206 | genotypes.resize(ids.keep.size(), nv); 207 | genotypes.reserve( (int) (nz_frac * genotypes.rows() * genotypes.cols() ) ); 208 | }; 209 | void initialize_genotypes(const int& nv){ 210 | resize_genotypes(nv); 211 | }; 212 | void print_genotypes(){ 213 | std::cout << genotypes << "\n"; 214 | }; 215 | 216 | void clear(); 217 | void clear_genotypes(); 218 | void read_genotypes(bcf_srs_t*&, bcf_hdr_t*&, const int&, const int&); 219 | 220 | bool record_matches(bcf1_t*& rec, bcf_hdr_t*& hdr, const int& i); 221 | 222 | private: 223 | inline bool add_bcf_genotypes(int*&, const int&, double&, double&, bool&, const bool); 224 | inline bool add_bcf_dosages(float*& ds_rec, const int& col_n, double& mean_, double& var_, bool& flip_, const bool store_geno); 225 | 226 | void checkResize(const int& col_n){ 227 | if( col_n >= genotypes.cols() ){ 228 | genotypes.conservativeResize(ids.keep.size(), col_n + chunk_size); 229 | } 230 | }; 231 | 232 | }; 233 | 234 | 235 | #endif 236 | 237 | -------------------------------------------------------------------------------- /src/htsWrappers.cpp: -------------------------------------------------------------------------------- 1 | /* htsWrappers: 2 | 3 | Copyright (C) 2020 4 | Author: Corbin Quick 5 | 6 | This file is a part of APEX. 7 | 8 | APEX is distributed "AS IS" in the hope that it will be 9 | useful, but WITHOUT ANY WARRANTY; without even the implied 10 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 11 | FOR A PARTICULAR PURPOSE. 12 | 13 | The above copyright notice and disclaimer of warranty must 14 | be included in all copies or substantial portions of APEX. 15 | */ 16 | 17 | #include "htsWrappers.hpp" 18 | 19 | 20 | void indexed_hts_file::open(const std::string& prefix, const std::vector& reg) 21 | { 22 | regions = reg; 23 | 24 | htsf = hts_open(prefix.c_str(), "r"); 25 | tbx = tbx_index_load(prefix.c_str()); 26 | 27 | itr = NULL; 28 | 29 | c_reg = 0; 30 | n_reg = regions.size(); 31 | 32 | if( n_reg == 0 ){ 33 | const char** chroms = tbx_seqnames(tbx, &n_reg); 34 | for( int i = 0; i < n_reg; ++i ){ 35 | regions.push_back(chroms[i]); 36 | } 37 | free(chroms); 38 | } 39 | } 40 | 41 | int indexed_hts_file::next_line(kstring_t& str) 42 | { 43 | if( !itr ){ 44 | itr = tbx_itr_querys(tbx, regions[c_reg].c_str() ); 45 | } 46 | if( tbx_itr_next(htsf, tbx, itr, &str) < 0 ){ 47 | if( c_reg + 1 < n_reg ){ 48 | c_reg++; 49 | itr = tbx_itr_querys(tbx, regions[c_reg].c_str() ); 50 | return tbx_itr_next(htsf, tbx, itr, &str); 51 | }else{ 52 | return -1; 53 | } 54 | } 55 | return 1; 56 | } 57 | 58 | void bcf_seek(bcf_srs_t* sr, const std::string& chr, const int& start, const int& end){ 59 | 60 | std::string region = chr; 61 | 62 | if( start > 0 ){ 63 | region += ":" + std::to_string(start); 64 | } 65 | if( end > 0 ){ 66 | region += "-" + std::to_string(end); 67 | } 68 | 69 | bcf_sr_regions_t *reg = bcf_sr_regions_init(region.c_str(),0,0,1,-2); 70 | 71 | bcf_sr_seek(sr, reg->seq_names[0], reg->start); 72 | 73 | return; 74 | } 75 | 76 | void bgzip_file(std::string file_name, int index){ 77 | 78 | // index == 0 => Do not build any index. 79 | // index == 1 => Build gzi byte index. 80 | // index == 2 => Build csi position-based index. 81 | 82 | BGZF *fp; 83 | void *buffer; 84 | 85 | int f_src = open(file_name.c_str(), O_RDONLY); 86 | 87 | std::string file_name_gz = file_name + ".gz"; 88 | fp = bgzf_open(file_name_gz.c_str(), "w\0"); 89 | 90 | if( index == 1 ) bgzf_index_build_init(fp); 91 | 92 | buffer = malloc(BUFFER_SIZE); 93 | 94 | int c; 95 | 96 | while ((c = read(f_src, buffer, BUFFER_SIZE)) > 0){ 97 | if (bgzf_write(fp, buffer, c) < 0){ 98 | std::cerr << "Fatal error: Couldn't write to " << file_name << ".gz\n"; 99 | exit(1); 100 | } 101 | } 102 | 103 | if( index == 1 ){ 104 | if (bgzf_index_dump(fp, file_name.c_str(), ".gz.gzi") < 0){ 105 | std::cerr << "Fatal error: Couldn't create index " << file_name << ".gz.gzi\n"; 106 | exit(1); 107 | } 108 | } 109 | 110 | if (bgzf_close(fp) < 0){ 111 | std::cerr << "Fatal error: Couldn't close " << file_name << ".gz\n"; 112 | exit(1); 113 | } 114 | 115 | if( index == 2 ){ 116 | if ( tbx_index_build(file_name_gz.c_str(), 14, &tbx_conf_vcf)!=0 ) 117 | { 118 | std::cerr << "Fatal error: Couldn't create index " << file_name << ".gz.csi\n"; 119 | exit(1); 120 | } 121 | } 122 | 123 | unlink(file_name.c_str()); 124 | free(buffer); 125 | close(f_src); 126 | } 127 | 128 | std::vector get_chroms(std::string file_name, std::vector& n_variants){ 129 | 130 | // get a list of chromosomes in the file 131 | // this code is directly adapted from bcftools: 132 | // https://github.com/samtools/bcftools/blob/develop/vcfindex.c 133 | 134 | n_variants.clear(); 135 | std::vector chroms; 136 | 137 | const char **seq; 138 | int nseq, stats; 139 | tbx_t *tbx = NULL; 140 | hts_idx_t *idx = NULL; 141 | 142 | htsFile *fp = hts_open(file_name.c_str(),"r"); 143 | if ( !fp ) { fprintf(stderr,"Could not read %s\n", file_name.c_str()); abort(); } 144 | bcf_hdr_t *hdr = NULL; 145 | 146 | if ( hts_get_format(fp)->format==bcf ) 147 | { 148 | hdr = bcf_hdr_read(fp); 149 | if ( !hdr ) { fprintf(stderr,"Could not read the header: %s\n", file_name.c_str()); abort(); } 150 | idx = bcf_index_load(file_name.c_str()); 151 | if ( !idx ) { fprintf(stderr,"Could not load index for %s\n", file_name.c_str()); abort(); } 152 | }else{ 153 | if( hts_get_format(fp)->format==vcf ){ 154 | hdr = bcf_hdr_read(fp); 155 | if ( !hdr ) { fprintf(stderr,"Could not read the header: %s\n", file_name.c_str()); abort(); } 156 | } 157 | tbx = tbx_index_load(file_name.c_str()); 158 | if ( !tbx ) { fprintf(stderr,"Could not load index for %s\n", file_name.c_str()); abort(); } 159 | } 160 | 161 | seq = tbx ? tbx_seqnames(tbx, &nseq) : bcf_index_seqnames(idx, hdr, &nseq); 162 | for (int i=0; iidx : idx, i, &records, &v); 166 | if (stats&2 || !records) continue; 167 | chroms.push_back(std::string(seq[i])); 168 | n_variants.push_back( (int) records ); 169 | } 170 | free(seq); 171 | 172 | hts_close(fp); 173 | bcf_hdr_destroy(hdr); 174 | if (tbx) tbx_destroy(tbx); 175 | if (idx) hts_idx_destroy(idx); 176 | 177 | return chroms; 178 | } 179 | -------------------------------------------------------------------------------- /src/htsWrappers.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | /* 18 | Generic wrappers for HTSLIB BCF/VCF file parsing, as well as for BGZF reading/writing. 19 | */ 20 | 21 | #ifndef HTSWRAPPERS_HPP 22 | #define HTSWRAPPERS_HPP 23 | 24 | #include 25 | #include 26 | 27 | #include 28 | #include 29 | #include 30 | #include 31 | 32 | #include "htslib/tbx.h" 33 | #include "htslib/hts.h" 34 | #include "htslib/bgzf.h" 35 | #include "htslib/kseq.h" 36 | #include "htslib/kstring.h" 37 | #include "htslib/synced_bcf_reader.h" 38 | 39 | #include 40 | 41 | #include "setOptions.hpp" 42 | #include "miscUtils.hpp" 43 | 44 | 45 | static const int BUFFER_SIZE = 64 * 1024; 46 | 47 | static int MINUS99 = -99; 48 | 49 | static const std::string null_string = ""; 50 | 51 | static Eigen::MatrixXd NULL_MATRIX = Eigen::MatrixXd::Zero(0,0); 52 | 53 | static std::vector null_vector_int = std::vector(0); 54 | 55 | class basic_hts_file 56 | { 57 | public: 58 | void open(const std::string& prefix){htsf = hts_open(prefix.c_str(), "r");}; 59 | int next_line(kstring_t& str){ return hts_getline(htsf, KS_SEP_LINE, &str); }; 60 | 61 | basic_hts_file(){}; 62 | basic_hts_file(const std::string& prefix){ open(prefix); }; 63 | 64 | void close(){ hts_close(htsf);}; 65 | 66 | private: 67 | htsFile *htsf; 68 | }; 69 | 70 | class indexed_hts_file 71 | { 72 | public: 73 | 74 | void open(const std::string& prefix, const std::vector& reg); 75 | void open(const std::string& prefix, const std::string& reg = null_string){ 76 | std::vector tmp_reg; 77 | if( reg != "" ) tmp_reg.push_back(reg); 78 | open(prefix, tmp_reg); 79 | }; 80 | 81 | int next_line(kstring_t& str); 82 | 83 | indexed_hts_file(){}; 84 | indexed_hts_file(const std::string& prefix, const std::string& reg = null_string){ 85 | open(prefix, reg); 86 | }; 87 | indexed_hts_file(const std::string& prefix, const std::vector& reg){ 88 | open(prefix, reg); 89 | }; 90 | 91 | void close(){ 92 | if (itr) hts_itr_destroy(itr); 93 | hts_close(htsf); 94 | }; 95 | 96 | private: 97 | int n_reg; 98 | int c_reg; 99 | std::vector regions; 100 | 101 | htsFile *htsf; 102 | tbx_t *tbx; 103 | hts_itr_t *itr; 104 | }; 105 | 106 | void bcf_seek(bcf_srs_t*, const std::string&, const int& start = MINUS99, const int& end = MINUS99); 107 | 108 | void bgzip_file(std::string, int); 109 | 110 | inline void write_to_bgzf(const std::string& inp, BGZF *fp){ 111 | if( bgzf_write(fp, (const void*) inp.c_str(), inp.length()) < 0 ){ 112 | std::cerr << "Error: Could not write to output file\n"; 113 | exit(1); 114 | } 115 | }; 116 | 117 | inline void write_to_bgzf(const char* inp, BGZF *fp){ 118 | if( bgzf_write(fp, (const void*) inp, strlen(inp)) < 0 ){ 119 | std::cerr << "Error: Could not write to output file\n"; 120 | exit(1); 121 | } 122 | }; 123 | 124 | inline void build_tabix_index(std::string file_name, int type = 0){ 125 | if(tbx_index_build(file_name.c_str(), 14, type==0? &tbx_conf_vcf : &tbx_conf_bed )!=0) 126 | { 127 | std::cerr << "Fatal error: Couldn't create index for " << file_name << "\n"; 128 | exit(1); 129 | } 130 | }; 131 | 132 | inline void operator<< (BGZF& fp, const char*& inp) 133 | { 134 | BGZF* fp_pt = &fp; 135 | if( bgzf_write(fp_pt, (const void*) inp, strlen(inp)) < 0 ){ 136 | std::cerr << "Error: Could not write to output file\n"; 137 | exit(1); 138 | } 139 | return; 140 | } 141 | 142 | inline void operator<< (BGZF& fp, const std::string& inp) 143 | { 144 | BGZF* fp_pt = &fp; 145 | if( bgzf_write(fp_pt, (const void*) inp.c_str(), inp.length()) < 0 ){ 146 | std::cerr << "Error: Could not write to output file\n"; 147 | exit(1); 148 | } 149 | return; 150 | } 151 | 152 | 153 | std::vector get_chroms(std::string file_name, std::vector& n_variants = null_vector_int); 154 | 155 | #endif 156 | 157 | -------------------------------------------------------------------------------- /src/mapID.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | /* 18 | mapID source files and the 'id_map' class are for subsetting, merging, 19 | and mapping row and column IDs in across multiple files (expression, 20 | covariate, and genotype). These classes help reduce redundancy for 21 | common tasks across file types. 22 | */ 23 | 24 | #include "mapID.hpp" 25 | 26 | 27 | std::vector intersect_ids(std::vector a, std::vector b) 28 | { 29 | std::vector out; 30 | sort(a.begin(), a.end()); 31 | sort(b.begin(), b.end()); 32 | set_intersection(a.begin(), a.end(), b.begin(), b.end(), back_inserter(out)); 33 | return out; 34 | } 35 | 36 | int id_map::n() 37 | { 38 | return keep.size(); 39 | } 40 | 41 | void id_map::setFileIDs(std::vector& ids) 42 | { 43 | file = ids; 44 | 45 | if( keep.size() > 0 ){ 46 | makeIndex(); 47 | } 48 | } 49 | 50 | bool id_map::tryKeep(std::string& id) 51 | { 52 | bool should_keep = true; 53 | if( keep.size() > 0 ){ 54 | if( keep_set.size() == 0 ){ 55 | copy(keep.begin(),keep.end(),inserter(keep_set,keep_set.end())); 56 | } 57 | should_keep = keep_set.find(id) != keep_set.end(); 58 | } 59 | if( should_keep ){ 60 | file.push_back(id); 61 | } 62 | return should_keep; 63 | } 64 | 65 | void id_map::makeIndex() 66 | { 67 | idx.clear(); 68 | idx_f2k.clear(); 69 | 70 | std::unordered_map file_id_map; 71 | std::unordered_map keep_id_map; 72 | 73 | int ii = 0; 74 | for( std::string& id : file ) 75 | { 76 | file_id_map[id] = ii; 77 | ii++; 78 | } 79 | ii = 0; 80 | std::vector not_in_file; 81 | for( std::string& id : keep ) 82 | { 83 | if( file_id_map.find(id) == file_id_map.end() ){ 84 | not_in_file.push_back(ii); 85 | }else{ 86 | idx.push_back(file_id_map[id]); 87 | } 88 | ii++; 89 | } 90 | for(auto j = not_in_file.rbegin(); j != not_in_file.rend(); ++j) 91 | { 92 | keep.erase(keep.begin() + *j); 93 | } 94 | 95 | 96 | ii = 0; 97 | for( std::string& id : keep ) 98 | { 99 | keep_id_map[id] = ii; 100 | ii++; 101 | } 102 | for( std::string& id : file ) 103 | { 104 | if( keep_id_map.find(id) == keep_id_map.end() ){ 105 | idx_f2k.push_back(-1); 106 | }else{ 107 | idx_f2k.push_back(keep_id_map[id]); 108 | } 109 | } 110 | } 111 | 112 | void id_map::setKeepIDs(std::vector& ids) 113 | { 114 | keep = ids; 115 | 116 | // map the ids we want to keep to the ids in the file 117 | if( file.size() > 0 ) 118 | { 119 | makeIndex(); 120 | }else 121 | { 122 | // if we set 'keep' without setting file ids, we will make 123 | // a hash to check whether each file id should be kept. 124 | copy(keep.begin(),keep.end(),inserter(keep_set,keep_set.end())); 125 | } 126 | } 127 | 128 | -------------------------------------------------------------------------------- /src/mapID.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | /* 17 | mapID source files and the 'id_map' class are for subsetting, merging, 18 | and mapping row and column IDs in across multiple files (expression, 19 | covariate, and genotype). These classes help reduce redundancy for 20 | common tasks across file types. 21 | */ 22 | 23 | #ifndef MAPID_HPP 24 | #define MAPID_HPP 25 | 26 | #include 27 | #include 28 | #include 29 | 30 | #include "miscUtils.hpp" 31 | #include "setOptions.hpp" 32 | 33 | 34 | std::vector intersect_ids(std::vector, std::vector); 35 | 36 | class id_map 37 | { 38 | public: 39 | std::unordered_set keep_set; 40 | 41 | std::vector file; 42 | std::vector keep; 43 | std::vector idx; 44 | std::vector idx_f2k; 45 | 46 | void makeIndex(); 47 | bool tryKeep(std::string&); 48 | void setFileIDs(std::vector&); 49 | void setKeepIDs(std::vector&); 50 | 51 | int n(); 52 | }; 53 | 54 | #endif 55 | 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /src/mathStats.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "mathStats.hpp" 18 | 19 | 20 | double p_bd = 1e-300; 21 | double q_bd = 3e+299; 22 | 23 | double qnorm(double p, bool lower){ 24 | boost::math::normal N01(0.0, 1.0); 25 | if( lower ) return boost::math::quantile(boost::math::complement(N01, p)); 26 | return boost::math::quantile(N01, p); 27 | } 28 | 29 | double pnorm(double x, bool lower){ 30 | boost::math::normal N01(0.0, 1.0); 31 | if( lower ) return boost::math::cdf(boost::math::complement(N01, x)); 32 | return boost::math::cdf(N01, x); 33 | } 34 | 35 | double qcauchy(double p, bool lower){ 36 | p = p > p_bd ? p : p_bd; 37 | p = p < 1 - p_bd ? p : 1 - p_bd; 38 | 39 | boost::math::cauchy C01(0.0, 1.0); 40 | if( lower ) return boost::math::quantile(boost::math::complement(C01, p)); 41 | return boost::math::quantile(C01, p); 42 | } 43 | 44 | double pcauchy(double x, bool lower){ 45 | x = x < q_bd ? x : q_bd; 46 | x = x > -q_bd ? x : -q_bd; 47 | 48 | boost::math::cauchy C01(0.0, 1.0); 49 | if( lower ) return boost::math::cdf(boost::math::complement(C01, x)); 50 | return boost::math::cdf(C01, x); 51 | } 52 | 53 | double qt(double p, double df, bool lower){ 54 | boost::math::students_t TDIST(df); 55 | if( lower ) return boost::math::quantile(boost::math::complement(TDIST, p)); 56 | return boost::math::quantile(TDIST, p); 57 | } 58 | 59 | double pt(double x, double df, bool lower){ 60 | boost::math::students_t TDIST(df); 61 | if( lower ) return boost::math::cdf(boost::math::complement(TDIST, x)); 62 | return boost::math::cdf(TDIST, x); 63 | } 64 | 65 | double qf(double p, double df1, double df2, bool lower){ 66 | boost::math::fisher_f FDIST(df1, df2); 67 | if( lower ) return boost::math::quantile(boost::math::complement(FDIST, p)); 68 | return boost::math::cdf(FDIST, p); 69 | } 70 | 71 | double pf(double x, double df1, double df2, bool lower){ 72 | boost::math::fisher_f FDIST(df1, df2); 73 | if( lower ) return boost::math::cdf(boost::math::complement(FDIST, x)); 74 | return boost::math::cdf(FDIST, x); 75 | } 76 | 77 | double qchisq(double p, double df, bool lower){ 78 | boost::math::chi_squared CHISQ(df); 79 | if( lower ) return boost::math::quantile(boost::math::complement(CHISQ, p)); 80 | return boost::math::quantile(CHISQ, p); 81 | } 82 | 83 | double pchisq(double x, double df, bool lower){ 84 | boost::math::chi_squared CHISQ(df); 85 | if( lower ) return boost::math::cdf(boost::math::complement(CHISQ, x)); 86 | return boost::math::cdf(CHISQ, x); 87 | } 88 | 89 | double ACAT(const std::vector& pvals){ 90 | long double sum_c = 0.0; 91 | double n = pvals.size(); 92 | for( const double& p: pvals ){ 93 | if( p >= 1 ){ 94 | sum_c += (qcauchy(1 - 1/n, true)/n); 95 | }else if( p <= 0 ){ 96 | std::cerr << "ACAT failed; input pval <= 0. \n"; 97 | exit(1); 98 | }else{ 99 | sum_c += (qcauchy(p, true)/n); 100 | } 101 | } 102 | return pcauchy(sum_c, true); 103 | } 104 | 105 | double ACAT(const std::vector& pvals,const std::vector& weights){ 106 | long double sum_c = 0.0; 107 | long double denom = 0.0; 108 | double n = pvals.size(); 109 | int i = 0; 110 | for( const double& w: weights ){ 111 | denom += weights[i]; 112 | } 113 | if( denom <= 0){ 114 | return -99; 115 | } 116 | for( const double& p: pvals ){ 117 | if( p >= 1 ){ 118 | sum_c += (weights[i] * qcauchy(1 - 1/n, true) / denom); 119 | }else if( p <= 0 ){ 120 | std::cerr << "ACAT failed; input pval <= 0. \n"; 121 | exit(1); 122 | }else{ 123 | sum_c += (weights[i] * qcauchy(p, true) / denom); 124 | } 125 | i++; 126 | } 127 | return pcauchy(sum_c, true); 128 | } 129 | 130 | std::vector filter_lt( const std::vector& p, double thresh){ 131 | std::vector out; 132 | for( const double& x : p ){ 133 | if( x > thresh ) out.push_back(x); 134 | } 135 | return out; 136 | } 137 | 138 | double ACAT_non_missing( const std::vector& pvals, const std::vector& dist ){ 139 | if( dist.size() == 0 || global_opts::exp_weight_val <= 0 ){ 140 | long double sum_c = 0.0; 141 | double n = 0; 142 | double n_p1 = 0; 143 | double max_p = 0; 144 | for( const double& p: pvals ){ 145 | if( !std::isnan(p) ){ 146 | if( p >= 1 ){ 147 | n_p1 += 1; 148 | n += 1; 149 | }else if( p > 0 ){ 150 | sum_c += qcauchy(p, true); 151 | n += 1; 152 | if( p > max_p ){ 153 | max_p = p; 154 | } 155 | } 156 | } 157 | } 158 | if( n_p1 > 0 ){ 159 | max_p = 0.5 + 0.5 * max_p; 160 | sum_c += n_p1 * qcauchy(max_p, true); 161 | }else if(n < 1){ 162 | n = 1; 163 | } 164 | return pcauchy(sum_c/n, true); 165 | }else{ 166 | long double sum_c = 0.0; 167 | double denom = 0; 168 | double n_p1 = 0; 169 | double w_p1 = 0; 170 | double max_p = 0; 171 | for( int i = 0; i < pvals.size(); i++ ){ 172 | const double& p = pvals[i]; 173 | double ww = std::exp(-std::abs(global_opts::exp_weight_val*dist[i])); 174 | if( !std::isnan(p) ){ 175 | if( p >= 1 ){ 176 | denom += ww; 177 | }else if( p > 0 ){ 178 | sum_c += ww*qcauchy(p, true); 179 | denom += ww; 180 | if( max_p < p ){ 181 | max_p = p; 182 | } 183 | } 184 | } 185 | } 186 | if( w_p1 > 0 ){ 187 | max_p = 0.5 + 0.5 * max_p; 188 | sum_c += w_p1 * qcauchy(max_p, true); 189 | }else if( denom <= 0 ){ 190 | denom = 1; 191 | } 192 | return pcauchy(sum_c/denom, true); 193 | } 194 | } 195 | 196 | 197 | std::vector rank_vector(const std::vector& v) 198 | { 199 | // This code is adapted from stackoverflow: 200 | // https://stackoverflow.com/questions/30822729/create-ranking-for-vector-of-double 201 | 202 | std::vector w(v.size()); 203 | iota(begin(w), end(w), 0); 204 | sort(begin(w), end(w), 205 | [&v](size_t i, size_t j) { return v[i] < v[j]; }); 206 | 207 | std::vector r(w.size()); 208 | for (size_t n, i = 0; i < w.size(); i += n) 209 | { 210 | n = 1; 211 | while (i + n < w.size() && v[w[i]] == v[w[i+n]]) ++n; 212 | for (size_t k = 0; k < n; ++k) 213 | { 214 | r[w[i+k]] = i + (n + 1) / 2.0; // average rank of n tied values 215 | // r[w[i+k]] = i + 1; // min 216 | // r[w[i+k]] = i + n; // max 217 | // r[w[i+k]] = i + k + 1; // random order 218 | } 219 | } 220 | return r; 221 | } 222 | -------------------------------------------------------------------------------- /src/mathStats.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef MATHSTATS_HPP 18 | #define MATHSTATS_HPP 19 | 20 | #include 21 | #include 22 | 23 | #include "setOptions.hpp" 24 | 25 | #include 26 | #include 27 | #include 28 | #include 29 | #include 30 | 31 | #include 32 | #include 33 | #include 34 | #include 35 | 36 | #include 37 | 38 | // ------------------------------------ 39 | // Eigen matrix printing formats 40 | // ------------------------------------ 41 | const static Eigen::IOFormat EigenCSV(Eigen::StreamPrecision, Eigen::DontAlignCols, ",", "\n"); 42 | const static Eigen::IOFormat EigenTSV(Eigen::StreamPrecision, Eigen::DontAlignCols, "\t", "\n"); 43 | 44 | // ------------------------------------ 45 | // R-like pdf and cdf (TODO: switch to libRmath) 46 | // ------------------------------------ 47 | double qnorm(double, bool lower = false); 48 | double pnorm(double, bool lower = false); 49 | double qt(double, double, bool lower = false); 50 | double pt(double, double, bool lower = false); 51 | double qf(double, double, double, bool lower = false); 52 | double pf(double, double, double, bool lower = false); 53 | double qchisq(double, double, bool lower = false); 54 | double pchisq(double, double, bool lower = false); 55 | double qcauchy(double, bool lower = false); 56 | double pcauchy(double, bool lower = false); 57 | 58 | double ACAT(const std::vector&); 59 | double ACAT(const std::vector&,const std::vector&); 60 | 61 | static const std::vector v0(0); 62 | 63 | std::vector filter_lt( const std::vector&, double); 64 | double ACAT_non_missing( const std::vector& pvals, const std::vector& dist = v0); 65 | 66 | std::vector rank_vector(const std::vector&); 67 | 68 | #endif 69 | 70 | -------------------------------------------------------------------------------- /src/metaAnalysis.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Authors: Corbin Quick 4 | Li Guan 5 | 6 | This file is a part of APEX. 7 | 8 | APEX is distributed "AS IS" in the hope that it will be 9 | useful, but WITHOUT ANY WARRANTY; without even the implied 10 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 11 | FOR A PARTICULAR PURPOSE. 12 | 13 | The above copyright notice and disclaimer of warranty must 14 | be included in all copies or substantial portions of APEX. 15 | */ 16 | 17 | 18 | #ifndef METAANALYSIS_HPP 19 | #define METAANALYSIS_HPP 20 | 21 | #include 22 | #include 23 | 24 | #include "setOptions.hpp" 25 | #include "processVCOV.hpp" 26 | #include "miscUtils.hpp" 27 | #include "fitModels.hpp" 28 | #include "dataParser.hpp" 29 | 30 | #include 31 | #include 32 | 33 | 34 | class cis_sumstat_data 35 | { 36 | public: 37 | std::string file_prefix; 38 | std::string region; 39 | 40 | lindex ln; 41 | 42 | std::vector chr; 43 | std::vector start; 44 | std::vector end; 45 | std::vector gene_id; 46 | 47 | std::vector egene_pval; 48 | 49 | std::vector NS; 50 | std::vector NC; 51 | std::vector SD; 52 | 53 | std::vector S_CIS; 54 | std::vector N_CIS; 55 | 56 | std::vector score; 57 | 58 | int n_genes; 59 | 60 | cis_sumstat_data(const std::string&,const std::string& reg = ""); 61 | void open(const std::string&,const std::string& reg = ""); 62 | }; 63 | 64 | class vcov_meta_data 65 | { 66 | public: 67 | std::vector vc; 68 | 69 | std::vector chr; 70 | std::vector pos; 71 | std::vector ref; 72 | std::vector alt; 73 | 74 | std::vector mac; 75 | std::vector var; 76 | 77 | std::vector> var_perStudy; 78 | 79 | // indexes for shared variants within each study 80 | // si[i][j] = the index of the jth shared snp 81 | // within the list of snps for study i 82 | std::vector> si; 83 | 84 | // one vector per study, indicating whether flipped relative to study #1 85 | std::vector> sflipped; 86 | 87 | vcov_meta_data() {}; 88 | vcov_meta_data(const std::vector& v_fn, const std::string& reg = ""){ 89 | open(v_fn, reg); 90 | }; 91 | 92 | void open(const std::vector& v_fn, const std::string& reg = ""); 93 | 94 | Eigen::MatrixXd getGtG(const std::vector&, const std::vector& w = std::vector(0), const std::vector& sl = std::vector(0) ); 95 | Eigen::MatrixXd getV(const std::vector&, const std::vector& w = std::vector(0), const std::vector& sl = std::vector(0) ); 96 | 97 | Eigen::MatrixXd getGtG(const std::vector&,const std::vector&, const std::vector& w = std::vector(0), const std::vector& sl = std::vector(0) ); 98 | Eigen::MatrixXd getV(const std::vector&,const std::vector&, const std::vector& w = std::vector(0), const std::vector& sl = std::vector(0) ); 99 | 100 | Eigen::MatrixXd getGtG_perStudy(const int&, const std::vector&); 101 | Eigen::MatrixXd getV_perStudy(const int&, const std::vector&); 102 | 103 | Eigen::MatrixXd getGtG_perStudy(const int&, const std::vector&,const std::vector&); 104 | Eigen::MatrixXd getV_perStudy(const int&, const std::vector&,const std::vector&); 105 | 106 | private: 107 | void build_index(); 108 | }; 109 | 110 | 111 | class cis_meta_data 112 | { 113 | public: 114 | std::vector ss; 115 | 116 | vcov_meta_data vc; 117 | 118 | std::vector chr; 119 | std::vector start; 120 | std::vector end; 121 | std::vector gene_id; 122 | 123 | std::vector> study_list; 124 | 125 | std::vector egene_pval; 126 | 127 | std::vector N; 128 | std::vector DF; 129 | std::vector SD; 130 | std::vector ADJ; 131 | 132 | std::vector S_CIS; 133 | std::vector N_CIS; 134 | 135 | std::vector> ivw; 136 | std::vector> SD_perStudy; 137 | std::vector> SSR_perStudy; 138 | std::vector> DF_perStudy; 139 | 140 | std::vector score; 141 | std::vector var_score; 142 | 143 | std::vector> score_perStudy; 144 | std::vector> var_score_perStudy; 145 | 146 | cis_meta_data() {}; 147 | cis_meta_data(const std::vector& v_fn, const std::string& reg = "" ){ 148 | vc.open(v_fn, reg); 149 | open(v_fn, reg); 150 | merge(vc.si, vc.sflipped); 151 | }; 152 | 153 | void merge(const std::vector>&, const std::vector>&); 154 | void meta_analyze(); 155 | 156 | void conditional_analysis(const int&, std::ostream&, std::ostream&, std::ostream&); 157 | void conditional_analysis(); 158 | void conditional_analysis_het(const int&, std::ostream&, std::ostream&); 159 | void conditional_analysis_het(); 160 | 161 | void get_vcov_gene(const int& gene_index, const bool& centered = true); 162 | 163 | void add_study(const std::string& fn, const std::string& reg = ""){ 164 | ss.push_back( cis_sumstat_data(fn, reg) ); 165 | }; 166 | void open(const std::vector& v_fn, const std::string& reg = ""){ 167 | for( const std::string& fn : v_fn ){ add_study(fn,reg); } 168 | }; 169 | 170 | private: 171 | // const double& diagV(const int& i){return vc.var[i];}; 172 | // const int& pos(const int& i){ return vc.pos[i];}; 173 | 174 | double& diagV_perStudy(const int& g, const int& i, const int& s){ 175 | return vc.var_perStudy[S_CIS[g] + i][s]; 176 | } 177 | 178 | double diagV(const int& g, const int& i){ 179 | double dv_val = 0; 180 | for( const int& s : study_list[g] ){ 181 | // std::cout << SD_perStudy[g][s] << ", " << vc.var_perStudy[S_CIS[g] + i][s] << "\n"; 182 | dv_val += diagV_perStudy(g,i,s) * ivw[g][s]; 183 | } 184 | // std::cout << dv_val << "\n"; 185 | return dv_val; 186 | } 187 | 188 | Eigen::VectorXd dV_perStudy(const int& g, const int& s){ 189 | Eigen::VectorXd out(N_CIS[g]); 190 | for( int i = 0; i < N_CIS[g]; i++){ 191 | out(i) = diagV_perStudy(g, i, s); 192 | } 193 | return out; 194 | } 195 | 196 | std::vector ivw_weight_H1(const int& g, const int& i){ 197 | std::vector ivw_H1(ss.size()); 198 | for( const int& s : study_list[g] ){ 199 | double sc = score_perStudy[g][s](i); 200 | ivw_H1[s] = (DF_perStudy[g][s]-1)/(SSR_perStudy[g][s] - sc*sc/diagV_perStudy(g,i,s) ); 201 | } 202 | return ivw_H1; 203 | } 204 | 205 | const int& pos(const int& i){ return vc.pos[i];}; 206 | const std::string& ref(const int& i){ return vc.ref[i];}; 207 | const std::string& alt(const int& i){ return vc.alt[i];}; 208 | }; 209 | 210 | 211 | class vcov_getter 212 | { 213 | private: 214 | std::vector& ivw; 215 | vcov_meta_data& vc; 216 | int s_var; 217 | int n_var; 218 | 219 | std::vector use_studies; 220 | 221 | std::vector R_CIS; 222 | std::vector add(std::vector inp, const int& s){ 223 | for( int &i : inp ){i += s;} 224 | return inp; 225 | } 226 | public: 227 | vcov_getter(vcov_meta_data& vc_, std::vector& ivw_, int s_var_, int n_var_) : 228 | vc(vc_), ivw(ivw_), s_var(s_var_), n_var(n_var_) 229 | { 230 | for(int i = 0; i < n_var; i++){ 231 | R_CIS.push_back(s_var + i); 232 | } 233 | for(int s = 0; s < vc.si.size(); s++ ){ 234 | use_studies.push_back(s); 235 | } 236 | }; 237 | vcov_getter(vcov_meta_data& vc_, std::vector& ivw_, int s_var_, int n_var_, std::vector use_studies_) : 238 | vc(vc_), ivw(ivw_), s_var(s_var_), n_var(n_var_), use_studies(use_studies_) 239 | { 240 | for(int i = 0; i < n_var; i++){ 241 | R_CIS.push_back(s_var + i); 242 | } 243 | }; 244 | Eigen::VectorXd Covar(const int& i){ return vc.getV(R_CIS, std::vector(1,s_var + i), ivw, use_studies);}; 245 | Eigen::MatrixXd Var(const std::vector& wh){ return vc.getV(add(wh, s_var), ivw, use_studies);}; 246 | Eigen::MatrixXd Var_uncentered(const std::vector& wh){ return vc.getGtG(add(wh, s_var), ivw, use_studies);}; 247 | 248 | Eigen::VectorXd Covar_perStudy(const int& s, const int& i){ return vc.getV_perStudy(use_studies[s], R_CIS, std::vector(1,s_var + i));}; 249 | Eigen::MatrixXd Var_perStudy(const int& s, const std::vector& wh){ return vc.getV_perStudy(use_studies[s], add(wh, s_var) );}; 250 | }; 251 | 252 | 253 | #endif 254 | 255 | -------------------------------------------------------------------------------- /src/miscUtils.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "miscUtils.hpp" 18 | 19 | 20 | bool all_lt( const std::vector& ii, const std::vector& nn ){ 21 | for(int i = 0; i < ii.size(); ++i){ 22 | if( ii[i] >= nn[i] ){ 23 | return false; 24 | } 25 | } 26 | return true; 27 | } 28 | 29 | bool any_lt( const std::vector& ii, const std::vector& nn ){ 30 | for(int i = 0; i < ii.size(); ++i){ 31 | if( ii[i] < nn[i] ){ 32 | return true; 33 | } 34 | } 35 | return false; 36 | } 37 | 38 | std::vector which_lt(const std::vector& ii, const std::vector& nn){ 39 | std::vector out; 40 | for(int i = 0; i < ii.size(); ++i){ 41 | if( ii[i] < nn[i] ){ 42 | out.push_back(i); 43 | } 44 | } 45 | return out; 46 | } 47 | 48 | 49 | void remove_gene_version_number( std::vector& ids){ 50 | for( std::string& id : ids ){ 51 | id = id.substr(0, id.find(".")); 52 | } 53 | return; 54 | } 55 | 56 | std::vector seq_int(const int& n){ 57 | std::vector out; 58 | for(int i = 0; i < n; i++){ 59 | out.push_back(i); 60 | } 61 | return out; 62 | } 63 | 64 | std::vector split_string(const std::string& input, const char delim) 65 | { 66 | std::stringstream ss(input); 67 | std::string field; 68 | std::vector out; 69 | 70 | while( getline(ss, field, delim) ){ 71 | out.push_back(field); 72 | } 73 | return out; 74 | } 75 | 76 | std::string clean_chrom(const std::string& x){ 77 | if( x.length() >= 3 ){ 78 | if( x.substr(0,3) == "chr" || x.substr(0,3) == "CHR" ){ 79 | return x.substr(3, x.length() - 3); 80 | } 81 | } 82 | return x; 83 | } 84 | 85 | int i_chrom( const std::string& chrom ) 86 | { 87 | auto f_chrom = global_opts::i_chrom_map.find(chrom); 88 | if( f_chrom == global_opts::i_chrom_map.end() ){ 89 | return -1; 90 | }else{ 91 | return f_chrom->second; 92 | } 93 | } 94 | 95 | std::vector sort_chroms( std::vector chroms ){ 96 | sort(chroms.begin(), chroms.end(), 97 | [](const std::string& x, const std::string& y){ return i_chrom(x) < i_chrom(y); } 98 | ); 99 | return chroms; 100 | } 101 | 102 | 103 | bool ambiguous_snv( const std::string& ref, const std::string& alt ) 104 | { 105 | if( ref == "T" || ref == "G" ){ 106 | return ambiguous_snv(alt, ref); 107 | }else if( ref == "A" && alt == "T" ){ 108 | return true; 109 | }else if( ref == "C" && alt == "G" ){ 110 | return true; 111 | } 112 | return false; 113 | } 114 | 115 | std::string flip_nucleotide(const std::string& ref) 116 | { 117 | if( ref == "A" ) return "T"; 118 | if( ref == "T" ) return "A"; 119 | if( ref == "C" ) return "G"; 120 | if( ref == "G" ) return "C"; 121 | return ""; 122 | } 123 | 124 | void print_iter_cerr(int i_last, int i_curr, std::string& suffix){ 125 | std::string s_last = std::to_string(i_last); 126 | std::string s_curr = std::to_string(i_curr); 127 | if( i_last < i_curr ){ 128 | move_back_cerr(s_last.length()); 129 | } 130 | std::cerr << i_curr; 131 | if( s_last.length() != s_curr.length() || i_last > i_curr ){ 132 | std::cerr << suffix; 133 | move_back_cerr(suffix.length()); 134 | } 135 | return; 136 | } 137 | 138 | void thinned_iter_cerr(int& i_last, const int& i_curr, std::string& suffix, const int& print_every){ 139 | if( rand() % print_every == 0 ){ 140 | std::string s_last = std::to_string(i_last); 141 | std::string s_curr = std::to_string(i_curr); 142 | if( i_last < i_curr ){ 143 | move_back_cerr(s_last.length()); 144 | } 145 | std::cerr << i_curr; 146 | if( s_last.length() != s_curr.length() || i_last > i_curr ){ 147 | std::cerr << suffix; 148 | move_back_cerr(suffix.length()); 149 | } 150 | i_last = i_curr; 151 | } 152 | return; 153 | } 154 | 155 | void print_header(const std::vector& cn, std::ostream& os){ 156 | bool is_first = true; 157 | for( const std::string& s : cn ){ 158 | if( is_first ){ 159 | is_first = false; 160 | }else{ 161 | os << "\t"; 162 | } 163 | os << s; 164 | } 165 | os << "\n"; 166 | return; 167 | } 168 | 169 | void lindex::build(){ 170 | 171 | double m_xy = 0; 172 | double m_x = 0; 173 | double m_xx = 0; 174 | double m_y = 0; 175 | n = 0; 176 | 177 | double sc = 0.000001; 178 | for( const int& x : vals){ 179 | double xs = x * sc; 180 | m_x += xs; 181 | m_xx += xs*xs; 182 | m_y += n; 183 | m_xy += n*xs; 184 | n++; 185 | } 186 | m_y /= n; 187 | m_x /= n; 188 | m_xy /= n; 189 | m_xx /= n; 190 | 191 | b = ( m_xy - m_x*m_y )/( m_xx - m_x*m_x ); 192 | a = m_y - b*m_x; 193 | b = b * sc; 194 | } 195 | 196 | int lindex::index(const int& x, bool left){ 197 | int i = round(a + x*b); 198 | int first = i; 199 | i = i < n ? i : n-1; 200 | i = i > 0 ? i : 0; 201 | 202 | if( left ){ 203 | while( vals[i] >= x && i > 0 ){ 204 | i--; 205 | } 206 | while( vals[i] < x ){ 207 | if( i + 1 < n ){ 208 | if( vals[i + 1] <= x ){ 209 | i++; 210 | }else{ 211 | break; 212 | } 213 | }else{ 214 | break; 215 | } 216 | } 217 | }else{ 218 | while( vals[i] > x && i > 0 ){ 219 | i--; 220 | } 221 | while( vals[i] <= x ){ 222 | if( i + 1 < n ){ 223 | if( vals[i + 1] <= x ){ 224 | i++; 225 | }else{ 226 | break; 227 | } 228 | }else{ 229 | break; 230 | } 231 | } 232 | } 233 | 234 | return i; 235 | } 236 | 237 | -------------------------------------------------------------------------------- /src/miscUtils.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef MISCUTILS_HPP 18 | #define MISCUTILS_HPP 19 | 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | 26 | #include "setOptions.hpp" 27 | 28 | 29 | static const int print_every_default = 1000; 30 | static std::vector null_vec = std::vector(0); 31 | 32 | int i_chrom(const std::string&); 33 | bool ambiguous_snv(const std::string&, const std::string&); 34 | std::string flip_nucleotide(const std::string&); 35 | 36 | std::vector sort_chroms(std::vector); 37 | 38 | std::vector seq_int(const int& n); 39 | 40 | template 41 | inline bool has_element(const std::vector& v, const T& x){ 42 | return find(v.begin(), v.end(), x) != v.end(); 43 | }; 44 | 45 | template 46 | inline bool has_element(const std::unordered_set& v, const T& x){ 47 | return v.find(x) != v.end(); 48 | }; 49 | 50 | void remove_gene_version_number( std::vector& ); 51 | 52 | std::string clean_chrom(const std::string&); 53 | std::vector split_string(const std::string&, const char); 54 | 55 | inline void restore_cursor(void){ std::cerr << "\e[?25h"; }; 56 | inline void hide_cursor(void){ std::cerr << "\e[?25l"; }; 57 | inline void clear_line_cerr(void){ std::cerr << "\33[2K\r"; }; 58 | 59 | inline void move_back_cerr(const int n){ std::cerr << std::string(n, '\b'); }; 60 | 61 | bool all_lt( const std::vector& ii, const std::vector& nn ); 62 | bool any_lt( const std::vector& ii, const std::vector& nn ); 63 | std::vector which_lt(const std::vector& ii, const std::vector& nn); 64 | 65 | void print_iter_cerr(int, int, std::string&); 66 | void thinned_iter_cerr(int& i_last, const int& i_curr, std::string& suffix, const int& print_every = print_every_default); 67 | 68 | void print_header(const std::vector& cn, std::ostream& os); 69 | 70 | class lindex 71 | { 72 | public: 73 | void set(std::vector& a){ vals = a; build(); }; 74 | 75 | lindex() : vals(null_vec), n(-1), a(-1), b(-1) {}; 76 | lindex(std::vector& a) : vals(a) { build(); }; 77 | 78 | int index(const int&, bool left = true); 79 | private: 80 | int n; 81 | double a; 82 | double b; 83 | std::vector& vals; 84 | void build(); 85 | }; 86 | 87 | /* 88 | class block_lindex 89 | { 90 | public: 91 | //lindex(); 92 | block_lindex(std::vector& chr, std::vector& pos) : blks(chr),vals(pos) { build(); }; 93 | int index(const std::string&, const int&); 94 | private: 95 | std::vector& blks; 96 | std::vector& vals; 97 | void build(); 98 | }; 99 | */ 100 | 101 | 102 | #endif 103 | 104 | -------------------------------------------------------------------------------- /src/processVCOV.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef PROCESSVCOV_HPP 18 | #define PROCESSVCOV_HPP 19 | 20 | #include 21 | 22 | #include "setOptions.hpp" 23 | #include "readTable.hpp" 24 | #include "genotypeData.hpp" 25 | #include "fitUtils.hpp" 26 | #include "miscUtils.hpp" 27 | #include "dataParser.hpp" 28 | #include "xzFormat.hpp" 29 | 30 | #include 31 | #include 32 | #include 33 | 34 | 35 | typedef uint16_t vbin_t; 36 | static const long double VBIN_T_MAX = (pow(2,8*sizeof(vbin_t)) - 1); 37 | 38 | static const bool xz_mode = true; 39 | static const size_t XZ_BUFFER_SIZE = 2 * 1024 * 1024; 40 | 41 | class vcov_bin_file 42 | { 43 | public: 44 | std::string file_name; 45 | 46 | BGZF *fp; 47 | xzReader fp_xz; 48 | 49 | bool is_xz; 50 | 51 | vcov_bin_file() {}; 52 | vcov_bin_file(std::string); 53 | void open(std::string); 54 | void close(); 55 | std::vector get_data(long,long); 56 | }; 57 | 58 | class vcov_data 59 | { 60 | public: 61 | std::string file_prefix; 62 | std::string region; 63 | 64 | std::vector chr; 65 | std::vector pos; 66 | //std::vector rsid; 67 | std::vector ref; 68 | std::vector alt; 69 | std::vector flipped; 70 | 71 | std::vector mac; 72 | std::vector var; 73 | 74 | std::vector rid; 75 | std::vector bin_start_idx; 76 | std::vector bin_nvals; 77 | 78 | vcov_bin_file bin; 79 | 80 | Eigen::MatrixXd GtU; 81 | 82 | vcov_data() {}; 83 | vcov_data(std::string,std::string); 84 | 85 | void open(std::string,std::string); 86 | void close(); 87 | 88 | Eigen::MatrixXd getGtG(const std::vector& v_i, const std::vector& v_j = std::vector(0)); 89 | Eigen::MatrixXd getV(const std::vector& v_i, const std::vector& v_j = std::vector(0)); 90 | 91 | Eigen::MatrixXd getGtG(int, int); 92 | Eigen::MatrixXd getV(int, int); 93 | 94 | Eigen::MatrixXd getColGtG(int, int, int); 95 | Eigen::MatrixXd getColV(int, int, int); 96 | 97 | Eigen::MatrixXd getColGtG(int, const std::vector& ); 98 | Eigen::MatrixXd getColV(int, const std::vector& ); 99 | 100 | double getPairV(int, int); 101 | 102 | Eigen::MatrixXd getV_i(const std::vector&, const int offset = 0); 103 | }; 104 | 105 | class vcov_bin_gz_out 106 | { 107 | public: 108 | vcov_bin_gz_out(std::string fn) : 109 | file_name(fn), 110 | fp_xz(fn.c_str()) 111 | { 112 | file_name = fn; 113 | bytes_c = 0; 114 | 115 | if( xz_mode ){ 116 | buf_size = XZ_BUFFER_SIZE; 117 | std::cerr << "\nWriting data to " << file_name << "\n"; 118 | }else{ 119 | buf_size = BUFFER_SIZE; 120 | std::string file_name_gz = file_name + ".gz"; 121 | fp = bgzf_open(file_name_gz.c_str(), "w\0"); 122 | 123 | bgzf_index_build_init(fp); 124 | } 125 | }; 126 | 127 | void open(std::string); 128 | void add_to_queue(vbin_t); 129 | void close(); 130 | 131 | private: 132 | std::vector queue; 133 | 134 | size_t buf_size; 135 | 136 | BGZF *fp; 137 | shrinkwrap::xz::ostream fp_xz; 138 | 139 | std::string file_name; 140 | 141 | void write_queue(); 142 | 143 | size_t bytes_c; 144 | }; 145 | 146 | 147 | void read_LD_gz_bytes(std::string); 148 | 149 | void write_vcov_files(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data&, table&); 150 | 151 | #endif 152 | 153 | -------------------------------------------------------------------------------- /src/readBED.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "readBED.hpp" 18 | 19 | 20 | void bed_data::readBedHeader(const char *file_name) 21 | { 22 | if( ids.file.size() != 0 ) return; 23 | 24 | htsFile *htsf = hts_open(file_name, "r"); 25 | 26 | kstring_t str = {0,0,0}; 27 | 28 | int line_no = 0; 29 | 30 | is_residual = false; 31 | 32 | while( hts_getline(htsf, KS_SEP_LINE, &str) ) 33 | { 34 | line_no++; 35 | if ( str.s[0] == '#' && str.s[1] == '#' ){ 36 | 37 | if( strstr (str.s, "APEX_LMM_RESID") != NULL ){ 38 | std::cerr << "Molecular traits are residualized.\n"; 39 | is_residual = true; 40 | } 41 | 42 | continue; 43 | } 44 | 45 | header_line = line_no; 46 | int n_fields; 47 | int *offsets = ksplit(&str, 0, &n_fields); 48 | for( int i = 4; i < n_fields; i++) 49 | { 50 | ids.file.push_back(std::string(str.s + offsets[i])); 51 | } 52 | break; 53 | } 54 | 55 | ks_free(&str); 56 | hts_close(htsf); 57 | 58 | } 59 | 60 | void bed_data::readBedFile(const char *file_name, const std::vector& regions) 61 | { 62 | if( ids.file.size() == 0 ){ 63 | readBedHeader(file_name); 64 | } 65 | if( ids.keep.size() == 0 ){ 66 | ids.setKeepIDs(ids.file); 67 | } 68 | 69 | std::string file_str = std::string(file_name); 70 | 71 | indexed_hts_file htsf(file_str, regions); 72 | 73 | int line_no = 0; 74 | n_genes = 0; 75 | 76 | data_matrix = Eigen::MatrixXd(ids.idx.size(), 50000); 77 | 78 | data_parser dp; 79 | dp.add_field(chr,0); 80 | dp.add_field(start,1); 81 | dp.add_field(end,2); 82 | dp.add_field(gene_id,3); 83 | dp.add_matrix(data_matrix, true, ids.idx, 4); 84 | 85 | kstring_t str = {0,0,0}; 86 | 87 | while( htsf.next_line(str) >= 0 ) 88 | { 89 | if( !str.l ) break; 90 | 91 | line_no++; 92 | 93 | if ( str.s[0] == '#' ){ 94 | continue; 95 | }else if( regions.size() == 0 && line_no <= header_line){ 96 | continue; 97 | } 98 | 99 | dp.parse_fields(str); 100 | 101 | if( global_opts::trim_gene_ids ) 102 | { 103 | gene_id[n_genes] = gene_id[n_genes].substr(0, gene_id[n_genes].find(".")); 104 | } 105 | 106 | n_genes++; 107 | 108 | if(n_genes % 500 == 0 ) { 109 | std::cerr << "Processed expression data for " << n_genes << " genes ... \r"; 110 | } 111 | } 112 | 113 | ks_free(&str); 114 | 115 | htsf.close(); 116 | 117 | data_matrix.conservativeResize(Eigen::NoChange, n_genes); 118 | } 119 | 120 | -------------------------------------------------------------------------------- /src/readBED.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #ifndef READBED_HPP 17 | #define READBED_HPP 18 | 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | 25 | #include "setOptions.hpp" 26 | #include "mapID.hpp" 27 | #include "dataParser.hpp" 28 | 29 | #include 30 | 31 | class bed_data 32 | { 33 | public: 34 | int header_line; 35 | 36 | int n_genes; 37 | 38 | id_map ids; 39 | 40 | Eigen::MatrixXd data_matrix; 41 | 42 | bool is_residual; 43 | 44 | std::vector chr; 45 | std::vector start; 46 | std::vector end; 47 | std::vector gene_id; 48 | 49 | std::vector block_s; 50 | std::vector block_e; 51 | 52 | std::vector v_s; 53 | std::vector v_e; 54 | 55 | std::vector stdev; 56 | 57 | std::vector n_var; 58 | std::vector pval; 59 | 60 | void setKeepIDs(std::vector& kp_ids){ids.setKeepIDs(kp_ids);}; 61 | 62 | void readBedFile(const char*, const std::vector&); 63 | void readBedHeader(const char*); 64 | 65 | void write_bed(const std::string& out_file_path, const std::string& bed_header = nullstr){ 66 | 67 | BGZF* out_file = bgzf_open(out_file_path.c_str(), "w"); 68 | std::stringstream out_line; 69 | 70 | for(int j = 0; j < gene_id.size(); j++){ 71 | std::stringstream out_line; 72 | 73 | if( j == 0 ){ 74 | if( bed_header != nullstr ){ 75 | out_line << bed_header << "\n"; 76 | } 77 | 78 | out_line << "#chr" << "\t" << "start" << "\t" << "end" << "\t" << "gene"; 79 | for( const std::string& id : ids.keep ){ 80 | out_line << "\t" << id; 81 | } 82 | out_line << "\n"; 83 | } 84 | 85 | out_line << chr[j] << "\t" << start[j] << "\t" << end[j] << "\t" << gene_id[j]; 86 | 87 | for(int i = 0; i < ids.keep.size(); i++){ 88 | out_line << "\t" << data_matrix(i,j); 89 | } 90 | out_line << "\n"; 91 | 92 | write_to_bgzf(out_line.str().c_str(), out_file); 93 | } 94 | 95 | bgzf_close(out_file); 96 | build_tabix_index(out_file_path, 1); 97 | 98 | }; 99 | }; 100 | 101 | #endif 102 | 103 | -------------------------------------------------------------------------------- /src/readTable.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #include "readTable.hpp" 17 | 18 | void table::setRows(std::vector& keep){ 19 | rows.setKeepIDs(keep); 20 | return; 21 | } 22 | 23 | void table::setCols(std::vector& keep){ 24 | cols.setKeepIDs(keep); 25 | return; 26 | } 27 | 28 | void table::readHeader(const char *file_name) 29 | { 30 | htsFile *htsf = hts_open(file_name, "r"); 31 | kstring_t str = {0,0,0}; 32 | int line_no = 0; 33 | 34 | while( hts_getline(htsf, KS_SEP_LINE, &str) ) 35 | { 36 | if( !str.l ) break; 37 | line_no++; 38 | if ( str.s[0] == '#' && str.s[1] == '#' ) continue; 39 | 40 | int n_fields; 41 | header_line = line_no; 42 | int *offsets = ksplit(&str, 0, &n_fields); 43 | 44 | // Note: The first column is assumed to be "ID" 45 | 46 | for( int i = 1; i < n_fields; i++) 47 | { 48 | cols.file.push_back(std::string(str.s + offsets[i])); 49 | } 50 | break; 51 | } 52 | ks_free(&str); 53 | hts_close(htsf); 54 | } 55 | 56 | void table::readFile(const char *file_name) 57 | { 58 | if( cols.file.size() == 0 ){ 59 | readHeader(file_name); 60 | } 61 | if( cols.keep.size() == 0 ){ 62 | std::vector all_cols = cols.file; 63 | all_cols.erase (all_cols.begin()); 64 | cols.setKeepIDs(all_cols); 65 | }else if( cols.idx.size() == 0 ){ 66 | cols.makeIndex(); 67 | } 68 | 69 | n_cols = cols.keep.size(); 70 | n_rows = 0; 71 | 72 | data_matrix = Eigen::MatrixXd(n_cols, 50000); 73 | 74 | data_parser dp; 75 | dp.add_matrix(data_matrix, true, cols.idx, 1); 76 | 77 | htsFile *htsf = hts_open(file_name, "r"); 78 | 79 | kstring_t str = {0,0,0}; 80 | 81 | int line_no = 0; 82 | 83 | while( hts_getline(htsf, KS_SEP_LINE, &str) >= 0 ) 84 | { 85 | if( !str.l ) break; 86 | 87 | line_no++; 88 | if ( str.s[0] == '#' || line_no <= header_line ){ 89 | continue; 90 | } 91 | 92 | int n_fields; 93 | int* offsets = ksplit(&str, 0, &n_fields); 94 | 95 | std::string id = std::string(str.s + offsets[0]); 96 | 97 | if ( rows.tryKeep(id) ){ 98 | dp.parse_fields(str, offsets, n_fields); 99 | n_rows++; 100 | } 101 | } 102 | 103 | rows.keep = rows.file; 104 | 105 | data_matrix.conservativeResize(Eigen::NoChange, n_rows); 106 | 107 | ks_free(&str); 108 | hts_close(htsf); 109 | 110 | // rows.makeIndex(); 111 | } 112 | 113 | -------------------------------------------------------------------------------- /src/readTable.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | #ifndef READTABLE_HPP 17 | #define READTABLE_HPP 18 | 19 | #include 20 | 21 | #include "setOptions.hpp" 22 | #include "dataParser.hpp" 23 | #include "mapID.hpp" 24 | 25 | #include 26 | 27 | class table 28 | { 29 | public: 30 | int header_line; 31 | 32 | int id_column; 33 | std::string id_column_name; 34 | 35 | int n_rows; 36 | int n_cols; 37 | 38 | id_map rows; 39 | id_map cols; 40 | 41 | Eigen::MatrixXd data_matrix; 42 | 43 | void setRows(std::vector&); 44 | void setCols(std::vector&); 45 | void readFile(const char*); 46 | void readHeader(const char*); 47 | 48 | void write_table(const std::string& out_file_path, const std::string& file_header = nullstr){ 49 | 50 | BGZF* out_file = bgzf_open(out_file_path.c_str(), "w"); 51 | std::stringstream out_line; 52 | 53 | 54 | int dj = 0; 55 | if( data_matrix.cols() > rows.keep.size() ){ 56 | // has intercept column. 57 | dj = 1; 58 | } 59 | 60 | 61 | for(int j = 0; j < rows.keep.size(); j++){ 62 | std::stringstream out_line; 63 | 64 | if( j == 0 ){ 65 | if( file_header != nullstr ){ 66 | out_line << file_header << "\n"; 67 | } 68 | 69 | out_line << "#id"; 70 | for( const std::string& id : cols.keep ){ 71 | out_line << "\t" << id; 72 | } 73 | out_line << "\n"; 74 | } 75 | 76 | out_line << rows.keep[j]; 77 | 78 | for(int i = 0; i < cols.keep.size(); i++){ 79 | out_line << "\t" << data_matrix(i, j + dj); 80 | } 81 | out_line << "\n"; 82 | 83 | write_to_bgzf(out_line.str().c_str(), out_file); 84 | } 85 | 86 | bgzf_close(out_file); 87 | }; 88 | }; 89 | 90 | 91 | #endif 92 | 93 | -------------------------------------------------------------------------------- /src/scanSignals.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "Main.hpp" 18 | 19 | void scan_signals(bcf_srs_t*& sr, bcf_hdr_t*& hdr,genotype_data& g_data, table& c_data, bed_data& e_data, block_intervals& bm, const bool& rknorm_y, const bool& rknorm_r) 20 | { 21 | 22 | Eigen::MatrixXd &Y = e_data.data_matrix; 23 | Eigen::MatrixXd &X = c_data.data_matrix; 24 | 25 | std::cerr << "Started cis-QTL analysis ...\n"; 26 | 27 | if( rknorm_y ){ 28 | std::cerr << "Rank-normalizing expression traits ... \n"; 29 | rank_normalize(Y); 30 | } 31 | std::cerr << "Scaling expression traits ... \n"; 32 | std::vector y_scale; 33 | scale_and_center(Y, y_scale); 34 | 35 | Eigen::MatrixXd S = get_half_hat_matrix(X); 36 | Eigen::VectorXd dV(g_data.var.size()); 37 | 38 | if( !global_opts::low_mem ){ 39 | 40 | std::cerr << "Calculating genotype-covariate covariance...\n"; 41 | 42 | Eigen::MatrixXd StG = S.transpose() * g_data.genotypes; 43 | std::cerr << "Calculating genotype residual variances ..."; 44 | //Eigen::VectorXd SD_vec(StG.cols()); 45 | for( int i = 0; i < StG.cols(); ++i) 46 | { 47 | 48 | dV(i) = g_data.genotypes.col(i).squaredNorm() - StG.col(i).squaredNorm(); 49 | //SD_vec[i] = std::sqrt(g_data.var[i]); 50 | } 51 | std::cerr << "Done.\n"; 52 | } 53 | 54 | std::cerr << "Calculating expression residuals...\n"; 55 | Eigen::MatrixXd Y_res = resid_from_half_hat(Y, S); 56 | 57 | std::cerr << "Scaling expression residuals ...\n"; 58 | scale_and_center(Y_res, e_data.stdev); 59 | 60 | if( rknorm_r ){ 61 | std::cerr << "Rank-normalizing expression residuals ...\n"; 62 | rank_normalize(Y_res); 63 | std::cerr << "Re-residualizing transformed residuals ...\n"; 64 | Eigen::MatrixXd tmp = resid_from_half_hat(Y_res, S); 65 | Y_res = tmp; 66 | scale_and_center(Y_res); 67 | } 68 | 69 | // std::cout << Y_res.format(EigenTSV) << "\n"; 70 | // return 0; 71 | 72 | Y_res.transposeInPlace(); 73 | 74 | double n_samples = X.rows(); 75 | double n_covar = X.cols(); 76 | 77 | // ------------------------------------------ 78 | 79 | std::string stat_file_path = global_opts::out_prefix + "." + "cis_signal_stats" + ".txt.gz"; 80 | std::string data_file_path = global_opts::out_prefix + "." + "cis_signal_data" + ".txt.gz"; 81 | 82 | BGZF* stat_file = bgzf_open(stat_file_path.c_str(), "w"); 83 | BGZF* data_file = bgzf_open(data_file_path.c_str(), "w"); 84 | 85 | // Write header for stats file 86 | std::stringstream os; 87 | 88 | std::string stat_header = "#gene\tnsignal\tsnp\tbeta\tse\tp_joint\tp_acat\tp_marginal\tp_sequential\n"; 89 | 90 | write_to_bgzf(stat_header.c_str(), stat_file); 91 | 92 | // Write header for dataa file 93 | 94 | os.str(std::string()); 95 | os.clear(); 96 | 97 | // os << "##RN_Y=" << rknorm_y << ";RN_R=" << rknorm_r << ";NS=" << (int)n_samples << ";NC=" << (int) n_covar << "\n"; 98 | 99 | os << "#chr" << "\t" << 100 | "start" << "\t" << 101 | "end" << "\t" << 102 | "gene" << "\t" << 103 | "snp"; 104 | for( const std::string& sample_id : g_data.ids.keep ){ 105 | os << "\t" << sample_id; 106 | } 107 | os << "\n"; 108 | 109 | write_to_bgzf(os.str().c_str(), data_file); 110 | 111 | // ------------------------------------------ 112 | // Forward selection and write output 113 | // ------------------------------------------ 114 | 115 | int bl = 0; 116 | 117 | std::string iter_cerr_suffix = " blocks out of " + std::to_string(bm.size()) + " total"; 118 | std::cerr << "Processed "; 119 | print_iter_cerr(1, 0, iter_cerr_suffix); 120 | 121 | for( int i = 0; i < bm.size(); i++ ){ 122 | 123 | int n_e = bm.bed_e[i] - bm.bed_s[i] + 1; 124 | n_e = n_e < Y_res.rows() ? n_e : Y_res.rows()-1; 125 | 126 | int n_g = bm.bcf_e[i] - bm.bcf_s[i] + 1; 127 | n_g = n_g < g_data.n_variants ? n_g : g_data.n_variants-1; 128 | 129 | if( n_g > 0 && n_e > 0 && bm.bed_s[i] < Y_res.rows() && bm.bcf_s[i] < g_data.n_variants ){ 130 | 131 | Eigen::MatrixXd out_mat; 132 | 133 | if( global_opts::low_mem ){ 134 | 135 | g_data.read_genotypes(sr, hdr, bm.bcf_s[i], n_g ); 136 | 137 | Eigen::SparseMatrix& G = g_data.genotypes; 138 | 139 | Eigen::VectorXd StG_block_sqnm = (S.transpose() * G).colwise().squaredNorm().eval(); // G.middleCols(bm.bcf_s[i], n_g); 140 | 141 | for( int si = bm.bcf_s[i], ii = 0; si < bm.bcf_s[i] + n_g; ++si, ++ii) 142 | { 143 | dV(si) = G.col(ii).squaredNorm() - StG_block_sqnm(ii); 144 | //cout << g_data.var[i] << "\n"; 145 | } 146 | } 147 | 148 | int G_start = bm.bcf_s[i]; 149 | if( global_opts::low_mem ){ 150 | G_start = 0; 151 | } 152 | 153 | // std::cerr << "Get G block\n"; 154 | 155 | const Eigen::SparseMatrix& G = g_data.genotypes.middleCols(G_start, n_g); 156 | // std::cerr << g_data.genotypes.rows() << "\t" << g_data.genotypes.cols() << "\n"; 157 | 158 | // std::cerr << "Get outmat\n"; 159 | 160 | out_mat = (Y_res.middleRows(bm.bed_s[i], n_e)* G).transpose().eval(); 161 | 162 | Eigen::MatrixXd StG = S.transpose() * G; 163 | 164 | for( int jj = bm.bed_s[i], jm = 0; jj < bm.bed_s[i] + n_e; jj++, jm++){ 165 | 166 | // std::cerr << "Start U V\n"; 167 | 168 | Eigen::VectorXd U = out_mat.col(jm); 169 | Eigen::VectorXd V = dV.segment(bm.bcf_s[i], n_g); 170 | 171 | // --------------------------------- 172 | // Messy work-around: 173 | // Set "U" and "V" elements to `0` for SNPs 174 | // outside the cis-gene window. 175 | // This can occur because there is a little 176 | // excess overlap from the SNP-gene "blocks". 177 | // Therefore they must be copied here ... 178 | // ---------------------------------- 179 | 180 | 181 | const int& start_jj = e_data.start[jj]; 182 | const int& end_jj = e_data.end[jj]; 183 | for( int snp_i =0; snp_i < U.size(); snp_i++ ){ 184 | int pos_i = g_data.pos[bm.bcf_s[i] + snp_i]; 185 | 186 | // if SNP is outside cis window, set score to 0.00 so never selected. 187 | if( pos_i < e_data.start[jj] - global_opts::cis_window_bp ){ 188 | U(snp_i) = std::numeric_limits::quiet_NaN(); 189 | V(snp_i) = std::numeric_limits::quiet_NaN(); 190 | }else if( pos_i > e_data.end[jj] + global_opts::cis_window_bp ){ 191 | U(snp_i) = std::numeric_limits::quiet_NaN(); 192 | V(snp_i) = std::numeric_limits::quiet_NaN(); 193 | } 194 | } 195 | 196 | // distance to TSS 197 | double tss_pos = 0.5*(start_jj + end_jj); 198 | std::vector dtss(U.size()); 199 | for(int snp_i = 0; snp_i < U.size(); snp_i++){ 200 | dtss[snp_i] = ( tss_pos - (double) g_data.pos[bm.bcf_s[i] + snp_i] ); 201 | } 202 | 203 | // --------------------------------- 204 | 205 | // std::cerr << "out_mat: " << out_mat.rows() << "\t" << out_mat.cols() << "\n"; 206 | 207 | 208 | // std::cerr << U.size() << "\t" << V.size() << "\n"; 209 | 210 | // std::cerr << "Start vget\n"; 211 | 212 | indiv_vcov_getter vget(G, StG); 213 | 214 | // std::cerr << G.rows() << "\t" << G.cols() << "\n"; 215 | // std::cerr << UtG.rows() << "\t" << UtG.cols() << "\n"; 216 | 217 | double y_sc = y_scale[jj] * e_data.stdev[jj]; 218 | 219 | // std::cerr << "Start forward\n"; 220 | 221 | // double SSR = e_data.stdev[jj] * e_data.stdev[jj] * (n_samples - 1); 222 | // double IVW = (n_samples - n_covar)/SSR; 223 | // double adj = e_data.stdev[jj] * IVW; 224 | 225 | double adj = std::sqrt(n_samples - n_covar)/std::sqrt(n_samples - 1); 226 | 227 | forward_lm flm(adj * U, V, n_samples, n_samples - n_covar, e_data.stdev[jj]/adj, vget, global_opts::LM_ALPHA, dtss); 228 | 229 | // std::cerr << "End forward\n"; 230 | 231 | int bcf_s_b = bm.bcf_s[i]; 232 | 233 | auto snp = [&](const int& kk ){ int j = bcf_s_b + kk; return clean_chrom(g_data.chr[j]) + "_" + std::to_string(g_data.pos[j]) + "_" + g_data.ref[j] + "_" + g_data.alt[j];}; 234 | 235 | if( flm.beta.size() > 0 ){ 236 | 237 | std::stringstream os; 238 | 239 | for(int bi = 0; bi < flm.beta.size(); bi++) 240 | { 241 | double slope = flm.beta[bi]; 242 | if( g_data.flipped[bcf_s_b + flm.keep[bi]] ){ 243 | slope *= -1.00; 244 | } 245 | // os.precision(4); 246 | os << e_data.gene_id[jj]; 247 | os << "\t" << bi+1 << ":" << flm.beta.size() << "\t" << snp(flm.keep[bi]) << "\t" << 248 | slope << "\t" << flm.se[bi] << "\t" << flm.pval_joint[bi] << "\t" << flm.pval_adj[bi] << "\t"; 249 | // os.precision(2); 250 | os << flm.pval_0[bi] << "\t" << flm.pval_seq[bi] << "\n"; 251 | } 252 | 253 | write_to_bgzf(os.str().c_str(), stat_file); 254 | 255 | os.str(std::string()); 256 | os.clear(); 257 | 258 | for(const int& k : flm.keep ){ 259 | os << clean_chrom(e_data.chr[jj]) << "\t" << 260 | e_data.start[jj] << "\t" << 261 | e_data.end[jj] << "\t" << 262 | e_data.gene_id[jj] << "\t" << 263 | snp(k); 264 | Eigen::VectorXd G_k = G.col(k); 265 | for(int bi = 0; bi < G_k.size(); bi++){ 266 | os << "\t" << G_k(bi); 267 | } 268 | os << "\n"; 269 | } 270 | 271 | 272 | write_to_bgzf(os.str().c_str(), data_file); 273 | } 274 | } 275 | 276 | }else{ 277 | std::cerr << "\nERROR: " << bl << "; " << bm.bed_s[i] << ", " << n_e << "; " << bm.bcf_s[i] << ", " << n_g << "\n"; 278 | abort(); 279 | } 280 | 281 | print_iter_cerr(bl, bl+1, iter_cerr_suffix); 282 | 283 | bl++; 284 | } 285 | std::cerr << "\n"; 286 | 287 | bgzf_close(stat_file); 288 | bgzf_close(data_file); 289 | 290 | return; 291 | } 292 | -------------------------------------------------------------------------------- /src/setOptions.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | /* setOptions is for setting global options, which are set at 18 | runtime and used throughout (potentially) all source files*/ 19 | 20 | #include "setOptions.hpp" 21 | 22 | 23 | 24 | namespace global_opts 25 | { 26 | 27 | // GENERAL OPTIONS 28 | std::string out_prefix = ""; 29 | bool low_mem = false; 30 | 31 | // INPUT OPTIONS 32 | 33 | std::string global_region = ""; 34 | 35 | // GENOTYPE OPTIONS 36 | // bool set_from_ 37 | int exclude_missing = 0; 38 | double dosage_thresh = 0.01; 39 | double minimum_maf = 0; 40 | int minimum_mac = 1; 41 | bool use_dosages = false; 42 | 43 | // COVARIATE OPTIONS 44 | bool filter_covariates = false; 45 | // use_covariates; 46 | 47 | // SAMPLE SUBSETTING 48 | bool filter_iids = false; 49 | // include_iids; 50 | // exclude_iids; 51 | 52 | bool trim_gene_ids = false; 53 | 54 | // ANALYSIS OPTIONS 55 | 56 | int n_fa_iter = 3; 57 | double fa_p = 0.001; 58 | double fa_tau = 1.00; 59 | 60 | bool write_resid_mat = false; 61 | 62 | // stepwise options 63 | double exp_weight_val = 5e-6; 64 | int max_signals = 10; 65 | int max_steps = 100; 66 | 67 | // "Sloppy" covariate adjustment 68 | bool sloppy_covar = false; 69 | 70 | double RSQ_PRUNE; 71 | double RSQ_BUDDY = 1.00; 72 | 73 | // TESTING OPTIONS 74 | bool het_use_hom = true; 75 | bool het_use_het = true; 76 | bool het_use_acat = true; 77 | 78 | bool step_marginal = false; 79 | 80 | // VARIANT MERGE OPTIONS 81 | bool biallelic_only = true; 82 | 83 | // LMM OPTIONS 84 | bool use_grm = false; 85 | bool ml_not_reml = false; 86 | bool write_v_anchors = false; 87 | 88 | // ANALYSIS MODE 89 | char meta_weight_method = '0'; 90 | bool conditional_analysis = false; 91 | bool trans_eqtl_mode = false; 92 | double backward_thresh = 1.00; 93 | 94 | // CIS-QTL OPTIONS 95 | double LM_ALPHA = 0.05; 96 | int cis_window_bp = 1000000; 97 | bool cis_window_gene_body = false; 98 | 99 | // LD OPTIONS 100 | int ld_window_bp = 1000000; 101 | 102 | // VARIANT MATCHING OPTIONS 103 | 104 | double freq_tol = 0.05; 105 | bool try_match_ambiguous_snv = true; 106 | 107 | // GENE SUBSETTING OPTIONS 108 | 109 | bool filter_genes; 110 | std::vector target_genes; 111 | 112 | // GLOBAL VALUES 113 | 114 | std::unordered_map i_chrom_map({ 115 | {"1",1},{"2",2},{"3",3},{"4",4}, 116 | {"5",5},{"6",6},{"7",7},{"8",8}, 117 | {"9",9},{"10",10},{"11",11},{"12",12}, 118 | {"13",13},{"14",14},{"15",15},{"16",16}, 119 | {"17",17},{"18",18},{"19",19},{"20",20}, 120 | {"21",21},{"22",22},{"X",23},{"Y",24}, 121 | {"M",25},{"MT",25}, 122 | {"chr1",1},{"chr2",2},{"chr3",3},{"chr4",4}, 123 | {"chr5",5},{"chr6",6},{"chr7",7},{"chr8",8}, 124 | {"chr9",9},{"chr10",10},{"chr11",11},{"chr12",12}, 125 | {"chr13",13},{"chr14",14},{"chr15",15},{"chr16",16}, 126 | {"chr17",17},{"chr18",18},{"chr19",19},{"chr20",20}, 127 | {"chr21",21},{"chr22",22},{"chrX",23},{"chrY",24}, 128 | {"chrM",25},{"chrMT",25} 129 | }); 130 | } 131 | 132 | bool global_opts::process_global_opts( const std::string& pfx, const bool& low_memory, const double& rsq_buddy, const double& rsq, const double& pthresh, const int& window, const std::vector& tg, const char& ivw_mode, const bool& use_ds, const bool& trim, const double& backward, const bool& h_hom, const bool& h_het, const bool& h_acat, const bool& step_marg ){ 133 | out_prefix = pfx; 134 | low_mem = low_memory; 135 | cis_window_bp = window; 136 | ld_window_bp = 2*window; 137 | RSQ_BUDDY = rsq_buddy; 138 | RSQ_PRUNE = rsq; 139 | LM_ALPHA = pthresh; 140 | filter_genes = (tg.size() > 0); 141 | target_genes = tg; 142 | meta_weight_method = ivw_mode; 143 | use_dosages = use_ds; 144 | trim_gene_ids = trim; 145 | backward_thresh = backward; 146 | het_use_hom = h_hom; 147 | het_use_het = h_het; 148 | het_use_acat = h_acat; 149 | step_marginal = step_marg; 150 | 151 | return true; 152 | }; 153 | 154 | bool global_opts::set_lmm_options(const bool& wap){ 155 | write_v_anchors = wap; 156 | return true; 157 | } 158 | 159 | void global_opts::set_exp_weight(const double& w){ 160 | exp_weight_val = w; 161 | return; 162 | } 163 | 164 | void global_opts::use_sloppy_covar(){ 165 | sloppy_covar = true; 166 | return; 167 | } 168 | 169 | void global_opts::set_factor_par(const int& nf_, const double& fp_, const double& ft_){ 170 | n_fa_iter = nf_; 171 | fa_p = fp_; 172 | fa_tau = ft_; 173 | return; 174 | } 175 | 176 | void global_opts::set_max_signals(const int& ms){ 177 | max_signals = ms; 178 | return; 179 | } 180 | 181 | void global_opts::save_residuals(const bool& wb){ 182 | write_resid_mat = wb; 183 | return; 184 | } 185 | 186 | void global_opts::set_global_region(const std::string& reg){ 187 | global_region = reg; 188 | return; 189 | } 190 | 191 | 192 | 193 | 194 | -------------------------------------------------------------------------------- /src/setOptions.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | /* 18 | setOptions is for setting global options, which are set at 19 | runtime and used throughout multiple source files 20 | */ 21 | 22 | #ifndef SETOPTIONS_HPP 23 | #define SETOPTIONS_HPP 24 | 25 | #include 26 | #include 27 | #include 28 | 29 | namespace global_opts 30 | { 31 | // GENERAL OPTIONS 32 | extern std::string out_prefix; 33 | extern bool low_mem; 34 | 35 | // INPUT OPTIONS 36 | 37 | extern std::string global_region; 38 | 39 | // GENOTYPE OPTIONS 40 | extern int exclude_missing; 41 | extern double dosage_thresh; 42 | extern double minimum_maf; 43 | extern int minimum_mac; 44 | extern bool use_dosages; 45 | 46 | // COVARIATE OPTIONS 47 | extern bool filter_covariates; 48 | extern std::vector use_covariates; 49 | 50 | // SAMPLE SUBSETTING 51 | extern bool filter_iids; 52 | extern std::vector include_iids; 53 | extern std::vector exclude_iids; 54 | 55 | // GENE OPTIONS 56 | extern bool trim_gene_ids; 57 | 58 | // ANALYSIS OPTIONS 59 | 60 | extern bool write_resid_mat; 61 | 62 | extern int n_fa_iter; 63 | extern double fa_p; 64 | extern double fa_tau; 65 | 66 | // stepwise options 67 | extern int max_signals; 68 | extern int max_steps; 69 | 70 | extern double exp_weight_val; 71 | 72 | // "Sloppy" covariate adjustment 73 | extern bool sloppy_covar; 74 | 75 | extern double RSQ_PRUNE; 76 | extern double RSQ_BUDDY; 77 | 78 | // TESTING OPTIONS 79 | extern bool het_use_hom; 80 | extern bool het_use_het; 81 | extern bool het_use_acat; 82 | 83 | extern bool step_marginal; 84 | 85 | // VARIANT MERGE OPTIONS 86 | extern bool biallelic_only; 87 | 88 | // LMM OPTIONS 89 | extern bool use_grm; 90 | extern bool ml_not_reml; 91 | extern bool write_v_anchors; 92 | 93 | // ANALYSIS MODE 94 | extern char meta_weight_method; 95 | extern bool conditional_analysis; 96 | extern bool trans_eqtl_mode; 97 | extern double backward_thresh; 98 | 99 | // CIS-QTL OPTIONS 100 | extern double LM_ALPHA; 101 | extern int cis_window_bp; 102 | extern bool cis_window_gene_body; 103 | 104 | // LD OPTIONS 105 | extern int ld_window_bp; 106 | 107 | // GENE SUBSETTING OPTIONS 108 | 109 | extern bool filter_genes; 110 | extern std::vector target_genes; 111 | 112 | // VARIANT MATCHING OPTIONS 113 | 114 | extern double freq_tol; 115 | extern bool try_match_ambiguous_snv; 116 | 117 | // GLOBAL VARIABLES 118 | 119 | extern std::unordered_map i_chrom_map; 120 | 121 | // PROCESS OPTIONS 122 | 123 | bool process_global_opts(const std::string& pfx, const bool& low_memory, const double& rsq_buddy, const double& rsq, const double& pthresh, const int& window, const std::vector& tg, const char& ivw_mode, const bool& use_ds, const bool& trim, const double& backward, const bool& h_hom, const bool& h_het, const bool& h_acat, const bool& step_marg); 124 | 125 | bool set_lmm_options(const bool& wap); 126 | 127 | void set_max_signals(const int& ms); 128 | 129 | void use_sloppy_covar(); 130 | 131 | void set_exp_weight(const double&); 132 | 133 | void save_residuals(const bool&); 134 | 135 | void set_factor_par(const int&, const double&, const double&); 136 | 137 | void set_global_region(const std::string& ); 138 | 139 | } 140 | 141 | # endif 142 | 143 | -------------------------------------------------------------------------------- /src/setRegions.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | 18 | #include "setRegions.hpp" 19 | 20 | 21 | void seek_to(int& i, const std::vector& chr, const std::vector& pos, const std::string& target_chr, const int target_pos){ 22 | while( chr[i] != target_chr && i < chr.size() - 1 ){ 23 | i++; 24 | } 25 | while( chr[i] == target_chr && pos[i] < target_pos && i < chr.size() - 1 ){ 26 | if( chr[i + 1] == target_chr && pos[i + 1] < target_pos ){ 27 | i++; 28 | }else{ 29 | break; 30 | } 31 | } 32 | } 33 | 34 | void block_intervals::make_blocks(bed_data& bdat, genotype_data& gdat, const int& ws){ 35 | 36 | std::vector& chr_b = bdat.chr; 37 | std::vector& start_b = bdat.start; 38 | std::vector& end_b = bdat.end; 39 | 40 | std::vector& chr_g = gdat.chr; 41 | std::vector& pos_g = gdat.pos; 42 | 43 | std::vector& b_s_b = bdat.block_s; 44 | std::vector& b_e_b = bdat.block_e; 45 | 46 | std::vector& v_s_b = bdat.v_s; 47 | std::vector& v_e_b = bdat.v_e; 48 | 49 | 50 | bool block_by_variant = false; 51 | 52 | // int block_size = bs; 53 | int window = ws; 54 | 55 | int block_id = 0; 56 | 57 | int n_b = chr_b.size(); 58 | int n_g = chr_g.size(); 59 | 60 | bcf_id_s = std::vector(n_g,-1); 61 | bcf_id_e = std::vector(n_g,-1); 62 | 63 | bed_id_s = std::vector(n_b,-1); 64 | bed_id_e = std::vector(n_b,-1); 65 | 66 | b_s_b = std::vector(n_b,-1); 67 | b_e_b = std::vector(n_b,-1); 68 | 69 | v_s_b = std::vector(n_b,-1); 70 | v_e_b = std::vector(n_b,-1); 71 | 72 | int i_b = 0; int i_g = 0; 73 | int i_b_e = 0; int i_g_e = 0; 74 | 75 | double magic_fraction = 0.8; 76 | 77 | while( i_b < n_b ){ 78 | 79 | i_b_e = i_b; 80 | 81 | seek_to(i_b_e, chr_b, start_b, chr_b[i_b], end_b[i_b] + magic_fraction * window); 82 | seek_to(i_g, chr_g, pos_g, chr_b[i_b], start_b[i_b] - window); 83 | 84 | if( chr_g[i_g] == chr_b[i_b] ){ 85 | 86 | i_g_e = i_g_e > i_g ? i_g_e : i_g; 87 | seek_to(i_g_e, chr_g, pos_g, chr_b[i_b], end_b[i_b_e] + window + 1); 88 | 89 | if( i_b_e >= i_b && pos_g[i_g] - end_b[i_b] <= window && chr_g[i_g] == chr_b[i_b] ){ 90 | 91 | bed_s.push_back(i_b); 92 | bed_e.push_back(i_b_e); 93 | bcf_s.push_back(i_g); 94 | bcf_e.push_back(i_g_e); 95 | 96 | for( int j = i_g; j <= i_g_e; ++j ){ 97 | if( bcf_id_s[j] < 0 ){ 98 | bcf_id_s[j] = block_id; 99 | } 100 | bcf_id_e[j] = block_id; 101 | } 102 | 103 | std::string region = chr_g[i_g] + ":" + std::to_string(pos_g[i_g]) + "-" + std::to_string(pos_g[i_g_e]); 104 | 105 | bcf_regions.push_back(region); 106 | 107 | for( int i = i_b; i <= i_b_e; ++i ){ 108 | if( bed_id_s[i] < 0 ){ 109 | bed_id_s[i] = block_id; 110 | b_s_b[i] = block_id; 111 | } 112 | bed_id_e[i] = block_id; 113 | b_e_b[i] = block_id; 114 | 115 | v_s_b[i] = (i == i_b ? i_g : v_s_b[i-1]); 116 | seek_to(v_s_b[i], chr_g, pos_g, chr_b[i_b], start_b[i] - window); 117 | 118 | v_e_b[i] = (i == i_b ? i_g : v_e_b[i-1]); 119 | 120 | seek_to(v_e_b[i], chr_g, pos_g, chr_b[i_b], end_b[i] + window + 1); 121 | } 122 | block_id++; 123 | } 124 | } 125 | i_b = i_b_e + 1; 126 | // std::cerr << block_id << " blocks ...\n"; 127 | } 128 | } 129 | 130 | 131 | -------------------------------------------------------------------------------- /src/setRegions.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | 18 | #ifndef SETREGIONS_HPP 19 | #define SETREGIONS_HPP 20 | 21 | #include 22 | 23 | #include "miscUtils.hpp" 24 | #include "setOptions.hpp" 25 | #include "readBED.hpp" 26 | #include "genotypeData.hpp" 27 | 28 | 29 | class block_intervals 30 | { 31 | public: 32 | std::vector bed_s; 33 | std::vector bed_e; 34 | std::vector bcf_s; 35 | std::vector bcf_e; 36 | 37 | std::vector bcf_id_s; 38 | std::vector bcf_id_e; 39 | 40 | std::vector bed_id_s; 41 | std::vector bed_id_e; 42 | 43 | std::vector bcf_regions; 44 | //std::vector gene_bcf_regions; 45 | 46 | block_intervals(){}; 47 | block_intervals(bed_data& bdat, genotype_data& gdat, const int& ws){ 48 | make_blocks(bdat, gdat, ws); 49 | }; 50 | int size(){ return bed_s.size();}; 51 | 52 | void make_blocks(bed_data&, genotype_data&, const int&); 53 | 54 | }; 55 | 56 | 57 | #endif 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /src/transMapping.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include "transMapping.hpp" 18 | 19 | 20 | void parse_cis_signal_data(const std::string& fn, const std::string& region, Eigen::MatrixXd& X, std::vector& samples, std::vector& genes){ 21 | data_parser dp; 22 | dp.add_header(samples, 5); 23 | dp.add_field(genes, 3); 24 | dp.add_matrix(X, true, 5); 25 | dp.parse_file(fn, region); 26 | return; 27 | } 28 | 29 | Eigen::VectorXd getVectorXd(const std::vector& v, const int& s, const int& n){ 30 | Eigen::VectorXd ev(n); 31 | for(int i = 0, ii = s; i < n; i++, ii++){ 32 | ev(i) = v[ii]; 33 | } 34 | return ev; 35 | } 36 | 37 | 38 | double u_stat_pval(const double& u_stat, const double& m, const double& n){ 39 | double f_stat = u_stat * u_stat; 40 | f_stat = (n - m - 1)*f_stat/(n - 1 - f_stat); 41 | if( f_stat < 0.00 || std::isnan(f_stat) ){ 42 | std::cerr << "Warning: F statistic = " << f_stat << "\n"; 43 | return -1.00; 44 | } 45 | return pf(f_stat, 1, n - m - 1, true); 46 | } 47 | 48 | double usq_stat_pval(const double& usq_stat, const double& m, const double& n){ 49 | double f_stat = (n - m - 1)*usq_stat/(n - 1 - usq_stat); 50 | if( f_stat < 0.00 || std::isnan(f_stat) ){ 51 | std::cerr << "Warning: F statistic = " << f_stat << "\n"; 52 | return -1.00; 53 | } 54 | return pf(f_stat, 1, n - m - 1, true); 55 | } 56 | 57 | void run_trans_QTL_analysis(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, bed_data& e_data, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const int& chunk_size, const std::string& cis_fp, const std::string& cis_fp_region) 58 | { 59 | 60 | struct Zsn {int s; int n; }; 61 | 62 | Eigen::MatrixXd Z; 63 | std::vector Z_genes; 64 | std::vector Z_samples; 65 | std::vector Z_map; 66 | bool adj_cis = false; 67 | bool adj_cis_sloppy = global_opts::sloppy_covar; 68 | 69 | if( cis_fp != "" ){ 70 | adj_cis = true; 71 | 72 | parse_cis_signal_data(cis_fp, "", Z, Z_samples, Z_genes); 73 | 74 | std::unordered_map Z_map0; 75 | 76 | std::string last_gene = Z_genes[0]; 77 | int ni = 0; 78 | for(int i = 0; i < Z_genes.size(); i++){ 79 | std::string& gene_i = Z_genes[i]; 80 | if( gene_i == last_gene ){ 81 | ni++; 82 | }else{ 83 | Z_map0[last_gene] = {i - ni, ni}; 84 | ni = 1; 85 | last_gene = gene_i; 86 | } 87 | if( i + 1 == Z_genes.size() ){ 88 | Z_map0[gene_i] = {i - ni, ni}; 89 | } 90 | } 91 | 92 | for( const std::string& gene_i : e_data.gene_id ){ 93 | if( Z_map0.find(gene_i) == Z_map0.end() ){ 94 | Z_map.push_back({-1,0}); 95 | }else{ 96 | // std::cerr << "\n\nMatch!\n\n"; 97 | Z_map.push_back(Z_map0[gene_i]); 98 | } 99 | } 100 | } 101 | 102 | Eigen::MatrixXd &Y = e_data.data_matrix; 103 | Eigen::MatrixXd &U = c_data.data_matrix; 104 | 105 | make_half_hat_matrix(U); 106 | 107 | std::cerr << "Started trans-QTL analysis ...\n"; 108 | 109 | if( rknorm_y ){ 110 | std::cerr << "Rank-normalizing expression traits ... \n"; 111 | rank_normalize(Y); 112 | } 113 | std::cerr << "Scaling expression traits ... \n"; 114 | std::vector y_scale; 115 | scale_and_center(Y, y_scale); 116 | 117 | if( !global_opts::low_mem ){ 118 | 119 | std::cerr << "Calculating genotype-covariate covariance...\n"; 120 | 121 | Eigen::MatrixXd UtG = U.transpose() * g_data.genotypes; 122 | std::cerr << "Calculating genotype residual variances ..."; 123 | for( int i = 0; i < UtG.cols(); i++) 124 | { 125 | 126 | g_data.var[i] = g_data.genotypes.col(i).squaredNorm() - UtG.col(i).squaredNorm(); 127 | } 128 | std::cerr << "Done.\n"; 129 | } 130 | 131 | std::cerr << "Calculating expression residuals...\n"; 132 | make_resid_from_half_hat(Y, U); 133 | 134 | if( rknorm_r ){ 135 | std::cerr << "Rank-normalizing expression residuals ...\n"; 136 | rank_normalize(Y); 137 | std::cerr << "Re-residualizing transformed residuals ...\n"; 138 | make_resid_from_half_hat(Y, U); 139 | scale_and_center(Y); 140 | } 141 | 142 | if( adj_cis ){ 143 | std::cerr << "Residualizing cis signals...\n"; 144 | make_resid_from_half_hat(Z, U); 145 | std::cerr << "Re-residualizing expression ...\n"; 146 | for( int i = 0; i < e_data.gene_id.size(); i++){ 147 | const int& s_i = Z_map[i].s; 148 | const int& n_i = Z_map[i].n; 149 | if( n_i > 0 ){ 150 | Eigen::MatrixXd Z_i = get_half_hat_matrix(Z.middleCols(s_i, n_i)); 151 | Z.middleCols(s_i, n_i) = Z_i; 152 | 153 | Eigen::VectorXd y_res_z = resid_vec_from_half_hat(Y.col(i), Z_i); 154 | Y.col(i) = y_res_z; 155 | } 156 | } 157 | } 158 | 159 | std::cerr << "Scaling expression residuals ...\n"; 160 | scale_and_center(Y, e_data.stdev); 161 | 162 | Y.transposeInPlace(); 163 | 164 | double n_samples = U.rows(); 165 | double n_covar = U.cols(); 166 | 167 | int n_genes = Y.rows(); 168 | 169 | std::vector gene_max_val(n_genes, 0.0); 170 | std::vector gene_max_idx(n_genes, 0); 171 | 172 | std::string block_file_path = global_opts::out_prefix + "." + "trans_sumstats" + ".txt.gz"; 173 | std::string gene_file_path = global_opts::out_prefix + "." + "trans_gene_table" + ".txt.gz"; 174 | std::string long_file_path = global_opts::out_prefix + "." + "trans_long_table" + ".txt.gz"; 175 | 176 | BGZF* block_file; 177 | BGZF* gene_file; 178 | BGZF* long_file; 179 | 180 | gene_file = bgzf_open(gene_file_path.c_str(), "w"); 181 | write_to_bgzf("#chrom\tpos\tref\talt\tgene_chrom\tgene_id\tbeta\tse\tpval\n", gene_file); 182 | 183 | long_file = bgzf_open(long_file_path.c_str(), "w"); 184 | write_to_bgzf("#chrom\tpos\tref\talt\tgene_chrom\tgene_id\tbeta\tse\tpval\n", long_file); 185 | 186 | int bl = 0; 187 | 188 | int n_blocks = ceil(g_data.n_variants/chunk_size); 189 | 190 | std::string iter_cerr_suffix = " genotype blocks out of " + std::to_string(n_blocks) + " total"; 191 | std::cerr << "Processed "; 192 | print_iter_cerr(1, 0, iter_cerr_suffix); 193 | 194 | double F_crit = qf(global_opts::LM_ALPHA, 1, n_samples - n_covar - 1, true); 195 | double P_crit = std::sqrt(F_crit * (n_samples - 1)/( F_crit + n_samples - n_covar - 1)); 196 | 197 | for( ; ; bl++ ){ 198 | 199 | int s_g = bl * chunk_size; 200 | int n_g = chunk_size; 201 | n_g = n_g < g_data.n_variants ? n_g : g_data.n_variants-1; 202 | n_g = n_g < g_data.n_variants - s_g ? n_g : g_data.n_variants - s_g; 203 | 204 | if( s_g >= g_data.n_variants || n_g <= 0 ){ 205 | break; 206 | } 207 | 208 | if( n_g > 0 && s_g < g_data.n_variants ){ 209 | 210 | Eigen::MatrixXd StdScore; 211 | Eigen::MatrixXd Denom; 212 | Eigen::VectorXd dV; 213 | 214 | int s_G, n_G; 215 | if( global_opts::low_mem ){ 216 | 217 | g_data.read_genotypes(sr, hdr, s_g, n_g ); 218 | 219 | Eigen::SparseMatrix& G = g_data.genotypes; 220 | 221 | Eigen::VectorXd UtG_block_sqnm = (U.transpose() * G).colwise().squaredNorm().eval(); 222 | 223 | for( int si = s_g, ii = 0; si < s_g + n_g; si++, ii++) 224 | { 225 | g_data.var[si] = G.col(ii).squaredNorm() - UtG_block_sqnm(ii); 226 | //cout << g_data.var[i] << "\n"; 227 | } 228 | s_G = 0; 229 | n_G = g_data.genotypes.cols(); 230 | }else{ 231 | s_G = s_g; 232 | n_G = n_g; 233 | } 234 | 235 | const Eigen::SparseMatrix& G = g_data.genotypes.middleCols(s_G, n_G); 236 | 237 | dV = getVectorXd(g_data.var, s_g, n_g); 238 | 239 | if( (!adj_cis) || adj_cis_sloppy ){ 240 | StdScore = dV.cwiseSqrt().asDiagonal().inverse() * (Y * G).transpose().eval(); 241 | }else{ 242 | StdScore = (Y * G).transpose(); 243 | Eigen::MatrixXd GtZ = G.transpose() * Z; 244 | 245 | Denom.resize( StdScore.rows(), StdScore.cols() ); 246 | 247 | for( int gi = 0; gi < e_data.gene_id.size(); gi++){ 248 | const int& s_i = Z_map[gi].s; 249 | const int& n_i = Z_map[gi].n; 250 | if( n_i > 0 ){ 251 | Eigen::VectorXd dV_adj = dV - GtZ.middleCols(s_i, n_i).cwiseAbs2().rowwise().sum(); 252 | Denom.col(gi) = dV - GtZ.middleCols(s_i, n_i).cwiseAbs2().rowwise().sum(); 253 | }else{ 254 | Denom.col(gi) = dV; 255 | } 256 | } 257 | StdScore = StdScore.cwiseQuotient(Denom.cwiseSqrt()); 258 | } 259 | 260 | std::stringstream long_line; 261 | 262 | for(int j = 0; j < StdScore.cols(); j++){ 263 | 264 | double n_covar_j = n_covar; 265 | 266 | if( adj_cis && !adj_cis_sloppy ){ 267 | n_covar_j += Z_map[j].n; 268 | } 269 | 270 | for(int i = 0; i < StdScore.rows(); i++){ 271 | const auto& val = StdScore(i,j); 272 | 273 | if( std::abs(val) > std::abs(gene_max_val[j]) ){ 274 | gene_max_val[j] = val; 275 | gene_max_idx[j] = s_g + i; 276 | } 277 | 278 | if( val > P_crit || val < - P_crit ){ 279 | 280 | double V; 281 | if( adj_cis && !adj_cis_sloppy ){ 282 | V = Denom(i,j); 283 | if( V/dV(i) < 0.001 || V <= 0 || std::isnan(V) ){ 284 | continue; 285 | } 286 | }else{ 287 | V = dV(i); 288 | } 289 | 290 | const double& scale = e_data.stdev[j]; 291 | double U = val * std::sqrt(V); 292 | double beta = U/V; 293 | double beta_se = std::sqrt( ((n_samples - 1)/V- beta*beta)/(n_samples - n_covar_j - 1) ); 294 | 295 | 296 | double pval_esnp = u_stat_pval(val, n_covar_j, n_samples); 297 | 298 | int ii = s_g + i; 299 | if( g_data.flipped[ii] ) beta *= (-1.00); 300 | 301 | long_line << 302 | clean_chrom(g_data.chr[ii]) << "\t" << 303 | g_data.pos[ii] << "\t" << 304 | g_data.ref[ii] << "\t" << 305 | g_data.alt[ii] << "\t" << 306 | clean_chrom(e_data.chr[j]) << "\t" << 307 | e_data.gene_id[j] << "\t" << 308 | y_scale[j]*scale*beta << "\t" << 309 | y_scale[j]*scale*beta_se << "\t" << 310 | pval_esnp << "\n"; 311 | } 312 | } 313 | } 314 | 315 | write_to_bgzf(long_line.str().c_str(), long_file); 316 | 317 | }else{ 318 | std::cerr << "\nERROR: " <& L, Eigen::VectorXd& GRM_lambda, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const int& chunk_size, const std::string& theta_path, const std::string& anchor_path) 364 | { 365 | 366 | Eigen::SparseMatrix Q; 367 | Eigen::VectorXd Q_lambda; 368 | 369 | subset_eigen(L, GRM_lambda, Q, Q_lambda); 370 | 371 | Eigen::MatrixXd& Y = e_data.data_matrix; 372 | Eigen::MatrixXd& C = c_data.data_matrix; 373 | 374 | double n_traits = Y.cols(); 375 | double n_samples = Y.rows(); 376 | double n_snps = g_data.n_variants; 377 | double n_covar = C.cols(); 378 | 379 | if( rknorm_y ){ 380 | if( e_data.is_residual ){ 381 | std::cerr << "IGNORING instruction to rank-normalize LMM residuals ... \n"; 382 | }else{ 383 | std::cerr << "Rank-normalizing expression traits ... \n"; 384 | rank_normalize(Y); 385 | } 386 | } 387 | 388 | std::vector y_scale; 389 | 390 | Eigen::MatrixXd QtY, CtY, QtC, QtG, CtC, CtC_i; 391 | bool gvar_precomputed = ( anchor_path != "" ); 392 | 393 | if( !e_data.is_residual ){ 394 | 395 | std::cerr << "Scaling expression traits ... \n"; 396 | 397 | scale_and_center(Y, y_scale); 398 | 399 | QtY = (Q.transpose() * Y).eval(); 400 | CtY = (C.transpose() * Y).eval(); 401 | Y = (L.transpose() * Y).eval(); 402 | }else{ 403 | 404 | y_scale = std::vector( (int) n_traits, 1.00 ); 405 | } 406 | 407 | if( !gvar_precomputed || !e_data.is_residual ){ 408 | std::cerr << "Calculating partial rotations ...\n"; 409 | 410 | QtC = (Q.transpose() * C).eval(); 411 | CtC = (C.transpose() * C).eval(); 412 | 413 | if( !gvar_precomputed ){ 414 | QtG = (Q.transpose() * g_data.genotypes).eval(); 415 | CtC_i = (CtC.inverse()).eval(); 416 | } 417 | 418 | std::cerr << "Done.\n"; 419 | } 420 | 421 | std::vector hsq_vals{0.0, 0.5, 1.0}; 422 | 423 | Eigen::MatrixXd V_mat, X; 424 | 425 | if( !gvar_precomputed ){ 426 | calculate_V_anchor_points(V_mat, g_data, C, hsq_vals, CtC, CtC_i, QtG, QtC, Q_lambda); 427 | QtG.resize(0,0); 428 | CtC_i.resize(0,0); 429 | }else{ 430 | read_V_anchor_points(anchor_path, V_mat, g_data, hsq_vals); 431 | } 432 | 433 | std::vector gene_max_val((int) n_traits, 0.0); 434 | std::vector gene_max_idx((int) n_traits, 0); 435 | 436 | std::string block_file_path = global_opts::out_prefix + "." + "trans_sumstats" + ".txt.gz"; 437 | std::string gene_file_path = global_opts::out_prefix + "." + "trans_gene_table" + ".txt.gz"; 438 | std::string long_file_path = global_opts::out_prefix + "." + "trans_long_table" + ".txt.gz"; 439 | 440 | BGZF* block_file; 441 | BGZF* gene_file; 442 | BGZF* long_file; 443 | 444 | gene_file = bgzf_open(gene_file_path.c_str(), "w"); 445 | write_to_bgzf("#chrom\tpos\tref\talt\tgene_chrom\tgene_id\tbeta\tse\tpval\n", gene_file); 446 | 447 | long_file = bgzf_open(long_file_path.c_str(), "w"); 448 | write_to_bgzf("#chrom\tpos\tref\talt\tgene_chrom\tgene_id\tbeta\tse\tpval\n", long_file); 449 | 450 | 451 | std::vector phi_v(Y.cols()); 452 | std::vector hsq_v(Y.cols()); 453 | std::vector sigma_v(Y.cols()); 454 | std::vector SSR_v(Y.cols()); 455 | 456 | e_data.stdev.resize(Y.cols()); 457 | 458 | theta_data t_data; 459 | bool use_theta = ( theta_path != "" ); 460 | 461 | if( use_theta ){ 462 | std::cerr << "Set null model for "; 463 | t_data.open(theta_path); 464 | }else{ 465 | std::cerr << "Fit null model for "; 466 | } 467 | 468 | X = (L.transpose() * C).eval(); 469 | 470 | std::string iter_cerr_suffix = " traits out of " + std::to_string(Y.cols()) + " total"; 471 | print_iter_cerr(1, 0, iter_cerr_suffix); 472 | 473 | int last_j = 0; 474 | for(int j = 0; j < (int) n_traits; j++ ){ 475 | DiagonalXd Vi; 476 | double sigma2; 477 | double phi; 478 | double tau2; 479 | 480 | if( use_theta ){ 481 | t_data.getTheta(j, e_data.gene_id[j], sigma2, tau2, phi, SSR_v[j], y_scale[j]); 482 | Vi = calc_Vi(phi, GRM_lambda); 483 | }else{ 484 | 485 | if( e_data.is_residual ){ 486 | std::cerr << "Error: --theta must be specified when --bed contains LMM residuals.\n"; 487 | abort(); 488 | } 489 | 490 | LMM_fitter fit(X, Y.col(j), GRM_lambda); 491 | fit.fit_REML(); 492 | 493 | Vi = fit.Vi; 494 | sigma2 = fit.sigma2; 495 | phi = fit.phi; 496 | tau2 = phi*sigma2; 497 | } 498 | 499 | double hsq = tau2 / (tau2 + sigma2); 500 | double scale = std::sqrt(tau2 + sigma2); 501 | 502 | if( !e_data.is_residual ){ 503 | DiagonalXd Psi = calc_Psi(phi, Q_lambda); 504 | Eigen::MatrixXd XtDX = (CtC - QtC.transpose() * Psi * QtC )/(1.00 + phi); 505 | Eigen::VectorXd XtDy = X.transpose() * Vi * Y.col(j); 506 | Eigen::VectorXd b = XtDX.colPivHouseholderQr().solve(XtDy); 507 | Eigen::VectorXd y_hat = X * b; 508 | 509 | Eigen::VectorXd y_res = Y.col(j) - y_hat; 510 | 511 | SSR_v[j] = y_res.dot(Vi * Y.col(j))/sigma2; 512 | 513 | // Y now stores the rotated residuals. 514 | Y.col(j) = (Vi * y_res/std::sqrt(sigma2)).eval(); 515 | } 516 | 517 | phi_v[j] = phi; 518 | hsq_v[j] = hsq; 519 | sigma_v[j] = sigma2; 520 | 521 | e_data.stdev[j] = std::sqrt(sigma2); 522 | 523 | thinned_iter_cerr(last_j, j+1, iter_cerr_suffix, 20); 524 | } 525 | print_iter_cerr(last_j, Y.cols(), iter_cerr_suffix); 526 | 527 | if( !e_data.is_residual ){ 528 | Y = (L * Y).eval(); 529 | } 530 | 531 | Eigen::MatrixXd VBeta = getVBeta(hsq_v, phi_v, hsq_vals.size()); 532 | 533 | std::cerr << "\n"; 534 | 535 | int bl = 0; 536 | 537 | int n_blocks = ceil(g_data.n_variants/chunk_size); 538 | 539 | iter_cerr_suffix = " genotype blocks out of " + std::to_string(n_blocks) + " total"; 540 | std::cerr << "Processed "; 541 | print_iter_cerr(1, 0, iter_cerr_suffix); 542 | 543 | double F_crit = qf(global_opts::LM_ALPHA, 1, n_samples - n_covar - 1, true); 544 | double P_crit = std::sqrt(F_crit * (n_samples - 1)/( F_crit + n_samples - n_covar - 1)); 545 | double Psq_crit = P_crit*P_crit; 546 | 547 | 548 | for( ; ; bl++ ){ 549 | 550 | int s_g = bl * chunk_size; 551 | int n_g = chunk_size; 552 | n_g = n_g < g_data.n_variants ? n_g : g_data.n_variants-1; 553 | n_g = n_g < g_data.n_variants - s_g ? n_g : g_data.n_variants - s_g; 554 | 555 | if( s_g >= g_data.n_variants || n_g <= 0 ){ 556 | break; 557 | } 558 | 559 | if( n_g > 0 && s_g < g_data.n_variants ){ 560 | 561 | Eigen::MatrixXd StdScore; 562 | Eigen::VectorXd dV; 563 | 564 | if( global_opts::low_mem ){ 565 | std::cerr << "Low mem trans LMM not supported\n."; 566 | abort(); 567 | } 568 | 569 | const Eigen::SparseMatrix& G = g_data.genotypes.middleCols(s_g, n_g); 570 | 571 | Eigen::MatrixXd U_b = G.transpose() * Y; 572 | Eigen::MatrixXd V_b = V_mat.middleRows(s_g, n_g) * VBeta; 573 | 574 | Eigen::VectorXd scale_vec(U_b.cols()); 575 | for(int i = 0; i < U_b.cols(); i++){ 576 | scale_vec(i) = (n_samples - 1)/SSR_v[i]; 577 | } 578 | 579 | Eigen::MatrixXd StdScore2 = (U_b.cwiseAbs2() * scale_vec.asDiagonal()).cwiseQuotient(V_b); 580 | 581 | std::stringstream long_line; 582 | 583 | for(int j = 0; j < U_b.cols(); j++){ 584 | for(int i = 0; i < U_b.rows(); i++){ 585 | 586 | const double& val2 = StdScore2(i,j); 587 | 588 | if( val2 > gene_max_val[j] ){ 589 | gene_max_val[j] = val2; 590 | gene_max_idx[j] = s_g + i; 591 | } 592 | 593 | if( val2 > Psq_crit ){ 594 | const double& V = V_b(i,j); 595 | const double& scale = e_data.stdev[j]; 596 | const double& U = U_b(i,j); 597 | double beta = U/V; 598 | double beta_se = std::sqrt( (SSR_v[j]/V- beta*beta)/(n_samples - n_covar - 1) ); 599 | 600 | double pval_esnp = usq_stat_pval(val2, n_covar, n_samples); 601 | 602 | int ii = s_g + i; 603 | 604 | if( g_data.flipped[ii] ) beta *= (-1.00); 605 | 606 | long_line << 607 | clean_chrom(g_data.chr[ii]) << "\t" << 608 | g_data.pos[ii] << "\t" << 609 | g_data.ref[ii] << "\t" << 610 | g_data.alt[ii] << "\t" << 611 | clean_chrom(e_data.chr[j]) << "\t" << 612 | e_data.gene_id[j] << "\t" << 613 | y_scale[j]*scale*beta << "\t" << 614 | y_scale[j]*scale*beta_se << "\t" << 615 | pval_esnp << "\n"; 616 | } 617 | } 618 | } 619 | 620 | write_to_bgzf(long_line.str().c_str(), long_file); 621 | }else{ 622 | std::cerr << "\nERROR: " < 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #ifndef TRANSMAPPING_HPP 18 | #define TRANSMAPPING_HPP 19 | 20 | #include "setOptions.hpp" 21 | #include "readBED.hpp" 22 | #include "setRegions.hpp" 23 | #include "readTable.hpp" 24 | #include "genotypeData.hpp" 25 | #include "htsWrappers.hpp" 26 | #include "fitUtils.hpp" 27 | #include "fitModels.hpp" 28 | #include "miscUtils.hpp" 29 | #include "mathStats.hpp" 30 | 31 | #include "apexLMM.hpp" 32 | #include "cisMapping.hpp" 33 | 34 | // static std::vector Z0(0); 35 | 36 | void run_trans_QTL_analysis(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, bed_data& e_data, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const int& chunk_size, const std::string& cis_fp = "", const std::string& cis_fp_region = "" ); 37 | 38 | void run_trans_QTL_analysis_LMM(bcf_srs_t*& sr, bcf_hdr_t*& hdr, genotype_data& g_data, table& c_data, bed_data& e_data, Eigen::SparseMatrix& L, Eigen::VectorXd& GRM_lambda, const bool& rknorm_y, const bool& rknorm_r, const bool& make_sumstat, const bool& make_long, const int& chunk_size, const std::string& theta_path = "", const std::string& anchor_path = ""); 39 | 40 | 41 | void parse_cis_signal_data(const std::string& fn, const std::string& region, Eigen::MatrixXd& X, std::vector& samples, std::vector& genes); 42 | 43 | 44 | #endif 45 | 46 | -------------------------------------------------------------------------------- /src/xzFormat.hpp: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright (C) 2020 3 | Author: Corbin Quick 4 | 5 | This file is a part of APEX. 6 | 7 | APEX is distributed "AS IS" in the hope that it will be 8 | useful, but WITHOUT ANY WARRANTY; without even the implied 9 | warranty of MERCHANTABILITY, NON-INFRINGEMENT, or FITNESS 10 | FOR A PARTICULAR PURPOSE. 11 | 12 | The above copyright notice and disclaimer of warranty must 13 | be included in all copies or substantial portions of APEX. 14 | */ 15 | 16 | 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | 25 | 26 | // Class to record recent int-value function calls 27 | // for FIFO-type caching / temporary memoization. 28 | class call_tracker 29 | { 30 | 31 | private: 32 | std::vector most_recent; 33 | int n; 34 | public: 35 | call_tracker(const int& size) : n(size) {}; 36 | call_tracker() {}; 37 | void set_size(const int& n_recent){ 38 | n = n_recent; 39 | } 40 | 41 | int push_new(const int& x){ 42 | most_recent.push_back(x); 43 | if( most_recent.size() < n ){ 44 | // Don't remove anything from cache. 45 | return -1; 46 | } 47 | int out = most_recent.front(); 48 | most_recent.erase(most_recent.begin()); 49 | return out; 50 | } 51 | 52 | int check_new( const int& x ){ 53 | if( most_recent.size() < n ){ 54 | most_recent.push_back(x); 55 | return -1; 56 | } 57 | int x_i = -1; 58 | for(int i = n; i >= 0; i-- ){ 59 | if( most_recent[i] == x ){ 60 | x_i = i; 61 | break; 62 | } 63 | } 64 | if( x_i > 0 ){ 65 | std::swap(most_recent[most_recent.size()-1], most_recent[x_i]); 66 | return -1; 67 | }else{ 68 | most_recent.push_back(x); 69 | int out = most_recent.front(); 70 | most_recent.erase(most_recent.begin()); 71 | return out; 72 | } 73 | } 74 | }; 75 | 76 | 77 | 78 | // The code for the xzReader class is partly 79 | // derived from the following GitHub repository: 80 | // https://github.com/CTSRD-CHERI/cheritrace.git 81 | // The license issued by the original authors 82 | // (Alfredo Mazzinghi and David T. Chisnall) for this code is here: 83 | // http://www.beri-open-systems.org/legal/license-1-0.txt 84 | 85 | class xzReader 86 | { 87 | protected: 88 | struct offsets 89 | { 90 | off_t c_start; size_t c_size; 91 | off_t u_start; size_t u_size; 92 | }; 93 | size_t read_compressed(void *buffer, off_t start, size_t n_bytes){ 94 | if ( start < 0 || start > compressed_file_size ){ 95 | return 0; 96 | } 97 | if (start + n_bytes > compressed_file_size){ 98 | n_bytes = compressed_file_size - start; 99 | } 100 | size_t n_completed = 0; 101 | while (n_bytes > 0) 102 | { 103 | ssize_t n_bytes_i = pread(compressed_file, buffer, n_bytes, start); 104 | if (n_bytes_i < 0) break; 105 | n_completed += n_bytes_i; 106 | start += n_bytes_i; 107 | n_bytes -= n_bytes_i; 108 | buffer = (void*)((char*)buffer + n_bytes_i); 109 | } 110 | return n_completed; 111 | } 112 | 113 | call_tracker cache_czar; 114 | std::vector> cache; 115 | 116 | int compressed_file; 117 | 118 | std::vector block_offsets; 119 | size_t compressed_file_size = 0; 120 | 121 | lzma_stream_flags stream_flags; 122 | 123 | public: 124 | 125 | void open(const std::string& filepath) 126 | { 127 | 128 | compressed_file = ::open(filepath.c_str(), O_RDONLY); 129 | compressed_file_size = lseek(compressed_file, 0, SEEK_END); 130 | 131 | uint8_t buffer[12]; 132 | read_compressed((void*)buffer, compressed_file_size-12, 12); 133 | 134 | if ( lzma_stream_footer_decode(&stream_flags, buffer) != LZMA_OK ){ 135 | std::cerr << "ERROR: Cannot read xz footer.\n"; 136 | close(compressed_file); 137 | abort(); 138 | } 139 | 140 | std::unique_ptr index_buffer(new uint8_t[stream_flags.backward_size]); 141 | read_compressed((void*)index_buffer.get(), 142 | compressed_file_size - stream_flags.backward_size - 12, 143 | stream_flags.backward_size); 144 | lzma_index *idx; 145 | uint64_t mem = UINT64_MAX; 146 | size_t pos = 0; 147 | 148 | if ( lzma_index_buffer_decode(&idx, &mem, nullptr,index_buffer.get(), &pos, stream_flags.backward_size) != LZMA_OK ){ 149 | std::cerr << "ERROR: Cannot read xz index.\n"; 150 | close(compressed_file); 151 | abort(); 152 | } 153 | lzma_index_iter iter; 154 | lzma_index_iter_init(&iter, idx); 155 | while (!lzma_index_iter_next(&iter, LZMA_INDEX_ITER_ANY)) 156 | { 157 | struct offsets block; 158 | block.c_start = iter.block.compressed_file_offset; 159 | block.c_size = iter.block.total_size; 160 | block.u_start = iter.block.uncompressed_file_offset; 161 | block.u_size = iter.block.uncompressed_size; 162 | block_offsets.push_back(block); 163 | } 164 | lzma_index_end(idx, nullptr); 165 | offsets &b = block_offsets.back(); 166 | 167 | cache_czar.set_size(100); 168 | cache.resize(block_offsets.size()); 169 | } 170 | xzReader(const std::string& filepath){ 171 | open(filepath); 172 | } 173 | xzReader() {}; 174 | 175 | size_t read(void *buffer, off_t start, size_t length) 176 | { 177 | int block_idx = get_block_from_offset(start); 178 | if (block_idx < 0) 179 | { 180 | return 0; 181 | } 182 | size_t copied = 0; 183 | while (length > 0) 184 | { 185 | if (block_idx >= (int)block_offsets.size()) 186 | { 187 | break; 188 | } 189 | if( cache_block(block_idx) < 0 ){ 190 | std::cerr << "ERROR: Cannot read block " << block_idx << "\n"; 191 | } 192 | const auto& data = cache[block_idx]; 193 | auto &b = block_offsets[block_idx++]; 194 | size_t copy_start = start - b.u_start; 195 | size_t copy_length = b.u_size - copy_start; 196 | copy_length = std::min(copy_length, length); 197 | memcpy(buffer, data.get()+copy_start, copy_length); 198 | copied += copy_length; 199 | start += copy_length; 200 | length -= copy_length; 201 | buffer = (void*)((char*)buffer + copy_length); 202 | } 203 | return copied; 204 | } 205 | 206 | int get_block_from_offset(off_t off) 207 | { 208 | int i = 0; 209 | for(const offsets& bi : block_offsets){ 210 | if( (bi.u_start <= off) && (bi.u_start + bi.u_size > off) ){ 211 | return i; 212 | } 213 | i++; 214 | } 215 | return -1; 216 | } 217 | 218 | int cache_block(const int& i) 219 | { 220 | if( cache[i] != nullptr ){ 221 | return 1; 222 | }else{ 223 | int prune_cache = cache_czar.push_new(i); 224 | if( prune_cache > 0 ){ 225 | cache[prune_cache] = nullptr; 226 | } 227 | } 228 | 229 | offsets& b = block_offsets[i]; 230 | 231 | std::unique_ptr buffer(new uint8_t[b.c_size]); 232 | 233 | read_compressed((void*)buffer.get(), b.c_start, b.c_size); 234 | lzma_block block; 235 | lzma_filter filters[LZMA_FILTERS_MAX + 1]; 236 | 237 | filters[0].id = LZMA_VLI_UNKNOWN; 238 | block.filters = filters; 239 | block.version = 1; 240 | block.check = stream_flags.check; 241 | block.header_size = lzma_block_header_size_decode(*buffer); 242 | 243 | if ( lzma_block_header_decode(&block, nullptr, buffer.get()) != LZMA_OK ) 244 | { 245 | return -1; 246 | } 247 | 248 | cache[i] = std::unique_ptr(new uint8_t[b.u_size]); 249 | size_t in_pos = block.header_size; 250 | size_t out_pos = 0; 251 | 252 | if ( lzma_block_buffer_decode(&block, nullptr, buffer.get(),&in_pos, b.c_size, cache[i].get(), &out_pos, b.u_size) != LZMA_OK) 253 | { 254 | return -1; 255 | } 256 | 257 | return 1; 258 | } 259 | }; 260 | 261 | 262 | // Class for writing xz files from output stream. 263 | // We want to pass small chunks of data at a time 264 | // (for example, results calculated for a given gene or 265 | // genomic region) to be written to file via writeBytes, 266 | // and take care of blocking, indexing, and other 267 | // details internally. 268 | 269 | // Uncompressed data is kept in buffer. Once buffer reaches 270 | // capacity based on the specified bytes_per_block, we 271 | // compress the data in buffer and write a new xz block 272 | // to the file. Then, we clear buffer and continue. 273 | 274 | // At the end, we want to write any remaining data as the final 275 | // xz block, and then write the xz block index to file. 276 | 277 | /* 278 | class xzWriter 279 | { 280 | protected: 281 | // Buffer storing uncompressed data. 282 | void* buffer; 283 | // How many bytes have been copied into the buffer. 284 | size_t buffer_n; 285 | 286 | // How many (uncompressed) bytes will be stored per block. 287 | size_t bytes_per_block; 288 | 289 | // Output file descriptor. 290 | int xz_out_file; 291 | 292 | int writeStreamHeader(); 293 | int writeBlock(); 294 | int writeIndex(); 295 | int writeStreamFooter(); 296 | public: 297 | xzWriter() {}; 298 | xzWriter( const std::string& fp, const int& nb) { open(fp, nb); }; 299 | int open(const std::string& filepath, const int& nb) 300 | { 301 | bytes_per_block = nb; 302 | buffer = malloc(bytes_per_block); 303 | xz_out_file = ::open(filepath.c_str(), O_WRONLY); 304 | writeStreamHeader(); 305 | } 306 | int writeBytes(void* data, size_t input_size){ 307 | // // Copy data to buffer. 308 | // size_t bytes_written = 0; 309 | // while( bytes_written < input_size ){ 310 | // if( input_size + buffer_n >= bytes_per_block ){ 311 | // // A. Copy first (bytes_per_block - buffer_n) 312 | // // bytes to buffer. Update the counts. 313 | // bytes_written += bytes_per_block - buffer_n; 314 | // buffer_n = bytes_per_block; 315 | // 316 | // // B. Compress buffer and write block to file. 317 | // writeBlock(); 318 | // // C. Clear buffer. Update buffer_n. 319 | // buffer_n = 0; 320 | // }else{ 321 | // // Copy data to buffer. Update the counts. 322 | // bytes_written += input_size; 323 | // buffer_n += input_size; 324 | // } 325 | // } 326 | return 0; 327 | } 328 | int close(){ 329 | // Write any remaining data to final block. 330 | writeBlock(); 331 | // Free the buffer. 332 | free(buffer); 333 | // Write the XZ index and footer. 334 | writeIndex(); 335 | writeStreamFooter(); 336 | // Close the file. 337 | if( ::close(xz_out_file) < 0 ){ 338 | std::cerr << "ERROR: Failed to close xz file.\n"; 339 | } 340 | return 0; 341 | } 342 | }; 343 | */ 344 | 345 | --------------------------------------------------------------------------------