├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── batch ├── README.md ├── build_annotator │ ├── Dockerfile.dbNSFP │ ├── Dockerfile.vep │ ├── README.md │ ├── build_databases.sh │ ├── dbNSFP_container.yaml │ ├── download_dbNSFP.sh │ └── vep_container.yaml ├── run_annotator │ ├── README.md │ ├── run_vep_remote.sh │ ├── vep_into_bigquery_for_docker.sh │ └── vep_schema.json └── vep │ ├── Dockerfile │ ├── README.md │ ├── build_vep_cache.sh │ ├── run_script_with_watchdog.sh │ ├── run_vep.sh │ └── sample_pipeline.yaml ├── curation ├── README.md ├── allPossibleSNPs │ ├── ESP_AA.sql │ ├── ESP_EA.sql │ ├── README.md │ ├── all_possible_snps.sql │ ├── check_joined_annotations.sql │ ├── clinvar.sql │ ├── dbSNP.sql │ ├── fasta_to_kv.py │ ├── join_annotations.sql │ ├── render_templated_sql.py │ └── thousandGenomes.sql └── tables │ ├── AddBigQueryDescriptions.md │ ├── Dockerfile │ ├── README.md │ ├── import_vcf_to_bigquery.py │ ├── launch_import_vcf_to_bigquery.sh │ ├── schema_update_utils.py │ ├── update_variants_schema.py │ ├── vcf_manifest.tsv │ └── vcf_to_bigquery_utils.py └── interactive ├── InteractiveVariantAnnotation.ipynb └── README.md /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | Want to contribute? Great! First, read this page (including the small print at the end). 2 | 3 | ### Before you contribute 4 | Before we can use your code, you must sign 5 | the 6 | [Google Individual Contributor License Agreement](https://cla.developers.google.com/about/google-individual) (CLA), 7 | which you can do online. The CLA is necessary mainly because you own the 8 | copyright to your changes, even after your contribution becomes part of our 9 | codebase, so we need your permission to use and distribute your code. We also 10 | need to be sure of various other things—for instance that you'll tell us if you 11 | know that your code infringes on other people's patents. You don't have to sign 12 | the CLA until after you've submitted your code for review and a member has 13 | approved it, but you must do it before we can put your code into our codebase. 14 | Before you start working on a larger contribution, you should get in touch with 15 | us first through the issue tracker with your idea so that we can help out and 16 | possibly guide you. Coordinating up front makes it much easier to avoid 17 | frustration later on. 18 | 19 | ### Code reviews 20 | All submissions, including submissions by project members, require review. We 21 | use GitHub pull requests for this purpose. 22 | 23 | ### The small print 24 | Contributions made by corporations are covered by a different agreement than the 25 | one above, 26 | the 27 | [Software Grant and Corporate Contributor License Agreement](https://cla.developers.google.com/about/google-corporate). 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Disclaimer 2 | 3 | This is a forked version of 4 | [verilylifesciences/variant-annotation]( 5 | https://github.com/verilylifesciences/variant-annotation). The intention is to 6 | gradually update various pieces of this repo for variant annotation needs we 7 | have in [googlegenomics/gcp-variant-transforms]( 8 | https://github.com/googlegenomics/gcp-variant-transforms) but at the same time 9 | provide annotation related tools/documentation that: 10 | - can be used independently, 11 | - are actively maintained with proper test harnesses. 12 | 13 | Any README file that has the following warning message indicates a sub-directory 14 | that has not been updated yet: 15 | 16 | **WARNING: Not actively maintained!** 17 | 18 | This is the list of modules that *are* actively maintained: 19 | * `batch/vep` 20 | 21 | 22 | variant-annotation 23 | ================== 24 | 25 | This repository contains code to annotate human sequence variants using 26 | cloud technology to perform analyses in parallel. 27 | 28 | Sub-projects: 29 | 30 | * [batch annotation](./batch) code for annotating a particular batch of variants 31 | using annotation resources available at a particular point in time 32 | * [interactive annotation](./interactive) queries and code to annotate variants 33 | interactively with new annotation resources as they become available 34 | * [annotation curation](./curation) code for ingesting and reformating raw 35 | annotation resources for use in interactive annotation 36 | 37 | The code in this repository is designed for use with genomic variants stored 38 | in [Google BigQuery](https://cloud.google.com/bigquery/) in a 39 | particular 40 | [variant table format](https://cloud.google.com/genomics/v1/bigquery-variants-schema). 41 | 42 | Processing 43 | uses 44 | [Google Container Builder](https://cloud.google.com/container-builder/), 45 | [Docker](https://www.docker.com/), 46 | and [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) for batch 47 | processing. We suggest working through the introductory materials for each tool 48 | before working with the code in this repository. 49 | 50 | For interactive annotation, parallelism is accomplished due to the use of 51 | BigQuery. For batch annotation, parallelism is accomplished due to the use of 52 | dsub to run annotation in parallel on small shards of the input file(s). 53 | -------------------------------------------------------------------------------- /batch/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Batch Variant Annotation 4 | ======================== 5 | 6 | Given a set of variants, the code here will allow you to annotate a batch of 7 | variants using annotation resources available at a particular point in time. 8 | (In comparison, using [interactive annotation](../interactive), variants can be 9 | annotated on the fly with new annotation resources as they become available.) 10 | 11 | This code uses 12 | Ensembl's 13 | [Variant Effect Prediction]( 14 | http://www.ensembl.org/info/docs/tools/vep/index.html) (VEP) 15 | from McLaren et. al. 2016 16 | ([doi:10.1186/s13059-016-0974-4]( 17 | https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0974-4)) 18 | to annotate variants either in VCF files or in a BigQuery table. 19 | 20 | To annotate VCF files, check the [documentation](./vep/README.md) in `vep` 21 | directory on how to build docker images containing VEP, how to create VEP cache 22 | for those images, and how to run VEP. 23 | 24 | **WARNING: The build_annotator and run_annotator pieces below are not actively 25 | maintained in this repo yet!** 26 | 27 | Annotating variants in BigQuery talbes is horizontally scalable due to the use 28 | of [dsub](https://cloud.google.com/genomics/v1alpha2/dsub). A separate instance 29 | of VEP is run by dsub for each shard of each of the files passed on the command 30 | line. VEP is also configured to run with as many threads as the number of cores 31 | on the virtual machine instantiated by dsub. 32 | 33 | ## Status of this sub-project 34 | 35 | VEP can be configured in many ways and can use as input a large variety of 36 | annotation sources. This code illustrates one possible configuration and could 37 | be modified to accomodate other configurations. 38 | 39 | All steps are run in the cloud, but each individual step is launched manually. 40 | 41 | ## Overview 42 | 43 | ### Build the annotator 44 | 45 | The first step involves building the Docker container holding VEP and cached 46 | annotations for the desired build of the human genome reference. 47 | 48 | A second container is built to curate and 49 | cache [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) in Cloud Storage. 50 | This is done because dbNSFP is quite a large annotation resource and therefore 51 | we choose not to add it to the same Docker container that includes VEP. 52 | 53 | [Follow the tutorial](./build_annotator/README.md) to build the tools needed to 54 | annotate GRCh37 or GRCh38 of the human genome reference. 55 | 56 | ### Run the annotator 57 | 58 | After the annotator has been built for the desired build of the human reference 59 | genome, it can be used to annotate variants from a single genome or a cohort of 60 | genomes in a BigQuery 61 | [variant table](https://cloud.google.com/genomics/v1/bigquery-variants-schema). 62 | 63 | [Follow the tutorial](./run_annotator/README.md) to 64 | annotate 65 | [Platinum Genomes variants called by DeepVariant](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) and 66 | aligned to build GRCh38. 67 | -------------------------------------------------------------------------------- /batch/build_annotator/Dockerfile.dbNSFP: -------------------------------------------------------------------------------- 1 | # Start from this container so that gcloud and all its dependencies 2 | # are already available. 3 | FROM gcr.io/cloud-builders/gcloud 4 | 5 | RUN apt-get -y update && apt-get install -y \ 6 | apt-transport-https \ 7 | build-essential \ 8 | ca-certificates \ 9 | curl \ 10 | tabix \ 11 | unzip \ 12 | wget \ 13 | zlib1g-dev 14 | 15 | # Install newer version of bgzip so that the --threads option is available. 16 | RUN cd /opt && \ 17 | export VER=1.3.1 && \ 18 | export NAMEVER=htslib-${VER} && \ 19 | wget https://github.com/samtools/htslib/releases/download/${VER}/${NAMEVER}.tar.bz2 && \ 20 | tar xjf ${NAMEVER}.tar.bz2 && \ 21 | cd ${NAMEVER} && \ 22 | make -j$(nproc) && \ 23 | make install && \ 24 | ldconfig && \ 25 | rm -rf /opt/${NAMEVER}.tar.bz2 /opt/${NAMEVER} 26 | 27 | COPY download_dbNSFP.sh /opt/download_dbNSFP.sh 28 | COPY build_databases.sh /opt/build_databases.sh 29 | 30 | ENTRYPOINT ["bash"] 31 | -------------------------------------------------------------------------------- /batch/build_annotator/Dockerfile.vep: -------------------------------------------------------------------------------- 1 | # Example: 2 | # 3 | # docker build \ 4 | # --build-arg ENSEMBL_RELEASE=88 \ 5 | # --build-arg GENOME_ASSEMBLY=GRCh38 \ 6 | # --build-arg DBNSFP_BASE=gs://my-bucket/dbNSFP \ 7 | # --tag vep_docker_test \ 8 | # . 9 | # 10 | # This requires the dbNSFP databases to be loaded to GCS via the 11 | # build_databases.sh script, for example: 12 | # 13 | # ./build_databases.sh gs://my-bucket/dbNSFP_GRCh38/ dbNSFPv3.4c.zip 14 | # 15 | # TODO: Determine an alternate strategy for the dbNSFP 16 | # dependency. This currently uses a hardcoded environment variable to 17 | # hold the cloud storage path to dbNSFP because it is too large to 18 | # include within the Docker image. The script that invokes VEP will 19 | # currently fail if dbNSFP is not available at the path indicated by 20 | # the environment variable. 21 | 22 | # Start from this container so that gcloud and all its dependencies 23 | # are already available. 24 | FROM gcr.io/cloud-builders/gcloud 25 | 26 | ARG ENSEMBL_RELEASE=88 27 | ARG GENOME_ASSEMBLY=GRCh38 28 | # No default value for this argument. See build_databases.sh for code that 29 | # loads these tables. 30 | ARG DBNSFP_BASE 31 | 32 | # Make this build argument available in the container as an environment 33 | # variable. 34 | ENV DBNSFP_BASE="${DBNSFP_BASE}" 35 | ENV GENOME_ASSEMBLY="${GENOME_ASSEMBLY}" 36 | ENV VEP_SPECIES="homo_sapiens" 37 | ENV VEP_BASE=/opt/variant_effect_predictor 38 | 39 | RUN apt-get -y update && apt-get install -y \ 40 | build-essential \ 41 | curl \ 42 | gawk \ 43 | git \ 44 | libarchive-zip-perl \ 45 | libdbd-mysql-perl \ 46 | libdbi-perl \ 47 | libfile-copy-recursive-perl \ 48 | libhts0 \ 49 | libjson-perl \ 50 | libmodule-build-perl \ 51 | tabix \ 52 | unzip \ 53 | wget \ 54 | zlib1g-dev 55 | 56 | # Install VEP per the instructions on 57 | # http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer 58 | RUN git clone https://github.com/Ensembl/ensembl-vep.git ${VEP_BASE} 59 | 60 | WORKDIR ${VEP_BASE} 61 | 62 | RUN git checkout release/${ENSEMBL_RELEASE} 63 | 64 | # Download the cache database separately. Downloading via vep installation 65 | # option -c results in timeout errors. 66 | RUN mkdir -p $HOME/.vep && \ 67 | cd $HOME/.vep && \ 68 | curl -O "ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/variation/VEP/${VEP_SPECIES}_vep_${ENSEMBL_RELEASE}_${GENOME_ASSEMBLY}.tar.gz" && \ 69 | tar xzf "${VEP_SPECIES}_vep_${ENSEMBL_RELEASE}_${GENOME_ASSEMBLY}.tar.gz" 70 | 71 | RUN perl INSTALL.pl \ 72 | --AUTO afl \ 73 | --SPECIES "${VEP_SPECIES}" \ 74 | --ASSEMBLY "${GENOME_ASSEMBLY}" 75 | 76 | # Configure Condel plugin. 77 | RUN curl -Lk \ 78 | "https://github.com/Ensembl/VEP_plugins/archive/release/${ENSEMBL_RELEASE}.tar.gz" | \ 79 | tar xz --strip-components=2 \ 80 | "VEP_plugins-release-${ENSEMBL_RELEASE}/config/Condel" && \ 81 | sed -i "s#path/to/config#${VEP_BASE}#" \ 82 | Condel/config/condel_SP.conf 83 | 84 | # Install Condel plugin. 85 | RUN perl INSTALL.pl --AUTO p --PLUGINS Condel 86 | 87 | # Install dbNSFP plugin. 88 | RUN perl INSTALL.pl --AUTO p --PLUGINS dbNSFP 89 | 90 | ENTRYPOINT ["bash"] 91 | -------------------------------------------------------------------------------- /batch/build_annotator/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Build the annotator 4 | =================== 5 | 6 | This tutorial builds the tools needed to annotate GRCh37 or GRCh38 of the human 7 | genome reference. 8 | 9 | [Container Builder](https://cloud.google.com/container-builder/docs/overview) 10 | and [dsub](https://github.com/googlegenomics/dsub) are used to run all of these 11 | steps in the cloud. 12 | 13 | ## (1) Configure project variables. 14 | 15 | Set a few environment variables to facilitate cutting and pasting the subsequent 16 | commands. 17 | 18 | ``` bash 19 | # The Google Cloud Platform project id in which the Docker containers 20 | # will be built and stored. 21 | PROJECT_ID=your-project-id 22 | # The bucket name (with the gs:// prefix) where the cached version of 23 | # dbNSFP should be stored. 24 | BUCKET=gs://your-bucket-name 25 | ``` 26 | 27 | ## (2) Build the VEP Docker container. 28 | 29 | Run one of the commands below to build a Docker container that is configured to 30 | run VEP on human genetic variants in GRCh37 or GRCh38 coordinates with 31 | annotations including dbNSFP, SIFT, and many others. Both of these commands can 32 | be run in parallel if you wish to annotate using both reference genomes. 33 | 34 | ### GRCh37 35 | 36 | ``` bash 37 | gcloud container builds submit \ 38 | --substitutions=_GENOME_ASSEMBLY=GRCh37,_ENSEMBL_RELEASE=89,_DBNSFP_BASE=${BUCKET}/dbNSFPv2.9.3/dbNSFP,_CONTAINER_SUFFIX=_89_grch37:latest \ 39 | --config=vep_container.yaml \ 40 | . 41 | ``` 42 | 43 | ### GRCh38 44 | 45 | ``` bash 46 | gcloud container builds submit \ 47 | --substitutions=_GENOME_ASSEMBLY=GRCh38,_ENSEMBL_RELEASE=89,_DBNSFP_BASE=${BUCKET}/dbNSFPv3.4c/dbNSFP,_CONTAINER_SUFFIX=_89_grch38:latest \ 48 | --config=vep_container.yaml \ 49 | . 50 | ``` 51 | ## (3) Cache dbNSFP. 52 | 53 | First run the command below to create the Docker container with tools and 54 | scripts needed to process dbNSFP. This Docker container can be used for any 55 | version of dbNSFP. 56 | 57 | ``` bash 58 | gcloud --project ${PROJECT_ID} container builds submit \ 59 | --substitutions=_CONTAINER_TAG=:latest \ 60 | --config=dbNSFP_container.yaml \ 61 | . 62 | ``` 63 | 64 | Then run the container via dsub to download and cache dbNSFP annotations in 65 | Cloud Storage. These commands can be run in parallel if you wish to annotate 66 | using both reference genomes. 67 | 68 | * Note that it can take several hours for the job to complete. 69 | * The values for `FILEID` in the commands came 70 | from [dbNSFP documentation](https://sites.google.com/site/jpopgen/dbNSFP) 71 | where you can also get detail on other available versions of dbNSFP. 72 | 73 | ### GRCh37 74 | 75 | ``` bash 76 | dsub \ 77 | --project ${PROJECT_ID} \ 78 | --image gcr.io/${PROJECT_ID}/dbnsfp_cache_builder:latest \ 79 | --zones "us-central1-*" \ 80 | --disk-size 200 \ 81 | --min-cores 8 \ 82 | --logging ${BUCKET}/dbNSFPv3.4c/dbNSFPv3.4c.log \ 83 | --env FILEID=0B60wROKy6OqcaWJ4Y0xvR2k1aUU \ 84 | --output-recursive OUTPUT_PATH=${BUCKET}/dbNSFPv3.4c/ \ 85 | --command '/opt/download_dbNSFP.sh && 86 | /opt/build_databases.sh' 87 | ``` 88 | 89 | ### GRCh38 90 | 91 | ``` bash 92 | dsub \ 93 | --project ${PROJECT_ID} \ 94 | --image gcr.io/${PROJECT_ID}/dbnsfp_cache_builder:latest \ 95 | --zones "us-central1-*" \ 96 | --disk-size 200 \ 97 | --min-cores 8 \ 98 | --logging ${BUCKET}/dbNSFPv2.9.3/dbNSFPv2.9.3.log \ 99 | --env FILEID=0B60wROKy6OqceTNZRkZnaERWREk \ 100 | --output-recursive OUTPUT_PATH=${BUCKET}/dbNSFPv2.9.3/ \ 101 | --command '/opt/download_dbNSFP.sh && 102 | /opt/build_databases.sh' 103 | ``` 104 | -------------------------------------------------------------------------------- /batch/build_annotator/build_databases.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | 18 | # This script will preprocess the dbNSFP database according to the 19 | # instructions for the VEP plugin, given here: 20 | # https://github.com/Ensembl/VEP_plugins/blob/release/87/dbNSFP.pm. 21 | # 22 | # Args: 23 | # $1: Output path. 24 | # $2: The dbNSFP input filename. 25 | 26 | set -o nounset 27 | set -o errexit 28 | 29 | readonly DEFAULT_OUTPUT_PATH=${OUTPUT_PATH:-} 30 | readonly DBNSFP_BASE=${1:-$DEFAULT_OUTPUT_PATH} 31 | readonly DBNSFP_ZIP_FILE=${2:-dbNSFP.zip} 32 | 33 | # Process a list of dbNSFP database files by sorting and indexing them with 34 | # tabix. We sort each chromosome separately, leaving the comment line at the 35 | # top. Additionally, add "chr" to the start of the chromosome names (column 1). 36 | # Concatenate the resulting files and run tabix to index the final file. 37 | # 38 | # Args: 39 | # $1: The combined output file name. 40 | # ...: Input files, typically one per chromosome. 41 | function process_tables() { 42 | local -r gzip_file="$1" 43 | shift 1 44 | local file 45 | for file in $(printf '%s\n' "$@" | sort); do 46 | # Write the comment lines for this file. 47 | awk '/^#/' "${file}" 48 | # Write the position-sorted non-comment lines for this file. 49 | awk '!/^#/{print "chr"$0}' "${file}" | \ 50 | sort --key=2,2n --stable --parallel=8 51 | done | \ 52 | bgzip --threads 8 -c > \ 53 | "${gzip_file}" 54 | tabix -s 1 -b 2 -e 2 "${gzip_file}" 55 | # TODO: Determine if "chr" needs to be prepended for GRCh37. 56 | } 57 | 58 | main() { 59 | local -r gzip_file="dbNSFP.gz" 60 | local -r readme_file=("dbNSFP"*"readme.txt") 61 | 62 | unzip "${DBNSFP_ZIP_FILE}" 63 | 64 | process_tables "${gzip_file}" "dbNSFP"*"chr"* 65 | 66 | if [[ ! -z "${DBNSFP_BASE}" ]] ; then 67 | # Move the processed files to the output directory. 68 | mkdir -p "${DBNSFP_BASE}" 69 | mv "${gzip_file}" "${DBNSFP_BASE}" 70 | mv "${gzip_file}.tbi" "${DBNSFP_BASE}" 71 | if [[ -f "${readme_file[0]}" ]] ; then 72 | mv "${readme_file[0]}" "${DBNSFP_BASE}" 73 | else 74 | # TODO: Understand why the readme file is sometimes not found. 75 | ls 76 | ls "${readme_file[0]}" 77 | fi 78 | fi 79 | } 80 | 81 | main "$@" 82 | -------------------------------------------------------------------------------- /batch/build_annotator/dbNSFP_container.yaml: -------------------------------------------------------------------------------- 1 | steps: 2 | - name: 'gcr.io/cloud-builders/docker' 3 | # Build a Docker container holding the tools and scripts needed 4 | # to create a cache of dbNSFP for use with VEP. 5 | args: ['build', '-f', 'Dockerfile.dbNSFP', 6 | '-t', 'gcr.io/$PROJECT_ID/dbnsfp_cache_builder${_CONTAINER_SUFFIX}', '.'] 7 | 8 | # Push the container to the Google Container Registry. 9 | images: 10 | - 'gcr.io/$PROJECT_ID/dbnsfp_cache_builder${_CONTAINER_SUFFIX}' 11 | -------------------------------------------------------------------------------- /batch/build_annotator/download_dbNSFP.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | 18 | # Download dbNSFP from Google Drive per 19 | # https://sites.google.com/site/jpopgen/dbNSFP 20 | # 21 | # Args: 22 | # $1: The GoogleDrive file id. 23 | # $2: The destination filename. 24 | 25 | set -o nounset 26 | set -o errexit 27 | 28 | # Default value is for dbNSFPv3.4c.zip 29 | readonly DEFAULT_FILEID=${FILEID:-0B60wROKy6OqcaWJ4Y0xvR2k1aUU} 30 | readonly DRIVE_FILEID=${1:-$DEFAULT_FILEID} 31 | readonly DESTINATION=${2:-dbNSFP.zip} 32 | 33 | # The following code will download a large world-readable file from Google 34 | # Drive. Implementation adapted from http://stackoverflow.com/a/43478623 35 | curl -c /tmp/cookie -L -o /tmp/probe.bin \ 36 | "https://drive.google.com/uc?export=download&id=${DRIVE_FILEID}" 37 | confirm=$(tr ';' '\n' ", "<*>") 53 | -- Remove this line to include the entire genome. 54 | AND reference_name IN ('chr22', '22') 55 | ) 56 | SELECT 57 | -- http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input 58 | CONCAT( 59 | chrom, '\t', 60 | CAST(pos AS STRING), '\t', 61 | CAST(`end` AS STRING), '\t', 62 | ref, '/', alt, '\t', 63 | '+') AS vep_input 64 | FROM 65 | variants 66 | ``` 67 | 68 | ## (2) Extract the VEP input to Cloud Storage. 69 | 70 | [Export](https://cloud.google.com/bigquery/docs/exporting-data) the contents of 71 | the table created in the prior step as a gzipped CSV file to Cloud Storage (for 72 | example to path `gs://your-bucket-name/platinum-genomes-grch38-chr22.csv.gz`). 73 | 74 | ## (3) Run VEP on the variants. 75 | 76 | Run [run_vep_remote.sh](./run_vep_remote.sh) to 77 | launch [dsub](https://github.com/googlegenomics/dsub) jobs that will use the VEP 78 | Docker container to run VEP on the variants, writing the resulting annotations 79 | as a BigQuery table. 80 | 81 | * Run `./run_vep_remote.sh --help` for additional documentation on its command 82 | line parameters. 83 | 84 | ``` bash 85 | # The Google Cloud Platform project id in which the docker containers 86 | # are stored. 87 | PROJECT_ID=your-project-id 88 | # The bucket name (with the gs:// prefix) to hold temp files and logs. 89 | BUCKET=gs://your-bucket-name 90 | # The full path to the VEP input file. 91 | INPUT_FILE=${BUCKET}/platinum-genomes-grch38-chr22.csv.gz 92 | 93 | # Kick off annotation. 94 | ./run_vep_remote.sh \ 95 | --project_id ${PROJECT_ID} \ 96 | --bucket ${BUCKET}/temp \ 97 | --docker_image gcr.io/${PROJECT_ID}/vep_grch38 \ 98 | --table_name platinum_genomes_grch38_chr22_annotations \ 99 | --shards_per_file 10 \ 100 | ${INPUT_FILE} 101 | ``` 102 | 103 | ## (3) Check the annotations. 104 | 105 | Use the following query to do some basic checks on the annotations. The query 106 | below assumes the annotations are in table 107 | `vep.platinum_genomes_grch38_chr22_annotations` in your project. 108 | 109 | ``` sql 110 | #standardSQL 111 | -- 112 | -- Count the number of variants per chromosome for both the variants and 113 | -- VEP output table. This basic QC metric ensures that every chromosome 114 | -- we expected successfully completed. 115 | -- 116 | SELECT chrom, variants_count, vep_count 117 | FROM ( 118 | SELECT LTRIM(reference_name, "chr") AS chrom, COUNT(1) AS variants_count 119 | FROM 120 | `genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls`, 121 | UNNEST(alternate_bases) AS alt -- flatten multi allelic sites 122 | WHERE 123 | -- Include only sites of variantion (exclude non-variant segments). 124 | alt IS NOT NULL AND alt NOT IN ("", "<*>") 125 | GROUP BY reference_name) 126 | FULL JOIN ( 127 | SELECT LTRIM(seq_region_name, "chr") AS chrom, COUNT(1) AS vep_count 128 | FROM `vep.platinum_genomes_grch38_chr22_annotations` 129 | GROUP BY seq_region_name) 130 | USING (chrom) 131 | WHERE vep_count IS NOT NULL 132 | ORDER BY chrom 133 | ``` 134 | 135 | We expect a result of 210,279 variants and VEP annotations for chromosome 22. 136 | 137 | ## (4) Optional: Reshape the annotations table for easier JOINs. 138 | 139 | The annotations table created by VEP is a little different than 140 | the 141 | [variant tables](https://cloud.google.com/genomics/v1/bigquery-variants-schema): 142 | 143 | * VEP uses 1-based coordinates whereas 144 | the 145 | [variants tables](https://cloud.google.com/genomics/v1/bigquery-variants-schema) use 146 | 0-based coordinates per [GA4GH](http://ga4gh.org/). 147 | * VEP rewrites some of the fields. For example field `start` 148 | excludes the first base for a deletion. 149 | 150 | Run a query like the following to reshape the annotations data and materialize 151 | the result to a new table. 152 | 153 | ``` sql 154 | #standardSQL 155 | -- 156 | -- Add additional columns to the VEP table to facilitate easier JOINs 157 | -- with variant tables. 158 | -- 159 | SELECT 160 | SPLIT(input, "\t")[OFFSET(0)] AS reference_name, 161 | CAST(SPLIT(input, "\t")[OFFSET(1)] AS INT64) - 1 AS start_0_based_coords, 162 | SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(0)] AS reference_bases, 163 | SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(1)] AS alternate_bases, 164 | * 165 | FROM 166 | `vep.platinum_genomes_grch38_chr22_annotations` 167 | ``` 168 | -------------------------------------------------------------------------------- /batch/run_annotator/run_vep_remote.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | FLAGS_HELP=" 18 | Runs VEP on one or more files using the pipelines api (via dsub.py) and 19 | writes the output to BigQuery. Positional parameters must be files stored 20 | on GCS (may be gzipped). Files must be in ensembl format or VCF format. 21 | 22 | USAGE: ./run_vep_remote.sh [flags] args 23 | " 24 | 25 | if [[ ! -e ./shflags ]]; then 26 | echo This script assumes https://github.com/kward/shflags is located \ 27 | in the current working directory. To obtain the file: \ 28 | curl -O https://raw.githubusercontent.com/kward/shflags/master/src/shflags 29 | exit 1 30 | fi 31 | 32 | source ./shflags 33 | 34 | DEFINE_string project_id "" \ 35 | "The Cloud Platform project id to use." 36 | 37 | DEFINE_string bucket "" \ 38 | "Bucket to use for temporary files and logging." 39 | 40 | DEFINE_string dataset "vep" \ 41 | "BigQuery destination dataset name. The dataset will be created, if needed." 42 | 43 | DEFINE_string table_name "testing" \ 44 | "BigQuery destination table name. This table will be created." 45 | 46 | DEFINE_string vep_schema_file "./vep_schema.json" \ 47 | "BigQuery schema for VEP annotations." 48 | 49 | # The reference genome versions must match between the image and the database. 50 | DEFINE_string docker_image "" \ 51 | "VEP docker image corresponding to the reference genome of the input." 52 | 53 | DEFINE_integer shards_per_file 1 \ 54 | "The number of concurrent dsub jobs to run, each working on a separate shard of the input file(s)." 55 | 56 | DEFINE_string zones "us-*" \ 57 | "Compute engine zones in which to run dsub." 58 | 59 | DEFINE_integer disk_size "200" \ 60 | "Size of dsub data disk." 61 | 62 | DEFINE_integer boot_disk_size "30" \ 63 | "Size of dsub boot disk." 64 | 65 | DEFINE_integer min_gb_ram 8 \ 66 | "Minimum amount of RAM for dsub." 67 | 68 | DEFINE_string docker_script "./vep_into_bigquery_for_docker.sh" \ 69 | "Script that will be run by dsub." 70 | 71 | function main() { 72 | if [[ -z "${FLAGS_project_id}" ]] ; then 73 | echo "--project_id is required." 74 | exit 1 75 | fi 76 | 77 | if [[ -z "${FLAGS_bucket}" ]] ; then 78 | echo "--bucket is required." 79 | exit 1 80 | fi 81 | 82 | if [[ -z "${FLAGS_docker_image}" ]] ; then 83 | echo "--docker_image is required." 84 | exit 1 85 | fi 86 | 87 | local -r description="VEP pipeline on $* using ${FLAGS_docker_image}" 88 | 89 | gsutil \ 90 | cp \ 91 | "${FLAGS_vep_schema_file}" \ 92 | "${FLAGS_bucket}/schema.json" 93 | 94 | bq \ 95 | --project_id "${FLAGS_project_id}" \ 96 | mk -f "${FLAGS_dataset}" 97 | 98 | # Note: this will cause the script to fail if the table already exists. 99 | bq \ 100 | --project_id "${FLAGS_project_id}" \ 101 | mk --table \ 102 | "${FLAGS_dataset}.${FLAGS_table_name}" 103 | 104 | bq \ 105 | --project_id "${FLAGS_project_id}" \ 106 | update \ 107 | --table \ 108 | --description "${description}" \ 109 | "${FLAGS_dataset}.${FLAGS_table_name}" 110 | 111 | local -r temp_dir=$(mktemp -d) 112 | 113 | # Create TSV file to pass into dsub. 114 | # Will run VEP in parallel for each of the INPUT_FILE and put the result in 115 | # BQ_DATASET_NAME.BQ_TABLE_NAME. 116 | # SHARD_INDEX is 1..NUM_SHARDS 117 | ( 118 | # Pass input file flags using "=" since the later logic changes spaces to 119 | # tabs. dsub wants spaces, so we convert the "=" characters after 120 | # converting spaces. 121 | echo --input=SCHEMA_FILE \ 122 | BQ_DATASET_NAME \ 123 | BQ_TABLE_NAME \ 124 | --input=INPUT_FILE \ 125 | NUM_SHARDS \ 126 | SHARD_INDEX \ 127 | | tr '= ' ' \t' 128 | 129 | local file 130 | for file in "$@"; do 131 | local -i shard_index 132 | for shard_index in $(seq "${FLAGS_shards_per_file}"); do 133 | echo "${FLAGS_bucket}/schema.json" \ 134 | "${FLAGS_dataset}" \ 135 | "${FLAGS_table_name}" \ 136 | "${file}" \ 137 | "${FLAGS_shards_per_file}" \ 138 | "${shard_index}" 139 | done 140 | done | tr ' ' '\t' 141 | ) > "${temp_dir}/table.tsv" 142 | 143 | dsub \ 144 | --wait \ 145 | --project "${FLAGS_project_id}" \ 146 | --zones "${FLAGS_zones}" \ 147 | --logging "${FLAGS_bucket}/logging" \ 148 | --image "${FLAGS_docker_image}" \ 149 | --min-ram "${FLAGS_min_gb_ram}" \ 150 | --disk-size "${FLAGS_disk_size}" \ 151 | --boot-disk-size "${FLAGS_boot_disk_size}" \ 152 | --tasks "${temp_dir}/table.tsv" \ 153 | --script "${FLAGS_docker_script}" 154 | } 155 | 156 | 157 | # Parse the command-line. 158 | FLAGS "$@" || exit $? 159 | eval set -- "${FLAGS_ARGV}" 160 | 161 | set -o xtrace 162 | set -o nounset 163 | set -o errexit 164 | 165 | main "$@" 166 | -------------------------------------------------------------------------------- /batch/run_annotator/vep_into_bigquery_for_docker.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | # Runs VEP on an ensembl format file or a VCF file using the pipelines api 18 | # (via dsub.py) and writes the output to BigQuery. Positional parameters must 19 | # be files stored on GCS (may be gzipped). 20 | 21 | # This script is meant to be run within a docker container to run VEP on a 22 | # single shard of an ensembl format or VCF file. 23 | # 24 | # Takes flags as the environment variables: 25 | # SCHEMA_FILE BQ_DATASET_NAME BQ_TABLE_NAME INPUT_FILE NUM_SHARDS SHARD_INDEX 26 | # 27 | # The following environment variables should be specified in the Docker image, 28 | # since they are properties of the downloaded databases which are specific to 29 | # the image (there is a one-to-one relationship): 30 | # GENOME_ASSEMBLY VEP_SPECIES DBNSFP_BASE 31 | # 32 | # We also allow for including as a module so that the functions (in 33 | # particular apply_shard_file) can be tested. 34 | 35 | # Applies shard to $1 in-place. 36 | # 37 | # Args: 38 | # file: file to be sharded (in-place) 39 | # shard_count: the number of shards 40 | # cur_shard: integer from 1...num_shards (inclusive) 41 | # 42 | # Comment lines (starting with #) are always included. Of the remaining lines, 43 | # the index chunk of ceil(num_non_comment_lines / num_shards) lines 44 | # is kept, with the final shard (cur_shard=shard_count) potentially having 45 | # fewer lines. 46 | # 47 | # We use shard_count and cur_shard here because using num_shards and shard_index 48 | # confuses the linter because they are too close to NUM_SHARDS and SHARD_INDEX, 49 | # which aren't actually defined here. 50 | function apply_shard_file() { 51 | local -r file=$1 52 | local -ri shard_count=$2 53 | local -ri cur_shard=$3 54 | 55 | if [[ "${shard_count}" -eq 1 ]]; then 56 | return 57 | fi 58 | 59 | local -r temp_sharded_file="${file}.sharded" 60 | 61 | # Pass through the file twice; once to get the number of non-comment lines 62 | # and then to use that to output the specific shard. 63 | gawk -vnum_shards="${shard_count}" -vshard_index="${cur_shard}" ' 64 | # Note: line_count and non_comment_line_index refer only to non-comment 65 | # lines. 66 | 67 | ARGIND==1 { 68 | if (!/^#/) 69 | line_count++ 70 | next 71 | } 72 | 73 | ARGIND==2 && FNR==1 { 74 | lines_per_shard = int((line_count + num_shards - 1) / num_shards) 75 | 76 | # If num_shards > line_count, then this could be greater than the number 77 | # of lines, which will lead to empty output files. 78 | first_line = (shard_index - 1) * lines_per_shard + 1 79 | 80 | # Note: this could be greater than line_count when num_shards > line_count 81 | # or for num_shards == shard_index. The later will lead to a file with 82 | # fewer than lines_per_shard non-comment lines. 83 | last_line = first_line + lines_per_shard - 1 84 | } 85 | 86 | !/^#/{line_index++} 87 | 88 | /^#/ || (line_index >= first_line && line_index <= last_line) { 89 | print 90 | }' "${file}"{,} > "${temp_sharded_file}" 91 | 92 | mv "${temp_sharded_file}" "${file}" 93 | } 94 | 95 | if [[ -z "${INPUT_FILE}" ]]; then 96 | echo 'Running script in bash library mode.' 97 | else 98 | set -o xtrace 99 | set -o nounset 100 | set -o errexit 101 | 102 | # Localize dbNSFP database files. We can't use dsub to do this for us because 103 | # the current version (specified by filename or bucket) is only known inside 104 | # the container. 105 | gsutil -q cp "${DBNSFP_BASE}.gz" "${TMPDIR}/dbNSFP.gz" 106 | gsutil -q cp "${DBNSFP_BASE}.gz.tbi" "${TMPDIR}/dbNSFP.gz.tbi" 107 | 108 | if [[ $INPUT_FILE == *.vcf.gz || $INPUT_FILE == *.vcf ]]; then 109 | # The cut operaton removes any genotype information from the input VCF files 110 | # (which, in the case of 1k genomes, takes up ~75% of the output JSON file). 111 | gunzip -cf "${INPUT_FILE}" | cut -f1-8 > /mnt/data/input_file 112 | readonly FORMAT="vcf" 113 | else 114 | gunzip -cf "${INPUT_FILE}" > /mnt/data/input_file 115 | readonly FORMAT="ensembl" 116 | fi 117 | 118 | rm "${INPUT_FILE}" 119 | 120 | apply_shard_file /mnt/data/input_file "${NUM_SHARDS}" "${SHARD_INDEX}" 121 | 122 | readonly NUM_CORES=$(grep --count --word-regexp "^processor" /proc/cpuinfo) 123 | 124 | cd "${VEP_BASE}" 125 | 126 | # Depending on the version of dbNSFP used, not all the columns 127 | # listed below may be available. VEP will issue a warning about 128 | # those missing columns and run successfully. 129 | "${VEP_BASE}/vep" \ 130 | --cache \ 131 | --offline \ 132 | --no_stats \ 133 | --allele_number \ 134 | --force_overwrite \ 135 | --fork "${NUM_CORES}" \ 136 | --json \ 137 | --species "${VEP_SPECIES}" \ 138 | --assembly "${GENOME_ASSEMBLY}" \ 139 | --sift b \ 140 | --polyphen b \ 141 | --hgvs \ 142 | --plugin Condel,Condel/config,b \ 143 | --plugin "dbNSFP,${TMPDIR}/dbNSFP.gz,ExAC_Adj_AC,ExAC_Adj_AF,ExAC_nonTCGA_Adj_AC,ExAC_nonTCGA_Adj_AF,ExAC_nonpsych_Adj_AC,ExAC_nonpsych_Adj_AF,GenoCanyon_score,phyloP100way_vertebrate,phyloP20way_mammalian,phastCons100way_vertebrate,phastCons20way_mammalian,SiPhy_29way_logOdds,TWINSUK_AC,TWINSUK_AF,clinvar_rs,Ensembl_geneid,Ensembl_transcriptid,Ensembl_proteinid,LRT_score,ALSPAC_AC,ALSPAC_AF,ESP6500_AA_AC,ESP6500_AA_AF,ESP6500_EA_AC,ESP6500_EA_AF,clinvar_trait,GTEx_V6_gene,GTEx_V6_tissue" \ 144 | --format "${FORMAT}" \ 145 | -i /mnt/data/input_file \ 146 | -o /mnt/data/output.json 147 | 148 | if [[ -s /mnt/data/output.json ]]; then 149 | bq \ 150 | --quiet \ 151 | load \ 152 | --source_format NEWLINE_DELIMITED_JSON \ 153 | "${BQ_DATASET_NAME}.${BQ_TABLE_NAME}" \ 154 | /mnt/data/output.json \ 155 | "${SCHEMA_FILE}" 156 | else 157 | echo "VEP output file empty." >&2 158 | fi 159 | if [[ -s /mnt/data/output.json_warnings.txt ]]; then 160 | # Record any VEP export errors in stdout. These are typically complaints 161 | # about unmatched "random" or alternate haplotype contigs in the database. 162 | echo "JSON warnings reported:" 163 | cat /mnt/data/output.json_warnings.txt 164 | fi 165 | 166 | fi 167 | 168 | -------------------------------------------------------------------------------- /batch/run_annotator/vep_schema.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "name": "input", 4 | "type": "string", 5 | "mode": "required", 6 | "description": "original vcf input line" 7 | }, 8 | { 9 | "name": "id", 10 | "type": "string", 11 | "mode": "required" 12 | }, 13 | { 14 | "name": "seq_region_name", 15 | "type": "string", 16 | "mode": "required" 17 | }, 18 | { 19 | "name": "start", 20 | "type": "integer", 21 | "mode": "required" 22 | }, 23 | { 24 | "name": "end", 25 | "type": "integer", 26 | "mode": "required" 27 | }, 28 | { 29 | "name": "strand", 30 | "type": "integer", 31 | "mode": "required" 32 | }, 33 | { 34 | "name": "assembly_name", 35 | "type": "string", 36 | "mode": "required" 37 | }, 38 | { 39 | "name": "allele_string", 40 | "type": "string", 41 | "mode": "nullable" 42 | }, 43 | { 44 | "name": "most_severe_consequence", 45 | "type": "string", 46 | "mode": "required" 47 | }, 48 | { 49 | "name": "variant_class", 50 | "type": "string", 51 | "mode": "nullable" 52 | }, 53 | { 54 | "name": "transcript_consequences", 55 | "type": "record", 56 | "mode": "repeated", 57 | "fields": [ 58 | { 59 | "name": "transcript_id", 60 | "type": "string", 61 | "mode": "required" 62 | }, 63 | { 64 | "name": "gene_id", 65 | "type": "string", 66 | "mode": "required" 67 | }, 68 | { 69 | "name": "impact", 70 | "type": "string", 71 | "mode": "nullable" 72 | }, 73 | { 74 | "name": "consequence_terms", 75 | "type": "string", 76 | "mode": "repeated" 77 | }, 78 | { 79 | "name": "variant_allele", 80 | "type": "string", 81 | "mode": "nullable" 82 | }, 83 | { 84 | "name": "allele_num", 85 | "type": "integer", 86 | "mode": "nullable" 87 | }, 88 | { 89 | "name": "strand", 90 | "type": "integer", 91 | "mode": "required" 92 | }, 93 | { 94 | "name": "codons", 95 | "type": "string", 96 | "mode": "nullable" 97 | }, 98 | { 99 | "name": "amino_acids", 100 | "type": "string", 101 | "mode": "nullable" 102 | }, 103 | { 104 | "name": "cds_start", 105 | "type": "integer", 106 | "mode": "nullable" 107 | }, 108 | { 109 | "name": "cds_end", 110 | "type": "integer", 111 | "mode": "nullable" 112 | }, 113 | { 114 | "name": "flags", 115 | "type": "string", 116 | "mode": "repeated" 117 | }, 118 | { 119 | "name": "cdna_start", 120 | "type": "integer", 121 | "mode": "nullable" 122 | }, 123 | { 124 | "name": "cdna_end", 125 | "type": "integer", 126 | "mode": "nullable" 127 | }, 128 | { 129 | "name": "protein_start", 130 | "type": "integer", 131 | "mode": "nullable" 132 | }, 133 | { 134 | "name": "protein_end", 135 | "type": "integer", 136 | "mode": "nullable" 137 | }, 138 | { 139 | "name": "distance", 140 | "type": "integer", 141 | "mode": "nullable" 142 | }, 143 | { 144 | "name": "bp_overlap", 145 | "type": "integer", 146 | "mode": "nullable" 147 | }, 148 | { 149 | "name": "percentage_overlap", 150 | "type": "float", 151 | "mode": "nullable" 152 | }, 153 | { 154 | "name": "hgvsc", 155 | "type": "string", 156 | "mode": "nullable" 157 | }, 158 | { 159 | "name": "hgvsp", 160 | "type": "string", 161 | "mode": "nullable" 162 | }, 163 | { 164 | "name": "hgvs_offset", 165 | "type": "integer", 166 | "mode": "nullable" 167 | }, 168 | { 169 | "name": "polyphen_prediction", 170 | "type": "string", 171 | "mode": "nullable" 172 | }, 173 | { 174 | "name": "polyphen_score", 175 | "type": "float", 176 | "mode": "nullable" 177 | }, 178 | { 179 | "name": "sift_prediction", 180 | "type": "string", 181 | "mode": "nullable" 182 | }, 183 | { 184 | "name": "sift_score", 185 | "type": "float", 186 | "mode": "nullable" 187 | }, 188 | { 189 | "name": "condel", 190 | "type": "string", 191 | "mode": "nullable" 192 | }, 193 | { 194 | "name": "exac_adj_ac", 195 | "type": "integer", 196 | "mode": "nullable" 197 | }, 198 | { 199 | "name": "exac_adj_af", 200 | "type": "float", 201 | "mode": "nullable" 202 | }, 203 | { 204 | "name": "exac_nontcga_adj_ac", 205 | "type": "integer", 206 | "mode": "nullable" 207 | }, 208 | { 209 | "name": "exac_nontcga_adj_af", 210 | "type": "float", 211 | "mode": "nullable" 212 | }, 213 | { 214 | "name": "exac_nonpsych_adj_ac", 215 | "type": "integer", 216 | "mode": "nullable" 217 | }, 218 | { 219 | "name": "exac_nonpsych_adj_af", 220 | "type": "float", 221 | "mode": "nullable" 222 | }, 223 | { 224 | "name": "genocanyon_score", 225 | "type": "float", 226 | "mode": "nullable" 227 | }, 228 | { 229 | "name": "phylop100way_vertebrate", 230 | "type": "float", 231 | "mode": "nullable" 232 | }, 233 | { 234 | "name": "phylop20way_mammalian", 235 | "type": "float", 236 | "mode": "nullable" 237 | }, 238 | { 239 | "name": "phastcons100way_vertebrate", 240 | "type": "float", 241 | "mode": "nullable" 242 | }, 243 | { 244 | "name": "phastcons20way_mammalian", 245 | "type": "float", 246 | "mode": "nullable" 247 | }, 248 | { 249 | "name": "siphy_29way_logodds", 250 | "type": "float", 251 | "mode": "nullable" 252 | }, 253 | { 254 | "name": "twinsuk_ac", 255 | "type": "integer", 256 | "mode": "nullable" 257 | }, 258 | { 259 | "name": "twinsuk_af", 260 | "type": "float", 261 | "mode": "nullable" 262 | }, 263 | { 264 | "name": "clinvar_rs", 265 | "type": "string", 266 | "mode": "nullable" 267 | }, 268 | { 269 | "name": "clinvar_trait", 270 | "type": "string", 271 | "mode": "nullable" 272 | }, 273 | { 274 | "name": "ensembl_geneid", 275 | "type": "string", 276 | "mode": "nullable" 277 | }, 278 | { 279 | "name": "ensembl_transcriptid", 280 | "type": "string", 281 | "mode": "nullable" 282 | }, 283 | { 284 | "name": "ensembl_proteinid", 285 | "type": "string", 286 | "mode": "nullable" 287 | }, 288 | { 289 | "name": "lrt_score", 290 | "type": "float", 291 | "mode": "nullable" 292 | }, 293 | { 294 | "name": "rvis", 295 | "type": "float", 296 | "mode": "nullable" 297 | }, 298 | { 299 | "name": "gdi", 300 | "type": "float", 301 | "mode": "nullable" 302 | }, 303 | { 304 | "name": "gtex_v6_gene", 305 | "type": "string", 306 | "mode": "nullable" 307 | }, 308 | { 309 | "name": "gtex_v6_tissue", 310 | "type": "string", 311 | "mode": "nullable" 312 | }, 313 | { 314 | "name": "alspac_ac", 315 | "type": "integer", 316 | "mode": "nullable" 317 | }, 318 | { 319 | "name": "alspac_af", 320 | "type": "float", 321 | "mode": "nullable" 322 | }, 323 | { 324 | "name": "esp6500_aa_ac", 325 | "type": "integer", 326 | "mode": "nullable" 327 | }, 328 | { 329 | "name": "esp6500_aa_af", 330 | "type": "float", 331 | "mode": "nullable" 332 | }, 333 | { 334 | "name": "esp6500_ea_ac", 335 | "type": "integer", 336 | "mode": "nullable" 337 | }, 338 | { 339 | "name": "esp6500_ea_af", 340 | "type": "float", 341 | "mode": "nullable" 342 | } 343 | ] 344 | }, 345 | { 346 | "name": "intergenic_consequences", 347 | "type": "record", 348 | "mode": "repeated", 349 | "fields": [ 350 | { 351 | "name": "impact", 352 | "type": "string", 353 | "mode": "nullable" 354 | }, 355 | { 356 | "name": "consequence_terms", 357 | "type": "string", 358 | "mode": "repeated" 359 | }, 360 | { 361 | "name": "variant_allele", 362 | "type": "string", 363 | "mode": "nullable" 364 | }, 365 | { 366 | "name": "allele_num", 367 | "type": "integer", 368 | "mode": "nullable" 369 | } 370 | ] 371 | } 372 | ] 373 | -------------------------------------------------------------------------------- /batch/vep/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | # This is branched from batch/build_annotator/Dockerfile.vep file and is meant 16 | # to replace that eventually. 17 | # 18 | # Example: 19 | # 20 | # docker build . --build-arg ENSEMBL_RELEASE=104 --tag vep:104 21 | # 22 | # To run vep through containers created by this file, the VEP cache has to be 23 | # downloaded separately and made available through command line arguments. 24 | # The script for doing so is build_vep_cache.sh, see README.md for details. 25 | 26 | # The pipelines-io container provides a wrapper around gsutil with additional 27 | # retry logic. 28 | FROM gcr.io/cloud-genomics-pipelines/io 29 | 30 | ARG ENSEMBL_RELEASE=104 31 | ARG VEP_BASE=/opt/variant_effect_predictor 32 | 33 | RUN apt-get -y update && apt-get install -y procps\ 34 | build-essential \ 35 | git \ 36 | libarchive-zip-perl \ 37 | libbz2-dev \ 38 | liblzma-dev \ 39 | libdbd-mysql-perl \ 40 | libdbi-perl \ 41 | libfile-copy-recursive-perl \ 42 | libhts1 \ 43 | libjson-perl \ 44 | libmodule-build-perl \ 45 | tabix \ 46 | unzip \ 47 | zlib1g-dev 48 | 49 | # Install VEP per the instructions at: 50 | # http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer 51 | RUN git clone https://github.com/Ensembl/ensembl-vep.git ${VEP_BASE} 52 | 53 | WORKDIR ${VEP_BASE} 54 | 55 | RUN git checkout release/${ENSEMBL_RELEASE} 56 | 57 | RUN perl INSTALL.pl \ 58 | --AUTO a \ 59 | --NO_UPDATE 60 | 61 | ADD run_vep.sh ${VEP_BASE}/run_vep.sh 62 | 63 | ADD run_script_with_watchdog.sh ${VEP_BASE}/run_script_with_watchdog.sh 64 | 65 | ENTRYPOINT [] 66 | -------------------------------------------------------------------------------- /batch/vep/README.md: -------------------------------------------------------------------------------- 1 | # Annotating input files with VEP 2 | 3 | This directory includes tools and utilities for running 4 | [Ensembl's Variant Effect Predictor]( 5 | https://ensembl.org/info/docs/tools/vep/index.html) (VEP) on input VCF files 6 | of [Variant Transforms](../README.md). 7 | 8 | ## Overview 9 | 10 | With tools provided in this directory, one can: 11 | * Create a docker image of VEP. 12 | * Download and package VEP's database (a.k.a. 13 | [cache](https://ensembl.org/info/docs/tools/vep/script/vep_cache.html)) for 14 | different species, reference sequences and versions of VEP. 15 | * Run VEP on VCF input files and create output VCF files that are annotated. 16 | 17 | Note that, this is a useful standalone tool for running VEP in the cloud but the 18 | main goal is to be able to run VEP as a preprocessor through Variant Transforms 19 | and then import the annotated variants into BigQuery with proper handling of 20 | annotations. 21 | 22 | ## How to create and push VEP docker images 23 | 24 | Inside this directory, run: 25 | 26 | `docker build . -t [IMAGE_TAG]` 27 | 28 | This will download the source from 29 | [VEP GitHub repo](https://github.com/Ensembl/ensembl-vep) and build VEP from 30 | that source. By default, it uses version 104 of VEP. This can be changed by 31 | `ENSEMBL_RELEASE` build argument, e.g., 32 | 33 | `docker build . -t [IMAGE_TAG] --build-arg ENSEMBL_RELEASE=104` 34 | 35 | Let's say we want to push this image to the 36 | [Container Registry](https://cloud.google.com/container-registry/) of 37 | `my-project` on Google Cloud, so we can pick `[IMAGE_TAG]` as 38 | `gcr.io/my-project/vep:104`. Then push this image by: 39 | 40 | `gcloud docker -- push gcr.io/my-project/vep:104` 41 | 42 | **TODO**: Add `cloudbuild.yaml` files for both easy push and integration test. 43 | 44 | ## How to download and package VEP databases 45 | 46 | Choose a local directory with enough space (e.g., ~20GB for homo_sapiens) to 47 | download and integrate different pieces of the VEP database or cache files. 48 | Then from within that directory run the 49 | [`build_vep_cache.sh`](build_vep_cache.sh) script. By default this script 50 | creates the database for human (homo_sapiens), referenec sequence `GRCh38`, 51 | and release 104 of VEP. These values can be overwritten by the following 52 | environment variables (note you should use the same VEP release 53 | that you used for creating VEP docker image above): 54 | 55 | * `VEP_SPECIES` 56 | * `GENOME_ASSEMBLY` 57 | * `ENSEMBL_RELEASE` 58 | 59 | ## How to run VEP on GCP 60 | 61 | There is the helper script [`run_vep.sh`](run_vep.sh) that is added to the VEP 62 | docker image and can be used to run VEP. One way of running it on 63 | Google Cloud Platform (GCP) is through the [Pipelines API]( 64 | https://cloud.google.com/genomics/v1alpha2/pipelines-api-command-line). For a 65 | sample `yaml` job description check 66 | [`sample_pipeline.yaml`](sample_pipeline.yaml). 67 | Here is a sample `gcloud` command that uses that file: 68 | 69 | ``` 70 | gcloud alpha genomics pipelines run \ 71 | --project my-project \ 72 | --pipeline-file sample_pipeline.yaml \ 73 | --logging gs://my_bucket/logs \ 74 | --inputs VCF_INFO_FILED=CSQ_RERUN 75 | ``` 76 | 77 | Note the `vep_cache_homo_sapiens_GRCh38_104.tar.gz` file that is referenced in 78 | the sample `yaml` file, is the output file that you get from the above database 79 | creation step. 80 | 81 | The [`run_vep.sh`](run_vep.sh) script relies on several environment variables 82 | that can be set to change the default behaviour. In the above example 83 | `VCF_INFO_FILED` is changed to `CSQ_RERUN` (the default is `CSQ_VT`). 84 | 85 | This is the full list of supported environment variables: 86 | 87 | * `SPECIES`: default is `homo_sapiens` 88 | * `GENOME_ASSEMBLY`: default is `GRCh38` 89 | * `NUM_FORKS`: The value to be set for 90 | [`--fork` option of VEP]( 91 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_fork). 92 | default is 1. 93 | * `OTHER_VEP_OPTS`: Other options to be set for the VEP invocation, default is 94 | [`--everything`]( 95 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_everything) 96 | * `VCF_INFO_FILED`: The name of the info field to be used for annotations, 97 | default is `CSQ_VT`. See 98 | [`--vcf_info_field`]( 99 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf_info_field) 100 | 101 | The following environment variables have to be set and point to valid storage 102 | locations: 103 | 104 | * `VEP_CACHE`: Where the tar.gz file, created in the above database creation 105 | step, is located. 106 | * `INPUT_FILE`: Note this can be either a VCF file or a compressed VCF file 107 | (`.gz` or `.bgz`). Treatment of compressed and uncompressed files is the same, 108 | i.e., the input file is directly fed into VEP. 109 | * `OUTPUT_VCF`: The name of the output file which is always a VCF file. 110 | -------------------------------------------------------------------------------- /batch/vep/build_vep_cache.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2018 Google Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | # This is a script for downloading VEP cache files, decompressing and placing 18 | # them in the appropriate directory structure that is expected by VEP script. 19 | # At the end, the whole structure is compressed to generate a single tar.gz 20 | # file that can be used in run_vep.sh invocations. 21 | # 22 | # This script creates a 'vep_cache' sub-directory and does every other file 23 | # operations and downloads inside that directory. The final cache file will be 24 | # stored in that directory as well. 25 | # 26 | # Capital letter variables refer to environment variables that can be set from 27 | # outside. Internal variables have small letters. All environment variables 28 | # have a default value as well to set up cache for homo_sapiens with reference 29 | # GRCh38 and release 104 of VEP. 30 | # 31 | # More details on cache files can be found here: 32 | # https://ensembl.org/info/docs/tools/vep/script/vep_cache.html 33 | 34 | set -euo pipefail 35 | 36 | readonly release="${ENSEMBL_RELEASE:-104}" 37 | readonly species="${VEP_SPECIES:-homo_sapiens}" # or "${VEP_SPECIES:-mus_musculus}" 38 | readonly assembly="${GENOME_ASSEMBLY:-GRCh38}" # or "${GENOME_ASSEMBLY:-GRCh37}" for homo_sapiens or "${GENOME_ASSEMBLY:-GRCm39}" for mus_musculus 39 | readonly work_dir="vep_cache" 40 | 41 | mkdir -p "${work_dir}" 42 | pushd "${work_dir}" 43 | readonly cache_file="${species}_vep_${release}_${assembly}.tar.gz" 44 | readonly ftp_base="ftp://ftp.ensembl.org/pub/release-${release}" 45 | 46 | # The fasta file name depends on the species and assembly but not the version. 47 | # Also the first letter of the file is capital while it is small for the actual 48 | # cache file (above). For example: "Homo_sapiens.GRCh38.dna.toplevel.fa.gz" 49 | readonly fasta_file="${species^?}.${assembly}.dna.toplevel.fa.gz" 50 | if [[ $species == "homo_sapiens" ]] && [[ $assembly == "GRCh37" ]]; then 51 | if [[ ! `command -v samtools` ]]; then 52 | echo "ERROR: samtools is needed to create the .fai index." 53 | echo "It can be installed by:" 54 | echo "sudo apt-get install samtools" 55 | echo "Or it can be downloaded from:" 56 | echo "http://www.htslib.org/download/" 57 | exit 1 58 | fi 59 | if [ ! `command -v bgzip` ]; then 60 | echo "ERROR: bgzip is needed to create the .gzi index." 61 | echo "It can be installed by:" 62 | echo "sudo apt-get install tabix" 63 | exit 1 64 | fi 65 | readonly ftp_GRCh37="ftp://ftp.ensembl.org/pub/grch37/release-${release}" 66 | readonly remote_fasta="${ftp_GRCh37}/fasta/${species}/dna/${fasta_file}" 67 | echo "Downloading ${remote_fasta}" 68 | curl -O "${remote_fasta}" 69 | echo "Decompressing fasta file..." 70 | gzip -d "${fasta_file}" 71 | echo "Block compressing fasta file and creating .gzi index..." 72 | readonly num_cores=`nproc --all` 73 | bgzip --index --threads "$num_cores" "${fasta_file%.*}" 74 | echo "Creating .fai index..." 75 | samtools faidx "${fasta_file}" 76 | else 77 | readonly remote_fasta="${ftp_base}/fasta/${species}/dna_index/${fasta_file}" 78 | echo "Downloading ${remote_fasta} and its index files ..." 79 | curl -O "${remote_fasta}" 80 | curl -O "${remote_fasta}.fai" 81 | curl -O "${remote_fasta}.gzi" 82 | fi 83 | 84 | # The path naming convention changed from "VEP" to "vep" after build 95. 85 | if (( release <= 95 )); then 86 | readonly remote_cache="${ftp_base}/variation/VEP/${cache_file}" 87 | else 88 | readonly remote_cache="${ftp_base}/variation/vep/${cache_file}" 89 | fi 90 | echo "Downloading ${remote_cache} ..." 91 | curl -O "${remote_cache}" 92 | echo "Decompressing cache files ..." 93 | tar xzf "${cache_file}" 94 | 95 | echo "Moving fasta files to the cache structure ..." 96 | mv ${fasta_file}* "${species}/${release}_${assembly}" 97 | 98 | echo "Creating single tar.gz file for the whole cache ..." 99 | readonly output_cache="vep_cache_${species}_${assembly}_${release}.tar.gz" 100 | tar czf "${output_cache}" "${species}" 101 | if [[ -r "${output_cache}" ]]; then 102 | echo "Cleaning up ..." 103 | rm -rf "${species}" 104 | rm -f "${cache_file}" 105 | fi 106 | popd 107 | 108 | if [[ -r "${work_dir}/${output_cache}" ]]; then 109 | echo "Successfully created cache file at ${work_dir}/${output_cache}" 110 | else 111 | echo "ERROR: Something went wrong when creating ${work_dir}/${output_cache} !" 112 | fi 113 | 114 | # TODO(bashir2): Experiment with the convert_cache.pl script of VEP and measure 115 | # performance improvements. If the change is significant then this script has to 116 | # run convert_cache.pl too. 117 | -------------------------------------------------------------------------------- /batch/vep/run_script_with_watchdog.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2019 Google Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | # This script runs `script_to_run` (first argument) in the background. Every 18 | # `watchdog_file_update_interval` seconds (second argument), it checks the last 19 | # update time of `watchdog_file` (third argument). Once the watchdog file is 20 | # found to be stale, the background process will be killed. The arguments needed 21 | # for `script_to_run` are passed after the third argument. 22 | 23 | set -euo pipefail 24 | 25 | ################################################# 26 | # Returns the generation number of a GCS file. 27 | # Arguments: 28 | # $1: The GCS file. 29 | ################################################# 30 | function get_last_update_time { 31 | gsutil stat $1 | awk '$1 == "Generation:" {print $2}' 32 | } 33 | 34 | function main { 35 | if [[ $# < 3 ]]; then 36 | echo "Usage: $0 " 37 | exit 1 38 | fi 39 | script_to_run="$1" 40 | watchdog_file_update_interval="$2" 41 | watchdog_file="$3" 42 | script_to_run_args="${@:4}" 43 | watchdog_file_allowed_stale_time="$((4*watchdog_file_update_interval))" 44 | 45 | ${script_to_run} ${script_to_run_args} & 46 | 47 | background_pid="$!" 48 | while ps -p "${background_pid}" > /dev/null 49 | do 50 | last_update_sec="$(($(get_last_update_time ${watchdog_file})/1000000))" 51 | declare -i now_sec 52 | now_sec="$(date +%s)" 53 | last_update_age_sec="$((now_sec-last_update_sec))" 54 | echo "The watchdog file is updated ${last_update_age_sec} seconds ago." 55 | if (("${last_update_age_sec}">"${watchdog_file_allowed_stale_time}")); then 56 | echo "ERROR: The watchdog file is stale, and running of ${script_to_run} has been killed." 57 | kill "${background_pid}" 58 | exit 1 59 | else 60 | sleep "${watchdog_file_update_interval}" 61 | fi 62 | done 63 | wait "${background_pid}" 64 | if [[ $? -ne 0 ]]; then 65 | echo "Running of ${script_to_run} failed." 66 | exit 1 67 | else 68 | echo "Running of ${script_to_run} succeed." 69 | exit 0 70 | fi 71 | } 72 | 73 | main "$@" 74 | -------------------------------------------------------------------------------- /batch/vep/run_vep.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2018 Google Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | # This script is intended to be used in the Docker image built for VEP. 18 | # Note that except the single input and output files, all other arguments are 19 | # passed through environment variables. 20 | # 21 | # The only environment variable that has to be set is (others are optional): 22 | # 23 | # VEP_CACHE: The path of the VEP cache which is a single .tar.gz file. 24 | # 25 | # The first argument is the input file (might be a VCF or a compressed VCF) and 26 | # the second is the output file which is always a VCF file. 27 | # 28 | # For the full list of supported environment variables and their documentation 29 | # check README.md. 30 | # Capital letter variables refer to environment variables that can be set from 31 | # outside. Internal variables have small letters. 32 | 33 | set -euo pipefail 34 | 35 | if [[ $# -ne 2 ]]; then 36 | echo "Usage: $0 input_file output_file" 37 | exit 1 38 | fi 39 | 40 | readonly species="${SPECIES:-homo_sapiens}" 41 | readonly assembly="${GENOME_ASSEMBLY:-GRCh38}" 42 | readonly fork_opt="--fork ${NUM_FORKS:-1}" 43 | readonly other_vep_opts="${OTHER_VEP_OPTS:---everything \ 44 | --check_ref --allow_non_variant}" 45 | readonly annotation_field_name="${VCF_INFO_FILED:-CSQ}" 46 | 47 | if [[ ! -r "${VEP_CACHE:?VEP_CACHE is not set!}" ]]; then 48 | echo "ERRPR: Cannot read ${VEP_CACHE}" 49 | exit 1 50 | fi 51 | 52 | # Check that the input file is readable. 53 | readonly input_file=${1} 54 | if [[ ! -r "${input_file}" ]]; then 55 | echo "ERRPR: Cannot read ${input_file}" 56 | exit 1 57 | fi 58 | 59 | echo "Checking the input file at $(date)" 60 | ls -l "${input_file}" 61 | 62 | # Make sure output file does not exist and can be written. 63 | readonly output_file=${2} 64 | if [[ -e ${output_file} ]]; then 65 | echo "ERROR: ${output_file} already exist!" 66 | exit 1 67 | fi 68 | mkdir -p $(dirname ${output_file}) 69 | touch ${output_file} 70 | rm ${output_file} 71 | 72 | readonly vep_cache_dir="$(dirname ${VEP_CACHE})" 73 | readonly vep_cache_file="$(basename ${VEP_CACHE})" 74 | pushd ${vep_cache_dir} 75 | if [[ -d "${species}" ]]; then 76 | echo "The cache is already decompressed; found ${species} at $(date)" 77 | else 78 | echo "Decompressing the cache file ${vep_cache_file} started at $(date)" 79 | tar xzvf "${vep_cache_file}" 80 | if [[ ! -d "${species}" ]]; then 81 | echo "Cannot find directory ${species} after decompressing ${vep_cache_file}!" 82 | exit 1 83 | fi 84 | fi 85 | popd 86 | 87 | readonly vep_command="./vep -i ${input_file} -o ${output_file} \ 88 | --dir ${vep_cache_dir} --offline --species ${species} --assembly ${assembly} \ 89 | --vcf --allele_number --vcf_info_field ${annotation_field_name} ${fork_opt} \ 90 | ${other_vep_opts}" 91 | echo "VEP command is: ${vep_command}" 92 | 93 | echo "Running vep started at $(date)" 94 | # The next line should not be quoted since we want word splitting to happen. 95 | ${vep_command} 96 | -------------------------------------------------------------------------------- /batch/vep/sample_pipeline.yaml: -------------------------------------------------------------------------------- 1 | # TODO(bashir2): Update this example with v2alpha1 required changes. 2 | name: run-vep 3 | resources: 4 | disks: 5 | - name: datadisk 6 | mountPoint: /mnt/data 7 | type: PERSISTENT_HDD 8 | sizeGb: 100 9 | minimumCpuCores: 12 10 | inputParameters: 11 | - name: VEP_CACHE 12 | defaultValue: gs://my_bucket/vep_cache_homo_sapiens_GRCh38_104.tar.gz 13 | localCopy: 14 | disk: datadisk 15 | path: vep_cache_104.tar.gz 16 | - name: INPUT_FILE 17 | defaultValue: gs://my_bucket/input.vcf 18 | localCopy: 19 | disk: datadisk 20 | path: input.vcf 21 | - name: VCF_INFO_FILED 22 | defaultValue: CSQ_VT 23 | - name: NUM_FORKS 24 | defaultValue: "12" 25 | outputParameters: 26 | - name: OUTPUT_FILE 27 | defaultValue: gs://my_bucket/output.vcf 28 | localCopy: 29 | disk: datadisk 30 | path: output.vcf 31 | docker: 32 | imageName: gcr.io/my-project/vep:104 33 | cmd: /opt/variant_effect_predictor/run_vep.sh ${INPUT_FILE} ${OUTPUT_FILE} 34 | -------------------------------------------------------------------------------- /curation/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Curation Scripts 4 | ================ 5 | 6 | The scripts in this portion of the repository were used to ingest, reshape, and 7 | store variant annotations in a cloud analysis-ready format. You can use them to 8 | bring in a fresh copy of an annotation resource or as a starting point for 9 | curation of a new annotation resource. 10 | 11 | ## Status of this sub-project 12 | 13 | This code currently works with annotation resources such 14 | as [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP/) 15 | and [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) along with variant allele 16 | frequencies 17 | from 18 | [NHLBI GO Exome Sequencing Project (ESP)](http://evs.gs.washington.edu/EVS/), 19 | [1000 Genomes](http://www.internationalgenome.org/), 20 | [ExAC](http://exac.broadinstitute.org/), 21 | and [Genome Aggregation Database (gnomAD)](http://gnomad.broadinstitute.org/) 22 | but similar techniques could be applied to other annotation resources. 23 | 24 | All steps are run in the cloud, but each individual step is launched manually. 25 | 26 | ## Overview 27 | 28 | ### Curate Individual Annotation Sources 29 | 30 | Many variant annotation sources are encoded as VCF files. Therefore we can 31 | use [Google Genomics](https://cloud.google.com/genomics/) to import the resource 32 | and export it to BigQuery. 33 | 34 | [Follow the tutorial](./tables) to run 35 | a [dsub](https://github.com/googlegenomics/dsub) script to create individual 36 | tables holding dbSNP, ClinVar, ESP, etc. 37 | 38 | ### Create an "All-Possible SNPs" Table 39 | 40 | A table with annotations for all possible SNPs of a particular genome reference 41 | is useful for: 42 | 43 | * Examining SNP variation across different regions of the genome. 44 | * Quickly annotating the SNPs for a cohort using a simple JOIN. 45 | * Generating synthetic sequence variant datasets using the SNP allele 46 | frequencies from this table. 47 | 48 | [Follow the tutorial](./allPossibleSNPs) to create an all-possible-SNPs tables 49 | for build 38 of the human genome reference. 50 | 51 | ### Add Column Descriptions to a BigQuery Table 52 | 53 | The `variants` table generated by performing 54 | an 55 | [export from Google Genomics](https://cloud.google.com/genomics/reference/rest/v1/variantsets/export) does 56 | not include the field descriptions for the fields. 57 | 58 | See [add BigQuery descriptions](./tables/AddBigQueryDescriptions.md) for 59 | instructions on how to automatically populate the BigQuery schema 60 | description with the information from the VCF header. 61 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/ESP_AA.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Prepare ESP AA for the JOIN. 3 | -- 4 | ESP_AA AS ( 5 | SELECT 6 | reference_name, 7 | start, 8 | `end`, 9 | reference_bases, 10 | alternate_bases, 11 | AF AS ESP_AA_AF, 12 | -- Used to check for correctness of the JOIN. 13 | names[OFFSET(0)] AS ESP_AA_rsid 14 | FROM 15 | `{{ ESP_AA_TABLE }}` v, 16 | v.alternate_bases alternate_bases ) 17 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/ESP_EA.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Prepare ESP EA for the JOIN. 3 | -- 4 | ESP_EA AS ( 5 | SELECT 6 | reference_name, 7 | start, 8 | `end`, 9 | reference_bases, 10 | alternate_bases, 11 | AF AS ESP_EA_AF, 12 | -- Used to check for correctness of the JOIN. 13 | names[OFFSET(0)] AS ESP_EA_rsid 14 | FROM 15 | `{{ ESP_EA_TABLE }}` v, 16 | v.alternate_bases alternate_bases ) 17 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Create an All-Possible SNPs Table 4 | ================================= 5 | 6 | This tutorial combines a particular reference genome with individual annotation 7 | resources to create an "all possible SNPs" table. 8 | 9 | [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) 10 | and [BigQuery](https://cloud.google.com/bigquery/) are used to run all of these 11 | steps in the cloud. 12 | 13 | ## Status of this tutorial 14 | 15 | This is a work-in-progress. Next steps are to: 16 | 17 | 1. add more annotation resources to the JOIN 18 | 2. add examples that make use of the all possible SNPs GRCh38 table to analyze 19 | SNPs from 20 | the 21 | [Platinum Genomes DeepVariant](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) cohort. These 22 | examples would be similar to 23 | https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes. 24 | 25 | ## (1) Configure project variables. 26 | 27 | Set a few environment variables to facilitate cutting and pasting the subsequent 28 | commands. 29 | 30 | ``` bash 31 | # The Google Cloud Platform project id in which to process the annotations. 32 | PROJECT_ID=your-project-id 33 | # The bucket name (with the gs:// prefix) for logs and temp files. 34 | BUCKET=gs://your-bucket-name 35 | # The BigQuery dataset, which must already exist, in which to store annotations. 36 | DATASET=your_bigquery_dataset_name 37 | ``` 38 | ## (2) Identify the reference genome you wish to annotate. 39 | 40 | In this tutorial we're specifically working 41 | with 42 | [Verily’s version of GRCh38](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/reference_genomes.html#verily-s-grch38). 43 | 44 | Note that instead you could: 45 | 46 | * Use one of the other reference genomes are already available in Cloud Storage. 47 | See 48 | [Reference Genomes](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/reference_genomes.html) for 49 | the list and Cloud Storage paths. 50 | * Copy the FASTA file for the desired reference genome to cloud storage. For 51 | more detail, 52 | see 53 | [Copying large files to a bucket](https://cloud.google.com/storage/docs/working-with-big-data#copy-large-file). 54 | 55 | ## (3) Convert the FASTA file. 56 | 57 | Run the following [dsub](https://github.com/googlegenomics/dsub) command to 58 | convert the FASTA file for the reference genome into a format ammenable to 59 | BigQuery. 60 | 61 | ``` bash 62 | # Copy the script dsub will run to Cloud Storage. 63 | gsutil cp fasta_to_kv.py ${BUCKET} 64 | 65 | # Run the conversion operation. 66 | dsub \ 67 | --project ${PROJECT_ID} \ 68 | --zones "us-central1-*" \ 69 | --logging ${BUCKET}/fasta_to_kv.log \ 70 | --image python:2.7-slim \ 71 | --input FASTA=gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \ 72 | --input CONVERTER=${BUCKET}/fasta_to_kv.py \ 73 | --output KV=${BUCKET}/GRCh38_Verily_v1.genome.txt \ 74 | --command 'cat "${FASTA}" | python "${CONVERTER}" > "${KV}"' \ 75 | --wait 76 | ``` 77 | 78 | ## (4) Load the sequences into BigQuery. 79 | 80 | Use the bq command line tool to load the sequences into BigQuery. 81 | 82 | ``` bash 83 | bq --project ${PROJECT_ID} load \ 84 | -F '>' \ 85 | --schema unused:string,chr:string,sequence_start:integer,sequence:string \ 86 | ${DATASET}.VerilyGRCh38_sequences \ 87 | ${BUCKET}/GRCh38_Verily_v1.genome.txt 88 | ``` 89 | 90 | ## (5) Reshape the sequences into SNPs and JOIN with annotations. 91 | 92 | Run script [render_templated_sql.py](./render_templated_sql.py) to create the 93 | SQL that will perform the JOIN. 94 | 95 | ``` bash 96 | python ./render_templated_sql.py \ 97 | --sequence_table ${DATASET}.VerilyGRCh38_sequences \ 98 | --b38 99 | ``` 100 | 101 | Then run the generated SQL via the BigQuery web UI or the bq command line tool 102 | and materialize the result to a new table. 103 | 104 | See `render_templated_sql.py --help` for more details. 105 | 106 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/all_possible_snps.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Create a table containing all possible SNPs for a reference genome. 3 | -- 4 | -- Split the sequences from the FASTA file. 5 | -- 6 | base_pairs AS ( 7 | SELECT 8 | chr, 9 | sequence_start, 10 | SPLIT(sequence, '') AS bps 11 | FROM 12 | `{{ SEQUENCE_TABLE }}` 13 | # Use this replacement to test on small amount of data. Otherwise replace 14 | # it with the empty string. 15 | {{ SEQUENCE_FILTER }} ), 16 | -- 17 | -- Expand the data to one row per base pair. Also upper case the 18 | -- base pair and compute the end position. 19 | -- 20 | all_refs AS ( 21 | SELECT 22 | chr AS original_reference_name, 23 | SUBSTR(chr, 4) AS reference_name, 24 | sequence_start + base_pair_offset AS start, 25 | sequence_start + base_pair_offset + 1 AS `end`, 26 | UPPER(base_pair) AS reference_bases, 27 | base_pair AS original_reference_bases 28 | FROM 29 | base_pairs, 30 | base_pairs.bps base_pair 31 | WITH 32 | OFFSET 33 | base_pair_offset), 34 | -- 35 | -- Create a table holding the four possible values for 36 | -- alternate_bases. 37 | -- 38 | all_alternate_bases AS ( 39 | SELECT 40 | 'A' AS alternate_bases 41 | UNION ALL 42 | SELECT 43 | 'C' AS alternate_bases 44 | UNION ALL 45 | SELECT 46 | 'G' AS alternate_bases 47 | UNION ALL 48 | SELECT 49 | 'T' AS alternate_bases ), 50 | all_possible_snps AS ( 51 | -- 52 | -- CROSS JOIN with all possible mutations for the base pair. Note 53 | -- that 'N' will result in four possible mutations. 54 | -- 55 | SELECT 56 | reference_name, 57 | original_reference_name, 58 | start, 59 | `end`, 60 | reference_bases, 61 | original_reference_bases, 62 | alternate_bases 63 | FROM 64 | all_refs 65 | CROSS JOIN 66 | all_alternate_bases 67 | ) 68 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/check_joined_annotations.sql: -------------------------------------------------------------------------------- 1 | #standardSQL 2 | -- 3 | -- Compare the dbSNP ids retrieved from a variety of annotation sources 4 | -- to ensure that the multiple sources were joined correctly. 5 | -- 6 | -- Replace `YOUR_NEWLY_CREATED_ANNOTATIONS_TABLE` with the table to which the 7 | -- JOINed annotations were materialized. 8 | -- 9 | SELECT 10 | {% for source in annot_sources %} 11 | COUNTIF({{source}}_rsid = dbSNP_rsid)/COUNTIF({{source}}_rsid IS NOT NULL) AS {{source}}_matched, 12 | COUNTIF({{source}}_rsid IS NOT NULL) AS {{source}}_compared, 13 | {% endfor %} 14 | COUNT(dbSNP_rsid) AS num_in_dbSNP 15 | FROM `YOUR_NEWLY_CREATED_ANNOTATIONS_TABLE` 16 | WHERE 17 | dbSNP_rsid IS NOT NULL 18 | 19 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/clinvar.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Prepare ClinVar for the JOIN. 3 | -- 4 | clinvar AS ( 5 | SELECT 6 | reference_name, 7 | start, 8 | `end`, 9 | reference_bases, 10 | alternate_bases, 11 | -- Used to check for correctness of the JOIN. 12 | CONCAT('rs', CAST(RS AS STRING)) AS clinvar_rsid, 13 | -- ClinVar uses field CLNALLE to indicate "variant alleles from REF 14 | -- or ALT columns. 0 is REF, 1 is the first ALT allele, etc. This 15 | -- is used to match alleles with other corresponding clinical (CLN) 16 | -- INFO tags. A value of -1 indicates that no allele was found to 17 | -- match a corresponding HGVS allele name." 18 | CLNDBN[OFFSET(clnalle_offset)] AS CLNDBN, 19 | CLNACC[OFFSET(clnalle_offset)] AS CLNACC, 20 | CLNDSDB[OFFSET(clnalle_offset)] AS CLNDSDB, 21 | CLNDSDBID[OFFSET(clnalle_offset)] AS CLNDSDBID, 22 | CLNREVSTAT[OFFSET(clnalle_offset)] AS CLNREVSTAT, 23 | CLNSIG[OFFSET(clnalle_offset)] AS CLNSIG 24 | FROM 25 | `{{ CLINVAR_TABLE }}` v, 26 | UNNEST(ARRAY_CONCAT([reference_bases], v.alternate_bases)) AS alternate_bases WITH OFFSET alt_offset, 27 | v.CLNALLE clnalle WITH OFFSET clnalle_offset 28 | WHERE 29 | clnalle = alt_offset) 30 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/dbSNP.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Prepare dbSNP for the JOIN. 3 | -- 4 | -- http://varianttools.sourceforge.net/Annotation/DbSNP 5 | -- Multiple alternate alleles sometimes correspond to the same rsid. 6 | -- Some variants have multiple rsids. 7 | -- 8 | dbSNP AS ( 9 | SELECT 10 | reference_name, 11 | start, 12 | `end`, 13 | reference_bases, -- on the + strand 14 | alternate_bases, -- on the + strand 15 | names AS rs_names, 16 | RS, 17 | -- Used to check for correctness of the JOIN. 18 | CONCAT('rs', CAST(RS AS STRING)) AS dbSNP_rsid 19 | FROM 20 | `{{ DBSNP_TABLE }}` v, 21 | v.alternate_bases alternate_bases ) 22 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/fasta_to_kv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | r"""Convert FASTA files to a map-reduceable format. 17 | 18 | Example Input: 19 | >chr22 20 | CAAGG 21 | TTAGC 22 | CCCCC 23 | 24 | Example Output: 25 | >chr22>0>CAAGG 26 | >chr22>5>TTAGC 27 | >chr22>10>CCCCC 28 | 29 | It is very fast (~2 minutes for a 3 GB FASTA) when run on Compute Engine 30 | utilizing streaming download and upload. 31 | https://cloud.google.com/storage/docs/gsutil/commands/cp#streaming-transfers 32 | 33 | For uncompressed FASTA files: 34 | 35 | gsutil cat \ 36 | gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa 37 | \ 38 | | \ 39 | ./fasta_to_kv.py \ 40 | | \ 41 | gsutil cp - gs://MY-BUCKET/refs/GRCh38_Verily_v1.genome.txt 42 | 43 | For compressed FASTA files, use the appropriate command to unzip the file 44 | before passing it to this script: 45 | 46 | gsutil cat \ 47 | gs://genomics-public-data/references/hg19/*fa.gz \ 48 | | \ 49 | gunzip \ 50 | | \ 51 | ./fasta_to_kv.py \ 52 | | \ 53 | gsutil cp - gs://MY-BUCKET/refs/hg19.txt 54 | """ 55 | 56 | import sys 57 | 58 | sequence = "" 59 | position = 0 60 | 61 | for line in sys.stdin: 62 | trimmed = line.strip() 63 | if not trimmed: 64 | break 65 | 66 | if trimmed.startswith(";"): 67 | # Skip comment lines. 68 | continue 69 | 70 | if trimmed.startswith(">"): 71 | # We've started a new sequence. Reset the state. 72 | sequence = trimmed 73 | position = 0 74 | continue 75 | 76 | # Write out the sequence with a prefix indicating its context. Use '>' 77 | # as the delimiter since its a safe character to use in the file. 78 | sys.stdout.write(sequence + ">" + str(position) + ">" + trimmed + "\n") 79 | position += len(trimmed) 80 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/join_annotations.sql: -------------------------------------------------------------------------------- 1 | #standardSQL 2 | WITH 3 | {% if 'dbSNP' in annot_sources %} 4 | {% include 'dbSNP.sql' %}, 5 | {% endif %} 6 | 7 | {% if 'clinvar' in annot_sources %} 8 | {% include 'clinvar.sql' %}, 9 | {% endif %} 10 | 11 | {% if 'thousandGenomes' in annot_sources %} 12 | {% include 'thousandGenomes.sql' %}, 13 | {% endif %} 14 | 15 | {% if 'ESP_AA' in annot_sources %} 16 | {% include 'ESP_AA.sql' %}, 17 | {% endif %} 18 | 19 | {% if 'ESP_EA' in annot_sources %} 20 | {% include 'ESP_EA.sql' %}, 21 | {% endif %} 22 | 23 | {% include 'all_possible_snps.sql' %} 24 | 25 | -- 26 | -- Then JOIN with the individual variant annotation DBs. 27 | -- 28 | SELECT 29 | * 30 | FROM 31 | all_possible_snps 32 | {% for source in annot_sources %} 33 | LEFT OUTER JOIN {{source}} 34 | USING(reference_name, start, `end`, reference_bases, alternate_bases) 35 | {% endfor %} 36 | 37 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/render_templated_sql.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Copyright 2017 Verily Life Sciences Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | """Assemble an SQL query. 18 | 19 | Using a basic pattern for JOINs with variant annotation databases, assemble 20 | templated SQL into a full query that can but run to create an annotated 21 | "all possible SNPs" table. 22 | """ 23 | 24 | from __future__ import absolute_import 25 | 26 | import argparse 27 | import logging 28 | import sys 29 | 30 | from jinja2 import Environment 31 | from jinja2 import FileSystemLoader 32 | 33 | SEQUENCE_TABLE_KEY = "SEQUENCE_TABLE" 34 | 35 | B37_QUERY_REPLACEMENTS = { 36 | "SEQUENCE_FILTER": """WHERE chr IN ('chr17', '17') 37 | AND sequence_start BETWEEN 41196311 AND 41277499""", 38 | "DBSNP_TABLE": "bigquery-public-data.human_variant_annotation.ncbi_dbsnp_hg19_20170710", 39 | "CLINVAR_TABLE": 40 | "bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg19_20170705", 41 | "THOUSAND_GENOMES_TABLE": 42 | "bigquery-public-data.human_variant_annotation.ensembl_1000genomes_phase3_hg19_release89", 43 | "ESP_AA_TABLE": 44 | "bigquery-public-data.human_variant_annotation.ensembl_esp6500_aa_hg19_release89", 45 | "ESP_EA_TABLE": 46 | "bigquery-public-data.human_variant_annotation.ensembl_esp6500_ea_hg19_release89", 47 | } 48 | 49 | B38_QUERY_REPLACEMENTS = { 50 | "SEQUENCE_FILTER": """WHERE chr IN ('chr17', '17') 51 | AND sequence_start BETWEEN 43045628 AND 43125483""", 52 | "DBSNP_TABLE": "bigquery-public-data.human_variant_annotation.ncbi_dbsnp_hg38_20170710", 53 | "CLINVAR_TABLE": 54 | "bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20170705", 55 | "THOUSAND_GENOMES_TABLE": 56 | "bigquery-public-data.human_variant_annotation.ensembl_1000genomes_phase3_hg38_release89", 57 | "ESP_AA_TABLE": 58 | "bigquery-public-data.human_variant_annotation.ensembl_esp6500_aa_hg38_release89", 59 | "ESP_EA_TABLE": 60 | "bigquery-public-data.human_variant_annotation.ensembl_esp6500_ea_hg38_release89", 61 | } 62 | 63 | # The table alias and the query filename must be the same. 64 | B37_ANNOTATION_SOURCES = ["dbSNP", 65 | "clinvar", 66 | "thousandGenomes", 67 | "ESP_AA", 68 | "ESP_EA" 69 | # TODO: add gnomAD here. 70 | ] 71 | B38_ANNOTATION_SOURCES = ["dbSNP", 72 | "clinvar", 73 | "thousandGenomes", 74 | "ESP_AA", 75 | "ESP_EA"] 76 | 77 | 78 | def run(argv=None): 79 | """Main entry point.""" 80 | parser = argparse.ArgumentParser() 81 | parser.add_argument( 82 | "--sequence_table", 83 | required=True, 84 | help="Fully qualified BigQuery table name for the reference " 85 | "genome sequences to be converted to all-possible SNPs.") 86 | parser.add_argument( 87 | "--b37", 88 | dest="is_b37", 89 | default=True, 90 | action="store_true", 91 | help="Use annotation tables aligned to build 37 of the " 92 | "human genome reference.") 93 | parser.add_argument( 94 | "--b38", 95 | dest="is_b37", 96 | action="store_false", 97 | help="Use annotation tables aligned to build 38 of the " 98 | "human genome reference.") 99 | parser.add_argument( 100 | "--output", 101 | dest="output", 102 | default="annotated_snps_RENDERED.sql", 103 | help="Output file to which to write rendered SQL.") 104 | parser.add_argument( 105 | "--debug", 106 | dest="debug", 107 | action="store_true", 108 | help="Generate SQL that will yield a small table for testing purposes.") 109 | args = parser.parse_args(argv) 110 | 111 | sources = B37_ANNOTATION_SOURCES if ( 112 | args.is_b37) else B38_ANNOTATION_SOURCES 113 | replacements = B37_QUERY_REPLACEMENTS.copy() if ( 114 | args.is_b37) else B38_QUERY_REPLACEMENTS.copy() 115 | 116 | replacements[SEQUENCE_TABLE_KEY] = args.sequence_table 117 | 118 | if not args.debug: 119 | replacements["SEQUENCE_FILTER"] = "" 120 | 121 | join_template = Environment(loader=FileSystemLoader("./")).from_string( 122 | open("join_annotations.sql", "r").read()) 123 | join_query = join_template.render(replacements, annot_sources=sources) 124 | with open(args.output, "w") as outfile: 125 | outfile.write(join_query) 126 | 127 | check_template = Environment(loader=FileSystemLoader("./")).from_string( 128 | open("check_joined_annotations.sql", "r").read()) 129 | check_query = check_template.render(replacements, annot_sources=sources) 130 | sys.stdout.write(""" 131 | Resulting JOIN query written to output file %s. Run that query using the 132 | BigQuery web UI or the bq command line tool. 133 | 134 | Be sure to test the result of the JOIN, for example: 135 | 136 | %s 137 | """ % (args.output, check_query)) 138 | 139 | if __name__ == "__main__": 140 | logging.getLogger().setLevel(logging.INFO) 141 | run() 142 | -------------------------------------------------------------------------------- /curation/allPossibleSNPs/thousandGenomes.sql: -------------------------------------------------------------------------------- 1 | -- 2 | -- Prepare 1000 Genomes for the JOIN. 3 | -- 4 | thousandGenomes AS ( 5 | SELECT 6 | reference_name, 7 | start, 8 | `end`, 9 | reference_bases, 10 | alternate_bases, 11 | AFR_AF[OFFSET(alt_offset)] AS AFR_AF_1000G, 12 | AMR_AF[OFFSET(alt_offset)] AS AMR_AF_1000G, 13 | EAS_AF[OFFSET(alt_offset)] AS EAS_AF_1000G, 14 | EUR_AF[OFFSET(alt_offset)] AS EUR_AF_1000G, 15 | SAS_AF[OFFSET(alt_offset)] AS SAS_AF_1000G, 16 | -- Used to check for correctness of the JOIN. 17 | names[OFFSET(0)] AS thousandGenomes_rsid 18 | FROM 19 | `{{ THOUSAND_GENOMES_TABLE }}` v, 20 | v.alternate_bases alternate_bases WITH OFFSET alt_offset ) 21 | -------------------------------------------------------------------------------- /curation/tables/AddBigQueryDescriptions.md: -------------------------------------------------------------------------------- 1 | Add Column Descriptions to a BigQuery Table 2 | =========================================== 3 | 4 | The `variants` table generated by performing 5 | an 6 | [export from Google Genomics](https://cloud.google.com/genomics/reference/rest/v1/variantsets/export) does 7 | not include the field descriptions for the fields. 8 | 9 | Some fields in the variants table are fixed fields (such as the 10 | `reference_name`, `start`, and `end` fields), while others are variable - 11 | discovered in source VCF or masterVar files during the variant import process. 12 | 13 | An ideal tool would be able to pull together a set of descriptions for the 14 | fixed fields, along with descriptions from: 15 | 16 | * the source variant set 17 | * the source variant files (VCFs or masterVar) 18 | * descriptions already set in the `variants` table schema 19 | 20 | and then allow for easy user additions and edits prior to updating the variants 21 | table schema. 22 | 23 | Since most variant sets are built from homogenous VCF files, this script 24 | provides just a simple implementation - provide a VCF used to build your variant 25 | set - it will extract the relevant descriptions from the VCF header and update 26 | the table. The VCF can be local or in Google Cloud Storage if TensorFlow 27 | libraries are installed. The VCF may be compressed with gzip and the filename 28 | must end with ".gz" if so. 29 | 30 | ## Setup 31 | 32 | The tool here uses the BigQuery client libraries described at: 33 | 34 | https://cloud.google.com/bigquery/docs/reference/libraries 35 | 36 | The following steps install that library in 37 | a 38 | [Python virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) 39 | 40 | 1. Create a virtualenv 41 | 42 | ``` 43 | virtualenv bq_lib 44 | ``` 45 | 46 | 2. Activate the virtualenv 47 | 48 | ``` 49 | source bq_lib/bin/activate 50 | ``` 51 | 52 | 3. Install the BigQuery client libraries and TensorFlow: 53 | 54 | ``` 55 | pip install --upgrade google-cloud-bigquery tensorflow 56 | ``` 57 | 58 | ## Run 59 | 60 | ```shell 61 | python update_variants_schema.py \ 62 | --source-vcf PATH_TO_VCF.vcf \ 63 | --destination-table PROJECT_ID.DATASET_NAME.TABLE_NAME 64 | ``` 65 | 66 | Note the fully-qualified table name follows 67 | BigQuery 68 | [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/) 69 | conventions. 70 | -------------------------------------------------------------------------------- /curation/tables/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM gcr.io/cloud-builders/gcloud 2 | 3 | RUN apt-get update \ 4 | && apt-get install -y python-setuptools \ 5 | && pip install --upgrade \ 6 | gcloud \ 7 | google-api-python-client \ 8 | google-cloud-bigquery \ 9 | retrying \ 10 | tensorflow 11 | 12 | 13 | COPY *.py /usr/local/bin/ 14 | COPY launch_import_vcf_to_bigquery.sh /usr/local/bin/ 15 | 16 | ENTRYPOINT ["bash"] 17 | -------------------------------------------------------------------------------- /curation/tables/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Curate Individual Annotation Sources 4 | ==================================== 5 | 6 | This tutorial loads several annotation sources to individual BigQuery tables. 7 | These tables are already available in BigQuery 8 | dataset 9 | [bigquery-public-data:human_variant_annotation](https://bigquery.cloud.google.com/dataset/bigquery-public-data:human_variant_annotation), 10 | but the configuration in this tutorial can be updated to load new versions of 11 | these resources or load additional annotation resources. 12 | 13 | [Container Builder](https://cloud.google.com/container-builder/docs/overview), 14 | [dsub](https://github.com/googlegenomics/dsub) 15 | and [Google Genomics](https://cloud.google.com/genomics/) are used to run all of 16 | these steps in the cloud. 17 | 18 | ## (1) Configure project variables. 19 | 20 | Set a few environment variables to facilitate cutting and pasting the subsequent 21 | commands. 22 | 23 | ``` bash 24 | # The Google Cloud Platform project id in which the Docker containers 25 | # will be built and stored. 26 | PROJECT_ID=your-project-id 27 | # The bucket name (with the gs:// prefix) where dsub logs will 28 | # be written. 29 | BUCKET=gs://your-bucket-name 30 | # The BigQuery destination dataset for the imported annotations. 31 | DATASET=your_bigquery_dataset_name 32 | ``` 33 | 34 | ## (2) Build the importer Docker container. 35 | 36 | Build the VCF importer image using the Container Builder service: 37 | 38 | ``` bash 39 | gcloud container builds submit \ 40 | --project ${PROJECT_ID} \ 41 | --tag gcr.io/${PROJECT_ID}/vcf_to_bigquery \ 42 | . 43 | ``` 44 | 45 | ## (3) Test a small import. 46 | 47 | The target BigQuery dataset must already exist, and the service account used to 48 | run [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) jobs must have 49 | "BigQuery Data Owner" role. (The Compute Engine default service account will 50 | not have this role by 51 | default. 52 | [It would need to be added.](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts)) 53 | 54 | Submit a single VCF import task 55 | via [dsub](https://cloud.google.com/genomics/v1alpha2/dsub). Here we use the 56 | small file 57 | `gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.genome_strip_hq.20101123.svs.low_coverage.genotypes.vcf` 58 | for a quick test. 59 | 60 | ``` bash 61 | dsub \ 62 | --project ${PROJECT_ID} \ 63 | --zones "us-*" \ 64 | --logging ${BUCKET}/upload_logs \ 65 | --image gcr.io/${PROJECT_ID}/vcf_to_bigquery \ 66 | --scopes "https://www.googleapis.com/auth/bigquery" \ 67 | "https://www.googleapis.com/auth/devstorage.read_write" \ 68 | --env \ 69 | SOURCE_VCFS=gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.genome_strip_hq.20101123.svs.low_coverage.genotypes.vcf \ 70 | PROJECT=${PROJECT_ID} \ 71 | DATASET=${DATASET} \ 72 | VARIANTSET=test \ 73 | TABLE=${PROJECT_ID}.${DATASET}.test \ 74 | --script launch_import_vcf_to_bigquery.sh 75 | ``` 76 | 77 | ## (4) Configure the annotation sources and destinations. 78 | 79 | Edit [vcf_manifest.tsv](vcf_manifest.tsv) to use your desired Cloud Storage, 80 | Google Genomics, and BigQuery destinations. It can also be edited to use newer 81 | versions of the annotation sources and/or add more annotation sources where the 82 | file format is VCF. 83 | 84 | ## (5) Run all the annotation imports in parallel. 85 | 86 | Submit multiple parallel imports via 87 | [dsub](https://cloud.google.com/genomics/v1alpha2/dsub): 88 | 89 | ``` bash 90 | dsub \ 91 | --project ${PROJECT_ID} \ 92 | --zones "us-*" \ 93 | --logging ${BUCKET}/upload_logs \ 94 | --image gcr.io/${PROJECT_ID}/vcf_to_bigquery \ 95 | --scopes "https://www.googleapis.com/auth/bigquery" \ 96 | "https://www.googleapis.com/auth/devstorage.read_write" \ 97 | --tasks vcf_manifest.tsv \ 98 | --script launch_import_vcf_to_bigquery.sh 99 | ``` 100 | -------------------------------------------------------------------------------- /curation/tables/import_vcf_to_bigquery.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | r"""Import variant data in a VCF file to a BigQuery variants table. 15 | 16 | Example usage: 17 | 18 | python import_vcf_to_bigquery.py \ 19 | --source-vcf "gs://BUCKET_NAME/PATH/TO/variants.vcf.gz" \ 20 | --project "PROJECT_ID" \ 21 | --dataset "DATASET_NAME" \ 22 | --variantset "VARIANTSET_NAME" \ 23 | --destination-table "PROJECT_ID.DATASET_NAME.TABLE_NAME" \ 24 | --expand-wildcards 25 | """ 26 | 27 | import argparse 28 | import logging 29 | 30 | import vcf_to_bigquery_utils 31 | 32 | 33 | def _parse_arguments(): 34 | """Parses command line arguments. 35 | 36 | Returns: 37 | A Namespace of parsed arguments. 38 | """ 39 | parser = argparse.ArgumentParser( 40 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 41 | parser.add_argument( 42 | "--source-vcf", 43 | nargs="+", 44 | required=True, 45 | help=("Cloud Storage path[s] to [gzip-compressed] VCF file[s]," 46 | " wildcards accepted (* but not **).")) 47 | parser.add_argument( 48 | "--project", 49 | required=True, 50 | help="Cloud project for imported Google Genomics data.") 51 | parser.add_argument( 52 | "--dataset", 53 | required=True, 54 | help=("Google Genomics dataset name or id" 55 | " (existing datasets will be appended).")) 56 | parser.add_argument( 57 | "--variantset", 58 | required=True, 59 | help=("Google Genomics variant set name or id" 60 | " (existing targets will be appended).")) 61 | parser.add_argument( 62 | "--new-dataset", 63 | action="store_true", 64 | help="Create a new dataset, even if one with this name exists.") 65 | parser.add_argument( 66 | "--new-variantset", 67 | action="store_true", 68 | help="Create a new variant set, even if one with this name exists.") 69 | parser.add_argument( 70 | "--expand-wildcards", 71 | action="store_true", 72 | help="Expand wildcards in VCF paths and use parallel imports.") 73 | parser.add_argument( 74 | "--destination-table", 75 | required=True, 76 | help="Full path to destination BigQuery table " 77 | "(PROJECT_ID.DATASET_NAME.TABLE_NAME).") 78 | parser.add_argument( 79 | "--description", 80 | help="Description for destination BigQuery table.") 81 | 82 | return parser.parse_args() 83 | 84 | 85 | def main(): 86 | args = _parse_arguments() 87 | logging.basicConfig(level=logging.INFO) 88 | 89 | uploader = vcf_to_bigquery_utils.VcfUploader(args.project) 90 | uploader.upload_variants(dataset=args.dataset, 91 | variantset=args.variantset, 92 | source_vcfs=args.source_vcf, 93 | destination_table=args.destination_table, 94 | expand_wildcards=args.expand_wildcards, 95 | new_dataset=args.new_dataset, 96 | new_variantset=args.new_variantset, 97 | description=args.description) 98 | 99 | 100 | if __name__ == "__main__": 101 | main() 102 | -------------------------------------------------------------------------------- /curation/tables/launch_import_vcf_to_bigquery.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | # Launch VCF importing code using parameter values set in environment variables. 18 | # ${SOURCE_VCFS} is a single environment variable that optionally refers to 19 | # multiple files, separated by whitespace and optionally quote-delimited. 20 | 21 | # TODO: Copy local ${SOURCE_VCFS} to Cloud Storage if they are remote (HTTP or 22 | # FTP) or local. Also uncompress input files for faster imports. 23 | 24 | # Handle quotes in VCF paths in original job array list with an eval. 25 | eval source_vcfs_array=("${SOURCE_VCFS}") 26 | python /usr/local/bin/import_vcf_to_bigquery.py \ 27 | --source-vcf "${source_vcfs_array[@]}" \ 28 | --project "${PROJECT}" \ 29 | --dataset "${DATASET}" \ 30 | --variantset "${VARIANTSET}" \ 31 | --destination-table "${TABLE}" \ 32 | --description "${SOURCE_VCFS}" \ 33 | --expand-wildcards 34 | -------------------------------------------------------------------------------- /curation/tables/schema_update_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Library to update a variants table schema with field descriptions. 15 | """ 16 | 17 | import glob 18 | import gzip 19 | import logging 20 | import re 21 | 22 | from gcloud import bigquery 23 | 24 | # If TensorFlow is installed, use its gfile library. 25 | try: 26 | from tensorflow import gfile 27 | except ImportError: 28 | logging.warning('TensorFlow not installed; VCF in Cloud Storage unsupported') 29 | 30 | 31 | # String length limit for BigQuery table and column descriptions. See: 32 | # https://cloud.google.com/bigquery/docs/reference/rest/v2/tables. 33 | _MAX_LENGTH = 1024 34 | _TRUNCATION_WARNING = 'Truncating %s to comply with BigQuery length limits' 35 | 36 | _FIXED_VARIANT_FIELDS = { 37 | 'reference_name': 38 | 'An identifier from the reference genome or an angle-bracketed ID ' 39 | 'string pointing to a contig in the assembly file.', 40 | 'start': 'The reference position, with the first base having position 0.', 41 | 'end': 'End position of the variant described in this record.', 42 | 'reference_bases': 43 | 'Each base must be one of A,C,G,T,N (case insensitive). Multiple ' 44 | 'bases are permitted. The value in the \'start\' field refers to the ' 45 | 'position of the first base in the string.', 46 | 'alternate_bases': 47 | 'List of alternate non-reference alleles.', 48 | 'variant_id': 'Google Genomics variant id.', 49 | 'quality': 'Phred-scaled quality score for the assertion made in ALT.', 50 | 'names': 'List of unique identifiers for the variant where available.', 51 | 'call': 'Per-sample measurements.', 52 | } 53 | 54 | _FIXED_CALL_FIELDS = { 55 | 'call_set_id': 56 | 'The id of the callset from which this data was exported from the ' 57 | 'Google Genomics Variants API.', 58 | 'call_set_name': 59 | 'Sample identifier from source data.', 60 | 'genotype': 61 | 'List of genotypes.', 62 | 'genotype_likelihood': 63 | 'List of genotype likelihoods.', 64 | 'phaseset': 65 | 'If this value is null, the data is unphased. Otherwise it is phased.', 66 | 'qual': 'Phred-scaled quality score for the assertion made in ALT.', 67 | } 68 | 69 | 70 | class Descriptions(object): 71 | """Encapsulate field descriptions as parsed from a VCF.""" 72 | 73 | def __init__(self): 74 | self.filter_description = None 75 | self.format_fields = {} 76 | self.info_fields = {} 77 | 78 | @staticmethod 79 | def _parse_filter_header(line_no, line): 80 | value = line.split('=', 1)[1] 81 | 82 | m = re.match(r'', value) 83 | if not m: 84 | raise ValueError('Failed to parse line %d: %s' % (line_no, line)) 85 | 86 | return {'id': m.group(1), 'description': m.group(2)} 87 | 88 | @staticmethod 89 | def _parse_format_or_info_header(line_no, line): 90 | value = line.split('=', 1)[1] 91 | 92 | m = re.match(r'', 93 | value) 94 | if not m: 95 | raise ValueError('Failed to parse line %d: %s' % (line_no, line)) 96 | 97 | return {'id': m.group(1), 'description': m.group(4)} 98 | 99 | def add_from_vcf(self, path): 100 | """Add descriptions from a VCF. 101 | 102 | Args: 103 | path: Path to local or remote (in Cloud Storage via a "gs://" path, if 104 | TensorFlow is installed) VCF file, optionally gzip-compressed 105 | (requires a ".gz" suffix). 106 | """ 107 | filter_desc = [] 108 | format_fields = {} 109 | info_fields = {} 110 | 111 | # Handle wildcards in the path by expanding and taking the first file. 112 | if path.startswith('gs://'): 113 | path = gfile.Glob(path)[0] 114 | f = gfile.Open(path) 115 | else: 116 | path = glob.glob(path)[0] 117 | f = open(path) 118 | 119 | # Handle gzipped VCF files. 120 | if path.endswith('.gz'): 121 | f = gzip.GzipFile(fileobj=f) 122 | 123 | line_no = 0 124 | for line in f: 125 | line_no += 1 126 | 127 | if line.startswith('##FORMAT='): 128 | header = self._parse_format_or_info_header(line_no, line) 129 | format_fields[header['id']] = header['description'] 130 | 131 | elif line.startswith('##INFO='): 132 | header = self._parse_format_or_info_header(line_no, line) 133 | info_fields[header['id']] = header['description'] 134 | 135 | elif line.startswith('##FILTER='): 136 | header = self._parse_filter_header(line_no, line) 137 | filter_desc.append(header) 138 | 139 | # Reached the end of the VCF header 140 | if line.startswith('#CHROM'): 141 | break 142 | 143 | # Update the member fields 144 | self.filter_description = '\n'.join( 145 | ['%s: %s' % (item['id'], item['description']) for item in filter_desc]) 146 | 147 | # If the filter description is too long, only include the field names. 148 | if len(self.filter_description) > _MAX_LENGTH: 149 | logging.warning(_TRUNCATION_WARNING, 'variant filter thresholds') 150 | self.filter_description = '\n'.join([item['id'] for item in filter_desc]) 151 | 152 | self.format_fields = format_fields 153 | self.info_fields = info_fields 154 | 155 | 156 | def tokenize_table_name(full_table_name): 157 | """Tokenize a BigQuery table_name. 158 | 159 | Splits a table name in the format of 'PROJECT_ID.DATASET_NAME.TABLE_NAME' to 160 | a tuple of three strings, in that order. PROJECT_ID may contain periods (for 161 | domain-scoped projects). 162 | 163 | Args: 164 | full_table_name: BigQuery table name, as PROJECT_ID.DATASET_NAME.TABLE_NAME. 165 | Returns: 166 | A tuple of project_id, dataset_name, and table_name. 167 | 168 | Raises: 169 | ValueError: If full_table_name cannot be parsed. 170 | """ 171 | delimiter = '.' 172 | tokenized_table = full_table_name.split(delimiter) 173 | if not tokenized_table or len(tokenized_table) < 3: 174 | raise ValueError('Table name must be of the form ' 175 | 'PROJECT_ID.DATASET_NAME.TABLE_NAME') 176 | # Handle project names with periods, e.g. domain.org:project_id. 177 | return (delimiter.join(tokenized_table[:-2]), 178 | tokenized_table[-2], 179 | tokenized_table[-1]) 180 | 181 | 182 | def update_table_schema(destination_table, source_vcf, description=None): 183 | """Updates a BigQuery table with the variants schema using a VCF header. 184 | 185 | Args: 186 | destination_table: BigQuery table name, PROJECT_ID.DATASET_NAME.TABLE_NAME. 187 | source_vcf: Path to local or remote (Cloud Storage) VCF or gzipped VCF file. 188 | description: Optional description for the BigQuery table. 189 | 190 | Raises: 191 | ValueError: If destination_table cannot be parsed. 192 | """ 193 | 194 | dest_table = tokenize_table_name(destination_table) 195 | dest_project_id, dest_dataset_name, dest_table_name = dest_table 196 | 197 | # Load the source VCF 198 | descriptions = Descriptions() 199 | descriptions.add_from_vcf(source_vcf) 200 | 201 | # Initialize the BQ client 202 | client = bigquery.Client(project=dest_project_id) 203 | 204 | # Load the destination table 205 | dest_dataset = client.dataset(dest_dataset_name) 206 | dest_dataset.reload() 207 | 208 | dest_table = dest_dataset.table(dest_table_name) 209 | dest_table.reload() 210 | 211 | if description is not None: 212 | dest_table.patch(description=description[:_MAX_LENGTH]) 213 | if len(description) > _MAX_LENGTH: 214 | logging.warning(_TRUNCATION_WARNING, 'table description') 215 | 216 | # Set the description on the variant fields and the call fields. 217 | # 218 | # The (non-fixed) variant field descriptions come from the ##INFO headers 219 | # The (non-fixed) call fields descriptions can come from the ##FORMAT headers 220 | # as well as the ##INFO headers. 221 | 222 | # Process variant fields 223 | call_field = None 224 | for field in dest_table.schema: 225 | if field.name.lower() in _FIXED_VARIANT_FIELDS: 226 | field.description = _FIXED_VARIANT_FIELDS[field.name.lower()] 227 | logging.debug('Variant(fixed): %s: %s', field.name, field.description) 228 | 229 | elif field.name in descriptions.info_fields: 230 | field.description = descriptions.info_fields[field.name] 231 | logging.debug('Variant(INFO) %s: %s', field.name, field.description) 232 | 233 | elif field.name.lower() == 'filter': 234 | field.description = descriptions.filter_description 235 | 236 | if field.name == 'call': 237 | call_field = field 238 | 239 | if field.description is not None and len(field.description) > _MAX_LENGTH: 240 | logging.warning(_TRUNCATION_WARNING, field.name) 241 | field.description = field.description[:_MAX_LENGTH] 242 | 243 | # Process call fields 244 | for field in call_field.fields: 245 | if field.name.lower() in _FIXED_CALL_FIELDS: 246 | field.description = _FIXED_CALL_FIELDS[field.name.lower()] 247 | logging.debug('Call(fixed): %s: %s', field.name, field.description) 248 | 249 | elif field.name in descriptions.format_fields: 250 | field.description = descriptions.format_fields[field.name] 251 | logging.debug('Call(FORMAT) %s: %s', field.name, field.description) 252 | 253 | elif field.name in descriptions.info_fields: 254 | field.description = descriptions.info_fields[field.name] 255 | logging.debug('Call(INFO) %s: %s', field.name, field.description) 256 | 257 | elif field.name.lower() == 'filter': 258 | field.description = descriptions.filter_description 259 | 260 | if field.description is not None and len(field.description) > _MAX_LENGTH: 261 | logging.warning(_TRUNCATION_WARNING, field.name) 262 | field.description = field.description[:_MAX_LENGTH] 263 | 264 | logging.info('Updating table %s', dest_table.path) 265 | dest_table.patch(schema=dest_table.schema) 266 | -------------------------------------------------------------------------------- /curation/tables/update_variants_schema.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Tool to update a variants table schema with field descriptions. 15 | """ 16 | 17 | import argparse 18 | 19 | import schema_update_utils 20 | 21 | 22 | def _parse_arguments(): 23 | """Parses command line arguments. 24 | 25 | Returns: 26 | A Namespace of parsed arguments. 27 | """ 28 | parser = argparse.ArgumentParser( 29 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 30 | parser.add_argument( 31 | '--source-vcf', 32 | required=True, 33 | help='Path to local or remote (Cloud Storage) VCF or gzipped VCF file.') 34 | parser.add_argument( 35 | '--destination-table', 36 | required=True, 37 | help='Full path to destination table ' 38 | '(PROJECT_ID.DATASET_NAME.TABLE_NAME)') 39 | return parser.parse_args() 40 | 41 | 42 | def main(): 43 | args = _parse_arguments() 44 | 45 | schema_update_utils.update_table_schema(args.destination_table, 46 | args.source_vcf) 47 | 48 | 49 | if __name__ == '__main__': 50 | main() 51 | -------------------------------------------------------------------------------- /curation/tables/vcf_manifest.tsv: -------------------------------------------------------------------------------- 1 | PROJECT DATASET VARIANTSET TABLE SOURCE_VCFS ORIGINAL_SOURCE_VCFS 2 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME dbSNP_hg38_20170710 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.dbSNP_hg38_20170710 gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf.gz 3 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME dbSNP_hg19_20170710 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.dbSNP_hg19_20170710 gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz 4 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME clinvar_hg38_20170705 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.clinvar_hg38_20170705 gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive/2017/clinvar_20170705.vcf.gz http://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive/2017/clinvar_20170705.vcf.gz 5 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME clinvar_hg19_20170705 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.clinvar_hg19_20170705 gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive/2017/clinvar_20170705.vcf.gz http://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive/2017/clinvar_20170705.vcf.gz 6 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME 1000genomes_phase3_hg38_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.1000genomes_phase3_hg38_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz 7 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME 1000genomes_phase3_hg19_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.1000genomes_phase3_hg19_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz 8 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME ESP6500_AA_hg38_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_AA_hg38_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz 9 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME ESP6500_AA_hg19_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_AA_hg19_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz 10 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME ESP6500_EA_hg38_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_EA_hg38_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz 11 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME ESP6500_EA_hg19_release89 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_EA_hg19_release89 gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz 12 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME ExAC_hg19_release1 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ExAC_hg19_release1 gs://gnomad-public/legacy/exacv1_downloads/release1/ExAC.r1.sites.vep.vcf.gz gs://gnomad-public/legacy/exacv1_downloads/release1/ExAC.r1.sites.vep.vcf.gz 13 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME gnomAD_genomes_hg19_release170228 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.gnomAD_genomes_hg19_release170228 gs://gnomad-public/release-170228/vcf/genomes/*.sites.?.vcf.gz gs://gnomad-public/release-170228/vcf/genomes/*.sites.??.vcf.gz gs://gnomad-public/release-170228/vcf/genomes/*.sites.?.vcf.gz gs://gnomad-public/release-170228/vcf/genomes/*.sites.??.vcf.gz 14 | YOUR_PROJECT_ID YOUR_VARIANTSET_NAME gnomAD_exomes_hg19_release170228 YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.gnomAD_exomes_hg19_release170228 gs://gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz gs://gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz 15 | -------------------------------------------------------------------------------- /curation/tables/vcf_to_bigquery_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Library to upload VCF files to Google Genomics and BigQuery. 15 | """ 16 | 17 | import logging 18 | import time 19 | 20 | from apiclient import discovery 21 | from oauth2client.client import GoogleCredentials 22 | from retrying import retry 23 | 24 | # Use tensorflow.gfile library, if available, to expand wildcards (optional). 25 | try: 26 | from tensorflow import gfile 27 | except ImportError: 28 | gfile = None 29 | 30 | import schema_update_utils 31 | 32 | class VcfUploader(object): 33 | """Class for managing a Google Genomics API connection and data transfers. 34 | 35 | Handles finding and creating variant sets and datasets and uploading and 36 | exporting variants stored in VCF. The main entry point is 37 | upload_variants(...), but other intermediate pipeline steps may also be used. 38 | """ 39 | 40 | def __init__(self, project, credentials=None): 41 | """Create VcfUploader class. 42 | 43 | Args: 44 | project: Cloud project to use for Genomics objects. 45 | credentials: Credentials object to use, get_application_default() if None. 46 | """ 47 | if credentials is None: 48 | credentials = GoogleCredentials.get_application_default() 49 | self.project = project 50 | self.service = discovery.build("genomics", "v1", credentials=credentials) 51 | 52 | @staticmethod 53 | def find_id_or_name(name, candidates): 54 | """Find a value linked as "id" or "name" in a collection of dicts. 55 | 56 | Args: 57 | name: string to search for in "id" and "name" fields. 58 | candidates: collection of dicts that should have "id" and "name" keys. 59 | 60 | Returns: 61 | choice["id"] for the unique matching choice (matched by "name" or "id"). 62 | Returns None if no matching choice is found. 63 | 64 | Raises: 65 | LookupError: If multiple items match the targeted name. 66 | """ 67 | target_id = None 68 | 69 | for choice in candidates: 70 | if choice.get("id") == name or choice.get("name") == name: 71 | if target_id is not None: 72 | raise LookupError("Found multiple hits for requested name") 73 | target_id = choice["id"] 74 | 75 | return target_id 76 | 77 | def find_or_create_dataset(self, 78 | dataset_name, 79 | always_create=False): 80 | """Finds or creates a Google Genomics dataset by name or id. 81 | 82 | If an existing dataset in the project has a name or ID of dataset_name, it 83 | will be reused and its id will be returned, unless always_create is True. 84 | A new dataset will be created if an existing one is not found. 85 | 86 | Args: 87 | dataset_name: Name or id of existing dataset, or name for a new dataset. 88 | always_create: Always create a new dataset with the requested name. 89 | 90 | Returns: 91 | The id of the existing or newly-created Genomics dataset. 92 | """ 93 | request = self.service.datasets().list(projectId=self.project) 94 | response = request.execute() 95 | 96 | dataset_id = self.find_id_or_name(dataset_name, 97 | response["datasets"]) 98 | 99 | if dataset_id is None or always_create: 100 | request = self.service.datasets().create( 101 | body={"name": dataset_name, 102 | "projectId": self.project}) 103 | response = request.execute() 104 | dataset_id = response["id"] 105 | 106 | return dataset_id 107 | 108 | def find_or_create_variantset(self, 109 | variantset_name, 110 | dataset_id, 111 | description="", 112 | always_create=False): 113 | """Finds or creates a Google Genomics variant set by name or id. 114 | 115 | If an existing variant set in the project has a name or ID of 116 | variantset_name, it will be reused and its id will be returned, unless 117 | always_create is True. A new variant set will be created if an existing 118 | one is not found. 119 | 120 | Args: 121 | variantset_name: Name or id of existing variant set, or name for a new 122 | variant set. 123 | dataset_id: Id of the dataset to find or create the variant set. 124 | description: The description for the variant set. 125 | always_create: Always create a new variant set with the requested name. 126 | 127 | Returns: 128 | The id of the existing or newly-created Genomics variant set. 129 | """ 130 | request = self.service.variantsets().search( 131 | body={"datasetIds": dataset_id}) 132 | response = request.execute() 133 | 134 | variantset_id = self.find_id_or_name(variantset_name, 135 | response["variantSets"]) 136 | 137 | if variantset_id is None or always_create: 138 | request = self.service.variantsets().create( 139 | body={"name": variantset_name, 140 | "datasetId": dataset_id, 141 | "description": description, 142 | }) 143 | response = request.execute() 144 | variantset_id = response["id"] 145 | return variantset_id 146 | 147 | def import_variants(self, source_uris, variantset_id): 148 | """Imports variants stored in a VCF file on Cloud Storage to a variant set. 149 | 150 | Args: 151 | source_uris: List of paths to VCF file[s] in Cloud Storage, wildcards 152 | accepted (*, not **). 153 | variantset_id: Id of the variant set to load the variants. 154 | 155 | Returns: 156 | The name of the loading operation. 157 | """ 158 | request = self.service.variants().import_( 159 | body={"variantSetId": variantset_id, 160 | "sourceUris": source_uris}) 161 | response = request.execute() 162 | return response["name"] 163 | 164 | # Handle transient HTTP errors by retrying several times before giving up. 165 | # Works around race conditions that arise when the operation ID is not 166 | # found, which yields a 404 error. 167 | @retry(stop_max_attempt_number=10, wait_exponential_multiplier=2000) 168 | def wait_for_operation(self, operation_id, wait_seconds=30): 169 | """Blocks until the Genomics operation completes. 170 | 171 | Args: 172 | operation_id: The name (id string) of the loading operation. 173 | wait_seconds: Number of seconds to wait between polling attempts. 174 | 175 | Returns: 176 | True if the operation succeeded, False otherwise. 177 | """ 178 | request = self.service.operations().get(name=operation_id) 179 | while not request.execute()["done"]: 180 | time.sleep(wait_seconds) 181 | 182 | # If the operation succeeded, there will be a "response" field and not an 183 | # "error" field, see: 184 | # https://cloud.google.com/genomics/reference/rest/Shared.Types/ListOperationsResponse#Operation 185 | response = request.execute() 186 | return "response" in response and "error" not in response 187 | 188 | def export_variants(self, variantset_id, destination_table): 189 | """Exports variants from Google Genomics to BigQuery. 190 | 191 | Per the Genomics API, this will overwrite any existing BigQuery table with 192 | this name. 193 | 194 | Args: 195 | variantset_id: Id of the variant set to export. 196 | destination_table: BigQuery output, as PROJECT_ID.DATASET_NAME.TABLE_NAME. 197 | 198 | Returns: 199 | The name of the export operation. 200 | """ 201 | tokenized_table = schema_update_utils.tokenize_table_name(destination_table) 202 | bigquery_project_id, dataset_name, table_name = tokenized_table 203 | 204 | request = self.service.variantsets().export( 205 | variantSetId=variantset_id, 206 | body={"projectId": bigquery_project_id, 207 | "bigqueryDataset": dataset_name, 208 | "bigqueryTable": table_name}) 209 | response = request.execute() 210 | return response["name"] 211 | 212 | def upload_variants(self, 213 | dataset, 214 | variantset, 215 | source_vcfs, 216 | destination_table, 217 | expand_wildcards=False, 218 | new_dataset=False, 219 | new_variantset=False, 220 | description=None): 221 | """Imports variants stored in a VCF in Cloud Storage to BigQuery. 222 | 223 | Handle all intermediate steps, including finding dataset and variant sets. 224 | 225 | Args: 226 | dataset: Name or id of existing dataset, or name for a new dataset. 227 | variantset: Name or id of existing variant set, or name for a new one. 228 | source_vcfs: List of VCF file[s] in Cloud Storage, wildcards accepted 229 | (*, not **). 230 | destination_table: BigQuery output, as PROJECT_ID.DATASET_NAME.TABLE_NAME. 231 | expand_wildcards: Expand wildcards in VCF paths and use parallel imports. 232 | new_dataset: Always create a new dataset with the requested name. 233 | new_variantset: Always create a new variant set with the requested name. 234 | description: Optional description for the BigQuery table. 235 | 236 | Raises: 237 | RuntimeError: If an upload or export request does not succeed. 238 | """ 239 | 240 | dataset_id = self.find_or_create_dataset(dataset, 241 | always_create=new_dataset) 242 | 243 | variantset_id = self.find_or_create_variantset( 244 | variantset, 245 | dataset_id, 246 | description="\t".join(source_vcfs), 247 | always_create=new_variantset) 248 | 249 | # Spawn off parallel imports for each VCF. 250 | if expand_wildcards and gfile is not None: 251 | # Expand any wildcarded paths and concatenate all files together. 252 | source_vcfs = sum([gfile.Glob(source_vcf) for source_vcf in source_vcfs], 253 | []) 254 | 255 | operation_ids = [] 256 | for source_vcf in source_vcfs: 257 | operation_ids.append(self.import_variants(source_vcf, variantset_id)) 258 | logging.info("Importing %s (%s)", source_vcf, operation_ids[-1]) 259 | 260 | # Wait for all imports to complete successfully before exporting variantset. 261 | for operation_id in operation_ids: 262 | if not self.wait_for_operation(operation_id): 263 | raise RuntimeError("Failed to import variants to Genomics (%s)" 264 | % operation_id) 265 | 266 | operation_id = self.export_variants(variantset_id, destination_table) 267 | logging.info("Exporting %s (%s)", variantset, operation_id) 268 | 269 | if not self.wait_for_operation(operation_id): 270 | raise RuntimeError("Failed to export variants to BigQuery (%s)" 271 | % operation_id) 272 | 273 | # Assume the VCF header is the same for all files and so just use the first. 274 | logging.info("Updating schema for %s", variantset) 275 | schema_update_utils.update_table_schema(destination_table, 276 | source_vcfs[0], 277 | description=description) 278 | -------------------------------------------------------------------------------- /interactive/InteractiveVariantAnnotation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Interactive Variant Annotation\n", 8 | "\n", 9 | "The following query retrieves variants from [DeepVariant-called Platinum Genomes](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) and interactively JOINs them with [ClinVar](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/clinvar_annotations.html). \n", 10 | "\n", 11 | "To run this on your own table of variants, change the table name and call_set_name in the `sample_variants` sub query below.\n", 12 | "\n", 13 | "For an ongoing investigation, you may wish to repeat this query each time a new version of ClinVar is released and [loaded into BigQuery](https://github.com/verilylifesciences/variant-annotation/tree/master/curation/tables/README.md) by changing the table name in the `rare_pathenogenic_variants` sub query.\n", 14 | "\n", 15 | "See also similar examples for GRCh37 in https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes " 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": { 22 | "collapsed": false 23 | }, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/html": [ 28 | "\n", 29 | "
chrstartreference_basesaltcall_set_nameCLNHGVSCLNALLECLNSRCCLNORIGINCLNSRCIDCLNSIGCLNDSDBCLNDSDBIDCLNDBNCLNREVSTATCLNACC
194047008CTNA12878_ERR194147['NC_000001.11:g.94047009C>T'][2]['HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)'][1]['CM024629|601691.0035|P78363#VAR_008428']['255|5|1|2|3|3|3|3']['MedGen:OMIM|MedGen|MedGen|Human_Phenotype_Ontology:MedGen|MedGen|MedGen|MedGen']['C1855465:248200|CN221809|CN169374|HP:0000608:C0024437|CN239309|CN239466|CN239312']['MACULAR_DEGENERATION\\\\x2c_AGE-RELATED\\\\x2c_2\\\\x2c_SUSCEPTIBILITY_TO|Stargardt_disease_1|not_provided|not_specified|Macular_degeneration|Cone-Rod_Dystrophy\\\\x2c_Recessive|Retinitis_Pigmentosa\\\\x2c_Recessive|Stargardt_Disease\\\\x2c_Recessive']['no_criteria|no_criteria|no_assertion|mult|single|single|single|single']['RCV000008374.4|RCV000008375.4|RCV000085512.3|RCV000152706.4|RCV000294335.1|RCV000349295.1|RCV000392936.1|RCV000399411.1']
1201361939AGNA12878_ERR194147['NC_000001.11:g.201361940A>G'][1]['.'][1]['.']['255|0|0|0|0|0|0|0']['MedGen|MedGen:OMIM|MedGen:OMIM|MedGen:OMIM|Human_Phenotype_Ontology:MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:Orphanet:SNOMED_CT']['CN169374|C1861864:115195|C2676271:612422|C1832243:601494|HP:0011664:C4021133|HP:0001639:C0007194:ORPHA217569|CN239310|C0340429:ORPHA217635:233878008']['not_specified|Familial_hypertrophic_cardiomyopathy_2|Familial_restrictive_cardiomyopathy_3|Left_ventricular_noncompaction_6|Left_ventricular_noncompaction_cardiomyopathy|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Familial_restrictive_cardiomyopathy']['conf|single|single|single|single|single|single|single']['RCV000168973.2|RCV000230425.2|RCV000230425.2|RCV000230425.2|RCV000283636.1|RCV000323526.1|RCV000338870.1|RCV000378147.1']
1212897348TTACACNA12878_ERR194147['NC_000001.11:g.212897351_212897370dup20', 'NC_000001.11:g.212897365_212897370dupCACACA', 'NC_000001.11:g.212897367_212897370dupCACA', 'NC_000001.11:g.212897369_212897370dupCA'][4, -1, -1, -1]['.', '.', '.', '.'][1, 1, 1, 1]['.', '.', '.', '.']['0', '3', '255', '0']['MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet']['C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628']['Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa']['single', 'single', 'conf', 'single']['RCV000355025.1', 'RCV000297866.1', 'RCV000262602.1', 'RCV000351203.1']
1215671030CTNA12878_ERR194147['NC_000001.11:g.215671031C>T'][1]['UniProtKB_(protein)'][1]['O75445#VAR_061351']['255']['MedGen']['CN169374']['not_specified']['conf']['RCV000041750.4']
1237589773ATANA12878_ERR194147['NC_000001.11:g.237589784delT'][1]['.'][1]['.']['2|255']['MedGen:Orphanet:SNOMED_CT|MedGen']['C0878544:ORPHA167848:85898001|CN169374']['Cardiomyopathy|not_specified']['no_criteria|conf']['RCV000030420.1|RCV000036734.8']
1026088401CTNA12878_ERR194147['NC_000010.11:g.26088402C>T'][1]['.'][1]['.']['255|0']['MedGen|MedGen']['CN169374|CN239439']['not_specified|Nonsyndromic_Hearing_Loss\\\\x2c_Recessive']['conf|single']['RCV000039026.3|RCV000381484.1']
116392135CTNA12878_ERR194147['NC_000011.10:g.6392136C>T'][1]['.'][1]['.']['255|0']['MedGen|MedGen:SNOMED_CT']['CN169374|C0028064:58459009']['not_specified|Sphingomyelin/cholesterol_lipidosis']['conf|single']['RCV000079188.5|RCV000394529.1']
116617153CTNA12878_ERR194147['NC_000011.10:g.6617154C>A', 'NC_000011.10:g.6617154C>G', 'NC_000011.10:g.6617154C>T'][1, 2, 3]['.', 'OMIM_Allelic_Variant', '.'][1, 1, 1]['.', '607998.0004', '.']['5', '5|5|5|5|5', '5']['MedGen', 'MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet|MedGen|MeSH:MedGen|MedGen:OMIM:Orphanet:SNOMED_CT', 'MedGen']['CN221809', 'C1876161:204500:ORPHA228349|C1836474:609270:ORPHA284324|CN221809|D030342:C0950123|C0027877:214200:ORPHA216:42012007', 'CN221809']['not_provided', 'Ceroid_lipofuscinosis_neuronal_2|Childhood-onset_autosomal_recessive_slowly_progressive_spinocerebellar_ataxia|not_provided|Inborn_genetic_diseases|Neuronal_ceroid_lipofuscinosis', 'not_provided']['single', 'mult|single|mult|single|single', 'single']['RCV000391641.1', 'RCV000002763.11|RCV000074608.7|RCV000189765.4|RCV000210689.1|RCV000228119.2', 'RCV000189764.3']
1147448802CTNA12878_ERR194147['NC_000011.10:g.47448803C>T'][1]['.'][1]['.']['255']['MedGen']['CN169374']['not_specified']['conf']['RCV000246056.2']
1166510682TCNA12878_ERR194147['NC_000011.10:g.66510683T>C'][1]['.'][1]['.']['255|2']['MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']['CN169374|C0752166:209900:ORPHA110:5619004']['not_specified|Bardet-Biedl_syndrome']['conf|single']['RCV000173529.2|RCV000226235.1']
12102840473TCNA12878_ERR194147['NC_000012.12:g.102840474T>C'][1]['HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)'][1]['CM910294|612349.0017|P00439#VAR_001038']['5|5|5']['MedGen|MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']['C0751435|CN221809|C0031485:261600:ORPHA716:154735006']['Hyperphenylalaninemia\\\\x2c_non-pku|not_provided|Phenylketonuria']['no_criteria|single|mult']['RCV000000624.4|RCV000078508.6|RCV000150074.4']
1423389061AGANA12878_ERR194147['NC_000014.9:g.23389063delG'][1]['.'][1]['.']['3|255|2|0|0|0']['MedGen:Orphanet:SNOMED_CT|MedGen|MedGen:OMIM|MedGen:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet']['C0878544:ORPHA167848:85898001|CN169374|C2750467:613251|C0018817:ORPHA1478|CN239310|HP:0001639:C0007194:ORPHA217569']['Cardiomyopathy|not_specified|Familial_hypertrophic_cardiomyopathy_14|Atrial_septal_defect|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hypertrophic_cardiomyopathy']['single|conf|single|single|single|single']['RCV000030306.1|RCV000154759.3|RCV000205051.2|RCV000299696.1|RCV000354591.1|RCV000396023.1']
1464210032CTNA12878_ERR194147['NC_000014.9:g.64210033C>T'][2]['OMIM_Allelic_Variant|UniProtKB_(protein)'][1]['608442.0001|Q8WXH0#VAR_062977']['5|255|3']['MedGen:OMIM|MedGen|MedGen:Orphanet:SNOMED_CT']['C2751805:612999|CN169374|C0410189:ORPHA261:111508004']['Emery-Dreifuss_muscular_dystrophy_5\\\\x2c_autosomal_dominant|not_specified|Emery-Dreifuss_muscular_dystrophy']['no_criteria|conf|single']['RCV000002414.4|RCV000173937.3|RCV000403391.1']
1565078011CTNA12878_ERR194147['NC_000015.10:g.65078012C>T'][1]['.'][1]['.']['255|3']['MedGen|MedGen']['CN169374|CN239448']['not_specified|Nemaline_Myopathy\\\\x2c_Dominant']['conf|single']['RCV000117307.4|RCV000304321.1']
1589645160ATNA12878_ERR194147['NC_000015.10:g.89645161A>T'][1]['.'][1]['.']['255|3']['MedGen|Gene:MedGen:OMIM:Orphanet']['CN169374|46:C0796147:200990:ORPHA36']['not_specified|Acrocallosal_syndrome\\\\x2c_Schinzel_type']['conf|single']['RCV000117416.4|RCV000261344.1']
162497032CGNA12878_ERR194147['NC_000016.10:g.2497033C>G'][1]['UniProtKB_(protein)'][1]['Q9ULP9#VAR_070890']['255|3|3|3']['MedGen|MedGen|MedGen:OMIM|MedGen:OMIM']['CN169374|C3809181|C3892048:616044|C3463992:308350']['not_specified|Caused_by_mutation_in_the_TBC1_domain_family\\\\x2c_member_24|Deafness\\\\x2c_autosomal_dominant_65|Epileptic_encephalopathy\\\\x2c_early_infantile\\\\x2c_1']['conf|single|single|single']['RCV000128367.6|RCV000477643.1|RCV000477643.1|RCV000477643.1']
1656514588CTNA12878_ERR194147['NC_000016.10:g.56514589C\\\\x3d', 'NC_000016.10:g.56514589C>T'][0, 1]['OMIM_Allelic_Variant', '.'][1, 1]['606151.0013', '.']['5', '2']['MedGen', 'MedGen']['C4016908', 'CN169374']['Bardet-biedl_syndrome_2/6\\\\x2c_digenic', 'not_specified']['no_criteria', 'single']['RCV000004838.4', 'RCV000301991.1']
1688805736TCNA12878_ERR194147['NC_000016.10:g.88805737T>C'][1]['UniProtKB_(protein)'][1]['Q9H211#VAR_054504']['255']['MedGen']['CN169374']['not_specified']['conf']['RCV000116652.3']
1717796722AAGGANA12878_ERR194147['NC_000017.11:g.17796729_17796731delGAG'][1]['HGMD'][1]['CD116392']['255|0']['MedGen|MedGen']['CN169374|CN221809']['not_specified|not_provided']['conf|single']['RCV000082265.6|RCV000118114.3']
182700878AGNA12878_ERR194147['NC_000018.10:g.2700879A>G'][1]['.'][1]['.']['255']['MedGen']['CN169374']['not_specified']['conf']['RCV000247697.2']
1911132530CCANA12878_ERR194147['NC_000019.10:g.11125285_11132532dup7248', 'NC_000019.10:g.11132532dupA'][-1, 1]['LDLR_@_LOVD', '.'][5, 1]['LDLR_000294', '.']['255', '3']['MedGen:OMIM:SNOMED_CT:SNOMED_CT', 'MedGen:OMIM:SNOMED_CT:SNOMED_CT']['C0020445:143890:397915002:398036000', 'C0020445:143890:397915002:398036000']['Familial_hypercholesterolemia', 'Familial_hypercholesterolemia']['conf', 'single']['RCV000237281.1', 'RCV000326993.1']
1941414124CTNA12878_ERR194147['NC_000019.10:g.41414125C>T'][1]['HGMD|UniProtKB_(protein)'][1]['CM021497|P12694#VAR_034361']['255|3']['MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']['CN169374|C0024776:248600:ORPHA268184:27718001']['not_specified|Maple_syrup_urine_disease']['conf|single']['RCV000079243.6|RCV000295914.1']
1945179661CTNA12878_ERR194147['NC_000019.10:g.45179662C>T'][1]['.'][1]['.']['255|0']['MedGen|MedGen:SNOMED_CT']['CN169374|C0079504:9311003']['not_specified|Hermansky-Pudlak_syndrome']['conf|single']['RCV000150192.3|RCV000320201.1']
1957231145GGCNA12878_ERR194147['NC_000019.10:g.57231150dupC'][1]['.'][1]['.']['255']['MedGen']['CN239485']['Spermatogenic_Failure']['conf']['RCV000311416.1']
2178532038CTNA12878_ERR194147['NC_000002.12:g.178532039C>T'][1]['.'][1]['.']['255|2|2|2|3|3|3|3|3|3']['MedGen|MedGen:OMIM|MedGen:OMIM:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet']['CN169374|C1858763:604145|C1837342:608807:ORPHA140922|CN230736|C2673677:611705:ORPHA289377|CN239352|HP:0001639:C0007194:ORPHA217569|CN239310|C1863599:603689:ORPHA178464|C1838244:600334:ORPHA609']['not_specified|Dilated_cardiomyopathy_1G|Limb-girdle_muscular_dystrophy\\\\x2c_type_2J|Cardiovascular_phenotype|Myopathy\\\\x2c_early-onset\\\\x2c_with_fatal_cardiomyopathy|Limb-Girdle_Muscular_Dystrophy\\\\x2c_Recessive|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hereditary_myopathy_with_early_respiratory_failure|Distal_myopathy_Markesbery-Griggs_type']['conf|single|single|single|single|single|single|single|single|single']['RCV000040944.8|RCV000231098.2|RCV000231098.2|RCV000244925.1|RCV000296735.1|RCV000311858.1|RCV000336579.1|RCV000351463.1|RCV000402951.1|RCV000403378.1']
\n", 30 | "
(rows: 63, time: 5.0s, 10GB processed, job: job_P6NRU_M3B1MeX_TpuZdGC9QTWZwp)
\n", 31 | " \n", 75 | " " 76 | ], 77 | "text/plain": [ 78 | "QueryResultsTable job_P6NRU_M3B1MeX_TpuZdGC9QTWZwp" 79 | ] 80 | }, 81 | "execution_count": 1, 82 | "metadata": {}, 83 | "output_type": "execute_result" 84 | } 85 | ], 86 | "source": [ 87 | "%%bq query\n", 88 | "#standardSQL\n", 89 | " --\n", 90 | " -- Return variants for sample NA12878 that are:\n", 91 | " -- annotated as 'pathogenic' or 'other' in ClinVar\n", 92 | " -- with observed population frequency less than 5%\n", 93 | " --\n", 94 | " WITH sample_variants AS (\n", 95 | " SELECT\n", 96 | " -- Remove the 'chr' prefix from the reference name.\n", 97 | " REGEXP_EXTRACT(reference_name, r'chr(.+)') AS chr,\n", 98 | " start,\n", 99 | " reference_bases,\n", 100 | " alt,\n", 101 | " call.call_set_name\n", 102 | " FROM\n", 103 | " `genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls` v,\n", 104 | " v.call call,\n", 105 | " v.alternate_bases alt WITH OFFSET alt_offset\n", 106 | " WHERE\n", 107 | " call_set_name = 'NA12878_ERR194147'\n", 108 | " -- Require that at least one genotype matches this alternate.\n", 109 | " AND EXISTS (SELECT gt FROM UNNEST(call.genotype) gt WHERE gt = alt_offset+1)\n", 110 | " ),\n", 111 | " --\n", 112 | " --\n", 113 | " rare_pathenogenic_variants AS (\n", 114 | " SELECT\n", 115 | " -- ClinVar does not use the 'chr' prefix for reference names.\n", 116 | " reference_name AS chr,\n", 117 | " start,\n", 118 | " reference_bases,\n", 119 | " alt,\n", 120 | " CLNHGVS,\n", 121 | " CLNALLE,\n", 122 | " CLNSRC,\n", 123 | " CLNORIGIN,\n", 124 | " CLNSRCID,\n", 125 | " CLNSIG,\n", 126 | " CLNDSDB,\n", 127 | " CLNDSDBID,\n", 128 | " CLNDBN,\n", 129 | " CLNREVSTAT,\n", 130 | " CLNACC\n", 131 | " FROM\n", 132 | " `bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20170705` v,\n", 133 | " v.alternate_bases alt\n", 134 | " WHERE\n", 135 | " -- Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided,\n", 136 | " -- 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic,\n", 137 | " -- 6 - drug response, 7 - histocompatibility, 255 - other\n", 138 | " EXISTS (SELECT sig FROM UNNEST(CLNSIG) sig WHERE REGEXP_CONTAINS(sig, '(4|5|255)'))\n", 139 | " -- TRUE if >5% minor allele frequency in 1+ populations\n", 140 | " AND G5 IS NULL\n", 141 | ")\n", 142 | " --\n", 143 | " --\n", 144 | "SELECT\n", 145 | " *\n", 146 | "FROM\n", 147 | " sample_variants\n", 148 | "JOIN\n", 149 | " rare_pathenogenic_variants USING(chr,\n", 150 | " start,\n", 151 | " reference_bases,\n", 152 | " alt)\n", 153 | "ORDER BY\n", 154 | " chr,\n", 155 | " start,\n", 156 | " reference_bases,\n", 157 | " alt" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 2", 164 | "language": "python", 165 | "name": "python2" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 2 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython2", 177 | "version": "2.7.12" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 0 182 | } 183 | -------------------------------------------------------------------------------- /interactive/README.md: -------------------------------------------------------------------------------- 1 | **WARNING: Not actively maintained!** 2 | 3 | Interactive Variant Annotation 4 | ============================== 5 | 6 | Given a particular set of variants for an individual or a cohort, the code here 7 | will allow you to interactively annotate the sequence variants 8 | using 9 | [annotation resources available in BigQuery](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/annotations_toc.html). Note 10 | that if there is a newer version of the annotation resource that you wish to 11 | use, [you can load it into BigQuery](../curation/tables). 12 | 13 | ## Status of this sub-project 14 | 15 | There is only one example here at the moment but see also similar work: 16 | 17 | * http://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/COSMIC.html 18 | * https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes 19 | * http://googlegenomics.readthedocs.io/en/latest/use_cases/annotate_variants/interval_joins.html 20 | 21 | TODO: add more example queries, Datalab notebooks and RMarkdown. 22 | 23 | ## Examples 24 | 25 | ### [Datalab](https://cloud.google.com/datalab/) Notebook Examples 26 | 27 | 1. Notebook [InteractiveVariantAnnotation.ipynb](./InteractiveVariantAnnotation.ipynb) will return variants for sample NA12878 that are: 28 | * annotated as 'pathogenic' or 'other' in ClinVar 29 | * with observed population frequency less than 5% 30 | 31 | --------------------------------------------------------------------------------