├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── batch
    ├── README.md
    ├── build_annotator
    │   ├── Dockerfile.dbNSFP
    │   ├── Dockerfile.vep
    │   ├── README.md
    │   ├── build_databases.sh
    │   ├── dbNSFP_container.yaml
    │   ├── download_dbNSFP.sh
    │   └── vep_container.yaml
    ├── run_annotator
    │   ├── README.md
    │   ├── run_vep_remote.sh
    │   ├── vep_into_bigquery_for_docker.sh
    │   └── vep_schema.json
    └── vep
    │   ├── Dockerfile
    │   ├── README.md
    │   ├── build_vep_cache.sh
    │   ├── run_script_with_watchdog.sh
    │   ├── run_vep.sh
    │   └── sample_pipeline.yaml
├── curation
    ├── README.md
    ├── allPossibleSNPs
    │   ├── ESP_AA.sql
    │   ├── ESP_EA.sql
    │   ├── README.md
    │   ├── all_possible_snps.sql
    │   ├── check_joined_annotations.sql
    │   ├── clinvar.sql
    │   ├── dbSNP.sql
    │   ├── fasta_to_kv.py
    │   ├── join_annotations.sql
    │   ├── render_templated_sql.py
    │   └── thousandGenomes.sql
    └── tables
    │   ├── AddBigQueryDescriptions.md
    │   ├── Dockerfile
    │   ├── README.md
    │   ├── import_vcf_to_bigquery.py
    │   ├── launch_import_vcf_to_bigquery.sh
    │   ├── schema_update_utils.py
    │   ├── update_variants_schema.py
    │   ├── vcf_manifest.tsv
    │   └── vcf_to_bigquery_utils.py
└── interactive
    ├── InteractiveVariantAnnotation.ipynb
    └── README.md


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | Want to contribute? Great! First, read this page (including the small print at the end).
 2 | 
 3 | ### Before you contribute
 4 | Before we can use your code, you must sign
 5 | the
 6 | [Google Individual Contributor License Agreement](https://cla.developers.google.com/about/google-individual) (CLA),
 7 | which you can do online. The CLA is necessary mainly because you own the
 8 | copyright to your changes, even after your contribution becomes part of our
 9 | codebase, so we need your permission to use and distribute your code. We also
10 | need to be sure of various other things—for instance that you'll tell us if you
11 | know that your code infringes on other people's patents. You don't have to sign
12 | the CLA until after you've submitted your code for review and a member has
13 | approved it, but you must do it before we can put your code into our codebase.
14 | Before you start working on a larger contribution, you should get in touch with
15 | us first through the issue tracker with your idea so that we can help out and
16 | possibly guide you. Coordinating up front makes it much easier to avoid
17 | frustration later on.
18 | 
19 | ### Code reviews
20 | All submissions, including submissions by project members, require review. We
21 | use GitHub pull requests for this purpose.
22 | 
23 | ### The small print
24 | Contributions made by corporations are covered by a different agreement than the
25 | one above,
26 | the
27 | [Software Grant and Corporate Contributor License Agreement](https://cla.developers.google.com/about/google-corporate).
28 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 
177 |    END OF TERMS AND CONDITIONS
178 | 
179 |    APPENDIX: How to apply the Apache License to your work.
180 | 
181 |       To apply the Apache License to your work, attach the following
182 |       boilerplate notice, with the fields enclosed by brackets "[]"
183 |       replaced with your own identifying information. (Don't include
184 |       the brackets!)  The text should be enclosed in the appropriate
185 |       comment syntax for the file format. We also recommend that a
186 |       file or class name and description of purpose be included on the
187 |       same "printed page" as the copyright notice for easier
188 |       identification within third-party archives.
189 | 
190 |    Copyright [yyyy] [name of copyright owner]
191 | 
192 |    Licensed under the Apache License, Version 2.0 (the "License");
193 |    you may not use this file except in compliance with the License.
194 |    You may obtain a copy of the License at
195 | 
196 |        http://www.apache.org/licenses/LICENSE-2.0
197 | 
198 |    Unless required by applicable law or agreed to in writing, software
199 |    distributed under the License is distributed on an "AS IS" BASIS,
200 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201 |    See the License for the specific language governing permissions and
202 |    limitations under the License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### Disclaimer
 2 | 
 3 | This is a forked version of
 4 | [verilylifesciences/variant-annotation](
 5 | https://github.com/verilylifesciences/variant-annotation). The intention is to
 6 | gradually update various pieces of this repo for variant annotation needs we
 7 | have in [googlegenomics/gcp-variant-transforms](
 8 | https://github.com/googlegenomics/gcp-variant-transforms) but at the same time
 9 | provide annotation related tools/documentation that:
10 |   - can be used independently,
11 |   - are actively maintained with proper test harnesses.
12 | 
13 | Any README file that has the following warning message indicates a sub-directory
14 | that has not been updated yet:
15 | 
16 | **WARNING: Not actively maintained!**
17 | 
18 | This is the list of modules that *are* actively maintained:
19 | * `batch/vep`
20 | 
21 | 
22 | variant-annotation
23 | ==================
24 | 
25 | This repository contains code to annotate human sequence variants using
26 | cloud technology to perform analyses in parallel.
27 | 
28 | Sub-projects:
29 | 
30 | * [batch annotation](./batch) code for annotating a particular batch of variants
31 |   using annotation resources available at a particular point in time
32 | * [interactive annotation](./interactive) queries and code to annotate variants
33 |   interactively with new annotation resources as they become available
34 | * [annotation curation](./curation) code for ingesting and reformating raw
35 |   annotation resources for use in interactive annotation
36 | 
37 | The code in this repository is designed for use with genomic variants stored
38 | in [Google BigQuery](https://cloud.google.com/bigquery/) in a
39 | particular
40 | [variant table format](https://cloud.google.com/genomics/v1/bigquery-variants-schema).
41 | 
42 | Processing
43 | uses
44 | [Google Container Builder](https://cloud.google.com/container-builder/),
45 | [Docker](https://www.docker.com/),
46 | and [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) for batch
47 | processing. We suggest working through the introductory materials for each tool
48 | before working with the code in this repository.
49 | 
50 | For interactive annotation, parallelism is accomplished due to the use of
51 | BigQuery.  For batch annotation, parallelism is accomplished due to the use of
52 | dsub to run annotation in parallel on small shards of the input file(s).
53 | 


--------------------------------------------------------------------------------
/batch/README.md:
--------------------------------------------------------------------------------
 1 | **WARNING: Not actively maintained!**
 2 | 
 3 | Batch Variant Annotation
 4 | ========================
 5 | 
 6 | Given a set of variants, the code here will allow you to annotate a batch of
 7 | variants using annotation resources available at a particular point in time.
 8 | (In comparison, using [interactive annotation](../interactive), variants can be
 9 | annotated on the fly with new annotation resources as they become available.)
10 | 
11 | This code uses
12 | Ensembl's
13 | [Variant Effect Prediction](
14 | http://www.ensembl.org/info/docs/tools/vep/index.html) (VEP)
15 | from McLaren et. al. 2016
16 | ([doi:10.1186/s13059-016-0974-4](
17 | https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0974-4))
18 | to annotate variants either in VCF files or in a BigQuery table.
19 | 
20 | To annotate VCF files, check the [documentation](./vep/README.md) in `vep`
21 | directory on how to build docker images containing VEP, how to create VEP cache
22 | for those images, and how to run VEP.
23 | 
24 | **WARNING: The build_annotator and run_annotator pieces below are not actively
25 | maintained in this repo yet!**
26 | 
27 | Annotating variants in BigQuery talbes is horizontally scalable due to the use
28 | of [dsub](https://cloud.google.com/genomics/v1alpha2/dsub). A separate instance
29 | of VEP is run by dsub for each shard of each of the files passed on the command
30 | line.  VEP is also configured to run with as many threads as the number of cores
31 | on the virtual machine instantiated by dsub.
32 | 
33 | ## Status of this sub-project
34 | 
35 | VEP can be configured in many ways and can use as input a large variety of
36 | annotation sources. This code illustrates one possible configuration and could
37 | be modified to accomodate other configurations.
38 | 
39 | All steps are run in the cloud, but each individual step is launched manually.
40 | 
41 | ## Overview
42 | 
43 | ### Build the annotator
44 | 
45 | The first step involves building the Docker container holding VEP and cached
46 | annotations for the desired build of the human genome reference.
47 | 
48 | A second container is built to curate and
49 | cache [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) in Cloud Storage.
50 | This is done because dbNSFP is quite a large annotation resource and therefore
51 | we choose not to add it to the same Docker container that includes VEP.
52 | 
53 | [Follow the tutorial](./build_annotator/README.md) to build the tools needed to
54 | annotate GRCh37 or GRCh38 of the human genome reference.
55 | 
56 | ### Run the annotator
57 | 
58 | After the annotator has been built for the desired build of the human reference
59 | genome, it can be used to annotate variants from a single genome or a cohort of
60 | genomes in a BigQuery
61 | [variant table](https://cloud.google.com/genomics/v1/bigquery-variants-schema).
62 | 
63 | [Follow the tutorial](./run_annotator/README.md) to
64 | annotate
65 | [Platinum Genomes variants called by DeepVariant](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) and
66 | aligned to build GRCh38.
67 | 


--------------------------------------------------------------------------------
/batch/build_annotator/Dockerfile.dbNSFP:
--------------------------------------------------------------------------------
 1 | # Start from this container so that gcloud and all its dependencies
 2 | # are already available.
 3 | FROM gcr.io/cloud-builders/gcloud
 4 | 
 5 | RUN apt-get -y update && apt-get install -y \
 6 |     apt-transport-https \
 7 |     build-essential \
 8 |     ca-certificates \
 9 |     curl \
10 |     tabix \
11 |     unzip \
12 |     wget \
13 |     zlib1g-dev
14 | 
15 | # Install newer version of bgzip so that the --threads option is available.
16 | RUN cd /opt && \
17 |     export VER=1.3.1 && \
18 |     export NAMEVER=htslib-${VER} && \
19 |     wget https://github.com/samtools/htslib/releases/download/${VER}/${NAMEVER}.tar.bz2 && \
20 |     tar xjf ${NAMEVER}.tar.bz2 && \
21 |     cd ${NAMEVER} && \
22 |     make -j$(nproc) && \
23 |     make install && \
24 |     ldconfig && \
25 |     rm -rf /opt/${NAMEVER}.tar.bz2 /opt/${NAMEVER}
26 | 
27 | COPY download_dbNSFP.sh /opt/download_dbNSFP.sh
28 | COPY build_databases.sh /opt/build_databases.sh
29 | 
30 | ENTRYPOINT ["bash"]
31 | 


--------------------------------------------------------------------------------
/batch/build_annotator/Dockerfile.vep:
--------------------------------------------------------------------------------
 1 | # Example:
 2 | #
 3 | # docker build \
 4 | #   --build-arg ENSEMBL_RELEASE=88 \
 5 | #   --build-arg GENOME_ASSEMBLY=GRCh38 \
 6 | #   --build-arg DBNSFP_BASE=gs://my-bucket/dbNSFP \
 7 | #   --tag vep_docker_test \
 8 | #   .
 9 | #
10 | # This requires the dbNSFP databases to be loaded to GCS via the
11 | # build_databases.sh script, for example:
12 | #
13 | # ./build_databases.sh gs://my-bucket/dbNSFP_GRCh38/ dbNSFPv3.4c.zip
14 | #
15 | # TODO: Determine an alternate strategy for the dbNSFP
16 | # dependency. This currently uses a hardcoded environment variable to
17 | # hold the cloud storage path to dbNSFP because it is too large to
18 | # include within the Docker image. The script that invokes VEP will
19 | # currently fail if dbNSFP is not available at the path indicated by
20 | # the environment variable.
21 | 
22 | # Start from this container so that gcloud and all its dependencies
23 | # are already available.
24 | FROM gcr.io/cloud-builders/gcloud
25 | 
26 | ARG ENSEMBL_RELEASE=88
27 | ARG GENOME_ASSEMBLY=GRCh38
28 | # No default value for this argument.  See build_databases.sh for code that
29 | # loads these tables.
30 | ARG DBNSFP_BASE
31 | 
32 | # Make this build argument available in the container as an environment
33 | # variable.
34 | ENV DBNSFP_BASE="${DBNSFP_BASE}"
35 | ENV GENOME_ASSEMBLY="${GENOME_ASSEMBLY}"
36 | ENV VEP_SPECIES="homo_sapiens"
37 | ENV VEP_BASE=/opt/variant_effect_predictor
38 | 
39 | RUN apt-get -y update && apt-get install -y \
40 |     build-essential \
41 |     curl \
42 |     gawk \
43 |     git \
44 |     libarchive-zip-perl \
45 |     libdbd-mysql-perl \
46 |     libdbi-perl \
47 |     libfile-copy-recursive-perl \
48 |     libhts0 \
49 |     libjson-perl \
50 |     libmodule-build-perl \
51 |     tabix \
52 |     unzip \
53 |     wget \
54 |     zlib1g-dev
55 | 
56 | # Install VEP per the instructions on
57 | # http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer
58 | RUN git clone https://github.com/Ensembl/ensembl-vep.git ${VEP_BASE}
59 | 
60 | WORKDIR ${VEP_BASE}
61 | 
62 | RUN git checkout release/${ENSEMBL_RELEASE}
63 | 
64 | # Download the cache database separately.  Downloading via vep installation
65 | # option -c results in timeout errors.
66 | RUN mkdir -p $HOME/.vep && \
67 |     cd $HOME/.vep && \
68 |     curl -O "ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/variation/VEP/${VEP_SPECIES}_vep_${ENSEMBL_RELEASE}_${GENOME_ASSEMBLY}.tar.gz" && \
69 |     tar xzf "${VEP_SPECIES}_vep_${ENSEMBL_RELEASE}_${GENOME_ASSEMBLY}.tar.gz"
70 | 
71 | RUN perl INSTALL.pl \
72 |     --AUTO afl \
73 |     --SPECIES "${VEP_SPECIES}" \
74 |     --ASSEMBLY "${GENOME_ASSEMBLY}"
75 | 
76 | # Configure Condel plugin.
77 | RUN curl -Lk \
78 |   "https://github.com/Ensembl/VEP_plugins/archive/release/${ENSEMBL_RELEASE}.tar.gz" | \
79 |   tar xz --strip-components=2 \
80 |     "VEP_plugins-release-${ENSEMBL_RELEASE}/config/Condel" && \
81 |   sed -i "s#path/to/config#${VEP_BASE}#" \
82 |     Condel/config/condel_SP.conf
83 | 
84 | # Install Condel plugin.
85 | RUN perl INSTALL.pl --AUTO p --PLUGINS Condel
86 | 
87 | # Install dbNSFP plugin.
88 | RUN perl INSTALL.pl --AUTO p --PLUGINS dbNSFP
89 | 
90 | ENTRYPOINT ["bash"]
91 | 


--------------------------------------------------------------------------------
/batch/build_annotator/README.md:
--------------------------------------------------------------------------------
  1 | **WARNING: Not actively maintained!**
  2 | 
  3 | Build the annotator
  4 | ===================
  5 | 
  6 | This tutorial builds the tools needed to annotate GRCh37 or GRCh38 of the human
  7 | genome reference.
  8 | 
  9 | [Container Builder](https://cloud.google.com/container-builder/docs/overview)
 10 | and [dsub](https://github.com/googlegenomics/dsub) are used to run all of these
 11 | steps in the cloud.
 12 | 
 13 | ## (1) Configure project variables.
 14 | 
 15 | Set a few environment variables to facilitate cutting and pasting the subsequent
 16 | commands.
 17 | 
 18 | ``` bash
 19 | # The Google Cloud Platform project id in which the Docker containers
 20 | # will be built and stored.
 21 | PROJECT_ID=your-project-id
 22 | # The bucket name (with the gs:// prefix) where the cached version of
 23 | # dbNSFP should be stored.
 24 | BUCKET=gs://your-bucket-name
 25 | ```
 26 | 
 27 | ## (2) Build the VEP Docker container.
 28 | 
 29 | Run one of the commands below to build a Docker container that is configured to
 30 | run VEP on human genetic variants in GRCh37 or GRCh38 coordinates with
 31 | annotations including dbNSFP, SIFT, and many others. Both of these commands can
 32 | be run in parallel if you wish to annotate using both reference genomes.
 33 | 
 34 | ### GRCh37
 35 | 
 36 | ``` bash
 37 | gcloud container builds submit \
 38 |     --substitutions=_GENOME_ASSEMBLY=GRCh37,_ENSEMBL_RELEASE=89,_DBNSFP_BASE=${BUCKET}/dbNSFPv2.9.3/dbNSFP,_CONTAINER_SUFFIX=_89_grch37:latest \
 39 |     --config=vep_container.yaml \
 40 |     .
 41 | ```
 42 | 
 43 | ### GRCh38
 44 | 
 45 | ``` bash
 46 | gcloud container builds submit \
 47 |     --substitutions=_GENOME_ASSEMBLY=GRCh38,_ENSEMBL_RELEASE=89,_DBNSFP_BASE=${BUCKET}/dbNSFPv3.4c/dbNSFP,_CONTAINER_SUFFIX=_89_grch38:latest \
 48 |     --config=vep_container.yaml \
 49 |     .
 50 | ```
 51 | ## (3) Cache dbNSFP.
 52 | 
 53 | First run the command below to create the Docker container with tools and
 54 | scripts needed to process dbNSFP. This Docker container can be used for any
 55 | version of dbNSFP.
 56 | 
 57 | ``` bash
 58 | gcloud --project ${PROJECT_ID} container builds submit \
 59 |     --substitutions=_CONTAINER_TAG=:latest \
 60 |     --config=dbNSFP_container.yaml \
 61 |     .
 62 | ```
 63 | 
 64 | Then run the container via dsub to download and cache dbNSFP annotations in
 65 | Cloud Storage. These commands can be run in parallel if you wish to annotate
 66 | using both reference genomes.
 67 | 
 68 | * Note that it can take several hours for the job to complete.
 69 | * The values for `FILEID` in the commands came
 70 |   from [dbNSFP documentation](https://sites.google.com/site/jpopgen/dbNSFP)
 71 |   where you can also get detail on other available versions of dbNSFP.
 72 | 
 73 | ### GRCh37
 74 | 
 75 | ``` bash
 76 | dsub \
 77 |   --project ${PROJECT_ID} \
 78 |   --image gcr.io/${PROJECT_ID}/dbnsfp_cache_builder:latest \
 79 |   --zones "us-central1-*" \
 80 |   --disk-size 200 \
 81 |   --min-cores 8 \
 82 |   --logging ${BUCKET}/dbNSFPv3.4c/dbNSFPv3.4c.log \
 83 |   --env FILEID=0B60wROKy6OqcaWJ4Y0xvR2k1aUU \
 84 |   --output-recursive OUTPUT_PATH=${BUCKET}/dbNSFPv3.4c/ \
 85 |   --command '/opt/download_dbNSFP.sh &&
 86 |              /opt/build_databases.sh'
 87 | ```
 88 | 
 89 | ### GRCh38
 90 | 
 91 | ``` bash
 92 | dsub \
 93 |   --project ${PROJECT_ID} \
 94 |   --image gcr.io/${PROJECT_ID}/dbnsfp_cache_builder:latest \
 95 |   --zones "us-central1-*" \
 96 |   --disk-size 200 \
 97 |   --min-cores 8 \
 98 |   --logging ${BUCKET}/dbNSFPv2.9.3/dbNSFPv2.9.3.log \
 99 |   --env FILEID=0B60wROKy6OqceTNZRkZnaERWREk \
100 |   --output-recursive OUTPUT_PATH=${BUCKET}/dbNSFPv2.9.3/ \
101 |   --command '/opt/download_dbNSFP.sh &&
102 |              /opt/build_databases.sh'
103 | ```
104 | 


--------------------------------------------------------------------------------
/batch/build_annotator/build_databases.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | 
18 | # This script will preprocess the dbNSFP database according to the
19 | # instructions for the VEP plugin, given here:
20 | #   https://github.com/Ensembl/VEP_plugins/blob/release/87/dbNSFP.pm.
21 | #
22 | # Args:
23 | #   $1: Output path.
24 | #   $2: The dbNSFP input filename.
25 | 
26 | set -o nounset
27 | set -o errexit
28 | 
29 | readonly DEFAULT_OUTPUT_PATH=${OUTPUT_PATH:-}
30 | readonly DBNSFP_BASE=${1:-$DEFAULT_OUTPUT_PATH}
31 | readonly DBNSFP_ZIP_FILE=${2:-dbNSFP.zip}
32 | 
33 | # Process a list of dbNSFP database files by sorting and indexing them with
34 | # tabix.  We sort each chromosome separately, leaving the comment line at the
35 | # top.  Additionally, add "chr" to the start of the chromosome names (column 1).
36 | # Concatenate the resulting files and run tabix to index the final file.
37 | #
38 | # Args:
39 | #   $1: The combined output file name.
40 | #  ...: Input files, typically one per chromosome.
41 | function process_tables() {
42 |   local -r gzip_file="$1"
43 |   shift 1
44 |   local file
45 |   for file in $(printf '%s\n' "$@" | sort); do
46 |     # Write the comment lines for this file.
47 |     awk '/^#/' "${file}"
48 |     # Write the position-sorted non-comment lines for this file.
49 |     awk '!/^#/{print "chr"$0}' "${file}" | \
50 |       sort --key=2,2n --stable --parallel=8
51 |   done | \
52 |     bgzip --threads 8 -c > \
53 |       "${gzip_file}"
54 |   tabix -s 1 -b 2 -e 2 "${gzip_file}"
55 |   # TODO: Determine if "chr" needs to be prepended for GRCh37.
56 | }
57 | 
58 | main() {
59 |   local -r gzip_file="dbNSFP.gz"
60 |   local -r readme_file=("dbNSFP"*"readme.txt")
61 | 
62 |   unzip "${DBNSFP_ZIP_FILE}"
63 | 
64 |   process_tables "${gzip_file}" "dbNSFP"*"chr"*
65 | 
66 |   if [[ ! -z "${DBNSFP_BASE}" ]] ; then
67 |     # Move the processed files to the output directory.
68 |     mkdir -p "${DBNSFP_BASE}"
69 |     mv "${gzip_file}" "${DBNSFP_BASE}"
70 |     mv "${gzip_file}.tbi" "${DBNSFP_BASE}"
71 |     if [[ -f "${readme_file[0]}" ]] ; then
72 |       mv "${readme_file[0]}" "${DBNSFP_BASE}"
73 |     else
74 |       # TODO: Understand why the readme file is sometimes not found.
75 |       ls
76 |       ls "${readme_file[0]}"
77 |     fi
78 |   fi
79 | }
80 | 
81 | main "$@"
82 | 


--------------------------------------------------------------------------------
/batch/build_annotator/dbNSFP_container.yaml:
--------------------------------------------------------------------------------
 1 | steps:
 2 | - name: 'gcr.io/cloud-builders/docker'
 3 |   # Build a Docker container holding the tools and scripts needed
 4 |   # to create a cache of dbNSFP for use with VEP.
 5 |   args: ['build', '-f', 'Dockerfile.dbNSFP',
 6 |   '-t', 'gcr.io/$PROJECT_ID/dbnsfp_cache_builder${_CONTAINER_SUFFIX}',  '.']
 7 | 
 8 | # Push the container to the Google Container Registry.
 9 | images:
10 | - 'gcr.io/$PROJECT_ID/dbnsfp_cache_builder${_CONTAINER_SUFFIX}'
11 | 


--------------------------------------------------------------------------------
/batch/build_annotator/download_dbNSFP.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | 
18 | # Download dbNSFP from Google Drive per
19 | #   https://sites.google.com/site/jpopgen/dbNSFP
20 | #
21 | # Args:
22 | #   $1: The GoogleDrive file id.
23 | #   $2: The destination filename.
24 | 
25 | set -o nounset
26 | set -o errexit
27 | 
28 | # Default value is for dbNSFPv3.4c.zip
29 | readonly DEFAULT_FILEID=${FILEID:-0B60wROKy6OqcaWJ4Y0xvR2k1aUU}
30 | readonly DRIVE_FILEID=${1:-$DEFAULT_FILEID}
31 | readonly DESTINATION=${2:-dbNSFP.zip}
32 | 
33 | # The following code will download a large world-readable file from Google
34 | # Drive. Implementation adapted from http://stackoverflow.com/a/43478623
35 | curl -c /tmp/cookie -L -o /tmp/probe.bin \
36 |   "https://drive.google.com/uc?export=download&id=${DRIVE_FILEID}"
37 | confirm=$(tr ';' '\n' </tmp/probe.bin | grep confirm)
38 | confirm=${confirm:8:4}
39 | curl -C - -b /tmp/cookie -L -o "${DESTINATION}" \
40 |   "https://drive.google.com/uc?export=download&id=${DRIVE_FILEID}&confirm=${confirm}"
41 | 


--------------------------------------------------------------------------------
/batch/build_annotator/vep_container.yaml:
--------------------------------------------------------------------------------
 1 | steps:
 2 | - name: 'gcr.io/cloud-builders/docker'
 3 |   # Build a Docker container holding VEP and some
 4 |   # cached annotations.  Larger annotations such as dbNSFP
 5 |   # should be cached separately on Cloud Storage to keep
 6 |   # the resulting container within a reasonable size limit.
 7 |   args: ['build', '-f', 'Dockerfile.vep',
 8 |   '-t', 'gcr.io/$PROJECT_ID/vep${_CONTAINER_SUFFIX}',
 9 |   '--build-arg', 'ENSEMBL_RELEASE=${_ENSEMBL_RELEASE}',
10 |   '--build-arg', 'GENOME_ASSEMBLY=${_GENOME_ASSEMBLY}',
11 |   '--build-arg', 'DBNSFP_BASE=${_DBNSFP_BASE}',
12 |   '.']
13 | 
14 | # Maximum time to wait for all build steps to complete.
15 | timeout: '7200s'
16 | 
17 | # Push the VEP container to the Google Container Registry.
18 | images:
19 | - 'gcr.io/$PROJECT_ID/vep${_CONTAINER_SUFFIX}'
20 | 


--------------------------------------------------------------------------------
/batch/run_annotator/README.md:
--------------------------------------------------------------------------------
  1 | **WARNING: Not actively maintained!**
  2 | 
  3 | Run the annotator
  4 | =================
  5 | 
  6 | This tutorial uses VEP to
  7 | annotate
  8 | [Platinum Genomes variants called by DeepVariant](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) and
  9 | aligned to build GRCh38.
 10 | 
 11 | The [BigQuery Web UI](https://cloud.google.com/bigquery/quickstart-web-ui)
 12 | and [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) are used to run all
 13 | of these steps in the cloud.
 14 | 
 15 | ## (1) Identify the variants you wish to annotate.
 16 | 
 17 | A query can be used to export variants from
 18 | BigQuery
 19 | [variants tables](https://cloud.google.com/genomics/v1/bigquery-variants-schema)
 20 | in
 21 | [a format that VEP can take as input](http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input).
 22 | 
 23 | * The specific query below will export variants for all members of the Platinum
 24 |   Genomes cohort in "ensembl" format.
 25 | * To use variants from your own table, just change the table name.
 26 | * To work with a subset of genomes or just one genome, modify the query to
 27 |   filter by call set name.
 28 | 
 29 | Run this query using
 30 | the [BigQuery Web UI](https://cloud.google.com/bigquery/quickstart-web-ui) and
 31 | materialize the result to a table in your project.
 32 | 
 33 | ``` sql
 34 | #standardSQL
 35 |   --
 36 |   -- Extract the information needed to annotate variants with VEP.
 37 |   -- This is small test on just the variants in chromosome 22.
 38 |   --
 39 | WITH
 40 |   variants AS (
 41 |   SELECT
 42 |     reference_name AS chrom,
 43 |     start + 1 AS pos, -- convert to 1-based coordinates
 44 |     `end`,
 45 |     reference_bases AS ref,
 46 |     alt
 47 |   FROM
 48 |     `genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls`,
 49 |     UNNEST(alternate_bases) AS alt -- flatten multiallelic sites
 50 |   WHERE
 51 |     -- Include only sites of variation (exclude non-variant segments).
 52 |     alt IS NOT NULL AND alt NOT IN ("<NON_REF>", "<*>")
 53 |     -- Remove this line to include the entire genome.
 54 |     AND reference_name IN ('chr22', '22')
 55 |   )
 56 | SELECT
 57 |   -- http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input
 58 |   CONCAT(
 59 |     chrom, '\t',
 60 |     CAST(pos AS STRING), '\t',
 61 |     CAST(`end` AS STRING), '\t',
 62 |     ref, '/', alt, '\t',
 63 |     '+') AS vep_input
 64 | FROM
 65 |   variants
 66 | ```
 67 | 
 68 | ## (2) Extract the VEP input to Cloud Storage.
 69 | 
 70 | [Export](https://cloud.google.com/bigquery/docs/exporting-data) the contents of
 71 | the table created in the prior step as a gzipped CSV file to Cloud Storage (for
 72 | example to path `gs://your-bucket-name/platinum-genomes-grch38-chr22.csv.gz`).
 73 | 
 74 | ## (3) Run VEP on the variants.
 75 | 
 76 | Run [run_vep_remote.sh](./run_vep_remote.sh) to
 77 | launch [dsub](https://github.com/googlegenomics/dsub) jobs that will use the VEP
 78 | Docker container to run VEP on the variants, writing the resulting annotations
 79 | as a BigQuery table.
 80 | 
 81 | * Run `./run_vep_remote.sh --help` for additional documentation on its command
 82 |   line parameters.
 83 | 
 84 | ``` bash
 85 | # The Google Cloud Platform project id in which the docker containers
 86 | # are stored.
 87 | PROJECT_ID=your-project-id
 88 | # The bucket name (with the gs:// prefix) to hold temp files and logs.
 89 | BUCKET=gs://your-bucket-name
 90 | # The full path to the VEP input file.
 91 | INPUT_FILE=${BUCKET}/platinum-genomes-grch38-chr22.csv.gz
 92 | 
 93 | # Kick off annotation.
 94 | ./run_vep_remote.sh \
 95 |     --project_id ${PROJECT_ID} \
 96 |     --bucket ${BUCKET}/temp \
 97 |     --docker_image gcr.io/${PROJECT_ID}/vep_grch38 \
 98 |     --table_name platinum_genomes_grch38_chr22_annotations \
 99 |     --shards_per_file 10 \
100 |     ${INPUT_FILE}
101 | ```
102 | 
103 | ## (3) Check the annotations.
104 | 
105 | Use the following query to do some basic checks on the annotations.  The query
106 | below assumes the annotations are in table
107 | `vep.platinum_genomes_grch38_chr22_annotations` in your project.
108 | 
109 | ``` sql
110 | #standardSQL
111 |   --
112 |   -- Count the number of variants per chromosome for both the variants and
113 |   -- VEP output table.  This basic QC metric ensures that every chromosome
114 |   -- we expected successfully completed.
115 |   --
116 | SELECT chrom, variants_count, vep_count
117 | FROM (
118 |   SELECT LTRIM(reference_name, "chr") AS chrom, COUNT(1) AS variants_count
119 |   FROM
120 |     `genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls`,
121 |     UNNEST(alternate_bases) AS alt -- flatten multi allelic sites
122 |   WHERE
123 |     -- Include only sites of variantion (exclude non-variant segments).
124 |     alt IS NOT NULL AND alt NOT IN ("<NON_REF>", "<*>")
125 |   GROUP BY reference_name)
126 | FULL JOIN (
127 |   SELECT LTRIM(seq_region_name, "chr") AS chrom, COUNT(1) AS vep_count
128 |   FROM `vep.platinum_genomes_grch38_chr22_annotations`
129 |   GROUP BY seq_region_name)
130 | USING (chrom)
131 | WHERE vep_count IS NOT NULL
132 | ORDER BY chrom
133 | ```
134 | 
135 | We expect a result of 210,279 variants and VEP annotations for chromosome 22.
136 | 
137 | ## (4) Optional: Reshape the annotations table for easier JOINs.
138 | 
139 | The annotations table created by VEP is a little different than
140 | the
141 | [variant tables](https://cloud.google.com/genomics/v1/bigquery-variants-schema):
142 | 
143 | * VEP uses 1-based coordinates whereas
144 |   the
145 |   [variants tables](https://cloud.google.com/genomics/v1/bigquery-variants-schema) use
146 |   0-based coordinates per [GA4GH](http://ga4gh.org/).
147 | * VEP rewrites some of the fields. For example field `start`
148 |   excludes the first base for a deletion.
149 | 
150 | Run a query like the following to reshape the annotations data and materialize
151 | the result to a new table.
152 | 
153 | ``` sql
154 | #standardSQL
155 |   --
156 |   -- Add additional columns to the VEP table to facilitate easier JOINs
157 |   -- with variant tables.
158 |   --
159 | SELECT
160 |   SPLIT(input, "\t")[OFFSET(0)] AS reference_name,
161 |   CAST(SPLIT(input, "\t")[OFFSET(1)] AS INT64) - 1 AS start_0_based_coords,
162 |   SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(0)] AS reference_bases,
163 |   SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(1)] AS alternate_bases,
164 |   *
165 | FROM
166 |   `vep.platinum_genomes_grch38_chr22_annotations`
167 | ```
168 | 


--------------------------------------------------------------------------------
/batch/run_annotator/run_vep_remote.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | FLAGS_HELP="
 18 | Runs VEP on one or more files using the pipelines api (via dsub.py) and
 19 | writes the output to BigQuery. Positional parameters must be files stored
 20 | on GCS (may be gzipped).  Files must be in ensembl format or VCF format.
 21 | 
 22 | USAGE: ./run_vep_remote.sh [flags] args
 23 | "
 24 | 
 25 | if [[ ! -e ./shflags ]]; then
 26 |   echo This script assumes https://github.com/kward/shflags is located \
 27 |     in the current working directory.  To obtain the file: \
 28 |     curl -O https://raw.githubusercontent.com/kward/shflags/master/src/shflags
 29 |   exit 1
 30 | fi
 31 | 
 32 | source ./shflags
 33 | 
 34 | DEFINE_string project_id ""  \
 35 |   "The Cloud Platform project id to use."
 36 | 
 37 | DEFINE_string bucket "" \
 38 |   "Bucket to use for temporary files and logging."
 39 | 
 40 | DEFINE_string dataset "vep" \
 41 |   "BigQuery destination dataset name. The dataset will be created, if needed."
 42 | 
 43 | DEFINE_string table_name "testing" \
 44 |   "BigQuery destination table name. This table will be created."
 45 | 
 46 | DEFINE_string vep_schema_file "./vep_schema.json" \
 47 |   "BigQuery schema for VEP annotations."
 48 | 
 49 | # The reference genome versions must match between the image and the database.
 50 | DEFINE_string docker_image "" \
 51 |    "VEP docker image corresponding to the reference genome of the input."
 52 | 
 53 | DEFINE_integer shards_per_file 1 \
 54 |   "The number of concurrent dsub jobs to run, each working on a separate shard of the input file(s)."
 55 | 
 56 | DEFINE_string zones "us-*" \
 57 |   "Compute engine zones in which to run dsub."
 58 | 
 59 | DEFINE_integer disk_size "200" \
 60 |   "Size of dsub data disk."
 61 | 
 62 | DEFINE_integer boot_disk_size "30" \
 63 |   "Size of dsub boot disk."
 64 | 
 65 | DEFINE_integer min_gb_ram 8 \
 66 |   "Minimum amount of RAM for dsub."
 67 | 
 68 | DEFINE_string docker_script "./vep_into_bigquery_for_docker.sh" \
 69 |    "Script that will be run by dsub."
 70 | 
 71 | function main() {
 72 |   if [[ -z "${FLAGS_project_id}" ]] ; then
 73 |     echo "--project_id is required."
 74 |     exit 1
 75 |   fi
 76 | 
 77 |   if [[ -z "${FLAGS_bucket}" ]] ; then
 78 |     echo "--bucket is required."
 79 |     exit 1
 80 |   fi
 81 | 
 82 |   if [[ -z "${FLAGS_docker_image}" ]] ; then
 83 |     echo "--docker_image is required."
 84 |     exit 1
 85 |   fi
 86 | 
 87 |   local -r description="VEP pipeline on $* using ${FLAGS_docker_image}"
 88 | 
 89 |   gsutil \
 90 |     cp \
 91 |     "${FLAGS_vep_schema_file}" \
 92 |     "${FLAGS_bucket}/schema.json"
 93 | 
 94 |   bq \
 95 |     --project_id "${FLAGS_project_id}" \
 96 |     mk -f "${FLAGS_dataset}"
 97 | 
 98 |   # Note: this will cause the script to fail if the table already exists.
 99 |   bq \
100 |     --project_id "${FLAGS_project_id}" \
101 |     mk --table \
102 |     "${FLAGS_dataset}.${FLAGS_table_name}"
103 | 
104 |   bq \
105 |     --project_id "${FLAGS_project_id}" \
106 |     update \
107 |     --table \
108 |     --description "${description}" \
109 |     "${FLAGS_dataset}.${FLAGS_table_name}"
110 | 
111 |   local -r temp_dir=$(mktemp -d)
112 | 
113 |   # Create TSV file to pass into dsub.
114 |   # Will run VEP in parallel for each of the INPUT_FILE and put the result in
115 |   # BQ_DATASET_NAME.BQ_TABLE_NAME.
116 |   # SHARD_INDEX is 1..NUM_SHARDS
117 |   (
118 |     # Pass input file flags using "=" since the later logic changes spaces to
119 |     # tabs. dsub wants spaces, so we convert the "=" characters after
120 |     # converting spaces.
121 |     echo --input=SCHEMA_FILE \
122 |          BQ_DATASET_NAME \
123 |          BQ_TABLE_NAME \
124 |          --input=INPUT_FILE \
125 |          NUM_SHARDS \
126 |          SHARD_INDEX \
127 |       | tr '= ' ' \t'
128 | 
129 |     local file
130 |     for file in "$@"; do
131 |       local -i shard_index
132 |       for shard_index in $(seq "${FLAGS_shards_per_file}"); do
133 |         echo "${FLAGS_bucket}/schema.json" \
134 |              "${FLAGS_dataset}" \
135 |              "${FLAGS_table_name}" \
136 |              "${file}" \
137 |              "${FLAGS_shards_per_file}" \
138 |              "${shard_index}"
139 |       done
140 |     done | tr ' ' '\t'
141 |   ) > "${temp_dir}/table.tsv"
142 | 
143 |   dsub \
144 |     --wait \
145 |     --project "${FLAGS_project_id}" \
146 |     --zones "${FLAGS_zones}" \
147 |     --logging "${FLAGS_bucket}/logging" \
148 |     --image "${FLAGS_docker_image}" \
149 |     --min-ram "${FLAGS_min_gb_ram}" \
150 |     --disk-size "${FLAGS_disk_size}" \
151 |     --boot-disk-size "${FLAGS_boot_disk_size}" \
152 |     --tasks "${temp_dir}/table.tsv" \
153 |     --script "${FLAGS_docker_script}"
154 | }
155 | 
156 | 
157 | # Parse the command-line.
158 | FLAGS "$@" || exit $?
159 | eval set -- "${FLAGS_ARGV}"
160 | 
161 | set -o xtrace
162 | set -o nounset
163 | set -o errexit
164 | 
165 | main "$@"
166 | 


--------------------------------------------------------------------------------
/batch/run_annotator/vep_into_bigquery_for_docker.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | # Runs VEP on an ensembl format file or a VCF file using the pipelines api
 18 | # (via dsub.py) and writes the output to BigQuery. Positional parameters must
 19 | # be files stored on GCS (may be gzipped).
 20 | 
 21 | # This script is meant to be run within a docker container to run VEP on a
 22 | # single shard of an ensembl format or VCF file.
 23 | #
 24 | # Takes flags as the environment variables:
 25 | # SCHEMA_FILE BQ_DATASET_NAME BQ_TABLE_NAME INPUT_FILE NUM_SHARDS SHARD_INDEX
 26 | #
 27 | # The following environment variables should be specified in the Docker image,
 28 | # since they are properties of the downloaded databases which are specific to
 29 | # the image (there is a one-to-one relationship):
 30 | # GENOME_ASSEMBLY VEP_SPECIES DBNSFP_BASE
 31 | #
 32 | # We also allow for including as a module so that the functions (in
 33 | # particular apply_shard_file) can be tested.
 34 | 
 35 | # Applies shard to $1 in-place.
 36 | #
 37 | # Args:
 38 | #   file: file to be sharded (in-place)
 39 | #   shard_count: the number of shards
 40 | #   cur_shard: integer from 1...num_shards (inclusive)
 41 | #
 42 | # Comment lines (starting with #) are always included. Of the remaining lines,
 43 | # the index chunk of ceil(num_non_comment_lines / num_shards) lines
 44 | # is kept, with the final shard (cur_shard=shard_count) potentially having
 45 | # fewer lines.
 46 | #
 47 | # We use shard_count and cur_shard here because using num_shards and shard_index
 48 | # confuses the linter because they are too close to NUM_SHARDS and SHARD_INDEX,
 49 | # which aren't actually defined here.
 50 | function apply_shard_file() {
 51 |   local -r file=$1
 52 |   local -ri shard_count=$2
 53 |   local -ri cur_shard=$3
 54 | 
 55 |   if [[ "${shard_count}" -eq 1 ]]; then
 56 |     return
 57 |   fi
 58 | 
 59 |   local -r temp_sharded_file="${file}.sharded"
 60 | 
 61 |   # Pass through the file twice; once to get the number of non-comment lines
 62 |   # and then to use that to output the specific shard.
 63 |   gawk -vnum_shards="${shard_count}" -vshard_index="${cur_shard}" '
 64 |     # Note: line_count and non_comment_line_index refer only to non-comment
 65 |     # lines.
 66 | 
 67 |     ARGIND==1 {
 68 |       if (!/^#/)
 69 |         line_count++
 70 |       next
 71 |     }
 72 | 
 73 |     ARGIND==2 && FNR==1 {
 74 |       lines_per_shard = int((line_count + num_shards - 1) / num_shards)
 75 | 
 76 |       # If num_shards > line_count, then this could be greater than the number
 77 |       # of lines, which will lead to empty output files.
 78 |       first_line = (shard_index - 1) * lines_per_shard + 1
 79 | 
 80 |       # Note: this could be greater than line_count when num_shards > line_count
 81 |       # or for num_shards == shard_index. The later will lead to a file with
 82 |       # fewer than lines_per_shard non-comment lines.
 83 |       last_line = first_line + lines_per_shard - 1
 84 |     }
 85 | 
 86 |     !/^#/{line_index++}
 87 | 
 88 |     /^#/ || (line_index >= first_line && line_index <= last_line) {
 89 |       print
 90 |     }' "${file}"{,} > "${temp_sharded_file}"
 91 | 
 92 |   mv "${temp_sharded_file}" "${file}"
 93 | }
 94 | 
 95 | if [[ -z "${INPUT_FILE}" ]]; then
 96 |   echo 'Running script in bash library mode.'
 97 | else
 98 |   set -o xtrace
 99 |   set -o nounset
100 |   set -o errexit
101 | 
102 |   # Localize dbNSFP database files.  We can't use dsub to do this for us because
103 |   # the current version (specified by filename or bucket) is only known inside
104 |   # the container.
105 |   gsutil -q cp "${DBNSFP_BASE}.gz" "${TMPDIR}/dbNSFP.gz"
106 |   gsutil -q cp "${DBNSFP_BASE}.gz.tbi" "${TMPDIR}/dbNSFP.gz.tbi"
107 | 
108 |   if [[ $INPUT_FILE == *.vcf.gz || $INPUT_FILE == *.vcf ]]; then
109 |     # The cut operaton removes any genotype information from the input VCF files
110 |     # (which, in the case of 1k genomes, takes up ~75% of the output JSON file).
111 |     gunzip -cf "${INPUT_FILE}" | cut -f1-8 > /mnt/data/input_file
112 |     readonly FORMAT="vcf"
113 |   else
114 |     gunzip -cf "${INPUT_FILE}" > /mnt/data/input_file
115 |     readonly FORMAT="ensembl"
116 |   fi
117 | 
118 |   rm "${INPUT_FILE}"
119 | 
120 |   apply_shard_file /mnt/data/input_file "${NUM_SHARDS}" "${SHARD_INDEX}"
121 | 
122 |   readonly NUM_CORES=$(grep --count --word-regexp "^processor" /proc/cpuinfo)
123 | 
124 |   cd "${VEP_BASE}"
125 | 
126 |   # Depending on the version of dbNSFP used, not all the columns
127 |   # listed below may be available. VEP will issue a warning about
128 |   # those missing columns and run successfully.
129 |   "${VEP_BASE}/vep" \
130 |     --cache \
131 |     --offline \
132 |     --no_stats \
133 |     --allele_number \
134 |     --force_overwrite \
135 |     --fork "${NUM_CORES}" \
136 |     --json \
137 |     --species "${VEP_SPECIES}" \
138 |     --assembly "${GENOME_ASSEMBLY}" \
139 |     --sift b \
140 |     --polyphen b \
141 |     --hgvs \
142 |     --plugin Condel,Condel/config,b \
143 |     --plugin "dbNSFP,${TMPDIR}/dbNSFP.gz,ExAC_Adj_AC,ExAC_Adj_AF,ExAC_nonTCGA_Adj_AC,ExAC_nonTCGA_Adj_AF,ExAC_nonpsych_Adj_AC,ExAC_nonpsych_Adj_AF,GenoCanyon_score,phyloP100way_vertebrate,phyloP20way_mammalian,phastCons100way_vertebrate,phastCons20way_mammalian,SiPhy_29way_logOdds,TWINSUK_AC,TWINSUK_AF,clinvar_rs,Ensembl_geneid,Ensembl_transcriptid,Ensembl_proteinid,LRT_score,ALSPAC_AC,ALSPAC_AF,ESP6500_AA_AC,ESP6500_AA_AF,ESP6500_EA_AC,ESP6500_EA_AF,clinvar_trait,GTEx_V6_gene,GTEx_V6_tissue" \
144 |     --format "${FORMAT}" \
145 |     -i /mnt/data/input_file \
146 |     -o /mnt/data/output.json
147 | 
148 |   if [[ -s /mnt/data/output.json ]]; then
149 |     bq \
150 |       --quiet \
151 |       load \
152 |       --source_format NEWLINE_DELIMITED_JSON \
153 |       "${BQ_DATASET_NAME}.${BQ_TABLE_NAME}" \
154 |       /mnt/data/output.json \
155 |       "${SCHEMA_FILE}"
156 |   else
157 |     echo "VEP output file empty." >&2
158 |   fi
159 |   if [[ -s /mnt/data/output.json_warnings.txt ]]; then
160 |     # Record any VEP export errors in stdout.  These are typically complaints
161 |     # about unmatched "random" or alternate haplotype contigs in the database.
162 |     echo "JSON warnings reported:"
163 |     cat /mnt/data/output.json_warnings.txt
164 |   fi
165 | 
166 | fi
167 | 
168 | 


--------------------------------------------------------------------------------
/batch/run_annotator/vep_schema.json:
--------------------------------------------------------------------------------
  1 | [
  2 |   {
  3 |     "name": "input",
  4 |     "type": "string",
  5 |     "mode": "required",
  6 |     "description": "original vcf input line"
  7 |   },
  8 |   {
  9 |     "name": "id",
 10 |     "type": "string",
 11 |     "mode": "required"
 12 |   },
 13 |   {
 14 |     "name": "seq_region_name",
 15 |     "type": "string",
 16 |     "mode": "required"
 17 |   },
 18 |   {
 19 |     "name": "start",
 20 |     "type": "integer",
 21 |     "mode": "required"
 22 |   },
 23 |   {
 24 |     "name": "end",
 25 |     "type": "integer",
 26 |     "mode": "required"
 27 |   },
 28 |   {
 29 |     "name": "strand",
 30 |     "type": "integer",
 31 |     "mode": "required"
 32 |   },
 33 |   {
 34 |     "name": "assembly_name",
 35 |     "type": "string",
 36 |     "mode": "required"
 37 |   },
 38 |   {
 39 |     "name": "allele_string",
 40 |     "type": "string",
 41 |     "mode": "nullable"
 42 |   },
 43 |   {
 44 |     "name": "most_severe_consequence",
 45 |     "type": "string",
 46 |     "mode": "required"
 47 |   },
 48 |   {
 49 |     "name": "variant_class",
 50 |     "type": "string",
 51 |     "mode": "nullable"
 52 |   },
 53 |   {
 54 |     "name": "transcript_consequences",
 55 |     "type": "record",
 56 |     "mode": "repeated",
 57 |     "fields": [
 58 |       {
 59 |         "name": "transcript_id",
 60 |         "type": "string",
 61 |         "mode": "required"
 62 |       },
 63 |       {
 64 |         "name": "gene_id",
 65 |         "type": "string",
 66 |         "mode": "required"
 67 |       },
 68 |       {
 69 |         "name": "impact",
 70 |         "type": "string",
 71 |         "mode": "nullable"
 72 |       },
 73 |       {
 74 |         "name": "consequence_terms",
 75 |         "type": "string",
 76 |         "mode": "repeated"
 77 |       },
 78 |       {
 79 |         "name": "variant_allele",
 80 |         "type": "string",
 81 |         "mode": "nullable"
 82 |       },
 83 |       {
 84 |         "name": "allele_num",
 85 |         "type": "integer",
 86 |         "mode": "nullable"
 87 |       },
 88 |       {
 89 |         "name": "strand",
 90 |         "type": "integer",
 91 |         "mode": "required"
 92 |       },
 93 |       {
 94 |         "name": "codons",
 95 |         "type": "string",
 96 |         "mode": "nullable"
 97 |       },
 98 |       {
 99 |         "name": "amino_acids",
100 |         "type": "string",
101 |         "mode": "nullable"
102 |       },
103 |       {
104 |         "name": "cds_start",
105 |         "type": "integer",
106 |         "mode": "nullable"
107 |       },
108 |       {
109 |         "name": "cds_end",
110 |         "type": "integer",
111 |         "mode": "nullable"
112 |       },
113 |       {
114 |         "name": "flags",
115 |         "type": "string",
116 |         "mode": "repeated"
117 |       },
118 |       {
119 |         "name": "cdna_start",
120 |         "type": "integer",
121 |         "mode": "nullable"
122 |       },
123 |       {
124 |         "name": "cdna_end",
125 |         "type": "integer",
126 |         "mode": "nullable"
127 |       },
128 |       {
129 |         "name": "protein_start",
130 |         "type": "integer",
131 |         "mode": "nullable"
132 |       },
133 |       {
134 |         "name": "protein_end",
135 |         "type": "integer",
136 |         "mode": "nullable"
137 |       },
138 |       {
139 |         "name": "distance",
140 |         "type": "integer",
141 |         "mode": "nullable"
142 |       },
143 |       {
144 |         "name": "bp_overlap",
145 |         "type": "integer",
146 |         "mode": "nullable"
147 |       },
148 |       {
149 |         "name": "percentage_overlap",
150 |         "type": "float",
151 |         "mode": "nullable"
152 |       },
153 |       {
154 |         "name": "hgvsc",
155 |         "type": "string",
156 |         "mode": "nullable"
157 |       },
158 |       {
159 |         "name": "hgvsp",
160 |         "type": "string",
161 |         "mode": "nullable"
162 |       },
163 |       {
164 |         "name": "hgvs_offset",
165 |         "type": "integer",
166 |         "mode": "nullable"
167 |       },
168 |       {
169 |         "name": "polyphen_prediction",
170 |         "type": "string",
171 |         "mode": "nullable"
172 |       },
173 |       {
174 |         "name": "polyphen_score",
175 |         "type": "float",
176 |         "mode": "nullable"
177 |       },
178 |       {
179 |         "name": "sift_prediction",
180 |         "type": "string",
181 |         "mode": "nullable"
182 |       },
183 |       {
184 |         "name": "sift_score",
185 |         "type": "float",
186 |         "mode": "nullable"
187 |       },
188 |       {
189 |         "name": "condel",
190 |         "type": "string",
191 |         "mode": "nullable"
192 |       },
193 |       {
194 |         "name": "exac_adj_ac",
195 |         "type": "integer",
196 |         "mode": "nullable"
197 |       },
198 |       {
199 |         "name": "exac_adj_af",
200 |         "type": "float",
201 |         "mode": "nullable"
202 |       },
203 |       {
204 |         "name": "exac_nontcga_adj_ac",
205 |         "type": "integer",
206 |         "mode": "nullable"
207 |       },
208 |       {
209 |         "name": "exac_nontcga_adj_af",
210 |         "type": "float",
211 |         "mode": "nullable"
212 |       },
213 |       {
214 |         "name": "exac_nonpsych_adj_ac",
215 |         "type": "integer",
216 |         "mode": "nullable"
217 |       },
218 |       {
219 |         "name": "exac_nonpsych_adj_af",
220 |         "type": "float",
221 |         "mode": "nullable"
222 |       },
223 |       {
224 |         "name": "genocanyon_score",
225 |         "type": "float",
226 |         "mode": "nullable"
227 |       },
228 |       {
229 |         "name": "phylop100way_vertebrate",
230 |         "type": "float",
231 |         "mode": "nullable"
232 |       },
233 |       {
234 |         "name": "phylop20way_mammalian",
235 |         "type": "float",
236 |         "mode": "nullable"
237 |       },
238 |       {
239 |         "name": "phastcons100way_vertebrate",
240 |         "type": "float",
241 |         "mode": "nullable"
242 |       },
243 |       {
244 |         "name": "phastcons20way_mammalian",
245 |         "type": "float",
246 |         "mode": "nullable"
247 |       },
248 |       {
249 |         "name": "siphy_29way_logodds",
250 |         "type": "float",
251 |         "mode": "nullable"
252 |       },
253 |       {
254 |         "name": "twinsuk_ac",
255 |         "type": "integer",
256 |         "mode": "nullable"
257 |       },
258 |       {
259 |         "name": "twinsuk_af",
260 |         "type": "float",
261 |         "mode": "nullable"
262 |       },
263 |       {
264 |         "name": "clinvar_rs",
265 |         "type": "string",
266 |         "mode": "nullable"
267 |       },
268 |       {
269 |         "name": "clinvar_trait",
270 |         "type": "string",
271 |         "mode": "nullable"
272 |       },
273 |       {
274 |         "name": "ensembl_geneid",
275 |         "type": "string",
276 |         "mode": "nullable"
277 |       },
278 |       {
279 |         "name": "ensembl_transcriptid",
280 |         "type": "string",
281 |         "mode": "nullable"
282 |       },
283 |       {
284 |         "name": "ensembl_proteinid",
285 |         "type": "string",
286 |         "mode": "nullable"
287 |       },
288 |       {
289 |         "name": "lrt_score",
290 |         "type": "float",
291 |         "mode": "nullable"
292 |       },
293 |       {
294 |         "name": "rvis",
295 |         "type": "float",
296 |         "mode": "nullable"
297 |       },
298 |       {
299 |         "name": "gdi",
300 |         "type": "float",
301 |         "mode": "nullable"
302 |       },
303 |       {
304 |         "name": "gtex_v6_gene",
305 |         "type": "string",
306 |         "mode": "nullable"
307 |       },
308 |       {
309 |         "name": "gtex_v6_tissue",
310 |         "type": "string",
311 |         "mode": "nullable"
312 |       },
313 |       {
314 |         "name": "alspac_ac",
315 |         "type": "integer",
316 |         "mode": "nullable"
317 |       },
318 |       {
319 |         "name": "alspac_af",
320 |         "type": "float",
321 |         "mode": "nullable"
322 |       },
323 |       {
324 |         "name": "esp6500_aa_ac",
325 |         "type": "integer",
326 |         "mode": "nullable"
327 |       },
328 |       {
329 |         "name": "esp6500_aa_af",
330 |         "type": "float",
331 |         "mode": "nullable"
332 |       },
333 |       {
334 |         "name": "esp6500_ea_ac",
335 |         "type": "integer",
336 |         "mode": "nullable"
337 |       },
338 |       {
339 |         "name": "esp6500_ea_af",
340 |         "type": "float",
341 |         "mode": "nullable"
342 |       }
343 |     ]
344 |   },
345 |   {
346 |     "name": "intergenic_consequences",
347 |     "type": "record",
348 |     "mode": "repeated",
349 |     "fields": [
350 |       {
351 |         "name": "impact",
352 |         "type": "string",
353 |         "mode": "nullable"
354 |       },
355 |       {
356 |         "name": "consequence_terms",
357 |         "type": "string",
358 |         "mode": "repeated"
359 |       },
360 |       {
361 |         "name": "variant_allele",
362 |         "type": "string",
363 |         "mode": "nullable"
364 |       },
365 |       {
366 |         "name": "allele_num",
367 |         "type": "integer",
368 |         "mode": "nullable"
369 |       }
370 |     ]
371 |   }
372 | ]
373 | 


--------------------------------------------------------------------------------
/batch/vep/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright 2018 Google Inc.  All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | #
15 | # This is branched from batch/build_annotator/Dockerfile.vep file and is meant
16 | # to replace that eventually.
17 | #
18 | # Example:
19 | #
20 | # docker build . --build-arg ENSEMBL_RELEASE=104 --tag vep:104
21 | #
22 | # To run vep through containers created by this file, the VEP cache has to be
23 | # downloaded separately and made available through command line arguments.
24 | # The script for doing so is build_vep_cache.sh, see README.md for details.
25 | 
26 | # The pipelines-io container provides a wrapper around gsutil with additional
27 | # retry logic.
28 | FROM gcr.io/cloud-genomics-pipelines/io
29 | 
30 | ARG ENSEMBL_RELEASE=104
31 | ARG VEP_BASE=/opt/variant_effect_predictor
32 | 
33 | RUN apt-get -y update && apt-get install -y procps\
34 |     build-essential \
35 |     git \
36 |     libarchive-zip-perl \
37 |     libbz2-dev \
38 |     liblzma-dev \
39 |     libdbd-mysql-perl \
40 |     libdbi-perl \
41 |     libfile-copy-recursive-perl \
42 |     libhts1 \
43 |     libjson-perl \
44 |     libmodule-build-perl \
45 |     tabix \
46 |     unzip \
47 |     zlib1g-dev
48 | 
49 | # Install VEP per the instructions at:
50 | # http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer
51 | RUN git clone https://github.com/Ensembl/ensembl-vep.git ${VEP_BASE}
52 | 
53 | WORKDIR ${VEP_BASE}
54 | 
55 | RUN git checkout release/${ENSEMBL_RELEASE}
56 | 
57 | RUN perl INSTALL.pl \
58 |     --AUTO a \
59 |     --NO_UPDATE
60 | 
61 | ADD run_vep.sh ${VEP_BASE}/run_vep.sh
62 | 
63 | ADD run_script_with_watchdog.sh ${VEP_BASE}/run_script_with_watchdog.sh
64 | 
65 | ENTRYPOINT []
66 | 


--------------------------------------------------------------------------------
/batch/vep/README.md:
--------------------------------------------------------------------------------
  1 | # Annotating input files with VEP
  2 | 
  3 | This directory includes tools and utilities for running
  4 | [Ensembl's Variant Effect Predictor](
  5 | https://ensembl.org/info/docs/tools/vep/index.html) (VEP) on input VCF files
  6 | of [Variant Transforms](../README.md).
  7 | 
  8 | ## Overview
  9 | 
 10 | With tools provided in this directory, one can:
 11 | * Create a docker image of VEP.
 12 | * Download and package VEP's database (a.k.a.
 13 | [cache](https://ensembl.org/info/docs/tools/vep/script/vep_cache.html)) for
 14 | different species, reference sequences and versions of VEP.
 15 | * Run VEP on VCF input files and create output VCF files that are annotated.
 16 | 
 17 | Note that, this is a useful standalone tool for running VEP in the cloud but the
 18 | main goal is to be able to run VEP as a preprocessor through Variant Transforms
 19 | and then import the annotated variants into BigQuery with proper handling of
 20 | annotations.
 21 | 
 22 | ## How to create and push VEP docker images
 23 | 
 24 | Inside this directory, run:
 25 | 
 26 | `docker build . -t [IMAGE_TAG]`
 27 | 
 28 | This will download the source from
 29 | [VEP GitHub repo](https://github.com/Ensembl/ensembl-vep) and build VEP from
 30 | that source. By default, it uses version 104 of VEP. This can be changed by
 31 | `ENSEMBL_RELEASE` build argument, e.g.,
 32 | 
 33 | `docker build . -t [IMAGE_TAG] --build-arg ENSEMBL_RELEASE=104`
 34 | 
 35 | Let's say we want to push this image to the
 36 | [Container Registry](https://cloud.google.com/container-registry/) of
 37 | `my-project` on Google Cloud, so we can pick `[IMAGE_TAG]` as
 38 | `gcr.io/my-project/vep:104`. Then push this image by:
 39 | 
 40 | `gcloud docker -- push gcr.io/my-project/vep:104`
 41 | 
 42 | **TODO**: Add `cloudbuild.yaml` files for both easy push and integration test.
 43 | 
 44 | ## How to download and package VEP databases
 45 | 
 46 | Choose a local directory with enough space (e.g., ~20GB for homo_sapiens) to
 47 | download and integrate different pieces of the VEP database or cache files.
 48 | Then from within that directory run the
 49 | [`build_vep_cache.sh`](build_vep_cache.sh) script. By default this script
 50 | creates the database for human (homo_sapiens), referenec sequence `GRCh38`,
 51 | and release 104 of VEP. These values can be overwritten by the following
 52 | environment variables (note you should use the same VEP release
 53 | that you used for creating VEP docker image above):
 54 | 
 55 | * `VEP_SPECIES`
 56 | * `GENOME_ASSEMBLY`
 57 | * `ENSEMBL_RELEASE`
 58 | 
 59 | ## How to run VEP on GCP
 60 | 
 61 | There is the helper script [`run_vep.sh`](run_vep.sh) that is added to the VEP
 62 | docker image and can be used to run VEP. One way of running it on
 63 | Google Cloud Platform (GCP) is through the [Pipelines API](
 64 | https://cloud.google.com/genomics/v1alpha2/pipelines-api-command-line). For a
 65 | sample `yaml` job description check
 66 | [`sample_pipeline.yaml`](sample_pipeline.yaml).
 67 | Here is a sample `gcloud` command that uses that file:
 68 | 
 69 | ```
 70 | gcloud alpha genomics pipelines run \
 71 |   --project my-project \
 72 |   --pipeline-file sample_pipeline.yaml \
 73 |   --logging gs://my_bucket/logs \
 74 |   --inputs VCF_INFO_FILED=CSQ_RERUN
 75 | ```
 76 | 
 77 | Note the `vep_cache_homo_sapiens_GRCh38_104.tar.gz` file that is referenced in
 78 | the sample `yaml` file, is the output file that you get from the above database
 79 | creation step.
 80 | 
 81 | The [`run_vep.sh`](run_vep.sh) script relies on several environment variables
 82 | that can be set to change the default behaviour. In the above example
 83 | `VCF_INFO_FILED` is changed to `CSQ_RERUN` (the default is `CSQ_VT`).
 84 | 
 85 | This is the full list of supported environment variables:
 86 | 
 87 | * `SPECIES`: default is `homo_sapiens`
 88 | * `GENOME_ASSEMBLY`: default is `GRCh38`
 89 | * `NUM_FORKS`: The value to be set for
 90 | [`--fork` option of VEP](
 91 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_fork).
 92 | default is 1.
 93 | * `OTHER_VEP_OPTS`: Other options to be set for the VEP invocation, default is
 94 | [`--everything`](
 95 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_everything)
 96 | * `VCF_INFO_FILED`: The name of the info field to be used for annotations,
 97 | default is `CSQ_VT`. See
 98 | [`--vcf_info_field`](
 99 | http://ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf_info_field)
100 | 
101 | The following environment variables have to be set and point to valid storage
102 | locations:
103 | 
104 | * `VEP_CACHE`: Where the tar.gz file, created in the above database creation
105 | step, is located.
106 | * `INPUT_FILE`: Note this can be either a VCF file or a compressed VCF file
107 | (`.gz` or `.bgz`). Treatment of compressed and uncompressed files is the same,
108 | i.e., the input file is directly fed into VEP.
109 | * `OUTPUT_VCF`: The name of the output file which is always a VCF file.
110 | 


--------------------------------------------------------------------------------
/batch/vep/build_vep_cache.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Copyright 2018 Google Inc.  All Rights Reserved.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #      http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | #
 17 | # This is a script for downloading VEP cache files, decompressing and placing
 18 | # them in the appropriate directory structure that is expected by VEP script.
 19 | # At the end, the whole structure is compressed to generate a single tar.gz
 20 | # file that can be used in run_vep.sh invocations.
 21 | #
 22 | # This script creates a 'vep_cache' sub-directory and does every other file
 23 | # operations and downloads inside that directory. The final cache file will be
 24 | # stored in that directory as well.
 25 | #
 26 | # Capital letter variables refer to environment variables that can be set from
 27 | # outside. Internal variables have small letters. All environment variables
 28 | # have a default value as well to set up cache for homo_sapiens with reference
 29 | # GRCh38 and release 104 of VEP.
 30 | #
 31 | # More details on cache files can be found here:
 32 | # https://ensembl.org/info/docs/tools/vep/script/vep_cache.html
 33 | 
 34 | set -euo pipefail
 35 | 
 36 | readonly release="${ENSEMBL_RELEASE:-104}"
 37 | readonly species="${VEP_SPECIES:-homo_sapiens}" # or "${VEP_SPECIES:-mus_musculus}"
 38 | readonly assembly="${GENOME_ASSEMBLY:-GRCh38}" # or "${GENOME_ASSEMBLY:-GRCh37}" for homo_sapiens or "${GENOME_ASSEMBLY:-GRCm39}" for mus_musculus
 39 | readonly work_dir="vep_cache"
 40 | 
 41 | mkdir -p "${work_dir}"
 42 | pushd "${work_dir}"
 43 | readonly cache_file="${species}_vep_${release}_${assembly}.tar.gz"
 44 | readonly ftp_base="ftp://ftp.ensembl.org/pub/release-${release}"
 45 | 
 46 | # The fasta file name depends on the species and assembly but not the version.
 47 | # Also the first letter of the file is capital while it is small for the actual
 48 | # cache file (above). For example: "Homo_sapiens.GRCh38.dna.toplevel.fa.gz"
 49 | readonly fasta_file="${species^?}.${assembly}.dna.toplevel.fa.gz"
 50 | if [[ $species == "homo_sapiens" ]] && [[ $assembly == "GRCh37" ]]; then
 51 |   if [[ ! `command -v samtools` ]]; then
 52 |     echo "ERROR: samtools is needed to create the .fai index."
 53 |     echo "It can be installed by:"
 54 |     echo "sudo apt-get install samtools"
 55 |     echo "Or it can be downloaded from:"
 56 |     echo "http://www.htslib.org/download/"
 57 |     exit 1
 58 |   fi
 59 |   if [ ! `command -v bgzip` ]; then
 60 |     echo "ERROR: bgzip is needed to create the .gzi index."
 61 |     echo "It can be installed by:"
 62 |     echo "sudo apt-get install tabix"
 63 |     exit 1
 64 |   fi
 65 |   readonly ftp_GRCh37="ftp://ftp.ensembl.org/pub/grch37/release-${release}"
 66 |   readonly remote_fasta="${ftp_GRCh37}/fasta/${species}/dna/${fasta_file}"
 67 |   echo "Downloading ${remote_fasta}"
 68 |   curl -O "${remote_fasta}"
 69 |   echo "Decompressing fasta file..."
 70 |   gzip -d "${fasta_file}"
 71 |   echo "Block compressing fasta file and creating .gzi index..."
 72 |   readonly num_cores=`nproc --all`
 73 |   bgzip --index --threads "$num_cores" "${fasta_file%.*}"
 74 |   echo "Creating .fai index..."
 75 |   samtools faidx "${fasta_file}"
 76 | else
 77 |   readonly remote_fasta="${ftp_base}/fasta/${species}/dna_index/${fasta_file}"
 78 |   echo "Downloading ${remote_fasta} and its index files ..."
 79 |   curl -O "${remote_fasta}"
 80 |   curl -O "${remote_fasta}.fai"
 81 |   curl -O "${remote_fasta}.gzi"
 82 | fi
 83 | 
 84 | # The path naming convention changed from "VEP" to "vep" after build 95.
 85 | if (( release <= 95 )); then
 86 |   readonly remote_cache="${ftp_base}/variation/VEP/${cache_file}"
 87 | else
 88 |   readonly remote_cache="${ftp_base}/variation/vep/${cache_file}"
 89 | fi
 90 | echo "Downloading ${remote_cache} ..."
 91 | curl -O "${remote_cache}"
 92 | echo "Decompressing cache files ..."
 93 | tar xzf "${cache_file}"
 94 | 
 95 | echo "Moving fasta files to the cache structure ..."
 96 | mv ${fasta_file}* "${species}/${release}_${assembly}"
 97 | 
 98 | echo "Creating single tar.gz file for the whole cache ..."
 99 | readonly output_cache="vep_cache_${species}_${assembly}_${release}.tar.gz"
100 | tar czf "${output_cache}" "${species}"
101 | if [[ -r "${output_cache}" ]]; then
102 |   echo "Cleaning up ..."
103 |   rm -rf "${species}"
104 |   rm -f "${cache_file}"
105 | fi
106 | popd
107 | 
108 | if [[ -r "${work_dir}/${output_cache}" ]]; then
109 |   echo "Successfully created cache file at ${work_dir}/${output_cache}"
110 | else
111 |   echo "ERROR: Something went wrong when creating ${work_dir}/${output_cache} !"
112 | fi
113 | 
114 | # TODO(bashir2): Experiment with the convert_cache.pl script of VEP and measure
115 | # performance improvements. If the change is significant then this script has to
116 | # run convert_cache.pl too.
117 | 


--------------------------------------------------------------------------------
/batch/vep/run_script_with_watchdog.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2019 Google Inc.  All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #      http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | # This script runs `script_to_run` (first argument) in the background. Every
18 | # `watchdog_file_update_interval` seconds (second argument), it checks the last
19 | # update time of `watchdog_file` (third argument). Once the watchdog file is
20 | # found to be stale, the background process will be killed. The arguments needed
21 | # for `script_to_run` are passed after the third argument.
22 | 
23 | set -euo pipefail
24 | 
25 | #################################################
26 | # Returns the generation number of a GCS file.
27 | # Arguments:
28 | #   $1: The GCS file.
29 | #################################################
30 | function get_last_update_time {
31 |   gsutil stat $1 | awk '$1 == "Generation:" {print $2}'
32 | }
33 | 
34 | function main {
35 |   if [[ $# < 3 ]]; then
36 |     echo "Usage: $0 <script_to_run> <watchdog_file_update_interval> <watchdog_file> <script_to_run_args>"
37 |     exit 1
38 |   fi
39 |   script_to_run="$1"
40 |   watchdog_file_update_interval="$2"
41 |   watchdog_file="$3"
42 |   script_to_run_args="${@:4}"
43 |   watchdog_file_allowed_stale_time="$((4*watchdog_file_update_interval))"
44 | 
45 |   ${script_to_run} ${script_to_run_args} &
46 | 
47 |   background_pid="$!"
48 |   while ps -p "${background_pid}" > /dev/null
49 |   do
50 |     last_update_sec="$(($(get_last_update_time ${watchdog_file})/1000000))"
51 |     declare -i now_sec
52 |     now_sec="$(date +%s)"
53 |     last_update_age_sec="$((now_sec-last_update_sec))"
54 |     echo "The watchdog file is updated ${last_update_age_sec} seconds ago."
55 |     if (("${last_update_age_sec}">"${watchdog_file_allowed_stale_time}")); then
56 |       echo "ERROR: The watchdog file is stale, and running of ${script_to_run} has been killed."
57 |       kill "${background_pid}"
58 |       exit 1
59 |     else
60 |       sleep "${watchdog_file_update_interval}"
61 |     fi
62 |   done
63 |   wait "${background_pid}"
64 |   if [[ $? -ne 0 ]]; then
65 |     echo "Running of ${script_to_run} failed."
66 |     exit 1
67 |   else
68 |     echo "Running of ${script_to_run} succeed."
69 |     exit 0
70 |   fi
71 | }
72 | 
73 | main "$@"
74 | 


--------------------------------------------------------------------------------
/batch/vep/run_vep.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2018 Google Inc.  All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #      http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # This script is intended to be used in the Docker image built for VEP.
18 | # Note that except the single input and output files, all other arguments are
19 | # passed through environment variables.
20 | #
21 | # The only environment variable that has to be set is (others are optional):
22 | #
23 | # VEP_CACHE: The path of the VEP cache which is a single .tar.gz file.
24 | #
25 | # The first argument is the input file (might be a VCF or a compressed VCF) and
26 | # the second is the output file which is always a VCF file.
27 | #
28 | # For the full list of supported environment variables and their documentation
29 | # check README.md.
30 | # Capital letter variables refer to environment variables that can be set from
31 | # outside. Internal variables have small letters.
32 | 
33 | set -euo pipefail
34 | 
35 | if [[ $# -ne 2 ]]; then
36 |   echo "Usage: $0 input_file output_file"
37 |   exit 1
38 | fi
39 | 
40 | readonly species="${SPECIES:-homo_sapiens}"
41 | readonly assembly="${GENOME_ASSEMBLY:-GRCh38}"
42 | readonly fork_opt="--fork ${NUM_FORKS:-1}"
43 | readonly other_vep_opts="${OTHER_VEP_OPTS:---everything \
44 |   --check_ref --allow_non_variant}"
45 | readonly annotation_field_name="${VCF_INFO_FILED:-CSQ}"
46 | 
47 | if [[ ! -r "${VEP_CACHE:?VEP_CACHE is not set!}" ]]; then
48 |   echo "ERRPR: Cannot read ${VEP_CACHE}"
49 |   exit 1
50 | fi
51 | 
52 | # Check that the input file is readable.
53 | readonly input_file=${1}
54 | if [[ ! -r "${input_file}" ]]; then
55 |   echo "ERRPR: Cannot read ${input_file}"
56 |   exit 1
57 | fi
58 | 
59 | echo "Checking the input file at $(date)"
60 | ls -l "${input_file}"
61 | 
62 | # Make sure output file does not exist and can be written.
63 | readonly output_file=${2}
64 | if [[ -e ${output_file} ]]; then
65 |   echo "ERROR: ${output_file} already exist!"
66 |   exit 1
67 | fi
68 | mkdir -p $(dirname ${output_file})
69 | touch ${output_file}
70 | rm ${output_file}
71 | 
72 | readonly vep_cache_dir="$(dirname ${VEP_CACHE})"
73 | readonly vep_cache_file="$(basename ${VEP_CACHE})"
74 | pushd ${vep_cache_dir}
75 | if [[ -d "${species}" ]]; then
76 |   echo "The cache is already decompressed; found ${species} at $(date)"
77 | else
78 |   echo "Decompressing the cache file ${vep_cache_file} started at $(date)"
79 |   tar xzvf "${vep_cache_file}"
80 |   if [[ ! -d "${species}" ]]; then
81 |     echo "Cannot find directory ${species} after decompressing ${vep_cache_file}!"
82 |     exit 1
83 |   fi
84 | fi
85 | popd
86 | 
87 | readonly vep_command="./vep -i ${input_file} -o ${output_file} \
88 |   --dir ${vep_cache_dir} --offline --species ${species} --assembly ${assembly} \
89 |   --vcf --allele_number --vcf_info_field ${annotation_field_name} ${fork_opt} \
90 |   ${other_vep_opts}"
91 | echo "VEP command is: ${vep_command}"
92 | 
93 | echo "Running vep started at $(date)"
94 | # The next line should not be quoted since we want word splitting to happen.
95 | ${vep_command}
96 | 


--------------------------------------------------------------------------------
/batch/vep/sample_pipeline.yaml:
--------------------------------------------------------------------------------
 1 | # TODO(bashir2): Update this example with v2alpha1 required changes.
 2 | name: run-vep
 3 | resources:
 4 |   disks:
 5 |     - name: datadisk
 6 |       mountPoint: /mnt/data
 7 |       type: PERSISTENT_HDD
 8 |       sizeGb: 100
 9 |   minimumCpuCores: 12
10 | inputParameters:
11 |   - name: VEP_CACHE
12 |     defaultValue: gs://my_bucket/vep_cache_homo_sapiens_GRCh38_104.tar.gz
13 |     localCopy:
14 |       disk: datadisk
15 |       path: vep_cache_104.tar.gz
16 |   - name: INPUT_FILE
17 |     defaultValue: gs://my_bucket/input.vcf
18 |     localCopy:
19 |       disk: datadisk
20 |       path: input.vcf
21 |   - name: VCF_INFO_FILED
22 |     defaultValue: CSQ_VT
23 |   - name: NUM_FORKS
24 |     defaultValue: "12"
25 | outputParameters:
26 |   - name: OUTPUT_FILE
27 |     defaultValue: gs://my_bucket/output.vcf
28 |     localCopy:
29 |       disk: datadisk
30 |       path: output.vcf
31 | docker:
32 |   imageName: gcr.io/my-project/vep:104
33 |   cmd: /opt/variant_effect_predictor/run_vep.sh ${INPUT_FILE} ${OUTPUT_FILE}
34 | 


--------------------------------------------------------------------------------
/curation/README.md:
--------------------------------------------------------------------------------
 1 | **WARNING: Not actively maintained!**
 2 | 
 3 | Curation Scripts
 4 | ================
 5 | 
 6 | The scripts in this portion of the repository were used to ingest, reshape, and
 7 | store variant annotations in a cloud analysis-ready format.  You can use them to
 8 | bring in a fresh copy of an annotation resource or as a starting point for
 9 | curation of a new annotation resource.
10 | 
11 | ## Status of this sub-project
12 | 
13 | This code currently works with annotation resources such
14 | as [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP/)
15 | and [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) along with variant allele
16 | frequencies
17 | from
18 | [NHLBI GO Exome Sequencing Project (ESP)](http://evs.gs.washington.edu/EVS/),
19 | [1000 Genomes](http://www.internationalgenome.org/),
20 | [ExAC](http://exac.broadinstitute.org/),
21 | and [Genome Aggregation Database (gnomAD)](http://gnomad.broadinstitute.org/)
22 | but similar techniques could be applied to other annotation resources.
23 | 
24 | All steps are run in the cloud, but each individual step is launched manually.
25 | 
26 | ## Overview
27 | 
28 | ### Curate Individual Annotation Sources
29 | 
30 | Many variant annotation sources are encoded as VCF files.  Therefore we can
31 | use [Google Genomics](https://cloud.google.com/genomics/) to import the resource
32 | and export it to BigQuery.
33 | 
34 | [Follow the tutorial](./tables) to run
35 | a [dsub](https://github.com/googlegenomics/dsub) script to create individual
36 | tables holding dbSNP, ClinVar, ESP, etc.
37 | 
38 | ### Create an "All-Possible SNPs" Table
39 | 
40 | A table with annotations for all possible SNPs of a particular genome reference
41 | is useful for:
42 | 
43 | * Examining SNP variation across different regions of the genome.
44 | * Quickly annotating the SNPs for a cohort using a simple JOIN.
45 | * Generating synthetic sequence variant datasets using the SNP allele
46 |   frequencies from this table.
47 | 
48 | [Follow the tutorial](./allPossibleSNPs) to create an all-possible-SNPs tables
49 | for build 38 of the human genome reference.
50 | 
51 | ### Add Column Descriptions to a BigQuery Table
52 | 
53 | The `variants` table generated by performing
54 | an
55 | [export from Google Genomics](https://cloud.google.com/genomics/reference/rest/v1/variantsets/export) does
56 | not include the field descriptions for the fields.
57 | 
58 | See [add BigQuery descriptions](./tables/AddBigQueryDescriptions.md) for
59 | instructions on how to automatically populate the BigQuery schema
60 | description with the information from the VCF header.
61 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/ESP_AA.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Prepare ESP AA for the JOIN.
 3 |   --
 4 |   ESP_AA AS (
 5 |   SELECT
 6 |     reference_name,
 7 |     start,
 8 |     `end`,
 9 |     reference_bases,
10 |     alternate_bases,
11 |     AF AS ESP_AA_AF,
12 |     -- Used to check for correctness of the JOIN.
13 |     names[OFFSET(0)] AS ESP_AA_rsid
14 |   FROM
15 |     `{{ ESP_AA_TABLE }}` v,
16 |     v.alternate_bases alternate_bases )
17 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/ESP_EA.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Prepare ESP EA for the JOIN.
 3 |   --
 4 |   ESP_EA AS (
 5 |   SELECT
 6 |     reference_name,
 7 |     start,
 8 |     `end`,
 9 |     reference_bases,
10 |     alternate_bases,
11 |     AF AS ESP_EA_AF,
12 |     -- Used to check for correctness of the JOIN.
13 |     names[OFFSET(0)] AS ESP_EA_rsid
14 |   FROM
15 |     `{{ ESP_EA_TABLE }}` v,
16 |     v.alternate_bases alternate_bases )
17 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/README.md:
--------------------------------------------------------------------------------
  1 | **WARNING: Not actively maintained!**
  2 | 
  3 | Create an All-Possible SNPs Table
  4 | =================================
  5 | 
  6 | This tutorial combines a particular reference genome with individual annotation
  7 | resources to create an "all possible SNPs" table.
  8 | 
  9 | [dsub](https://cloud.google.com/genomics/v1alpha2/dsub)
 10 | and [BigQuery](https://cloud.google.com/bigquery/) are used to run all of these
 11 | steps in the cloud.
 12 | 
 13 | ## Status of this tutorial
 14 | 
 15 | This is a work-in-progress. Next steps are to:
 16 | 
 17 | 1. add more annotation resources to the JOIN
 18 | 2. add examples that make use of the all possible SNPs GRCh38 table to analyze
 19 |    SNPs from
 20 |    the
 21 |    [Platinum Genomes DeepVariant](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) cohort. These
 22 |    examples would be similar to
 23 |    https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes.
 24 | 
 25 | ## (1) Configure project variables.
 26 | 
 27 | Set a few environment variables to facilitate cutting and pasting the subsequent
 28 | commands.
 29 | 
 30 | ``` bash
 31 | # The Google Cloud Platform project id in which to process the annotations.
 32 | PROJECT_ID=your-project-id
 33 | # The bucket name (with the gs:// prefix) for logs and temp files.
 34 | BUCKET=gs://your-bucket-name
 35 | # The BigQuery dataset, which must already exist, in which to store annotations.
 36 | DATASET=your_bigquery_dataset_name
 37 | ```
 38 | ## (2) Identify the reference genome you wish to annotate.
 39 | 
 40 | In this tutorial we're specifically working
 41 | with
 42 | [Verily’s version of GRCh38](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/reference_genomes.html#verily-s-grch38).
 43 | 
 44 | Note that instead you could:
 45 | 
 46 | * Use one of the other reference genomes are already available in Cloud Storage.
 47 | See
 48 | [Reference Genomes](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/reference_genomes.html) for
 49 | the list and Cloud Storage paths.
 50 | * Copy the FASTA file for the desired reference genome to cloud storage. For
 51 |   more detail,
 52 |   see
 53 |   [Copying large files to a bucket](https://cloud.google.com/storage/docs/working-with-big-data#copy-large-file).
 54 | 
 55 | ## (3) Convert the FASTA file.
 56 | 
 57 | Run the following [dsub](https://github.com/googlegenomics/dsub) command to
 58 | convert the FASTA file for the reference genome into a format ammenable to
 59 | BigQuery.
 60 | 
 61 | ``` bash
 62 | # Copy the script dsub will run to Cloud Storage.
 63 | gsutil cp fasta_to_kv.py ${BUCKET}
 64 | 
 65 | # Run the conversion operation.
 66 | dsub \
 67 |   --project ${PROJECT_ID} \
 68 |   --zones "us-central1-*" \
 69 |   --logging ${BUCKET}/fasta_to_kv.log \
 70 |   --image python:2.7-slim \
 71 |   --input FASTA=gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \
 72 |   --input CONVERTER=${BUCKET}/fasta_to_kv.py \
 73 |   --output KV=${BUCKET}/GRCh38_Verily_v1.genome.txt \
 74 |   --command 'cat "${FASTA}" | python "${CONVERTER}" > "${KV}"' \
 75 |   --wait
 76 | ```
 77 | 
 78 | ## (4) Load the sequences into BigQuery.
 79 | 
 80 | Use the bq command line tool to load the sequences into BigQuery.
 81 | 
 82 | ``` bash
 83 | bq --project ${PROJECT_ID} load \
 84 |   -F '>' \
 85 |   --schema unused:string,chr:string,sequence_start:integer,sequence:string \
 86 |   ${DATASET}.VerilyGRCh38_sequences \
 87 |   ${BUCKET}/GRCh38_Verily_v1.genome.txt
 88 | ```
 89 | 
 90 | ## (5) Reshape the sequences into SNPs and JOIN with annotations.
 91 | 
 92 | Run script [render_templated_sql.py](./render_templated_sql.py) to create the
 93 | SQL that will perform the JOIN.
 94 | 
 95 | ``` bash
 96 | python ./render_templated_sql.py \
 97 |   --sequence_table ${DATASET}.VerilyGRCh38_sequences \
 98 |   --b38
 99 | ```
100 | 
101 | Then run the generated SQL via the BigQuery web UI or the bq command line tool
102 | and materialize the result to a new table.
103 | 
104 | See `render_templated_sql.py --help` for more details.
105 | 
106 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/all_possible_snps.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Create a table containing all possible SNPs for a reference genome.
 3 |   --
 4 |   -- Split the sequences from the FASTA file.
 5 |   --
 6 |   base_pairs AS (
 7 |   SELECT
 8 |     chr,
 9 |     sequence_start,
10 |     SPLIT(sequence, '') AS bps
11 |   FROM
12 |     `{{ SEQUENCE_TABLE }}`
13 |   # Use this replacement to test on small amount of data.  Otherwise replace
14 |   # it with the empty string.
15 |   {{ SEQUENCE_FILTER }} ),
16 |   --
17 |   -- Expand the data to one row per base pair.  Also upper case the
18 |   -- base pair and compute the end position.
19 |   --
20 |   all_refs AS (
21 |   SELECT
22 |     chr AS original_reference_name,
23 |     SUBSTR(chr, 4) AS reference_name,
24 |     sequence_start + base_pair_offset AS start,
25 |     sequence_start + base_pair_offset + 1 AS `end`,
26 |     UPPER(base_pair) AS reference_bases,
27 |     base_pair AS original_reference_bases
28 |   FROM
29 |     base_pairs,
30 |     base_pairs.bps base_pair
31 |   WITH
32 |   OFFSET
33 |     base_pair_offset),
34 |   --
35 |   -- Create a table holding the four possible values for
36 |   -- alternate_bases.
37 |   --
38 |   all_alternate_bases AS (
39 |   SELECT
40 |     'A' AS alternate_bases
41 |   UNION ALL
42 |   SELECT
43 |     'C' AS alternate_bases
44 |   UNION ALL
45 |   SELECT
46 |     'G' AS alternate_bases
47 |   UNION ALL
48 |   SELECT
49 |     'T' AS alternate_bases ),
50 |   all_possible_snps AS (
51 |     --
52 |     -- CROSS JOIN with all possible mutations for the base pair.  Note
53 |     -- that 'N' will result in four possible mutations.
54 |     --
55 |   SELECT
56 |     reference_name,
57 |     original_reference_name,
58 |     start,
59 |     `end`,
60 |     reference_bases,
61 |     original_reference_bases,
62 |     alternate_bases
63 |   FROM
64 |     all_refs
65 |   CROSS JOIN
66 |     all_alternate_bases
67 |     )
68 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/check_joined_annotations.sql:
--------------------------------------------------------------------------------
 1 | #standardSQL
 2 | --
 3 | -- Compare the dbSNP ids retrieved from a variety of annotation sources
 4 | -- to ensure that the multiple sources were joined correctly.
 5 | --
 6 | -- Replace `YOUR_NEWLY_CREATED_ANNOTATIONS_TABLE` with the table to which the
 7 | -- JOINed annotations were materialized.
 8 | --
 9 | SELECT
10 | {% for source in annot_sources %}
11 |   COUNTIF({{source}}_rsid = dbSNP_rsid)/COUNTIF({{source}}_rsid IS NOT NULL) AS {{source}}_matched,
12 |   COUNTIF({{source}}_rsid IS NOT NULL) AS {{source}}_compared,
13 | {% endfor %}
14 |   COUNT(dbSNP_rsid) AS num_in_dbSNP
15 | FROM `YOUR_NEWLY_CREATED_ANNOTATIONS_TABLE`
16 | WHERE
17 |   dbSNP_rsid IS NOT NULL
18 | 
19 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/clinvar.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Prepare ClinVar for the JOIN.
 3 |   --
 4 |   clinvar AS (
 5 |   SELECT
 6 |     reference_name,
 7 |     start,
 8 |     `end`,
 9 |     reference_bases,
10 |     alternate_bases,
11 |     -- Used to check for correctness of the JOIN.
12 |     CONCAT('rs', CAST(RS AS STRING)) AS clinvar_rsid,
13 |     -- ClinVar uses field CLNALLE to indicate "variant alleles from REF
14 |     -- or ALT columns.  0 is REF, 1 is the first ALT allele, etc.  This
15 |     -- is used to match alleles with other corresponding clinical (CLN)
16 |     -- INFO tags.  A value of -1 indicates that no allele was found to
17 |     -- match a corresponding HGVS allele name."
18 |     CLNDBN[OFFSET(clnalle_offset)] AS CLNDBN,
19 |     CLNACC[OFFSET(clnalle_offset)] AS CLNACC,
20 |     CLNDSDB[OFFSET(clnalle_offset)] AS CLNDSDB,
21 |     CLNDSDBID[OFFSET(clnalle_offset)] AS CLNDSDBID,
22 |     CLNREVSTAT[OFFSET(clnalle_offset)] AS CLNREVSTAT,
23 |     CLNSIG[OFFSET(clnalle_offset)] AS CLNSIG
24 |   FROM
25 |     `{{ CLINVAR_TABLE }}` v,
26 |     UNNEST(ARRAY_CONCAT([reference_bases], v.alternate_bases)) AS alternate_bases WITH OFFSET alt_offset,
27 |     v.CLNALLE clnalle WITH OFFSET clnalle_offset
28 |   WHERE
29 |     clnalle = alt_offset)
30 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/dbSNP.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Prepare dbSNP for the JOIN.
 3 |   --
 4 |   -- http://varianttools.sourceforge.net/Annotation/DbSNP
 5 |   -- Multiple alternate alleles sometimes correspond to the same rsid.
 6 |   -- Some variants have multiple rsids.
 7 |   --
 8 |   dbSNP AS (
 9 |   SELECT
10 |     reference_name,
11 |     start,
12 |     `end`,
13 |     reference_bases, -- on the + strand
14 |     alternate_bases, -- on the + strand
15 |     names AS rs_names,
16 |     RS,
17 |     -- Used to check for correctness of the JOIN.
18 |     CONCAT('rs', CAST(RS AS STRING)) AS dbSNP_rsid
19 |   FROM
20 |     `{{ DBSNP_TABLE }}` v,
21 |     v.alternate_bases alternate_bases )
22 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/fasta_to_kv.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #      http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | r"""Convert FASTA files to a map-reduceable format.
17 | 
18 | Example Input:
19 | >chr22
20 | CAAGG
21 | TTAGC
22 | CCCCC
23 | 
24 | Example Output:
25 | >chr22>0>CAAGG
26 | >chr22>5>TTAGC
27 | >chr22>10>CCCCC
28 | 
29 | It is very fast (~2 minutes for a 3 GB FASTA) when run on Compute Engine
30 | utilizing streaming download and upload.
31 | https://cloud.google.com/storage/docs/gsutil/commands/cp#streaming-transfers
32 | 
33 | For uncompressed FASTA files:
34 | 
35 | gsutil cat \
36 |   gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa
37 |   \
38 |   | \
39 |   ./fasta_to_kv.py \
40 |   | \
41 |   gsutil cp - gs://MY-BUCKET/refs/GRCh38_Verily_v1.genome.txt
42 | 
43 | For compressed FASTA files, use the appropriate command to unzip the file
44 | before passing it to this script:
45 | 
46 | gsutil cat \
47 |   gs://genomics-public-data/references/hg19/*fa.gz \
48 |   | \
49 |   gunzip \
50 |   | \
51 |   ./fasta_to_kv.py \
52 |   | \
53 |   gsutil cp - gs://MY-BUCKET/refs/hg19.txt
54 | """
55 | 
56 | import sys
57 | 
58 | sequence = ""
59 | position = 0
60 | 
61 | for line in sys.stdin:
62 |   trimmed = line.strip()
63 |   if not trimmed:
64 |     break
65 | 
66 |   if trimmed.startswith(";"):
67 |     # Skip comment lines.
68 |     continue
69 | 
70 |   if trimmed.startswith(">"):
71 |     # We've started a new sequence.  Reset the state.
72 |     sequence = trimmed
73 |     position = 0
74 |     continue
75 | 
76 |   # Write out the sequence with a prefix indicating its context.  Use '>'
77 |   # as the delimiter since its a safe character to use in the file.
78 |   sys.stdout.write(sequence + ">" + str(position) + ">" + trimmed + "\n")
79 |   position += len(trimmed)
80 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/join_annotations.sql:
--------------------------------------------------------------------------------
 1 | #standardSQL
 2 | WITH
 3 | {% if 'dbSNP' in annot_sources %}
 4 | {% include 'dbSNP.sql' %},
 5 | {% endif %}
 6 | 
 7 | {% if 'clinvar' in annot_sources %}
 8 | {% include 'clinvar.sql' %},
 9 | {% endif %}
10 | 
11 | {% if 'thousandGenomes' in annot_sources %}
12 | {% include 'thousandGenomes.sql' %},
13 | {% endif %}
14 | 
15 | {% if 'ESP_AA' in annot_sources %}
16 | {% include 'ESP_AA.sql' %},
17 | {% endif %}
18 | 
19 | {% if 'ESP_EA' in annot_sources %}
20 | {% include 'ESP_EA.sql' %},
21 | {% endif %}
22 | 
23 | {% include 'all_possible_snps.sql' %}
24 | 
25 | --
26 | -- Then JOIN with the individual variant annotation DBs.
27 | --
28 | SELECT
29 |   *
30 | FROM
31 |   all_possible_snps
32 | {% for source in annot_sources %}
33 | LEFT OUTER JOIN {{source}}
34 |   USING(reference_name, start, `end`, reference_bases, alternate_bases)
35 | {% endfor %}
36 | 
37 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/render_templated_sql.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | # Copyright 2017 Verily Life Sciences Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #    http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | """Assemble an SQL query.
 18 | 
 19 | Using a basic pattern for JOINs with variant annotation databases, assemble
 20 | templated SQL into a full query that can but run to create an annotated
 21 | "all possible SNPs" table.
 22 | """
 23 | 
 24 | from __future__ import absolute_import
 25 | 
 26 | import argparse
 27 | import logging
 28 | import sys
 29 | 
 30 | from jinja2 import Environment
 31 | from jinja2 import FileSystemLoader
 32 | 
 33 | SEQUENCE_TABLE_KEY = "SEQUENCE_TABLE"
 34 | 
 35 | B37_QUERY_REPLACEMENTS = {
 36 |     "SEQUENCE_FILTER": """WHERE chr IN ('chr17', '17')
 37 |     AND sequence_start BETWEEN 41196311 AND 41277499""",
 38 |     "DBSNP_TABLE": "bigquery-public-data.human_variant_annotation.ncbi_dbsnp_hg19_20170710",
 39 |     "CLINVAR_TABLE":
 40 |     "bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg19_20170705",
 41 |     "THOUSAND_GENOMES_TABLE":
 42 |     "bigquery-public-data.human_variant_annotation.ensembl_1000genomes_phase3_hg19_release89",
 43 |     "ESP_AA_TABLE":
 44 |     "bigquery-public-data.human_variant_annotation.ensembl_esp6500_aa_hg19_release89",
 45 |     "ESP_EA_TABLE":
 46 |     "bigquery-public-data.human_variant_annotation.ensembl_esp6500_ea_hg19_release89",
 47 | }
 48 | 
 49 | B38_QUERY_REPLACEMENTS = {
 50 |     "SEQUENCE_FILTER": """WHERE chr IN ('chr17', '17')
 51 |     AND sequence_start BETWEEN 43045628 AND 43125483""",
 52 |     "DBSNP_TABLE": "bigquery-public-data.human_variant_annotation.ncbi_dbsnp_hg38_20170710",
 53 |     "CLINVAR_TABLE":
 54 |     "bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20170705",
 55 |     "THOUSAND_GENOMES_TABLE":
 56 |     "bigquery-public-data.human_variant_annotation.ensembl_1000genomes_phase3_hg38_release89",
 57 |     "ESP_AA_TABLE":
 58 |     "bigquery-public-data.human_variant_annotation.ensembl_esp6500_aa_hg38_release89",
 59 |     "ESP_EA_TABLE":
 60 |     "bigquery-public-data.human_variant_annotation.ensembl_esp6500_ea_hg38_release89",
 61 | }
 62 | 
 63 | # The table alias and the query filename must be the same.
 64 | B37_ANNOTATION_SOURCES = ["dbSNP",
 65 |                           "clinvar",
 66 |                           "thousandGenomes",
 67 |                           "ESP_AA",
 68 |                           "ESP_EA"
 69 |                           # TODO: add gnomAD here.
 70 |                          ]
 71 | B38_ANNOTATION_SOURCES = ["dbSNP",
 72 |                           "clinvar",
 73 |                           "thousandGenomes",
 74 |                           "ESP_AA",
 75 |                           "ESP_EA"]
 76 | 
 77 | 
 78 | def run(argv=None):
 79 |   """Main entry point."""
 80 |   parser = argparse.ArgumentParser()
 81 |   parser.add_argument(
 82 |       "--sequence_table",
 83 |       required=True,
 84 |       help="Fully qualified BigQuery table name for the reference "
 85 |       "genome sequences to be converted to all-possible SNPs.")
 86 |   parser.add_argument(
 87 |       "--b37",
 88 |       dest="is_b37",
 89 |       default=True,
 90 |       action="store_true",
 91 |       help="Use annotation tables aligned to build 37 of the "
 92 |       "human genome reference.")
 93 |   parser.add_argument(
 94 |       "--b38",
 95 |       dest="is_b37",
 96 |       action="store_false",
 97 |       help="Use annotation tables aligned to build 38 of the "
 98 |       "human genome reference.")
 99 |   parser.add_argument(
100 |       "--output",
101 |       dest="output",
102 |       default="annotated_snps_RENDERED.sql",
103 |       help="Output file to which to write rendered SQL.")
104 |   parser.add_argument(
105 |       "--debug",
106 |       dest="debug",
107 |       action="store_true",
108 |       help="Generate SQL that will yield a small table for testing purposes.")
109 |   args = parser.parse_args(argv)
110 | 
111 |   sources = B37_ANNOTATION_SOURCES if (
112 |       args.is_b37) else B38_ANNOTATION_SOURCES
113 |   replacements = B37_QUERY_REPLACEMENTS.copy() if (
114 |       args.is_b37) else B38_QUERY_REPLACEMENTS.copy()
115 | 
116 |   replacements[SEQUENCE_TABLE_KEY] = args.sequence_table
117 | 
118 |   if not args.debug:
119 |     replacements["SEQUENCE_FILTER"] = ""
120 | 
121 |   join_template = Environment(loader=FileSystemLoader("./")).from_string(
122 |       open("join_annotations.sql", "r").read())
123 |   join_query = join_template.render(replacements, annot_sources=sources)
124 |   with open(args.output, "w") as outfile:
125 |     outfile.write(join_query)
126 | 
127 |   check_template = Environment(loader=FileSystemLoader("./")).from_string(
128 |       open("check_joined_annotations.sql", "r").read())
129 |   check_query = check_template.render(replacements, annot_sources=sources)
130 |   sys.stdout.write("""
131 | Resulting JOIN query written to output file %s.  Run that query using the
132 | BigQuery web UI or the bq command line tool.
133 | 
134 | Be sure to test the result of the JOIN, for example:
135 | 
136 | %s
137 | """ % (args.output, check_query))
138 | 
139 | if __name__ == "__main__":
140 |   logging.getLogger().setLevel(logging.INFO)
141 |   run()
142 | 


--------------------------------------------------------------------------------
/curation/allPossibleSNPs/thousandGenomes.sql:
--------------------------------------------------------------------------------
 1 |   --
 2 |   -- Prepare 1000 Genomes for the JOIN.
 3 |   --
 4 |   thousandGenomes AS (
 5 |   SELECT
 6 |     reference_name,
 7 |     start,
 8 |     `end`,
 9 |     reference_bases,
10 |     alternate_bases,
11 |     AFR_AF[OFFSET(alt_offset)] AS AFR_AF_1000G,
12 |     AMR_AF[OFFSET(alt_offset)] AS AMR_AF_1000G,
13 |     EAS_AF[OFFSET(alt_offset)] AS EAS_AF_1000G,
14 |     EUR_AF[OFFSET(alt_offset)] AS EUR_AF_1000G,
15 |     SAS_AF[OFFSET(alt_offset)] AS SAS_AF_1000G,
16 |     -- Used to check for correctness of the JOIN.
17 |     names[OFFSET(0)] AS thousandGenomes_rsid
18 |   FROM
19 |     `{{ THOUSAND_GENOMES_TABLE }}` v,
20 |     v.alternate_bases alternate_bases WITH OFFSET alt_offset )
21 | 


--------------------------------------------------------------------------------
/curation/tables/AddBigQueryDescriptions.md:
--------------------------------------------------------------------------------
 1 | Add Column Descriptions to a BigQuery Table
 2 | ===========================================
 3 | 
 4 | The `variants` table generated by performing
 5 | an
 6 | [export from Google Genomics](https://cloud.google.com/genomics/reference/rest/v1/variantsets/export) does
 7 | not include the field descriptions for the fields.
 8 | 
 9 | Some fields in the variants table are fixed fields (such as the
10 | `reference_name`, `start`, and `end` fields), while others are variable -
11 | discovered in source VCF or masterVar files during the variant import process.
12 | 
13 | An ideal tool would be able to pull together a set of descriptions for the
14 | fixed fields, along with descriptions from:
15 | 
16 |   * the source variant set
17 |   * the source variant files (VCFs or masterVar)
18 |   * descriptions already set in the `variants` table schema
19 | 
20 | and then allow for easy user additions and edits prior to updating the variants
21 | table schema.
22 | 
23 | Since most variant sets are built from homogenous VCF files, this script
24 | provides just a simple implementation - provide a VCF used to build your variant
25 | set - it will extract the relevant descriptions from the VCF header and update
26 | the table.  The VCF can be local or in Google Cloud Storage if TensorFlow
27 | libraries are installed.  The VCF may be compressed with gzip and the filename
28 | must end with ".gz" if so.
29 | 
30 | ## Setup
31 | 
32 | The tool here uses the BigQuery client libraries described at:
33 | 
34 |       https://cloud.google.com/bigquery/docs/reference/libraries
35 | 
36 | The following steps install that library in
37 | a
38 | [Python virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)
39 | 
40 | 1. Create a virtualenv
41 | 
42 |     ```
43 |     virtualenv bq_lib
44 |     ```
45 | 
46 | 2. Activate the virtualenv
47 | 
48 |     ```
49 |     source bq_lib/bin/activate
50 |     ```
51 | 
52 | 3. Install the BigQuery client libraries and TensorFlow:
53 | 
54 |     ```
55 |     pip install --upgrade google-cloud-bigquery tensorflow
56 |     ```
57 | 
58 | ## Run
59 | 
60 | ```shell
61 | python update_variants_schema.py \
62 |   --source-vcf PATH_TO_VCF.vcf \
63 |   --destination-table PROJECT_ID.DATASET_NAME.TABLE_NAME
64 | ```
65 | 
66 | Note the fully-qualified table name follows
67 | BigQuery
68 | [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/)
69 | conventions.
70 | 


--------------------------------------------------------------------------------
/curation/tables/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM gcr.io/cloud-builders/gcloud
 2 | 
 3 | RUN apt-get update \
 4 |   && apt-get install -y python-setuptools \
 5 |   && pip install --upgrade \
 6 |     gcloud \
 7 |     google-api-python-client \
 8 |     google-cloud-bigquery \
 9 |     retrying \
10 |     tensorflow
11 | 
12 | 
13 | COPY *.py /usr/local/bin/
14 | COPY launch_import_vcf_to_bigquery.sh /usr/local/bin/
15 | 
16 | ENTRYPOINT ["bash"]
17 | 


--------------------------------------------------------------------------------
/curation/tables/README.md:
--------------------------------------------------------------------------------
  1 | **WARNING: Not actively maintained!**
  2 | 
  3 | Curate Individual Annotation Sources
  4 | ====================================
  5 | 
  6 | This tutorial loads several annotation sources to individual BigQuery tables.
  7 | These tables are already available in BigQuery
  8 | dataset
  9 | [bigquery-public-data:human_variant_annotation](https://bigquery.cloud.google.com/dataset/bigquery-public-data:human_variant_annotation),
 10 | but the configuration in this tutorial can be updated to load new versions of
 11 | these resources or load additional annotation resources.
 12 | 
 13 | [Container Builder](https://cloud.google.com/container-builder/docs/overview),
 14 | [dsub](https://github.com/googlegenomics/dsub)
 15 | and [Google Genomics](https://cloud.google.com/genomics/) are used to run all of
 16 | these steps in the cloud.
 17 | 
 18 | ## (1) Configure project variables.
 19 | 
 20 | Set a few environment variables to facilitate cutting and pasting the subsequent
 21 | commands.
 22 | 
 23 | ``` bash
 24 | # The Google Cloud Platform project id in which the Docker containers
 25 | # will be built and stored.
 26 | PROJECT_ID=your-project-id
 27 | # The bucket name (with the gs:// prefix) where dsub logs will
 28 | # be written.
 29 | BUCKET=gs://your-bucket-name
 30 | # The BigQuery destination dataset for the imported annotations.
 31 | DATASET=your_bigquery_dataset_name
 32 | ```
 33 | 
 34 | ## (2) Build the importer Docker container.
 35 | 
 36 | Build the VCF importer image using the Container Builder service:
 37 | 
 38 | ``` bash
 39 | gcloud container builds submit \
 40 |     --project ${PROJECT_ID} \
 41 |     --tag gcr.io/${PROJECT_ID}/vcf_to_bigquery \
 42 |     .
 43 | ```
 44 | 
 45 | ## (3) Test a small import.
 46 | 
 47 | The target BigQuery dataset must already exist, and the service account used to
 48 | run [dsub](https://cloud.google.com/genomics/v1alpha2/dsub) jobs must have
 49 | "BigQuery Data Owner" role.  (The Compute Engine default service account will
 50 | not have this role by
 51 | default.
 52 | [It would need to be added.](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts))
 53 | 
 54 | Submit a single VCF import task
 55 | via [dsub](https://cloud.google.com/genomics/v1alpha2/dsub).  Here we use the
 56 | small file
 57 | `gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.genome_strip_hq.20101123.svs.low_coverage.genotypes.vcf`
 58 | for a quick test.
 59 | 
 60 | ``` bash
 61 | dsub \
 62 |   --project ${PROJECT_ID} \
 63 |   --zones "us-*" \
 64 |   --logging ${BUCKET}/upload_logs \
 65 |   --image gcr.io/${PROJECT_ID}/vcf_to_bigquery \
 66 |   --scopes "https://www.googleapis.com/auth/bigquery" \
 67 |     "https://www.googleapis.com/auth/devstorage.read_write" \
 68 |   --env \
 69 |     SOURCE_VCFS=gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.genome_strip_hq.20101123.svs.low_coverage.genotypes.vcf \
 70 |     PROJECT=${PROJECT_ID} \
 71 |     DATASET=${DATASET} \
 72 |     VARIANTSET=test \
 73 |     TABLE=${PROJECT_ID}.${DATASET}.test \
 74 |   --script launch_import_vcf_to_bigquery.sh
 75 | ```
 76 | 
 77 | ## (4) Configure the annotation sources and destinations.
 78 | 
 79 | Edit [vcf_manifest.tsv](vcf_manifest.tsv) to use your desired Cloud Storage,
 80 | Google Genomics, and BigQuery destinations.  It can also be edited to use newer
 81 | versions of the annotation sources and/or add more annotation sources where the
 82 | file format is VCF.
 83 | 
 84 | ## (5) Run all the annotation imports in parallel.
 85 | 
 86 | Submit multiple parallel imports via
 87 | [dsub](https://cloud.google.com/genomics/v1alpha2/dsub):
 88 | 
 89 | ``` bash
 90 | dsub \
 91 |   --project ${PROJECT_ID} \
 92 |   --zones "us-*" \
 93 |   --logging ${BUCKET}/upload_logs \
 94 |   --image gcr.io/${PROJECT_ID}/vcf_to_bigquery \
 95 |   --scopes "https://www.googleapis.com/auth/bigquery" \
 96 |     "https://www.googleapis.com/auth/devstorage.read_write" \
 97 |   --tasks vcf_manifest.tsv \
 98 |   --script launch_import_vcf_to_bigquery.sh
 99 | ```
100 | 


--------------------------------------------------------------------------------
/curation/tables/import_vcf_to_bigquery.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | r"""Import variant data in a VCF file to a BigQuery variants table.
 15 | 
 16 | Example usage:
 17 | 
 18 | python import_vcf_to_bigquery.py \
 19 |     --source-vcf "gs://BUCKET_NAME/PATH/TO/variants.vcf.gz" \
 20 |     --project "PROJECT_ID" \
 21 |     --dataset "DATASET_NAME" \
 22 |     --variantset "VARIANTSET_NAME" \
 23 |     --destination-table "PROJECT_ID.DATASET_NAME.TABLE_NAME" \
 24 |     --expand-wildcards
 25 | """
 26 | 
 27 | import argparse
 28 | import logging
 29 | 
 30 | import vcf_to_bigquery_utils
 31 | 
 32 | 
 33 | def _parse_arguments():
 34 |   """Parses command line arguments.
 35 | 
 36 |   Returns:
 37 |     A Namespace of parsed arguments.
 38 |   """
 39 |   parser = argparse.ArgumentParser(
 40 |       formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 41 |   parser.add_argument(
 42 |       "--source-vcf",
 43 |       nargs="+",
 44 |       required=True,
 45 |       help=("Cloud Storage path[s] to [gzip-compressed] VCF file[s],"
 46 |             " wildcards accepted (* but not **)."))
 47 |   parser.add_argument(
 48 |       "--project",
 49 |       required=True,
 50 |       help="Cloud project for imported Google Genomics data.")
 51 |   parser.add_argument(
 52 |       "--dataset",
 53 |       required=True,
 54 |       help=("Google Genomics dataset name or id"
 55 |             " (existing datasets will be appended)."))
 56 |   parser.add_argument(
 57 |       "--variantset",
 58 |       required=True,
 59 |       help=("Google Genomics variant set name or id"
 60 |             " (existing targets will be appended)."))
 61 |   parser.add_argument(
 62 |       "--new-dataset",
 63 |       action="store_true",
 64 |       help="Create a new dataset, even if one with this name exists.")
 65 |   parser.add_argument(
 66 |       "--new-variantset",
 67 |       action="store_true",
 68 |       help="Create a new variant set, even if one with this name exists.")
 69 |   parser.add_argument(
 70 |       "--expand-wildcards",
 71 |       action="store_true",
 72 |       help="Expand wildcards in VCF paths and use parallel imports.")
 73 |   parser.add_argument(
 74 |       "--destination-table",
 75 |       required=True,
 76 |       help="Full path to destination BigQuery table "
 77 |            "(PROJECT_ID.DATASET_NAME.TABLE_NAME).")
 78 |   parser.add_argument(
 79 |       "--description",
 80 |       help="Description for destination BigQuery table.")
 81 | 
 82 |   return parser.parse_args()
 83 | 
 84 | 
 85 | def main():
 86 |   args = _parse_arguments()
 87 |   logging.basicConfig(level=logging.INFO)
 88 | 
 89 |   uploader = vcf_to_bigquery_utils.VcfUploader(args.project)
 90 |   uploader.upload_variants(dataset=args.dataset,
 91 |                            variantset=args.variantset,
 92 |                            source_vcfs=args.source_vcf,
 93 |                            destination_table=args.destination_table,
 94 |                            expand_wildcards=args.expand_wildcards,
 95 |                            new_dataset=args.new_dataset,
 96 |                            new_variantset=args.new_variantset,
 97 |                            description=args.description)
 98 | 
 99 | 
100 | if __name__ == "__main__":
101 |   main()
102 | 


--------------------------------------------------------------------------------
/curation/tables/launch_import_vcf_to_bigquery.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Launch VCF importing code using parameter values set in environment variables.
18 | # ${SOURCE_VCFS} is a single environment variable that optionally refers to
19 | # multiple files, separated by whitespace and optionally quote-delimited.
20 | 
21 | # TODO: Copy local ${SOURCE_VCFS} to Cloud Storage if they are remote (HTTP or
22 | #   FTP) or local. Also uncompress input files for faster imports.
23 | 
24 | # Handle quotes in VCF paths in original job array list with an eval.
25 | eval source_vcfs_array=("${SOURCE_VCFS}")
26 | python /usr/local/bin/import_vcf_to_bigquery.py \
27 |   --source-vcf "${source_vcfs_array[@]}" \
28 |   --project "${PROJECT}" \
29 |   --dataset "${DATASET}" \
30 |   --variantset "${VARIANTSET}" \
31 |   --destination-table "${TABLE}" \
32 |   --description "${SOURCE_VCFS}" \
33 |   --expand-wildcards
34 | 


--------------------------------------------------------------------------------
/curation/tables/schema_update_utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | """Library to update a variants table schema with field descriptions.
 15 | """
 16 | 
 17 | import glob
 18 | import gzip
 19 | import logging
 20 | import re
 21 | 
 22 | from gcloud import bigquery
 23 | 
 24 | # If TensorFlow is installed, use its gfile library.
 25 | try:
 26 |   from tensorflow import gfile
 27 | except ImportError:
 28 |   logging.warning('TensorFlow not installed; VCF in Cloud Storage unsupported')
 29 | 
 30 | 
 31 | # String length limit for BigQuery table and column descriptions.  See:
 32 | #   https://cloud.google.com/bigquery/docs/reference/rest/v2/tables.
 33 | _MAX_LENGTH = 1024
 34 | _TRUNCATION_WARNING = 'Truncating %s to comply with BigQuery length limits'
 35 | 
 36 | _FIXED_VARIANT_FIELDS = {
 37 |     'reference_name':
 38 |         'An identifier from the reference genome or an angle-bracketed ID '
 39 |         'string pointing to a contig in the assembly file.',
 40 |     'start': 'The reference position, with the first base having position 0.',
 41 |     'end': 'End position of the variant described in this record.',
 42 |     'reference_bases':
 43 |         'Each base must be one of A,C,G,T,N (case insensitive). Multiple '
 44 |         'bases are permitted. The value in the \'start\' field refers to the '
 45 |         'position of the first base in the string.',
 46 |     'alternate_bases':
 47 |         'List of alternate non-reference alleles.',
 48 |     'variant_id': 'Google Genomics variant id.',
 49 |     'quality': 'Phred-scaled quality score for the assertion made in ALT.',
 50 |     'names': 'List of unique identifiers for the variant where available.',
 51 |     'call': 'Per-sample measurements.',
 52 | }
 53 | 
 54 | _FIXED_CALL_FIELDS = {
 55 |     'call_set_id':
 56 |         'The id of the callset from which this data was exported from the '
 57 |         'Google Genomics Variants API.',
 58 |     'call_set_name':
 59 |         'Sample identifier from source data.',
 60 |     'genotype':
 61 |         'List of genotypes.',
 62 |     'genotype_likelihood':
 63 |         'List of genotype likelihoods.',
 64 |     'phaseset':
 65 |         'If this value is null, the data is unphased.  Otherwise it is phased.',
 66 |     'qual': 'Phred-scaled quality score for the assertion made in ALT.',
 67 | }
 68 | 
 69 | 
 70 | class Descriptions(object):
 71 |   """Encapsulate field descriptions as parsed from a VCF."""
 72 | 
 73 |   def __init__(self):
 74 |     self.filter_description = None
 75 |     self.format_fields = {}
 76 |     self.info_fields = {}
 77 | 
 78 |   @staticmethod
 79 |   def _parse_filter_header(line_no, line):
 80 |     value = line.split('=', 1)[1]
 81 | 
 82 |     m = re.match(r'<ID=([^,]+),Description="(.*)">', value)
 83 |     if not m:
 84 |       raise ValueError('Failed to parse line %d: %s' % (line_no, line))
 85 | 
 86 |     return {'id': m.group(1), 'description': m.group(2)}
 87 | 
 88 |   @staticmethod
 89 |   def _parse_format_or_info_header(line_no, line):
 90 |     value = line.split('=', 1)[1]
 91 | 
 92 |     m = re.match(r'<ID=([^,]+),Number=([^,]+),Type=([^,]+),Description="(.*)">',
 93 |                  value)
 94 |     if not m:
 95 |       raise ValueError('Failed to parse line %d: %s' % (line_no, line))
 96 | 
 97 |     return {'id': m.group(1), 'description': m.group(4)}
 98 | 
 99 |   def add_from_vcf(self, path):
100 |     """Add descriptions from a VCF.
101 | 
102 |     Args:
103 |       path: Path to local or remote (in Cloud Storage via a "gs://" path, if
104 |           TensorFlow is installed) VCF file, optionally gzip-compressed
105 |           (requires a ".gz" suffix).
106 |     """
107 |     filter_desc = []
108 |     format_fields = {}
109 |     info_fields = {}
110 | 
111 |     # Handle wildcards in the path by expanding and taking the first file.
112 |     if path.startswith('gs://'):
113 |       path = gfile.Glob(path)[0]
114 |       f = gfile.Open(path)
115 |     else:
116 |       path = glob.glob(path)[0]
117 |       f = open(path)
118 | 
119 |     # Handle gzipped VCF files.
120 |     if path.endswith('.gz'):
121 |       f = gzip.GzipFile(fileobj=f)
122 | 
123 |     line_no = 0
124 |     for line in f:
125 |       line_no += 1
126 | 
127 |       if line.startswith('##FORMAT='):
128 |         header = self._parse_format_or_info_header(line_no, line)
129 |         format_fields[header['id']] = header['description']
130 | 
131 |       elif line.startswith('##INFO='):
132 |         header = self._parse_format_or_info_header(line_no, line)
133 |         info_fields[header['id']] = header['description']
134 | 
135 |       elif line.startswith('##FILTER='):
136 |         header = self._parse_filter_header(line_no, line)
137 |         filter_desc.append(header)
138 | 
139 |       # Reached the end of the VCF header
140 |       if line.startswith('#CHROM'):
141 |         break
142 | 
143 |     # Update the member fields
144 |     self.filter_description = '\n'.join(
145 |         ['%s: %s' % (item['id'], item['description']) for item in filter_desc])
146 | 
147 |     # If the filter description is too long, only include the field names.
148 |     if len(self.filter_description) > _MAX_LENGTH:
149 |       logging.warning(_TRUNCATION_WARNING, 'variant filter thresholds')
150 |       self.filter_description = '\n'.join([item['id'] for item in filter_desc])
151 | 
152 |     self.format_fields = format_fields
153 |     self.info_fields = info_fields
154 | 
155 | 
156 | def tokenize_table_name(full_table_name):
157 |   """Tokenize a BigQuery table_name.
158 | 
159 |   Splits a table name in the format of 'PROJECT_ID.DATASET_NAME.TABLE_NAME' to
160 |   a tuple of three strings, in that order.  PROJECT_ID may contain periods (for
161 |   domain-scoped projects).
162 | 
163 |   Args:
164 |     full_table_name: BigQuery table name, as PROJECT_ID.DATASET_NAME.TABLE_NAME.
165 |   Returns:
166 |     A tuple of project_id, dataset_name, and table_name.
167 | 
168 |   Raises:
169 |     ValueError: If full_table_name cannot be parsed.
170 |   """
171 |   delimiter = '.'
172 |   tokenized_table = full_table_name.split(delimiter)
173 |   if not tokenized_table or len(tokenized_table) < 3:
174 |     raise ValueError('Table name must be of the form '
175 |                      'PROJECT_ID.DATASET_NAME.TABLE_NAME')
176 |   # Handle project names with periods, e.g. domain.org:project_id.
177 |   return (delimiter.join(tokenized_table[:-2]),
178 |           tokenized_table[-2],
179 |           tokenized_table[-1])
180 | 
181 | 
182 | def update_table_schema(destination_table, source_vcf, description=None):
183 |   """Updates a BigQuery table with the variants schema using a VCF header.
184 | 
185 |   Args:
186 |     destination_table: BigQuery table name, PROJECT_ID.DATASET_NAME.TABLE_NAME.
187 |     source_vcf: Path to local or remote (Cloud Storage) VCF or gzipped VCF file.
188 |     description: Optional description for the BigQuery table.
189 | 
190 |   Raises:
191 |     ValueError: If destination_table cannot be parsed.
192 |   """
193 | 
194 |   dest_table = tokenize_table_name(destination_table)
195 |   dest_project_id, dest_dataset_name, dest_table_name = dest_table
196 | 
197 |   # Load the source VCF
198 |   descriptions = Descriptions()
199 |   descriptions.add_from_vcf(source_vcf)
200 | 
201 |   # Initialize the BQ client
202 |   client = bigquery.Client(project=dest_project_id)
203 | 
204 |   # Load the destination table
205 |   dest_dataset = client.dataset(dest_dataset_name)
206 |   dest_dataset.reload()
207 | 
208 |   dest_table = dest_dataset.table(dest_table_name)
209 |   dest_table.reload()
210 | 
211 |   if description is not None:
212 |     dest_table.patch(description=description[:_MAX_LENGTH])
213 |     if len(description) > _MAX_LENGTH:
214 |       logging.warning(_TRUNCATION_WARNING, 'table description')
215 | 
216 |   # Set the description on the variant fields and the call fields.
217 |   #
218 |   # The (non-fixed) variant field descriptions come from the ##INFO headers
219 |   # The (non-fixed) call fields descriptions can come from the ##FORMAT headers
220 |   #   as well as the ##INFO headers.
221 | 
222 |   # Process variant fields
223 |   call_field = None
224 |   for field in dest_table.schema:
225 |     if field.name.lower() in _FIXED_VARIANT_FIELDS:
226 |       field.description = _FIXED_VARIANT_FIELDS[field.name.lower()]
227 |       logging.debug('Variant(fixed): %s: %s', field.name, field.description)
228 | 
229 |     elif field.name in descriptions.info_fields:
230 |       field.description = descriptions.info_fields[field.name]
231 |       logging.debug('Variant(INFO) %s: %s', field.name, field.description)
232 | 
233 |     elif field.name.lower() == 'filter':
234 |       field.description = descriptions.filter_description
235 | 
236 |     if field.name == 'call':
237 |       call_field = field
238 | 
239 |     if field.description is not None and len(field.description) > _MAX_LENGTH:
240 |       logging.warning(_TRUNCATION_WARNING, field.name)
241 |       field.description = field.description[:_MAX_LENGTH]
242 | 
243 |   # Process call fields
244 |   for field in call_field.fields:
245 |     if field.name.lower() in _FIXED_CALL_FIELDS:
246 |       field.description = _FIXED_CALL_FIELDS[field.name.lower()]
247 |       logging.debug('Call(fixed): %s: %s', field.name, field.description)
248 | 
249 |     elif field.name in descriptions.format_fields:
250 |       field.description = descriptions.format_fields[field.name]
251 |       logging.debug('Call(FORMAT) %s: %s', field.name, field.description)
252 | 
253 |     elif field.name in descriptions.info_fields:
254 |       field.description = descriptions.info_fields[field.name]
255 |       logging.debug('Call(INFO) %s: %s', field.name, field.description)
256 | 
257 |     elif field.name.lower() == 'filter':
258 |       field.description = descriptions.filter_description
259 | 
260 |     if field.description is not None and len(field.description) > _MAX_LENGTH:
261 |       logging.warning(_TRUNCATION_WARNING, field.name)
262 |       field.description = field.description[:_MAX_LENGTH]
263 | 
264 |   logging.info('Updating table %s', dest_table.path)
265 |   dest_table.patch(schema=dest_table.schema)
266 | 


--------------------------------------------------------------------------------
/curation/tables/update_variants_schema.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | """Tool to update a variants table schema with field descriptions.
15 | """
16 | 
17 | import argparse
18 | 
19 | import schema_update_utils
20 | 
21 | 
22 | def _parse_arguments():
23 |   """Parses command line arguments.
24 | 
25 |   Returns:
26 |     A Namespace of parsed arguments.
27 |   """
28 |   parser = argparse.ArgumentParser(
29 |       formatter_class=argparse.ArgumentDefaultsHelpFormatter)
30 |   parser.add_argument(
31 |       '--source-vcf',
32 |       required=True,
33 |       help='Path to local or remote (Cloud Storage) VCF or gzipped VCF file.')
34 |   parser.add_argument(
35 |       '--destination-table',
36 |       required=True,
37 |       help='Full path to destination table '
38 |            '(PROJECT_ID.DATASET_NAME.TABLE_NAME)')
39 |   return parser.parse_args()
40 | 
41 | 
42 | def main():
43 |   args = _parse_arguments()
44 | 
45 |   schema_update_utils.update_table_schema(args.destination_table,
46 |                                           args.source_vcf)
47 | 
48 | 
49 | if __name__ == '__main__':
50 |   main()
51 | 


--------------------------------------------------------------------------------
/curation/tables/vcf_manifest.tsv:
--------------------------------------------------------------------------------
 1 | PROJECT	DATASET	VARIANTSET	TABLE	SOURCE_VCFS	ORIGINAL_SOURCE_VCFS
 2 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	dbSNP_hg38_20170710	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.dbSNP_hg38_20170710	gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf	http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/All_20170710.vcf.gz
 3 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	dbSNP_hg19_20170710	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.dbSNP_hg19_20170710	gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf	http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz
 4 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	clinvar_hg38_20170705	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.clinvar_hg38_20170705	gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive/2017/clinvar_20170705.vcf.gz	http://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive/2017/clinvar_20170705.vcf.gz
 5 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	clinvar_hg19_20170705	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.clinvar_hg19_20170705	gs://YOUR_BUCKET/mirror/ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive/2017/clinvar_20170705.vcf.gz	http://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive/2017/clinvar_20170705.vcf.gz
 6 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	1000genomes_phase3_hg38_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.1000genomes_phase3_hg38_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz	http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
 7 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	1000genomes_phase3_hg19_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.1000genomes_phase3_hg19_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz	http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
 8 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	ESP6500_AA_hg38_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_AA_hg38_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz	http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz
 9 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	ESP6500_AA_hg19_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_AA_hg19_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz	http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-African_American.vcf.gz
10 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	ESP6500_EA_hg38_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_EA_hg38_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz	http://ftp.ensembl.org/pub/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz
11 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	ESP6500_EA_hg19_release89	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ESP6500_EA_hg19_release89	gs://YOUR_BUCKET/mirror/ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz	http://ftp.ensembl.org/pub/grch37/release-89/variation/vcf/homo_sapiens/ESP6500-European_American.vcf.gz
12 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	ExAC_hg19_release1	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.ExAC_hg19_release1	gs://gnomad-public/legacy/exacv1_downloads/release1/ExAC.r1.sites.vep.vcf.gz	gs://gnomad-public/legacy/exacv1_downloads/release1/ExAC.r1.sites.vep.vcf.gz
13 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	gnomAD_genomes_hg19_release170228	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.gnomAD_genomes_hg19_release170228	gs://gnomad-public/release-170228/vcf/genomes/*.sites.?.vcf.gz gs://gnomad-public/release-170228/vcf/genomes/*.sites.??.vcf.gz	gs://gnomad-public/release-170228/vcf/genomes/*.sites.?.vcf.gz gs://gnomad-public/release-170228/vcf/genomes/*.sites.??.vcf.gz
14 | YOUR_PROJECT_ID	YOUR_VARIANTSET_NAME	gnomAD_exomes_hg19_release170228	YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET_NAME.gnomAD_exomes_hg19_release170228	gs://gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz	gs://gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz
15 | 


--------------------------------------------------------------------------------
/curation/tables/vcf_to_bigquery_utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 Verily Life Sciences Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | """Library to upload VCF files to Google Genomics and BigQuery.
 15 | """
 16 | 
 17 | import logging
 18 | import time
 19 | 
 20 | from apiclient import discovery
 21 | from oauth2client.client import GoogleCredentials
 22 | from retrying import retry
 23 | 
 24 | # Use tensorflow.gfile library, if available, to expand wildcards (optional).
 25 | try:
 26 |   from tensorflow import gfile
 27 | except ImportError:
 28 |   gfile = None
 29 | 
 30 | import schema_update_utils
 31 | 
 32 | class VcfUploader(object):
 33 |   """Class for managing a Google Genomics API connection and data transfers.
 34 | 
 35 |   Handles finding and creating variant sets and datasets and uploading and
 36 |   exporting variants stored in VCF.  The main entry point is
 37 |   upload_variants(...), but other intermediate pipeline steps may also be used.
 38 |   """
 39 | 
 40 |   def __init__(self, project, credentials=None):
 41 |     """Create VcfUploader class.
 42 | 
 43 |     Args:
 44 |       project: Cloud project to use for Genomics objects.
 45 |       credentials: Credentials object to use, get_application_default() if None.
 46 |     """
 47 |     if credentials is None:
 48 |       credentials = GoogleCredentials.get_application_default()
 49 |     self.project = project
 50 |     self.service = discovery.build("genomics", "v1", credentials=credentials)
 51 | 
 52 |   @staticmethod
 53 |   def find_id_or_name(name, candidates):
 54 |     """Find a value linked as "id" or "name" in a collection of dicts.
 55 | 
 56 |     Args:
 57 |       name: string to search for in "id" and "name" fields.
 58 |       candidates: collection of dicts that should have "id" and "name" keys.
 59 | 
 60 |     Returns:
 61 |       choice["id"] for the unique matching choice (matched by "name" or "id").
 62 |       Returns None if no matching choice is found.
 63 | 
 64 |     Raises:
 65 |       LookupError: If multiple items match the targeted name.
 66 |     """
 67 |     target_id = None
 68 | 
 69 |     for choice in candidates:
 70 |       if choice.get("id") == name or choice.get("name") == name:
 71 |         if target_id is not None:
 72 |           raise LookupError("Found multiple hits for requested name")
 73 |         target_id = choice["id"]
 74 | 
 75 |     return target_id
 76 | 
 77 |   def find_or_create_dataset(self,
 78 |                              dataset_name,
 79 |                              always_create=False):
 80 |     """Finds or creates a Google Genomics dataset by name or id.
 81 | 
 82 |     If an existing dataset in the project has a name or ID of dataset_name, it
 83 |     will be reused and its id will be returned, unless always_create is True.
 84 |     A new dataset will be created if an existing one is not found.
 85 | 
 86 |     Args:
 87 |       dataset_name: Name or id of existing dataset, or name for a new dataset.
 88 |       always_create: Always create a new dataset with the requested name.
 89 | 
 90 |     Returns:
 91 |       The id of the existing or newly-created Genomics dataset.
 92 |     """
 93 |     request = self.service.datasets().list(projectId=self.project)
 94 |     response = request.execute()
 95 | 
 96 |     dataset_id = self.find_id_or_name(dataset_name,
 97 |                                       response["datasets"])
 98 | 
 99 |     if dataset_id is None or always_create:
100 |       request = self.service.datasets().create(
101 |           body={"name": dataset_name,
102 |                 "projectId": self.project})
103 |       response = request.execute()
104 |       dataset_id = response["id"]
105 | 
106 |     return dataset_id
107 | 
108 |   def find_or_create_variantset(self,
109 |                                 variantset_name,
110 |                                 dataset_id,
111 |                                 description="",
112 |                                 always_create=False):
113 |     """Finds or creates a Google Genomics variant set by name or id.
114 | 
115 |     If an existing variant set in the project has a name or ID of
116 |     variantset_name, it will be reused and its id will be returned, unless
117 |     always_create is True.  A new variant set will be created if an existing
118 |     one is not found.
119 | 
120 |     Args:
121 |       variantset_name: Name or id of existing variant set, or name for a new
122 |           variant set.
123 |       dataset_id: Id of the dataset to find or create the variant set.
124 |       description: The description for the variant set.
125 |       always_create: Always create a new variant set with the requested name.
126 | 
127 |     Returns:
128 |       The id of the existing or newly-created Genomics variant set.
129 |     """
130 |     request = self.service.variantsets().search(
131 |         body={"datasetIds": dataset_id})
132 |     response = request.execute()
133 | 
134 |     variantset_id = self.find_id_or_name(variantset_name,
135 |                                          response["variantSets"])
136 | 
137 |     if variantset_id is None or always_create:
138 |       request = self.service.variantsets().create(
139 |           body={"name": variantset_name,
140 |                 "datasetId": dataset_id,
141 |                 "description": description,
142 |           })
143 |       response = request.execute()
144 |       variantset_id = response["id"]
145 |     return variantset_id
146 | 
147 |   def import_variants(self, source_uris, variantset_id):
148 |     """Imports variants stored in a VCF file on Cloud Storage to a variant set.
149 | 
150 |     Args:
151 |       source_uris: List of paths to VCF file[s] in Cloud Storage, wildcards
152 |           accepted (*, not **).
153 |       variantset_id: Id of the variant set to load the variants.
154 | 
155 |     Returns:
156 |       The name of the loading operation.
157 |     """
158 |     request = self.service.variants().import_(
159 |         body={"variantSetId": variantset_id,
160 |               "sourceUris": source_uris})
161 |     response = request.execute()
162 |     return response["name"]
163 | 
164 |   # Handle transient HTTP errors by retrying several times before giving up.
165 |   # Works around race conditions that arise when the operation ID is not
166 |   # found, which yields a 404 error.
167 |   @retry(stop_max_attempt_number=10, wait_exponential_multiplier=2000)
168 |   def wait_for_operation(self, operation_id, wait_seconds=30):
169 |     """Blocks until the Genomics operation completes.
170 | 
171 |     Args:
172 |       operation_id: The name (id string) of the loading operation.
173 |       wait_seconds: Number of seconds to wait between polling attempts.
174 | 
175 |     Returns:
176 |       True if the operation succeeded, False otherwise.
177 |     """
178 |     request = self.service.operations().get(name=operation_id)
179 |     while not request.execute()["done"]:
180 |       time.sleep(wait_seconds)
181 | 
182 |     # If the operation succeeded, there will be a "response" field and not an
183 |     # "error" field, see:
184 |     # https://cloud.google.com/genomics/reference/rest/Shared.Types/ListOperationsResponse#Operation
185 |     response = request.execute()
186 |     return "response" in response and "error" not in response
187 | 
188 |   def export_variants(self, variantset_id, destination_table):
189 |     """Exports variants from Google Genomics to BigQuery.
190 | 
191 |     Per the Genomics API, this will overwrite any existing BigQuery table with
192 |     this name.
193 | 
194 |     Args:
195 |       variantset_id: Id of the variant set to export.
196 |       destination_table: BigQuery output, as PROJECT_ID.DATASET_NAME.TABLE_NAME.
197 | 
198 |     Returns:
199 |       The name of the export operation.
200 |     """
201 |     tokenized_table = schema_update_utils.tokenize_table_name(destination_table)
202 |     bigquery_project_id, dataset_name, table_name = tokenized_table
203 | 
204 |     request = self.service.variantsets().export(
205 |         variantSetId=variantset_id,
206 |         body={"projectId": bigquery_project_id,
207 |               "bigqueryDataset": dataset_name,
208 |               "bigqueryTable": table_name})
209 |     response = request.execute()
210 |     return response["name"]
211 | 
212 |   def upload_variants(self,
213 |                       dataset,
214 |                       variantset,
215 |                       source_vcfs,
216 |                       destination_table,
217 |                       expand_wildcards=False,
218 |                       new_dataset=False,
219 |                       new_variantset=False,
220 |                       description=None):
221 |     """Imports variants stored in a VCF in Cloud Storage to BigQuery.
222 | 
223 |     Handle all intermediate steps, including finding dataset and variant sets.
224 | 
225 |     Args:
226 |       dataset: Name or id of existing dataset, or name for a new dataset.
227 |       variantset: Name or id of existing variant set, or name for a new one.
228 |       source_vcfs: List of VCF file[s] in Cloud Storage, wildcards accepted
229 |           (*, not **).
230 |       destination_table: BigQuery output, as PROJECT_ID.DATASET_NAME.TABLE_NAME.
231 |       expand_wildcards: Expand wildcards in VCF paths and use parallel imports.
232 |       new_dataset: Always create a new dataset with the requested name.
233 |       new_variantset: Always create a new variant set with the requested name.
234 |       description: Optional description for the BigQuery table.
235 | 
236 |     Raises:
237 |       RuntimeError: If an upload or export request does not succeed.
238 |     """
239 | 
240 |     dataset_id = self.find_or_create_dataset(dataset,
241 |                                              always_create=new_dataset)
242 | 
243 |     variantset_id = self.find_or_create_variantset(
244 |         variantset,
245 |         dataset_id,
246 |         description="\t".join(source_vcfs),
247 |         always_create=new_variantset)
248 | 
249 |     # Spawn off parallel imports for each VCF.
250 |     if expand_wildcards and gfile is not None:
251 |       # Expand any wildcarded paths and concatenate all files together.
252 |       source_vcfs = sum([gfile.Glob(source_vcf) for source_vcf in source_vcfs],
253 |                         [])
254 | 
255 |     operation_ids = []
256 |     for source_vcf in source_vcfs:
257 |       operation_ids.append(self.import_variants(source_vcf, variantset_id))
258 |       logging.info("Importing %s (%s)", source_vcf, operation_ids[-1])
259 | 
260 |     # Wait for all imports to complete successfully before exporting variantset.
261 |     for operation_id in operation_ids:
262 |       if not self.wait_for_operation(operation_id):
263 |         raise RuntimeError("Failed to import variants to Genomics (%s)"
264 |                            % operation_id)
265 | 
266 |     operation_id = self.export_variants(variantset_id, destination_table)
267 |     logging.info("Exporting %s (%s)", variantset, operation_id)
268 | 
269 |     if not self.wait_for_operation(operation_id):
270 |       raise RuntimeError("Failed to export variants to BigQuery (%s)"
271 |                          % operation_id)
272 | 
273 |     # Assume the VCF header is the same for all files and so just use the first.
274 |     logging.info("Updating schema for %s", variantset)
275 |     schema_update_utils.update_table_schema(destination_table,
276 |                                             source_vcfs[0],
277 |                                             description=description)
278 | 


--------------------------------------------------------------------------------
/interactive/InteractiveVariantAnnotation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Interactive Variant Annotation\n",
  8 |     "\n",
  9 |     "The following query retrieves variants from [DeepVariant-called Platinum Genomes](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html) and interactively JOINs them with [ClinVar](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/clinvar_annotations.html).  \n",
 10 |     "\n",
 11 |     "To run this on your own table of variants, change the table name and call_set_name in the `sample_variants` sub query below.\n",
 12 |     "\n",
 13 |     "For an ongoing investigation, you may wish to repeat this query each time a new version of ClinVar is released and [loaded into BigQuery](https://github.com/verilylifesciences/variant-annotation/tree/master/curation/tables/README.md) by changing the table name in the `rare_pathenogenic_variants` sub query.\n",
 14 |     "\n",
 15 |     "See also similar examples for GRCh37 in https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes "
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 1,
 21 |    "metadata": {
 22 |     "collapsed": false
 23 |    },
 24 |    "outputs": [
 25 |     {
 26 |      "data": {
 27 |       "text/html": [
 28 |        "\n",
 29 |        "    <div class=\"bqtv\" id=\"1_150360998944\"><table><tr><th>chr</th><th>start</th><th>reference_bases</th><th>alt</th><th>call_set_name</th><th>CLNHGVS</th><th>CLNALLE</th><th>CLNSRC</th><th>CLNORIGIN</th><th>CLNSRCID</th><th>CLNSIG</th><th>CLNDSDB</th><th>CLNDSDBID</th><th>CLNDBN</th><th>CLNREVSTAT</th><th>CLNACC</th></tr><tr><td>1</td><td>94047008</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000001.11:g.94047009C>T']</td><td>[2]</td><td>['HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)']</td><td>[1]</td><td>['CM024629|601691.0035|P78363#VAR_008428']</td><td>['255|5|1|2|3|3|3|3']</td><td>['MedGen:OMIM|MedGen|MedGen|Human_Phenotype_Ontology:MedGen|MedGen|MedGen|MedGen']</td><td>['C1855465:248200|CN221809|CN169374|HP:0000608:C0024437|CN239309|CN239466|CN239312']</td><td>['MACULAR_DEGENERATION\\\\x2c_AGE-RELATED\\\\x2c_2\\\\x2c_SUSCEPTIBILITY_TO|Stargardt_disease_1|not_provided|not_specified|Macular_degeneration|Cone-Rod_Dystrophy\\\\x2c_Recessive|Retinitis_Pigmentosa\\\\x2c_Recessive|Stargardt_Disease\\\\x2c_Recessive']</td><td>['no_criteria|no_criteria|no_assertion|mult|single|single|single|single']</td><td>['RCV000008374.4|RCV000008375.4|RCV000085512.3|RCV000152706.4|RCV000294335.1|RCV000349295.1|RCV000392936.1|RCV000399411.1']</td></tr><tr><td>1</td><td>201361939</td><td>A</td><td>G</td><td>NA12878_ERR194147</td><td>['NC_000001.11:g.201361940A>G']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|0|0|0|0|0|0|0']</td><td>['MedGen|MedGen:OMIM|MedGen:OMIM|MedGen:OMIM|Human_Phenotype_Ontology:MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:Orphanet:SNOMED_CT']</td><td>['CN169374|C1861864:115195|C2676271:612422|C1832243:601494|HP:0011664:C4021133|HP:0001639:C0007194:ORPHA217569|CN239310|C0340429:ORPHA217635:233878008']</td><td>['not_specified|Familial_hypertrophic_cardiomyopathy_2|Familial_restrictive_cardiomyopathy_3|Left_ventricular_noncompaction_6|Left_ventricular_noncompaction_cardiomyopathy|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Familial_restrictive_cardiomyopathy']</td><td>['conf|single|single|single|single|single|single|single']</td><td>['RCV000168973.2|RCV000230425.2|RCV000230425.2|RCV000230425.2|RCV000283636.1|RCV000323526.1|RCV000338870.1|RCV000378147.1']</td></tr><tr><td>1</td><td>212897348</td><td>T</td><td>TACAC</td><td>NA12878_ERR194147</td><td>['NC_000001.11:g.212897351_212897370dup20', 'NC_000001.11:g.212897365_212897370dupCACACA', 'NC_000001.11:g.212897367_212897370dupCACA', 'NC_000001.11:g.212897369_212897370dupCA']</td><td>[4, -1, -1, -1]</td><td>['.', '.', '.', '.']</td><td>[1, 1, 1, 1]</td><td>['.', '.', '.', '.']</td><td>['0', '3', '255', '0']</td><td>['MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet', 'MedGen:OMIM:Orphanet']</td><td>['C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628', 'C1836916:609033:ORPHA88628']</td><td>['Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa', 'Posterior_column_ataxia_with_retinitis_pigmentosa']</td><td>['single', 'single', 'conf', 'single']</td><td>['RCV000355025.1', 'RCV000297866.1', 'RCV000262602.1', 'RCV000351203.1']</td></tr><tr><td>1</td><td>215671030</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000001.11:g.215671031C>T']</td><td>[1]</td><td>['UniProtKB_(protein)']</td><td>[1]</td><td>['O75445#VAR_061351']</td><td>['255']</td><td>['MedGen']</td><td>['CN169374']</td><td>['not_specified']</td><td>['conf']</td><td>['RCV000041750.4']</td></tr><tr><td>1</td><td>237589773</td><td>AT</td><td>A</td><td>NA12878_ERR194147</td><td>['NC_000001.11:g.237589784delT']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['2|255']</td><td>['MedGen:Orphanet:SNOMED_CT|MedGen']</td><td>['C0878544:ORPHA167848:85898001|CN169374']</td><td>['Cardiomyopathy|not_specified']</td><td>['no_criteria|conf']</td><td>['RCV000030420.1|RCV000036734.8']</td></tr><tr><td>10</td><td>26088401</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000010.11:g.26088402C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|0']</td><td>['MedGen|MedGen']</td><td>['CN169374|CN239439']</td><td>['not_specified|Nonsyndromic_Hearing_Loss\\\\x2c_Recessive']</td><td>['conf|single']</td><td>['RCV000039026.3|RCV000381484.1']</td></tr><tr><td>11</td><td>6392135</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000011.10:g.6392136C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|0']</td><td>['MedGen|MedGen:SNOMED_CT']</td><td>['CN169374|C0028064:58459009']</td><td>['not_specified|Sphingomyelin/cholesterol_lipidosis']</td><td>['conf|single']</td><td>['RCV000079188.5|RCV000394529.1']</td></tr><tr><td>11</td><td>6617153</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000011.10:g.6617154C>A', 'NC_000011.10:g.6617154C>G', 'NC_000011.10:g.6617154C>T']</td><td>[1, 2, 3]</td><td>['.', 'OMIM_Allelic_Variant', '.']</td><td>[1, 1, 1]</td><td>['.', '607998.0004', '.']</td><td>['5', '5|5|5|5|5', '5']</td><td>['MedGen', 'MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet|MedGen|MeSH:MedGen|MedGen:OMIM:Orphanet:SNOMED_CT', 'MedGen']</td><td>['CN221809', 'C1876161:204500:ORPHA228349|C1836474:609270:ORPHA284324|CN221809|D030342:C0950123|C0027877:214200:ORPHA216:42012007', 'CN221809']</td><td>['not_provided', 'Ceroid_lipofuscinosis_neuronal_2|Childhood-onset_autosomal_recessive_slowly_progressive_spinocerebellar_ataxia|not_provided|Inborn_genetic_diseases|Neuronal_ceroid_lipofuscinosis', 'not_provided']</td><td>['single', 'mult|single|mult|single|single', 'single']</td><td>['RCV000391641.1', 'RCV000002763.11|RCV000074608.7|RCV000189765.4|RCV000210689.1|RCV000228119.2', 'RCV000189764.3']</td></tr><tr><td>11</td><td>47448802</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000011.10:g.47448803C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255']</td><td>['MedGen']</td><td>['CN169374']</td><td>['not_specified']</td><td>['conf']</td><td>['RCV000246056.2']</td></tr><tr><td>11</td><td>66510682</td><td>T</td><td>C</td><td>NA12878_ERR194147</td><td>['NC_000011.10:g.66510683T>C']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|2']</td><td>['MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']</td><td>['CN169374|C0752166:209900:ORPHA110:5619004']</td><td>['not_specified|Bardet-Biedl_syndrome']</td><td>['conf|single']</td><td>['RCV000173529.2|RCV000226235.1']</td></tr><tr><td>12</td><td>102840473</td><td>T</td><td>C</td><td>NA12878_ERR194147</td><td>['NC_000012.12:g.102840474T>C']</td><td>[1]</td><td>['HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)']</td><td>[1]</td><td>['CM910294|612349.0017|P00439#VAR_001038']</td><td>['5|5|5']</td><td>['MedGen|MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']</td><td>['C0751435|CN221809|C0031485:261600:ORPHA716:154735006']</td><td>['Hyperphenylalaninemia\\\\x2c_non-pku|not_provided|Phenylketonuria']</td><td>['no_criteria|single|mult']</td><td>['RCV000000624.4|RCV000078508.6|RCV000150074.4']</td></tr><tr><td>14</td><td>23389061</td><td>AG</td><td>A</td><td>NA12878_ERR194147</td><td>['NC_000014.9:g.23389063delG']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['3|255|2|0|0|0']</td><td>['MedGen:Orphanet:SNOMED_CT|MedGen|MedGen:OMIM|MedGen:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet']</td><td>['C0878544:ORPHA167848:85898001|CN169374|C2750467:613251|C0018817:ORPHA1478|CN239310|HP:0001639:C0007194:ORPHA217569']</td><td>['Cardiomyopathy|not_specified|Familial_hypertrophic_cardiomyopathy_14|Atrial_septal_defect|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hypertrophic_cardiomyopathy']</td><td>['single|conf|single|single|single|single']</td><td>['RCV000030306.1|RCV000154759.3|RCV000205051.2|RCV000299696.1|RCV000354591.1|RCV000396023.1']</td></tr><tr><td>14</td><td>64210032</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000014.9:g.64210033C>T']</td><td>[2]</td><td>['OMIM_Allelic_Variant|UniProtKB_(protein)']</td><td>[1]</td><td>['608442.0001|Q8WXH0#VAR_062977']</td><td>['5|255|3']</td><td>['MedGen:OMIM|MedGen|MedGen:Orphanet:SNOMED_CT']</td><td>['C2751805:612999|CN169374|C0410189:ORPHA261:111508004']</td><td>['Emery-Dreifuss_muscular_dystrophy_5\\\\x2c_autosomal_dominant|not_specified|Emery-Dreifuss_muscular_dystrophy']</td><td>['no_criteria|conf|single']</td><td>['RCV000002414.4|RCV000173937.3|RCV000403391.1']</td></tr><tr><td>15</td><td>65078011</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000015.10:g.65078012C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|3']</td><td>['MedGen|MedGen']</td><td>['CN169374|CN239448']</td><td>['not_specified|Nemaline_Myopathy\\\\x2c_Dominant']</td><td>['conf|single']</td><td>['RCV000117307.4|RCV000304321.1']</td></tr><tr><td>15</td><td>89645160</td><td>A</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000015.10:g.89645161A>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|3']</td><td>['MedGen|Gene:MedGen:OMIM:Orphanet']</td><td>['CN169374|46:C0796147:200990:ORPHA36']</td><td>['not_specified|Acrocallosal_syndrome\\\\x2c_Schinzel_type']</td><td>['conf|single']</td><td>['RCV000117416.4|RCV000261344.1']</td></tr><tr><td>16</td><td>2497032</td><td>C</td><td>G</td><td>NA12878_ERR194147</td><td>['NC_000016.10:g.2497033C>G']</td><td>[1]</td><td>['UniProtKB_(protein)']</td><td>[1]</td><td>['Q9ULP9#VAR_070890']</td><td>['255|3|3|3']</td><td>['MedGen|MedGen|MedGen:OMIM|MedGen:OMIM']</td><td>['CN169374|C3809181|C3892048:616044|C3463992:308350']</td><td>['not_specified|Caused_by_mutation_in_the_TBC1_domain_family\\\\x2c_member_24|Deafness\\\\x2c_autosomal_dominant_65|Epileptic_encephalopathy\\\\x2c_early_infantile\\\\x2c_1']</td><td>['conf|single|single|single']</td><td>['RCV000128367.6|RCV000477643.1|RCV000477643.1|RCV000477643.1']</td></tr><tr><td>16</td><td>56514588</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000016.10:g.56514589C\\\\x3d', 'NC_000016.10:g.56514589C>T']</td><td>[0, 1]</td><td>['OMIM_Allelic_Variant', '.']</td><td>[1, 1]</td><td>['606151.0013', '.']</td><td>['5', '2']</td><td>['MedGen', 'MedGen']</td><td>['C4016908', 'CN169374']</td><td>['Bardet-biedl_syndrome_2/6\\\\x2c_digenic', 'not_specified']</td><td>['no_criteria', 'single']</td><td>['RCV000004838.4', 'RCV000301991.1']</td></tr><tr><td>16</td><td>88805736</td><td>T</td><td>C</td><td>NA12878_ERR194147</td><td>['NC_000016.10:g.88805737T>C']</td><td>[1]</td><td>['UniProtKB_(protein)']</td><td>[1]</td><td>['Q9H211#VAR_054504']</td><td>['255']</td><td>['MedGen']</td><td>['CN169374']</td><td>['not_specified']</td><td>['conf']</td><td>['RCV000116652.3']</td></tr><tr><td>17</td><td>17796722</td><td>AAGG</td><td>A</td><td>NA12878_ERR194147</td><td>['NC_000017.11:g.17796729_17796731delGAG']</td><td>[1]</td><td>['HGMD']</td><td>[1]</td><td>['CD116392']</td><td>['255|0']</td><td>['MedGen|MedGen']</td><td>['CN169374|CN221809']</td><td>['not_specified|not_provided']</td><td>['conf|single']</td><td>['RCV000082265.6|RCV000118114.3']</td></tr><tr><td>18</td><td>2700878</td><td>A</td><td>G</td><td>NA12878_ERR194147</td><td>['NC_000018.10:g.2700879A>G']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255']</td><td>['MedGen']</td><td>['CN169374']</td><td>['not_specified']</td><td>['conf']</td><td>['RCV000247697.2']</td></tr><tr><td>19</td><td>11132530</td><td>C</td><td>CA</td><td>NA12878_ERR194147</td><td>['NC_000019.10:g.11125285_11132532dup7248', 'NC_000019.10:g.11132532dupA']</td><td>[-1, 1]</td><td>['LDLR_@_LOVD', '.']</td><td>[5, 1]</td><td>['LDLR_000294', '.']</td><td>['255', '3']</td><td>['MedGen:OMIM:SNOMED_CT:SNOMED_CT', 'MedGen:OMIM:SNOMED_CT:SNOMED_CT']</td><td>['C0020445:143890:397915002:398036000', 'C0020445:143890:397915002:398036000']</td><td>['Familial_hypercholesterolemia', 'Familial_hypercholesterolemia']</td><td>['conf', 'single']</td><td>['RCV000237281.1', 'RCV000326993.1']</td></tr><tr><td>19</td><td>41414124</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000019.10:g.41414125C>T']</td><td>[1]</td><td>['HGMD|UniProtKB_(protein)']</td><td>[1]</td><td>['CM021497|P12694#VAR_034361']</td><td>['255|3']</td><td>['MedGen|MedGen:OMIM:Orphanet:SNOMED_CT']</td><td>['CN169374|C0024776:248600:ORPHA268184:27718001']</td><td>['not_specified|Maple_syrup_urine_disease']</td><td>['conf|single']</td><td>['RCV000079243.6|RCV000295914.1']</td></tr><tr><td>19</td><td>45179661</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000019.10:g.45179662C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|0']</td><td>['MedGen|MedGen:SNOMED_CT']</td><td>['CN169374|C0079504:9311003']</td><td>['not_specified|Hermansky-Pudlak_syndrome']</td><td>['conf|single']</td><td>['RCV000150192.3|RCV000320201.1']</td></tr><tr><td>19</td><td>57231145</td><td>G</td><td>GC</td><td>NA12878_ERR194147</td><td>['NC_000019.10:g.57231150dupC']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255']</td><td>['MedGen']</td><td>['CN239485']</td><td>['Spermatogenic_Failure']</td><td>['conf']</td><td>['RCV000311416.1']</td></tr><tr><td>2</td><td>178532038</td><td>C</td><td>T</td><td>NA12878_ERR194147</td><td>['NC_000002.12:g.178532039C>T']</td><td>[1]</td><td>['.']</td><td>[1]</td><td>['.']</td><td>['255|2|2|2|3|3|3|3|3|3']</td><td>['MedGen|MedGen:OMIM|MedGen:OMIM:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet']</td><td>['CN169374|C1858763:604145|C1837342:608807:ORPHA140922|CN230736|C2673677:611705:ORPHA289377|CN239352|HP:0001639:C0007194:ORPHA217569|CN239310|C1863599:603689:ORPHA178464|C1838244:600334:ORPHA609']</td><td>['not_specified|Dilated_cardiomyopathy_1G|Limb-girdle_muscular_dystrophy\\\\x2c_type_2J|Cardiovascular_phenotype|Myopathy\\\\x2c_early-onset\\\\x2c_with_fatal_cardiomyopathy|Limb-Girdle_Muscular_Dystrophy\\\\x2c_Recessive|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hereditary_myopathy_with_early_respiratory_failure|Distal_myopathy_Markesbery-Griggs_type']</td><td>['conf|single|single|single|single|single|single|single|single|single']</td><td>['RCV000040944.8|RCV000231098.2|RCV000231098.2|RCV000244925.1|RCV000296735.1|RCV000311858.1|RCV000336579.1|RCV000351463.1|RCV000402951.1|RCV000403378.1']</td></tr></table></div>\n",
 30 |        "    <br />(rows: 63, time: 5.0s,    10GB processed, job: job_P6NRU_M3B1MeX_TpuZdGC9QTWZwp)<br />\n",
 31 |        "    <script>\n",
 32 |        "\n",
 33 |        "      require.config({\n",
 34 |        "        paths: {\n",
 35 |        "          d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.13/d3',\n",
 36 |        "          plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',\n",
 37 |        "          jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min'\n",
 38 |        "        },\n",
 39 |        "        map: {\n",
 40 |        "          '*': {\n",
 41 |        "            datalab: 'nbextensions/gcpdatalab'\n",
 42 |        "          }\n",
 43 |        "        },\n",
 44 |        "        shim: {\n",
 45 |        "          plotly: {\n",
 46 |        "            deps: ['d3', 'jquery'],\n",
 47 |        "            exports: 'plotly'\n",
 48 |        "          }\n",
 49 |        "        }\n",
 50 |        "      });\n",
 51 |        "\n",
 52 |        "      require(['datalab/charting', 'datalab/element!1_150360998944', 'base/js/events',\n",
 53 |        "          'datalab/style!/nbextensions/gcpdatalab/charting.css'],\n",
 54 |        "        function(charts, dom, events) {\n",
 55 |        "          charts.render('gcharts', dom, events, 'paged_table', [], {\"rows\": [{\"c\": [{\"v\": \"1\"}, {\"v\": 94047008}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000001.11:g.94047009C>T\"]}, {\"v\": [2]}, {\"v\": [\"HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"CM024629|601691.0035|P78363#VAR_008428\"]}, {\"v\": [\"255|5|1|2|3|3|3|3\"]}, {\"v\": [\"MedGen:OMIM|MedGen|MedGen|Human_Phenotype_Ontology:MedGen|MedGen|MedGen|MedGen\"]}, {\"v\": [\"C1855465:248200|CN221809|CN169374|HP:0000608:C0024437|CN239309|CN239466|CN239312\"]}, {\"v\": [\"MACULAR_DEGENERATION\\\\x2c_AGE-RELATED\\\\x2c_2\\\\x2c_SUSCEPTIBILITY_TO|Stargardt_disease_1|not_provided|not_specified|Macular_degeneration|Cone-Rod_Dystrophy\\\\x2c_Recessive|Retinitis_Pigmentosa\\\\x2c_Recessive|Stargardt_Disease\\\\x2c_Recessive\"]}, {\"v\": [\"no_criteria|no_criteria|no_assertion|mult|single|single|single|single\"]}, {\"v\": [\"RCV000008374.4|RCV000008375.4|RCV000085512.3|RCV000152706.4|RCV000294335.1|RCV000349295.1|RCV000392936.1|RCV000399411.1\"]}]}, {\"c\": [{\"v\": \"1\"}, {\"v\": 201361939}, {\"v\": \"A\"}, {\"v\": \"G\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000001.11:g.201361940A>G\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|0|0|0|0|0|0|0\"]}, {\"v\": [\"MedGen|MedGen:OMIM|MedGen:OMIM|MedGen:OMIM|Human_Phenotype_Ontology:MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:Orphanet:SNOMED_CT\"]}, {\"v\": [\"CN169374|C1861864:115195|C2676271:612422|C1832243:601494|HP:0011664:C4021133|HP:0001639:C0007194:ORPHA217569|CN239310|C0340429:ORPHA217635:233878008\"]}, {\"v\": [\"not_specified|Familial_hypertrophic_cardiomyopathy_2|Familial_restrictive_cardiomyopathy_3|Left_ventricular_noncompaction_6|Left_ventricular_noncompaction_cardiomyopathy|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Familial_restrictive_cardiomyopathy\"]}, {\"v\": [\"conf|single|single|single|single|single|single|single\"]}, {\"v\": [\"RCV000168973.2|RCV000230425.2|RCV000230425.2|RCV000230425.2|RCV000283636.1|RCV000323526.1|RCV000338870.1|RCV000378147.1\"]}]}, {\"c\": [{\"v\": \"1\"}, {\"v\": 212897348}, {\"v\": \"T\"}, {\"v\": \"TACAC\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000001.11:g.212897351_212897370dup20\", \"NC_000001.11:g.212897365_212897370dupCACACA\", \"NC_000001.11:g.212897367_212897370dupCACA\", \"NC_000001.11:g.212897369_212897370dupCA\"]}, {\"v\": [4, -1, -1, -1]}, {\"v\": [\".\", \".\", \".\", \".\"]}, {\"v\": [1, 1, 1, 1]}, {\"v\": [\".\", \".\", \".\", \".\"]}, {\"v\": [\"0\", \"3\", \"255\", \"0\"]}, {\"v\": [\"MedGen:OMIM:Orphanet\", \"MedGen:OMIM:Orphanet\", \"MedGen:OMIM:Orphanet\", \"MedGen:OMIM:Orphanet\"]}, {\"v\": [\"C1836916:609033:ORPHA88628\", \"C1836916:609033:ORPHA88628\", \"C1836916:609033:ORPHA88628\", \"C1836916:609033:ORPHA88628\"]}, {\"v\": [\"Posterior_column_ataxia_with_retinitis_pigmentosa\", \"Posterior_column_ataxia_with_retinitis_pigmentosa\", \"Posterior_column_ataxia_with_retinitis_pigmentosa\", \"Posterior_column_ataxia_with_retinitis_pigmentosa\"]}, {\"v\": [\"single\", \"single\", \"conf\", \"single\"]}, {\"v\": [\"RCV000355025.1\", \"RCV000297866.1\", \"RCV000262602.1\", \"RCV000351203.1\"]}]}, {\"c\": [{\"v\": \"1\"}, {\"v\": 215671030}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000001.11:g.215671031C>T\"]}, {\"v\": [1]}, {\"v\": [\"UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"O75445#VAR_061351\"]}, {\"v\": [\"255\"]}, {\"v\": [\"MedGen\"]}, {\"v\": [\"CN169374\"]}, {\"v\": [\"not_specified\"]}, {\"v\": [\"conf\"]}, {\"v\": [\"RCV000041750.4\"]}]}, {\"c\": [{\"v\": \"1\"}, {\"v\": 237589773}, {\"v\": \"AT\"}, {\"v\": \"A\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000001.11:g.237589784delT\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"2|255\"]}, {\"v\": [\"MedGen:Orphanet:SNOMED_CT|MedGen\"]}, {\"v\": [\"C0878544:ORPHA167848:85898001|CN169374\"]}, {\"v\": [\"Cardiomyopathy|not_specified\"]}, {\"v\": [\"no_criteria|conf\"]}, {\"v\": [\"RCV000030420.1|RCV000036734.8\"]}]}, {\"c\": [{\"v\": \"10\"}, {\"v\": 26088401}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000010.11:g.26088402C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|0\"]}, {\"v\": [\"MedGen|MedGen\"]}, {\"v\": [\"CN169374|CN239439\"]}, {\"v\": [\"not_specified|Nonsyndromic_Hearing_Loss\\\\x2c_Recessive\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000039026.3|RCV000381484.1\"]}]}, {\"c\": [{\"v\": \"11\"}, {\"v\": 6392135}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000011.10:g.6392136C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|0\"]}, {\"v\": [\"MedGen|MedGen:SNOMED_CT\"]}, {\"v\": [\"CN169374|C0028064:58459009\"]}, {\"v\": [\"not_specified|Sphingomyelin/cholesterol_lipidosis\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000079188.5|RCV000394529.1\"]}]}, {\"c\": [{\"v\": \"11\"}, {\"v\": 6617153}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000011.10:g.6617154C>A\", \"NC_000011.10:g.6617154C>G\", \"NC_000011.10:g.6617154C>T\"]}, {\"v\": [1, 2, 3]}, {\"v\": [\".\", \"OMIM_Allelic_Variant\", \".\"]}, {\"v\": [1, 1, 1]}, {\"v\": [\".\", \"607998.0004\", \".\"]}, {\"v\": [\"5\", \"5|5|5|5|5\", \"5\"]}, {\"v\": [\"MedGen\", \"MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet|MedGen|MeSH:MedGen|MedGen:OMIM:Orphanet:SNOMED_CT\", \"MedGen\"]}, {\"v\": [\"CN221809\", \"C1876161:204500:ORPHA228349|C1836474:609270:ORPHA284324|CN221809|D030342:C0950123|C0027877:214200:ORPHA216:42012007\", \"CN221809\"]}, {\"v\": [\"not_provided\", \"Ceroid_lipofuscinosis_neuronal_2|Childhood-onset_autosomal_recessive_slowly_progressive_spinocerebellar_ataxia|not_provided|Inborn_genetic_diseases|Neuronal_ceroid_lipofuscinosis\", \"not_provided\"]}, {\"v\": [\"single\", \"mult|single|mult|single|single\", \"single\"]}, {\"v\": [\"RCV000391641.1\", \"RCV000002763.11|RCV000074608.7|RCV000189765.4|RCV000210689.1|RCV000228119.2\", \"RCV000189764.3\"]}]}, {\"c\": [{\"v\": \"11\"}, {\"v\": 47448802}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000011.10:g.47448803C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255\"]}, {\"v\": [\"MedGen\"]}, {\"v\": [\"CN169374\"]}, {\"v\": [\"not_specified\"]}, {\"v\": [\"conf\"]}, {\"v\": [\"RCV000246056.2\"]}]}, {\"c\": [{\"v\": \"11\"}, {\"v\": 66510682}, {\"v\": \"T\"}, {\"v\": \"C\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000011.10:g.66510683T>C\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|2\"]}, {\"v\": [\"MedGen|MedGen:OMIM:Orphanet:SNOMED_CT\"]}, {\"v\": [\"CN169374|C0752166:209900:ORPHA110:5619004\"]}, {\"v\": [\"not_specified|Bardet-Biedl_syndrome\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000173529.2|RCV000226235.1\"]}]}, {\"c\": [{\"v\": \"12\"}, {\"v\": 102840473}, {\"v\": \"T\"}, {\"v\": \"C\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000012.12:g.102840474T>C\"]}, {\"v\": [1]}, {\"v\": [\"HGMD|OMIM_Allelic_Variant|UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"CM910294|612349.0017|P00439#VAR_001038\"]}, {\"v\": [\"5|5|5\"]}, {\"v\": [\"MedGen|MedGen|MedGen:OMIM:Orphanet:SNOMED_CT\"]}, {\"v\": [\"C0751435|CN221809|C0031485:261600:ORPHA716:154735006\"]}, {\"v\": [\"Hyperphenylalaninemia\\\\x2c_non-pku|not_provided|Phenylketonuria\"]}, {\"v\": [\"no_criteria|single|mult\"]}, {\"v\": [\"RCV000000624.4|RCV000078508.6|RCV000150074.4\"]}]}, {\"c\": [{\"v\": \"14\"}, {\"v\": 23389061}, {\"v\": \"AG\"}, {\"v\": \"A\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000014.9:g.23389063delG\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"3|255|2|0|0|0\"]}, {\"v\": [\"MedGen:Orphanet:SNOMED_CT|MedGen|MedGen:OMIM|MedGen:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet\"]}, {\"v\": [\"C0878544:ORPHA167848:85898001|CN169374|C2750467:613251|C0018817:ORPHA1478|CN239310|HP:0001639:C0007194:ORPHA217569\"]}, {\"v\": [\"Cardiomyopathy|not_specified|Familial_hypertrophic_cardiomyopathy_14|Atrial_septal_defect|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hypertrophic_cardiomyopathy\"]}, {\"v\": [\"single|conf|single|single|single|single\"]}, {\"v\": [\"RCV000030306.1|RCV000154759.3|RCV000205051.2|RCV000299696.1|RCV000354591.1|RCV000396023.1\"]}]}, {\"c\": [{\"v\": \"14\"}, {\"v\": 64210032}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000014.9:g.64210033C>T\"]}, {\"v\": [2]}, {\"v\": [\"OMIM_Allelic_Variant|UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"608442.0001|Q8WXH0#VAR_062977\"]}, {\"v\": [\"5|255|3\"]}, {\"v\": [\"MedGen:OMIM|MedGen|MedGen:Orphanet:SNOMED_CT\"]}, {\"v\": [\"C2751805:612999|CN169374|C0410189:ORPHA261:111508004\"]}, {\"v\": [\"Emery-Dreifuss_muscular_dystrophy_5\\\\x2c_autosomal_dominant|not_specified|Emery-Dreifuss_muscular_dystrophy\"]}, {\"v\": [\"no_criteria|conf|single\"]}, {\"v\": [\"RCV000002414.4|RCV000173937.3|RCV000403391.1\"]}]}, {\"c\": [{\"v\": \"15\"}, {\"v\": 65078011}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000015.10:g.65078012C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|3\"]}, {\"v\": [\"MedGen|MedGen\"]}, {\"v\": [\"CN169374|CN239448\"]}, {\"v\": [\"not_specified|Nemaline_Myopathy\\\\x2c_Dominant\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000117307.4|RCV000304321.1\"]}]}, {\"c\": [{\"v\": \"15\"}, {\"v\": 89645160}, {\"v\": \"A\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000015.10:g.89645161A>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|3\"]}, {\"v\": [\"MedGen|Gene:MedGen:OMIM:Orphanet\"]}, {\"v\": [\"CN169374|46:C0796147:200990:ORPHA36\"]}, {\"v\": [\"not_specified|Acrocallosal_syndrome\\\\x2c_Schinzel_type\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000117416.4|RCV000261344.1\"]}]}, {\"c\": [{\"v\": \"16\"}, {\"v\": 2497032}, {\"v\": \"C\"}, {\"v\": \"G\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000016.10:g.2497033C>G\"]}, {\"v\": [1]}, {\"v\": [\"UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"Q9ULP9#VAR_070890\"]}, {\"v\": [\"255|3|3|3\"]}, {\"v\": [\"MedGen|MedGen|MedGen:OMIM|MedGen:OMIM\"]}, {\"v\": [\"CN169374|C3809181|C3892048:616044|C3463992:308350\"]}, {\"v\": [\"not_specified|Caused_by_mutation_in_the_TBC1_domain_family\\\\x2c_member_24|Deafness\\\\x2c_autosomal_dominant_65|Epileptic_encephalopathy\\\\x2c_early_infantile\\\\x2c_1\"]}, {\"v\": [\"conf|single|single|single\"]}, {\"v\": [\"RCV000128367.6|RCV000477643.1|RCV000477643.1|RCV000477643.1\"]}]}, {\"c\": [{\"v\": \"16\"}, {\"v\": 56514588}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000016.10:g.56514589C\\\\x3d\", \"NC_000016.10:g.56514589C>T\"]}, {\"v\": [0, 1]}, {\"v\": [\"OMIM_Allelic_Variant\", \".\"]}, {\"v\": [1, 1]}, {\"v\": [\"606151.0013\", \".\"]}, {\"v\": [\"5\", \"2\"]}, {\"v\": [\"MedGen\", \"MedGen\"]}, {\"v\": [\"C4016908\", \"CN169374\"]}, {\"v\": [\"Bardet-biedl_syndrome_2/6\\\\x2c_digenic\", \"not_specified\"]}, {\"v\": [\"no_criteria\", \"single\"]}, {\"v\": [\"RCV000004838.4\", \"RCV000301991.1\"]}]}, {\"c\": [{\"v\": \"16\"}, {\"v\": 88805736}, {\"v\": \"T\"}, {\"v\": \"C\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000016.10:g.88805737T>C\"]}, {\"v\": [1]}, {\"v\": [\"UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"Q9H211#VAR_054504\"]}, {\"v\": [\"255\"]}, {\"v\": [\"MedGen\"]}, {\"v\": [\"CN169374\"]}, {\"v\": [\"not_specified\"]}, {\"v\": [\"conf\"]}, {\"v\": [\"RCV000116652.3\"]}]}, {\"c\": [{\"v\": \"17\"}, {\"v\": 17796722}, {\"v\": \"AAGG\"}, {\"v\": \"A\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000017.11:g.17796729_17796731delGAG\"]}, {\"v\": [1]}, {\"v\": [\"HGMD\"]}, {\"v\": [1]}, {\"v\": [\"CD116392\"]}, {\"v\": [\"255|0\"]}, {\"v\": [\"MedGen|MedGen\"]}, {\"v\": [\"CN169374|CN221809\"]}, {\"v\": [\"not_specified|not_provided\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000082265.6|RCV000118114.3\"]}]}, {\"c\": [{\"v\": \"18\"}, {\"v\": 2700878}, {\"v\": \"A\"}, {\"v\": \"G\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000018.10:g.2700879A>G\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255\"]}, {\"v\": [\"MedGen\"]}, {\"v\": [\"CN169374\"]}, {\"v\": [\"not_specified\"]}, {\"v\": [\"conf\"]}, {\"v\": [\"RCV000247697.2\"]}]}, {\"c\": [{\"v\": \"19\"}, {\"v\": 11132530}, {\"v\": \"C\"}, {\"v\": \"CA\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000019.10:g.11125285_11132532dup7248\", \"NC_000019.10:g.11132532dupA\"]}, {\"v\": [-1, 1]}, {\"v\": [\"LDLR_@_LOVD\", \".\"]}, {\"v\": [5, 1]}, {\"v\": [\"LDLR_000294\", \".\"]}, {\"v\": [\"255\", \"3\"]}, {\"v\": [\"MedGen:OMIM:SNOMED_CT:SNOMED_CT\", \"MedGen:OMIM:SNOMED_CT:SNOMED_CT\"]}, {\"v\": [\"C0020445:143890:397915002:398036000\", \"C0020445:143890:397915002:398036000\"]}, {\"v\": [\"Familial_hypercholesterolemia\", \"Familial_hypercholesterolemia\"]}, {\"v\": [\"conf\", \"single\"]}, {\"v\": [\"RCV000237281.1\", \"RCV000326993.1\"]}]}, {\"c\": [{\"v\": \"19\"}, {\"v\": 41414124}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000019.10:g.41414125C>T\"]}, {\"v\": [1]}, {\"v\": [\"HGMD|UniProtKB_(protein)\"]}, {\"v\": [1]}, {\"v\": [\"CM021497|P12694#VAR_034361\"]}, {\"v\": [\"255|3\"]}, {\"v\": [\"MedGen|MedGen:OMIM:Orphanet:SNOMED_CT\"]}, {\"v\": [\"CN169374|C0024776:248600:ORPHA268184:27718001\"]}, {\"v\": [\"not_specified|Maple_syrup_urine_disease\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000079243.6|RCV000295914.1\"]}]}, {\"c\": [{\"v\": \"19\"}, {\"v\": 45179661}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000019.10:g.45179662C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|0\"]}, {\"v\": [\"MedGen|MedGen:SNOMED_CT\"]}, {\"v\": [\"CN169374|C0079504:9311003\"]}, {\"v\": [\"not_specified|Hermansky-Pudlak_syndrome\"]}, {\"v\": [\"conf|single\"]}, {\"v\": [\"RCV000150192.3|RCV000320201.1\"]}]}, {\"c\": [{\"v\": \"19\"}, {\"v\": 57231145}, {\"v\": \"G\"}, {\"v\": \"GC\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000019.10:g.57231150dupC\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255\"]}, {\"v\": [\"MedGen\"]}, {\"v\": [\"CN239485\"]}, {\"v\": [\"Spermatogenic_Failure\"]}, {\"v\": [\"conf\"]}, {\"v\": [\"RCV000311416.1\"]}]}, {\"c\": [{\"v\": \"2\"}, {\"v\": 178532038}, {\"v\": \"C\"}, {\"v\": \"T\"}, {\"v\": \"NA12878_ERR194147\"}, {\"v\": [\"NC_000002.12:g.178532039C>T\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [1]}, {\"v\": [\".\"]}, {\"v\": [\"255|2|2|2|3|3|3|3|3|3\"]}, {\"v\": [\"MedGen|MedGen:OMIM|MedGen:OMIM:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen|Human_Phenotype_Ontology:MedGen:Orphanet|MedGen|MedGen:OMIM:Orphanet|MedGen:OMIM:Orphanet\"]}, {\"v\": [\"CN169374|C1858763:604145|C1837342:608807:ORPHA140922|CN230736|C2673677:611705:ORPHA289377|CN239352|HP:0001639:C0007194:ORPHA217569|CN239310|C1863599:603689:ORPHA178464|C1838244:600334:ORPHA609\"]}, {\"v\": [\"not_specified|Dilated_cardiomyopathy_1G|Limb-girdle_muscular_dystrophy\\\\x2c_type_2J|Cardiovascular_phenotype|Myopathy\\\\x2c_early-onset\\\\x2c_with_fatal_cardiomyopathy|Limb-Girdle_Muscular_Dystrophy\\\\x2c_Recessive|Hypertrophic_cardiomyopathy|Dilated_Cardiomyopathy\\\\x2c_Dominant|Hereditary_myopathy_with_early_respiratory_failure|Distal_myopathy_Markesbery-Griggs_type\"]}, {\"v\": [\"conf|single|single|single|single|single|single|single|single|single\"]}, {\"v\": [\"RCV000040944.8|RCV000231098.2|RCV000231098.2|RCV000244925.1|RCV000296735.1|RCV000311858.1|RCV000336579.1|RCV000351463.1|RCV000402951.1|RCV000403378.1\"]}]}], \"cols\": [{\"type\": \"string\", \"id\": \"chr\", \"label\": \"chr\"}, {\"type\": \"number\", \"id\": \"start\", \"label\": \"start\"}, {\"type\": \"string\", \"id\": \"reference_bases\", \"label\": \"reference_bases\"}, {\"type\": \"string\", \"id\": \"alt\", \"label\": \"alt\"}, {\"type\": \"string\", \"id\": \"call_set_name\", \"label\": \"call_set_name\"}, {\"type\": \"string\", \"id\": \"CLNHGVS\", \"label\": \"CLNHGVS\"}, {\"type\": \"string\", \"id\": \"CLNALLE\", \"label\": \"CLNALLE\"}, {\"type\": \"string\", \"id\": \"CLNSRC\", \"label\": \"CLNSRC\"}, {\"type\": \"string\", \"id\": \"CLNORIGIN\", \"label\": \"CLNORIGIN\"}, {\"type\": \"string\", \"id\": \"CLNSRCID\", \"label\": \"CLNSRCID\"}, {\"type\": \"string\", \"id\": \"CLNSIG\", \"label\": \"CLNSIG\"}, {\"type\": \"string\", \"id\": \"CLNDSDB\", \"label\": \"CLNDSDB\"}, {\"type\": \"string\", \"id\": \"CLNDSDBID\", \"label\": \"CLNDSDBID\"}, {\"type\": \"string\", \"id\": \"CLNDBN\", \"label\": \"CLNDBN\"}, {\"type\": \"string\", \"id\": \"CLNREVSTAT\", \"label\": \"CLNREVSTAT\"}, {\"type\": \"string\", \"id\": \"CLNACC\", \"label\": \"CLNACC\"}]},\n",
 56 |        "            {\n",
 57 |        "              pageSize: 25,\n",
 58 |        "              cssClassNames:  {\n",
 59 |        "                tableRow: 'gchart-table-row',\n",
 60 |        "                headerRow: 'gchart-table-headerrow',\n",
 61 |        "                oddTableRow: 'gchart-table-oddrow',\n",
 62 |        "                selectedTableRow: 'gchart-table-selectedrow',\n",
 63 |        "                hoverTableRow: 'gchart-table-hoverrow',\n",
 64 |        "                tableCell: 'gchart-table-cell',\n",
 65 |        "                headerCell: 'gchart-table-headercell',\n",
 66 |        "                rowNumberCell: 'gchart-table-rownumcell'\n",
 67 |        "              }\n",
 68 |        "            },\n",
 69 |        "            {source_index: 0, fields: 'chr,start,reference_bases,alt,call_set_name,CLNHGVS,CLNALLE,CLNSRC,CLNORIGIN,CLNSRCID,CLNSIG,CLNDSDB,CLNDSDBID,CLNDBN,CLNREVSTAT,CLNACC'},\n",
 70 |        "            0,\n",
 71 |        "            63);\n",
 72 |        "        }\n",
 73 |        "      );\n",
 74 |        "    </script>\n",
 75 |        "  "
 76 |       ],
 77 |       "text/plain": [
 78 |        "QueryResultsTable job_P6NRU_M3B1MeX_TpuZdGC9QTWZwp"
 79 |       ]
 80 |      },
 81 |      "execution_count": 1,
 82 |      "metadata": {},
 83 |      "output_type": "execute_result"
 84 |     }
 85 |    ],
 86 |    "source": [
 87 |     "%%bq query\n",
 88 |     "#standardSQL\n",
 89 |     "  --\n",
 90 |     "  -- Return variants for sample NA12878 that are:\n",
 91 |     "  --   annotated as 'pathogenic' or 'other' in ClinVar\n",
 92 |     "  --   with observed population frequency less than 5%\n",
 93 |     "  --\n",
 94 |     "  WITH sample_variants AS (\n",
 95 |     "  SELECT\n",
 96 |     "    -- Remove the 'chr' prefix from the reference name.\n",
 97 |     "    REGEXP_EXTRACT(reference_name, r'chr(.+)') AS chr,\n",
 98 |     "    start,\n",
 99 |     "    reference_bases,\n",
100 |     "    alt,\n",
101 |     "    call.call_set_name\n",
102 |     "  FROM\n",
103 |     "    `genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls` v,\n",
104 |     "    v.call call,\n",
105 |     "    v.alternate_bases alt WITH OFFSET alt_offset\n",
106 |     "  WHERE\n",
107 |     "    call_set_name = 'NA12878_ERR194147'\n",
108 |     "    -- Require that at least one genotype matches this alternate.\n",
109 |     "    AND EXISTS (SELECT gt FROM UNNEST(call.genotype) gt WHERE gt = alt_offset+1)\n",
110 |     "    ),\n",
111 |     "  --\n",
112 |     "  --\n",
113 |     "  rare_pathenogenic_variants AS (\n",
114 |     "  SELECT\n",
115 |     "    -- ClinVar does not use the 'chr' prefix for reference names.\n",
116 |     "    reference_name AS chr,\n",
117 |     "    start,\n",
118 |     "    reference_bases,\n",
119 |     "    alt,\n",
120 |     "    CLNHGVS,\n",
121 |     "    CLNALLE,\n",
122 |     "    CLNSRC,\n",
123 |     "    CLNORIGIN,\n",
124 |     "    CLNSRCID,\n",
125 |     "    CLNSIG,\n",
126 |     "    CLNDSDB,\n",
127 |     "    CLNDSDBID,\n",
128 |     "    CLNDBN,\n",
129 |     "    CLNREVSTAT,\n",
130 |     "    CLNACC\n",
131 |     "  FROM\n",
132 |     "    `bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20170705` v,\n",
133 |     "    v.alternate_bases alt\n",
134 |     "  WHERE\n",
135 |     "    -- Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided,\n",
136 |     "    -- 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic,\n",
137 |     "    -- 6 - drug response, 7 - histocompatibility, 255 - other\n",
138 |     "    EXISTS (SELECT sig FROM UNNEST(CLNSIG) sig WHERE REGEXP_CONTAINS(sig, '(4|5|255)'))\n",
139 |     "    -- TRUE if >5% minor allele frequency in 1+ populations\n",
140 |     "    AND G5 IS NULL\n",
141 |     ")\n",
142 |     " --\n",
143 |     " --\n",
144 |     "SELECT\n",
145 |     "  *\n",
146 |     "FROM\n",
147 |     "  sample_variants\n",
148 |     "JOIN\n",
149 |     "  rare_pathenogenic_variants USING(chr,\n",
150 |     "    start,\n",
151 |     "    reference_bases,\n",
152 |     "    alt)\n",
153 |     "ORDER BY\n",
154 |     "  chr,\n",
155 |     "  start,\n",
156 |     "  reference_bases,\n",
157 |     "  alt"
158 |    ]
159 |   }
160 |  ],
161 |  "metadata": {
162 |   "kernelspec": {
163 |    "display_name": "Python 2",
164 |    "language": "python",
165 |    "name": "python2"
166 |   },
167 |   "language_info": {
168 |    "codemirror_mode": {
169 |     "name": "ipython",
170 |     "version": 2
171 |    },
172 |    "file_extension": ".py",
173 |    "mimetype": "text/x-python",
174 |    "name": "python",
175 |    "nbconvert_exporter": "python",
176 |    "pygments_lexer": "ipython2",
177 |    "version": "2.7.12"
178 |   }
179 |  },
180 |  "nbformat": 4,
181 |  "nbformat_minor": 0
182 | }
183 | 


--------------------------------------------------------------------------------
/interactive/README.md:
--------------------------------------------------------------------------------
 1 | **WARNING: Not actively maintained!**
 2 | 
 3 | Interactive Variant Annotation
 4 | ==============================
 5 | 
 6 | Given a particular set of variants for an individual or a cohort, the code here
 7 | will allow you to interactively annotate the sequence variants
 8 | using
 9 | [annotation resources available in BigQuery](http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/annotations_toc.html). Note
10 | that if there is a newer version of the annotation resource that you wish to
11 | use, [you can load it into BigQuery](../curation/tables).
12 | 
13 | ## Status of this sub-project
14 | 
15 | There is only one example here at the moment but see also similar work:
16 | 
17 | * http://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/COSMIC.html
18 | * https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes
19 | * http://googlegenomics.readthedocs.io/en/latest/use_cases/annotate_variants/interval_joins.html
20 | 
21 | TODO: add more example queries, Datalab notebooks and RMarkdown.
22 | 
23 | ## Examples
24 | 
25 | ### [Datalab](https://cloud.google.com/datalab/) Notebook Examples
26 | 
27 |  1. Notebook [InteractiveVariantAnnotation.ipynb](./InteractiveVariantAnnotation.ipynb) will return variants for sample NA12878 that are:
28 |    * annotated as 'pathogenic' or 'other' in ClinVar
29 |    * with observed population frequency less than 5%
30 | 
31 | 


--------------------------------------------------------------------------------