├── CHANGELOG.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── THIRD-PARTY-LICENSES ├── deploy └── upload_artifacts.sh ├── src ├── cfn_templates │ ├── apply-s3-lifecycle-stack.yml │ ├── code-build-stack.yml │ ├── e2e-sfn-stack.yml │ ├── omics-resources-stack.yml │ ├── s3-stack.yml │ ├── sfn-task-checker-stack.yml │ ├── sfn-trigger-stack.yml │ └── solution-cfn.yml ├── codebuild │ ├── buildspec_docker.yml │ └── buildspec_lambdas.yml ├── glue │ ├── PhenotypicGenomes.json │ └── etl.py ├── lambda │ ├── add_bucket_notification │ │ └── add_bucket_notification_lambda.py │ ├── apply_s3_lifecycle │ │ └── apply_s3_lifecycle_lambda.py │ ├── check_omcis_workflow_task │ │ └── lambda_check_omics_workflow_task.py │ ├── import_annotation │ │ └── import_annotation_lambda.py │ ├── import_reference │ │ └── import_reference_lambda.py │ ├── import_sequence │ │ └── import_sequence_lambda.py │ ├── import_variants │ │ └── import_variant_lambda.py │ ├── launch_genomics_sfn │ │ └── lambda_launch_genomics_sfn.py │ ├── start_workflow │ │ └── start_workflow_lambda.py │ └── trigger_code_build │ │ ├── trigger_docker_code_build.py │ │ └── trigger_lambdas_code_build.py ├── notebook │ └── Sample_queries_omics.ipynb └── workflow │ ├── main.wdl │ ├── parameter-template.json │ └── sub-workflows │ ├── fastq-to-bam.wdl │ ├── haplotypecaller-gvcf-gatk4.wdl │ └── processing-for-variant-discovery-gatk4.wdl └── static ├── arch_diagram.png ├── stepfunctions.png └── stepfunctions_graph_workflowstudio.png /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | All notable changes to this project will be documented in this file. 4 | 5 | ## [1.1.0] (2023-03-29) 6 | 7 | ### Features 8 | 9 | * Removed dependency on Omics API models as a Lambda Layer due to general availablility 10 | * Use Omics CloudFormation resources instead of Custom Resources 11 | * Introduce checks in Step Functions State Machine to prevent duplicate workflows from being launched 12 | * Use existing Omics Reference store, if provided, else create a new one 13 | 14 | ## [1.0.0] (2022-11-30) 15 | 16 | ### Features 17 | 18 | * First releasw -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon Omics - from raw sequence data to insights 2 | 3 | This github repository has the code and artifacts described in the blog post: [Part 2 – Automated End to End Genomics Data Processing and Analysis using Amazon Omics and AWS Step Functions](https://aws.amazon.com/blogs/industries/automated-end-to-end-genomics-data-storage-and-analysis-using-amazon-omics/): 4 | 5 | 6 | ## Reference architecture 7 | ![Alt text](static/arch_diagram.png?raw=true "Reference architecture using Step Functions and Lambda Functions") 8 | 9 | ## Prerequisites 10 | 11 | - Python 3.7 and above with package installer - pip 12 | - Linux/UNIX environment to run deployment shell scripts 13 | - Compression and file packaging utility - zip 14 | - AWS account with AdminstratorAccess to deploy various AWS resources using CloudFormation 15 | - An S3 bucket, for example `my-artifact-bucket` within this account to upload all assets needed for deployment 16 | - AWS CLI v2 installed and configured to your AWS Account to upload files to `` (Installation instructions here: https://github.com/aws/aws-cli/tree/v2#installation) 17 | 18 | 19 | ``` 20 | Note that cross region imports are not supported in Amazon Omics today. If you chose to deploy it in another supported region outside of us-east-1, copy the example data used in the solution in a bucket in that region and update the permissions in the CloudFormation templates accordingly 21 | ``` 22 | 23 | ## How to deploy 24 | 25 | 1. Once you clone the repository, navigate to the `deploy/` directory within the repository. 26 | 2. Run the deployment script to upload all required files to the artifact bucket 27 | 28 | `sh upload_artifacts my-artifact-bucket ` 29 | ``` 30 | NOTE 31 | 32 | You can use the 2nd argument as an optional argument if you chose to use a specific AWS profile 33 | ``` 34 | 3. Navigate to the AWS S3 Console. In the list of buckets, click on `` and navigate to `templates` prefix. Find the file named `solution-cfn.yml`. Copy the Object URL (begins with https://) for this object (not the S3 URI). 35 | 4. Navigate to AWS CloudFormation Console. Click on `Create Stack`, select `Template is ready` and paste the above https:// Object URL into the `Amazon S3 URL` field and click `Next`. 36 | 5. Fill in the `Stack name` with a name of your choice, `ArtifactBucketName` with ``, `WorkflowInputsBucketName` & `WorkflowOutputsBucketName` with new bucket names of your choice; these buckets will be created. 37 | 6. For the `CurrentReferenceStoreId` parameter, if the account that you plan to use has an existing reference store and you want to repurpose it, you can provide the Referernce store ID as the value. (Since only 1 reference store is allowed per account per region). If you don't have one and want to create a new one, provide the value `NONE`. 38 | 7. Click Next on the subsequent two pages, then on the Review page, acknowledge the following 'Capabilities', and click `Submit`: 39 | - AWS CloudFormation might create IAM resources with custom names. 40 | - AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND 41 | 8. CloudFormation will now create multiple stacks with all the necessary resources, including Omics resources: 42 | - Omics Reference Store with Reference genome imported (Skip Reference store creation if store ID provided) 43 | - Omics Sequence Store 44 | - Omics Workflow with workflow definition and parameters defined 45 | - Omics Variant Store 46 | - Omics Annotation Store with ClinVar imported 47 | 48 | 9. It's recommended that users update omics permissions to least privilege access when leveraging this sample code as a starting point for future production needs. 49 | 10. The CloudFormation Stack should complete deployment in less than an hour 50 | 51 | ## Usage 52 | 1. Once the template has been deployed successfully, you can use a pair of FASTQ files to launch the end to end Secondary Analysis pipeline. 53 | 54 | ``` 55 | Note that in this solution, the FASTQ files need to be named in the following manner: 56 | 57 | _R1.fastq.gz 58 | 59 | _R2.fastq.gz 60 | 61 | This can be updated to your needs by updating the Python regex in start_workflow_lambda.py 62 | 63 | You can also use example FASTQs provided here to test: 64 | 65 | s3://aws-genomics-static-us-east-1/omics-e2e/test_fastqs/NA1287820K_R1.fastq.gz 66 | 67 | s3://aws-genomics-static-us-east-1/omics-e2e/test_fastqs/NA1287820K_R2.fastq.gz 68 | 69 | ``` 70 | 71 | 72 | 2. Upload these FASTQ files to the bucket used for `WorkflowInputsBucketName` in a prefix `inputs`. This bucket is configured such that uploaded FASTQ files in this prefix will use S3 notifications to tigger a Lambda function that evaluates the inputs and launches the Step Functions workflow. You can monitor the Step Functions workflow in the AWS Console for Step Functions and navigating to State Machines -> AmazonOmicsEndToEndStepFunction. You should see a running Execution with the Name "GENOMICS_\_\", where sampleId is extracted from the name of the FASTQ files used. 73 | 74 | ``` 75 | NOTE 76 | 77 | Currently if both FASTQs are uploaded simultaneaously, the Step Function trigger lambda has a best-effort mechanism to avoid race conditions by adding a random delay and checking for a running execution with the same sample name. It's recommended to check for a duplicate execution as a precaution. 78 | ``` 79 | 80 | 3. The Step Functions workflow has the following steps: 81 | - Import FASTQ files to the pre-created Omics Sequence store. 82 | - Start a pre-created Omics Workflow with the input FASTQs. 83 | - Import the workflow output BAM file to the pre-created Omics Sequence Store and the output VCF file to the pre-created Omics Variant Store in parallel. 84 | - Apply S3 object tags to the input FASTQ and output BAM and VCF files to allow S3 lifecycle policies to be applied. 85 | 86 | ![Alt text](static/stepfunctions_graph_workflowstudio.png?raw=true "Step Function State Machine") 87 | 88 | 4. Since these steps are asynchronous API calls, we leverage tasks to poll for completion and move on to the next step on success. The Step Functions Workflow takes about 3 hours to complete with the test FASTQs provided above and could vary by the size of inputs chosen. 89 | 90 | Note that if there is a Step Function Workflow failure, users can refer to this blog on instructions for how to resume a Step Function workflow - https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/ 91 | 92 | 5. Now that the variants are available in the Omics Variant Store and the pre-loaded annotations in the Omics Annotation store, you can create resource links for them in AWS Lake Formation, Grant permissions to the desired users and query the resuting tables in Amazon Athena to derive insights (see instructions on how to provide Lake Formation permissions in the blog post ). Note that for the example notebook, we used genomic data from the example [Ovation Dx NAFLD Whole Genome dataset](https://aws.amazon.com/marketplace/pp/prodview-565xa6uzf77wu?sr=0-1&ref_=beagle&applicationId=AWS-Marketplace-Console#offers) from the Amazon Data Exchange 93 | 94 | ## Cleanup 95 | 96 | The above solution has deployed several AWS resources as part of the CloudFormation stack. If you chose to clean up the resources created by this solution, you can take the following steps: 97 | 98 | 1. Delete the CloudFormation stack with the name that was assigned at creation. This will start deleting all the resources created. 99 | 2. Due to certain actions taken during usage of the solution, resources such as the Workflow input and output buckets and the ECR respoistories will fail to delete due to them not being empty. In order to delete them as well, users will have to empty the contents of the S3 buckets for Workflow inputs and outputs and delete the images created under the Amazon ECR repositories (if you chose to clean up these resources). Once deleted, you can re-attempt to delete the CloudFormation stack. 100 | 3. If the Omics resources fail to delete by the delete stack action in CloudFormation, users will need to manually delete the Omics Resources created by the stack, such as the workfow, variant store, annotation store, sequence store and reference store (or just the imported reference genome). Once done, you can re-attempt to delete the CloudFormation stack. 101 | 4. If certain custom CloudFormation resources such as Lambda functions in the Omics and CodeBuild stacks fail to delete again, simply retrying the deletion of the parent stack should delete it. 102 | 103 | 104 | ## License 105 | This library is licensed under the MIT-0 License. See the LICENSE file. 106 | 107 | ## Authors 108 | 109 | Nadeem Bulsara | Sr. Solutions Architect - Genomics, BDSI | AWS 110 | 111 | Sujaya Srinivasan | Genomics Solutions Architect, WWPS | AWS 112 | 113 | David Obenshain | Cloud Application Architect, WWPS Professional Services | AWS 114 | 115 | Gargi Singh Chhatwal | Sr. Solutions Architect - Federal Civilian, WWPS | AWS 116 | 117 | Joshua Broyde | Sr. AI/ML Solutions Architect, BDSI | AWS -------------------------------------------------------------------------------- /THIRD-PARTY-LICENSES: -------------------------------------------------------------------------------- 1 | ** CrHelper; version 2.0.11 -- https://github.com/aws-cloudformation/custom-resource-helper 2 | 3 | Apache License 4 | Version 2.0, January 2004 5 | http://www.apache.org/licenses/ 6 | 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 8 | 9 | 1. Definitions. 10 | 11 | "License" shall mean the terms and conditions for use, reproduction, and 12 | distribution as defined by Sections 1 through 9 of this document. 13 | 14 | "Licensor" shall mean the copyright owner or entity authorized by the copyright 15 | owner that is granting the License. 16 | 17 | "Legal Entity" shall mean the union of the acting entity and all other entities 18 | that control, are controlled by, or are under common control with that entity. 19 | For the purposes of this definition, "control" means (i) the power, direct or 20 | indirect, to cause the direction or management of such entity, whether by 21 | contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity exercising 25 | permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, including 28 | but not limited to software source code, documentation source, and configuration 29 | files. 30 | 31 | "Object" form shall mean any form resulting from mechanical transformation or 32 | translation of a Source form, including but not limited to compiled object code, 33 | generated documentation, and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or Object form, made 36 | available under the License, as indicated by a copyright notice that is included 37 | in or attached to the work (an example is provided in the Appendix below). 38 | 39 | "Derivative Works" shall mean any work, whether in Source or Object form, that 40 | is based on (or derived from) the Work and for which the editorial revisions, 41 | annotations, elaborations, or other modifications represent, as a whole, an 42 | original work of authorship. For the purposes of this License, Derivative Works 43 | shall not include works that remain separable from, or merely link (or bind by 44 | name) to the interfaces of, the Work and Derivative Works thereof. 45 | 46 | "Contribution" shall mean any work of authorship, including the original version 47 | of the Work and any modifications or additions to that Work or Derivative Works 48 | thereof, that is intentionally submitted to Licensor for inclusion in the Work 49 | by the copyright owner or by an individual or Legal Entity authorized to submit 50 | on behalf of the copyright owner. For the purposes of this definition, 51 | "submitted" means any form of electronic, verbal, or written communication sent 52 | to the Licensor or its representatives, including but not limited to 53 | communication on electronic mailing lists, source code control systems, and 54 | issue tracking systems that are managed by, or on behalf of, the Licensor for 55 | the purpose of discussing and improving the Work, but excluding communication 56 | that is conspicuously marked or otherwise designated in writing by the copyright 57 | owner as "Not a Contribution." 58 | 59 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf 60 | of whom a Contribution has been received by Licensor and subsequently 61 | incorporated within the Work. 62 | 63 | 2. Grant of Copyright License. Subject to the terms and conditions of this 64 | License, each Contributor hereby grants to You a perpetual, worldwide, non- 65 | exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, 66 | prepare Derivative Works of, publicly display, publicly perform, sublicense, and 67 | distribute the Work and such Derivative Works in Source or Object form. 68 | 69 | 3. Grant of Patent License. Subject to the terms and conditions of this License, 70 | each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no- 71 | charge, royalty-free, irrevocable (except as stated in this section) patent 72 | license to make, have made, use, offer to sell, sell, import, and otherwise 73 | transfer the Work, where such license applies only to those patent claims 74 | licensable by such Contributor that are necessarily infringed by their 75 | Contribution(s) alone or by combination of their Contribution(s) with the Work 76 | to which such Contribution(s) was submitted. If You institute patent litigation 77 | against any entity (including a cross-claim or counterclaim in a lawsuit) 78 | alleging that the Work or a Contribution incorporated within the Work 79 | constitutes direct or contributory patent infringement, then any patent licenses 80 | granted to You under this License for that Work shall terminate as of the date 81 | such litigation is filed. 82 | 83 | 4. Redistribution. You may reproduce and distribute copies of the Work or 84 | Derivative Works thereof in any medium, with or without modifications, and in 85 | Source or Object form, provided that You meet the following conditions: 86 | 87 | (a) You must give any other recipients of the Work or Derivative Works a 88 | copy of this License; and 89 | 90 | (b) You must cause any modified files to carry prominent notices stating 91 | that You changed the files; and 92 | 93 | (c) You must retain, in the Source form of any Derivative Works that You 94 | distribute, all copyright, patent, trademark, and attribution notices from the 95 | Source form of the Work, excluding those notices that do not pertain to any part 96 | of the Derivative Works; and 97 | 98 | (d) If the Work includes a "NOTICE" text file as part of its distribution, 99 | then any Derivative Works that You distribute must include a readable copy of 100 | the attribution notices contained within such NOTICE file, excluding those 101 | notices that do not pertain to any part of the Derivative Works, in at least one 102 | of the following places: within a NOTICE text file distributed as part of the 103 | Derivative Works; within the Source form or documentation, if provided along 104 | with the Derivative Works; or, within a display generated by the Derivative 105 | Works, if and wherever such third-party notices normally appear. The contents of 106 | the NOTICE file are for informational purposes only and do not modify the 107 | License. You may add Your own attribution notices within Derivative Works that 108 | You distribute, alongside or as an addendum to the NOTICE text from the Work, 109 | provided that such additional attribution notices cannot be construed as 110 | modifying the License. 111 | 112 | You may add Your own copyright statement to Your modifications and may 113 | provide additional or different license terms and conditions for use, 114 | reproduction, or distribution of Your modifications, or for any such Derivative 115 | Works as a whole, provided Your use, reproduction, and distribution of the Work 116 | otherwise complies with the conditions stated in this License. 117 | 118 | 5. Submission of Contributions. Unless You explicitly state otherwise, any 119 | Contribution intentionally submitted for inclusion in the Work by You to the 120 | Licensor shall be under the terms and conditions of this License, without any 121 | additional terms or conditions. Notwithstanding the above, nothing herein shall 122 | supersede or modify the terms of any separate license agreement you may have 123 | executed with Licensor regarding such Contributions. 124 | 125 | 6. Trademarks. This License does not grant permission to use the trade names, 126 | trademarks, service marks, or product names of the Licensor, except as required 127 | for reasonable and customary use in describing the origin of the Work and 128 | reproducing the content of the NOTICE file. 129 | 130 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in 131 | writing, Licensor provides the Work (and each Contributor provides its 132 | Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 133 | KIND, either express or implied, including, without limitation, any warranties 134 | or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 135 | PARTICULAR PURPOSE. You are solely responsible for determining the 136 | appropriateness of using or redistributing the Work and assume any risks 137 | associated with Your exercise of permissions under this License. 138 | 139 | 8. Limitation of Liability. In no event and under no legal theory, whether in 140 | tort (including negligence), contract, or otherwise, unless required by 141 | applicable law (such as deliberate and grossly negligent acts) or agreed to in 142 | writing, shall any Contributor be liable to You for damages, including any 143 | direct, indirect, special, incidental, or consequential damages of any character 144 | arising as a result of this License or out of the use or inability to use the 145 | Work (including but not limited to damages for loss of goodwill, work stoppage, 146 | computer failure or malfunction, or any and all other commercial damages or 147 | losses), even if such Contributor has been advised of the possibility of such 148 | damages. 149 | 150 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or 151 | Derivative Works thereof, You may choose to offer, and charge a fee for, 152 | acceptance of support, warranty, indemnity, or other liability obligations 153 | and/or rights consistent with this License. However, in accepting such 154 | obligations, You may act only on Your own behalf and on Your sole 155 | responsibility, not on behalf of any other Contributor, and only if You agree to 156 | indemnify, defend, and hold each Contributor harmless for any liability incurred 157 | by, or claims asserted against, such Contributor by reason of your accepting any 158 | such warranty or additional liability. 159 | 160 | END OF TERMS AND CONDITIONS 161 | 162 | APPENDIX: How to apply the Apache License to your work. 163 | 164 | To apply the Apache License to your work, attach the following boilerplate 165 | notice, with the fields enclosed by brackets "[]" replaced with your own 166 | identifying information. (Don't include the brackets!) The text should be 167 | enclosed in the appropriate comment syntax for the file format. We also 168 | recommend that a file or class name and description of purpose be included on 169 | the same "printed page" as the copyright notice for easier identification within 170 | third-party archives. 171 | 172 | Copyright [yyyy] [name of copyright owner] 173 | 174 | Licensed under the Apache License, Version 2.0 (the "License"); 175 | you may not use this file except in compliance with the License. 176 | You may obtain a copy of the License at 177 | 178 | http://www.apache.org/licenses/LICENSE-2.0 179 | 180 | Unless required by applicable law or agreed to in writing, software 181 | distributed under the License is distributed on an "AS IS" BASIS, 182 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 183 | See the License for the specific language governing permissions and 184 | limitations under the License. 185 | 186 | * For CrHelper see also this required NOTICE: 187 | Custom Resource Helper Copyright 2019 Amazon.com, Inc. or its affiliates. 188 | All Rights Reserved. 189 | This library is licensed under the Apache 2.0 License. 190 | Decorator implementation inspired by https://github.com/ryansb/cfn-wrapper- 191 | python 192 | Log implementation inspired by https://gitlab.com/hadrien/aws_lambda_logging 193 | 194 | ------ 195 | 196 | ** gatk4-germline-snps-indels; version 2.3.1 -- https://github.com/gatk-workflows/gatk4-germline-snps-indels 197 | Copyright Broad Institute, 2019 | BSD-3 This script is released under the WDL 198 | open source code license (BSD-3) (full license text at 199 | https://github.com/openwdl/wdl/blob/master/LICENSE). Note however that the 200 | programs it calls may be subject to different licenses. Users are responsible 201 | for checking that they are authorized to run all programs before running this 202 | script. 203 | 204 | BSD 3-Clause License 205 | 206 | Copyright (c) 2017, Broad Institute 207 | All rights reserved. 208 | 209 | Redistribution and use in source and binary forms, with or without 210 | modification, are permitted provided that the following conditions are met: 211 | 212 | * Redistributions of source code must retain the above copyright notice, this 213 | list of conditions and the following disclaimer. 214 | 215 | * Redistributions in binary form must reproduce the above copyright notice, 216 | this list of conditions and the following disclaimer in the documentation 217 | and/or other materials provided with the distribution. 218 | 219 | * Neither the name of the copyright holder nor the names of its 220 | contributors may be used to endorse or promote products derived from 221 | this software without specific prior written permission. 222 | 223 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 224 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 225 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 226 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 227 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 228 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 229 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 230 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 231 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 232 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 233 | 234 | ------ 235 | 236 | ** jsonschema; version 4.16.0 -- https://github.com/python-jsonschema/jsonschema 237 | N/A 238 | 239 | MIT License 240 | 241 | Copyright (c) 242 | 243 | Permission is hereby granted, free of charge, to any person obtaining a copy of 244 | this software and associated documentation files (the "Software"), to deal in 245 | the Software without restriction, including without limitation the rights to 246 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 247 | the Software, and to permit persons to whom the Software is furnished to do so, 248 | subject to the following conditions: 249 | 250 | The above copyright notice and this permission notice shall be included in all 251 | copies or substantial portions of the Software. 252 | 253 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 254 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 255 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 256 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 257 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 258 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 259 | -------------------------------------------------------------------------------- /deploy/upload_artifacts.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | ARTIFACT_S3_BUCKET=s3://${1} 6 | AWS_PROFILE=$2 7 | TEMPLATES='templates' 8 | LAMBDAS='lambdas' 9 | WORKFLOWS='workflows' 10 | BUILDSPECS='buildspecs' 11 | 12 | # use profile if 2nd argument provided 13 | if [ $# -eq 2 ] 14 | then 15 | AWS_PROFILE=" --profile ${2}" 16 | fi 17 | 18 | echo "==========================================" 19 | echo "TRIGGER CODEBUILD JOB LAMBDAS" 20 | echo "==========================================" 21 | cd ../src/lambda/trigger_code_build 22 | echo "Installing pip packages" 23 | pip3 install crhelper boto3==1.26.65 -t ./package 24 | cd ./package 25 | zip -r ../trigger_docker_code_build.zip . 26 | cd .. 27 | echo "Zip lambda to artifact" 28 | zip -g trigger_docker_code_build.zip trigger_docker_code_build.py 29 | aws s3 cp trigger_docker_code_build.zip $ARTIFACT_S3_BUCKET/$LAMBDAS/ $AWS_PROFILE 30 | rm trigger_docker_code_build.zip 31 | echo "done with trigger_docker_code_build.zip" 32 | 33 | cd ./package 34 | zip -r ../trigger_lambdas_code_build.zip . 35 | cd .. 36 | echo "Zip lambda to artifact" 37 | zip -g trigger_lambdas_code_build.zip trigger_lambdas_code_build.py 38 | aws s3 cp trigger_lambdas_code_build.zip $ARTIFACT_S3_BUCKET/$LAMBDAS/ $AWS_PROFILE 39 | rm trigger_lambdas_code_build.zip 40 | rm -r ./package 41 | echo "done with trigger_lambdas_code_build.zip" 42 | echo "Uploaded all lambdas Zip files to trigger Code Build jobs to ${ARTIFACT_S3_PATH}/${LAMBDAS}/" 43 | 44 | echo "==========================================" 45 | echo "UPLOAD CLOUDFORMATION TEMPLATES" 46 | echo "==========================================" 47 | echo "iterate through cfn templates and upload to s3" 48 | cd ../../../deploy 49 | for f in $(find ../src/cfn_templates/ -name '*.yml' -or -name '*.yaml'); do echo "uploading `basename $f`" && aws s3 cp $f $ARTIFACT_S3_BUCKET/$TEMPLATES/ $AWS_PROFILE && echo "Done"; done 50 | echo "Uploaded all cfn template files to ${ARTIFACT_S3_PATH}/${TEMPLATES}/" 51 | 52 | echo "==========================================" 53 | echo "ZIP and UPLOAD WORKFLOW FILES" 54 | echo "==========================================" 55 | echo "zip and upload workflow files " 56 | cd ../src/workflow/ 57 | zip -r gatkbestpractices.wdl.zip main.wdl sub-workflows/ 58 | aws s3 cp gatkbestpractices.wdl.zip $ARTIFACT_S3_BUCKET/$WORKFLOWS/ $AWS_PROFILE 59 | aws s3 cp parameter-template.json $ARTIFACT_S3_BUCKET/$WORKFLOWS/ $AWS_PROFILE 60 | rm gatkbestpractices.wdl.zip 61 | echo "uploaded all workflow files to ${ARTIFACT_S3_BUCKET}/${WORKFLOWS}/" 62 | cd - 63 | 64 | echo "==========================================" 65 | echo "UPLOAD LAMBDA FILES" 66 | echo "==========================================" 67 | echo "iterate through lambdas and upload to s3" 68 | for f in $(find ../src/lambda/ -name '*.py' -or -name '*.py'); do echo "uploading `basename $f`" && aws s3 cp $f $ARTIFACT_S3_BUCKET/$LAMBDAS/ $AWS_PROFILE && echo "Done"; done 69 | echo "Uploaded all lambda files to ${ARTIFACT_S3_BUCKET}/${LAMBDAS}/" 70 | 71 | echo "==========================================" 72 | echo "UPLOAD CODEBUILD BUILDSPEC FILES" 73 | echo "==========================================" 74 | echo "iterate through buildspecs and upload to s3" 75 | for f in $(find ../src/codebuild/ -name '*.yml' -or -name '*.yaml'); do echo "uploading `basename $f`" && aws s3 cp $f $ARTIFACT_S3_BUCKET/$BUILDSPECS/ $AWS_PROFILE && echo "Done"; done 76 | echo "Uploaded all buildspec files to ${ARTIFACT_S3_BUCKET}/${BUILDSPECS}/" 77 | 78 | echo "==========================================" 79 | echo "DONE" 80 | echo "==========================================" -------------------------------------------------------------------------------- /src/cfn_templates/apply-s3-lifecycle-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: >- 3 | Lambda to tag relevant genomics inputs and output 4 | files in S3 so that S3 life cycle policies can be applied 5 | Parameters: 6 | InputsBucketName: 7 | Type: String 8 | Description: S3 bucket that's used for inputs (e.g. FASTQs) 9 | OutputsBucketName: 10 | Type: String 11 | Description: S3 bucket that's used for outputs (e.g. BAM/CRAM, VCFs) 12 | LambdaBucketName: 13 | Type: String 14 | Description: S3 bucket where lambda code artifacts are stored 15 | LambdaArtifactPrefix: 16 | Type: String 17 | Description: Prefix in bucket where lambda artifacts are stored 18 | Resources: 19 | ApplyS3LifecycleLambdaFunction: 20 | Type: 'AWS::Lambda::Function' 21 | Properties: 22 | FunctionName: apply-s3-lifecycle 23 | Code: 24 | S3Bucket: !Ref LambdaBucketName 25 | S3Key: !Sub "${LambdaArtifactPrefix}apply_s3_lifecycle_lambda.zip" 26 | Handler: apply_s3_lifecycle_lambda.lambda_handler 27 | Role: !GetAtt LambdaCleanupIAMRole.Arn 28 | Runtime: python3.9 29 | Timeout: 30 30 | LambdaCleanupIAMRole: 31 | Type: 'AWS::IAM::Role' 32 | Properties: 33 | AssumeRolePolicyDocument: 34 | Version: 2012-10-17 35 | Statement: 36 | - Effect: Allow 37 | Principal: 38 | Service: 39 | - lambda.amazonaws.com 40 | Action: 41 | - 'sts:AssumeRole' 42 | Policies: 43 | - PolicyName: TaggingAndLogging 44 | PolicyDocument: 45 | Version: 2012-10-17 46 | Statement: 47 | - Effect: Allow 48 | Action: 49 | - 's3:PutObjectTagging' 50 | - 's3:GetObjectTagging' 51 | - 's3:ListBucket' 52 | Resource: 53 | - !Sub 'arn:aws:s3:::${InputsBucketName}' 54 | - !Sub 'arn:aws:s3:::${InputsBucketName}/*' 55 | - !Sub 'arn:aws:s3:::${OutputsBucketName}' 56 | - !Sub 'arn:aws:s3:::${OutputsBucketName}/*' 57 | - Effect: Allow 58 | Action: 59 | - 'logs:CreateLogGroup' 60 | - 'logs:CreateLogStream' 61 | - 'logs:PutLogEvents' 62 | Resource: 'arn:aws:logs:*:*:*' 63 | Outputs: 64 | ApplyS3LifecycleLambdaFunctionArn: 65 | Value: !GetAtt ApplyS3LifecycleLambdaFunction.Arn 66 | Export: 67 | Name: ApplyS3LifecycleLambdaFunctionArn 68 | -------------------------------------------------------------------------------- /src/cfn_templates/code-build-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: >- 3 | Build Code Build projects and Lambdas to trigger the build of all required 4 | lambda functions and Docker images for Omics resources 5 | Parameters: 6 | ResourcesS3Bucket: 7 | Type: String 8 | LambdasS3Prefix: 9 | Type: String 10 | BuildSpecS3Prefix: 11 | Type: String 12 | DockerGenomesInTheCloud: 13 | Type: String 14 | DockerGatk: 15 | Type: String 16 | DockerCodeBuildBuildSpecS3File: 17 | Type: String 18 | Default: buildspec_docker.yml 19 | LambdasCodeBuildBuildSpecS3File: 20 | Type: String 21 | Default: buildspec_lambdas.yml 22 | Resources: 23 | ECRGenomesInTheCloud: 24 | Type: 'AWS::ECR::Repository' 25 | Properties: 26 | RepositoryName: genomes-in-the-cloud 27 | RepositoryPolicyText: 28 | Version: 2012-10-17 29 | Statement: 30 | - Sid: AllowOmicsToPull 31 | Effect: Allow 32 | Principal: 33 | Service: omics.amazonaws.com 34 | Action: 35 | - 'ecr:BatchGetImage' 36 | - 'ecr:GetDownloadUrlForLayer' 37 | - 'ecr:BatchCheckLayerAvailability' 38 | ECRGatk: 39 | Type: 'AWS::ECR::Repository' 40 | Properties: 41 | RepositoryName: gatk 42 | RepositoryPolicyText: 43 | Version: 2012-10-17 44 | Statement: 45 | - Sid: AllowOmicsToPull 46 | Effect: Allow 47 | Principal: 48 | Service: omics.amazonaws.com 49 | Action: 50 | - 'ecr:BatchGetImage' 51 | - 'ecr:GetDownloadUrlForLayer' 52 | - 'ecr:BatchCheckLayerAvailability' 53 | CodeBuildServiceRole: 54 | Type: 'AWS::IAM::Role' 55 | Properties: 56 | AssumeRolePolicyDocument: 57 | Version: 2012-10-17 58 | Statement: 59 | - Action: 60 | - 'sts:AssumeRole' 61 | Effect: Allow 62 | Principal: 63 | Service: 64 | - codebuild.amazonaws.com 65 | Path: / 66 | Policies: 67 | - PolicyName: CodeBuildServiceRolePolicy 68 | PolicyDocument: 69 | Statement: 70 | - Effect: Allow 71 | Action: 72 | - 'ecr:BatchCheckLayerAvailability' 73 | - 'ecr:CompleteLayerUpload' 74 | - 'ecr:GetAuthorizationToken' 75 | - 'ecr:InitiateLayerUpload' 76 | - 'ecr:PutImage' 77 | - 'ecr:UploadLayerPart' 78 | Resource: '*' 79 | - Effect: Allow 80 | Action: 81 | - 's3:GetObject' 82 | - 's3:GetBucketLocation' 83 | - 's3:ListBucket' 84 | - 's3:PutObject' 85 | Resource: 86 | - !Sub 'arn:aws:s3:::${ResourcesS3Bucket}' 87 | - !Sub 'arn:aws:s3:::${ResourcesS3Bucket}/*' 88 | - Effect: Allow 89 | Action: 90 | - 'logs:CreateLogGroup' 91 | - 'logs:CreateLogStream' 92 | - 'logs:PutLogEvents' 93 | Resource: 94 | - !Sub >- 95 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/* 96 | DockerCodeBuildProject: 97 | Type: 'AWS::CodeBuild::Project' 98 | DependsOn: 99 | - ECRGenomesInTheCloud 100 | - ECRGatk 101 | - CodeBuildServiceRole 102 | - LambdasCodeBuildProject 103 | Properties: 104 | Name: DockerCodeBuildProject 105 | ServiceRole: !Sub '${CodeBuildServiceRole.Arn}' 106 | Artifacts: 107 | Type: NO_ARTIFACTS 108 | Environment: 109 | Type: linuxContainer 110 | ComputeType: BUILD_GENERAL1_SMALL 111 | Image: 'aws/codebuild/standard:3.0' 112 | PrivilegedMode: true 113 | Source: 114 | Type: NO_SOURCE 115 | BuildSpec: !Sub >- 116 | arn:aws:s3:::${ResourcesS3Bucket}/${BuildSpecS3Prefix}/${DockerCodeBuildBuildSpecS3File} 117 | TimeoutInMinutes: 20 118 | LambdasCodeBuildProject: 119 | Type: 'AWS::CodeBuild::Project' 120 | DependsOn: 121 | - CodeBuildServiceRole 122 | Properties: 123 | Name: LambdasCodeBuildProject 124 | ServiceRole: !Sub '${CodeBuildServiceRole.Arn}' 125 | Artifacts: 126 | Type: NO_ARTIFACTS 127 | Environment: 128 | Type: linuxContainer 129 | ComputeType: BUILD_GENERAL1_SMALL 130 | Image: 'aws/codebuild/standard:3.0' 131 | PrivilegedMode: true 132 | EnvironmentVariables: 133 | - Name: RESOURCES_BUCKET 134 | Type: PLAINTEXT 135 | Value: !Ref ResourcesS3Bucket 136 | - Name: RESOURCES_PREFIX 137 | Type: PLAINTEXT 138 | Value: !Ref LambdasS3Prefix 139 | Source: 140 | Type: NO_SOURCE 141 | BuildSpec: !Sub >- 142 | arn:aws:s3:::${ResourcesS3Bucket}/${BuildSpecS3Prefix}/${LambdasCodeBuildBuildSpecS3File} 143 | TimeoutInMinutes: 10 144 | TriggerDockerCodeBuildGenomesInTheCloud: 145 | Type: 'Custom::TriggerDockerCodeBuildGitc' 146 | DependsOn: 147 | - DockerCodeBuildProject 148 | - TriggerDockerCodeBuildLambda 149 | Version: 1 150 | Properties: 151 | ServiceToken: !Sub '${TriggerDockerCodeBuildLambda.Arn}' 152 | ProjectName: DockerCodeBuildProject 153 | SourceRepo: !Ref DockerGenomesInTheCloud 154 | EcrRepo: !GetAtt ECRGenomesInTheCloud.RepositoryUri 155 | TriggerDockerCodeBuildGatk: 156 | Type: 'Custom::TriggerDockerCodeBuildGatk' 157 | DependsOn: 158 | - DockerCodeBuildProject 159 | - TriggerDockerCodeBuildLambda 160 | Version: 1 161 | Properties: 162 | ServiceToken: !Sub '${TriggerDockerCodeBuildLambda.Arn}' 163 | ProjectName: DockerCodeBuildProject 164 | SourceRepo: !Ref DockerGatk 165 | EcrRepo: !GetAtt ECRGatk.RepositoryUri 166 | TriggerLambdasCodeBuild: 167 | Type: 'Custom::TriggerLambdasCodeBuild' 168 | DependsOn: 169 | - LambdasCodeBuildProject 170 | - TriggerLambdasCodeBuildLambda 171 | Version: 1 172 | Properties: 173 | ServiceToken: !Sub '${TriggerLambdasCodeBuildLambda.Arn}' 174 | ProjectName: LambdasCodeBuildProject 175 | TriggerDockerCodeBuildLambda: 176 | Type: 'AWS::Lambda::Function' 177 | DependsOn: 178 | - TriggerCodeBuildLambdaRole 179 | Properties: 180 | Handler: trigger_docker_code_build.handler 181 | Runtime: python3.9 182 | FunctionName: trigger-docker-code-build 183 | Code: 184 | S3Bucket: !Sub '${ResourcesS3Bucket}' 185 | S3Key: !Sub '${LambdasS3Prefix}trigger_docker_code_build.zip' 186 | Role: !Sub '${TriggerCodeBuildLambdaRole.Arn}' 187 | Timeout: 60 188 | TriggerLambdasCodeBuildLambda: 189 | Type: 'AWS::Lambda::Function' 190 | DependsOn: 191 | - TriggerCodeBuildLambdaRole 192 | Properties: 193 | Handler: trigger_lambdas_code_build.handler 194 | Runtime: python3.9 195 | FunctionName: trigger-lambdas-code-build 196 | Code: 197 | S3Bucket: !Sub '${ResourcesS3Bucket}' 198 | S3Key: !Sub '${LambdasS3Prefix}trigger_lambdas_code_build.zip' 199 | Role: !Sub '${TriggerCodeBuildLambdaRole.Arn}' 200 | Timeout: 60 201 | TriggerCodeBuildLambdaRole: 202 | Type: 'AWS::IAM::Role' 203 | DependsOn: [] 204 | Properties: 205 | AssumeRolePolicyDocument: 206 | Version: 2012-10-17 207 | Statement: 208 | - Action: 209 | - 'sts:AssumeRole' 210 | Effect: Allow 211 | Principal: 212 | Service: 213 | - lambda.amazonaws.com 214 | Path: / 215 | Policies: 216 | - PolicyName: TriggerCodeBuildLambdaRolePolicy 217 | PolicyDocument: 218 | Statement: 219 | - Effect: Allow 220 | Action: 221 | - 'logs:CreateLogGroup' 222 | - 'logs:CreateLogStream' 223 | - 'logs:PutLogEvents' 224 | Resource: 225 | - !Sub >- 226 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 227 | - Effect: Allow 228 | Action: 229 | - 'lambda:AddPermission' 230 | - 'lambda:RemovePermission' 231 | - 'events:PutRule' 232 | - 'events:DeleteRule' 233 | - 'events:PutTargets' 234 | - 'events:RemoveTargets' 235 | Resource: '*' 236 | - Effect: Allow 237 | Action: 238 | - 'iam:GetRole' 239 | - 'iam:PassRole' 240 | Resource: !Sub '${CodeBuildServiceRole.Arn}' 241 | - Effect: Allow 242 | Action: 243 | - 'codebuild:StartBuild' 244 | - 'codebuild:BatchGetBuilds' 245 | Resource: 246 | - !Sub '${DockerCodeBuildProject.Arn}' 247 | - !Sub '${LambdasCodeBuildProject.Arn}' 248 | Outputs: 249 | EcrImageUriGotc: 250 | Value: !GetAtt TriggerDockerCodeBuildGenomesInTheCloud.EcrImageUri 251 | EcrImageUriGatk: 252 | Value: !GetAtt TriggerDockerCodeBuildGatk.EcrImageUri -------------------------------------------------------------------------------- /src/cfn_templates/e2e-sfn-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Parameters: 3 | OmicsImportSequenceLambdaArn: 4 | Type: String 5 | OmicsImportSequenceJobRoleArn: 6 | Type: String 7 | CheckOmicsTaskLambdaFunctionArn: 8 | Type: String 9 | OmicsWorkflowStartRunLambdaArn: 10 | Type: String 11 | OmicsWorkflowStartRunJobRoleArn: 12 | Type: String 13 | OmicsImportVariantLambdaArn: 14 | Type: String 15 | OmicsImportVariantJobRoleArn: 16 | Type: String 17 | ApplyS3LifecycleLambdaFunctionArn: 18 | Type: String 19 | ReferenceFastaFileS3Uri: 20 | Type: String 21 | Mills1000GIndelsVcf: 22 | Type: String 23 | DbSnpVcf: 24 | Type: String 25 | KnownIndelsVcf: 26 | Type: String 27 | RunDate: 28 | Type: String 29 | Description: Example run date 30 | Default: 2016-09-01T02:00:00+0200 31 | PlatformName: 32 | Type: String 33 | Description: Example platform name 34 | Default: Illumina 35 | SequencingCenter: 36 | Type: String 37 | Description: Example sequencing center 38 | Default: ABCD 39 | OmicsVariantStoreName: 40 | Type: String 41 | 42 | Description: >- 43 | State Machine for ingesting FASTQs into Omics Sequence Store, 44 | running GATK workflow with input FASTQs, 45 | ingesting post-workflow outputs into Omics Sequence and Variamt Stores, 46 | and S3 file tagging to enable activation of S3 lifecycle policies on inputs and outputs 47 | Resources: 48 | AmazonOmicsStepFunction: 49 | Type: AWS::StepFunctions::StateMachine 50 | Properties: 51 | RoleArn: !GetAtt AmazonOmicsStepFunctionRole.Arn 52 | StateMachineName: AmazonOmicsEndToEndStepFunction 53 | DefinitionSubstitutions: 54 | OmicsImportSequenceLambdaArn: !Ref OmicsImportSequenceLambdaArn 55 | OmicsImportSequenceJobRoleArn: !Ref OmicsImportSequenceJobRoleArn 56 | CheckOmicsTaskLambdaFunctionArn: !Ref CheckOmicsTaskLambdaFunctionArn 57 | OmicsWorkflowStartRunLambdaArn: !Ref OmicsWorkflowStartRunLambdaArn 58 | OmicsWorkflowStartRunJobRoleArn: !Ref OmicsWorkflowStartRunJobRoleArn 59 | OmicsImportVariantJobRoleArn: !Ref OmicsImportVariantJobRoleArn 60 | OmicsImportVariantLambdaArn: !Ref OmicsImportVariantLambdaArn 61 | ApplyS3LifecycleLambdaFunctionArn: !Ref ApplyS3LifecycleLambdaFunctionArn 62 | 63 | DefinitionString: !Sub | 64 | { 65 | "Comment": "StateMachine to orchestrate end-to-end Omics Workflow", 66 | "StartAt": "IngestFastqToReadSet", 67 | "States": { 68 | "IngestFastqToReadSet": { 69 | "InputPath": "$", 70 | "Next": "WaitForFastqIngest", 71 | "Parameters": { 72 | "FunctionName": "${OmicsImportSequenceLambdaArn}", 73 | "Payload": { 74 | "FileType": "FASTQ", 75 | "Read1.$": "$.Read1", 76 | "Read2.$": "$.Read2", 77 | "ReferenceArn.$": "$.ReferenceArn", 78 | "SampleId.$": "$.SampleId", 79 | "SequenceStoreId.$": "$.SequenceStoreId", 80 | "SubjectId.$": "$.SubjectId", 81 | "RoleArn": "${OmicsImportSequenceJobRoleArn}" 82 | } 83 | }, 84 | "Resource": "arn:aws:states:::lambda:invoke", 85 | "ResultSelector": { 86 | "import_fastq.$": "$.Payload" 87 | }, 88 | "ResultPath": "$.import_fastq", 89 | "Type": "Task" 90 | }, 91 | "WaitForFastqIngest": { 92 | "Next": "CheckFastqIngest", 93 | "Seconds": 10, 94 | "Type": "Wait" 95 | }, 96 | "SuccessState": { 97 | "Type": "Succeed" 98 | }, 99 | "CheckFastqIngest": { 100 | "InputPath": "$", 101 | "Next": "FastqIngestDone?", 102 | "Parameters": { 103 | "FunctionName": "${CheckOmicsTaskLambdaFunctionArn}", 104 | "Payload": { 105 | "task_type": "GetReadSetImportJob", 106 | "task_params": { 107 | "id.$": "$.import_fastq.import_fastq.importReadSetJobId", 108 | "sequence_store_id.$": "$.SequenceStoreId" 109 | } 110 | } 111 | }, 112 | "Resource": "arn:aws:states:::lambda:invoke", 113 | "ResultPath": "$.import_fastq.import_fastq.status.message", 114 | "Type": "Task" 115 | }, 116 | "FastqIngestFailed": { 117 | "Cause": "Fastq Ingest Failed", 118 | "Error": "$.import_fastq.import_fastq.status.error", 119 | "Type": "Fail" 120 | }, 121 | "FastqIngestDone?": { 122 | "Choices": [ 123 | { 124 | "Next": "RunOmicsWorkflowLambda", 125 | "StringEquals": "COMPLETED", 126 | "Variable": "$.import_fastq.import_fastq.status.message.Payload.task_status" 127 | }, 128 | { 129 | "Next": "FastqIngestFailed", 130 | "StringEquals": "FAILED", 131 | "Variable": "$.import_fastq.import_fastq.status.message.Payload.task_status" 132 | } 133 | ], 134 | "Default": "WaitForFastqIngest", 135 | "Type": "Choice" 136 | }, 137 | "RunOmicsWorkflowLambda": { 138 | "InputPath": "$", 139 | "Next": "WaitForOmicsWorkflow", 140 | "Parameters": { 141 | "FunctionName": "${OmicsWorkflowStartRunLambdaArn}", 142 | "Payload": { 143 | "sample_name.$": "$.SampleId", 144 | "fastq_1.$": "$.Read1", 145 | "fastq_2.$": "$.Read2", 146 | "ref_fasta": "${ReferenceFastaFileS3Uri}", 147 | "readgroup_name.$": "$.SampleId", 148 | "library_name.$": "$.SampleId", 149 | "platform_name": "${PlatformName}", 150 | "run_date": "${RunDate}", 151 | "sequencing_center": "${SequencingCenter}", 152 | "dbSNP_vcf": "${DbSnpVcf}", 153 | "Mills_1000G_indels_vcf": "${Mills1000GIndelsVcf}", 154 | "known_indels_vcf": "${KnownIndelsVcf}", 155 | "scattered_calling_intervals_archive.$": "$.IntervalsS3Path", 156 | "gatk_docker.$": "$.GatkDockerUri", 157 | "gotc_docker.$": "$.GotcDockerUri", 158 | "WorkflowId.$": "$.WorkflowId", 159 | "JobRoleArn": "${OmicsWorkflowStartRunJobRoleArn}", 160 | "OutputS3Path.$": "$.WorkflowOutputS3Path" 161 | } 162 | }, 163 | "Resource": "arn:aws:states:::lambda:invoke", 164 | "ResultSelector": { 165 | "workflow.$": "$.Payload" 166 | }, 167 | "ResultPath": "$.workflow", 168 | "Type": "Task" 169 | }, 170 | "WaitForOmicsWorkflow": { 171 | "Next": "CheckOmicsWorkflow", 172 | "Seconds": 60, 173 | "Type": "Wait" 174 | }, 175 | "CheckOmicsWorkflow": { 176 | "InputPath": "$", 177 | "Next": "OmicsWorkflowDone?", 178 | "Parameters": { 179 | "FunctionName": "${CheckOmicsTaskLambdaFunctionArn}", 180 | "Payload": { 181 | "task_type": "GetRun", 182 | "task_params": { 183 | "id.$": "$.workflow.workflow.WorkflowRunId" 184 | } 185 | } 186 | }, 187 | "Resource": "arn:aws:states:::lambda:invoke", 188 | "ResultPath": "$.workflow.workflow.status.message", 189 | "Type": "Task" 190 | }, 191 | "OmicsWorkflowDone?": { 192 | "Choices": [ 193 | { 194 | "Next": "PostWorkflowIngest", 195 | "StringEquals": "COMPLETED", 196 | "Variable": "$.workflow.workflow.status.message.Payload.task_status" 197 | }, 198 | { 199 | "Next": "OmicsWorkflowFailed", 200 | "StringEquals": "FAILED", 201 | "Variable": "$.workflow.workflow.status.message.Payload.task_status" 202 | } 203 | ], 204 | "Default": "WaitForOmicsWorkflow", 205 | "Type": "Choice" 206 | }, 207 | "OmicsWorkflowFailed": { 208 | "Cause": "Omics Workflow Failed", 209 | "Error": "$.workflow.workflow.status.error", 210 | "Type": "Fail" 211 | }, 212 | "PostWorkflowIngest": 213 | { 214 | "Branches": 215 | [ 216 | { 217 | "StartAt": "IngestBamToReadSet", 218 | "States": 219 | { 220 | "BamIngestDone?": 221 | { 222 | "Choices": 223 | [ 224 | { 225 | "Next": "PostWorkflowBamIngestCompleted", 226 | "StringEquals": "COMPLETED", 227 | "Variable": "$.import_bam.import_bam.status.message.Payload.task_status" 228 | }, 229 | { 230 | "Next": "BamIngestFailed", 231 | "StringEquals": "FAILED", 232 | "Variable": "$.import_bam.import_bam.status.message.Payload.task_status" 233 | } 234 | ], 235 | "Default": "WaitForBamIngest", 236 | "Type": "Choice" 237 | }, 238 | "BamIngestFailed": 239 | { 240 | "Cause": "Post Workflow BAM Ingest Failed", 241 | "Error": "$.import_bam.import_bam.status.error", 242 | "Type": "Fail" 243 | }, 244 | "CheckBamIngest": 245 | { 246 | "InputPath": "$", 247 | "Next": "BamIngestDone?", 248 | "Parameters": { 249 | "FunctionName": "${CheckOmicsTaskLambdaFunctionArn}", 250 | "Payload": { 251 | "task_type": "GetReadSetImportJob", 252 | "task_params": { 253 | "id.$": "$.import_bam.import_bam.importReadSetJobId", 254 | "sequence_store_id.$": "$.SequenceStoreId" 255 | } 256 | } 257 | }, 258 | "Resource": "arn:aws:states:::lambda:invoke", 259 | "ResultPath": "$.import_bam.import_bam.status.message", 260 | "Type": "Task" 261 | }, 262 | "IngestBamToReadSet": 263 | { 264 | "InputPath": "$", 265 | "Next": "WaitForBamIngest", 266 | "Parameters": 267 | { 268 | "FunctionName": "${OmicsImportSequenceLambdaArn}", 269 | "Payload": { 270 | "FileType": "BAM", 271 | "Read1.$": "States.Format('{}/{}/out/analysis_ready_bam/{}.hg38.bam', $.WorkflowOutputS3Path, $.workflow.workflow.WorkflowRunId, $.SampleId)", 272 | "ReferenceArn.$": "$.ReferenceArn", 273 | "SampleId.$": "$.SampleId", 274 | "SequenceStoreId.$": "$.SequenceStoreId", 275 | "SubjectId.$": "$.SubjectId", 276 | "RoleArn": "${OmicsImportSequenceJobRoleArn}" 277 | } 278 | }, 279 | "Resource": "arn:aws:states:::lambda:invoke", 280 | "ResultSelector": { 281 | "import_bam.$": "$.Payload" 282 | }, 283 | "ResultPath": "$.import_bam", 284 | "Type": "Task" 285 | }, 286 | "PostWorkflowBamIngestCompleted": 287 | { 288 | "End": true, 289 | "Type": "Pass" 290 | }, 291 | "WaitForBamIngest": 292 | { 293 | "Next": "CheckBamIngest", 294 | "Seconds": 10, 295 | "Type": "Wait" 296 | } 297 | } 298 | }, 299 | { 300 | "StartAt": "IngestVcfToVariantStore", 301 | "States": 302 | { 303 | "CheckVcfIngest": 304 | { 305 | "InputPath": "$", 306 | "Next": "VcfIngestDone?", 307 | "Parameters": 308 | { 309 | "FunctionName": "${CheckOmicsTaskLambdaFunctionArn}", 310 | "Payload": { 311 | "task_type": "GetVariantImportJob", 312 | "task_params": { 313 | "job_id.$": "$.import_vcf.import_vcf.VariantImportJobId" 314 | } 315 | } 316 | }, 317 | "Resource": "arn:aws:states:::lambda:invoke", 318 | "ResultPath": "$.import_vcf.import_vcf.status.message", 319 | "Type": "Task" 320 | }, 321 | "IngestVcfToVariantStore": 322 | { 323 | "InputPath": "$", 324 | "Next": "WaitForVcfIngest", 325 | "Parameters": 326 | { 327 | "FunctionName": "${OmicsImportVariantLambdaArn}", 328 | "Payload": { 329 | "VariantStoreName": "${OmicsVariantStoreName}", 330 | "OmicsImportVariantRoleArn": "${OmicsImportVariantJobRoleArn}", 331 | "VcfS3Uri.$": "States.Format('{}/{}/out/output_vcf/{}.hg38.vcf.gz', $.WorkflowOutputS3Path, $.workflow.workflow.WorkflowRunId, $.SampleId)" 332 | } 333 | }, 334 | "Resource": "arn:aws:states:::lambda:invoke", 335 | "ResultSelector": { 336 | "import_vcf.$": "$.Payload" 337 | }, 338 | "ResultPath": "$.import_vcf", 339 | "Type": "Task" 340 | }, 341 | "PostWorkflowVcfIngestCompleted": 342 | { 343 | "End": true, 344 | "Type": "Pass" 345 | }, 346 | "VcfIngestDone?": 347 | { 348 | "Choices": 349 | [ 350 | { 351 | "Next": "PostWorkflowVcfIngestCompleted", 352 | "StringEquals": "COMPLETED", 353 | "Variable": "$.import_vcf.import_vcf.status.message.Payload.task_status" 354 | }, 355 | { 356 | "Next": "VcfIngestFailed", 357 | "StringEquals": "FAILED", 358 | "Variable": "$.import_vcf.import_vcf.status.message.Payload.task_status" 359 | } 360 | ], 361 | "Default": "WaitForVcfIngest", 362 | "Type": "Choice" 363 | }, 364 | "VcfIngestFailed": 365 | { 366 | "Cause": "Post Workflow VCF Ingest to Variant Store Failed", 367 | "Error": "$.import_vcf.import_vcf.status.error", 368 | "Type": "Fail" 369 | }, 370 | "WaitForVcfIngest": 371 | { 372 | "Next": "CheckVcfIngest", 373 | "Seconds": 10, 374 | "Type": "Wait" 375 | } 376 | } 377 | } 378 | ], 379 | "Next": "AddLifeCycleTags", 380 | "Type": "Parallel" 381 | }, 382 | "AddLifeCycleTags": 383 | { 384 | "Next": "SuccessState", 385 | "InputPath": "$.[0]", 386 | "Parameters": 387 | { 388 | "FunctionName": "${ApplyS3LifecycleLambdaFunctionArn}", 389 | "Payload": { 390 | "inputs": { 391 | "fastq.$": "States.Array($.Read1,$.Read2)" 392 | }, 393 | "outputs": { 394 | "vcf.$": "States.Array(States.Format('{}/{}/out/output_vcf/{}.hg38.vcf.gz', $.WorkflowOutputS3Path, $.workflow.workflow.WorkflowRunId, $.SampleId))", 395 | "bam.$": "States.Array(States.Format('{}/{}/out/analysis_ready_bam/{}.hg38.bam', $.WorkflowOutputS3Path, $.workflow.workflow.WorkflowRunId, $.SampleId))" 396 | } 397 | } 398 | }, 399 | "Resource": "arn:aws:states:::lambda:invoke", 400 | "Type": "Task" 401 | } 402 | } 403 | } 404 | 405 | AmazonOmicsStepFunctionRole: 406 | Type: AWS::IAM::Role 407 | Properties: 408 | RoleName: AmazonOmicsStepFunctionRole 409 | AssumeRolePolicyDocument: 410 | Version: '2012-10-17' 411 | Statement: 412 | - Effect: Allow 413 | Principal: 414 | Service: !Sub 'states.${AWS::Region}.amazonaws.com' 415 | Action: 'sts:AssumeRole' 416 | Policies: 417 | - PolicyName: InvokeLambda 418 | PolicyDocument: 419 | Statement: 420 | - Effect: Allow 421 | Action: 'lambda:InvokeFunction' 422 | Resource: 423 | - !Ref OmicsImportSequenceLambdaArn 424 | - !Ref CheckOmicsTaskLambdaFunctionArn 425 | - !Ref OmicsWorkflowStartRunLambdaArn 426 | - !Ref OmicsImportVariantLambdaArn 427 | - !Ref ApplyS3LifecycleLambdaFunctionArn 428 | Outputs: 429 | AmazonOmicsStepFunctionArn: 430 | Value: !GetAtt AmazonOmicsStepFunction.Arn 431 | Export: 432 | Name: AmazonOmicsStepFunctionArn -------------------------------------------------------------------------------- /src/cfn_templates/omics-resources-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: All necessary Amazon Omics Resources to store and process Genomics data 3 | Parameters: 4 | ExistingReferenceStoreId: 5 | Type: String 6 | Default: 'NONE' 7 | Description: 'Provide Reference Store ID if exists in current account-region, else leave it NONE' 8 | OmicsResourcePrefix: 9 | Type: String 10 | Default: omics-cfn 11 | OmicsResourcesS3Bucket: 12 | Type: String 13 | OmicsCustomResourceLambdaS3Prefix: 14 | Type: String 15 | OmicsWorkflowInputBucketName: 16 | Type: String 17 | OmicsWorkflowOutputBucketName: 18 | Type: String 19 | OmicsReferenceFastaUri: 20 | Type: String 21 | OmicsReferenceName: 22 | Type: String 23 | ClinvarS3Path: 24 | Type: String 25 | OmicsAnnotationStoreName: 26 | Type: String 27 | OmicsVariantStoreName: 28 | Type: String 29 | AnnotationStoreFormat: 30 | Type: String 31 | Default: VCF 32 | OmicsWorkflowDefinitionZipS3: 33 | Type: String 34 | Conditions: 35 | ToCreateReferenceStore: !Equals 36 | - !Sub '${ExistingReferenceStoreId}' 37 | - 'NONE' 38 | Resources: 39 | 40 | OmicsReferenceStore: 41 | Type: AWS::Omics::ReferenceStore 42 | Condition: ToCreateReferenceStore 43 | Properties: 44 | Name: !Sub '${OmicsResourcePrefix}-reference-store' 45 | 46 | OmicsImportReference: 47 | Type: 'Custom::OmicsImportReference' 48 | Version: 1 49 | Properties: 50 | ServiceToken: !Sub '${OmicsImportReferenceLambda.Arn}' 51 | ReferenceStoreId: 52 | !If [ 53 | ToCreateReferenceStore, 54 | !GetAtt OmicsReferenceStore.ReferenceStoreId, 55 | !Ref ExistingReferenceStoreId] 56 | ReferenceName: !Sub '${OmicsReferenceName}' 57 | OmicsImportReferenceRoleArn: !Sub '${OmicsImportReferenceJobRole.Arn}' 58 | ReferenceSourceS3Uri: !Ref OmicsReferenceFastaUri 59 | 60 | OmicsImportReferenceLambda: 61 | Type: 'AWS::Lambda::Function' 62 | Properties: 63 | Handler: import_reference_lambda.handler 64 | Runtime: python3.9 65 | FunctionName: !Sub '${OmicsResourcePrefix}-import-reference' 66 | Code: 67 | S3Bucket: !Sub '${OmicsResourcesS3Bucket}' 68 | S3Key: !Sub '${OmicsCustomResourceLambdaS3Prefix}import_reference_lambda.zip' 69 | Role: !Sub '${OmicsImportReferenceLambdaRole.Arn}' 70 | Timeout: 60 71 | OmicsImportReferenceLambdaRole: 72 | Type: 'AWS::IAM::Role' 73 | Properties: 74 | AssumeRolePolicyDocument: 75 | Version: 2012-10-17 76 | Statement: 77 | - Action: 78 | - 'sts:AssumeRole' 79 | Effect: Allow 80 | Principal: 81 | Service: 82 | - lambda.amazonaws.com 83 | Path: / 84 | Policies: 85 | - PolicyName: ImportReferencePolicy 86 | PolicyDocument: 87 | Statement: 88 | - Effect: Allow 89 | Action: 90 | - 'logs:CreateLogGroup' 91 | - 'logs:CreateLogStream' 92 | - 'logs:PutLogEvents' 93 | Resource: 94 | - !Sub >- 95 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 96 | - Effect: Allow 97 | Action: 98 | - 'omics:*' 99 | Resource: '*' 100 | - Effect: Allow 101 | Action: 102 | - 'lambda:AddPermission' 103 | - 'lambda:RemovePermission' 104 | - 'events:PutRule' 105 | - 'events:DeleteRule' 106 | - 'events:PutTargets' 107 | - 'events:RemoveTargets' 108 | Resource: '*' 109 | - Effect: Allow 110 | Action: 111 | - 'iam:GetRole' 112 | - 'iam:PassRole' 113 | Resource: !Sub '${OmicsImportReferenceJobRole.Arn}' 114 | OmicsImportReferenceJobRole: 115 | Type: 'AWS::IAM::Role' 116 | Properties: 117 | AssumeRolePolicyDocument: 118 | Version: 2012-10-17 119 | Statement: 120 | - Action: 121 | - 'sts:AssumeRole' 122 | Effect: Allow 123 | Principal: 124 | Service: 125 | - omics.amazonaws.com 126 | Path: / 127 | Policies: 128 | - PolicyName: ImportReferenceJobRolePolicy 129 | PolicyDocument: 130 | Statement: 131 | - Effect: Allow 132 | Action: 133 | - 's3:GetObject' 134 | - 's3:GetBucketLocation' 135 | - 's3:ListBucket' 136 | Resource: 137 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}' 138 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}/*' 139 | - !Sub 'arn:aws:s3:::broad-references' 140 | - !Sub 'arn:aws:s3:::broad-references/*' 141 | OmicsVariantStore: 142 | Type: AWS::Omics::VariantStore 143 | DependsOn: 144 | - OmicsAnnotationStore 145 | Properties: 146 | Description: String 147 | Name: !Sub '${OmicsVariantStoreName}' 148 | Reference: 149 | ReferenceArn: !Sub '${OmicsImportReference.Arn}' 150 | 151 | OmicsImportVariantLambda: 152 | Type: 'AWS::Lambda::Function' 153 | Properties: 154 | Handler: import_variant_lambda.handler 155 | Runtime: python3.9 156 | FunctionName: !Sub '${OmicsResourcePrefix}-import-variant' 157 | Code: 158 | S3Bucket: !Sub '${OmicsResourcesS3Bucket}' 159 | S3Key: !Sub '${OmicsCustomResourceLambdaS3Prefix}import_variant_lambda.zip' 160 | Role: !Sub '${OmicsImportVariantLambdaRole.Arn}' 161 | Timeout: 60 162 | OmicsImportVariantLambdaRole: 163 | Type: 'AWS::IAM::Role' 164 | Properties: 165 | AssumeRolePolicyDocument: 166 | Version: 2012-10-17 167 | Statement: 168 | - Action: 169 | - 'sts:AssumeRole' 170 | Effect: Allow 171 | Principal: 172 | Service: 173 | - lambda.amazonaws.com 174 | Path: / 175 | Policies: 176 | - PolicyName: ImportVariantPolicy 177 | PolicyDocument: 178 | Statement: 179 | - Effect: Allow 180 | Action: 181 | - 'logs:CreateLogGroup' 182 | - 'logs:CreateLogStream' 183 | - 'logs:PutLogEvents' 184 | Resource: 185 | - !Sub >- 186 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 187 | - Effect: Allow 188 | Action: 189 | - 'omics:*' 190 | Resource: '*' 191 | - Effect: Allow 192 | Action: 193 | - 'iam:GetRole' 194 | - 'iam:PassRole' 195 | Resource: !Sub '${OmicsImportVariantJobRole.Arn}' 196 | OmicsImportVariantJobRole: 197 | Type: 'AWS::IAM::Role' 198 | Properties: 199 | AssumeRolePolicyDocument: 200 | Version: 2012-10-17 201 | Statement: 202 | - Action: 203 | - 'sts:AssumeRole' 204 | Effect: Allow 205 | Principal: 206 | Service: 207 | - omics.amazonaws.com 208 | Path: / 209 | Policies: 210 | - PolicyName: OmicsImportVariantJobRolePolicy 211 | PolicyDocument: 212 | Statement: 213 | - Effect: Allow 214 | Action: 215 | - 's3:GetObject' 216 | - 's3:GetBucketLocation' 217 | - 's3:ListBucket' 218 | Resource: 219 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}' 220 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}/*' 221 | - Effect: Allow 222 | Action: 223 | - 'omics:GetReference' 224 | - 'omics:GetReferenceMetadata' 225 | Resource: 226 | - !Sub 'arn:aws:omics:${AWS::Region}:${AWS::AccountId}:referenceStore/*' 227 | OmicsAnnotationStore: 228 | Type: AWS::Omics::AnnotationStore 229 | Properties: 230 | Name: !Sub '${OmicsAnnotationStoreName}' 231 | Reference: 232 | ReferenceArn: !Sub '${OmicsImportReference.Arn}' 233 | StoreFormat: !Sub '${AnnotationStoreFormat}' 234 | 235 | OmicsImportAnnotation: 236 | Type: 'Custom::OmicsImportAnnotation' 237 | DependsOn: 238 | - OmicsAnnotationStore 239 | Version: 1 240 | Properties: 241 | ServiceToken: !Sub '${OmicsImportAnnotationLambda.Arn}' 242 | AnnotationStoreName: !Sub '${OmicsAnnotationStoreName}' 243 | OmicsImportAnnotationRoleArn: !Sub '${OmicsImportAnnotationJobRole.Arn}' 244 | AnnotationSourceS3Uri: !Ref ClinvarS3Path 245 | 246 | OmicsImportAnnotationLambda: 247 | Type: 'AWS::Lambda::Function' 248 | Properties: 249 | Handler: import_annotation_lambda.handler 250 | Runtime: python3.9 251 | FunctionName: !Sub '${OmicsResourcePrefix}-import-annotation' 252 | Code: 253 | S3Bucket: !Sub '${OmicsResourcesS3Bucket}' 254 | S3Key: !Sub '${OmicsCustomResourceLambdaS3Prefix}import_annotation_lambda.zip' 255 | Role: !Sub '${OmicsImportAnnotationLambdaRole.Arn}' 256 | Timeout: 60 257 | 258 | OmicsImportAnnotationLambdaRole: 259 | Type: 'AWS::IAM::Role' 260 | Properties: 261 | AssumeRolePolicyDocument: 262 | Version: 2012-10-17 263 | Statement: 264 | - Action: 265 | - 'sts:AssumeRole' 266 | Effect: Allow 267 | Principal: 268 | Service: 269 | - lambda.amazonaws.com 270 | Path: / 271 | Policies: 272 | - PolicyName: ImportAnnotationPolicy 273 | PolicyDocument: 274 | Statement: 275 | - Effect: Allow 276 | Action: 277 | - 'logs:CreateLogGroup' 278 | - 'logs:CreateLogStream' 279 | - 'logs:PutLogEvents' 280 | Resource: 281 | - !Sub >- 282 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 283 | - Effect: Allow 284 | Action: 285 | - 'omics:*' 286 | Resource: '*' 287 | - Effect: Allow 288 | Action: 289 | - 'lambda:AddPermission' 290 | - 'lambda:RemovePermission' 291 | - 'events:PutRule' 292 | - 'events:DeleteRule' 293 | - 'events:PutTargets' 294 | - 'events:RemoveTargets' 295 | Resource: '*' 296 | - Effect: Allow 297 | Action: 298 | - 'iam:GetRole' 299 | - 'iam:PassRole' 300 | Resource: !Sub '${OmicsImportAnnotationJobRole.Arn}' 301 | OmicsImportAnnotationJobRole: 302 | Type: 'AWS::IAM::Role' 303 | Properties: 304 | AssumeRolePolicyDocument: 305 | Version: 2012-10-17 306 | Statement: 307 | - Action: 308 | - 'sts:AssumeRole' 309 | Effect: Allow 310 | Principal: 311 | Service: 312 | - omics.amazonaws.com 313 | Path: / 314 | Policies: 315 | - PolicyName: ImportAnnotationJobRolePolicy 316 | PolicyDocument: 317 | Statement: 318 | - Effect: Allow 319 | Action: 320 | - 's3:GetObject' 321 | - 's3:GetBucketLocation' 322 | - 's3:ListBucket' 323 | Resource: 324 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}' 325 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}/*' 326 | - arn:aws:s3:::aws-genomics-datasets 327 | - 'arn:aws:s3:::aws-genomics-datasets/*' 328 | - arn:aws:s3:::aws-genomics-static-us-east-1 329 | - 'arn:aws:s3:::aws-genomics-static-us-east-1/*' 330 | - Effect: Allow 331 | Action: 332 | - 'omics:GetReference' 333 | - 'omics:GetReferenceMetadata' 334 | Resource: 335 | - !Sub 'arn:aws:omics:${AWS::Region}:${AWS::AccountId}:referenceStore/*' 336 | 337 | OmicsSequenceStore: 338 | Type: AWS::Omics::SequenceStore 339 | Properties: 340 | Name: !Sub '${OmicsResourcePrefix}-sequence-store' 341 | 342 | OmicsCreateWorkflow: 343 | Type: AWS::Omics::Workflow 344 | Properties: 345 | DefinitionUri: !Sub ${OmicsWorkflowDefinitionZipS3} 346 | Description: !Sub '${OmicsResourcePrefix} test workflow' 347 | Name: !Sub '${OmicsResourcePrefix}-test-workflow' 348 | ParameterTemplate: 349 | sample_name: 350 | Description: "sample name" 351 | fastq_1: 352 | Description: "path to fastq1" 353 | fastq_2: 354 | Description: "path to fastq2" 355 | ref_fasta: 356 | Description: "path to reference fasta" 357 | readgroup_name: 358 | Description: "readgroup name" 359 | library_name: 360 | Description: "library name" 361 | platform_name: 362 | Description: "platform name e.g. Illumina" 363 | run_date: 364 | Description: "sequencing run date" 365 | sequencing_center: 366 | Description: "name of sequencing center" 367 | dbSNP_vcf: 368 | Description: "dbsnp vcf" 369 | Mills_1000G_indels_vcf: 370 | Description: "Mills 1000 genomes gold indels vcf" 371 | known_indels_vcf: 372 | Description: "known indels vcf" 373 | scattered_calling_intervals_archive: 374 | Description: "tar (not gzip) of scatter intervals" 375 | gatk_docker: 376 | Description: "docker uri in private ECR of GATK" 377 | gotc_docker: 378 | Description: "docker uri in private ECR of Genomes in the Cloud" 379 | 380 | OmicsWorkflowStartRunLambda: 381 | Type: 'AWS::Lambda::Function' 382 | Properties: 383 | Handler: start_workflow_lambda.handler 384 | Runtime: python3.9 385 | FunctionName: !Sub '${OmicsResourcePrefix}-start-workflow' 386 | Code: 387 | S3Bucket: !Sub '${OmicsResourcesS3Bucket}' 388 | S3Key: !Sub '${OmicsCustomResourceLambdaS3Prefix}start_workflow_lambda.zip' 389 | Role: !Sub '${OmicsWorkflowStartRunLambdaRole.Arn}' 390 | Timeout: 60 391 | 392 | OmicsWorkflowStartRunLambdaRole: 393 | Type: 'AWS::IAM::Role' 394 | Properties: 395 | AssumeRolePolicyDocument: 396 | Version: 2012-10-17 397 | Statement: 398 | - Action: 399 | - 'sts:AssumeRole' 400 | Effect: Allow 401 | Principal: 402 | Service: 403 | - lambda.amazonaws.com 404 | Path: / 405 | Policies: 406 | - PolicyName: ImportSequencePolicy 407 | PolicyDocument: 408 | Statement: 409 | - Effect: Allow 410 | Action: 411 | - 'logs:CreateLogGroup' 412 | - 'logs:CreateLogStream' 413 | - 'logs:PutLogEvents' 414 | Resource: 415 | - !Sub >- 416 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 417 | - Effect: Allow 418 | Action: 419 | - 'omics:*' 420 | Resource: '*' 421 | - Effect: Allow 422 | Action: 423 | - 'iam:GetRole' 424 | - 'iam:PassRole' 425 | Resource: !Sub '${OmicsWorkflowStartRunJobRole.Arn}' 426 | OmicsWorkflowStartRunJobRole: 427 | Type: 'AWS::IAM::Role' 428 | Properties: 429 | AssumeRolePolicyDocument: 430 | Version: 2012-10-17 431 | Statement: 432 | - Action: 433 | - 'sts:AssumeRole' 434 | Effect: Allow 435 | Principal: 436 | Service: 437 | - omics.amazonaws.com 438 | Path: / 439 | Policies: 440 | - PolicyName: WorkflowStartRunJobRolePolicy 441 | PolicyDocument: 442 | Statement: 443 | - Effect: Allow 444 | Action: 445 | - 's3:GetObject' 446 | - 's3:GetBucketLocation' 447 | - 's3:ListBucket' 448 | Resource: 449 | - !Sub 'arn:aws:s3:::${OmicsWorkflowInputBucketName}' 450 | - !Sub 'arn:aws:s3:::${OmicsWorkflowInputBucketName}/*' 451 | - arn:aws:s3:::broad-references 452 | - 'arn:aws:s3:::broad-references/*' 453 | - arn:aws:s3:::gatk-test-data 454 | - 'arn:aws:s3:::gatk-test-data/*' 455 | - arn:aws:s3:::aws-genomics-datasets 456 | - 'arn:aws:s3:::aws-genomics-datasets/*' 457 | - arn:aws:s3:::aws-genomics-static-us-east-1 458 | - 'arn:aws:s3:::aws-genomics-static-us-east-1/*' 459 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}' 460 | - !Sub 'arn:aws:s3:::${OmicsResourcesS3Bucket}/*' 461 | - Effect: Allow 462 | Action: 463 | - 's3:GetObject' 464 | - 's3:GetBucketLocation' 465 | - 's3:ListBucket' 466 | - 's3:PutObject' 467 | Resource: 468 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}' 469 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}/*' 470 | - Effect: Allow 471 | Action: 472 | - ecr:GetAuthorizationToken 473 | - ecr:BatchCheckLayerAvailability 474 | - ecr:GetDownloadUrlForLayer 475 | - ecr:GetRepositoryPolicy 476 | - ecr:ListImages 477 | - ecr:DescribeImages 478 | - ecr:BatchGetImage 479 | Resource: "*" 480 | - Effect: Allow 481 | Action: 482 | - 'omics:*' 483 | Resource: "*" 484 | - Effect: Allow 485 | Action: 486 | - 'logs:CreateLogGroup' 487 | Resource: 488 | - !Sub 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/omics/WorkflowLog:*' 489 | - Effect: Allow 490 | Action: 491 | - 'logs:DescribeLogStreams' 492 | - 'logs:CreateLogStream' 493 | - 'logs:PutLogEvents' 494 | Resource: 495 | - !Sub 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/omics/WorkflowLog:log-stream:*' 496 | 497 | OmicsImportSequenceLambda: 498 | Type: 'AWS::Lambda::Function' 499 | Properties: 500 | Handler: import_sequence_lambda.handler 501 | Runtime: python3.9 502 | FunctionName: !Sub '${OmicsResourcePrefix}-import-sequence' 503 | Code: 504 | S3Bucket: !Sub '${OmicsResourcesS3Bucket}' 505 | S3Key: !Sub '${OmicsCustomResourceLambdaS3Prefix}import_sequence_lambda.zip' 506 | Role: !Sub '${OmicsImportSequenceLambdaRole.Arn}' 507 | Timeout: 60 508 | 509 | OmicsImportSequenceLambdaRole: 510 | Type: 'AWS::IAM::Role' 511 | Properties: 512 | AssumeRolePolicyDocument: 513 | Version: 2012-10-17 514 | Statement: 515 | - Action: 516 | - 'sts:AssumeRole' 517 | Effect: Allow 518 | Principal: 519 | Service: 520 | - lambda.amazonaws.com 521 | Path: / 522 | Policies: 523 | - PolicyName: ImportSequencePolicy 524 | PolicyDocument: 525 | Statement: 526 | - Effect: Allow 527 | Action: 528 | - 'logs:CreateLogGroup' 529 | - 'logs:CreateLogStream' 530 | - 'logs:PutLogEvents' 531 | Resource: 532 | - !Sub >- 533 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 534 | - Effect: Allow 535 | Action: 536 | - 'omics:*' 537 | Resource: '*' 538 | - Effect: Allow 539 | Action: 540 | - 'iam:GetRole' 541 | - 'iam:PassRole' 542 | Resource: !Sub ${OmicsImportSequenceJobRole.Arn} 543 | OmicsImportSequenceJobRole: 544 | Type: 'AWS::IAM::Role' 545 | Properties: 546 | AssumeRolePolicyDocument: 547 | Version: 2012-10-17 548 | Statement: 549 | - Action: 550 | - 'sts:AssumeRole' 551 | Effect: Allow 552 | Principal: 553 | Service: 554 | - omics.amazonaws.com 555 | Path: / 556 | Policies: 557 | - PolicyName: ImportSequenceJobRolePolicy 558 | PolicyDocument: 559 | Statement: 560 | - Effect: Allow 561 | Action: 562 | - 's3:GetObject' 563 | - 's3:GetBucketLocation' 564 | - 's3:ListBucket' 565 | Resource: 566 | - !Sub 'arn:aws:s3:::${OmicsWorkflowInputBucketName}' 567 | - !Sub 'arn:aws:s3:::${OmicsWorkflowInputBucketName}/*' 568 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}' 569 | - !Sub 'arn:aws:s3:::${OmicsWorkflowOutputBucketName}/*' 570 | Outputs: 571 | OmicsImportSequenceLambdaArn: 572 | Value: !Sub ${OmicsImportSequenceLambda.Arn} 573 | OmicsImportSequenceJobRoleArn: 574 | Value: !Sub ${OmicsImportSequenceJobRole.Arn} 575 | OmicsWorkflowStartRunLambdaArn: 576 | Value: !Sub ${OmicsWorkflowStartRunLambda.Arn} 577 | OmicsWorkflowStartRunJobRoleArn: 578 | Value: !Sub ${OmicsWorkflowStartRunJobRole.Arn} 579 | OmicsImportVariantLambdaArn: 580 | Value: !Sub ${OmicsImportVariantLambda.Arn} 581 | OmicsImportVariantJobRoleArn: 582 | Value: !Sub ${OmicsImportVariantJobRole.Arn} 583 | OmicsSequenceStoreId: 584 | Value: !GetAtt OmicsSequenceStore.SequenceStoreId 585 | OmicsReferenceArn: 586 | Value: !GetAtt OmicsImportReference.Arn 587 | OmicsWorkflowId: 588 | Value: !GetAtt OmicsCreateWorkflow.Id -------------------------------------------------------------------------------- /src/cfn_templates/s3-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: >- 3 | S3 buckets for Genomics inputs and outputs 4 | with lifecycle configuration for cost savings 5 | Parameters: 6 | DataInputBucketName: 7 | Type: String 8 | DataOutputBucketName: 9 | Type: String 10 | Resources: 11 | InputBucket: 12 | Type: 'AWS::S3::Bucket' 13 | Properties: 14 | BucketName: !Ref DataInputBucketName 15 | LifecycleConfiguration: 16 | Rules: 17 | - Id: DeleteRule 18 | Status: Enabled 19 | ExpirationInDays: 30 20 | TagFilters: 21 | - Key: OmicsTiering 22 | Value: RemoveIn30 23 | BucketEncryption: 24 | ServerSideEncryptionConfiguration: 25 | - BucketKeyEnabled: true 26 | ServerSideEncryptionByDefault: 27 | SSEAlgorithm: AES256 28 | InputBucketPolicy: 29 | Type: AWS::S3::BucketPolicy 30 | Properties: 31 | Bucket: !Ref InputBucket 32 | PolicyDocument: 33 | Version: 2012-10-17 34 | Statement: 35 | - Action: "s3:*" 36 | Effect: Deny 37 | Resource: 38 | - !Sub arn:aws:s3:::${DataInputBucketName} 39 | - !Sub arn:aws:s3:::${DataInputBucketName}/* 40 | Principal: '*' 41 | Condition: 42 | Bool: 43 | "aws:SecureTransport": "false" 44 | 45 | OutputBucket: 46 | Type: 'AWS::S3::Bucket' 47 | Properties: 48 | BucketName: !Ref DataOutputBucketName 49 | LifecycleConfiguration: 50 | Rules: 51 | - Id: IntelligentTier 52 | Status: Enabled 53 | Transitions: 54 | - TransitionInDays: 1 55 | StorageClass: INTELLIGENT_TIERING 56 | TagFilters: 57 | - Key: OmicsTiering 58 | Value: IntelligentTierAfter30 59 | BucketEncryption: 60 | ServerSideEncryptionConfiguration: 61 | - BucketKeyEnabled: true 62 | ServerSideEncryptionByDefault: 63 | SSEAlgorithm: AES256 64 | 65 | OutputBucketPolicy: 66 | Type: AWS::S3::BucketPolicy 67 | Properties: 68 | Bucket: !Ref OutputBucket 69 | PolicyDocument: 70 | Version: 2012-10-17 71 | Statement: 72 | - Action: "s3:*" 73 | Effect: Deny 74 | Resource: 75 | - !Sub arn:aws:s3:::${DataOutputBucketName} 76 | - !Sub arn:aws:s3:::${DataOutputBucketName}/* 77 | Principal: '*' 78 | Condition: 79 | Bool: 80 | "aws:SecureTransport": "false" -------------------------------------------------------------------------------- /src/cfn_templates/sfn-task-checker-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: >- 3 | Lambda to check completion of various Omics API calls 4 | such as Sequence data import and Omics workflow completion 5 | Parameters: 6 | OmicsOutputBucket: 7 | Type: String 8 | Description: S3 bucket that Amazon Omics workflow will write outputs to 9 | LambdaBucketName: 10 | Type: String 11 | Description: S3 bucket where lambda code artifacts are stored 12 | LambdaArtifactPrefix: 13 | Type: String 14 | Description: Prefix in bucket where lambda artifacts are stored 15 | 16 | Resources: 17 | CheckOmicsTaskLambdaFunction: 18 | Type: 'AWS::Lambda::Function' 19 | Properties: 20 | FunctionName: CheckOmicsTask 21 | Code: 22 | S3Bucket: !Ref LambdaBucketName 23 | S3Key: !Sub '${LambdaArtifactPrefix}lambda_check_omics_workflow_task.zip' 24 | Handler: lambda_check_omics_workflow_task.lambda_handler 25 | Role: !GetAtt LambdaIAMRole.Arn 26 | Runtime: python3.9 27 | Timeout: 20 28 | 29 | LambdaIAMRole: 30 | Type: 'AWS::IAM::Role' 31 | Properties: 32 | RoleName: CheckOmicsLambdaFnRole 33 | AssumeRolePolicyDocument: 34 | Version: 2012-10-17 35 | Statement: 36 | - Effect: Allow 37 | Principal: 38 | Service: 39 | - lambda.amazonaws.com 40 | Action: 41 | - 'sts:AssumeRole' 42 | Policies: 43 | - PolicyName: AmazonOmicsPolicy 44 | PolicyDocument: 45 | Version: 2012-10-17 46 | Statement: 47 | - Effect: Allow 48 | Action: 49 | - 'omics:GetReadSetImportJob' 50 | - 'omics:GetRun' 51 | - 'omics:GetVariantImportJob' 52 | Resource: "*" 53 | - PolicyName: LambdaLogs 54 | PolicyDocument: 55 | Version: 2012-10-17 56 | Statement: 57 | - Effect: Allow 58 | Action: 59 | - 'logs:CreateLogGroup' 60 | - 'logs:CreateLogStream' 61 | - 'logs:PutLogEvents' 62 | Resource: 'arn:aws:logs:*:*:*' 63 | Outputs: 64 | CheckOmicsTaskLambdaFunctionArn: 65 | Value: !GetAtt CheckOmicsTaskLambdaFunction.Arn -------------------------------------------------------------------------------- /src/cfn_templates/sfn-trigger-stack.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: >- 3 | Lambda to evaluate complete set of inputs and 4 | invoke Step Functions State machine to process Genomics Data. 5 | S3 event notification to lambda integration for FASTQ input bucket. 6 | Parameters: 7 | FastqInputBucket: 8 | Type: String 9 | Description: S3 bucket that's used for the Lambda event notification (FASTQs) 10 | GenomicsStepFunctionArn: 11 | Type: String 12 | Description: ARN of the Step Function State machine that processes Genomics input data 13 | LambdaBucketName: 14 | Type: String 15 | Description: S3 bucket where lambda code artifacts are stored 16 | LambdaArtifactPrefix: 17 | Type: String 18 | Description: Prefix in bucket where lambda artifacts are stored 19 | SequenceStoreId: 20 | Type: String 21 | ReferenceArn: 22 | Type: String 23 | WorkflowId: 24 | Type: String 25 | WorkflowOutputS3Path: 26 | Type: String 27 | GatkDockerUri: 28 | Type: String 29 | GotcDockerUri: 30 | Type: String 31 | IntervalS3Path: 32 | Type: String 33 | NotificationAppliedToS3Prefix: 34 | Type: String 35 | Default: inputs/ 36 | Resources: 37 | InvokeGenomicsStepFunctionLambda: 38 | Type: 'AWS::Lambda::Function' 39 | Properties: 40 | FunctionName: lambda-invoke-genomics-sfn-wf 41 | Code: 42 | S3Bucket: !Ref LambdaBucketName 43 | S3Key: !Sub "${LambdaArtifactPrefix}lambda_launch_genomics_sfn.zip" 44 | Handler: lambda_launch_genomics_sfn.lambda_handler 45 | Role: !GetAtt LambdaIAMRole.Arn 46 | Runtime: python3.9 47 | Timeout: 20 48 | Environment: 49 | Variables: 50 | NUM_FASTQS_PER_SAMPLE: 2 51 | GENOMICS_STEP_FUNCTION_ARN: !Ref GenomicsStepFunctionArn 52 | SEQUENCE_STORE_ID: !Ref SequenceStoreId 53 | REFERENCE_ARN: !Ref ReferenceArn 54 | WORKFLOW_ID: !Ref WorkflowId 55 | WORKFLOW_OUTPUT_S3_PATH: !Ref WorkflowOutputS3Path 56 | GATK_DOCKER_URI: !Ref GatkDockerUri 57 | GOTC_DOCKER_URI: !Ref GotcDockerUri 58 | INTERVAL_S3_PATH: !Ref IntervalS3Path 59 | 60 | LambdaIAMRole: 61 | Type: 'AWS::IAM::Role' 62 | Properties: 63 | AssumeRolePolicyDocument: 64 | Version: 2012-10-17 65 | Statement: 66 | - Effect: Allow 67 | Principal: 68 | Service: 69 | - lambda.amazonaws.com 70 | Action: 71 | - 'sts:AssumeRole' 72 | Policies: 73 | - PolicyName: Policy1 74 | PolicyDocument: 75 | Version: 2012-10-17 76 | Statement: 77 | - Effect: Allow 78 | Action: 79 | - 's3:GetBucketNotification' 80 | - 's3:PutBucketNotification' 81 | - 's3:GetObject' 82 | - 's3:ListBucket' 83 | Resource: 84 | - !Sub 'arn:aws:s3:::${FastqInputBucket}' 85 | - !Sub 'arn:aws:s3:::${FastqInputBucket}:*' 86 | - Effect: Allow 87 | Action: 88 | - 'logs:CreateLogGroup' 89 | - 'logs:CreateLogStream' 90 | - 'logs:PutLogEvents' 91 | Resource: 'arn:aws:logs:*:*:*' 92 | - Effect: Allow 93 | Action: 94 | - 'states:StartExecution' 95 | - 'states:StartSyncExecution' 96 | - 'states:ListExecutions' 97 | - 'states:ListStateMachines' 98 | Resource: !Sub ${GenomicsStepFunctionArn} 99 | 100 | PutBucketNotificationTrigger: 101 | Type: 'Custom::PutBucketNotificationTrigger' 102 | DependsOn: 103 | - PutBucketNotificationTriggerLambda 104 | Version: 1 105 | Properties: 106 | ServiceToken: !Sub '${PutBucketNotificationTriggerLambda.Arn}' 107 | BucketName: !Ref FastqInputBucket 108 | Prefix: !Ref NotificationAppliedToS3Prefix 109 | LambdaFunctionArn: !GetAtt InvokeGenomicsStepFunctionLambda.Arn 110 | 111 | PutBucketNotificationTriggerLambda: 112 | Type: 'AWS::Lambda::Function' 113 | DependsOn: 114 | - PutBucketNotificationTriggerLambdaRole 115 | Properties: 116 | Handler: add_bucket_notification_lambda.handler 117 | Runtime: python3.9 118 | FunctionName: !Sub 'lambda-put-bucket-notification' 119 | Code: 120 | S3Bucket: !Ref LambdaBucketName 121 | S3Key: !Sub "${LambdaArtifactPrefix}add_bucket_notification_lambda.zip" 122 | Role: !Sub '${PutBucketNotificationTriggerLambdaRole.Arn}' 123 | Timeout: 60 124 | 125 | PutBucketNotificationTriggerLambdaRole: 126 | Type: 'AWS::IAM::Role' 127 | Properties: 128 | AssumeRolePolicyDocument: 129 | Version: 2012-10-17 130 | Statement: 131 | - Action: 132 | - 'sts:AssumeRole' 133 | Effect: Allow 134 | Principal: 135 | Service: 136 | - lambda.amazonaws.com 137 | Path: / 138 | Policies: 139 | - PolicyName: PutBucketNotificationPolicy 140 | PolicyDocument: 141 | Statement: 142 | - Effect: Allow 143 | Action: 144 | - 'logs:CreateLogGroup' 145 | - 'logs:CreateLogStream' 146 | - 'logs:PutLogEvents' 147 | Resource: 148 | - !Sub >- 149 | arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/* 150 | - Effect: Allow 151 | Action: 152 | - 's3:PutBucketNotification' 153 | Resource: !Sub 'arn:aws:s3:::${FastqInputBucket}' 154 | 155 | AllowInputBucketToInvokeLambda: 156 | Type: 'AWS::Lambda::Permission' 157 | DependsOn: 158 | - InvokeGenomicsStepFunctionLambda 159 | Properties: 160 | FunctionName: !GetAtt InvokeGenomicsStepFunctionLambda.Arn 161 | Action: lambda:InvokeFunction 162 | Principal: s3.amazonaws.com 163 | SourceAccount: !Ref 'AWS::AccountId' 164 | SourceArn: !Sub arn:aws:s3:::${FastqInputBucket} 165 | 166 | 167 | 168 | 169 | -------------------------------------------------------------------------------- /src/cfn_templates/solution-cfn.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: Main stack that nests other stacks required by the solution 3 | Parameters: 4 | ArtifactBucketName: 5 | Type: String 6 | Description: Choose an existing bucket in your account for deployment artifacts 7 | LambdaArtifactsS3Prefix: 8 | Type: String 9 | Description: 'trailing backslash required - Folder name used by the upload script - keep in sync' 10 | Default: lambdas/ 11 | CodeBuildArtifactsS3Prefix: 12 | Type: String 13 | Description: Folder name used by the upload script - keep in sync 14 | Default: buildspecs 15 | CfnTemplatesS3Prefix: 16 | Type: String 17 | Description: Folder name used by the upload script - keep in sync 18 | Default: templates 19 | WorkflowArtifactsS3Prefix: 20 | Type: String 21 | Description: Folder name used by the upload script - keep in sync 22 | Default: workflows 23 | WorkflowInputsBucketName: 24 | Type: String 25 | Description: New bucket created for users to upload inputs. Make it unique by adding accountId and region in the name 26 | WorkflowOutputsBucketName: 27 | Type: String 28 | Description: New bucket created for workflows to write outputs. Make it unique by adding accountId and region in the name 29 | ReferenceFastaName: 30 | Type: String 31 | Default: GRCh38 32 | ReferenceFastaS3Uri: 33 | Type: String 34 | Default: s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta 35 | WorkflowIntervalS3Path: 36 | Type: String 37 | Default: s3://aws-genomics-static-us-east-1/omics-e2e/intervals.tar 38 | ClinVarVcfS3Path: 39 | Type: String 40 | Default: s3://aws-genomics-static-us-east-1/omics-e2e/clinvar.vcf.gz 41 | DnSnpVcfS3Uri: 42 | Type: String 43 | Default: s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf 44 | Mills1000GIndelsVcfS3Uri: 45 | Type: String 46 | Default: s3://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz 47 | KnownIndelsVcfS3Uri: 48 | Type: String 49 | Default: s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz 50 | WorkflowDefinitionFilename: 51 | Type: String 52 | Description: File name used by the upload script - keep in sync 53 | Default: gatkbestpractices.wdl.zip 54 | CurrentReferenceStoreId: 55 | Type: String 56 | Default: 'NONE' 57 | Description: 'Provide Reference Store ID if exists in current account-region, else leave it NONE' 58 | VariantStoreName: 59 | Type: String 60 | Default: omicsvariantstore 61 | Description: Name of the Omics Variant store 62 | AnnotationStoreName: 63 | Type: String 64 | Default: omicsannotationstore 65 | Description: Name of the Omics Annotation store 66 | 67 | Resources: 68 | CodeBuildStack: 69 | Type: AWS::CloudFormation::Stack 70 | Properties: 71 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/code-build-stack.yml 72 | TimeoutInMinutes: 20 73 | Parameters: 74 | ResourcesS3Bucket: !Ref ArtifactBucketName 75 | LambdasS3Prefix: !Ref LambdaArtifactsS3Prefix 76 | BuildSpecS3Prefix: !Ref CodeBuildArtifactsS3Prefix 77 | DockerGenomesInTheCloud: public.ecr.aws/aws-genomics/broadinstitute/genomes-in-the-cloud:2.4.7-1603303710 78 | DockerGatk: public.ecr.aws/aws-genomics/broadinstitute/gatk:4.1.9.0 79 | 80 | S3ResourcesStack: 81 | Type: AWS::CloudFormation::Stack 82 | Properties: 83 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/s3-stack.yml 84 | TimeoutInMinutes: 5 85 | Parameters: 86 | DataInputBucketName: !Ref WorkflowInputsBucketName 87 | DataOutputBucketName: !Ref WorkflowOutputsBucketName 88 | 89 | OmicsResourcesStack: 90 | Type: AWS::CloudFormation::Stack 91 | DependsOn: 92 | - S3ResourcesStack 93 | - CodeBuildStack 94 | Properties: 95 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/omics-resources-stack.yml 96 | TimeoutInMinutes: 60 97 | Parameters: 98 | OmicsResourcesS3Bucket: !Ref ArtifactBucketName 99 | OmicsCustomResourceLambdaS3Prefix: !Ref LambdaArtifactsS3Prefix 100 | OmicsWorkflowInputBucketName: !Ref WorkflowInputsBucketName 101 | OmicsWorkflowOutputBucketName: !Ref WorkflowOutputsBucketName 102 | ExistingReferenceStoreId: !Ref CurrentReferenceStoreId 103 | OmicsReferenceFastaUri: !Ref ReferenceFastaS3Uri 104 | OmicsReferenceName: !Ref ReferenceFastaName 105 | OmicsWorkflowDefinitionZipS3: !Sub "s3://${ArtifactBucketName}/${WorkflowArtifactsS3Prefix}/${WorkflowDefinitionFilename}" 106 | ClinvarS3Path: !Ref ClinVarVcfS3Path 107 | OmicsVariantStoreName: !Ref VariantStoreName 108 | OmicsAnnotationStoreName: !Ref AnnotationStoreName 109 | 110 | ApplyS3LifecycleStack: 111 | Type: AWS::CloudFormation::Stack 112 | DependsOn: 113 | - S3ResourcesStack 114 | - CodeBuildStack 115 | Properties: 116 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/apply-s3-lifecycle-stack.yml 117 | TimeoutInMinutes: 10 118 | Parameters: 119 | LambdaBucketName: !Ref ArtifactBucketName 120 | LambdaArtifactPrefix: !Ref LambdaArtifactsS3Prefix 121 | InputsBucketName: !Ref WorkflowInputsBucketName 122 | OutputsBucketName: !Ref WorkflowOutputsBucketName 123 | SfnTaskCheckerStack: 124 | Type: AWS::CloudFormation::Stack 125 | DependsOn: 126 | - OmicsResourcesStack 127 | - CodeBuildStack 128 | Properties: 129 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/sfn-task-checker-stack.yml 130 | TimeoutInMinutes: 10 131 | Parameters: 132 | OmicsOutputBucket: !Ref WorkflowOutputsBucketName 133 | LambdaBucketName: !Ref ArtifactBucketName 134 | LambdaArtifactPrefix: !Ref LambdaArtifactsS3Prefix 135 | 136 | StepFunctionStack: 137 | Type: AWS::CloudFormation::Stack 138 | DependsOn: 139 | - ApplyS3LifecycleStack 140 | - SfnTaskCheckerStack 141 | Properties: 142 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/e2e-sfn-stack.yml 143 | TimeoutInMinutes: 60 144 | Parameters: 145 | ReferenceFastaFileS3Uri: !Ref ReferenceFastaS3Uri 146 | OmicsVariantStoreName: !Ref VariantStoreName 147 | DbSnpVcf: !Ref DnSnpVcfS3Uri 148 | Mills1000GIndelsVcf: !Ref Mills1000GIndelsVcfS3Uri 149 | KnownIndelsVcf: !Ref KnownIndelsVcfS3Uri 150 | OmicsImportSequenceLambdaArn: 151 | Fn::GetAtt: 152 | - OmicsResourcesStack 153 | - Outputs.OmicsImportSequenceLambdaArn 154 | OmicsImportSequenceJobRoleArn: 155 | Fn::GetAtt: 156 | - OmicsResourcesStack 157 | - Outputs.OmicsImportSequenceJobRoleArn 158 | CheckOmicsTaskLambdaFunctionArn: 159 | Fn::GetAtt: 160 | - SfnTaskCheckerStack 161 | - Outputs.CheckOmicsTaskLambdaFunctionArn 162 | OmicsWorkflowStartRunLambdaArn: 163 | Fn::GetAtt: 164 | - OmicsResourcesStack 165 | - Outputs.OmicsWorkflowStartRunLambdaArn 166 | OmicsWorkflowStartRunJobRoleArn: 167 | Fn::GetAtt: 168 | - OmicsResourcesStack 169 | - Outputs.OmicsWorkflowStartRunJobRoleArn 170 | OmicsImportVariantLambdaArn: 171 | Fn::GetAtt: 172 | - OmicsResourcesStack 173 | - Outputs.OmicsImportVariantLambdaArn 174 | OmicsImportVariantJobRoleArn: 175 | Fn::GetAtt: 176 | - OmicsResourcesStack 177 | - Outputs.OmicsImportVariantJobRoleArn 178 | ApplyS3LifecycleLambdaFunctionArn: 179 | Fn::GetAtt: 180 | - ApplyS3LifecycleStack 181 | - Outputs.ApplyS3LifecycleLambdaFunctionArn 182 | SfnTriggerStack: 183 | Type: AWS::CloudFormation::Stack 184 | DependsOn: 185 | - StepFunctionStack 186 | Properties: 187 | TemplateURL: !Sub https://${ArtifactBucketName}.s3.amazonaws.com/${CfnTemplatesS3Prefix}/sfn-trigger-stack.yml 188 | TimeoutInMinutes: 5 189 | Parameters: 190 | FastqInputBucket: !Ref WorkflowInputsBucketName 191 | GenomicsStepFunctionArn: 192 | Fn::GetAtt: 193 | - StepFunctionStack 194 | - Outputs.AmazonOmicsStepFunctionArn 195 | LambdaBucketName: !Ref ArtifactBucketName 196 | LambdaArtifactPrefix: !Ref LambdaArtifactsS3Prefix 197 | SequenceStoreId: 198 | Fn::GetAtt: 199 | - OmicsResourcesStack 200 | - Outputs.OmicsSequenceStoreId 201 | ReferenceArn: 202 | Fn::GetAtt: 203 | - OmicsResourcesStack 204 | - Outputs.OmicsReferenceArn 205 | WorkflowId: 206 | Fn::GetAtt: 207 | - OmicsResourcesStack 208 | - Outputs.OmicsWorkflowId 209 | WorkflowOutputS3Path: !Sub "s3://${WorkflowOutputsBucketName}/outputs" 210 | GatkDockerUri: 211 | Fn::GetAtt: 212 | - CodeBuildStack 213 | - Outputs.EcrImageUriGatk 214 | GotcDockerUri: 215 | Fn::GetAtt: 216 | - CodeBuildStack 217 | - Outputs.EcrImageUriGotc 218 | IntervalS3Path: !Ref WorkflowIntervalS3Path 219 | -------------------------------------------------------------------------------- /src/codebuild/buildspec_docker.yml: -------------------------------------------------------------------------------- 1 | version: 0.2 2 | phases: 3 | install: 4 | runtime-versions: 5 | docker: 18 6 | pre_build: 7 | commands: 8 | - echo Logging in to Amazon ECR... 9 | - aws --version 10 | - $(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email) 11 | - REPOSITORY_URI=$ECR_REPO 12 | - IMAGE_TAG=$ECR_REPO_VERSION 13 | build: 14 | commands: 15 | - echo Build started on `date` 16 | - echo Pull public docker image 17 | - docker pull $SOURCE_REPO 18 | - docker tag $SOURCE_REPO $REPOSITORY_URI:$IMAGE_TAG 19 | post_build: 20 | commands: 21 | - echo Pull completed on `date` 22 | - echo Pushing the Docker image... 23 | - docker push $REPOSITORY_URI:$IMAGE_TAG 24 | - echo Push complete on `date` -------------------------------------------------------------------------------- /src/codebuild/buildspec_lambdas.yml: -------------------------------------------------------------------------------- 1 | version: 0.2 2 | env: 3 | shell: bash 4 | phases: 5 | install: 6 | runtime-versions: 7 | python: 3.9 8 | build: 9 | commands: 10 | - | 11 | #!/bin/bash 12 | lambda_s3_dirname=${RESOURCES_PREFIX} 13 | artifact_s3_dirname=${RESOURCES_PREFIX} 14 | 15 | # Declare all lambda functions with package needs (crhelper needed since these lambdas help with resource creation) 16 | declare -a LambdaNamesWithCrHelper=("import_reference_lambda" "import_annotation_lambda" "add_bucket_notification_lambda") 17 | 18 | # iterate over each lambda 19 | for lambda in ${LambdaNamesWithCrHelper[@]}; do 20 | 21 | COUNT=$(aws s3 ls "s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}${lambda}.py" | wc -l) 22 | if [ $COUNT = 0 ]; then 23 | echo "skipping Build, ${lambda}.py not found in s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}" 24 | else 25 | echo "Building lambda zip for: ${lambda} " 26 | mkdir tmp_${lambda} 27 | cd tmp_${lambda} 28 | echo "Download lambda py for: ${lambda} " 29 | aws s3 cp s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}${lambda}.py . 30 | echo "Installing pip packages" 31 | pip install crhelper boto3==1.26.65 -t ./package 32 | cd ./package 33 | zip -r ../${lambda}.zip * 34 | cd .. 35 | echo "Zip lambda to artifact" 36 | zip -g ${lambda}.zip ${lambda}.py 37 | echo "Upload zip to s3://${RESOURCES_BUCKET}/${artifact_s3_dirname}" 38 | aws s3 cp ${lambda}.zip s3://${RESOURCES_BUCKET}/${artifact_s3_dirname} 39 | cd .. 40 | rm -rf tmp_${lambda} 41 | echo "Done with ${lambda}" 42 | fi 43 | done 44 | 45 | # Declare all lambda functions with package needs 46 | declare -a LambdaNamesJsonSchema=("apply_s3_lifecycle_lambda" "lambda_check_omics_workflow_task" "import_sequence_lambda" "import_variant_lambda" "lambda_launch_genomics_sfn" "start_workflow_lambda") 47 | 48 | # iterate over each lambda 49 | for lambda in ${LambdaNamesJsonSchema[@]}; do 50 | 51 | COUNT=$(aws s3 ls "s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}${lambda}.py" | wc -l) 52 | if [ $COUNT = 0 ]; then 53 | echo "skipping Build, ${lambda}.py not found in s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}" 54 | else 55 | echo "Building lambda zip for: ${lambda} " 56 | mkdir tmp_${lambda} 57 | cd tmp_${lambda} 58 | echo "Download lambda py for: ${lambda} " 59 | aws s3 cp s3://${RESOURCES_BUCKET}/${lambda_s3_dirname}${lambda}.py . 60 | echo "Installing pip packages" 61 | pip install jsonschema boto3==1.26.65 -t ./package 62 | cd ./package 63 | zip -r ../${lambda}.zip * 64 | cd .. 65 | echo "Zip lambda to artifact" 66 | zip -g ${lambda}.zip ${lambda}.py 67 | echo "Upload zip to s3://${RESOURCES_BUCKET}/${artifact_s3_dirname}" 68 | aws s3 cp ${lambda}.zip s3://${RESOURCES_BUCKET}/${artifact_s3_dirname} 69 | cd .. 70 | rm -rf tmp_${lambda} 71 | echo "Done with ${lambda}" 72 | fi 73 | done 74 | -------------------------------------------------------------------------------- /src/glue/etl.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from awsglue.transforms import * 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | from awsglue.dynamicframe import DynamicFrameCollection 8 | from awsglue.dynamicframe import DynamicFrame 9 | 10 | # Script generated for node Custom Transform 11 | def MyTransform(glueContext, dfc) -> DynamicFrameCollection: 12 | from pyspark.sql.functions import coalesce 13 | from awsglue.dynamicframe import DynamicFrame 14 | 15 | df0 = dfc.select(list(dfc.keys())[0]).toDF() 16 | 17 | df0.withColumn("patient_id", coalesce(df0.dg_patient_id, df0.patient_id)) 18 | 19 | df0.withColumn("patient_id", coalesce(df0.rx_patient_id, df0.patient_id)) 20 | 21 | df0.withColumn("patient_id", coalesce(df0.pr_patient_id, df0.patient_id)) 22 | 23 | dyf = DynamicFrame.fromDF(df0, glueContext, "results") 24 | return DynamicFrameCollection({"CustomTransform0": dyf}, glueContext) 25 | 26 | 27 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 28 | sc = SparkContext() 29 | glueContext = GlueContext(sc) 30 | spark = glueContext.spark_session 31 | job = Job(glueContext) 32 | job.init(args["JOB_NAME"], args) 33 | 34 | # Script generated for node Novation_Rx 35 | Novation_Rx_node1665891226598 = glueContext.create_dynamic_frame.from_catalog( 36 | database="phenotypicdb", 37 | table_name="ovation_rx_csv", 38 | transformation_ctx="Novation_Rx_node1665891226598", 39 | ) 40 | 41 | # Script generated for node DiagnosisDF 42 | DiagnosisDF_node1665691724279 = glueContext.create_dynamic_frame.from_catalog( 43 | database="phenotypicdb", 44 | table_name="ovation_diagnosis_csv", 45 | transformation_ctx="DiagnosisDF_node1665691724279", 46 | ) 47 | 48 | # Script generated for node ClinicoGenomicsDF 49 | ClinicoGenomicsDF_node1665689379027 = glueContext.create_dynamic_frame.from_catalog( 50 | database="phenotypicdb", 51 | table_name="ovation_clinicogenomics_csv", 52 | transformation_ctx="ClinicoGenomicsDF_node1665689379027", 53 | ) 54 | 55 | # Script generated for node ProceduresDF 56 | ProceduresDF_node1665690724543 = glueContext.create_dynamic_frame.from_catalog( 57 | database="phenotypicdb", 58 | table_name="ovation_procedures_csv", 59 | transformation_ctx="ProceduresDF_node1665690724543", 60 | ) 61 | 62 | # Script generated for node Renamed keys for finalJoin 63 | RenamedkeysforfinalJoin_node1665892999370 = ApplyMapping.apply( 64 | frame=Novation_Rx_node1665891226598, 65 | mappings=[ 66 | ("patient_id", "string", "rx_patient_id", "string"), 67 | ("claim_id", "string", "rx_claim_id", "string"), 68 | ("ndc_product", "long", "ndc_product", "long"), 69 | ("quantity", "double", "quantity", "double"), 70 | ("uom", "string", "uom", "string"), 71 | ("prescriber_npi", "long", "prescriber_npi", "long"), 72 | ("brand_name", "string", "brand_name", "string"), 73 | ("generic_name", "string", "generic_name", "string"), 74 | ("dosage_form", "string", "dosage_form", "string"), 75 | ], 76 | transformation_ctx="RenamedkeysforfinalJoin_node1665892999370", 77 | ) 78 | 79 | # Script generated for node Renamed keys for Join 80 | RenamedkeysforJoin_node1665694312611 = ApplyMapping.apply( 81 | frame=DiagnosisDF_node1665691724279, 82 | mappings=[ 83 | ("patient_id", "string", "dg_patient_id", "string"), 84 | ("claim_id", "string", "dg_claim_id", "string"), 85 | ("diagnosis_date", "string", "diagnosis_date", "string"), 86 | ("diagnosis_vocab", "string", "diagnosis_vocab", "string"), 87 | ("diagnosis_code", "string", "diagnosis_code", "string"), 88 | ("diagnosis_desc", "string", "diagnosis_desc", "string"), 89 | ("vocabulary_name", "string", "vocabulary_name", "string"), 90 | ], 91 | transformation_ctx="RenamedkeysforJoin_node1665694312611", 92 | ) 93 | 94 | # Script generated for node ApplyFilteronClinicalGenomicsData 95 | ApplyFilteronClinicalGenomicsData_node1665894278030 = ApplyMapping.apply( 96 | frame=ClinicoGenomicsDF_node1665689379027, 97 | mappings=[ 98 | ("patient_id", "string", "patient_id", "string"), 99 | ("lab_specimen_identifier", "string", "lab_specimen_identifier", "string"), 100 | ("sample_type", "string", "sample_type", "string"), 101 | ("afr_ancestry_percent", "double", "afr_ancestry_percent", "double"), 102 | ("amr_ancestry_percent", "double", "amr_ancestry_percent", "double"), 103 | ("eas_ancestry_percent", "double", "eas_ancestry_percent", "double"), 104 | ("eur_ancestry_percent", "double", "eur_ancestry_percent", "double"), 105 | ("oce_ancestry_percent", "double", "oce_ancestry_percent", "double"), 106 | ("sas_ancestry_percent", "double", "sas_ancestry_percent", "double"), 107 | ("was_ancestry_percent", "double", "was_ancestry_percent", "double"), 108 | ], 109 | transformation_ctx="ApplyFilteronClinicalGenomicsData_node1665894278030", 110 | ) 111 | 112 | # Script generated for node FilterProcedureData 113 | FilterProcedureData_node1665694157838 = ApplyMapping.apply( 114 | frame=ProceduresDF_node1665690724543, 115 | mappings=[ 116 | ("patient_id", "string", "pr_patient_id", "string"), 117 | ("claim_id", "string", "pr_claim_id", "string"), 118 | ("claim_type", "string", "pr_claim_type", "string"), 119 | ("procedure_date", "string", "pr_procedure_date", "string"), 120 | ("procedure_vocab", "string", "pr_procedure_vocab", "string"), 121 | ("procedure_code", "string", "pr_procedure_code", "string"), 122 | ("procedure_short_desc", "string", "procedure_short_desc", "string"), 123 | ("procedure_long_desc", "string", "procedure_long_desc", "string"), 124 | ("vocabulary_name", "string", "vocabulary_name", "string"), 125 | ], 126 | transformation_ctx="FilterProcedureData_node1665694157838", 127 | ) 128 | 129 | # Script generated for node JoinClinicalGenomicswithProcedures 130 | ApplyFilteronClinicalGenomicsData_node1665894278030DF = ( 131 | ApplyFilteronClinicalGenomicsData_node1665894278030.toDF() 132 | ) 133 | FilterProcedureData_node1665694157838DF = FilterProcedureData_node1665694157838.toDF() 134 | JoinClinicalGenomicswithProcedures_node1665694145271 = DynamicFrame.fromDF( 135 | ApplyFilteronClinicalGenomicsData_node1665894278030DF.join( 136 | FilterProcedureData_node1665694157838DF, 137 | ( 138 | ApplyFilteronClinicalGenomicsData_node1665894278030DF["patient_id"] 139 | == FilterProcedureData_node1665694157838DF["pr_patient_id"] 140 | ), 141 | "outer", 142 | ), 143 | glueContext, 144 | "JoinClinicalGenomicswithProcedures_node1665694145271", 145 | ) 146 | 147 | # Script generated for node Join 148 | JoinClinicalGenomicswithProcedures_node1665694145271DF = ( 149 | JoinClinicalGenomicswithProcedures_node1665694145271.toDF() 150 | ) 151 | RenamedkeysforJoin_node1665694312611DF = RenamedkeysforJoin_node1665694312611.toDF() 152 | Join_node1665694288059 = DynamicFrame.fromDF( 153 | JoinClinicalGenomicswithProcedures_node1665694145271DF.join( 154 | RenamedkeysforJoin_node1665694312611DF, 155 | ( 156 | JoinClinicalGenomicswithProcedures_node1665694145271DF["patient_id"] 157 | == RenamedkeysforJoin_node1665694312611DF["dg_patient_id"] 158 | ), 159 | "outer", 160 | ), 161 | glueContext, 162 | "Join_node1665694288059", 163 | ) 164 | 165 | # Script generated for node finalJoin 166 | Join_node1665694288059DF = Join_node1665694288059.toDF() 167 | RenamedkeysforfinalJoin_node1665892999370DF = ( 168 | RenamedkeysforfinalJoin_node1665892999370.toDF() 169 | ) 170 | finalJoin_node1665891366375 = DynamicFrame.fromDF( 171 | Join_node1665694288059DF.join( 172 | RenamedkeysforfinalJoin_node1665892999370DF, 173 | ( 174 | Join_node1665694288059DF["patient_id"] 175 | == RenamedkeysforfinalJoin_node1665892999370DF["rx_patient_id"] 176 | ), 177 | "outer", 178 | ), 179 | glueContext, 180 | "finalJoin_node1665891366375", 181 | ) 182 | 183 | # Script generated for node Custom Transform 184 | CustomTransform_node1665962473016 = MyTransform( 185 | glueContext, 186 | DynamicFrameCollection( 187 | {"finalJoin_node1665891366375": finalJoin_node1665891366375}, glueContext 188 | ), 189 | ) 190 | 191 | # Script generated for node Select From Collection 192 | SelectFromCollection_node1665962600856 = SelectFromCollection.apply( 193 | dfc=CustomTransform_node1665962473016, 194 | key=list(CustomTransform_node1665962473016.keys())[0], 195 | transformation_ctx="SelectFromCollection_node1665962600856", 196 | ) 197 | 198 | # Script generated for node Amazon S3 199 | AmazonS3_node1665695621571 = glueContext.write_dynamic_frame.from_options( 200 | frame=SelectFromCollection_node1665962600856, 201 | connection_type="s3", 202 | format="glueparquet", 203 | connection_options={ 204 | "path": "s3://omics-datalake-genomics/phentotypic-datalake/", 205 | "partitionKeys": [], 206 | }, 207 | transformation_ctx="AmazonS3_node1665695621571", 208 | ) 209 | 210 | job.commit() 211 | -------------------------------------------------------------------------------- /src/lambda/add_bucket_notification/add_bucket_notification_lambda.py: -------------------------------------------------------------------------------- 1 | from crhelper import CfnResource 2 | import logging 3 | import boto3 4 | from botocore.exceptions import ClientError 5 | 6 | logger = logging.getLogger(__name__) 7 | # Initialise the helper, all inputs are optional, this example shows the defaults 8 | helper = CfnResource(json_logging=False, log_level='DEBUG', boto_level='CRITICAL', polling_interval=1) 9 | 10 | # Initiate client 11 | try: 12 | print("Attempt to initiate client") 13 | s3 = boto3.resource('s3') 14 | print("Attempt to initiate client complete") 15 | except Exception as e: 16 | raise e 17 | 18 | @helper.create 19 | def create(event, context): 20 | logger.info("Got Create") 21 | put_bucket_notification(event, context) 22 | 23 | 24 | @helper.update 25 | def update(event, context): 26 | logger.info("Got Update") 27 | put_bucket_notification(event, context) 28 | 29 | 30 | @helper.delete 31 | def delete(event, context): 32 | logger.info("Got Delete") 33 | pass 34 | # Delete never returns anything. Should not fail if the underlying resources are already deleted. Desired state. 35 | 36 | def handler(event, context): 37 | helper(event, context) 38 | 39 | def put_bucket_notification(event, context): 40 | bucket_name = event['ResourceProperties']['BucketName'] 41 | prefix = event['ResourceProperties']['Prefix'] 42 | lambda_function_arn = event['ResourceProperties']['LambdaFunctionArn'] 43 | try: 44 | print("Attempt to update bucket configuration") 45 | bucket_notification = s3.BucketNotification(bucket_name) 46 | response = bucket_notification.put( 47 | NotificationConfiguration={ 48 | 'LambdaFunctionConfigurations': [ 49 | { 50 | 'Id': 'ObjectCreatedStartsWithPrefix', 51 | 'LambdaFunctionArn': lambda_function_arn, 52 | 'Events': [ 53 | 's3:ObjectCreated:*' 54 | ], 55 | 'Filter': { 56 | 'Key': { 57 | 'FilterRules': [ 58 | { 59 | 'Name': 'prefix', 60 | 'Value': prefix 61 | }, 62 | ] 63 | } 64 | } 65 | }, 66 | ] 67 | }, 68 | ) 69 | except ClientError as e: 70 | raise Exception( "boto3 client error : " + e.__str__()) 71 | except Exception as e: 72 | raise Exception( "Unexpected error : " + e.__str__()) 73 | print(response) -------------------------------------------------------------------------------- /src/lambda/apply_s3_lifecycle/apply_s3_lifecycle_lambda.py: -------------------------------------------------------------------------------- 1 | import json 2 | import boto3 3 | from jsonschema import validate 4 | from botocore.exceptions import ClientError 5 | 6 | print('Loading function - Apply S3 Lifecycle') 7 | 8 | s3_client = boto3.client('s3') 9 | 10 | file_tag_rules = { 11 | "bam": 12 | [ 13 | { 14 | "Key": "processed", 15 | "Value": "true" 16 | }, 17 | { 18 | "Key": "OmicsTiering", 19 | "Value": "IntelligentTierAfter30" 20 | } 21 | ], 22 | "vcf": 23 | [ 24 | { 25 | "Key": "processed", 26 | "Value": "true" 27 | }, 28 | { 29 | "Key": "OmicsTiering", 30 | "Value": "Standard" 31 | } 32 | ], 33 | "gvcf": 34 | [ 35 | { 36 | "Key": "processed", 37 | "Value": "true" 38 | }, 39 | { 40 | "Key": "OmicsTiering", 41 | "Value": "IntelligentTierAfter30" 42 | } 43 | ], 44 | "fastq": 45 | [ 46 | { 47 | "Key": "processed", 48 | "Value": "true" 49 | }, 50 | { 51 | "Key": "OmicsTiering", 52 | "Value": "RemoveIn30" 53 | } 54 | ] 55 | } 56 | 57 | def validate_event(_event_json): 58 | schema = { 59 | "$schema": "http://json-schema.org/draft-04/schema#", 60 | "type": "object", 61 | "properties": { 62 | "inputs": { 63 | "type": "object", 64 | "properties": { 65 | "fastq": { 66 | "type": "array" 67 | } 68 | }, 69 | "required": [ 70 | "fastq" 71 | ] 72 | }, 73 | "outputs": { 74 | "type": "object", 75 | "properties": { 76 | "vcf": { 77 | "type": "array" 78 | }, 79 | "bam": { 80 | "type": "array" 81 | }, 82 | "gvcf": { 83 | "type": "array" 84 | } 85 | }, 86 | "required": [ 87 | "vcf", 88 | "bam" 89 | ] 90 | } 91 | }, 92 | "required": [ 93 | "inputs", 94 | "outputs" 95 | ] 96 | } 97 | 98 | try: 99 | validate(_event_json, schema=schema) 100 | return True 101 | except Exception as e: 102 | raise e 103 | 104 | def split_s3_path(s3_path): 105 | path_parts=s3_path.replace("s3://","").split("/") 106 | bucket=path_parts.pop(0) 107 | key="/".join(path_parts) 108 | return bucket, key 109 | 110 | def get_tagset_for_object(_bucket, _key): 111 | try: 112 | get_tags_response = s3_client.get_object_tagging( 113 | Bucket=_bucket, 114 | Key=_key, 115 | ) 116 | return get_tags_response['TagSet'] 117 | except Exception as e: 118 | raise e 119 | 120 | def lambda_handler(event, context): 121 | """ 122 | Example event 123 | { 124 | "inputs": { 125 | "fastq": [ 126 | "s3://path/tofastq_R1.fastq.gz", 127 | "s3://path/tofastq_R2.fastq.gz" 128 | ] 129 | }, 130 | "outputs": { 131 | "vcf": [ 132 | "s3://output.vcf" 133 | ], 134 | "bam": [ 135 | "s3://output.bam" 136 | ], 137 | "gvcf": [ 138 | "s3://example.genome.vcf.gz" 139 | ] 140 | } 141 | } 142 | """ 143 | # Inoked by Step Function 144 | print("Received event: " + json.dumps(event, indent=2)) 145 | 146 | objects_to_tag = {} 147 | # check valid event and add files to tag 148 | if validate_event(event): 149 | for _k, _v in event['inputs'].items(): 150 | objects_to_tag[_k] = _v 151 | for _k, _v in event['outputs'].items(): 152 | objects_to_tag[_k] = _v 153 | 154 | # check and apply tags based on config 155 | for file_type, s3_files in objects_to_tag.items(): 156 | for _s3_file in s3_files: 157 | bucket, _key = split_s3_path(_s3_file) 158 | print(f"File type: {file_type} Bucket: {bucket} Key: {_key}") 159 | # tag_set = get_tagset_for_object(bucket, _key) 160 | # Add logic here to check existing tags if needed 161 | 162 | # Apply new tag set based on file type and config 163 | try: 164 | put_tags_response = s3_client.put_object_tagging( 165 | Bucket=bucket, 166 | Key=_key, 167 | Tagging={ 168 | 'TagSet': file_tag_rules[file_type] 169 | } 170 | ) 171 | print(put_tags_response) 172 | except ClientError as e: 173 | raise Exception( "boto3 client error : " + e.__str__()) 174 | except Exception as e: 175 | raise Exception( "Unexpected error : " + e.__str__()) 176 | print("Cleanup complete for sample") 177 | -------------------------------------------------------------------------------- /src/lambda/check_omcis_workflow_task/lambda_check_omics_workflow_task.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from copy import deepcopy 3 | from dataclasses import dataclass, asdict 4 | from datetime import datetime 5 | from typing import Literal 6 | 7 | import logging 8 | 9 | logger = logging.getLogger() 10 | logger.setLevel(logging.INFO) 11 | 12 | # Set up for omics model for boto3, only needed while in beta 13 | 14 | session = boto3.Session() 15 | 16 | omics_client = session.client("omics") 17 | 18 | # Define omics service response status types 19 | 20 | TASK_TYPE = Literal["GetReadSetImportJob", "GetVariantImportJob", "GetRun"] 21 | READ_SET_IMPORT_STATUS = Literal[ 22 | "CREATED", "SUBMITTED", "RUNNING", "CANCELLING", "FAILED", "DONE", "COMPLETED_WITH_FAILURES"] 23 | GET_RUN_JOB_STATUS = Literal[ 24 | "PENDING", "STARTING", "RUNNING", "STOPPING", "COMPLETED", "DELETED", "CANCELLED", "FAILED"] 25 | GET_VARIANT_IMPORT_JOB_STATUS = Literal[ 26 | "CREATING", "QUEUED", "IN_PROGRESS", "CANCELING", "CANCELED", "COMPLETE", "FAILED"] 27 | 28 | 29 | # Define data classes 30 | 31 | 32 | @dataclass 33 | class CheckOmicsWorkflowTaskRequest: 34 | task_type: TASK_TYPE 35 | task_params: dict 36 | 37 | 38 | @dataclass 39 | class CheckOmicsWorkflowTaskResponse(CheckOmicsWorkflowTaskRequest): 40 | task_status: str = None 41 | task_response: dict = None 42 | 43 | 44 | # Converts datetimes to isoformat string 45 | def dates_to_string(response_dict: dict): 46 | converted_response = deepcopy(response_dict) 47 | 48 | for k, v in response_dict.items(): 49 | if isinstance(v, dict): 50 | converted_response[k] = dates_to_string(response_dict[k]) 51 | elif isinstance(v, datetime): 52 | converted_response[k] = response_dict[k].isoformat() 53 | return converted_response 54 | 55 | 56 | # Checks if task_status is one of possible values for a terminal state 57 | # Sets value to either COMPLETED, or FAILED if terminal state 58 | def get_terminal_status(task_status) -> str: 59 | if task_status == 'DONE': 60 | return "COMPLETED" 61 | elif task_status in ["CANCELLING", "DELETED", "CANCELLED", "COMPLETED_WITH_FAILURES"]: 62 | return "FAILED" 63 | else: 64 | return task_status 65 | 66 | 67 | # Make Lambda Function response 68 | def make_response( 69 | request: CheckOmicsWorkflowTaskRequest, 70 | task_status: str, 71 | task_response: dict 72 | ) -> CheckOmicsWorkflowTaskResponse: 73 | del task_response['ResponseMetadata'] 74 | 75 | workflow_task_response = CheckOmicsWorkflowTaskResponse( 76 | task_type=request.task_type, 77 | task_params=request.task_params, 78 | task_status=task_status, 79 | task_response=dates_to_string(task_response) 80 | ) 81 | return workflow_task_response 82 | 83 | 84 | # Amazon Omics service calls 85 | 86 | def get_read_set_import_job(request: CheckOmicsWorkflowTaskRequest) -> CheckOmicsWorkflowTaskResponse: 87 | boto3.client('omics') 88 | response = omics_client.get_read_set_import_job( 89 | id=request.task_params['id'], 90 | sequenceStoreId=request.task_params['sequence_store_id'] 91 | ) 92 | logger.info(response) 93 | 94 | workflow_task_response = make_response( 95 | request=request, 96 | task_status=get_terminal_status(response['status']), 97 | task_response=response 98 | ) 99 | return workflow_task_response 100 | 101 | 102 | def get_run(request: CheckOmicsWorkflowTaskRequest) -> CheckOmicsWorkflowTaskResponse: 103 | boto3.client('omics') 104 | response = omics_client.get_run( 105 | id=request.task_params['id'], 106 | ) 107 | 108 | logger.info(response) 109 | 110 | workflow_task_response = make_response( 111 | request=request, 112 | task_status=get_terminal_status(response['status']), 113 | task_response=response 114 | ) 115 | return workflow_task_response 116 | 117 | 118 | def get_variant_import_job(request: CheckOmicsWorkflowTaskRequest) -> CheckOmicsWorkflowTaskResponse: 119 | boto3.client('omics') 120 | response = omics_client.get_variant_import_job( 121 | jobId=request.task_params['job_id'], 122 | ) 123 | 124 | logger.info(response) 125 | 126 | workflow_task_response = make_response( 127 | request=request, 128 | task_status=get_terminal_status(response['status']), 129 | task_response=response 130 | ) 131 | 132 | return workflow_task_response 133 | 134 | 135 | # Main lambda handler 136 | 137 | def lambda_handler(event: CheckOmicsWorkflowTaskRequest, context): 138 | logger.info(f"Event Object: {event}") 139 | 140 | request = CheckOmicsWorkflowTaskRequest(**event) 141 | 142 | if request.task_type == "GetReadSetImportJob": 143 | task_response = get_read_set_import_job(request) 144 | elif request.task_type == "GetRun": 145 | task_response = get_run(request) 146 | elif request.task_type == "GetVariantImportJob": 147 | task_response = get_variant_import_job(request) 148 | else: 149 | task_response = make_response( 150 | request, 151 | task_status="FAILED", 152 | task_response={ 153 | 'failure_message': f'The requested task_type: {request.task_type} is not one of: [' 154 | f'GetReadSetImportJob, GetRun, GetVariantImportJob]'} 155 | ) 156 | 157 | return asdict(task_response) -------------------------------------------------------------------------------- /src/lambda/import_annotation/import_annotation_lambda.py: -------------------------------------------------------------------------------- 1 | from crhelper import CfnResource 2 | import logging 3 | import boto3 4 | from botocore.exceptions import ClientError 5 | 6 | logger = logging.getLogger(__name__) 7 | # Initialise the helper, all inputs are optional, this example shows the defaults 8 | helper = CfnResource(json_logging=False, log_level='DEBUG', boto_level='CRITICAL', polling_interval=1) 9 | 10 | # Initiate client 11 | try: 12 | print("Attempt to initiate client") 13 | omics_session = boto3.Session() 14 | omics_client = omics_session.client('omics') 15 | print("Attempt to initiate client complete") 16 | except Exception as e: 17 | helper.init_failure(e) 18 | 19 | 20 | @helper.create 21 | def create(event, context): 22 | logger.info("Got Create") 23 | import_annotation(event, context) 24 | 25 | 26 | @helper.update 27 | def update(event, context): 28 | logger.info("Got Update") 29 | import_annotation(event, context) 30 | 31 | 32 | @helper.delete 33 | def delete(event, context): 34 | logger.info("Got Delete") 35 | return "delete" 36 | # Delete never returns anything. Should not fail if the underlying resources are already deleted. Desired state. 37 | 38 | @helper.poll_create 39 | def poll_create(event, context): 40 | logger.info("Got Create poll") 41 | return check_annotation_import_status(event, context) 42 | 43 | 44 | @helper.poll_update 45 | def poll_update(event, context): 46 | logger.info("Got Update poll") 47 | return check_annotation_import_status(event, context) 48 | 49 | 50 | @helper.poll_delete 51 | def poll_delete(event, context): 52 | logger.info("Got Delete poll") 53 | return "delete poll" 54 | 55 | def handler(event, context): 56 | helper(event, context) 57 | 58 | def import_annotation(event, context): 59 | omics_import_role_arn = event['ResourceProperties']['OmicsImportAnnotationRoleArn'] 60 | annotation_source_s3_uri = event['ResourceProperties']['AnnotationSourceS3Uri'] 61 | annotation_store_name = event['ResourceProperties']['AnnotationStoreName'] 62 | try: 63 | print(f"Attempt to import annotation file: {annotation_source_s3_uri} to store: {annotation_store_name}") 64 | response = omics_client.start_annotation_import_job( 65 | destinationName=annotation_store_name, 66 | roleArn=omics_import_role_arn, 67 | items=[{'source': annotation_source_s3_uri}] 68 | ) 69 | except ClientError as e: 70 | raise Exception( "boto3 client error : " + e.__str__()) 71 | except Exception as e: 72 | raise Exception( "Unexpected error : " + e.__str__()) 73 | logger.info(response) 74 | helper.Data.update({"AnnotationImportJobId": response['jobId']}) 75 | return True 76 | 77 | def check_annotation_import_status(event, context): 78 | annotation_import_job_id = helper.Data.get("AnnotationImportJobId") 79 | 80 | try: 81 | response = omics_client.get_annotation_import_job( 82 | jobId=annotation_import_job_id 83 | ) 84 | except ClientError as e: 85 | raise Exception( "boto3 client error : " + e.__str__()) 86 | except Exception as e: 87 | raise Exception( "Unexpected error : " + e.__str__()) 88 | status = response['status'] 89 | 90 | if status in ['SUBMITTED', 'IN_PROGRESS', 'RUNNING', 'CREATING', 'QUEUED']: 91 | logger.info(status) 92 | return None 93 | else: 94 | if status in ['READY', 'ACTIVE', 'COMPLETED', 'COMPLETE']: 95 | logger.info(status) 96 | return True 97 | else: 98 | msg = f"Annotation Import Job ID : {annotation_import_job_id} has status {status}, exiting" 99 | logger.info(msg) 100 | raise ValueError(msg) 101 | 102 | -------------------------------------------------------------------------------- /src/lambda/import_reference/import_reference_lambda.py: -------------------------------------------------------------------------------- 1 | from crhelper import CfnResource 2 | import logging 3 | import boto3 4 | from botocore.exceptions import ClientError 5 | 6 | logger = logging.getLogger(__name__) 7 | # Initialise the helper, all inputs are optional, this example shows the defaults 8 | helper = CfnResource(json_logging=False, log_level='DEBUG', boto_level='CRITICAL', polling_interval=1) 9 | 10 | # Initiate client 11 | try: 12 | print("Attempt to initiate client") 13 | omics_session = boto3.Session() 14 | omics_client = omics_session.client('omics') 15 | print("Attempt to initiate client complete") 16 | except Exception as e: 17 | helper.init_failure(e) 18 | 19 | 20 | @helper.create 21 | def create(event, context): 22 | logger.info("Got Create") 23 | import_reference(event, context) 24 | 25 | 26 | @helper.update 27 | def update(event, context): 28 | logger.info("Got Update") 29 | import_reference(event, context) 30 | 31 | 32 | @helper.delete 33 | def delete(event, context): 34 | logger.info("Got Delete") 35 | return "delete" 36 | # Delete never returns anything. Should not fail if the underlying resources are already deleted. Desired state. 37 | 38 | @helper.poll_create 39 | def poll_create(event, context): 40 | logger.info("Got Create poll") 41 | return check_reference_import_status(event, context) 42 | 43 | 44 | @helper.poll_update 45 | def poll_update(event, context): 46 | logger.info("Got Update poll") 47 | return check_reference_import_status(event, context) 48 | 49 | 50 | @helper.poll_delete 51 | def poll_delete(event, context): 52 | logger.info("Got Delete poll") 53 | return "delete poll" 54 | 55 | def handler(event, context): 56 | helper(event, context) 57 | 58 | def import_reference(event, context): 59 | reference_store_id = event['ResourceProperties']['ReferenceStoreId'] 60 | omics_import_role_arn = event['ResourceProperties']['OmicsImportReferenceRoleArn'] 61 | reference_source_s3_uri = event['ResourceProperties']['ReferenceSourceS3Uri'] 62 | reference_name = event['ResourceProperties']['ReferenceName'] 63 | try: 64 | print(f"Attempt to import reference: {reference_source_s3_uri} to store: {reference_store_id}") 65 | response = omics_client.start_reference_import_job( 66 | referenceStoreId=reference_store_id, 67 | roleArn=omics_import_role_arn, 68 | sources=[{'sourceFile': reference_source_s3_uri, 'name': reference_name}] 69 | ) 70 | except ClientError as e: 71 | raise Exception( "boto3 client error : " + e.__str__()) 72 | except Exception as e: 73 | raise Exception( "Unexpected error : " + e.__str__()) 74 | logger.info(response) 75 | helper.Data.update({"ReferenceImportJobId": response['id']}) 76 | helper.Data.update({"ReferenceStoreId": response['referenceStoreId']}) 77 | return True 78 | 79 | def get_reference_arn_id(reference_store_id, reference_name): 80 | try: 81 | response = omics_client.list_references( 82 | referenceStoreId=reference_store_id, 83 | filter={'name': reference_name} 84 | ) 85 | except ClientError as e: 86 | raise Exception( "boto3 client error : " + e.__str__()) 87 | except Exception as e: 88 | raise Exception( "Unexpected error : " + e.__str__()) 89 | return response['references'][0]['arn'], response['references'][0]['id'] 90 | 91 | def check_reference_import_status(event, context): 92 | reference_store_id = helper.Data.get("ReferenceStoreId") 93 | reference_import_job_id = helper.Data.get("ReferenceImportJobId") 94 | 95 | try: 96 | response = omics_client.get_reference_import_job( 97 | id=reference_import_job_id, 98 | referenceStoreId=reference_store_id 99 | ) 100 | except ClientError as e: 101 | raise Exception( "boto3 client error : " + e.__str__()) 102 | except Exception as e: 103 | raise Exception( "Unexpected error : " + e.__str__()) 104 | status = response['status'] 105 | 106 | if status in ['SUBMITTED', 'IN_PROGRESS', 'RUNNING']: 107 | logger.info(status) 108 | return None 109 | else: 110 | if status in ['READY', 'ACTIVE', 'COMPLETED']: 111 | logger.info(status) 112 | _arn, _id = get_reference_arn_id( 113 | reference_store_id, 114 | event['ResourceProperties']['ReferenceName'] 115 | ) 116 | helper.Data.update({"Arn": _arn}) 117 | helper.Data.update({"Id": _id}) 118 | return True 119 | else: 120 | msg = f"Reference store: {reference_store_id} has status {status}, exiting" 121 | logger.info(msg) 122 | raise ValueError(msg) 123 | 124 | -------------------------------------------------------------------------------- /src/lambda/import_sequence/import_sequence_lambda.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import boto3 3 | from botocore.exceptions import ClientError 4 | 5 | logger = logging.getLogger(__name__) 6 | 7 | # Initiate client 8 | try: 9 | print("Attempt to initiate client") 10 | omics_session = boto3.Session() 11 | omics_client = omics_session.client('omics') 12 | print("Attempt to initiate client complete") 13 | except Exception as e: 14 | raise e 15 | 16 | def handler(event, context): 17 | sequence_store_id = event['SequenceStoreId'] 18 | sample_id = event['SampleId'] 19 | subject_id = event['SubjectId'] 20 | source_file_type = event['FileType'] 21 | source_files = {} 22 | source1 = event['Read1'] 23 | source_files["source1"] = source1 24 | if "Read2" in event: 25 | source2 = event['Read2'] 26 | source_files["source2"] = source2 27 | reference_arn = event['ReferenceArn'] 28 | role_arn = event['RoleArn'] 29 | source_list = [ 30 | { 31 | "sourceFiles": source_files, 32 | "sourceFileType": source_file_type, 33 | "subjectId": subject_id, 34 | "sampleId": sample_id, 35 | "referenceArn": reference_arn 36 | } 37 | ] 38 | 39 | try: 40 | print("Attempt to import read set") 41 | response = omics_client.start_read_set_import_job( 42 | sequenceStoreId=sequence_store_id, 43 | roleArn=role_arn, 44 | sources=source_list 45 | ) 46 | except ClientError as e: 47 | raise Exception( "boto3 client error : " + e.__str__()) 48 | except Exception as e: 49 | raise Exception( "Unexpected error : " + e.__str__()) 50 | logger.info(response) 51 | return {"importReadSetJobId": response['id']} -------------------------------------------------------------------------------- /src/lambda/import_variants/import_variant_lambda.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from botocore.exceptions import ClientError 3 | 4 | # Initiate client 5 | try: 6 | print("Attempt to initiate client") 7 | omics_session = boto3.Session() 8 | omics_client = omics_session.client('omics') 9 | print("Attempt to initiate client complete") 10 | except Exception as e: 11 | raise e 12 | 13 | def handler(event, context): 14 | variant_store_name = event['VariantStoreName'] 15 | role_arn = event['OmicsImportVariantRoleArn'] 16 | variant_items = [{ 17 | "source": event['VcfS3Uri'] 18 | }] 19 | try: 20 | print("Attempt to start variant import job") 21 | response = omics_client.start_variant_import_job( 22 | destinationName=variant_store_name, 23 | roleArn=role_arn, 24 | items=variant_items 25 | ) 26 | except ClientError as e: 27 | raise Exception( "boto3 client error : " + e.__str__()) 28 | except Exception as e: 29 | raise Exception( "Unexpected error : " + e.__str__()) 30 | print(response) 31 | return {"VariantImportJobId": response['jobId']} -------------------------------------------------------------------------------- /src/lambda/launch_genomics_sfn/lambda_launch_genomics_sfn.py: -------------------------------------------------------------------------------- 1 | import json 2 | import boto3 3 | import re 4 | import os 5 | import sys 6 | import uuid 7 | import time 8 | import random 9 | 10 | print('Loading function - launch Genomics Step Function Workflow') 11 | 12 | s3 = boto3.client('s3') 13 | 14 | ## FASTQ files name should look like 15 | # mysample_R1.fastq.gz mysample_R2.fastq.gz 16 | # based on regex below 17 | FASTQ_REGEX = re.compile('^(\w{1,20})_R(\d{1,10})\.*') 18 | 19 | ## this will control the expected number of files 20 | # found with prefix, for example, inputs/mysamples 21 | EXPECTED_READS = int(os.environ['NUM_FASTQS_PER_SAMPLE']) 22 | SFN_ARN = os.environ['GENOMICS_STEP_FUNCTION_ARN'] 23 | MAX_DELAY = 20 24 | 25 | def get_files_with_prefix(_bucket, _key, _sample): 26 | file_list = [] 27 | if "/" in _key: 28 | _prefix = os.path.dirname(_key) + '/' + _sample 29 | else: 30 | _prefix = _sample 31 | 32 | s3_client = boto3.client("s3") 33 | response = s3_client.list_objects_v2(Bucket=_bucket, Prefix=_prefix) 34 | files = response.get("Contents") 35 | for file in files: 36 | s3_file_uri = 's3://' + _bucket + '/' + file['Key'] 37 | print(f"S3 file path: {s3_file_uri}") 38 | file_list.append(s3_file_uri) 39 | return file_list 40 | 41 | def verify_fastq(_filename): 42 | result = re.search(FASTQ_REGEX, _filename) 43 | if result: 44 | print(f'verified that file {_filename} is a FASTQ') 45 | return True 46 | else: 47 | return False 48 | 49 | def is_sfn_exec_running(exec_name_prefix): 50 | sfn_client = boto3.client('stepfunctions') 51 | try: 52 | response = sfn_client.list_executions( 53 | stateMachineArn=SFN_ARN, 54 | statusFilter='RUNNING' 55 | ) 56 | except Exception as e: 57 | raise e 58 | # check for response 59 | if 'executions' in response and len(response['executions']) > 0: 60 | for _exec in response['executions']: 61 | if _exec['name'].startswith(exec_name_prefix): 62 | return True 63 | return False 64 | else: 65 | return False 66 | 67 | def lambda_handler(event, context): 68 | # Sanity checks 69 | print("Received s3 event: " + json.dumps(event, indent=4)) 70 | if "Records" not in event: 71 | sys.exit("Event doesnt have records, exiting") 72 | 73 | if len(event["Records"]) == 0: 74 | sys.exit("Event has empty records, exiting") 75 | 76 | event_obj = event["Records"][0] 77 | if "eventSource" not in event_obj or \ 78 | event_obj["eventSource"] != "aws:s3" or \ 79 | event_obj["eventName"].split(':')[0] != "ObjectCreated": 80 | sys.exit("Not a valid PutObject S3 event, exiting") 81 | 82 | # Get the object from the event and show its content type 83 | bucket = event_obj['s3']['bucket']['name'] 84 | _key = event_obj['s3']['object']['key'] 85 | print(f"Bucket: {bucket} Key: {_key}") 86 | 87 | if not _key.endswith('.fastq') and not _key.endswith('.fastq.gz'): 88 | sys.exit("Not a valid FASTQ file, exiting") 89 | else: 90 | # check if reads present 91 | if not verify_fastq(_key.split('/')[-1]): 92 | sys.exit("Not a valid fastq") 93 | else: 94 | # Add a random sleep to reduce likelihood of a race condition when 95 | # multiple fastqs arrive at the same time 96 | time_delay_seconds = random.randrange(0, MAX_DELAY) 97 | print(f"Waiting for {time_delay_seconds} seconds") 98 | time.sleep(time_delay_seconds) 99 | 100 | result = re.match(FASTQ_REGEX, os.path.basename(_key)) 101 | sample_name = result.group(1) 102 | files_for_sample = get_files_with_prefix(bucket, _key, sample_name) 103 | print(f"{len(files_for_sample)} reads found for sample {sample_name}") 104 | if len(files_for_sample) == EXPECTED_READS: 105 | print("All FASTQs for sample accounted for, start step functions") 106 | sfn_name_prefix = f'GENOMICS_{sample_name}' 107 | sfn_exec_name = sfn_name_prefix + '_' + str(uuid.uuid1()) 108 | 109 | # check if already running (to avoid race condition) 110 | print("Checking if SFN execution running") 111 | if is_sfn_exec_running(sfn_name_prefix): 112 | sys.exit(f"SFN execution for sample: {sample_name} is RUNNING, skip launching a duplicate") 113 | 114 | sfn_payload = { 115 | "SampleId": sample_name, 116 | "Read1": files_for_sample[0], 117 | "Read2": files_for_sample[1], 118 | "SubjectId": 'TEST_SUBJECT', 119 | "SequenceStoreId": os.environ["SEQUENCE_STORE_ID"], 120 | "ReferenceArn": os.environ["REFERENCE_ARN"], 121 | "WorkflowId": os.environ["WORKFLOW_ID"], 122 | "WorkflowOutputS3Path": os.environ["WORKFLOW_OUTPUT_S3_PATH"], 123 | "GatkDockerUri": os.environ["GATK_DOCKER_URI"], 124 | "GotcDockerUri": os.environ["GOTC_DOCKER_URI"], 125 | "IntervalsS3Path": os.environ["INTERVAL_S3_PATH"] 126 | } 127 | sfn_client = boto3.client('stepfunctions') 128 | try: 129 | response = sfn_client.start_execution( 130 | stateMachineArn=SFN_ARN, 131 | name=sfn_exec_name, 132 | input=json.dumps(sfn_payload) 133 | ) 134 | print(f"Launched SFN execution: {sfn_exec_name}") 135 | except Exception as e: 136 | raise e 137 | else: 138 | print("Not all FASTQs found for sample, exit") 139 | -------------------------------------------------------------------------------- /src/lambda/start_workflow/start_workflow_lambda.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import boto3 3 | from botocore.exceptions import ClientError 4 | 5 | logger = logging.getLogger(__name__) 6 | 7 | # Initiate client 8 | try: 9 | print("Attempt to initiate client") 10 | omics_session = boto3.Session() 11 | omics_client = omics_session.client('omics') 12 | print("Attempt to initiate client complete") 13 | except Exception as e: 14 | raise e 15 | 16 | def handler(event, context): 17 | workflow_id = event['WorkflowId'] 18 | role_arn = event['JobRoleArn'] 19 | output_s3_path = event['OutputS3Path'] 20 | params = { 21 | "sample_name": event['sample_name'], 22 | "ref_fasta": event['ref_fasta'], 23 | "fastq_1": event['fastq_1'], 24 | "fastq_2": event['fastq_2'], 25 | "readgroup_name": event['readgroup_name'], 26 | "library_name": event['fastq_2'], 27 | "platform_name": event['platform_name'], 28 | "run_date": event['run_date'], 29 | "sequencing_center":event['sequencing_center'], 30 | "dbSNP_vcf": event['dbSNP_vcf'], 31 | "Mills_1000G_indels_vcf": event['Mills_1000G_indels_vcf'], 32 | "known_indels_vcf": event['known_indels_vcf'], 33 | "scattered_calling_intervals_archive": event['scattered_calling_intervals_archive'], 34 | "gatk_docker": event['gatk_docker'], 35 | "gotc_docker": event['gotc_docker'] 36 | 37 | } 38 | 39 | try: 40 | print("Attempt to start workflow run") 41 | response = omics_client.start_run( 42 | workflowId=workflow_id, 43 | name=event['sample_name'] + '-workflow', 44 | roleArn=role_arn, 45 | parameters=params, 46 | outputUri=output_s3_path 47 | ) 48 | except ClientError as e: 49 | raise Exception( "boto3 client error : " + e.__str__()) 50 | except Exception as e: 51 | raise Exception( "Unexpected error : " + e.__str__()) 52 | logger.info(response) 53 | return {"WorkflowRunId": response['id']} -------------------------------------------------------------------------------- /src/lambda/trigger_code_build/trigger_docker_code_build.py: -------------------------------------------------------------------------------- 1 | from crhelper import CfnResource 2 | import logging 3 | import boto3 4 | from botocore.exceptions import ClientError 5 | 6 | logger = logging.getLogger(__name__) 7 | # Initialise the helper, all inputs are optional, this example shows the defaults 8 | helper = CfnResource(json_logging=False, log_level='DEBUG', boto_level='CRITICAL', polling_interval=1) 9 | 10 | 11 | # Initiate client 12 | try: 13 | print("Attempt to initiate client") 14 | _session = boto3.Session() 15 | code_build_client = _session.client('codebuild') 16 | print("Attempt to initiate codebuild client complete") 17 | except Exception as e: 18 | helper.init_failure(e) 19 | 20 | 21 | @helper.create 22 | def create(event, context): 23 | logger.info("Got Create") 24 | start_code_build(event, context) 25 | 26 | 27 | @helper.update 28 | def update(event, context): 29 | logger.info("Got Update") 30 | start_code_build(event, context) 31 | 32 | 33 | @helper.delete 34 | def delete(event, context): 35 | logger.info("Got Delete") 36 | return "delete" 37 | # Delete never returns anything. Should not fail if the underlying resources are already deleted. Desired state. 38 | 39 | @helper.poll_create 40 | def poll_create(event, context): 41 | logger.info("Got Create poll") 42 | return check_code_build_status(event, context) 43 | 44 | 45 | @helper.poll_update 46 | def poll_update(event, context): 47 | logger.info("Got Update poll") 48 | return check_code_build_status(event, context) 49 | 50 | 51 | @helper.poll_delete 52 | def poll_delete(event, context): 53 | logger.info("Got Delete poll") 54 | return "delete poll" 55 | 56 | def handler(event, context): 57 | helper(event, context) 58 | 59 | 60 | def start_code_build(event, context): 61 | project_name = event['ResourceProperties']['ProjectName'] 62 | source_repo = event['ResourceProperties']['SourceRepo'] 63 | ecr_repo = event['ResourceProperties']['EcrRepo'] 64 | image_tag = source_repo.split(':')[-1] 65 | try: 66 | print(f"Attempt to start code build project {project_name}") 67 | response = code_build_client.start_build( 68 | projectName=project_name, 69 | environmentVariablesOverride=[ 70 | { 71 | "name": 'SOURCE_REPO', 72 | "value": source_repo 73 | }, 74 | { 75 | "name": 'ECR_REPO_VERSION', 76 | "value": image_tag 77 | }, 78 | { 79 | "name": 'ECR_REPO', 80 | "value": ecr_repo 81 | } 82 | ] 83 | ) 84 | except ClientError as e: 85 | raise Exception( "boto3 client error : " + e.__str__()) 86 | except Exception as e: 87 | raise Exception( "Unexpected error : " + e.__str__()) 88 | logger.info(response) 89 | helper.Data.update({"BuildId": response['build']['id']}) 90 | helper.Data.update({"EcrImageUri": ecr_repo + ':' + image_tag}) 91 | 92 | def check_code_build_status(event, context): 93 | build_id = helper.Data.get('BuildId') 94 | 95 | try: 96 | response = code_build_client.batch_get_builds(ids=[build_id]) 97 | except ClientError as e: 98 | raise Exception( "boto3 client error : " + e.__str__()) 99 | except Exception as e: 100 | raise Exception( "Unexpected error : " + e.__str__()) 101 | status = response['builds'][0]['buildStatus'] 102 | 103 | if status in ['FAILED', 'FAULT', 'STOPPED', 'TIMED_OUT']: 104 | msg = f"Build ID {build_id} has status {status}, exiting" 105 | logger.info(msg) 106 | raise ValueError(msg) 107 | else: 108 | if status in ['SUCCEEDED']: 109 | logger.info(status) 110 | return True 111 | else: 112 | logger.info(f"Build ID status is: {build_id}") 113 | return None -------------------------------------------------------------------------------- /src/lambda/trigger_code_build/trigger_lambdas_code_build.py: -------------------------------------------------------------------------------- 1 | from crhelper import CfnResource 2 | import logging 3 | import boto3 4 | from botocore.exceptions import ClientError 5 | 6 | logger = logging.getLogger(__name__) 7 | # Initialise the helper, all inputs are optional, this example shows the defaults 8 | helper = CfnResource(json_logging=False, log_level='DEBUG', boto_level='CRITICAL', polling_interval=1) 9 | 10 | 11 | # Initiate client 12 | try: 13 | print("Attempt to initiate client") 14 | _session = boto3.Session() 15 | code_build_client = _session.client('codebuild') 16 | print("Attempt to initiate codebuild client complete") 17 | except Exception as e: 18 | helper.init_failure(e) 19 | 20 | 21 | @helper.create 22 | def create(event, context): 23 | logger.info("Got Create") 24 | start_code_build(event, context) 25 | 26 | 27 | @helper.update 28 | def update(event, context): 29 | logger.info("Got Update") 30 | start_code_build(event, context) 31 | 32 | 33 | @helper.delete 34 | def delete(event, context): 35 | logger.info("Got Delete") 36 | return "delete" 37 | # Delete never returns anything. Should not fail if the underlying resources are already deleted. Desired state. 38 | 39 | @helper.poll_create 40 | def poll_create(event, context): 41 | logger.info("Got Create poll") 42 | return check_code_build_status(event, context) 43 | 44 | 45 | @helper.poll_update 46 | def poll_update(event, context): 47 | logger.info("Got Update poll") 48 | return check_code_build_status(event, context) 49 | 50 | 51 | @helper.poll_delete 52 | def poll_delete(event, context): 53 | logger.info("Got Delete poll") 54 | return "delete poll" 55 | 56 | def handler(event, context): 57 | helper(event, context) 58 | 59 | 60 | def start_code_build(event, context): 61 | project_name = event['ResourceProperties']['ProjectName'] 62 | 63 | try: 64 | print(f"Attempt to start code build project {project_name}") 65 | response = code_build_client.start_build( 66 | projectName=project_name 67 | ) 68 | except ClientError as e: 69 | raise Exception( "boto3 client error : " + e.__str__()) 70 | except Exception as e: 71 | raise Exception( "Unexpected error : " + e.__str__()) 72 | logger.info(response) 73 | helper.Data.update({"BuildId": response['build']['id']}) 74 | 75 | def check_code_build_status(event, context): 76 | build_id = helper.Data.get('BuildId') 77 | 78 | try: 79 | response = code_build_client.batch_get_builds(ids=[build_id]) 80 | except ClientError as e: 81 | raise Exception( "boto3 client error : " + e.__str__()) 82 | except Exception as e: 83 | raise Exception( "Unexpected error : " + e.__str__()) 84 | status = response['builds'][0]['buildStatus'] 85 | 86 | if status in ['FAILED', 'FAULT', 'STOPPED', 'TIMED_OUT']: 87 | msg = f"Build ID {build_id} has status {status}, exiting" 88 | logger.info(msg) 89 | raise ValueError(msg) 90 | else: 91 | if status in ['SUCCEEDED']: 92 | logger.info(status) 93 | return True 94 | else: 95 | logger.info(f"Build ID status is: {build_id}") 96 | return None -------------------------------------------------------------------------------- /src/workflow/main.wdl: -------------------------------------------------------------------------------- 1 | version 1.0 2 | 3 | import "sub-workflows/processing-for-variant-discovery-gatk4.wdl" as preprocess 4 | import "sub-workflows/haplotypecaller-gvcf-gatk4.wdl" as haplotype 5 | import "sub-workflows/fastq-to-bam.wdl" as fastq2bam 6 | 7 | workflow fastqToVCF { 8 | input { 9 | String sample_name 10 | File fastq_1 11 | File fastq_2 12 | String readgroup_name 13 | String run_date 14 | String library_name 15 | String platform_name 16 | String sequencing_center 17 | File ref_fasta 18 | File dbSNP_vcf 19 | File known_indels_vcf 20 | 21 | File Mills_1000G_indels_vcf 22 | 23 | File scattered_calling_intervals_archive 24 | String gatk_docker 25 | String gotc_docker 26 | 27 | } 28 | String unmapped_bam_suffix="unmapped.bam" 29 | String ref_base_ext = basename(ref_fasta) 30 | String ref_base = basename(ref_fasta,".fasta") 31 | String ref_base_path = sub(ref_fasta,ref_base_ext,"") 32 | String dbSNP_vcf_index_path = sub(dbSNP_vcf,".vcf",".vcf.idx") 33 | String known_indels_vcf_path = sub(known_indels_vcf,".vcf.gz",".vcf.gz.tbi") 34 | String Mills_1000G_indels_vcf_path = sub(Mills_1000G_indels_vcf,".vcf.gz",".vcf.gz.tbi") 35 | 36 | File ref_fasta_index = ref_base_path + ref_base_ext + ".fai" 37 | File ref_dict = ref_base_path + ref_base + ".dict" 38 | File ref_alt = ref_base_path + ref_base_ext + ".64.alt" 39 | File ref_sa = ref_base_path + ref_base_ext + ".64.sa" 40 | File ref_ann = ref_base_path + ref_base_ext + ".64.ann" 41 | File ref_bwt = ref_base_path + ref_base_ext + ".64.bwt" 42 | File ref_pac = ref_base_path + ref_base_ext + ".64.pac" 43 | File ref_amb = ref_base_path + ref_base_ext + ".64.amb" 44 | 45 | File dbSNP_vcf_index = dbSNP_vcf_index_path 46 | File known_indels_vcf_index = known_indels_vcf_path 47 | File Mills_1000G_indels_vcf_index = Mills_1000G_indels_vcf_path 48 | 49 | call fastq2bam.ConvertPairedFastQsToUnmappedBamWf as Fastq2Bam { 50 | input: 51 | sample_name=sample_name, 52 | fastq_1=fastq_1, fastq_2=fastq_2, 53 | readgroup_name=readgroup_name, 54 | run_date = run_date, 55 | library_name = library_name, 56 | platform_name = platform_name, 57 | sequencing_center = sequencing_center, 58 | gatk_docker = gatk_docker 59 | } 60 | call preprocess.PreProcessingForVariantDiscovery_GATK4 as PreProcess { 61 | input: 62 | sample_name = sample_name, 63 | unmapped_bam = Fastq2Bam.output_unmapped_bam, 64 | unmapped_bam_suffix = unmapped_bam_suffix, 65 | ref_fasta = ref_fasta, 66 | ref_fasta_index = ref_fasta_index, 67 | ref_dict = ref_dict, 68 | ref_alt = ref_alt, 69 | ref_sa = ref_sa, 70 | ref_ann = ref_ann, 71 | ref_bwt = ref_bwt, 72 | ref_pac = ref_pac, 73 | ref_amb = ref_amb, 74 | dbSNP_vcf = dbSNP_vcf, 75 | dbSNP_vcf_index = dbSNP_vcf_index, 76 | known_indels_vcf = known_indels_vcf, 77 | known_indels_vcf_index = known_indels_vcf_index, 78 | Mills_1000G_indels_vcf = Mills_1000G_indels_vcf, 79 | Mills_1000G_indels_vcf_index = Mills_1000G_indels_vcf_index, 80 | gatk_docker = gatk_docker, 81 | gotc_docker = gotc_docker, 82 | } 83 | call haplotype.HaplotypeCallerGvcf_GATK4 as CallHaplotypes { 84 | input: 85 | input_bam = PreProcess.analysis_ready_bam, 86 | input_bam_index = PreProcess.analysis_ready_bam_index, 87 | ref_fasta = ref_fasta, 88 | ref_fasta_index = ref_fasta_index, 89 | ref_dict = ref_dict, 90 | scattered_calling_intervals_archive = scattered_calling_intervals_archive, 91 | gatk_docker = gatk_docker, 92 | gotc_docker = gotc_docker, 93 | 94 | } 95 | output { 96 | File duplication_metrics = PreProcess.duplication_metrics 97 | File bqsr_report = PreProcess.bqsr_report 98 | File analysis_ready_bam = PreProcess.analysis_ready_bam 99 | File analysis_ready_bam_index = PreProcess.analysis_ready_bam_index 100 | File analysis_ready_bam_md5 = PreProcess.analysis_ready_bam_md5 101 | File output_vcf = CallHaplotypes.output_vcf 102 | File output_vcf_index = CallHaplotypes.output_vcf_index 103 | } 104 | } 105 | -------------------------------------------------------------------------------- /src/workflow/parameter-template.json: -------------------------------------------------------------------------------- 1 | { 2 | "sample_name": { 3 | "description": "sample name" 4 | }, 5 | "fastq_1": { 6 | "description": "path to fastq1" 7 | }, 8 | "fastq_2": { 9 | "description": "path to fastq2" 10 | }, 11 | "ref_fasta": { 12 | "description": "path to reference fasta" 13 | }, 14 | "readgroup_name": { 15 | "description": "readgroup name" 16 | }, 17 | "library_name": { 18 | "description": "library name" 19 | }, 20 | "platform_name": { 21 | "description": "platform name}, e.g. Illumina" 22 | }, 23 | "run_date": { 24 | "description": "sequencing run date" 25 | }, 26 | "sequencing_center": { 27 | "description": "name of sequencing center" 28 | }, 29 | "dbSNP_vcf": { 30 | "description": "dbsnp vcf" 31 | }, 32 | "Mills_1000G_indels_vcf": { 33 | "description": "Mills 1000 genomes gold indels vcf" 34 | }, 35 | "known_indels_vcf": { 36 | "description": "known indels vcf" 37 | }, 38 | "scattered_calling_intervals_archive": { 39 | "description": "tar (not gzip) of scatter intervals" 40 | }, 41 | "gatk_docker": { 42 | "description": "docker uri in private ECR of GATK" 43 | }, 44 | "gotc_docker": { 45 | "description": "docker uri in private ECR of Genomes in the Cloud" 46 | } 47 | } -------------------------------------------------------------------------------- /src/workflow/sub-workflows/fastq-to-bam.wdl: -------------------------------------------------------------------------------- 1 | version 1.0 2 | ##Copyright Broad Institute, 2018 3 | ## 4 | ## This WDL converts paired FASTQ to uBAM and adds read group information 5 | ## 6 | ## Requirements/expectations : 7 | ## - Pair-end sequencing data in FASTQ format (one file per orientation) 8 | ## - The following metada descriptors per sample: 9 | ## - readgroup 10 | ## - sample_name 11 | ## - library_name 12 | ## - platform_unit 13 | ## - run_date 14 | ## - platform_name 15 | ## - sequecing_center 16 | ## 17 | ## Outputs : 18 | ## - Set of unmapped BAMs, one per read group 19 | ## - File of a list of the generated unmapped BAMs 20 | ## 21 | ## Cromwell version support 22 | ## - Successfully tested on v47 23 | ## - Does not work on versions < v23 due to output syntax 24 | ## 25 | ## Runtime parameters are optimized for Broad's Google Cloud Platform implementation. 26 | ## For program versions, see docker containers. 27 | ## 28 | ## LICENSING : 29 | ## This script is released under the WDL source code license (BSD-3) (see LICENSE in 30 | ## https://github.com/broadinstitute/wdl). Note however that the programs it calls may 31 | ## be subject to different licenses. Users are responsible for checking that they are 32 | ## authorized to run all programs before running this script. Please see the docker 33 | ## page at https://hub.docker.com/r/broadinstitute/genomes-in-the-cloud/ for detailed 34 | ## licensing information pertaining to the included programs. 35 | 36 | # WORKFLOW DEFINITION 37 | workflow ConvertPairedFastQsToUnmappedBamWf { 38 | input { 39 | String sample_name 40 | File fastq_1 41 | File fastq_2 42 | String readgroup_name 43 | String run_date 44 | String library_name 45 | String platform_name 46 | String sequencing_center 47 | String gatk_docker 48 | } 49 | 50 | #String gatk_docker = "022521056385.dkr.ecr.us-east-1.amazonaws.com/gatk:4.1.9.0" 51 | 52 | String gatk_path = "/gatk/gatk" 53 | 54 | # Convert pair of FASTQs to uBAM 55 | call PairedFastQsToUnmappedBAM { 56 | input: 57 | sample_name = sample_name, 58 | fastq_1 = fastq_1, 59 | fastq_2 = fastq_2, 60 | readgroup_name = readgroup_name, 61 | run_date = run_date, 62 | library_name = library_name, 63 | platform_name = platform_name, 64 | sequencing_center = sequencing_center, 65 | gatk_path = gatk_path, 66 | docker = gatk_docker, 67 | } 68 | 69 | 70 | # Outputs that will be retained when execution is complete 71 | output { 72 | File output_unmapped_bam = PairedFastQsToUnmappedBAM.output_unmapped_bam 73 | } 74 | } 75 | 76 | # TASK DEFINITIONS 77 | 78 | # Convert a pair of FASTQs to uBAM 79 | task PairedFastQsToUnmappedBAM { 80 | input { 81 | # Command parameters 82 | String sample_name 83 | File fastq_1 84 | File fastq_2 85 | String readgroup_name 86 | String gatk_path 87 | String run_date 88 | String library_name 89 | String platform_name 90 | String sequencing_center 91 | 92 | # Runtime parameters 93 | Int machine_mem_gb = 7 94 | String docker 95 | } 96 | Int command_mem_gb = machine_mem_gb - 1 97 | command { 98 | echo "FASTQ to uBAM" >&2 99 | echo "fastq_1 ~{fastq_1}" >&2 100 | echo "fastq_2 ~{fastq_2}" >&2 101 | echo "sample_name ~{sample_name}" >&2 102 | echo "readgroup_name ~{readgroup_name}" >&2 103 | 104 | ~{gatk_path} --java-options "-Xmx~{command_mem_gb}g" \ 105 | FastqToSam \ 106 | --FASTQ ~{fastq_1} \ 107 | --FASTQ2 ~{fastq_2} \ 108 | --OUTPUT ~{readgroup_name}.unmapped.bam \ 109 | --READ_GROUP_NAME ~{readgroup_name} \ 110 | --SAMPLE_NAME ~{sample_name} \ 111 | --LIBRARY_NAME ~{library_name} \ 112 | --RUN_DATE ~{run_date} \ 113 | --PLATFORM ~{platform_name} \ 114 | --SEQUENCING_CENTER ~{sequencing_center} 115 | 116 | # Creates a file of file names of the uBAM, which is a text file with each row having the path to the file. 117 | # In this case there will only be one file path in the txt file but this format is used by 118 | # the pre-processing for variant discovery workflow. 119 | 120 | } 121 | runtime { 122 | docker: docker 123 | memory: machine_mem_gb + " GiB" 124 | cpu: 4 125 | } 126 | output { 127 | File output_unmapped_bam = "~{readgroup_name}.unmapped.bam" 128 | } 129 | } 130 | 131 | 132 | -------------------------------------------------------------------------------- /src/workflow/sub-workflows/haplotypecaller-gvcf-gatk4.wdl: -------------------------------------------------------------------------------- 1 | version 1.0 2 | 3 | ## Copyright Broad Institute, 2019 4 | ## 5 | ## The haplotypecaller-gvcf-gatk4 workflow runs the HaplotypeCaller tool 6 | ## from GATK4 in GVCF mode on a single sample according to GATK Best Practices. 7 | ## When executed the workflow scatters the HaplotypeCaller tool over a sample 8 | ## using an intervals list file. The output file produced will be a 9 | ## single gvcf file. 10 | ## 11 | ## Requirements/expectations : 12 | ## - One analysis-ready BAM file for a single sample (as identified in RG:SM) 13 | ## - Set of variant calling intervals lists for the scatter, provided in a file 14 | ## 15 | ## Outputs : 16 | ## - One GVCF file and its index 17 | ## 18 | ## 19 | ## LICENSING : 20 | ## This script is released under the WDL source code license (BSD-3) (see LICENSE in 21 | ## https://github.com/broadinstitute/wdl). Note however that the programs it calls may 22 | ## be subject to different licenses. Users are responsible for checking that they are 23 | ## authorized to run all programs before running this script. Please see the dockers 24 | ## for detailed licensing information pertaining to the included programs. 25 | 26 | # WORKFLOW DEFINITION 27 | workflow HaplotypeCallerGvcf_GATK4 { 28 | input { 29 | File input_bam 30 | File input_bam_index 31 | File ref_fasta 32 | File scattered_calling_intervals_archive 33 | File ref_fasta_index 34 | File ref_dict 35 | String gatk_docker 36 | String gotc_docker 37 | } 38 | 39 | 40 | #File ref_fasta="s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta" 41 | 42 | #String ref_base_ext = basename(ref_fasta) 43 | #String ref_base = basename(ref_fasta,".fasta") 44 | #String ref_base_path = sub(ref_fasta,ref_base_ext,"") 45 | 46 | #File ref_fasta_index= ref_base_path + ref_base_ext + ".fai" 47 | #File ref_dict = ref_base_path + ref_base + ".dict" 48 | 49 | #File ref_fasta_index="s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai" 50 | #File scattered_calling_intervals_archive="s3://omics-test-input-bucket/workflow/intervals.tar.gz" 51 | #File ref_dict="s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict" 52 | 53 | Boolean make_gvcf = false 54 | Boolean make_bamout = false 55 | #String gatk_docker = "022521056385.dkr.ecr.us-east-1.amazonaws.com/gatk:4.1.9.0" 56 | String gatk_path = "/gatk/gatk" 57 | 58 | String sample_basename = basename(input_bam, ".bam") 59 | String vcf_basename = sample_basename 60 | String output_suffix = if make_gvcf then ".g.vcf.gz" else ".vcf.gz" 61 | String output_filename = vcf_basename + output_suffix 62 | 63 | #Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list) 64 | 65 | # Call variants in parallel over grouped calling intervals 66 | 67 | call UnpackIntervals { 68 | input: archive = scattered_calling_intervals_archive, 69 | docker = gotc_docker 70 | } 71 | 72 | scatter (interval_file in UnpackIntervals.interval_files) { 73 | 74 | # Generate GVCF by interval 75 | call HaplotypeCaller { 76 | input: 77 | input_bam = input_bam, 78 | input_bam_index = input_bam_index, 79 | interval_list = interval_file, 80 | output_filename = output_filename, 81 | ref_dict = ref_dict, 82 | ref_fasta = ref_fasta, 83 | ref_fasta_index = ref_fasta_index, 84 | make_gvcf = make_gvcf, 85 | make_bamout = make_bamout, 86 | docker = gatk_docker, 87 | gatk_path = gatk_path 88 | } 89 | } 90 | 91 | # Merge per-interval GVCFs 92 | call MergeGVCFs { 93 | input: 94 | input_vcfs = HaplotypeCaller.output_vcf, 95 | input_vcfs_indexes = HaplotypeCaller.output_vcf_index, 96 | output_filename = output_filename, 97 | docker = gatk_docker, 98 | gatk_path = gatk_path 99 | } 100 | 101 | # Outputs that will be retained when execution is complete 102 | output { 103 | File output_vcf = MergeGVCFs.output_vcf 104 | File output_vcf_index = MergeGVCFs.output_vcf_index 105 | } 106 | } 107 | 108 | # TASK DEFINITIONS 109 | 110 | task UnpackIntervals { 111 | input { 112 | File archive 113 | String docker 114 | } 115 | String basestem_input = basename(archive, ".tar") 116 | command { 117 | echo "Unpack Intervals" >&2 118 | tar xvf ~{archive} --directory ./ 119 | } 120 | runtime { 121 | docker: docker 122 | cpu: 2 123 | memory: "2 GiB" 124 | } 125 | output { 126 | Array[File] interval_files = glob("${basestem_input}/*") 127 | } 128 | } 129 | 130 | # HaplotypeCaller per-sample in GVCF mode 131 | task HaplotypeCaller { 132 | input { 133 | # Command parameters 134 | File input_bam 135 | File input_bam_index 136 | File interval_list 137 | String output_filename 138 | File ref_dict 139 | File ref_fasta 140 | File ref_fasta_index 141 | Float? contamination 142 | Boolean make_gvcf 143 | Boolean make_bamout 144 | 145 | String gatk_path 146 | String? java_options 147 | 148 | # Runtime parameters 149 | String docker 150 | Int? mem_gb 151 | } 152 | 153 | String java_opt = select_first([java_options, ""]) 154 | 155 | Int machine_mem_gb = select_first([mem_gb, 16]) 156 | Int command_mem_gb = machine_mem_gb - 2 157 | 158 | String vcf_basename = if make_gvcf then basename(output_filename, ".gvcf") else basename(output_filename, ".vcf") 159 | String bamout_arg = if make_bamout then "-bamout ~{vcf_basename}.bamout.bam" else "" 160 | 161 | parameter_meta { 162 | input_bam: { 163 | description: "a bam file" 164 | } 165 | input_bam_index: { 166 | description: "an index file for the bam input" 167 | } 168 | } 169 | command { 170 | echo HaplotypeCaller >&2 171 | set -euxo pipefail 172 | 173 | ~{gatk_path} --java-options "-Xmx~{command_mem_gb}G ~{java_opt}" \ 174 | HaplotypeCaller \ 175 | -R ~{ref_fasta} \ 176 | -I ~{input_bam} \ 177 | -L ~{interval_list} \ 178 | -O ~{output_filename} \ 179 | -contamination ~{default="0" contamination} \ 180 | -G StandardAnnotation -G StandardHCAnnotation ~{true="-G AS_StandardAnnotation" false="" make_gvcf} \ 181 | -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \ 182 | ~{true="-ERC GVCF" false="" make_gvcf} \ 183 | ~{bamout_arg} 184 | 185 | touch ~{vcf_basename}.bamout.bam 186 | } 187 | runtime { 188 | docker: docker 189 | memory: machine_mem_gb + " GiB" 190 | cpu: 4 191 | } 192 | output { 193 | File output_vcf = "~{output_filename}" 194 | File output_vcf_index = "~{output_filename}.tbi" 195 | File bamout = "~{vcf_basename}.bamout.bam" 196 | } 197 | } 198 | # Merge GVCFs generated per-interval for the same sample 199 | task MergeGVCFs { 200 | input { 201 | # Command parameters 202 | Array[File] input_vcfs 203 | Array[File] input_vcfs_indexes 204 | String output_filename 205 | 206 | String gatk_path 207 | 208 | # Runtime parameters 209 | String docker 210 | Int? mem_gb 211 | } 212 | Int machine_mem_gb = select_first([mem_gb, 8]) 213 | Int command_mem_gb = machine_mem_gb - 2 214 | 215 | command { 216 | echo MergeGVCFs 217 | set -euxo pipefail 218 | 219 | ~{gatk_path} --java-options "-Xmx~{command_mem_gb}G" \ 220 | MergeVcfs \ 221 | --INPUT ~{sep=' --INPUT ' input_vcfs} \ 222 | --OUTPUT ~{output_filename} 223 | } 224 | runtime { 225 | docker: docker 226 | memory: machine_mem_gb + " GB" 227 | cpu: 2 228 | } 229 | output { 230 | File output_vcf = "~{output_filename}" 231 | File output_vcf_index = "~{output_filename}.tbi" 232 | } 233 | } 234 | -------------------------------------------------------------------------------- /src/workflow/sub-workflows/processing-for-variant-discovery-gatk4.wdl: -------------------------------------------------------------------------------- 1 | version 1.0 2 | 3 | ## Copyright Broad Institute, 2021 4 | ## 5 | ## This WDL pipeline implements data pre-processing according to the GATK Best Practices. 6 | ## 7 | ## Requirements/expectations : 8 | ## - Pair-end sequencing data in unmapped BAM (uBAM) format 9 | ## - One or more read groups, one per uBAM file, all belonging to a single sample (SM) 10 | ## - Input uBAM files must additionally comply with the following requirements: 11 | ## - - filenames all have the same suffix (we use ".unmapped.bam") 12 | ## - - files must pass validation by ValidateSamFile 13 | ## - - reads are provided in query-sorted order 14 | ## - - all reads must have an RG tag 15 | ## 16 | ## Output : 17 | ## - A clean BAM file and its index, suitable for variant discovery analyses. 18 | ## 19 | ## Cromwell version support 20 | ## - Successfully tested on v59 21 | ## 22 | ## Runtime parameters are optimized for Broad's Google Cloud Platform implementation. 23 | ## 24 | ## LICENSING : 25 | ## This script is released under the WDL source code license (BSD-3) (see LICENSE in 26 | ## https://github.com/broadinstitute/wdl). Note however that the programs it calls may 27 | ## be subject to different licenses. Users are responsible for checking that they are 28 | ## authorized to run all programs before running this script. Please see the dockers 29 | ## for detailed licensing information pertaining to the included programs. 30 | 31 | # WORKFLOW DEFINITION 32 | workflow PreProcessingForVariantDiscovery_GATK4 { 33 | input { 34 | String sample_name 35 | 36 | File unmapped_bam 37 | String unmapped_bam_suffix 38 | File ref_fasta 39 | File ref_fasta_index 40 | File ref_dict 41 | File ref_alt 42 | File ref_ann 43 | File ref_bwt 44 | File ref_pac 45 | File ref_amb 46 | File ref_sa 47 | File dbSNP_vcf 48 | File dbSNP_vcf_index 49 | File known_indels_vcf 50 | File known_indels_vcf_index 51 | File Mills_1000G_indels_vcf 52 | File Mills_1000G_indels_vcf_index 53 | String gatk_docker 54 | String gotc_docker 55 | 56 | } 57 | 58 | String ref_name = "hg38" 59 | #File ref_fasta = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta" 60 | 61 | #String ref_base_ext = basename(ref_fasta) 62 | #String ref_base = basename(ref_fasta,".fasta") 63 | #String ref_base_path = sub(ref_fasta,ref_base_ext,"") 64 | #String dbSNP_vcf_index_path = sub(dbSNP_vcf,".vcf",".vcf.idx") 65 | #String known_indels_vcf_path = sub(known_indels_vcf,".vcf.gz",".vcf.gz.tbi") 66 | #String Mills_1000G_indels_vcf_path = sub(Mills_1000G_indels_vcf,".vcf.gz",".vcf.gz.tbi") 67 | 68 | #File ref_fasta_index = ref_base_path + ref_base_ext + ".fai" 69 | #File ref_dict = ref_base_path + ref_base + ".dict" 70 | #File ref_alt = ref_base_path + ref_base_ext + ".64.alt" 71 | #File ref_sa = ref_base_path + ref_base_ext + ".64.sa" 72 | #File ref_ann = ref_base_path + ref_base_ext + ".64.ann" 73 | #File ref_bwt = ref_base_path + ref_base_ext + ".64.bwt" 74 | #File ref_pac = ref_base_path + ref_base_ext + ".64.pac" 75 | # File ref_amb = ref_base_path + ref_base_ext + ".64.amb" 76 | 77 | #File dbSNP_vcf_index = dbSNP_vcf_index_path 78 | #File known_indels_vcf_index = known_indels_vcf_path 79 | #File Mills_1000G_indels_vcf_index = Mills_1000G_indels_vcf_path 80 | 81 | #File ref_fasta_index = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai" 82 | #File ref_dict = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict" 83 | #File ref_alt = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt" 84 | #File ref_sa = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa" 85 | #File ref_ann = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann" 86 | #File ref_bwt = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt" 87 | #File ref_pac = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac" 88 | #File ref_amb = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb" 89 | #File dbSNP_vcf = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf" 90 | #File dbSNP_vcf_index = "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx" 91 | #Array[File] known_indels_sites_VCFs = [ 92 | # "s3://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz", 93 | # "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz" 94 | # ] 95 | #Array[File] known_indels_sites_indices = [ 96 | # "s3://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi", 97 | # "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi" 98 | # ] 99 | 100 | 101 | String bwa_commandline = "bwa mem -K 100000000 -p -v 3 -t 14 -Y $bash_ref_fasta" 102 | Int compression_level = 6 103 | 104 | #String gatk_docker = "022521056385.dkr.ecr.us-east-1.amazonaws.com/gatk:4.1.9.0" 105 | String gatk_path = "/gatk/gatk" 106 | #String gotc_docker = "022521056385.dkr.ecr.us-east-1.amazonaws.com/genomes-in-the-cloud:2.4.7-1603303710" 107 | String gotc_path = "/usr/gitc/" 108 | # Amazon linux has python installed 109 | 110 | String base_file_name = sample_name + "." + ref_name 111 | 112 | #Array[File] flowcell_unmapped_bams = read_lines(flowcell_unmapped_bams_list) 113 | 114 | #File unmapped_bam = unmapped_bam 115 | 116 | # Get the version of BWA to include in the PG record in the header of the BAM produced 117 | # by MergeBamAlignment. 118 | call GetBwaVersion { 119 | input: 120 | docker_image = gotc_docker, 121 | bwa_path = gotc_path, 122 | } 123 | 124 | # Align flowcell-level unmapped input bams in parallel 125 | 126 | # Get the basename, i.e. strip the filepath and the extension 127 | String bam_basename = basename(unmapped_bam, unmapped_bam_suffix) 128 | 129 | # Map reads to reference 130 | call SamToFastqAndBwaMem { 131 | input: 132 | input_bam = unmapped_bam, 133 | bwa_commandline = bwa_commandline, 134 | output_bam_basename = bam_basename + ".unmerged", 135 | ref_fasta = ref_fasta, 136 | ref_fasta_index = ref_fasta_index, 137 | ref_dict = ref_dict, 138 | ref_alt = ref_alt, 139 | ref_sa = ref_sa, 140 | ref_ann = ref_ann, 141 | ref_bwt = ref_bwt, 142 | ref_pac = ref_pac, 143 | ref_amb = ref_amb, 144 | docker_image = gotc_docker, 145 | bwa_path = gotc_path, 146 | gotc_path = gotc_path, 147 | compression_level = compression_level 148 | } 149 | 150 | # Merge original uBAM and BWA-aligned BAM 151 | call MergeBamAlignment { 152 | input: 153 | unmapped_bam = unmapped_bam, 154 | bwa_commandline = bwa_commandline, 155 | bwa_version = GetBwaVersion.version, 156 | aligned_bam = SamToFastqAndBwaMem.output_bam, 157 | output_bam_basename = bam_basename + ".aligned.unsorted", 158 | ref_fasta = ref_fasta, 159 | ref_fasta_index = ref_fasta_index, 160 | ref_dict = ref_dict, 161 | docker_image = gatk_docker, 162 | gatk_path = gatk_path, 163 | compression_level = compression_level 164 | } 165 | 166 | # Aggregate aligned+merged flowcell BAM files and mark duplicates 167 | # We take advantage of the tool's ability to take multiple BAM inputs and write out a single output 168 | # to avoid having to spend time just merging BAM files. 169 | call MarkDuplicates { 170 | input: 171 | input_bams = MergeBamAlignment.output_bam, 172 | output_bam_basename = base_file_name + ".aligned.unsorted.duplicates_marked", 173 | metrics_filename = base_file_name + ".duplicate_metrics", 174 | docker_image = gatk_docker, 175 | gatk_path = gatk_path, 176 | compression_level = compression_level, 177 | } 178 | 179 | # Sort aggregated+deduped BAM file and fix tags 180 | call SortAndFixTags { 181 | input: 182 | input_bam = MarkDuplicates.output_bam, 183 | output_bam_basename = base_file_name + ".aligned.duplicate_marked.sorted", 184 | ref_dict = ref_dict, 185 | ref_fasta = ref_fasta, 186 | ref_fasta_index = ref_fasta_index, 187 | docker_image = gatk_docker, 188 | gatk_path = gatk_path, 189 | compression_level = compression_level 190 | } 191 | 192 | # Create list of sequences for scatter-gather parallelization 193 | call CreateSequenceGroupingTSV { 194 | input: 195 | ref_dict = ref_dict, 196 | docker_image = gotc_docker, 197 | } 198 | 199 | # Perform Base Quality Score Recalibration (BQSR) on the sorted BAM in parallel 200 | scatter (subgroup in CreateSequenceGroupingTSV.sequence_grouping) { 201 | # Generate the recalibration model by interval 202 | call BaseRecalibrator { 203 | input: 204 | input_bam = SortAndFixTags.output_bam, 205 | input_bam_index = SortAndFixTags.output_bam_index, 206 | recalibration_report_filename = base_file_name + ".recal_data.csv", 207 | sequence_group_interval = subgroup, 208 | dbSNP_vcf = dbSNP_vcf, 209 | dbSNP_vcf_index = dbSNP_vcf_index, 210 | known_indels_vcf = known_indels_vcf, 211 | known_indels_vcf_index = known_indels_vcf_index, 212 | Mills_1000G_indels_vcf = Mills_1000G_indels_vcf, 213 | Mills_1000G_indels_vcf_index = Mills_1000G_indels_vcf_index, 214 | ref_dict = ref_dict, 215 | ref_fasta = ref_fasta, 216 | ref_fasta_index = ref_fasta_index, 217 | docker_image = gatk_docker, 218 | gatk_path = gatk_path, 219 | } 220 | } 221 | 222 | # Merge the recalibration reports resulting from by-interval recalibration 223 | call GatherBqsrReports { 224 | input: 225 | input_bqsr_reports = BaseRecalibrator.recalibration_report, 226 | output_report_filename = base_file_name + ".recal_data.csv", 227 | docker_image = gatk_docker, 228 | gatk_path = gatk_path, 229 | } 230 | 231 | scatter (subgroup in CreateSequenceGroupingTSV.sequence_grouping_with_unmapped) { 232 | 233 | # Apply the recalibration model by interval 234 | call ApplyBQSR { 235 | input: 236 | input_bam = SortAndFixTags.output_bam, 237 | input_bam_index = SortAndFixTags.output_bam_index, 238 | output_bam_basename = base_file_name + ".aligned.duplicates_marked.recalibrated", 239 | recalibration_report = GatherBqsrReports.output_bqsr_report, 240 | sequence_group_interval = subgroup, 241 | ref_dict = ref_dict, 242 | ref_fasta = ref_fasta, 243 | ref_fasta_index = ref_fasta_index, 244 | docker_image = gatk_docker, 245 | gatk_path = gatk_path, 246 | } 247 | } 248 | 249 | # Merge the recalibrated BAM files resulting from by-interval recalibration 250 | call GatherBamFiles { 251 | input: 252 | input_bams = ApplyBQSR.recalibrated_bam, 253 | output_bam_basename = base_file_name, 254 | docker_image = gatk_docker, 255 | gatk_path = gatk_path, 256 | compression_level = compression_level 257 | } 258 | 259 | # Outputs that will be retained when execution is complete 260 | output { 261 | File duplication_metrics = MarkDuplicates.duplicate_metrics 262 | File bqsr_report = GatherBqsrReports.output_bqsr_report 263 | File analysis_ready_bam = GatherBamFiles.output_bam 264 | File analysis_ready_bam_index = GatherBamFiles.output_bam_index 265 | File analysis_ready_bam_md5 = GatherBamFiles.output_bam_md5 266 | } 267 | } 268 | 269 | # TASK DEFINITIONS 270 | 271 | # Get version of BWA 272 | task GetBwaVersion { 273 | input { 274 | Float mem_size_gb = 1 275 | String docker_image 276 | String bwa_path 277 | } 278 | 279 | command { 280 | echo GetBwaVersion >&2 281 | 282 | # Not setting "set -o pipefail" here because /bwa has a rc=1 and we don't want to allow rc=1 to succeed 283 | # because the sed may also fail with that error and that is something we actually want to fail on. 284 | 285 | set -ux 286 | 287 | ~{bwa_path}bwa 2>&1 | \ 288 | grep -e '^Version' | \ 289 | sed 's/Version: //' 290 | } 291 | runtime { 292 | docker: docker_image 293 | memory: "~{mem_size_gb} GiB" 294 | } 295 | output { 296 | String version = read_string(stdout()) 297 | } 298 | } 299 | 300 | # Read unmapped BAM, convert on-the-fly to FASTQ and stream to BWA MEM for alignment 301 | task SamToFastqAndBwaMem { 302 | # This is the .alt file from bwa-kit (https://github.com/lh3/bwa/tree/master/bwakit), 303 | # listing the reference contigs that are "alternative". Leave blank in JSON for legacy 304 | # references such as b37 and hg19. 305 | input { 306 | File input_bam 307 | String bwa_commandline 308 | String output_bam_basename 309 | File ref_fasta 310 | File ref_fasta_index 311 | File ref_dict 312 | File? ref_alt 313 | File ref_amb 314 | File ref_ann 315 | File ref_bwt 316 | File ref_pac 317 | File ref_sa 318 | 319 | Float mem_size_gb = 56 320 | Int num_cpu = 32 321 | 322 | Int compression_level 323 | 324 | String docker_image 325 | String bwa_path 326 | String gotc_path 327 | } 328 | 329 | command { 330 | set -euo pipefail 331 | 332 | # set the bash variable needed for the command-line 333 | bash_ref_fasta=~{ref_fasta} 334 | 335 | java -Xmx8G -jar ~{gotc_path}picard.jar \ 336 | SamToFastq \ 337 | INPUT=~{input_bam} \ 338 | FASTQ=/dev/stdout \ 339 | INTERLEAVE=true \ 340 | NON_PF=true \ 341 | | \ 342 | ~{bwa_path}~{bwa_commandline} /dev/stdin - \ 343 | | \ 344 | samtools view -1 - > ~{output_bam_basename}.bam 345 | } 346 | runtime { 347 | docker: docker_image 348 | memory: "~{mem_size_gb} GiB" 349 | cpu: num_cpu 350 | } 351 | output { 352 | File output_bam = "~{output_bam_basename}.bam" 353 | } 354 | } 355 | 356 | # Merge original input uBAM file with BWA-aligned BAM file 357 | task MergeBamAlignment { 358 | input { 359 | File unmapped_bam 360 | String bwa_commandline 361 | String bwa_version 362 | File aligned_bam 363 | String output_bam_basename 364 | File ref_fasta 365 | File ref_fasta_index 366 | File ref_dict 367 | 368 | Int compression_level 369 | Int mem_size_gb = 32 370 | 371 | String docker_image 372 | String gatk_path 373 | } 374 | 375 | Int command_mem_gb = ceil(mem_size_gb) - 1 376 | 377 | command { 378 | echo MergeBamAlignment >&2 379 | 380 | set -euxo pipefail 381 | 382 | # set the bash variable needed for the command-line 383 | bash_ref_fasta=~{ref_fasta} 384 | ~{gatk_path} --java-options "-Dsamjdk.compression_level=~{compression_level} -Xmx~{command_mem_gb}G" \ 385 | MergeBamAlignment \ 386 | --VALIDATION_STRINGENCY SILENT \ 387 | --EXPECTED_ORIENTATIONS FR \ 388 | --ATTRIBUTES_TO_RETAIN X0 \ 389 | --ALIGNED_BAM ~{aligned_bam} \ 390 | --UNMAPPED_BAM ~{unmapped_bam} \ 391 | --OUTPUT ~{output_bam_basename}.bam \ 392 | --REFERENCE_SEQUENCE ~{ref_fasta} \ 393 | --PAIRED_RUN true \ 394 | --SORT_ORDER "unsorted" \ 395 | --IS_BISULFITE_SEQUENCE false \ 396 | --ALIGNED_READS_ONLY false \ 397 | --CLIP_ADAPTERS false \ 398 | --ADD_MATE_CIGAR true \ 399 | --MAX_INSERTIONS_OR_DELETIONS -1 \ 400 | --PRIMARY_ALIGNMENT_STRATEGY MostDistant \ 401 | --PROGRAM_RECORD_ID "bwamem" \ 402 | --PROGRAM_GROUP_VERSION "~{bwa_version}" \ 403 | --PROGRAM_GROUP_COMMAND_LINE "~{bwa_commandline}" \ 404 | --PROGRAM_GROUP_NAME "bwamem" \ 405 | --UNMAPPED_READ_STRATEGY COPY_TO_TAG \ 406 | --ALIGNER_PROPER_PAIR_FLAGS true \ 407 | --UNMAP_CONTAMINANT_READS true 408 | } 409 | runtime { 410 | docker: docker_image 411 | memory: "~{mem_size_gb} GiB" 412 | cpu: 2 413 | } 414 | output { 415 | File output_bam = "~{output_bam_basename}.bam" 416 | } 417 | } 418 | 419 | # Sort BAM file by coordinate order and fix tag values for NM and UQ 420 | task SortAndFixTags { 421 | input { 422 | File input_bam 423 | String output_bam_basename 424 | File ref_dict 425 | File ref_fasta 426 | File ref_fasta_index 427 | 428 | Int compression_level 429 | Float mem_size_gb = 32 430 | 431 | String docker_image 432 | String gatk_path 433 | } 434 | 435 | command { 436 | echo SortAndFixTags >&2 437 | 438 | set -euxo pipefail 439 | 440 | ~{gatk_path} --java-options "-Xmx32G -Xms32G" \ 441 | SortSam \ 442 | --INPUT ~{input_bam} \ 443 | --OUTPUT /dev/stdout \ 444 | --SORT_ORDER "coordinate" \ 445 | --CREATE_INDEX false \ 446 | --CREATE_MD5_FILE false \ 447 | | \ 448 | ~{gatk_path} --java-options "-Dsamjdk.compression_level=6 -Xmx18G -Xms18G" \ 449 | SetNmMdAndUqTags \ 450 | --INPUT /dev/stdin \ 451 | --OUTPUT ~{output_bam_basename}.bam \ 452 | --CREATE_INDEX true \ 453 | --CREATE_MD5_FILE false \ 454 | --REFERENCE_SEQUENCE ~{ref_fasta} 455 | 456 | } 457 | runtime { 458 | docker: docker_image 459 | memory: "~{mem_size_gb} GiB" 460 | cpu: 8 461 | } 462 | output { 463 | File output_bam = "~{output_bam_basename}.bam" 464 | File output_bam_index = "~{output_bam_basename}.bai" 465 | } 466 | } 467 | 468 | # Mark duplicate reads to avoid counting non-independent observations 469 | task MarkDuplicates { 470 | input { 471 | File input_bams 472 | String output_bam_basename 473 | String metrics_filename 474 | 475 | Int compression_level 476 | Float mem_size_gb = 32 477 | 478 | String docker_image 479 | String gatk_path 480 | } 481 | 482 | Int xmx = ceil(mem_size_gb) - 8 483 | # Task is assuming query-sorted input so that the Secondary and Supplementary reads get marked correctly. 484 | # This works because the output of BWA is query-grouped and therefore, so is the output of MergeBamAlignment. 485 | # While query-grouped isn't actually query-sorted, it's good enough for MarkDuplicates with ASSUME_SORT_ORDER="queryname" 486 | command { 487 | echo MarkDuplicates >&2 488 | 489 | set -euxo pipefail 490 | 491 | ~{gatk_path} --java-options "-Dsamjdk.compression_level=~{compression_level} -Xms~{xmx}G -Xmx~{xmx}G" \ 492 | MarkDuplicates \ 493 | --INPUT ~{input_bams} \ 494 | --OUTPUT ~{output_bam_basename}.bam \ 495 | --METRICS_FILE ~{metrics_filename} \ 496 | --VALIDATION_STRINGENCY SILENT \ 497 | --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \ 498 | --ASSUME_SORT_ORDER "queryname" \ 499 | --CREATE_MD5_FILE false 500 | } 501 | runtime { 502 | docker: docker_image 503 | memory: "~{mem_size_gb} GiB" 504 | cpu: 4 505 | } 506 | output { 507 | File output_bam = "~{output_bam_basename}.bam" 508 | File duplicate_metrics = "~{metrics_filename}" 509 | } 510 | } 511 | 512 | # Generate sets of intervals for scatter-gathering over chromosomes 513 | task CreateSequenceGroupingTSV { 514 | input { 515 | File ref_dict 516 | Float mem_size_gb = 2 517 | String docker_image 518 | } 519 | # Use python to create the Sequencing Groupings used for BQSR and PrintReads Scatter. 520 | # It outputs to stdout where it is parsed into a wdl Array[Array[String]] 521 | # e.g. [["1"], ["2"], ["3", "4"], ["5"], ["6", "7", "8"]] 522 | command <<< 523 | 524 | echo CreateSequenceGroupingTSV >&2 525 | 526 | python <>> 563 | runtime { 564 | docker: docker_image 565 | memory: "~{mem_size_gb} GiB" 566 | cpu: 2 567 | } 568 | output { 569 | Array[Array[String]] sequence_grouping = read_tsv("sequence_grouping.txt") 570 | Array[Array[String]] sequence_grouping_with_unmapped = read_tsv("sequence_grouping_with_unmapped.txt") 571 | } 572 | } 573 | 574 | # Generate Base Quality Score Recalibration (BQSR) model 575 | task BaseRecalibrator { 576 | input { 577 | File input_bam 578 | File input_bam_index 579 | String recalibration_report_filename 580 | Array[String] sequence_group_interval 581 | File dbSNP_vcf 582 | File dbSNP_vcf_index 583 | File known_indels_vcf 584 | File known_indels_vcf_index 585 | File Mills_1000G_indels_vcf 586 | File Mills_1000G_indels_vcf_index 587 | #Array[File] known_indels_sites_indices 588 | File ref_dict 589 | File ref_fasta 590 | File ref_fasta_index 591 | 592 | Float mem_size_gb = 30 593 | 594 | String docker_image 595 | String gatk_path 596 | } 597 | 598 | Int xmx = ceil(mem_size_gb) - 2 599 | 600 | command { 601 | echo BaseRecalibrator >&2 602 | set -euxo pipefail 603 | 604 | ~{gatk_path} --java-options "-Xmx~{xmx}G" \ 605 | BaseRecalibrator \ 606 | -R ~{ref_fasta} \ 607 | -I ~{input_bam} \ 608 | --use-original-qualities \ 609 | -O ~{recalibration_report_filename} \ 610 | --known-sites ~{dbSNP_vcf} \ 611 | --known-sites ~{known_indels_vcf} \ 612 | --known-sites ~{Mills_1000G_indels_vcf} \ 613 | -L ~{sep=" -L " sequence_group_interval} 614 | } 615 | runtime { 616 | docker: docker_image 617 | memory: "~{mem_size_gb} GiB" 618 | cpu: 2 619 | } 620 | output { 621 | File recalibration_report = "~{recalibration_report_filename}" 622 | } 623 | } 624 | 625 | # Combine multiple recalibration tables from scattered BaseRecalibrator runs 626 | task GatherBqsrReports { 627 | input { 628 | Array[File] input_bqsr_reports 629 | String output_report_filename 630 | 631 | Float mem_size_gb = 8 632 | 633 | String docker_image 634 | String gatk_path 635 | } 636 | 637 | Int xmx = ceil(mem_size_gb) - 2 638 | 639 | command { 640 | echo GatherBqsrReports 641 | set -euxo pipefail 642 | 643 | ~{gatk_path} --java-options "-Xmx~{xmx}G" \ 644 | GatherBQSRReports \ 645 | -I ~{sep=' -I ' input_bqsr_reports} \ 646 | -O ~{output_report_filename} 647 | } 648 | runtime { 649 | docker: docker_image 650 | memory: "~{mem_size_gb} GiB" 651 | cpu: 2 652 | } 653 | output { 654 | File output_bqsr_report = "~{output_report_filename}" 655 | } 656 | } 657 | 658 | # Apply Base Quality Score Recalibration (BQSR) model 659 | task ApplyBQSR { 660 | input { 661 | File input_bam 662 | File input_bam_index 663 | String output_bam_basename 664 | File recalibration_report 665 | Array[String] sequence_group_interval 666 | File ref_dict 667 | File ref_fasta 668 | File ref_fasta_index 669 | 670 | Float mem_size_gb = 8 671 | 672 | String docker_image 673 | String gatk_path 674 | } 675 | 676 | Int xmx = ceil(mem_size_gb) - 2 677 | command { 678 | echo ApplyBQSR 679 | set -euxo pipefail 680 | 681 | ~{gatk_path} --java-options "-Dsamjdk.compression_level=6 -Xmx~{xmx}G" \ 682 | ApplyBQSR \ 683 | -R ~{ref_fasta} \ 684 | -I ~{input_bam} \ 685 | -O ~{output_bam_basename}.bam \ 686 | -L ~{sep=" -L " sequence_group_interval} \ 687 | -bqsr ~{recalibration_report} \ 688 | --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \ 689 | --add-output-sam-program-record \ 690 | --create-output-bam-md5 \ 691 | --use-original-qualities 692 | } 693 | runtime { 694 | docker: docker_image 695 | memory: "~{mem_size_gb} GiB" 696 | cpu: 2 697 | } 698 | output { 699 | File recalibrated_bam = "~{output_bam_basename}.bam" 700 | } 701 | } 702 | 703 | # Combine multiple recalibrated BAM files from scattered ApplyRecalibration runs 704 | task GatherBamFiles { 705 | input { 706 | Array[File] input_bams 707 | String output_bam_basename 708 | 709 | Int compression_level 710 | Float mem_size_gb = 4 711 | 712 | String docker_image 713 | String gatk_path 714 | } 715 | 716 | Int xmx = ceil(mem_size_gb) - 2 717 | 718 | command { 719 | echo GatherBamFiles 720 | set -euxo pipefail 721 | 722 | ~{gatk_path} --java-options "-Dsamjdk.compression_level=~{compression_level} -Xmx~{xmx}G" \ 723 | GatherBamFiles \ 724 | --INPUT ~{sep=' --INPUT ' input_bams} \ 725 | --OUTPUT ~{output_bam_basename}.bam \ 726 | --CREATE_INDEX true \ 727 | --CREATE_MD5_FILE true 728 | } 729 | runtime { 730 | docker: docker_image 731 | memory: "~{mem_size_gb} GiB" 732 | cpu: 2 733 | } 734 | output { 735 | File output_bam = "~{output_bam_basename}.bam" 736 | File output_bam_index = "~{output_bam_basename}.bai" 737 | File output_bam_md5 = "~{output_bam_basename}.bam.md5" 738 | } 739 | } 740 | -------------------------------------------------------------------------------- /static/arch_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-omics-end-to-end-genomics/d5898660bc4a7d505f0556201208932df3fe947b/static/arch_diagram.png -------------------------------------------------------------------------------- /static/stepfunctions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-omics-end-to-end-genomics/d5898660bc4a7d505f0556201208932df3fe947b/static/stepfunctions.png -------------------------------------------------------------------------------- /static/stepfunctions_graph_workflowstudio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-omics-end-to-end-genomics/d5898660bc4a7d505f0556201208932df3fe947b/static/stepfunctions_graph_workflowstudio.png --------------------------------------------------------------------------------