├── LICENSE ├── README.md └── docs ├── bioinfo-solutions.md ├── hiv-bioinfo-solutions.md ├── images ├── PHA4GE_SC2_QC_Workflow.png ├── hiv-bioinfo-solutions-figure-1.svg ├── hiv-bioinfo-solutions-table-2.svg ├── influenza-guidance-fig1.png ├── influenza-guidance-fig2.png ├── omicron_standford.svg ├── pha4ge_sc2_qc_workflow.png └── sc2-recombinants │ ├── covariants21k.png │ ├── covariants21l.png │ ├── ex1-usher-metrics.png │ ├── ex1-usher-tree.png │ ├── ex1.png │ ├── ex2-usher-metrics.png │ ├── ex2-usher-tree.png │ ├── ex2.png │ ├── ex3.png │ ├── mutations1.png │ ├── mutations2.png │ ├── mutations3.png │ ├── nextclade-output.png │ └── nextclade-output2.png ├── influenza-bioinfo-solutions.md ├── mpxv-bioinfo-solutions.md ├── omicron-resources.md ├── pipeline-best-practices.md ├── qc-solutions.md └── sc2-recombinants.md /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ![PHA4GE logo](https://pha4ge.org/wp-content/uploads/2020/09/phage-logo-thin.png) 3 | *Bioinformatics Pipelines and Visualization Working Group Resources* 4 | 5 | Overview 6 | ======== 7 | This repository hosts [PHA4GE-developed](https://pha4ge.org/) guidance documents and resources that address common challenges regarding the integration of bioinformatics solutions for the global public health community. 8 | 9 | ## Contents 10 | - [Rationale](#rationale) 11 | - [SARS-CoV-2 Resources](#sars-cov-2-resources) 12 | - [Omicron Variant Resources](/docs/omicron-resources.md) 13 | - [Identifying SARS-CoV-2 Recombinants](docs/sc2-recombinants.md) 14 | - [Bioinformatics Solutions](docs/bioinfo-solutions.md) 15 | - [Validation Data Sets](https://github.com/CDCgov/datasets-sars-cov-2) 16 | - [Quality Control Guidance](docs/qc-solutions.md) 17 | - [Informing Public Health Action](#sars-cov-2-resources) 18 | - [Mpox Resources](#mpox-resources) 19 | - [Bioinformatics Solutions](docs/mpxv-bioinfo-solutions.md) 20 | - [HIV Resources](#hiv-resources) 21 | - [Bioinformatics Solutions](docs/hiv-bioinfo-solutions.md) 22 | - [Bioinformatics Development](#bioinformatics-development) 23 | - [Best Practices for Public Health Bioinformatics Pipelines](https://github.com/pha4ge/public-health-pipeline-best-practices/blob/main/docs/pipeline-best-practices.md) 24 | - [Contributing](#contributing) 25 | 26 | 27 | Rationale 28 | ======== 29 | As public health bioinformatic workflows become increasingly complicated, efforts are needed to promote sensible standardization, portability and reproducibility of assays and workflows across a range of environments, contexts and resource conditions. 30 | 31 | SARS-CoV-2 Resources 32 | ================== 33 | 34 | ### [Omicron Variant Resources](/docs/omicron-resources.md) 35 | 36 | The PHA4GE Pipelines and Visualization Working Group has created this document to highlight critical open-source/accesses resources to aid in the understanding and further analysis of the Omicron variant. 37 | 38 | ### [Identifying SARS-CoV-2 Recombinants](docs/sc2-recombinants.md) 39 | 40 | SARS-CoV-2 recombinants have garnered the attention of the public health community largely due to the unknown clinical and epidemiological implications. This uncertainty emphasizes the need to detect and characterize recombinant SARS-CoV-2 genomes, but the ability to do so rapidly and systematically is not without challenges. Often, recombinant genomes receive an “Unassigned” pango lineage, a non-recombinant pango lineage, or the incorrect recombinant lineage assignment. Additionally, determining the site of recombination within the genome can be difficult for those without extensive SARS-CoV-2 bioinformatics experience. 41 | 42 | The PHA4GE Pipelines and Visualization Working Group has created this document as an attempt to highlight critical sources of information and open-source/access resources to aid in the analysis and surveillance of potential recombinant specimens. 43 | 44 | ### [Bioinformatics Solutions](docs/bioinfo-solutions.md) 45 | 46 | In an attempt to assist this integration process, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them. 47 | 48 | ### [Validation Data Sets](https://github.com/CDCgov/datasets-sars-cov-2) 49 | 50 | The US Centers for Disease Control and Prevention's Technical Outreach and Assistance for States Team (TOAST) developed benchmark datasets for SARS-CoV-2 sequencing which are designed to help users at varying stages of building sequencing capacity. Rather than duplicating these efforts, the PHA4GE bioinformatics pipeline and visualization working group will be working alongside TOAST members to maintain and improve upon the currently-available validation datasets. 51 | 52 | ### [Quality Control Guidance](docs/qc-solutions.md) 53 | 54 | In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them. 55 | 56 | ### Informing Public Health Action 57 | 58 | {In development} 59 | 60 | 61 | Mpox Resources 62 | ================== 63 | 64 | ### [Bioinformatics Solutions](docs/mpxv-bioinfo-solutions.md) 65 | 66 | In an attempt to assist this integration process, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for Mpox genomic analysis and suggest various open-source and freely available bioinformatics resources to address them. 67 | 68 | HIV Resources 69 | ================== 70 | 71 | ### [Bioinformatics Solutions](docs/hiv-bioinfo-solutions.md) 72 | 73 | Understanding the HIV genome, evolutionary dynamics, and subtypes are essential for designing bioinformatic processes. Here, we present a set of resources to help springboard researchers into the world of HIV bioinformatics. 74 | 75 | Bioinformatics Development 76 | ================== 77 | 78 | ### [Public Health Pipeline Best Practices](https://github.com/pha4ge/public-health-pipeline-best-practices/blob/main/docs/pipeline-best-practices.md) 79 | 80 | In an attempt to assist software developers, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has proposed a set of best practices, tailored specifically for public health bioinformatics pipelines. These best practices aim to provide a guidance framework for development, testing, maintenance of bioinformatics software. By adhering closely to these best practices, developers can enhance the quality, reliability and sustainability of their software, facilitating impact in public health research. 81 | 82 | Contributing 83 | ============ 84 | Contributions to the documents are more than welcome. To propose a change, edit the source files and open a pull-request with the proposed changes. 85 | 86 | If you're interested in participating in further discussions please free to join the [Working Group](https://pha4ge.org/bioinformatics-pipelines-and-visualization/). 87 | 88 | -------------------------------------------------------------------------------- /docs/bioinfo-solutions.md: -------------------------------------------------------------------------------- 1 | # **Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis** 2 | 3 | PHA4GE Bioinformatics Pipelines & Visualization Working Group
4 | Libuit KG, Park D, van Heusden P, Neher R, Kapsak CJ, Southgate J, Bridges D, Mboowa G, Lunn S, Constantinides B, Varona S, Langhorst B 5 | 6 |
7 | Document Changelog 8 | 9 | - 2023-03-19: 10 | - Add CLI tool ViralConsensus 11 | - Add changelog 12 | - 2023-04-13: 13 | - Add CLI tool nf-core/viralrecon 14 |
15 | 16 | # Overview 17 | 18 | Genomic analysis of SARS-CoV-2 (SC2) samples is an increasingly critical function to public health laboratories around the world. Integration of the appropriate bioinformatics solutions to support these works, however, can be an overwhelming challenge. 19 | 20 | In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the [Public Health Alliance for Genomic Epidemiology (PHA4GE)](https://www.pha4ge.org) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them. 21 | 22 | Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues. 23 | 24 | # Bioinformatics Challenges for Public Health 25 | 26 | The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples: 27 | 28 | 1. **Generating consensus assemblies from PCR tiling NGS data:** Tiled amplicon sequencing--through the Artic V3 protocol, for example--is the most commonly adopted method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample. As a result, one of the initial bioinformatics challenges laboratories face is the assembly of PCR tiling NGS data into a contiguous SC2 genome from which powerful public health insights can be derived, such as lineage typing and genomic epidemiology studies that help inform public-health decision making. 29 | 30 | 2. **Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases:** Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be. 31 | 32 | 3. **Screening sequenced SC2 samples for variants of concern:** The detection of certain genetic variants of the SARS-CoV-2 virus may have a significant impact on the decisions of public health officials. Thus, an ability to accurately and reliably screen for variants of interest (VoI) and variants of concern(VoC), such as B.1.1.7 (Alpha) or B.1.617.2 (Delta), is a critical component to the bioinformatics analysis of SC2 genomes. 33 | 34 | 4. **Performing phylogenetic analysis of SC2 datasets:** Genetic relatedness as inferred through phylogenetic analysis of SC2 datasets can be a powerful proxy for epidemiological associations that help resolve transmission networks, enable real-time surveillance, provide insights of the variance-over-time of SC2 samples, and support local outbreak investigations 35 | 36 | # Open-Access/Source Bioinformatics Solutions & Resources 37 | 38 | ## 1. Generating consensus assemblies from PCR tiling NGS data 39 | 40 | The bioinformatics resources listed below are open-source pipelines that run on general-purpose, containerized workflow infrastructure to generate consensus SC2 assemblies from PCR tiling NGS data. While some parameters and modules may differ slightly, each pipeline will perform read mapping to the Wuhan-1 reference genome, remove primer regions from the mapped read data, and generate a consensus assembly based on conserved and variant positions identified in the resulting alignment. These resources have been organized into three categories: [Terra](app.terra.bio) and [Galaxy](https://galaxyproject.org/) Workflows, Web-Accessible Software as a Service (SaaS) Solutions, and Command-Line Interface (CLI) tools and are listed in no particular order. 41 | 42 |
43 | Terra and Galaxy Workflows 44 | 45 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) 46 | - **Brief Description:** The viral-ngs workflow collection contains many tools for viral analysis. The consensus genome caller is called assemble\_refbased and should work for any low-diversity microbial genome and is appropriate for viruses stemming from a single point-source outbreak, such as SARS-CoV-2. Accepts Illumina paired, single, or mixed reads, as well as ONT reads. Accepts metagenomic or amplicon-based reads with primer trimming. 47 | - **Developed/supported by:** Broad Institute Viral Genomics 48 | - **Documentation:** [Technical documentation (ReadTheDocs)](https://viral-ngs.readthedocs.io/en/latest/) 49 | - **User base:** [H3Africa](https://h3africa.org/index.php/consortium/genomic-characterization-and-surveillance-of-microbial-threats-in-west-africa/) West African sites ([RUN](http://acegid.org/), [KGH](https://vhfc.org/consortium/people/), [UCAD](https://www.ucad.sn/)) 50 | - **Workflow language:** WDL 51 | - **Web/Cloud GUI Platforms:** Terra, DNAnexus 52 | - **CLI Platforms:** Cromwell (local HPC, cloud), miniWDL 53 | - [Theiagen's Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) 54 | - **Brief Description:** Theiagen's Public Health Viral Genomics WDL Workflows include four separate WDL workflows (Titan\_Illumina\_PE, Titan\_Illumina\_SE, Titan\_ClearLabs, and Titan\_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively. 55 | - **Developed/supported by:** Theiagen Genomics 56 | - **Documentation:** [Technical documentation (ReadTheDocs)](https://public-health-viral-genomics-theiagen.readthedocs.io/en/latest/overview.html), [step-by-step protocols (Protocols.io)](https://www.protocols.io/file-manager/9EF18A27777511EBA1C60A58A9FEAC2A), and [video tutorials (YouTube Playlist)](https://www.youtube.com/watch?v=fy0Hm0lfIas&list=PLU47xRg_MKJrtyoFwqGiywl7lQj6vq8Uz) 57 | - **User base:** US PHLs 58 | - **Workflow language:** WDL 59 | - **Web/Cloud GUI Platforms:** Terra 60 | - **CLI Platforms:** Cromwell (local HPC, cloud), miniWDL 61 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) 62 | - **Brief Description:** Several Galaxy workflows for performing SC2 consensus genome assembly have been available including a Galaxy workflow for the analysis of SARS-CoV-2 data. 63 | - **Workflow language:** Galaxy 64 | - **Developed/supported by:** usegalaxy.eu ([https://covid19.galaxyproject.org/artic/](https://covid19.galaxyproject.org/artic/)) 65 | - **Web/Cloud GUI Platforms:** [usegalaxy.*](https://galaxyproject.org/use/) 66 | - **Documentation:** [SARS-CoV-2 Data Analysis and Monitoring with Galaxy](https://galaxyproject.eu/event/2021-06-21-sars-cov-2-data-analysis-monitoring-training/) 67 | - **Sequencing technologies supported:** Illumina metagenomic sequencing, Illumina and Oxford Nanopore ARTIC amplicon sequencing 68 | - **Developed/suppported by:** ARIES/Istituto Superiore di Sanità 69 | - **Web/Cloud GUI Platforms:** [ARIES Galaxy](https://aries.iss.it/) ([https://aries.iss.it/u/arnold-knijn/w/sars-cov-2recovery31](https://aries.iss.it/u/arnold-knijn/w/sars-cov-2recovery31)) 70 | - **Documentation:** [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.01.16.425365v1) 71 | - **Sequencing technologies supported:** Illumina, Ion Torrent and Oxford Nanopore ARTIC amplicon sequencing 72 |
73 | 74 |
75 | Web-Accessible SaaS Solutions 76 | 77 | - [IDSeq](https://idseq.net/) 78 | - **Brief Description:** User-friendly software platform originally developed for metagenomics studies that has since been repurposed to include SC2 consensus assembly from Oxford Nanopore or paired-end Illumina data 79 | - **Developed/supported by:** [Chan Zuckerberg Initiative (CZI)](https://chanzuckerberg.com/) 80 | - **User base:** CZ Biohub & partners; access available on request to other users 81 | - **User-interface** : Web application on CZI-funded AWS 82 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) 83 | - **Brief Description:** EDGE COVID-19 is a derivative of the original EDGE Bioinformatics software (Li _et al._ 2017) that was developed to perform reference-based SC2 assemblies and quality assessment of Illumina or Nanopore read data. 84 | - **Developed/supported by:** Los Alamos National Laboratories 85 | - **Documentation:** [EDGE COVID-19 User Guide](https://edge-covid19.edgebioinformatics.org/docs/EDGE_COVID-19_guide.pdf) 86 | - **User base:** LANL & partners 87 | - **User-interface:** Web application on LANL hardware, [local instance using Docker](https://hub.docker.com/r/bioedge/edge-covid19 88 | ) 89 |
90 | 91 |
92 | Command-line interface (CLI) Tools 93 | 94 | - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN/OnCOV)](https://github.com/jaleezyy/covid-19-signal) 95 | - **Brief Description:** Quality control, assembly, and analysis snakemake workflow for Illumina-based viral amplicon sequencing. Includes de-hosting via competitive mapping, freebayes variant and consensus generation, lineage assignment, interactive HTML run summaries, and integration with the [ncov-tools](https://github.com/jts/ncov-tools/) QC workflow. 96 | - **Developed/supported by:** [CARD/McArthur Lab](https://mcarthurbioinformatics.ca), lead maintainers: Jalees Nasir & Finlay Maguire 97 | - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/jaleezyy/covid-19-signal) 98 | - **User base:** CA PHLs & academic partners 99 | - **User-interface:** CLI (Snakemake) 100 | - [ARTIC nCOV19 (ARTIC Network; Connor-lab)](https://github.com/connor-lab/ncov2019-artic-nf) 101 | - **Brief Description:** Configured conda environment that enables access to Oxford Nanopore or Illumina consensus sequence assemblers: Medaka (ONT), NanoPolish (ONT) or BWA (Illumina) 102 | - **Developed/supported by:** COG UK / ARTIC 103 | - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/connor-lab/ncov2019-artic-nf/blob/master/README.md) 104 | - **User base:** COG UK 105 | - **Workflow language:** Nextflow 106 | - **CLI Platforms:** Nextflow cli client, Nextflow Tower (local HPC, cloud, etc) 107 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) 108 | - **Brief Description:** Two StaPH-B workflows for performing SC2 consensus genome assembly have been available: Cecret, a pipeline developed for the analysis of single or paired-end Illumina reads. and Monroe, a workflow with various subcommands that perform consensus genome assembly from either Illumina or Nanopore read data. 109 | - **Developed/supported by:** StaPH-B 110 | - **Documentation:** [https://staph-b.github.io/staphb\_toolkit/](https://staph-b.github.io/staphb_toolkit/install/), [Python Package Index (PyPI)](https://pypi.org/project/staphb-toolkit/) 111 | - **User base:** US PHLs 112 | - **User-interface:** CLI (Python package) 113 | - [ViralConsensus](https://github.com/niemasd/ViralConsensus) 114 | - **Brief Description:** A primer-aware consensus assembler developed for efficient assembly of SARS-CoV-2 reads from CRAM/BAM/SAM input. Written in C++. [Preprint](https://www.biorxiv.org/content/10.1101/2023.01.05.522928v1). 115 | - **Developed/supported by:** Niema Moshiri 116 | - **Documentation:** [Github Readme](https://github.com/niemasd/ViralConsensus), [DockerHub](https://hub.docker.com/r/niemasd/viral_consensus) 117 | - **User base:** Unknown 118 | - **User-interface:** CLI (C++ executable) 119 | - [nf-core/viralrecon](https://github.com/nf-core/viralrecon) 120 | - **Brief Description:** nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data. For Illumina short-reads the pipeline is able to analyse metagenomics data typically obtained from shotgun sequencing (e.g. directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based). For Nanopore data the pipeline only supports amplicon-based analysis obtained from primer sets created and maintained by the ARTIC Network 121 | - **Developed/supported by:** [nf-core](https://nf-co.re/) 122 | - **Documentation:** [Github Readme](https://github.com/nf-core/viralrecon), [nf-core documentation](https://nf-co.re/viralrecon) 123 | - **User base:** Unknown 124 | - **Workflow language:** Nextflow 125 | - **CLI Platforms:** Nextflow cli client, Nextflow Tower (local HPC, cloud, etc) 126 | 127 |
128 | 129 | ## 2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases 130 | 131 | Below is a list of resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as [NCBI](https://www.ncbi.nlm.nih.gov/sars-cov-2/), [ENA](https://www.ebi.ac.uk/ena/browser/home), and [GISAID](https://www.gisaid.org/). We have also included a list of bioinformatics software designed to assess the quality of SC2 data; we recommend the use of such software prior to submission to avoid the inadvertent sharing of poor quality, contaminated, or otherwise misleading SC2 data. Additional information regarding the interpretation of read and assembly quality metrics for SC2 data will be made available as a separate document. 132 | 133 |
134 | Recommended SC2 Sample Metadata Specifications 135 | 136 | - [PHA4GE Contextual Data Specifications](https://www.preprints.org/manuscript/202008.0220/v1) 137 | - **Database Target(s):** GISAID, ENA, SRA, Genbank 138 | - **Brief Description:** A SARS-CoV-2 contextual data specification based on harmonizable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories. 139 | - **Developed/supported by:** PHA4GE 140 | - **Documentation:** [Technical documentation (GitHub README)](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification) 141 | - **User base:** Global public health community 142 | - **Protocols:** [NCBI Submission](http://dx.doi.org/10.17504/protocols.io.bsypnfvn), [ENA Submission](http://dx.doi.org/10.17504/protocols.io.buqnnvve), & [GISAID Submission](http://dx.doi.org/10.17504/protocols.io.bumknu4w) 143 | 144 |
145 | 146 |
147 | Bioinformatics Solutions to Prepare and/or Submit SC2 Sample Data 148 | 149 | - [Galaxy ENA Submission Plugin](https://github.com/galaxyproject/tools-iuc/tree/master/tools/ena_upload) 150 | - **Database Target(s):** ENA 151 | - **Brief Description:** Galaxy plugin for direct submission to the European Nucleotide Archive database 152 | - **Developed/supported by:** [Galaxy IUC (Intergalactic Utilities Commission)](https://galaxyproject.org/iuc/) 153 | - **Documentation:** [https://github.com/ELIXIR-Belgium/ena-upload-container](https://github.com/ELIXIR-Belgium/ena-upload-container) 154 | - **User base:** European PHLs 155 | - **Workflow language:** Galaxy 156 | - **Web/Cloud GUI Platforms:** GalaxyProject 157 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above) 158 | - **Database Target(s):** GISAID, GenBank, & SRA 159 | - [Theiagen's Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above) 160 | - **Database Target(s):** GISAID & GenBank (SRA submission in development) 161 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above) 162 | - **Database Target(s):** GISAID, GenBank, & SRA 163 | 164 | 165 |
166 | 167 |
168 | Bioinformatics Solutions to Assess Data Quality Prior to Submission 169 | 170 | - [VADR - Viral Annotation DefineR](https://github.com/ncbi/vadr) 171 | - **Brief Description:** VADR is a suite of CLI tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. With regards to SC2, laboratories have utilized VADR to identify samples with potentially mis-assembled genomes that are likely to be rejected from an internationally-accessible database. 172 | - **Developed/supported by:** NCBI 173 | - **Documentation:** [Technical Documentation (GitHub Wiki)](https://github.com/ncbi/vadr/wiki/Coronavirus-annotation) 174 | - **User base:** NCBI GenBank & US PHLs 175 | - **Accessibility:** [Local install](https://github.com/ncbi/vadr/blob/master/documentation/install.md#top) or the [StaPH-B Docker Image](https://hub.docker.com/r/staphb/vadr/) 176 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above; includes VADR) 177 | - [Titan Workflows for Genomic Characterization](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above; includes VADR) 178 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above) 179 | - [IDSeq (CZ BioHub)](https://idseq.net/) (SaaS solution described above) 180 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above) 181 | - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN)](https://github.com/jaleezyy/covid-19-signal) (CLI tool described above) 182 | - [ARTIC nCOV19 (ARTIC Network; Connor-lab)](https://github.com/connor-lab/ncov2019-artic-nf) (CLI tool described above) 183 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above; VADR included in the Cecret workflow) 184 | 185 |
186 | 187 | ## 3. Screening sequenced SC2 samples for variants of concern & general lineage typing 188 | 189 | These tools either assign a clade or lineage descriptor to consensus sequences or provide databases for lookup of information on variants in the SARS-CoV-2 genome. As variants of concern are listed by their lineage descriptor (typically PANGO lineage or sometimes Nextclade clades) these tools help identify variants of concern. 190 | 191 |
192 | Bioinformatics tools for SC2 lineage or clade assignment 193 | 194 | - [Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages)](https://cov-lineages.org/pangolin.html) 195 | - **Brief Description:** Tool developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (PANGO lineage) to SARS-CoV-2 query sequences. 196 | - **Developed/supported by:** Pangolin Network 197 | - **Documentation:** [Technical Documentation (Pangolin Website)](https://cov-lineages.org/pangolin.html), [publication (Nature Microbiology)](https://www.nature.com/articles/s41564-020-0770-5) 198 | - **User base:** Global Public Health Community 199 | - **Accessibility:** [Web application](https://pangolin.cog-uk.io/) & [CLI tool](https://github.com/cov-lineages/pangolin) 200 | - **Bioinformatics workflows that incorporate Pango lineage assignments:** 201 | - [Datapipe](https://github.com/COG-UK/datapipe) 202 | - **Brief Description:** Performs alignment and variant calling, assigns lineages with pangolin and VOC/VUI with scorpio and cleans up geography metadata. 203 | - **Developed/supported by:** Virus Group (University of Edinburgh) 204 | - **User-interface:** command-line tool, nextflow pipeline 205 | - **User base:** COG-UK 206 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above) 207 | - [Theiagen's Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above) 208 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above) 209 | - [IDSeq](https://idseq.net/) (SaaS solution described above) 210 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above) 211 | - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN)](https://github.com/jaleezyy/covid-19-signal) (CLI tool described above) 212 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above) 213 | - [NextClade](https://clades.nextstrain.org/) 214 | - **Brief Descriptio:n** Tool that identifies differences between your sequences and a reference sequence used by Nextstrain, uses these differences to assign your sequences to clades, and reports potential sequence quality issues in your data 215 | - **User-interface:** [Web application](https://clades.nextstrain.org/) & CLI tool 216 | - **Help/community/discussion:** [discussion.nextstrain.org](http://discussion.nextstrain.org/) 217 | - **Bioinformatics workflows that incorporate NextClade clade assignments:** 218 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above) 219 | - [Theiagen's Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above) 220 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above) 221 | - [IDSeq](https://idseq.net/) (SaaS solution described above) 222 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above) 223 | 224 |
225 | 226 | 227 |
228 | Public Health Resources that Track & Visualize SC2 Variants Over Time 229 | 230 | - [PANGO cov-lineages](https://cov-lineages.org/) 231 | - **Brief Description:** Track global prevalences of PANGO lineages 232 | - **Developed/supported by:** Pangolin Network 233 | - [Covariants](https://covariants.org/) 234 | - **Brief Description:** Track global prevalence of Nextclade-annotated lineages 235 | - **Developed/supported by:** NextStrain Team 236 | - [Outbreak.info](https://outbreak.info/) 237 | - **Brief Description:** Epidemiological info including PANGO lineage prevalence 238 | - **Developed/supported by:** [Su](http://sulab.org/), [Wu](http://wulab.io/), and [Andersen](https://andersen-lab.com/) labs at Scripps Research 239 | - [COV-GLUE](http://cov-glue.cvr.gla.ac.uk/) 240 | - **Brief Description:** CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic Epidemiological info including PANGO lineage prevalence 241 | - **Developed/supported by:** COG-UK 242 | - [2019nCoVR](https://bigd.big.ac.cn/ncov/) 243 | - **Brief Description** :2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected SARS-CoV-2 strains. 244 | - **Developed/supported by:** China National Center for Bioinformation (CNCB) 245 | - [CoVizu](https://filogeneti.ca/covizu/) 246 | - **Brief Description:** CoVizu is an [open source project](https://github.com/PoonLab/CoVizu) endeavouring to visualize the global diversity of SARS-CoV-2 genomes, which are provided by the [GISAID Initiative](https://gisaid.org/). 247 | - **Developed/supported by:** [Poon Laboratory](https://www.schulich.uwo.ca/pathol/people/bios/faculty/poon_art.html) of Western University 248 | - [Annotation of SARS-2 Coronavirus Genome (Observable)](https://observablehq.com/@delphine-l/annotation-of-sars-2-coronavirus-genome) 249 | - **Brief Description:** Annotation of variation in the genome with some notes on what is known about the various amino acids 250 | - **Developed/supported by:** Delphine Lariviere (Penn State University) 251 | 252 |
253 | 254 |
255 | Bioinformatics Tools to Track & Visualize Your Own SC2 Variants Over Time 256 | 257 | - [KRISP R-scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics) 258 | - **Brief Description:** Open-source repository containing all the code, data and information needed to reproduce the analyses for the [African genomic epidemiology manuscript](https://www.nature.com/articles/s41591-021-01255-3). 259 | - **Developed/supported by:** Emmanuel James San (University of KwaZulu-Natal) 260 | - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme), [publication (Nature Medicine)](https://www.nature.com/articles/s41591-021-01255-3) 261 | - **Accessibility:** [RCL-Scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme) 262 | - [GISAID Processing](https://github.com/pvanheus/GISAID_processing) 263 | - **Brief Description:** Open-source repository containing python scripts to process GISIAD data into frequency graphs 264 | - **Developed/supported by:** Peter van Heusden (University of Western Cape) 265 | - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/pvanheus/GISAID_processing/blob/main/README.md) 266 | - **Accessibility:** [Python-Scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme) 267 | 268 |
269 | 270 | ## 4. Performing phylogenetic analysis of SC2 datasets 271 | 272 | _The tools listed below perform phylogenetic analyses of different complexity, ranging from web-apps to command-line tools that need to run on HPC facilities. The selected tools are integrated with visualization features that facilitate the interrogation of the results, but beware that such inferences might be uncertain and often require careful interpretation._ 273 | 274 | 275 |
276 | Public Health Resources Performing Global SC2 Phylogenetic Analysis 277 | 278 | 279 | - [NextStrain](https://nextstrain.org/) 280 | - **Brief Description:** Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. 281 | - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team) 282 | - **User base:** USA based groups 283 | - **Documentation:** [docs](https://docs.nextstrain.org/en/latest/index.html) 284 | - **Help/community/discussion:** [discussion.nextstrain.org](http://discussion.nextstrain.org/) 285 | - Implementations for compute steps ("augur"): 286 | - [**nextstrain/ncov**](https://github.com/nextstrain/ncov) snakemake pipeline 287 | - **Description:** The authoritative implementation of the Nextstrain "augur" pipeline that takes genomes and metadata to trees and visualizations. 288 | - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team) 289 | - **Workflow language:** Snakemake 290 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above) 291 | - [Theiagen's Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above) 292 | - [Microreact](https://microreact.org/) 293 | - **Brief Description:** Open data visualization and sharing for genomic epidemiology 294 | - **Developed/supported by:** Centre for Genomic Pathogen Surveillance (CGPS) 295 | - **User base:** COG-UK, New Zealand, etc 296 | - **User-interface:** Web application / centrally hosted service 297 | 298 |
299 | 300 |
301 | Offlineable Browser-Based Web Applications 302 | 303 | - [Auspice](https://auspice.us/) 304 | - **Brief Description:** Allows interactive exploration of phylogenomic datasets by simply dragging & dropping them onto this page. 305 | - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team) 306 | - **Documentation:** [Technical documentation (GitHub README)](https://github.com/nextstrain/auspice#readme), [NextStrain discussion Forum](https://discussion.nextstrain.org/) 307 | - **User-interface:** offlineable browser-based web app 308 | - [MicrobeTrace](https://microbetrace.cdc.gov/MicrobeTrace/) 309 | - **Brief Description:** The Visualization Multitool for Molecular Epidemiology and Bioinformatics 310 | - **Developed/supported by:** US CDC 311 | - **Documentation:** https://github.com/CDCgov/MicrobeTrace 312 | - **User-interface:** offlineable browser-based web app 313 | - [UShER](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace) 314 | - **Brief Description:** Places user provided sequences on very large reference trees, extracts the relevant subtree, and provides a visualization 315 | - **Developed/supported by:** UCSC 316 | - **User-interface:** offlineable browser-based web app 317 | 318 |
319 | 320 |
321 | Command-line interface (CLI) Tools 322 | 323 | - [Grinch](https://github.com/cov-lineages/grinch) 324 | - **Brief Description:** Generates reports for the international distribution of PANGO lineages that can be viewed in a web browser. 325 | - **Developed/supported by:** PANGO, cov-lineages 326 | - **User-interface:** command-line tool 327 | 328 | - [Phylopipe](https://github.com/cov-ert/phylopipe) 329 | - **Brief Description:** Generates a downsampled global tree using FastTree and updates it daily using UShER, cleans and annotates the tree; can be run on output from Datapipe. 330 | - **Developed/supported by:** Virus Group (University of Edinburgh) 331 | - **User-interface:** command-line tool, nextflow pipeline 332 | - **User base:** COG-UK 333 | 334 |
335 | -------------------------------------------------------------------------------- /docs/hiv-bioinfo-solutions.md: -------------------------------------------------------------------------------- 1 | # HIV Bioinformatics Solutions 2 | 3 | Authors 4 | ======= 5 | 6 | [Amy Gaskin](https://github.com/gaskinae), [Marc Niebel](https://github.com/MarcNiebel), [Frank Ambrosio](https://github.com/frankambrosio3), Abbas Abel Anzaku 7 | 8 | ## Contents 9 | 10 | [Introduction](#introduction) 11 | 12 | [Background Information for Bioinformaticians](#background-information-for-bioinformaticians) 13 | 14 | [Genomic Structure](#genomic-structure) 15 | 16 | [Evolution](#evolution) 17 | 18 | [Subtypes](#subtypes) 19 | 20 | [HIV Bioinformatics Guidance Pathways](#hiv-bioinformatics-guidance-pathways) 21 | 22 | [Genomic Characterisation/Subtyping](#genomic-characterisationsubtyping) 23 | 24 | [Drug Resistance Surveillance](#drug-resistance-surveillance) 25 | 26 | [Drug Development and Resistance Prediction](#drug-development-and-resistance-prediction) 27 | 28 | 29 | [Genomic Epidemiology](#genomic-epidemiology) 30 | 31 | [Sequencing Strategies](#sequencing-strategies) 32 | 33 | [HIV-1 Bioinformatics Tools](#hiv-1-bioinformatics-tools) 34 | 35 | [Assembly](#assembly) 36 | 37 | [Resistance detection](#resistance-detection) 38 | 39 | [Transmission Network Analysis](#transmission-network-analysis) 40 | 41 | [Sequence Databases](#sequence-databases) 42 | 43 | [Case Studies](#case-studies) 44 | 45 | ## Introduction 46 | 47 | 48 | Human Immunodeficiency Virus (HIV), a highly contagious retrovirus, presents a formidable global public health challenge. According to the World Health Organization (WHO), HIV infections lead to severe immunodeficiency and acquired immunodeficiency syndrome (AIDS), causing millions of deaths annually. The virus primarily targets CD4+ T cells, compromising the immune system and necessitating effective treatment strategies. Given its high mutation rate, understanding the genomics of HIV is crucial for designing treatments, including antiretrovirals and vaccines. 49 | 50 | In recent years, bioinformatics has played a pivotal role in HIV genomics research, offering insights into viral diversity, drug resistance, and transmission patterns. Applying bioinformatics to HIV research enhances public health initiatives by monitoring genetic variations, detecting mutations, supporting risk assessment, and refining vaccine development. Standardization holds the potential to make HIV genomics research more accessible across diverse global settings, as it enhances transparency, reproducibility, and reliability. 51 | 52 | This paper serves as a guidance document, aiming to address existing challenges within the bioinformatics community related to standardization of HIV genomic analyses.  Here, we introduce guidance pathways, which can be likened to distinct themes of bioinformatic analysis specifically tailored to HIV research. These pathways include real-world public health case studies along with relevant HIV bioinformatics tools, making them an invaluable starting point for researchers familiar with microbial bioinformatics but seeking orientation in the world of HIV bioinformatics. 53 | 54 | Therefore, this paper aims to contribute to a global network of knowledge-sharing, promoting equitable access to genomics for HIV -- empowering researchers and clinicians worldwide in pursuit of reducing the burden of disease.  55 | 56 | ## Background Information for Bioinformaticians 57 | 58 | Understanding the HIV genome, evolutionary dynamics, and subtypes are essential for designing bioinformatic processes. Here, we present a set of resources to help springboard researchers into the world of HIV bioinformatics!  59 | 60 | ## Genomic Structure  61 | 62 | The diploid genome of HIV-1 consists of approximately 9700 nucleotides, and features nine genes which encode for fifteen proteins (Figure 1)[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/) which interact with human proteins as part of the HIV-1 viral life cycle. Structural proteins, enzymes, and envelope proteins are encoded by three main genes: gag, pol, and env respectively. The remaining genes are responsible for coding regulatory (tat, rev) and accessory (vif, vpr, vpu/vpx, nef) proteins. 63 | 64 | Figure 1: HIV-1 DNA genome structure [1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/) 65 | 66 |

67 | 68 |

69 | 70 | ## Evolution 71 | 72 | The HIV-1 population, despite having a relatively small genome (Figure 1), showcases extensive genomic diversity primarily due to its exceptionally high mutation rate. This rapid mutation occurs during replication, leading to an accumulation of genetic variations within the viral population. Furthermore, the virus exhibits a high proficiency in recombination[2](https://pubmed.ncbi.nlm.nih.gov/30687518/), particularly notable due to the varying recombination event rates observed in different segments of the HIV genome, which contribute to the overall genomic diversity of the virus. 73 | 74 | In turn, this high genomic diversity gives rise to minor variants that assume a critical role in the development of drug resistance. When exposed to selective pressures, such as antiretroviral drugs or the host immune system, the virus adapts by favouring the proliferation of specific minor variants carrying resistance mutations. 75 | 76 | A comprehensive understanding of these evolutionary mechanisms is crucial, as they pose a significant challenge in the clinical and public health management of HIV-1 by significantly influencing treatment outcomes. 77 | 78 | ## Subtypes 79 | 80 | HIV is classified into types, groups and subtypes according to its genetic diversity. [3](https://pubmed.ncbi.nlm.nih.gov/30882484/) 81 | 82 | **HIV-1 / Group M, N, O, P** 83 | 84 | Group M is the most widespread subtype, responsible for the majority of infections globally (Table  1). Groups N, O, and P are less common. These variations impact transmission, virulence, and treatment responses. 85 | 86 | Table  1: Subtypes and main locations for group M 87 | | Subtype | Predominant Region | 88 | | ---------------------- | --------------------------------------- | 89 | | A | Eastern Europe & former Soviet Union countries | 90 | | B | North America and Western Europe | 91 | | C | Sub-saharan Africa | 92 | | D | East Africa | 93 | | F | Central Africa, Eastern Europe, and South America | 94 | | G | Western and Central Africa | 95 | | H | Central Africa | 96 | | J | Spain | 97 | 98 | ## HIV Bioinformatics Guidance Pathways  99 | 100 | Below, we outline key guidance pathways, which aim to capture  distinct themes of common types of bioinformatics analysis specifically tailored to HIV research. These pathways include real-world public health case studies along with relevant sequencing strategies and HIV bioinformatics tools. 101 | 102 | ### Genomic Characterisation/Subtyping 103 | 104 | Subtyping of HIV-1 refers to the categorization of the genome into groups (M, N, O or P) and further into subtypes (A-J  & CRFs) or clades. Following the production of a consensus sequence various different approaches (similarity, statistical or phylogenetic) can be used  to assign a probable subtype. 105 | 106 | Sequencing strategy 107 | 108 | - Tiled amplicon WGS (gold standard) 109 | 110 | - Targeted amplicon sequencing (overlap with drug resistance prediction on pol region) 111 | 112 | Analysis 113 | 114 | - Reference based mapping or de novo assembly methods 115 | 116 | - Consensus sequence generation 117 | 118 | - Assignment of subtype to queried sequence 119 | 120 | Tools 121 | 122 | - minimap2 [4](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [5](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [6](https://pubmed.ncbi.nlm.nih.gov/29876136/) 123 | 124 | - Quasitools HyDRA [7](https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf), samtools [8](https://pubmed.ncbi.nlm.nih.gov/33590861/), bcftools [8](https://pubmed.ncbi.nlm.nih.gov/33590861/) 125 | 126 | - Stanford HIVdb [9,10,11](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/), REGA [12](https://pubmed.ncbi.nlm.nih.gov/23660484/) 127 | 128 | Case Study: Benchmarking study of HIV-1 subtyping tools for clinical and surveillance purposes [12](https://pubmed.ncbi.nlm.nih.gov/23660484/) 129 | 130 | Description: In this study HIV-1 pol sequences obtained from Los Alamos were subtyped using various automated subtyping tools which were compared to manual phylogenetic analysis.This concluded that most automated subtyping tools work well with pure subtypes especially A & C, however variability of sensitivity and  specificity in subtyping CRFs concluded that multiple tools should be used to confirm HIV-1 subtype. 131 | 132 | Tools & databases used: Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/), REGA [12](https://pubmed.ncbi.nlm.nih.gov/23660484/) ,COMET [13](https://pubmed.ncbi.nlm.nih.gov/25120265/) , jpHMM [14](https://pubmed.ncbi.nlm.nih.gov/16845050/), STAR [15](https://pubmed.ncbi.nlm.nih.gov/16046498/), NCBI [16](https://pubmed.ncbi.nlm.nih.gov/15215470/), Stanford HIVdb [17](https://hivdb.stanford.edu/page/hiv-subtyper/) and SCUEAL [18](https://pubmed.ncbi.nlm.nih.gov/19956739/). 133 | 134 | ### Drug Resistance Surveillance 135 | 136 | The goal of using bioinformatics in drug resistance surveillance is to comprehensively analyse and identify mutations that confer resistance to resistance to antiretoviral drugs, such as protease, reverse transcriptase and integrase inhibitors. Targeting the pol region of the genome is useful as this region is associated with genes coding for protease, reverse transcriptase, and integrase. 137 | 138 | This guidance pathway can involve several commonly-used steps, including: aligning sequencing reads to a reference genome (e.g. HXB2: ), followed by de novo assembly methods to reconstruct the HIV genome. 139 | 140 | Assembled contigs can be used to generate an optional consensus sequence, serving as a reference for variant identification. 141 | 142 | Variant calling algorithms can either detect high-frequency genetic variations,  or perform more sensitive analysis for minor variant calling, which identifies low-frequency mutations. Both can be used to inform downstream clinical treatment. 143 | 144 | Annotated variants are then cross-referenced with gold-standard databases and interpreted through the lens of literature on HIV drug resistance to assess their potential impact on drug susceptibility in human-readable formats. These results can then be bundled into tailored reports so clinicians can interpret these findings and tailor antiretroviral therapy appropriately, minimising treatment failure and optimising patient care. 145 | 146 | Through this integrated approach, this guidance pathway facilitates proactive surveillance and management of HIV drug resistance, ultimately improving treatment efficacy and patient outcomes. 147 | 148 | Sequencing Strategy 149 | 150 | - Targeted amplicon sequencing of pol (Table 2). 151 | 152 | Analysis 153 | 154 | - Reference-based mapping or de novo assembly methods 155 | 156 | - Consensus sequence 157 | 158 | - Variant calling  159 | 160 | - Minor variant calling 161 | 162 | - Database querying 163 | 164 | Tools  165 | 166 | - minimap2 [4](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [5](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [6](https://pubmed.ncbi.nlm.nih.gov/29876136/) 167 | 168 | - VarScan [19](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/) 169 | 170 | - Stanford Database (either for pipeline implementation - codon frequency file - or for API query from consensus sequence) [9,10,11](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/) 171 | 172 | Case Study: Bioinformatic data processing pipelines in support of next-generation sequencing-based HIV drug resistance testing: the Winnipeg Consensus [20](https://pubmed.ncbi.nlm.nih.gov/30350345/) 173 | 174 | Pipelines:  175 | 176 | Quasiflow: 177 | 178 | HIV-DRIVES: 179 | 180 | ### Drug Development and Resistance Prediction 181 | 182 | Drug Target Identification: Understanding the genome aids in identifying potential drug targets, such as the Protease, Reverse Transcriptase, and Integrase enzymes. Bioinformatics tools predict inhibitors for these targets. 183 | 184 | Drug Resistance Prediction: Analysing genomic sequences helps predict drug resistance mutations, informing clinicians about the efficacy of personalised highly active antiretroviral therapies (HAART).The treatment success of HIV infection is affected by development of viral drug resistance, thereby, complicating clinicians choice of selecting the right drugs for patients' treatment. This challenge has led to the development  of various bioinformatics software tools and databases for predicting drug resistance, and responses to combination therapy from viral genotypes.[21](https://link.springer.com/chapter/10.1007/978-981-10-7483-7_16) 185 | 186 | Phylogenetics 187 | 188 | The study of HIV genetic variation and evolution using genomic data serves several important purposes in understanding and combating HIV/AIDS. It allows researchers to reconstruct the evolutionary history of HIV and track transmission dynamics within populations. By analyzing the genetic sequences of HIV strains obtained from infected individuals, researchers can infer relationships between viral lineages, identify transmission clusters, and trace the spread of the virus over time and geographical regions. This information is crucial for understanding patterns of HIV transmission, identifying high-risk populations, and implementing targeted prevention and intervention strategies. 189 | 190 | In addition to understanding transmission dynamics, HIV phylogenomics can shed light on the emergence and spread of drug resistance mutations in the HIV genome. Antiretroviral therapy (ART) is a cornerstone of HIV treatment, but the emergence of drug-resistant strains poses a significant challenge to effective treatment and control efforts around the world. By analyzing the genetic sequences of HIV strains, researchers can identify mutations associated with drug resistance and monitor their prevalence and transmission patterns within communities. This information is essential for guiding treatment decisions, designing effective drug regimens, and developing strategies to prevent the spread of drug-resistant HIV strains. 191 | 192 | Furthermore, HIV phylogenomics can provide valuable insights into the broader epidemiology of HIV/AIDS and inform public health responses to outbreaks and epidemics. By integrating genomic data with epidemiological information, researchers can identify sources of infection, map transmission networks, and assess the impact of prevention and control measures. This knowledge can help public health officials allocate resources more effectively, tailor interventions to specific populations, and ultimately reduce the burden of HIV/AIDS on affected communities. This type of analysis can be performed using whole genome sequencing (WGS) or by using one of the more stable regions of the HIV genome such as the pol, env or gag genes, as mutations in these genes will likely have occurred from the process of natural viral evolution, and not from recombinations or from insertions and deletions. 193 | 194 | Sequencing Strategy 195 | 196 | - Tiled amplicon WGS (Table 2). 197 | 198 | - Targeted amplicon sequencing of pol (Table 2). 199 | 200 | Analysis 201 | 202 | - Reference-based mapping assembly methods  203 | 204 | - Consensus sequence 205 | 206 | - Variant calling  207 | 208 | - Pairwise genomic distance computation 209 | 210 | - Phylogenetic tree inference 211 | 212 | - Phylogenetic tree visualization 213 | 214 | Tools  215 | 216 | - minimap2 [4](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [5](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [6](https://pubmed.ncbi.nlm.nih.gov/29876136/) 217 | 218 | - VarScan [19](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/) 219 | 220 | - HIV-TRACE [22](https://pubmed.ncbi.nlm.nih.gov/29401317/) 221 | 222 | Protocol: https://www.researchgate.net/publication/376330219_An_NGS_amplicon_tiling_protocol_for_HIV-1_drug_resistance_detection_using_IlluminaR_COVIDSeq_Assay_Kit_v2 223 | 224 | Pipelines:  225 | 226 | TheiaCoV: https://github.com/theiagen/public_health_bioinformatics/tree/main/workflows/theiacov 227 | 228 | iVar: https://github.com/andersen-lab/ivar 229 | 230 | ### Genomic Epidemiology  231 | 232 | Genomic Epidemiology: Utilising bioinformatics for large-scale analysis of HIV sequence data aids in tracking the spread of viral variants and understanding transmission dynamics. 233 | 234 | Network transmission analysis usually takes place just after phylogenetic tree construction, and these analyses are inherently related by the fact that each requires the investigator to determine the pairwise genomic distances between a set of samples. However, there is a distinction between the use of phylogenetic trees to visualize the genomic diversity and relationships between a set of samples, and the use of the pairwise genomic distances of a set of samples to generate a putative transmission network. 235 | 236 | Genomic distance information can be used to infer which samples may be linked by a transmission event. One can construct a putative transmission network from a set of inferred transmission events. By using genomic similarity as a proxy for likelihood of a direct transmission event one can map the propagation of a pathogen through a population using high-resolution genomic sequencing data. 237 | 238 | Connections between nodes in the network (edges) are based on genomic similarity falling below a threshold of genomic distance computed by assessing the distribution of genomic distances found within samples known to be associated with the outbreak by traditional epidemiological techniques such as contact tracing. The techniques and tools outlined in the above phylogenomics section will be useful in generating assemblies, distance matrices and phylogenetic trees which are the inputs for most genomic network construction tools. 239 | 240 | Sequencing Strategy 241 | 242 | - Tiled amplicon WGS (Table 2). 243 | 244 | - Targeted amplicon sequencing of pol (Table 2). 245 | 246 | Analysis 247 | 248 | - Reference-based mapping assembly methods  249 | 250 | - Consensus sequence 251 | 252 | - Variant calling  253 | 254 | - Pairwise genomic distance computation 255 | 256 | - Putative transmission network construction 257 | 258 | - Force-directed network layout visualization 259 | 260 | Tools: 261 | 262 | - MicrobeTrace [23](https://pubmed.ncbi.nlm.nih.gov/34492010/) 263 | 264 | - GrapeTree [24](https://pubmed.ncbi.nlm.nih.gov/30049790/) 265 | 266 | ## Sequencing Strategies 267 | 268 | The sequencing strategy (Table 2)  that you adopt is dependent on multiple factors but should be driven by the question that you are trying to answer. For example, targeted amplification of the Pol region has historically been used to assess drug resistance to antiretroviral therapy. 269 | 270 | 271 | Table 2: Potential sequencing  strategies for HIV-1 272 | 273 | | Strategy/Application | DR Detection | Subtyping | Phylogenomics| Phylogenetics | 274 | | -------------------- | ------------ | --------- | ------------ | ------------- | 275 | | Targeted Amplicon Sequencing** | ✓ | ✓ | X | ✓ | 276 | | Long-Range PCR | X | ✓ | X | X | 277 | | Tiled amplicon WGS | ✓ | ✓ | ✓ | ✓ | 278 | 279 | ** Will not be able to subtype some circulating recombinant forms due to missing breakpoints (CRF_AE & CRF_BG) [25,26](https://pubmed.ncbi.nlm.nih.gov/11981372/,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/) 280 | 281 | A tiled amplicon sequencing primer scheme has been developed by the Association of Public Health Laboratories (APHL): (DOI:10.17504/protocols.io.n92ldmq4ol5b/v2) 282 | 283 | ## HIV-1 Bioinformatics Tools  284 | 285 | For researchers and clinicians alike, designing a bioinformatics analysis of HIV sequence data comes down to careful selection of bioinformatics tools. The effectiveness of analysis, accuracy of results, and subsequent genomic and epidemiological insights hinge on the appropriateness of the chosen tools. This decision encapsulates the essence of bioinformatics in HIV research, where precise tool selection aligns with the specific research objectives, ensuring that analytical methods harmonize with the intricacies of the virus. Below, we provide a categorised list of bioinformatics tools used in HIV research, distinguishing between general tools applicable to various contexts and those specifically designed for HIV-related analyses. 286 | 287 | ### Assembly 288 | 289 | - shiver: A tool for assembling HIV sequences, particularly focusing on improving de novo assembly by minimising biased information [6](https://pubmed.ncbi.nlm.nih.gov/29876136/) 290 | - iva: Generating de novo assembly of RNA virus genomes [5](https://pubmed.ncbi.nlm.nih.gov/25725497/) 291 | Subtyping 292 | 293 | Various HIV-1 subtyping tools are available (Table 3) which have been benchmarked previously [12,27](https://pubmed.ncbi.nlm.nih.gov/23660484/,https://pubmed.ncbi.nlm.nih.gov/28701420/) 294 | 295 | Table 3: HIV-1  subtyping tools 296 | | Tool | Type | CLI | GUI | 297 | | ---- | ---- | --- | --- | 298 | | NCBI [16](https://pubmed.ncbi.nlm.nih.gov/15215470/) | similarity | X | ✓ | 299 | | Stanford [17](https://hivdb.stanford.edu/page/hiv-subtyper/) | similarity | ✓ | ✓ | 300 | | COMET [13](https://pubmed.ncbi.nlm.nih.gov/25120265/) | similarity | ✓ | ✓ | 301 | | jpHMM [14](https://pubmed.ncbi.nlm.nih.gov/16845050/) | statistical | X | ✓ | 302 | | REGA [12](https://pubmed.ncbi.nlm.nih.gov/23660484/) | phylogenetic | X | ✓ | 303 | | SCUEAL [18](https://pubmed.ncbi.nlm.nih.gov/19956739/) | phylogenetic | X | ✓ | 304 | 305 | Multiple considerations need to be taken into account when choosing a subtyping tool. 306 | 307 | Although the gold standard for HIV-1 subtyping is full-genome, often only the pol region is available. This region will allow for subtyping for most group M subtypes but will not differentiate CRF_AE & CRF_BG from the pure parent subtype due to lacking the recombination breakpoint in this region [25,26](https://pubmed.ncbi.nlm.nih.gov/11981372/,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/). The second is that an up-to-date alignment is desirable especially when considering treatment failures e.g. cabotegravir (integrase inhibitor) not working on HIV-1 subtypes A1/A6 [28](https://pubmed.ncbi.nlm.nih.gov/33730748/) 308 | 309 | ### Resistance detection 310 | 311 | Resistance detection is mainly undertaken in reference to HXB2 (Accession Number:K03455) 312 | 313 | - Stanford University HIVdb () : An online database and tool for identifying drug-resistant mutations in HIV-1 using consensus and next-generation sequencing data [9,10,11](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/) 314 | 315 | - Quasitools HyDRA (no longer actively maintained): Command line tool to analyse next-generation sequencing data for cataloging drug resistance mutations using the Stanford University HIVdb [7](https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf) 316 | 317 | - SierraPy: Python client to query Stanford University HIVdb () 318 | 319 | Other Stanford HIVdb resources which are useful to investigate especially for pipeline implementation: 320 | 321 | - Release Notes ()  322 | 323 | - Web Service () 324 | 325 | - Github repository () 326 | 327 | ### Transmission Network Analysis  328 | 329 | - HIV-TRACE: A command line tool for identifying and visualizing HIV transmission clusters using molecular sequence data [22](https://pubmed.ncbi.nlm.nih.gov/29401317/) 330 | 331 | - Clusterpicker (no longer actively maintained): A command line tool for identifying clusters in a phylogenetic tree based on bootstrap support and pairwise genetic distance within clusters [29](https://pubmed.ncbi.nlm.nih.gov/24191891/) 332 | 333 | 334 | ### Sequence Databases 335 | 336 | - NCBI HIV-1 Human Interaction Database 337 | 338 | ( )[30](https://pubmed.ncbi.nlm.nih.gov/25378338/) 339 | 340 | An online database of HIV-1 sequence data and annotations, including drug resistance and subtype information 341 | 342 | - Los Alamos HIV Sequence Database : A web database which can be searched for HIV sequence data, including reference genomes, annotations, and geographic origins of subtypes 343 | 344 | - Stanford University HIVdb: In addition to being the most up-to-date resource for investigating HIV-1 drug resistance(see above) it also has a wealth of information on HIV-1 virus isolates,  published drug susceptibilities and archived treatment episodes (incorporates ARVs received, mutations detected and new regimen initiated with measured viral loads and CD4 counts longitudinally). 345 | 346 | ### Case Studies 347 | 348 | ## 349 | 350 | #### Case Study: *Tracking HIV Transmission Networks in a High-Incidence Area* 351 | 352 | **Description:** Using molecular epidemiology, this study identified transmission networks among individuals with acute HIV infection in a high-incidence region, providing insights into transmission dynamics and hotspots. 353 | 354 | **Citation:** Wertheim JO, *et al.* (2014). "Social and Genetic Networks of HIV-1 Transmission in New York City." PLOS Pathogens, 10(7), e1004280. 355 | 356 | **Tool(s) & databases used:** HIV-TRACE, Los Alamos 357 | 358 | ## 359 | 360 | #### Case Study: *Evolution of Drug Resistance Mutations in Long-Term ART Patients* 361 | 362 | **Description:** This study investigated the dynamics of drug resistance mutations in individuals on long-term therapy (ART), revealing the persistence of archived resistant variants and the importance of continuous monitoring. 363 | 364 | **Citation:** Rhee SY, *et al.* (2005). "HIV-1 Protease and Reverse-Transcriptase Mutations: Correlations with Antiretroviral Therapy in Subtype B Isolates and Implications for Drug-Resistance Surveillance." Journal of Infectious Diseases, 194(4), 454-465. 365 | 366 | **Tool(s) & databases used:** PAUP, MESQUITE, Stanford HIVdb 367 | 368 | ## 369 | 370 | #### Case Study: *Impact of Drug Resistance Mutations on Treatment Outcomes* 371 | 372 | **Description:** This study assessed the impact of specific drug resistance mutations on treatment response and virological outcomes, contributing to the optimization of treatment regimens for individuals with drug-resistant HIV. 373 | 374 | **Citation:** Gupta RK, *et al.* (2009). "HIV-1 Drug Resistance before Initiation or Re-initiation of First-line Antiretroviral Therapy in Low-Income and Middle-Income Countries: A Systematic Review and Meta-Regression Analysis." The Lancet Infectious Diseases, 9(10), 711-718. 375 | 376 | **Tool(s) & databases used:** Stanford HIVdb 377 | 378 | ## 379 | 380 | #### Case Study: *HIV Phylogenetics to Investigate Cross-Border Transmission* 381 | 382 | **Description:** Using phylogenetic analysis, this study traced cross-border transmission of HIV strains between neighboring countries, highlighting the need for coordinated prevention efforts in the region. 383 | 384 | **Citation:** Novitsky V, *et al.* (2015). "Phylogenetic Relatedness of Circulating HIV-1C Strains in Mochudi, Botswana, and Implications for HIV Subtype Distribution in Botswana." AIDS Research and Human Retroviruses, 31(6), 631-638. 385 | 386 | **Tool(s) & databases used:** Los Alamos 387 | 388 | ## 389 | 390 | #### Case Study: *Cross-clade simultaneous HIV drug resistance genotyping for reverse transcriptase, protease, and integrase inhibitor mutations by Illumina MiSeq* 391 | 392 | **Description:** This study created a universal Illumina MiSeq-based HIV drug resistance genotyping assay, which works across all major group M HIV-1 subtypes and identifies DRMs in the pol gene known to confer resistance to protease, reverse transcriptase, and integrase inhibitors. 393 | 394 | **Citation:** Dudley, D. M., *et al.* (2014). "Cross-clade simultaneous HIV drug resistance genotyping for reverse transcriptase, protease, and integrase inhibitor mutations by Illumina MiSeq". Retrovirology, 11, 122.   395 | 396 | **Tool(s) & databases used:** Los Alamos 397 | 398 | References 399 | ========== 400 | 401 | 1. Xiao, Q., Guo, D. & Chen, S. Application of CRISPR/Cas9-Based Gene Editing in HIV-1/AIDS Therapy. Front. Cell. Infect. Microbiol. 9, (2019). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/ 402 | 2. Olabode, A. S. et al. Evidence for a recombinant origin of HIV-1 Group M from genomic variation. Virus Evol. 5, vey039 (2019). https://pubmed.ncbi.nlm.nih.gov/30687518/ 403 | 3. Bbosa, N., Kaleebu, P. & Ssemwanga, D. HIV subtype diversity worldwide. Curr. Opin. HIV AIDS 14, 153--160 (2019). https://pubmed.ncbi.nlm.nih.gov/30882484/ 404 | 4. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094-3100 (2018). https://pubmed.ncbi.nlm.nih.gov/29750242/ 405 | 5. Hunt, M. et al. IVA: accurate de novo assembly of RNA virus genomes. Bioinformatics 31, 2374--2376 (2015). https://pubmed.ncbi.nlm.nih.gov/25725497/ 406 | 6. Wymant, C. et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver. Virus Evol. 4, vey007 (2018). https://pubmed.ncbi.nlm.nih.gov/29876136/ 407 | 7. Marinier, E. et al. quasitools: A Collection of Tools for Viral Quasispecies Analysis. 733238 Preprint at https://doi.org/10.1101/733238 (2019). https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf 408 | 8. Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience 10(2):giab008 (2021).https://pubmed.ncbi.nlm.nih.gov/33590861/ 409 | 9. Rhee, S.-Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 31, 298--303 (2003). https://pubmed.ncbi.nlm.nih.gov/12520007/ 410 | 10. Shafer, R. W. Rationale and Uses of a Public HIV Drug-Resistance Database. J. Infect. Dis. 194, S51--S58 (2006). https://pubmed.ncbi.nlm.nih.gov/16921473/ 411 | 11. Liu, T. F. & Shafer, R. W. Web Resources for HIV Type 1 Genotypic-Resistance Test Interpretation. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 42, 1608--1618 (2006). https://pubmed.ncbi.nlm.nih.gov/16652319/ 412 | 12. Pineda-Peña, A.-C. et al. Automated subtyping of HIV-1 genetic sequences for clinical and surveillance purposes: Performance evaluation of the new REGA version 3 and seven other tools. Infect. Genet. Evol. 19, 337--348 (2013). https://pubmed.ncbi.nlm.nih.gov/23660484/ 413 | 13. Struck, D., Lawyer, G., Ternes, A.-M., Schmit, J.-C. & Bercoff, D. P. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res. 42, e144 (2014). https://pubmed.ncbi.nlm.nih.gov/25120265/ 414 | 14. Zhang, M. et al. jpHMM at GOBICS: a web server to detect genomic recombinations in HIV-1. Nucleic Acids Res. 34, W463--W465 (2006). https://pubmed.ncbi.nlm.nih.gov/16845050/ 415 | 15. Myers, R. et al. A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics 21(17):3535-40 (2005). https://pubmed.ncbi.nlm.nih.gov/16046498/ 416 | 16. Rozanov, M., Plikat, U., Chappey, C., Kochergin, A. & Tatusova, T. A web-based genotyping resource for viral sequences. Nucleic Acids Res. 32, W654-659 (2004). https://pubmed.ncbi.nlm.nih.gov/15215470/ 417 | 17. HIV Subtyping Program - HIV Drug Resistance Database. https://hivdb.stanford.edu/page/hiv-subtyper/ 418 | 18. Pond, S. L. K. et al. An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1. PLOS Comput. Biol. 5, e1000581 (2009). https://pubmed.ncbi.nlm.nih.gov/19956739/ 419 | 19. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25(17):2283-5 (2009).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/ 420 | 20. Ji, H., Enns, E., Brumme, C. J., Parkin, N., Howison, M., Lee, E. R., Capina, R., Marinier, E., Avila-Rios, S., Sandstrom, P., Van Domselaar, G., Harrigan, R., Paredes, R., Kantor, R., & Noguera-Julian, M. (2018). Bioinformatic data processing pipelines in support of next-generation sequencing-based HIV drug resistance testing: the Winnipeg Consensus. Journal of the International AIDS Society, 21(10), e25193. https://pubmed.ncbi.nlm.nih.gov/30350345/ 421 | 21. Mannu, J. & Mathur, P. P. Role of Bioinformatics in Drug Resistance Prediction for HIV/AIDS. in Current trends in Bioinformatics: An Insight (eds. Wadhwa, G., Shanmughavel, P., Singh, A. K. & Bellare, J. R.) 277--286 (Springer, Singapore, 2018). doi:10.1007/978-981-10-7483-7_16. https://link.springer.com/chapter/10.1007/978-981-10-7483-7_16 422 | 22. Kosakovsky Pond, S. L., Weaver, S., Leigh Brown, A. J. & Wertheim, J. O. HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. Mol. Biol. Evol. 35, 1812--1819 (2018). https://pubmed.ncbi.nlm.nih.gov/29401317/ 423 | 23. Campbell, EM. et al. MicrobeTrace: Retooling molecular epidemiology for rapid public health response PLoS Comput Biol. 17(9):e1009300 (2021). https://pubmed.ncbi.nlm.nih.gov/34492010/ 424 | 24. Zhou Z. et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 28(9):1395-1404 (2018). https://pubmed.ncbi.nlm.nih.gov/30049790/ 425 | 25. Delgado, E. et al. Identification of a Newly Characterized HIV-1 BG Intersubtype Circulating Recombinant Form in Galicia, Spain, Which Exhibits a Pseudotype-Like Virion Structure. JAIDS J. Acquir. Immune Defic. Syndr. 29, 536 (2002). https://pubmed.ncbi.nlm.nih.gov/11981372/ 426 | 26. Carr, J. K. et al. Full-length sequence and mosaic structure of a human immunodeficiency virus type 1 isolate from Thailand. J. Virol. 70, 5935--5943 (1996). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/ 427 | 27. Fabeni, L. et al. Comparative Evaluation of Subtyping Tools for Surveillance of Newly Emerging HIV-1 Strains. J. Clin. Microbiol. 55, 2827--2837 (2017). https://pubmed.ncbi.nlm.nih.gov/28701420/ 428 | 28. Cutrell, A. G. et al. Exploring predictors of HIV-1 virologic failure to long-acting cabotegravir and rilpivirine: a multivariable analysis. AIDS Lond. Engl. 35, 1333--1342 (2021). https://pubmed.ncbi.nlm.nih.gov/33730748/ 429 | 29. Ragonnet-Cronin, M. et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics 14, 317 (2013). https://pubmed.ncbi.nlm.nih.gov/24191891/ 430 | 30. Ako-Adjei, D. et al. HIV-1, human interaction database: current status and new features. Nucleic Acids Res. 43, D566-570 (2015). https://pubmed.ncbi.nlm.nih.gov/25378338/ 431 | 432 | 433 | 434 | -------------------------------------------------------------------------------- /docs/images/PHA4GE_SC2_QC_Workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/PHA4GE_SC2_QC_Workflow.png -------------------------------------------------------------------------------- /docs/images/influenza-guidance-fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/influenza-guidance-fig1.png -------------------------------------------------------------------------------- /docs/images/influenza-guidance-fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/influenza-guidance-fig2.png -------------------------------------------------------------------------------- /docs/images/omicron_standford.svg: -------------------------------------------------------------------------------- 1 | 110018026034042050058066079596511351273ORF1abSpikeNTDRBDRBMSD1SD2S1/S2NA67VΔ69-70T95IG142DΔ143-145Δ211L212IR214InsertionG339DS371LS373PS375FK417NN440KG446SS477NT478KE484AQ493KG496SQ498RN501YY505HT547KD614GH655YN679KP681HN764KD796YN856KQ954HN969KL981F 2 | -------------------------------------------------------------------------------- /docs/images/pha4ge_sc2_qc_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/pha4ge_sc2_qc_workflow.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/covariants21k.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/covariants21k.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/covariants21l.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/covariants21l.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex1-usher-metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1-usher-metrics.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex1-usher-tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1-usher-tree.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex2-usher-metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2-usher-metrics.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex2-usher-tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2-usher-tree.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/ex3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex3.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/mutations1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations1.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/mutations2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations2.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/mutations3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations3.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/nextclade-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/nextclade-output.png -------------------------------------------------------------------------------- /docs/images/sc2-recombinants/nextclade-output2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/nextclade-output2.png -------------------------------------------------------------------------------- /docs/mpxv-bioinfo-solutions.md: -------------------------------------------------------------------------------- 1 | # **Bioinformatics Solutions for Mpox Genomic Analysis** 2 | 3 | PHA4GE Bioinformatics Pipelines & Visualization Working Group
4 | Libuit KG, Southgate J, Ünal G, Maguire F, Smith E, Kapsak S, van Heusden P, Wright S, Neher R, Diallo A 5 | 6 |
7 | Document Changelog 8 | 9 | - 2022-10-10: 10 | - First draft published 11 | - 2022-11-28: 12 | - Nomenclature update: Monkeypox -> Mpox 13 | - 2023-03-09: 14 | - Add changelog 15 | - 2024-08-23: 16 | - Add PolkaPax and TOSTADAS details 17 |
18 | 19 | 20 | # Overview 21 | 22 | Genomic analysis of Mpox virus (MPXV) samples by public health laboratories is a critical component in understanding the global outbreak. The integration and awareness of appropriate bioinformatics tools to support these endeavours are potential challenges. 23 | 24 | In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the [Public Health Alliance for Genomic Epidemiology (PHA4GE)](https://www.pha4ge.org) has drafted this living document to help define the major bioinformatics challenges for MPXV genomic analysis and suggest various open-source and freely available bioinformatics resources to address them. 25 | 26 | Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive list of all available MPXV bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues. 27 | 28 | 29 | # Background 30 | 31 | Mpox is a viral zoonosis which belongs to genus Orthopoxvirus in the family Poxviridae. The virus can be transmitted to humans from animals. After the eradication of smallpox in 1980, mpox emerged and became the most important orthopoxvirus for public health aspects. The virus is an enveloped double-stranded DNA virus and has two distinct genetic clades: the central African (Congo Basin) clade and the west African clades. Historically known as the Congo Basin can cause more severe disease and more transmissible [WHO](https://www.who.int/news-room/fact-sheets/detail/monkeypox). The clinical presentation of this virus is similar to smallpox but some vaccination with smallpox can help individuals for cross-immunity. Lethality rate varies %1-10 and transmission between humans mainly occurs either direct contact or body fluids and via droplets [Berthet, N. et al.](https://rdcu.be/cTOiG). 32 | 33 | MPXV is a linear DNA genome of ≈197 kb. Like other orthopoxviruses, the central coding region sequence (CRS) at MPXV is between ≈56000–120000 and is highly conserved. The genes in the terminal end of MPXV genome responsible for immunomodulation, host range and pathogenicity and also contains at least 4 ORF in the ITR region [Kugelman, JR et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901482/). 34 | 35 | 36 | # Public Mpox Case Databases 37 | 38 | This [repository](https://github.com/globaldothealth/monkeypox) contains dated records of curated Mpox cases from the 2022 outbreak (April - ), a data dictionary, and a script used to pull contents from a spreadsheet into JSON and CSV files. 39 | 40 | The downloadable [data file](https://www.ecdc.europa.eu/en/publications-data/data-monkeypox-cases-eueea) contains information on the number of mpox cases reported by EU/EEA countries or collected throughout epidemiologic intelligence at ECDC. Each row contains the corresponding data for a country, day of reporting, number of cases and source of information (data are in long format). The file is updated twice a week. You may use the data in line with ECDC’s copyright and data usage policy. 41 | 42 | This [report](https://monkeypoxreport.ecdc.europa.eu/) provides an overview of the total number of cases of mpox identified by ECDC and the WHO Regional Office for Europe through IHR mechanisms and official public resources and case-based data through The European Surveillance System (TESSy) up to 9 August 2022. The first summary table and maps (first two tabs) describe the number of cases identified through the different platforms. The following figures and tables describe national case-based data for surveillance of mpox reported in TESSy from all the countries and areas of the WHO European Region, including the 24 countries of the European Union (EU) and the additional three countries of the European Economic Area (EEA). 43 | 44 | 45 | # Bioinformatics Challenges for Public Health 46 | 47 | The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples: 48 | 49 | 1. **Generating consensus assemblies** 50 | 51 | 2. **Submission of sequence data to international accessible databases** 52 | 53 | 3. **Screening for Variants of Concern** 54 | 55 | 4. **Performing Phylogenetic analysis of MPXV datasets** 56 | 57 | # Open-Access/Source Bioinformatics Solutions & Resources 58 | 59 | ## Video resources 60 | 61 | - [BV-BRC Mpox and Orthopoxvirus Mini Symposium Playlist](https://youtube.com/playlist?list=PLWfOyhOW_OavOhvmuyUf19nsYASClMnXU) 62 | 63 | ## Sequencing resources 64 | 65 | - [PrimalSeq amplicon scheme and protocol](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-cd8ds9s6) 66 | - [Yale tiled-amplicon protocol](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-5qpvob1nbl4o/v2) 67 | 68 | 69 | ## Generating consensus assemblies 70 | - [TheiaCoV workflows (for Illumina SE/PE, ONT, and fasta files) with MPXV input variables](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-cd8ds9s6) 71 | - Supports amplicon and metagenomic data 72 | - [GalaxyProject MPXV analysis effort](https://galaxyproject.org/projects/mpxv/) 73 | - Only supports Illumina PE metagenomic data 74 | - [Nextflow workflow from the Utah PHL](https://github.com/UPHL-BioNGS/Cecret#monkeypox) 75 | - Supports amplicon and metagenomic data 76 | - [Epi2Me](https://labs.epi2me.io/basic-monkeypox-workflow/) 77 | - Only supports metagenomic data 78 | - [Viral-Recon](https://github.com/nf-core/viralrecon): 79 | - Workflow for raw read quality control, de-hosting, assembly, variant calling, and consensus generation for illumina and nanopore monkeypox data. Currently does not include pre-built support for monkeypox (e.g., reference genome, reference annotations, nextclade dataset, and amplicon schemes) but these can be user-supplied on the command line and should be appropriate to the sequencing method (e.g., for amplicon sequencing using the reference used to create the amplicon scheme and for metagenomic sequencing, to be consistent with Nextstrain, you can use NC_063383.1.fasta, NC_063383.1.gff, with the nextclade dataset nextclade_hMPXV_B1_pseudo_ON563414_XXXXXXX). 80 | - [PolkaPox](https://github.com/CDCgov/polkapox): 81 | - Nextflow workflow for taxonomic filtering, trimming, quality control, reference-based analysis, and de novo assembly of Illumina metagenomic sequencing reads from orthopoxviruses, including multiple lineages of MPXV. 82 | 83 | ## Submission of sequence data to international accessible databases 84 | - [Sample Metadata Specifications](https://sprcdn-assets.sprinklr.com/1652/133486a8-9b49-4461-a0d7-211c140947cc-562840094.pdf) 85 | - Preparation and/or Submission of Samples 86 | - Terra_2_NCBI workflow (only SRA/BioSample at the moment) for programmatic submission of raw read data analysed on Terra to SRA and BioSample 87 | - [NCBI guide to submit consensus sequences using BankIt](https://www.ncbi.nlm.nih.gov/genbank/monkeypox_submission/) 88 | - [TOSTADAS](https://github.com/CDCgov/tostadas): Metadata validation, standardized gene annotation, and programmatic NCBI submission (Biosample, SRA, Genbank). 89 | - Assess Data Quality Prior to Submission 90 | 91 | 92 | ## Screening for Variants of Concern 93 | 94 | - [Nextclade](https://clades.nextstrain.org/) 95 | - assignment of consensus sequences to Nextstrain clades, quality control, and mutation effect annotation. References pre-built for inferred ancestral monkeypox, the human monkeypox clade, and the specific B.1 human monkeypox clade. 96 | 97 | 98 | ## Performing Phylogenetic analysis of MPXV datasets 99 | 100 | - [Augur](https://docs.nextstrain.org/projects/augur/en/stable/index.html) 101 | - A bioinformatics toolkit for phylogenetic analysis which constructs phylogenetic trees that can be visualised in NextStrain 102 | 103 | - [Nextstrain Mpox build workflow](https://github.com/nextstrain/monkeypox) 104 | - Workflow to perform contextualised phylogenetic analysis of monkeypox consensus sequences (by default using the human monkeypox reference genome NC_063383.1) 105 | 106 | - [Taxonium](https://taxonium.org/?treeUrl=https%3A%2F%2Fns-proxy.vercel.app%2Fapi%2Fcharon%2FgetDataset%3Fprefix%3Dmonkeypox%2Fhmpxv1&ladderizeTree=true&treeType=nextstrain&color=%7B%22field%22%3A%22meta_country%22%7D) 107 | - Tool for exploring large phylogenetic trees - Mpox sequences from GenBank 108 | 109 | ## Publicly available data 110 | To help getting started with phylogenetic analysis, Nextstrain provides MPXV data available on NCBI in aggregated form: 111 | - [Sequences](https://data.nextstrain.org/files/workflows/monkeypox/sequences.fasta.xz) 112 | - [Metadata](https://data.nextstrain.org/files/workflows/monkeypox/metadata.tsv.gz) 113 | 114 | Pairwise alignments with [Nextclade](https://clades.nextstrain.org/) against the [reference sequence MPXV-M5312_HM12_Rivers](https://www.ncbi.nlm.nih.gov/nuccore/NC_063383), insertions relative to the reference, and translated ORFs are available: 115 | 116 | - [Alignment](https://data.nextstrain.org/files/workflows/monkeypox/alignment.fasta.xz) 117 | - [Insertions](https://data.nextstrain.org/files/workflows/monkeypox/insertions.csv.gz) 118 | - [Translations](data.nextstrain.org/files/workflows/monkeypox/translations.zip) 119 | 120 | -------------------------------------------------------------------------------- /docs/omicron-resources.md: -------------------------------------------------------------------------------- 1 | # Omicron Variant Resources 2 | 3 | **PHA4GE Bioinformatics Pipelines & Visualization Working Group**
4 | Libuit KG, Spinler JK, Southgate J, Black A, Nekrutenko A, Neuhaus B, O’Cathail C, Lemmer D, Jones D, Smith E, Gnimpieba E, Guthrie J, Maturure P, Monsierurs P, Maier W, Langhorst B, Page A, & Niewiadomska AM 5 | 6 |
7 | Document Changelog 8 | 9 | - 2021-12-19: 10 | - Added section detailing Omicron lineage and clade nomenclature, COVID-19 scenario modeling resource, and additional reference sequences 11 | - Updated Pangolin and Nextclade software minimums and resource links for genomic information (e.g. defining mutations), visualizations, and global case counts over time to include B.1.1.529 sub lineages 12 | - 2022-10-10: 13 | - Updated variant designations 14 | - Historical Information / Archived Data added 15 | - 2023-03-09: 16 | - Format changelog 17 |
18 | 19 | # Overview 20 | 21 | The [World Health Organization (WHO) has classified the SARS-CoV-2 B.1.1.529 variant as a Variant of Concern (VOC)](https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern) under the advice of the [Technical Advisory Group on SARS-CoV-2 Virus Evolution (TAG-VE)](https://www.who.int/groups/technical-advisory-group-on-sars-cov-2-virus-evolution)—an independent group of experts that periodically monitors and evaluates the evolution of SARS-CoV-2 and assess if specific mutations and combinations of mutations alter the behavior of the virus. The WHO has assigned the B.1.1.529 VOC the label Omicron per [their greek-letter key variant assignment system](https://www.who.int/news/item/31-05-2021-who-announces-simple-easy-to-say-labels-for-sars-cov-2-variants-of-interest-and-concern). The elevation of Omicron to a WHO-designated VOC was based on the TAG-VE's assessment of the variant’s large number of genomic mutations and plausible impact on COVID-19 epidemiology. 22 | 23 | The PHA4GE Pipelines and Visualization Working Group has created this document to highlight critical open-source/accesses resources to aid in the understanding and further analysis of the Omicron variant. 24 | 25 | In no way does this document represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues. 26 | 27 | ## Contents 28 | - [General Information on the Omicron Variant](#general-information-on-the-omicron-variant) 29 | - [Omicron Lineage and Clade Nomenclature](#omicron-lineage-and-clade-nomenclature) 30 | - [Educational Material](#educational-material) 31 | - [Public Health Announcements and Publications](#public-health-announcements-and-publications) 32 | - [Technical Details and Global Trackers](#technical-details-and-global-trackers) 33 | - [Phylogenetic Visualizations](#phylogenetic-visualizations) 34 | - [Data Reporting and Sharing](#data-reporting-and-sharing) 35 | - [Potential impacts of Spike Protein Mutations](#potential-impacts-of-spike-protein-mutations) 36 | - [Diagnostic and Sequencing Assays](#diagnostic-and-sequencing-assays) 37 | - [Bioinformatics Resources and Considerations](#bioinformatics-resources-and-considerations) 38 | - [Software Version Minimums](#software-version-minimums) 39 | - [Reference Sequences](#reference-sequences-and-assemblies) 40 | - [SARS-CoV-2 Multiple Sequence Alignments](#sars-cov-2-multiple-sequence-alignments) 41 | 42 | # General Information on the Omicron Variant 43 | Below is a list of various educational material, public health announcements and publications, thechnical details and global trackers, phylogenetic visualiations, and resources to assist in data sharing and reporting of the Omicron variant. 44 | 45 | ## Omicron Lineage and Clade Nomenclature 46 | - The Omicron Variant is the [WHO SARS-CoV-2 VOC label](https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/) for the pango lineage B.1.1.529 (includes BA.1, BA.2, BA.3, BA.4, BA.5 and descendent and recombinant lineage XE) Nextstrain designations include 21K, 21L, 21M, 22A, 22B, and 22C. 47 | 48 | ## Educational Material 49 | - [Nature News Article - Heavily mutated Omicron variant puts scientists on alert](https://www.nature.com/articles/d41586-021-03552-w): Overview of the identified variant and its potential public health impacts. 50 | - [Theiagen Genomics Primer the Omicron Variant (Video)](https://www.youtube.com/watch?v=xhyWjPgdP9U): To assist public health scientists' understanding of the Omicron Variant, Frank Ambrosio recorded a small primer on the Omicron variant that includes an overview of the Nature news article by Ewen Callaway, visual depictions of key Omicron mutations, and the genetic diversity of Omicron relative to other SARS-CoV-2 variants using MicrobeTrace. 51 | 52 | ## Public Health Announcements and Publications 53 | - [Classification of Omicron (B.1.1.529): SARS-CoV-2 Variant of Concern (World Health Organization)](https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern) 54 | - [CDC Statement on B.1.1.529 (Omicron variant)](https://www.cdc.gov/media/releases/2021/s1126-B11-529-omicron.html) 55 | - [CDC Science Brief: Omicron (B.1.1.529) Variant](https://www.cdc.gov/coronavirus/2019-ncov/science/science-briefs/scientific-brief-omicron-variant.html) 56 | - [SARS-CoV-2 variants of concern as of 3 December 2021 (ECDC)](https://www.ecdc.europa.eu/en/covid-19/variants-concern) 57 | - [Implications of the further emergence and spread of the SARS-CoV-2 B.1.1.529 variant of concern (Omicron) for the EU/EEAECDC (2021-12-02)](https://www.ecdc.europa.eu/sites/default/files/documents/threat-assessment-covid-19-emergence-sars-cov-2-variant-omicron-december-2021.pdf) 58 | - [SARS-CoV-2 variants of concern and variants under investigation in England (UK Health Security Agency)](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1036501/Technical_Briefing_29_published_26_November_2021.pdf) 59 | - [Genomic surveillance of SARS-CoV-2 in Belgium ( National Reference Laboratory (UZ Leuven & KU Leuven))](https://assets.uzleuven.be/files/2021-11/genomic_surveillance_update_211126.pdf) 60 | - [SARS-CoV-2 variants of concern and variants under investigation in England Variant of concern: Omicron, VOC21NOV-01 (B.1.1.529); Technical briefing 30 (2021-12-03)](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1038404/Technical_Briefing_30.pdf) 61 | 62 | ## Technical Details and Global Trackers 63 | 64 | - Various resources for genomic information (e.g. defining mutations), visualizations, and global case counts over time: 65 | - COV-Lineage Variant Summary Pages: [B.1.1.529](https://cov-lineages.org/lineage.html?lineage=B.1.1.529), [BA.1](https://cov-lineages.org/lineage.html?lineage=BA.1), [BA.2](https://cov-lineages.org/lineage.html?lineage=BA.2),[BA.3](https://cov-lineages.org/lineage.html?lineage=BA.3), [BA.4](https://cov-lineages.org/lineage.html?lineage=BA.4), [BA.5](https://cov-lineages.org/lineage.html?lineage=BA.5) & [XE](https://cov-lineages.org/lineage.html?lineage=XE) 66 | - BV-BRC Lineage Profiles: [BA.1](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.1), [BA.2](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.2), [BA.3](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.3), [BA.4](https://www.bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.4), [BA.5](https://www.bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.5) 67 | - [Outbreak.info Omicron Variant Report](https://outbreak.info/situation-reports/omicron) 68 | - CoVariants (Omicron) Profiles for Nextstrain: [21K](https://covariants.org/variants/21K.Omicron), [21L](https://covariants.org/variants/21L.Omicron), [22A](https://covariants.org/variants/22A.Omicron), [22B](https://covariants.org/variants/22B.Omicron), [22C](https://covariants.org/variants/22C.Omicron), [22D](https://covariants.org/variants/22D.Omicron) 69 | - [CNCB RCoV19 Lineage Browser](https://ngdc.cncb.ac.cn/ncov/lineage?lineage=B.1.1.529#goto) 70 | - [COVID-19 Scenario Modeling Hub](https://covid19scenariomodelinghub.org/viz.html): Synthesis of over 30 COVID-19 models for public health forecasting 71 | 72 | ## Phylogenetic Visualizations 73 | - [NextStrain Build of B.1.1.529 (21K)](https://nextstrain.org/groups/neherlab/ncov/21K) 74 | - [Outbreak.info VOC Lineage Comparisons](https://outbreak.info/compare-lineages?gene=ORF1a&gene=ORF1b&gene=S&gene=ORF8&gene=N&gene=ORF3a&gene=E&gene=M&gene=ORF6&gene=ORF7a&gene=ORF7b&gene=ORF10&threshold=75&nthresh=1&sub=false&dark=true) 75 | 76 | ## Data Reporting and Sharing 77 | - [PHA4GE Resource on Data Sharing](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification): Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be. 78 | - [PHA4GE Resource on Data Submission](https://github.com/pha4ge/pipeline-resources/blob/main/docs/bioinfo-solutions.md#2-submitting-raw-sequence-data-fastq-consensus-assemblies-fasta-and-relevant-sample-metadata-to-internationally-accessible-databases): Resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as NCBI, ENA, and GISAID 79 | 80 | # Potential Impacts of Spike Protein Mutations 81 | 82 | The spike protein of the SARS-CoV-2 Omicron variant contains approximately 32 mutations, many of which have not been observed in previous VOCs. However, based on their location, several of these mutations have the potential to impact immune escape, transmissibility, and detection. Spike mutations found in the Omicron VOC can be analyzed in detail using the [Stanford University Coronavirus Antiviral & Resistance Database](https://covdb.stanford.edu/sierra/sars2/by-patterns/). 83 | 84 | ![Omicron S-gene mutations](./images/omicron_standford.svg) 85 | 86 | - Up to 15 mutations have been observed within the receptor binding domain (RBD). The RBD region of the Spike protein interacts directly with the human receptor ACE2 and mutations in this region may have a direct impact on how well SARS-CoV-2 viral particles attach to a host cell. 87 | 88 | - Approximately 8 mutations have been observed within the N-terminal domain (NTD). The NTD of the Spike protein aids in virus attachment and mutations in this region could also impact virus infectivity. 89 | 90 | - Both the RBD and NTD are surface exposed areas of the Spike protein that are targeted by antibodies. Mutations in these regions have the potential to evade immunity by antibodies acquired through previous infection or vaccination. 91 | 92 | - Three mutations occur near the furin cleavage site, the region of the Spike protein responsible for viral-host membrane fusion. Mutations in this region have the potential to affect viral entry into host cells. 93 | 94 | ## Diagnostic and Sequencing Assays 95 | 96 | Mutations in the SARS-CoV-2 genome can affect PCR-based diagnostic assays and genomic sequencing. For example, the ThermoFisher TaqPath probe targeting the Spike gene is known to result in S-gene target failure (SGTF) when amplifying nucleic acid preparations from VOC Alpha. This occurs when the SARS-CoV-2 genome contains a deletion resulting in the loss of amino acids 69-70 of the NTD. When coupled with the positive amplification of other SARS-CoV-2 genetic regions, the SGTF has been used as a diagnostic indicator of VOC presence [SGF Deletion Assay](https://www.biorxiv.org/content/10.1101/2021.10.25.465706v1.full). 97 | 98 | - [Thermo Fisher Scientific Confirms Detection of SARS-CoV-2 in Samples Containing the Omicron Variant with its TaqPath COVID-19 Tests](https://thermofisher.mediaroom.com/2021-11-29-Thermo-Fisher-Scientific-Confirms-Detection-of-SARS-CoV-2-in-Samples-Containing-the-Omicron-Variant-with-its-TaqPath-COVID-19-Tests): The Omicron variant contains the NTD deletion at amino acids 69/70 and results in SGTF by the TaqPath PCR assay. 99 | 100 | - [NEB's Primer Monitor Tool](https://primer-monitor.neb.com/lineages): Monitor registered primer sets for overlapping sequence variants in Omicron. 101 | 102 | - [SARS-CoV-2 Artic V4.1 update for Omicron variant](https://community.artic.network/t/sars-cov-2-v4-1-update-for-omicron-variant/342): Ten mutations in the Omicron VOC affect the Artic V4 primer scheme for whole genome sequencing. The Artic Network has designed 11 new primers to account for these mutations. 103 | 104 | 105 | # Bioinformatics Resources and Considerations 106 | Genome assembly as well as clade and lineage assignment of Omicron variants should follow the same bioinformatics workflow recommendations outlined in this working group's [Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis](https://github.com/jkspinler/pipeline-resources/edit/main/docs/omicron-resources.md) guidance document. Briefly, raw amplicon read data should be mapped to the Wuhan-1 reference genome and primer trimming performed before a consensus genome is called. Clade annd lineage assignment can then be made by analyzing the resulting consensus genome assemblies with the [NextClade](https://clades.nextstrain.org/) and [Pangolin](https://pangolin.cog-uk.io/) software, respectively. 107 | 108 | ## Software Version Minimums 109 | For laboraotires making clade and lineage assignements outside of the NextClade and Pangolin web applications, e.g. through a custom workflow available on CLI, Terra.Bio, or Galaxy Project, please ensure to utilize updated NextClade and Pangolin software capable of making an accurate Omicron clade and lineage designation: 110 | - [NextClade Software Version 1.7.0](https://github.com/nextstrain/nextclade/releases/tag/1.7.0) ([Dataset Tag >=2021-12-16T20:15:53Z](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html)) 111 | - [NextStrain Docker Container Image](https://hub.docker.com/r/nextstrain/nextclade) 112 | - [Pangolin Software Version 3.1.17](https://github.com/cov-lineages/pangolin/releases/tag/v3.1.17) ([Constellations >=0.1.0](https://github.com/cov-lineages/constellations/releases/tag/v0.1.0) 113 | - [StaPH-B Docker Container Image](https://hub.docker.com/r/staphb/pangolin/tags?page=1&ordering=last_updated) 114 | - [BioContainer Docker Container Image](https://quay.io/repository/biocontainers/pangolin?tab=tags) 115 | 116 | ## Reference Sequences and Assemblies 117 | - [KRISP CERI NCBI BioProject of Omicron Data](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA784038): Sequencing of the Omicron variant in South Africa by the Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP) and the Centre for Epidemic Response and Innovation (CERI). 118 | - [NCBI SAMN23572360](https://www.ncbi.nlm.nih.gov/biosample/SAMN23572360): Raw read and assembly data for the first Omicron idenfied in Minnesota, USA 119 | - [NCBI SAMN23637602](https://www.ncbi.nlm.nih.gov/biosample/SAMN23637602): Raw reads and assembly data for first Omicron in Massachusetts, USA 120 | - ENA Assemblies: [ERZ4210179](https://www.ebi.ac.uk/ena/browser/view/ERZ4210179), [ERZ4209688](https://www.ebi.ac.uk/ena/browser/view/ERZ4209688), [ERZ4211168](https://www.ebi.ac.uk/ena/browser/view/ERZ4211168), [ERZ4210738](https://www.ebi.ac.uk/ena/browser/view/ERZ4210738) 121 | - [NCBI SAMN23998005](https://www.ncbi.nlm.nih.gov/biosample/SAMN23998005/): Raw read data for an Omicron variant sequecned with the [ONT Midnight 1200 primers](https://store.nanoporetech.com/us/midnight-rt-pcr-expansion.html) 122 | - 123 | 124 | ## SARS-CoV-2 Multiple Sequence Alignments 125 | Primer dropouts in Omicron sequence data may lead to errant evolutionary inferences when performing phylogenetic analysis of SARS-CoV-2 genomes. A proposed work around to these dropout regions is to mask the spike region and adjust the molecular clock rate accordingly, as [performed by Trevor Bedford in a recent phylodynamic analysis](https://twitter.com/trvrb/status/1466102128343093248?s=20). 126 | - [Nextstrain default masked sites for tree topology](https://github.com/nextstrain/ncov/blob/master/defaults/sites_ignored_for_tree_topology.txt) 127 | 128 | 129 | 130 | ## Historical Information / Archived Data 131 | 132 | - [Pango-designation proposed new lineage](https://github.com/cov-lineages/pango-designation/issues/343) and the [associated twitter thread](https://twitter.com/PeacockFlu/status/1463176821416075279) (Tom Peacock) 133 | - [Proposal for third sublineage in B.1.1.529 (BA.3)](https://github.com/cov-lineages/pango-designation/issues/367) (Andrew Rambaut) 134 | - Includes table of shared and unique mutations across B.1.1.529, BA.1, BA.2, and BA.3 135 | - [Galaxy EU Omicron Public Analysis](https://galaxyproject.eu/posts/2021/11/29/omicron-and-galaxy/): View of the Omicron lineage’s mutational pattern derived transparently and fully reproducibly from raw sequencing reads using the Galaxy Project bioinformatics platform 136 | - [Omicron Data Round Up](https://docs.google.com/presentation/d/1sOaHoXFZqIUnqmjdeuaUODCqaUSvtxQp4f2hF9pBdn8/edit#slide=id.g104e9fe3cf0_2_75): Summary of the Omicron variant and what can be inferred based on publicly-accessible data presented 2021-12-01 by Anna Niewiadomska 137 | -------------------------------------------------------------------------------- /docs/pipeline-best-practices.md: -------------------------------------------------------------------------------- 1 | 2 | **

NOTE: AS OF 2024-03-24, THIS DOCUMENT HAS BEEN MOVED TO A SEPARATE REPOSITORY. FOR THE MOST UP-TO-DATE VERSION OF THIS DOCUMENT, PLEASE VISIT https://github.com/pha4ge/public-health-pipeline-best-practices.

** 3 | 4 | ---- 5 | 6 | 7 | # **Best Practices for Public Health Bioinformatics Pipelines** 8 | 9 | **PHA4GE Bioinformatics Pipelines & Visualization Working Group
** 10 | Libuit KG, Guthrie J, Ambrosio F, Kapsak C, Unal Gultekin, Holmes J, Wright S, Nguinkal J, Doughty E, Southgate J, O'Cathail C, Carleton H, Kingwara L, Khan W, Baker K, Diallo A, Connor T, Kanwar S, Maturure P, James S, Cuesta I, Dyster V, Gaskin A, Williams C, Smith E, Rokney A, Petkau A, Varona S, Gnimpieba E, Rey S, Macori G, & Mboowa G 11 |
12 | Document Changelog 13 | 14 | - 2023-07-05: 15 | - Commit to main 16 | - 2023-12-03: 17 | - Adding max duration for commitments to maintain (2 years) 18 | - 2024-02-11: 19 | - Adding descriptions of meeting and verifying proposed standards 20 | - 2024-03-11: 21 | - Focus shift from proposed standards for bioinformatics software to best practices for bioinformatics pipelines 22 |
23 | 24 | ## Overview 25 | 26 | The field of public health bioinformatics relies heavily on the development and sustainability of high-quality software to support efforts in disease surveillance, outbreak investigation, and genomic research. Bioinformatics pipelines, also known as bioinformatics workflows, play a critical role in facilitating the routine analysis of genomic data by orchestrating the flow from raw data through various processing stages to final analysis. 27 | 28 | Despite their critical role, the absence of guidelines and best practices specific to public health pathogen genomics has hindered progress towards accessible, reproducible, interoperable, and standardized bioinformatics analysis in public health. 29 | 30 | To support both pipeline developers and analysts who rely on these pipelines to inform critical public health decision making, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has proposed a set of best practices for bioinformatics pipelines, tailored for public health applications. These best practices aim to provide a framework for the development, testing, and maintenance of bioinformatics pipelines, enhancing the quality, reliability, and sustainability of these resources and facilitating their impact on public health. 31 | 32 | ## 10 Best Practices for Public Health Bioinformatics Pipelines 33 | 34 | ### 1. Publicly-Accessible Repository 35 | _Is the source code for this pipeline available at a publicly-accessible repository URL?_ 36 | 37 | Publicly-accessible software bolsters collaboration and expedites innovation in public health bioinformatics, empowering worldwide public health communities to address critical challenges. By enhancing accessibility, publicly available software enable interoperability and reproducibility across public health investigations, crucial for well-informed decision-making and policy creation. Popular code repositories such as GitHub, GitLab, Bitbucket, and SourceForge offer platforms for developers to share their work. 38 | 39 | **To adhere to this best practice:** Host the bioinformatics pipeline on an open code repository platform. 40 | 41 | **To verify adherence to this best practice:** The reviewer should confirm existence and functionality of the repository. 42 | 43 | ### 2. Open-Source License 44 | _Does the repository contain a plain-text LICENSE file with the contents of an OSI-approved software license?_ 45 | 46 | Open-source licenses in public health bioinformatics encourage the widespread adoption and sharing of valuable tools and resources, fostering a collaborative environment for research and innovation. These licenses also support the unrestricted improvement and customization of software, enabling researchers to tailor solutions to specific public health challenges and enhance overall outcomes. Without a license, the code author retains all rights to the work, and others are not allowed to use, copy, distribute, or modify it without permission. This means that the code is effectively unusable by the research community as a whole. A license grants this permission and allows others to use, copy, distribute, or modify the work under certain conditions. Popular license types for open-source bioinformatics software include GNU General Public License (GPL), MIT License, and Apache License 2.0. For helpful information regarding open-source licenses, we recommend the ARS Technica Article “[Open source licenses: What, which, and why]([url](https://arstechnica.com/gadgets/2020/02/how-to-choose-an-open-source-license/))”. 47 | 48 | **To adhere to this best practice:** Choose an open-source license (e.g., MIT) for the pipeline, providing legal permissions for users to modify and share the code. 49 | 50 | **To verify adherence to this best practice:** The reviewer should confirm the presence of a clearly defined open-source license in the pipeline repository. 51 | 52 | ### 3. Version Controlled Software 53 | _Are stable releases that follow semantic versioning of the pipeline available for implementation in public health laboratories?_ 54 | 55 | Utilizing stable software releases with semantic versioning in public health bioinformatics ensures consistent functionality and compatibility, minimizing disruptions in research workflows. This approach also simplifies version tracking and communication, facilitating seamless collaboration among researchers and reducing the likelihood of errors due to software discrepancies. Maintaining a detailed changelog is also highly recommended to track software updates, bug fixes, and feature additions, ensuring transparency and ease of understanding for users. 56 | 57 | **To adhere to this best practice:** Implement semantic versioning (e.g., MAJOR.MINOR.PATCH). 58 | 59 | **To verify adherence to this best practice:** The reviewer should check for version tags in the repository, ensuring that versioning follows semantic versioning principles and accurately reflects the pipeline's development history. 60 | 61 | ### 4. Workflow Management System 62 | _Does the pipeline utilize a workflow management system for its development and execution??_ 63 | 64 | Implementing workflow management systems in public health bioinformatics pipelines ensures efficient, scalable, and reproducible analyses. These systems automate complex data processing tasks, facilitating seamless integration and execution of diverse bioinformatics tools. By standardizing workflow execution, they enhance data analysis consistency across different computational environments, contributing significantly to the reliability and reproducibility of public health research findings. 65 | 66 | Additionally, the use of workflow management systems facilitates maintainability. The shared knowledge and use of these systems by a community of developers ensure that pipelines can be more readily supported and updated, significantly enhancing long-term usability and stability. Common workflow management systems adopted acorss public health pathogen genomics include [NextFlow](https://www.nextflow.io/), [WDL](https://openwdl.org/), [SnakeMake](https://snakemake.readthedocs.io/en/stable/), and [CWL](https://www.commonwl.org/). 67 | 68 | **To adhere to this best practice:** Choose a workflow management system that supports scalability, is compatible with common bioinformatics tools, and integrates easily with existing infrastructure. Document the workflow configuration and dependencies clearly. 69 | 70 | **To verify adherence to this best practice:** The reviwer should assess the pipeline source code to identify the use of a workflow management system. 71 | 72 | ### 5. Containerized/Packaged Software 73 | _Does the pipeline utilize containerized (e.g. Docker) or packaging (e.g. conda) software software to enhance interoperable pipeline distribution?_ 74 | 75 | Using software packages and/or containerization within public health bioinformatics pipeline enhances interoperability by enabling seamless integration and deployment across different platforms and environments. This approach simplifies pipeline distribution and installation, promotes reproducibility, and facilitates collaboration among researchers, contributing to the development of more accessible and interoperable tools and resources. 76 | 77 | Containers are essential for modern bioinformatics development and pipeline distribution, as the ability to replicate results is a fundamental principle of the scientific method. This is true for public health, as any lab should be able to easily install and maintain pipeline and reproduce/verify results from another lab. Containers should be packaged with one or a combination of the following methods: 78 | - conda environments 79 | - venv environments 80 | - Singularity containers 81 | - Docker containers. 82 | 83 | There should be a clear summary in the Git README pointing to which containerisation method has been chosen and instructions for how a lab can install this pipeline and/or where docker images are available, e.g. dockerhub and quay for containers; and where conda packages are available, e.g. anaconda and bioconda (cross-referenced with Installation Instructions). Documentation should indicate the specific version included in the pipeline. This is important as specific software versions may impact functionality. 84 | 85 | **To adhear this best practice:** Implement pipeline components within Docker containers or distribute them as Conda packages; use of containerized/packaged software should be clearly documented. 86 | 87 | **To verify adherence to this best practice:** The reviewer should inspect the pipeline source code and review analyatical steps (e.g. NextFLow processes or WDL tasks) to ensure use of containerized or packaged softare and verify documentation of these resources. 88 | 89 | ### 6. Common File Formats 90 | _Does the pipeline accept as input and generate as output common file format utilized in public health pathogen genomics?_ 91 | 92 | Accepting and generating common file formats for public health bioinformatics pipeline enhances interoperability and data exchange between different tools and platforms. This approach facilitates data sharing, promotes consistency, and enables researchers to leverage a wide range of pipeline solutions, contributing to the development of more comprehensive and effective solutions for addressing critical public health challenges. These files include: .fasta, .fastq(.gz), .sam, .bam, .bai, .bed, .vcf, .gff, .gtf, .txt, .log, .tsv, .csv, .nwk, and .json. 93 | 94 | **To adhere to this best practice:** The pipeline should accept as input and generate as output common file formats utilized in public health pathogen genomics, enhancing interoperability and data exchange between different tools and platforms. 95 | 96 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for explicit information on the common file formats supported by the pipeline, ensuring compatibility with widely used formats in public health bioinformatics. 97 | 98 | ### 7. Software Testing 99 | _Are there automated and/or manual tests described so that the functionality of the pipeline can be assessed?_ 100 | 101 | Including software tests for public health bioinformatics pipeline ensures the reliability and accuracy of the tools, enhancing user confidence and promoting consistent research outcomes. These tests also facilitate early detection and resolution of potential issues, contributing to the overall stability and robustness of the pipeline in a rapidly evolving public health landscape. At a minimum, pipeline being implemented for public health pathogen genomics should include: 102 | - Smoke tests to ensure that the basic functionality of the program is working correctly 103 | - Unit tests to test individual code units 104 | - System tests/end-to-end tests to assess the overall functionality of the program, with a focus on common and important paths 105 | - Regression tests to ensure that changes to the code do not break existing functionality (note: system tests can be utilized to implement regression tests) 106 | 107 | When appropriate, inclusion of additional testing strategies may also enhance the robustness of pipeline, e.g. 108 | - Acceptance tests to ensure that the program meets a project’s fundamental requirements 109 | - Runtime tests to evaluate the pipeline's behavior, performance, and stability during its operation to ensure that it meets the required standards and functions correctly in real-world scenarios 110 | - Testing frameworks, e.g. use of GitHub Actions to automate defined software tests, provide a consistent and organized structure for writing and running test cases, enabling developers to efficiently validate the correctness, performance, and reliability of their pipeline 111 | 112 | The description of functionality and performance testing should be made accessible via the code repository where the pipeline is made available. 113 | 114 | **To adhere to this best practice:** Provide a description of both automated and/or manual tests that assess the functionality of the pipeline, ensuring reliability and accuracy. 115 | 116 | **To verify adherence to this best practice:** The reviewer should check the documentation for a description of tests, both automated and manual, that evaluate the functionality of the pipeline, contributing to user confidence and consistent research outcomes. 117 | 118 | ### 8. Benchmark/Validation Datasets 119 | _Is there a publically available set of inputs with known outputs that can be used to test successful installation and benchmark against other tools?_ 120 | 121 | Including a benchmark or validation dataset for public health bioinformatics pipeline provides researchers with a standard reference for evaluating and comparing the performance of different tools, promoting transparency and consistency in the evaluation of pipeline. By establishing a common reference point, benchmarking enables researchers to identify the strengths and weaknesses of various pipeline solutions and promotes the development of more accurate, reliable, and effective tools. Authors should make benchmark and/or validation datasets publicly available and well-documented, allowing others to reproduce the experiments and validate the results. 122 | 123 | A benchmark dataset is a standardized set of inputs with known outputs that is used to compare the performance of different bioinformatics tools on the same set of data. The benchmark dataset is typically designed to be representative of the types of data that the tool is likely to encounter in real-world scenarios and covers a range of use cases. 124 | 125 | A validation dataset, on the other hand, is used to validate the accuracy and reliability of a specific bioinformatics tool. The validation dataset is designed to test the tool's performance on a range of input data types and sizes and evaluate its ability to correctly identify the target sequences and distinguish them from non-target sequences. 126 | 127 | **To adhere to this best practice:** Include a benchmark or validation dataset for public health bioinformatics pipeline, promoting transparency and consistency in the evaluation and comparison of different tools. 128 | 129 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for information on the availability and accessibility of a benchmark and/or validation dataset for the public health bioinformatics pipeline. 130 | 131 | ### 9. Reference Data Requirements 132 | _Are required reference data and/or databases clearly documented, publicly accessible, and maintained?_ 133 | 134 | Documenting any external reference data or database requirements for public health bioinformatics pipeline enhances the usability and reproducibility of the pipeline by providing clear and comprehensive information on the necessary data sources and dependencies. This documentation promotes transparency, facilitates replication, and enables researchers to more effectively integrate the pipeline into their workflows, contributing to the development of more reliable and impactful tools and resources. 135 | 136 | If an external reference data or database is required, the following standards should be met: static versioning, open-access, and clear instructions to install/access database; database versioning and date of most recent update – version control and compatibility 137 | - Identify what aspects of the database need to be documented, such as the database schema, table structure, and stored procedures. Identify the format the documentation should take, such as technical documentation, user guides, and reference manuals. 138 | - Clearly document the sources of data used to construct the database, including information on how the data was acquired, processed, and validated. 139 | - Specify the format of the data, including any file formats, parameters used, and other relevant information. 140 | - Describe the process of data curation, including any quality control measures, data cleaning, and data integration. 141 | - Describe any taxonomy and annotation used in the database, including any reference standards or guidelines that were followed. 142 | - Specify the terms of use and any restrictions on the use of data from the database, including any attribution or citation requirements. 143 | - Mention any community or website, such as a help forum or feedback mechanism, that are in place for the database. 144 | - Ensure that the database is compatible with the pipeline that it is being used with. Make sure that the documentation clearly states the versions of pipeline and systems that the database is compatible with. 145 | 146 | The format of open-access downloadable should be defined, ideally in compressed format, and in such a format that will be best suited for downstream usage/analysis. 147 | 148 | **To adhere to this best practice:** The required reference data and/or databases for public health bioinformatics pipeline should be clearly documented, publicly accessible, and maintained. If an external reference database is required, it should also adhere to standards such as static versioning, open-access, and clear instructions for installation/access. Clearly document aspects of the database, such as the schema, table structure, and stored procedures, in a format suitable for users, such as technical documentation, user guides, and reference manuals. 149 | 150 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for comprehensive information on external reference data or database requirements, ensuring transparency, usability, and reproducibility. 151 | 152 | ### 10. Pipeline Documentation 153 | _Is the pipeline documentation clearly written and publicly accessible?_ 154 | 155 | Having clearly written and publicly-accessible pipeline documentation enhances user understanding, facilitates adoption, and promotes efficient usage of the pipeline. It provides comprehensive instructions, usage examples, and explanations of key functionalities, enabling public health scientists to effectively utilize the pipeline for their specific bioinformatics needs. Best practices for pipeline documentation include: 156 | - Defining the documentation scope: Identify what aspects of the tool's core functionality need to be documented and what format the documentation should take. This can include things like user guides, reference manuals, and API documentation. 157 | - Establishing documentation guidelines: Develop a set of guidelines or standards for documenting the tool's core functionality. 158 | - Creating a documentation template: Develop a template or set of templates that can be used to create consistent and accurate documentation.. 159 | - Reviewing and update the documentation: Regularly review and update the documentation to ensure that it is accurate and up to date. This can be done by gathering feedback, monitoring usage data, and making adjustments as necessary. 160 | - Keeping it accessible: Make the documentation easily accessible to users by providing it in different formats like HTML, PDF, and user-friendly formats. 161 | 162 | Effective pipeline documentation encompasses a broad range of practices, each targeting specific aspects of usability, transparency, and collaboration. These documentation practices ensure that users and contributors have a clear understanding of the pipeline's development, usage, and governance. By incorporating these elements, documentation becomes a comprehensive resource that supports the pipeline's integrity, facilitates community contribution, and enhances user engagement, making it an indispensable part of best practices in bioinformatics pipeline development. 163 | 164 | **To adhere to this best practice and verify adherence to this best practice:** Refer to the documentation practices listed below; all pipeline documentation practices should be met. 165 | 166 | #### 10a. Contribution, Authorship, and Verified Point of Contact 167 | _Does the full list of authors seem appropriate and include a verified point of contact?_ 168 | 169 | Clearly listing authorship and credit in public health bioinformatics acknowledges the contributions of individual researchers, fostering a sense of ownership and responsibility for their work. This practice also promotes transparency, collaboration, and recognition within the scientific community, enhancing career development opportunities and encouraging the sharing of expertise. We would recommend using the [CRedIT system](https://credit.niso.org/) adopted by the Natural Sciences field to acknowledge contributions to bioinformatic tools. Contributors to the pipeline must be acknowledged as a co-author if they have contributed by: programming, pipeline development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components. 170 | 171 | A verified point of contact must include a working email address of an individual or organization that is most likely to maintain the bioinformatic tool in the long term. Ideally, email addresses for multiple individuals should be provided and these should not be organizational email addresses (e.g. joe.bloggs@phag4e.org), as they could lose access to that email when they leave that organization. 172 | 173 | **To adhere to this documentation practice:** Clearly list authorship and credit for the bioinformatics tool, acknowledging individual contributions and following the CRedIT system for appropriate recognition based on specific contributions. 174 | 175 | **To verify adherence to this documentation practice:** The reviewer should check the documentation and repository for a clear and comprehensive list of authors, ensuring adherence to the CRedIT system and appropriate acknowledgment of contributors based on their roles in programming, pipeline development, design, implementation, and testing. 176 | 177 | #### 10b. Conflict of Interest Statement 178 | _Have all potential conflicts of interest been disclosed?_ 179 | 180 | Stating potential conflicts of interest regarding pipeline authors in public health bioinformatics promotes transparency and integrity in scientific research. This practice enables users to make informed decisions about the pipeline they utilize, ensuring unbiased results and fostering trust within the research community. Conflict of interest is defined as any factor which renders an author, co-author, or collaborative team unable to (potentially or otherwise) perform an independent peer review or evaluation pertaining to a study. Examples of conflict of interest include but are not limited to commercial, personal, political, or religious interests. When developing bioinformatics (for public health) pipelines, any conflict of interest should be disclosed by the responsible authors to ensure independent peer review and testing of code has been carried out prior to publication. Some conflict of interest statements may be waived if an author can demonstrate they are able to perform an impartial code review. For example, JOSS suggests that if two co-authors did not ever truly collaborate, this might mean a co-author is a suitable selection for code review. 181 | 182 | **To adhere to this documentation practice:** Authors of public health bioinformatics pipeline, pipelines, or methods should transparently disclose any potential conflicts of interest, such as commercial, personal, political, or religious affiliations, to promote transparency and integrity in scientific research. 183 | 184 | **To verify adherence to this documentation practice:** The reviewer should assess the presence of a clear conflict of interest statement in the documentation or repository, ensuring that responsible authors have openly disclosed any factors that may impact their ability to perform an unbiased peer review or evaluation. 185 | 186 | #### 10c. Pipeline Maintenance Statement 187 | _Have the authors provided documentation regarding the intent to maintain the pipeline?_ 188 | 189 | Ensuring the long-term sustainability and maintenance of a public health bioinformatics pipeline is crucial for its continued relevance and reliability in the face of evolving public health challenges and technological advancements. Clear documentation regarding the author’s intent to maintain the pipeline not only signals to potential users that the tool will remain up-to-date and secure but also demonstrates the authors' dedication to supporting the public health community over time. 190 | 191 | This documentation may detail the support mechanism for users, including how they can report issues, request features, or contribute to the project. If applicable, this documentation may also cover the funding model or community support strategy that will ensure the pipeline's ongoing development and maintenance. This could include details on how the project is funded, plans for seeking future funding, or how the project fosters a community of contributors. 192 | 193 | **To adhere to this documentation practice:** Authors must include a detailed statement regarding the pipeline's future upkeep, covering aspects such as update schedules, user support mechanisms, and funding or community support strategies. 194 | 195 | **To verify adherence to this documentation practice:** The reviewer will check that the documentation contains a comprehensive statement detailing the authors' commitment to maintaining the pipeline, including specific plans for updates, user support, and securing the pipeline's sustainability. 196 | 197 | #### 10d. Community Guidelines for Contribution and Support 198 | _Are there clear guidelines for third parties wishing to 1) contribute to the pipeline 2) report issues or problems with the pipeline and 3) seek support?_ 199 | 200 | Including community guidelines for contribution and support in public health bioinformatics pipeline promotes open and transparent communication channels between developers, users, and the broader scientific community. These guidelines foster an environment of shared knowledge and expertise, enabling individuals to provide feedback, report issues, and contribute to the improvement and sustainability of essential tools and resources and can include repository style guides, issue templates, and/or guidelines for providing support to users, including how to report issues, how to troubleshoot common problems, and how to escalate issues that cannot be resolved through standard support channels. 201 | 202 | **To adhere to this documentation practice:** Include community guidelines for contribution and support on the code repository where the pipeline is made available. Ensure that community guidelines foster an environment of shared knowledge and expertise, enabling individuals to contribute feedback, report issues, and actively participate in the improvement and sustainability of essential tools and resources. 203 | 204 | **To verify adherence to this documentation practice:** The reviewer should confirm the presence of well-documented community guidelines in the documentation or repository, encompassing aspects such as repository style guides, issue templates, and guidelines for providing user support. 205 | 206 | 207 | #### 10e. Statement of Need with Respect to Public Health Pathogen Genomics 208 | _Have the authors clearly stated the challenges in public health pathogen genomics that this pipeline aims to address?_ 209 | 210 | A clear statement of need for public health bioinformatics pipeline highlights the significance and relevance of the tool within the public health landscape, facilitating its adoption by the target user base. This practice also helps to align development efforts with pressing public health challenges, ensuring that resources are directed towards addressing the most critical issues. For instance, the tool could address the challenge of integrating multiple types of genomic data analysis, such as variant calling, phylogenetic reconstruction, and outbreak investigation, into a single platform. The pipeline could also incorporate machine learning algorithms to provide automated classification and identification of pathogens, reducing the need for manual curation. The type of organization or researcher that the tool is intended for should be made clear, and it is helpful to provide information regarding the level of computational expertise needed. 211 | 212 | Users are more likely to adopt a new tool if they can see how it addresses existing limitations or provides new and innovative features that are not available in current tools. The authors should explain how their tool is different from existing tools and how it improves upon established methods, if there are any. For example, the tool might provide more accurate and reliable results due to the incorporation of new algorithms or statistical models. It might also be more user-friendly and accessible to users with varying levels of computational expertise, allowing a wider range of end-users to take advantage of the tool's capabilities. 213 | 214 | **To adhere to this documentation practice:** Clearly state the significance and relevance of public health bioinformatics pipeline and explain how the tool differs from existing ones and/or improves established methods. 215 | 216 | **To verify adherence to this documentation practice:** The reviewer will ensure that the documentation includes a statement highlighting the pipeline's purpose and its alignment with pressing public health challenges. 217 | 218 | #### 10f. Pipeline Functionality 219 | _Has the function of this software as it pertains to public health bioinformatics been clearly articulated?_ 220 | 221 | Including a clear indication of software function in public health bioinformatics enables researchers to easily identify the most suitable tools for their specific needs, enhancing productivity and the overall quality of their work. This practice also fosters informed decision-making, ensuring that the software is applied effectively and appropriately to address public health challenges. 222 | 223 | The intended use of the software in the context of public health pathogen genomics should be clearly stated, accompanied by the means to confirm this functionality, e.g. through the provision of a validation dataset with information detailing expected outputs and, if appropriate, how these outputs can be compared to a benchmark standard. Standardizing the documentation of the core functionality of a tool can help to ensure that it is clear, accurate, and easy to understand. Limitations of the software that may affect use of the results for clinical or epidemiological purposes and decisions should also be indicated with clarification of which organisms, species and subspecies have been validated for use, and where limitations and discrepancies may occur. 224 | 225 | **To adhere to this documentation practice:** Clearly indicate the function, intended use, and limitations of the software. 226 | 227 | **To verify adherence to this documentation practice:** The reviewer should check for explicit documentation that clearly communicates the software's functionality, intended use, limitations, validated organisms, and potential discrepancies. 228 | 229 | #### 10g. Documentation for Local Installation and/or Remote Access (e.g. Web Server or Galaxy/Terra Workflow) 230 | _Does installation and/or access to the pipeline proceed as outlined in the documentation?_ 231 | 232 | Providing clear local installation and/or remote access instructions for public health bioinformatics pipeline streamlines the user experience, enabling researchers to efficiently deploy and utilize essential tools. This practice also minimizes potential technical barriers, fostering accessibility and promoting widespread adoption within the scientific community. The pipeline installation guide should be clear, concise, and easy to follow. The system requirements for the pipeline should be outlined with a clearly-stated list of all prerequisites and dependencies that are necessary to install the pipeline correctly. Ideally, dependencies are handled with an automated package management solution. The necessary pipeline should be defined with the required minimum version and release. 233 | 234 | Installation instructions should include a step-by-step list. Configuration settings should be detailed and need to be clear as to the expected outcome and result. A method for verification of a successful installation should be described, and any typical problems that might occur during the installation along with methods for troubleshooting. Where there are manual post-installation / cleanup-tasks it is necessary to provide details of necessary tasks. If there are any software license terms, these should be listed. 235 | Methods to update the pipeline should also be described within the installation instructions. 236 | 237 | If the resources have been made available via a web application (e.g. Galaxy or Terra.Bio), instructions on how to access and utilize the pipeline through the web application should be clearly indicated. 238 | 239 | **To adhere to this documentation practice:** Authors of public health bioinformatics pipeline should offer clear local installation and/or remote access instructions, ensuring a streamlined user experience and facilitating efficient deployment of essential tools. Installation instructions should include detailed configuration settings with clear expected outcomes, step-by-step lists, and verification methods for successful installation, along with troubleshooting guidance for common issues. If resources are available via a web application (e.g., Galaxy or Terra.Bio), clear instructions on how to access and utilize the pipeline through the web application should be provided. 240 | 241 | **To verify adherence to this documentation practice:** The reviewer should check for a well-documented installation guide with clear, concise, and easy-to-follow step-by-step instructions, including system requirements, dependencies, and automated package management solutions, as well as information on software versioning. 242 | 243 | #### 10h. Example Usage 244 | _Do the authors include examples of how to use this pipeline?_ 245 | 246 | Documenting an example usage for public health bioinformatics pipeline provides researchers with practical guidance on how to effectively apply the tool in real-world scenarios, enhancing their understanding of its potential applications. This practice promotes successful integration of the pipeline into research workflows, ensuring that it is utilized to its full capacity and ultimately advancing public health outcomes. An example usage for a command-line interface (CLI) tool in public health bioinformatics might illustrate the required input data, command syntax, and expected output, providing a tangible demonstration of the tool's application. For instance, a tool analyzing genomic variants could have an example usage like: 247 | 248 | ``` 249 | Input files: 250 | - sample.vcf (Variant Call Format file containing genomic variants) 251 | 252 | Command: 253 | $ analyze_variants.py -i sample.vcf -o output.txt -p population_data.csv 254 | 255 | Output: 256 | - output.txt (file containing the filtered and annotated genomic variants relevant to public health) 257 | ``` 258 | 259 | This example usage showcases the necessary input files, command options, and the output generated, helping researchers to better understand the tool's functionality and how to incorporate it into their own analyses. 260 | 261 | **To adhere to this documentation practice:** Provide practical guidance by documenting an example usage for public health bioinformatics pipeline, offering researchers clear instructions on how to effectively apply the tool in real-world scenarios. Ensure that the example usage enhances researchers' understanding of the pipeline, facilitating successful integration into research workflows and maximizing its utility for advancing public health outcomes. 262 | 263 | **To verify adherence to this documentation practice:** The reviewer should check for comprehensive documentation that includes an example usage, illustrating the required input data, command syntax, and expected output, enhancing researchers' understanding of the tool's potential applications. 264 | 265 | -------------------------------------------------------------------------------- /docs/qc-solutions.md: -------------------------------------------------------------------------------- 1 | # **QC Solutions for SARS-CoV-2 Genomic Analysis** 2 | 3 | **PHA4GE Bioinformatics Pipelines & Visualization Working Group**
4 | Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J 5 | 6 |
7 | Document Changelog 8 | 9 | - 2022-06-23: 10 | - First draft published 11 | - 2023-03-09: 12 | - Added changelog 13 | 14 |
15 | 16 | 17 | # Overview 18 | 19 | Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data. 20 | 21 | However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses. 22 | 23 | In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them. 24 | 25 | Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues. 26 | 27 | ## Contents 28 | - [Process Control For Bioinformatics QC Checkpoints](#process-control-for-bioinformatics-qc-checkpoints) 29 | - [QC Acceptance Criteria](#qc-acceptance-creiteria) 30 | - [PHA4GE Suggested Thresholds](#pha4ge-suggested-thresholds) 31 | - [QC Metric Definitions](#qc-metric-definitions) 32 | - [Read QC Metrics](#read-qc-metrics) 33 | - [Alignment QC Metrics](#alignment-qc-metrics) 34 | - [Consensus Assembly QC Metrics](#consensus-assembly-qc-metrics) 35 | - [Additional QC Resources and Materials](#additional-qc-resrouces-and-materials) 36 | 37 | # Process Control For Bioinformatics QC Checkpoints 38 | The focus of this document is on the quality control (QC) of tiled amplicon sequencing--through the Artic V3 protocol, for example--a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and--as discussed in this working group's [Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document](../docs/bioinfo-solutions.md)--assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples. 39 | 40 | Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly. 41 | 42 |

43 | 44 |

45 | 46 | In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself. 47 | 48 | Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data. 49 | 50 | # QC Acceptance Criteria 51 | 52 | When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group's suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols. 53 | 54 | ## PHA4GE Suggested Thresholds 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 |
Read QC Metrics
Number of Reads Protocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq)
Percent Human Reads<20%
Alignment QC Metrics
Average Read Depth≥100x
Percent mapped reads to Wuhan reference genome≥65%
Coverage at a Single Base to Make a Base Call≥50x
Percent Agreement80%
Average base quality of aligned reads>15
Assembly QC Metrics
Percent reference coverage>83%
Number of Ns<5,000bp
Assembly length unambiguous>24,000bp
NTC percent coverage<10%
Lineage defining mutations≥60%
S-gene coverage≥99%
S-gene frameshifts sequence0
S-gene ambiguous bases<10%
129 | 130 | 131 | ## QC Metric Definitions 132 | 133 | ### Read QC Metrics 134 | Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files). 135 | 136 | | Term | Definition | 137 | | ---------------------- | --------------------------------------- | 138 | |

Reads

| Fragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform | 139 | |

Number of Reads

| Count of reads generated in an NGS run| 140 | |

BCL Files

| Raw image files produced by Illumina instruments, converted to fastq via bcl2fastq program | 141 | |

FAST5 Files

| Raw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard) | 142 | |

Basecalling

| The computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence
[Performance of neural network basecalling tools for Oxford Nanopore sequencing](https://pubmed.ncbi.nlm.nih.gov/31234903/) | 143 | |

FASTQ Files

| The common “raw” sequence files containing nucleotide sequences and their associated quality scores
• The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length
• The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position
• Q scores range from 0 to 40 and are mathematically equivalent to:
    
 Q = -10log10P
• [Quality Scores for Next-Generation Sequencing - illumina](https://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf)
• [Measuring sequencing accuracy - illumina](https://emea.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html)
• Q Scores for Illumina and ONT sequencing will differ dramatically
     • An excellent Illumina run will have an average Q Score of 27-30
     • An excellent Nanopore run will have an average Q Score of 12-15
• Low Q Scores indicate poor sequencing quality which will impact all downstream analyses | 144 | |

Ambiguity / Mixed Sites

| The percent of each read where the base called is ambiguous
[IUPAC Codes](https://www.bioinformatics.org/sms/iupac.html) | 145 | |

Sequence GC Content

| The GC content of reads should be normally distributed | 146 | |

Raw vs Processed Reads

| It is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed.| 147 | |

Percent Human Reads

| Percentage of human read data sequenced in an NGS run. | 148 | 149 | ### Alignment QC Metrics 150 | Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome--usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)--and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome--VCF files can also be inspected to assess quality of identified variant positions. 151 | 152 | | Term | Definition | 153 | | ---------------------- | --------------------------------------- | 154 | |

Sequence Alignment

| A method of arranging nucleic acid (DNA/RNA) or protein sequences to identify regions of similarity or conservation that may be of function, structural, or evolutionary relationships. Pairwise sequence alignment consists of two sequences whereas multiple sequence alignment consists of more than three sequences | 155 | |

Sequencing Depth

| The number of reads that cover a particular nucleotide, section/amplicon of the genome, or average across the reference sequence
• Ideally a min depth of 10X for Illumina or 20X for Nanopore would be reached
• Uniform depth of coverage is better
• Nonuniform depth may be indicative of differential amplification of amplicons, or amplicon dropout
    • This can be assessed using bedtools | 156 | |

Percent Agreement

| Percentage of base call concordance in reads mapped at a designated position in the reference genome| 157 | |

Coverage

| What percent of the reference sequence is covered by the reads that have been produced
• This metric is typically used in conjunction with depth
| 158 | |

Percent Mapped Reads

| Percentage of read data mapped to a specified reference genome| 159 | |

Average Base Quality of Aligned Reads

| Mean phred score of read data mapped to a reference genome| 160 | 161 | ### Consensus Assembly QC Metrics 162 | An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations. 163 | 164 | | Term | Definition | 165 | | ---------------------- | --------------------------------------- | 166 | |

Length of the Assembly

| Should be similar to that of reference. If it is not, why? Have there been large insertions/deletions, gene duplications, etc.| 167 | |

Total Number of N’s

| The total number of ambiguous basecalls in the assembly | 168 | |

Length of Strings of N’s

| While the total number of N’s is important, the length of the strings of N’s can indicate issues with upstream laboratory workflows. If a string of N’s is consistently reported over a specific region of the genome, then one can cross reference the primer binding loci in the bed file to see if one amplicon is dropping out or amplifying at a lower rate than the other amplicons. This could be due to amplification bias, resulting from a large differential in the GC content between the amplicons. This may also indicate that you have a mixed population and there may be a subpopulation with a different sequence in the ambiguous region.| 169 | |

Percent Reference Coverage

| Percentage of the Wuhan-1 reference genome represented in the consensus assembly| 170 | |

Number of Ns

| Number of ambiguous base calls (Ns) incorporated into the consensus assembly| 171 | |

Assembly Length Unambiguous

| Number of unambiguous base calls (ATCGs) incorporated into the consensus assembly| 172 | |

NTC Percent Coverage

| Percentage of the Wuhan-1 reference genome represented in the consensus assembly of a non-template control (NTC; i.e. negative control)| 173 | |

Lineage Defining Mutations

| Percentage of lineage-specific mutations represented in the consensus assembly| 174 | |

Number of Ns

| Number of ambiguous base calls (Ns) incorporated into the consensus assembly| 175 | |

S-gene Coverage

| Percentage of the SARS-CoV-2 S-gene represented in the consensus assembly| 176 | |

S-gene Frameshifts

| S-gene insertion or deletion events represented in the consensus assembly| 177 | |

S-gene Ambiguous Bases

| Number of ambiguous base calls (Ns) incorporated into the s-gene of the consensus assembly| 178 | 179 | 180 | ## Additional QC Resources and Materials 181 | 182 | - [ncov-tools](https://github.com/jts/ncov-tools) - Tools and plots for performing quality control on coronavirus sequencing results. 183 | - [Quality Management Systems Tools & Resources - Process Management](https://www.cdc.gov/labquality/qms-tools-and-resources.html#:~:text=Click%20to%20expand-,Process%20Management,-Provides%20guidance%20on) - US CDC Quality Management Systems for SARS-CoV-2 NGS Data 184 | - [TheiaCoV QC output Video](https://www.youtube.com/watch?v=Amb-8M71umw&list=PLU47xRg_MKJrtyoFwqGiywl7lQj6vq8Uz&index=3) - Video tutorial for assessing SARS-CoV-2 genomic characterization with Theiagen's TheiaCoV workflows 185 | - [StaPH-B Glossary](http://www.staphb.org/resources/glossary/) - US State Public Health Bioinformatics (StaPH-B) working group's bioinformatics glossary of terms 186 | - [PHA4GE Bioinformatics Solutions](https://github.com/pha4ge/pipeline-resources/blob/main/docs/bioinfo-solutions.md) - This working groups list of bioinformatics solutions for SARS-CoV-2 bioinformatics 187 | - [ECDC: Guidance for representative and targeted genomic SARS-CoV-2 monitoring](https://www.ecdc.europa.eu/sites/default/files/documents/Guidance-for-representative-and-targeted-genomic-SARS-CoV-2-monitoring.pdf) - European CDC Guidance Document for SARS-CoV-2 genomic analysis 188 | -------------------------------------------------------------------------------- /docs/sc2-recombinants.md: -------------------------------------------------------------------------------- 1 | # **Identifying SARS-CoV-2 Recombinants** 2 | 3 | **PHA4GE Bioinformatics Pipelines & Visualization Working Group**
4 | Smith E, Wright S, Libuit K 5 | 6 |
7 | Document Change Log 8 | 9 | - 2022-06-28: 10 | - First draft published 11 | - 2023-02-16: 12 | - Update recombinant image (Example 3) 13 | - 2023-03-09: 14 | - Add changelog 15 |
16 | 17 | # Overview 18 | SARS-CoV-2 recombinants have garnered the attention of the public health community largely due to the unknown clinical and epidemiological implications. This uncertainty emphasizes the need to detect and characterize recombinant SARS-CoV-2 genomes, but the ability to do so rapidly and systematically is not without challenges. Often, recombinant genomes receive an “Unassigned” pango lineage, a non-recombinant pango lineage, or the incorrect recombinant lineage assignment. Additionally, determining the site of recombination within the genome can be difficult for those without extensive SARS-CoV-2 bioinformatics experience. 19 | 20 | The PHA4GE Pipelines and Visualization Working Group has created this document as an attempt to highlight critical sources of information and open-source/access resources to aid in the analysis and surveillance of potential recombinant specimens. 21 | 22 | In no way does this document represent a comprehensive list of all available SC2 bioinformatics resources for assessing recombination. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues 23 | 24 | ## Contents 25 | - [General Information](#general-information) 26 | - [Tools used to detect recombinants/find breakpoint](#tools-used-to-detect-recombinants/find-breakpoint) 27 | - [Investigating Putative Recombinant Specimen](#investigating-putative-recombinant-specimen) 28 | - [Common terminology](#common-terminology) 29 | - [Steps for investigating putative recombinant genomes](#steps-for-investigating-putative-recombinant-genomes) 30 | - [Assess whether recombination exists within the genome](#steps-for-investigating-putative-recombinant-genomes) 31 | - [Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage](#determine-whether-the-genome-belongs-to-a-designated-recombinant-lineage-or-represents-a-novel-recombinant-lineage) 32 | - [Identify the breakpoint of the putative recombinant](#identify-the-breakpoint-of-the-putative-recombinant) 33 | - [Proposing a new recombinant lineage](#proposing-a-new-recombinant-lineage) 34 | - [Publications on SARS-CoV-2 recombinants](#publications-on-sars-cov-2-recombinants) 35 | 36 | 37 | # General Information 38 | 39 | More info and resources on recombinants 40 | 41 | # Tools used to detect recombinants/find breakpoint 42 | 43 | - [Sc2rf](https://github.com/lenaschimmel/sc2rf) 44 | - [Nextclade](https://clades.nextstrain.org/) 45 | - [UShER](https://clades.nextstrain.org/) 46 | - [Potential Recombinant List - Sakaguchi Hitoshi](https://docs.google.com/spreadsheets/d/1cQILRxXD756gJoRsaqMdJkxZm7sEjhV7ceY398Iz7gI/edit#gid=0) 47 | 48 | # Investigating Putative Recombinant Specimen 49 | 50 | ## Common terminology 51 | 52 | | Term | Definition | 53 | | ---------------------- | --------------------------------------- | 54 | | Breakpoint | The site within the genome where recombination occurred. This is also sometimes referred to as the recombinant site. Usually, the breakpoint is a range of nucleotide positions instead of a single nucleotide position. This is due to a lack of lineage-specific mutations in certain regions, or the same mutations being shared between different lineages. The beginning of the breakpoint range is the first possible site within the genome that recombination could have occurred, which generally follows a lineage-specific mutation site. The end of the breakpoint range is the last possible site where the recombination could have occurred, which generally precedes the site of a lineage-specific mutation. | 55 | | Donor | The lineage from which a portion of the recombinant genome originated. This is also called the “parental lineage”. For example, a BA.1 x BA.2 recombinant has BA.1 and BA.2 donor sequences. | 56 | | Designated lineage | A specific pango lineage. Designated recombinant lineages are currently referred to with the XX nomenclature per the pango network [guidelines](https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/). If a recombinant sequence does not belong to a designated recombinant lineage, it may be a novel recombinant lineage that has not yet been designated | 57 | | Allele frequency | The proportion of reads that contain a specific nucleotide at a specific position within the genome. Ideally, the nucleotide sites for lineage-defining mutations within a recombinant genome will have near 100% allele frequency. This is one way to distinguish a recombinant genome from a contaminated genome. | 58 | | Sequencing depth | The number of reads covering a specific position within the genome. This is also often referred to as “coverage”. Ideally, a recombinant genome will have high sequencing depth at the sites of lineage-defining mutations. If a genome has low sequencing depth, it is difficult to determine what the donor, or parental lineage, is at that site. | 59 | 60 | ## Steps for investigating putative recombinant genomes 61 | 62 | There are three main steps in investigating a putative SARS-CoV-2 recombinant genome: 63 | 64 | 1. Assess whether recombination exists within the genome 65 | 2. Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage 66 | 3. Identify the breakpoint of the putative recombinant (if it represents a novel recombinant lineage) 67 | 68 | ### Assess whether recombination exists within the genome 69 | 70 | #### 1. Run genome assembly through Nextclade. This can be done either through the [Nextclade web portal](https://clades.nextstrain.org/) or [NextClade CLI](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html) (command-line interface). 71 | 72 | #### 2. Assess Nextclade output for sequence quality and potential for recombination. 73 | 74 | 75 | 76 | ##### Example 1 77 | 78 | This genome (EPI_ISL_11758210) received a clade assignment of “recombinant” using v1.14.1. It has 67 mutations relative to the Wuhan-1 reference genome. It also has 0 Ns, meaning that there are no ambiguous bases in the genome and it is likely a high quality genome assembly. Therefore, we can be confident that this sequence is a true recombinant. However, additional investigation will be needed to determine whether this genome actually belongs to the XQ recombinant lineage. 79 | 80 |

81 | 82 |

83 | 84 | ##### Example 2 85 | 86 | This genome (EPI_ISL_12612634) received a clade assignment of “recombinant” using v1.14.1. It has 64 mutations relative to the Wuhan-1 reference genome. It has 646 N’s, which means ~2% of the genome assembly is composed of Ns. While any ambiguous bases in the genome assembly are not ideal, it is not abnormal. This is still a relatively low number of N’s and we can have confidence that this genome is a recombinant. However, additional investigation will be needed to determine whether this genome actually belongs to the XE lineage. 87 | 88 | 89 |

90 | 91 |

92 | 93 | ##### Example 3 94 | 95 | This genome received a clade assignment of “recombinant” using v1.14.1. It has 10 mutations relative to the Wuhan-1 reference genome. It has 21,498 Ns, meaning that the majority of the genome assembly is composed of ambiguous bases. Despite the “recombinant” assignment and “XF” pango lineage, the quality of this sequence is too poor to continue investigating for recombination. 96 | 97 |

98 | 99 |

100 | 101 | 102 | ### Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage 103 | 104 | #### 1. Upload sequences to the [UShER web portal](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace). To follow along with Examples 1 and 2, you can either upload the fasta files or copy and paste the GISAID accessions. 105 | 106 | #### 2. Review UShER outputs and subtrees 107 | 108 | ##### Example 1 109 | 110 | This genome was given a lineage assignment of BA.2 by UShER. This differs from the Nextclade lineage assignment of XQ, which is a bit confusing. On the UShER subtree, this genome falls into a clade of similar sequences, but all of them are assigned the BA.2 lineage. This means that this genome will need further investigation. Since this sequence was high quality, determined to be a recombinant by Nextclade, but was not assigned a recombinant pango lineage in UShER, it may belong to a novel recombinant lineage that has not been designated. 111 | 112 |

113 | 114 |

115 | 116 |

117 | 118 |

119 | 120 | ##### Example 2 121 | 122 | This genome was given a lineage assignment of XE by UShER. When viewing the UShER subtree, this sequence falls amongst many other XE genomes. In fact, it is nearly identical to several other genomes in the public repositories. Therefore, this genome is likely a true XE recombinant. Since this genome belongs to a designated recombinant lineage that has been previously characterized, no further investigation is necessary. 123 | 124 |

125 | 126 |

127 | 128 |

129 | 130 |

131 | 132 | 133 | ### Identify the breakpoint of the putative recombinant 134 | 135 | #### 1. Return to Nextclade output 136 | 137 | When in doubt, always go back to the list of mutations! While looking at individual mutations can be a slow and manual process, it is the best way to determine the breakpoint of the putative novel recombinant genome. You can hover over the number of mutations in the web portal to get a graphical display of the mutations, or you can download the list of mutations as a JSON, CSV, or TSV. 138 | 139 |

140 | 141 |

142 | 143 |

144 | 145 |

146 | 147 | #### 2. Find list of defining mutations for each variant 148 | 149 | Lists of mutations for each variant can be found on the [CoVariants web page](https://covariants.org/variants/21K.Omicron). When you click on a specific variant, a list of defining mutations will appear on the right side of the page. Be aware that several variants share defining mutations. 150 | 151 |

152 | 153 |

154 | 155 |

156 | 157 |

158 | 159 | #### 3. Determine which mutations within the putative novel recombinant genome belong to which donors 160 | 161 | Go mutation-by-mutation and determine whether that mutation is specific to 21K (BA.1) or 21L (BA.2). It is also important to check to see if lineage-specific mutations are missing from the genome. Starting in ORF1a with Example 1, the genome contains ORF1a:K856R and ORF1a:L2084I, which are both specific to 21K (BA.1). However, with a BA.1 genome we would expect the next amino acid substitution to be ORF1a:A2710T based on the list from CoVariants. Instead, the two amino acid substitutions are ORF1a:L3027F and ORF1a:T3090I, which are both specific to 21L (BA.2). This means that the breakpoint for this recombinant genome is between ORF1a:L2084I and ORF1a:A2710T, which corresponds to 6,516-8,392bp. 162 | 163 |

164 | 165 |

166 | 167 |

168 | 169 |

170 | 171 | In this case, the breakpoint was at two sites that confer amino acid substitutions, however, some cases will require looking at synonymous nucleotide changes as well. [CoVariants](https://covariants.org/) lists these underneath the list of amino acid substitutions. 172 | 173 |

174 | 175 |

176 | 177 | ## Proposing a new recombinant lineage: 178 | 179 | The pango team recently released a set of [guidelines](https://www.pango.network/pango-lineages-guidelines-for-suggesting-novel-and-recombinant-lineages/) for proposing new recombinant lineages. 180 | 181 | # Publications on SARS-CoV-2 recombinants 182 | 183 | - Bolze, A., White, S., Basler, T., Dei Rossi, A., Roychoudhury, P., Greninger, A. L., ... & Luo, S. (2022). Evidence for SARS-CoV-2 Delta and Omicron co-infections and recombination. medRxiv. doi: https://doi.org/10.1101/2022.03.09.22272113 184 | 185 | - Colson, P, Fournier, P-E, Delerce, J, et al. Culture and identification of a “Deltamicron” SARS-CoV-2 in a three cases cluster in southern France. J Med Virol. 2022; 1- 11. doi:10.1002/jmv.27789 186 | 187 | - Colson, P., Delerce, J., Marion-Paris, E., Lagier, J. C., Levasseur, A., Fournier, P. E., ... & Raoult, D. (2022). A 21L/BA. 2-21K/BA. 1 “MixOmicron” SARS-CoV-2 hybrid undetected by qPCR that screen for variant in routine diagnosis. medRxiv. doi: https://doi.org/10.1101/2022.03.28.22273010 188 | 189 | - Duerr, R., Dimartino, D., Marier, C., Zappile, P., Wang, G., Plitnick, J., ... & Heguy, A. (2022). Delta-Omicron recombinant SARS-CoV-2 in a transplant patient treated with Sotrovimab. bioRxiv. doi: https://doi.org/10.1101/2022.04.06.487325 190 | 191 | - Gu, H., Ng, D., Liu, G., Cheng, S., Krishnan, P., Chang, L....Poon, L. (2022). Recombinant BA.1/BA.2 SARS-CoV-2 Virus in Arriving Travelers, Hong Kong, February 2022. Emerging Infectious Diseases, 28(6), 1276-1278. https://doi.org/10.3201/eid2806.220523. 192 | 193 | - Lacek, Kristine A., Benjamin Rambo-Martin, Dhwani Batra, Xiao-yu Zheng, Matthew W. Keller, Malania Wilson, Mili Sheth et al. "Identification of a Novel SARS-CoV-2 Delta-Omicron Recombinant Virus in the United States." bioRxiv (2022). doi: https://doi.org/10.1101/2022.03.19.484981 194 | 195 | - Lacek, K. A., Rambo-Martin, B. L., Batra, D., Zheng, X. Y., Hassell, N., Sakaguchi, H., ... & Paden, C. R. (2022). SARS-CoV-2 Delta-Omicron Recombinant Viruses, United States. Emerging Infectious Diseases, 28(7). DOI: 10.3201/eid2807.220526 196 | 197 | - da Silva, L. S., de Oliveira, C. M., Cota, B. D. C. V., Romano, C. M., & Levi, J. E. (2022). Three SARS-CoV-2 recombinants identified in Brazilian children. DOI: https://doi.org/10.21203/rs.3.rs-1641864/v1 198 | 199 | - Moisan, A., Mastrovito, B., De Oliveira, F., Martel, M., Hedin, H., Leoz, M., ... & Plantier, J. C. (2022). Evidence of transmission and circulation of Deltacron XD recombinant SARS-CoV-2 in Northwest France. Clinical Infectious Diseases. doi: https://doi.org/10.1093/cid/ciac360 200 | 201 | - SIMON-LORIERE, E., Montagutelli, X., Lemoine, F., Donati, F., Touret, F., Bourret, J., ... & Danish COVID-19 Genome Consortium (DCGC). (2022). Rapid characterization of a Delta-Omicron SARS-CoV-2 recombinant detected in Europe. https://doi.org/10.21203/rs.3.rs-1502293/v1 202 | 203 | - VanInsberghe, D., Neish, A. S., Lowen, A. C., & Koelle, K. (2021). Recombinant SARS-CoV-2 genomes circulated at low levels over the first year of the pandemic. Virus Evolution, 7(2), veab059.doi: https://doi.org/10.1093/ve/veab059 204 | 205 | - Wertheim, J. O., Wang, J. C., Leelawong, M., Martin, D. P., Havens, J. L., Chowdhury, M. A., ... & Hughes, S. (2022). Capturing intrahost recombination of SARS-CoV-2 during superinfection with Alpha and Epsilon variants in New York City. medRxiv. doi: https://doi.org/10.1101/2022.01.18.22269300 206 | --------------------------------------------------------------------------------