├── LICENSE
├── README.md
└── docs
    ├── bioinfo-solutions.md
    ├── hiv-bioinfo-solutions.md
    ├── images
        ├── PHA4GE_SC2_QC_Workflow.png
        ├── hiv-bioinfo-solutions-figure-1.svg
        ├── hiv-bioinfo-solutions-table-2.svg
        ├── influenza-guidance-fig1.png
        ├── influenza-guidance-fig2.png
        ├── omicron_standford.svg
        ├── pha4ge_sc2_qc_workflow.png
        └── sc2-recombinants
        │   ├── covariants21k.png
        │   ├── covariants21l.png
        │   ├── ex1-usher-metrics.png
        │   ├── ex1-usher-tree.png
        │   ├── ex1.png
        │   ├── ex2-usher-metrics.png
        │   ├── ex2-usher-tree.png
        │   ├── ex2.png
        │   ├── ex3.png
        │   ├── mutations1.png
        │   ├── mutations2.png
        │   ├── mutations3.png
        │   ├── nextclade-output.png
        │   └── nextclade-output2.png
    ├── influenza-bioinfo-solutions.md
    ├── mpxv-bioinfo-solutions.md
    ├── omicron-resources.md
    ├── pipeline-best-practices.md
    ├── qc-solutions.md
    └── sc2-recombinants.md


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ![PHA4GE logo](https://pha4ge.org/wp-content/uploads/2020/09/phage-logo-thin.png)
 3 | *Bioinformatics Pipelines and Visualization Working Group Resources*
 4 | 
 5 | Overview
 6 | ========
 7 | This repository hosts [PHA4GE-developed](https://pha4ge.org/) guidance documents and resources that address common challenges regarding the integration of bioinformatics solutions for the global public health community.
 8 | 
 9 | ## Contents
10 | - [Rationale](#rationale)
11 | - [SARS-CoV-2 Resources](#sars-cov-2-resources)
12 | 	- [Omicron Variant Resources](/docs/omicron-resources.md)
13 | 	- [Identifying SARS-CoV-2 Recombinants](docs/sc2-recombinants.md)
14 | 	- [Bioinformatics Solutions](docs/bioinfo-solutions.md)
15 | 	- [Validation Data Sets](https://github.com/CDCgov/datasets-sars-cov-2)
16 | 	- [Quality Control Guidance](docs/qc-solutions.md)
17 | 	- [Informing Public Health Action](#sars-cov-2-resources)
18 | - [Mpox Resources](#mpox-resources)
19 | 	- [Bioinformatics Solutions](docs/mpxv-bioinfo-solutions.md)
20 | - [HIV Resources](#hiv-resources)
21 | 	- [Bioinformatics Solutions](docs/hiv-bioinfo-solutions.md)
22 | - [Bioinformatics Development](#bioinformatics-development)
23 | 	- [Best Practices for Public Health Bioinformatics Pipelines](https://github.com/pha4ge/public-health-pipeline-best-practices/blob/main/docs/pipeline-best-practices.md)
24 | - [Contributing](#contributing)
25 | 
26 | 
27 | Rationale
28 | ========
29 | As public health bioinformatic workflows become increasingly complicated, efforts are needed to promote sensible standardization, portability and reproducibility of assays and workflows across a range of environments, contexts and resource conditions. 
30 | 
31 | SARS-CoV-2 Resources
32 | ==================
33 | 
34 | ### [Omicron Variant Resources](/docs/omicron-resources.md)
35 | 
36 | The PHA4GE Pipelines and Visualization Working Group has created this document to highlight critical open-source/accesses resources to aid in the understanding and further analysis of the Omicron variant. 
37 | 
38 | ### [Identifying SARS-CoV-2 Recombinants](docs/sc2-recombinants.md)
39 | 
40 | SARS-CoV-2 recombinants have garnered the attention of the public health community largely due to the unknown clinical and epidemiological implications. This uncertainty emphasizes the need to detect and characterize recombinant SARS-CoV-2 genomes, but the ability to do so rapidly and systematically is not without challenges. Often, recombinant genomes receive an “Unassigned” pango lineage, a non-recombinant pango lineage, or the incorrect recombinant lineage assignment. Additionally, determining the site of recombination within the genome can be difficult for those without extensive SARS-CoV-2 bioinformatics experience.
41 | 
42 | The PHA4GE Pipelines and Visualization Working Group has created this document as an attempt to highlight critical sources of information and open-source/access resources to aid in the analysis and surveillance of potential recombinant specimens.
43 | 
44 | ### [Bioinformatics Solutions](docs/bioinfo-solutions.md)
45 | 
46 | In an attempt to assist this integration process, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
47 | 
48 | ### [Validation Data Sets](https://github.com/CDCgov/datasets-sars-cov-2)
49 | 
50 | The US Centers for Disease Control and Prevention's Technical Outreach and Assistance for States Team (TOAST) developed benchmark datasets for SARS-CoV-2 sequencing which are designed to help users at varying stages of building sequencing capacity. Rather than duplicating these efforts, the PHA4GE bioinformatics pipeline and visualization working group will be working alongside TOAST members to maintain and improve upon the currently-available validation datasets. 
51 | 
52 | ### [Quality Control Guidance](docs/qc-solutions.md)
53 | 
54 | In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.
55 | 
56 | ### Informing Public Health Action
57 | 
58 | {In development}
59 | 
60 | 
61 | Mpox Resources
62 | ==================
63 | 
64 | ### [Bioinformatics Solutions](docs/mpxv-bioinfo-solutions.md)
65 | 
66 | In an attempt to assist this integration process, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for Mpox genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
67 | 
68 | HIV Resources
69 | ==================
70 | 
71 | ### [Bioinformatics Solutions](docs/hiv-bioinfo-solutions.md)
72 | 
73 | Understanding the HIV genome, evolutionary dynamics, and subtypes are essential for designing bioinformatic processes. Here, we present a set of resources to help springboard researchers into the world of HIV bioinformatics.
74 | 
75 | Bioinformatics Development
76 | ==================
77 | 
78 | ### [Public Health Pipeline Best Practices](https://github.com/pha4ge/public-health-pipeline-best-practices/blob/main/docs/pipeline-best-practices.md)
79 | 
80 | In an attempt to assist software developers, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has proposed a set of best practices, tailored specifically for public health bioinformatics pipelines. These best practices aim to provide a guidance framework for development, testing, maintenance of bioinformatics software. By adhering closely to these best practices, developers can enhance the quality, reliability and sustainability of their software, facilitating impact in public health research.
81 | 
82 | Contributing
83 | ============
84 | Contributions to the documents are more than welcome. To propose a change, edit the source files and open a pull-request with the proposed changes.
85 | 
86 | If you're interested in participating in further discussions please free to join the [Working Group](https://pha4ge.org/bioinformatics-pipelines-and-visualization/).
87 | 
88 | 


--------------------------------------------------------------------------------
/docs/bioinfo-solutions.md:
--------------------------------------------------------------------------------
  1 | # **Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis**
  2 | 
  3 | PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group <br/>
  4 | Libuit KG, Park D, van Heusden P, Neher R, Kapsak CJ, Southgate J, Bridges D, Mboowa G, Lunn S, Constantinides B, Varona S, Langhorst B
  5 | 
  6 | <details>
  7 |  <summary> Document Changelog</summary>
  8 | 
  9 | - 2023-03-19:
 10 |   - Add CLI tool ViralConsensus
 11 |   - Add changelog
 12 | - 2023-04-13:
 13 |   - Add CLI tool nf-core/viralrecon
 14 | </details>
 15 | 
 16 | # Overview
 17 | 
 18 | Genomic analysis of SARS-CoV-2 (SC2) samples is an increasingly critical function to public health laboratories around the world. Integration of the appropriate bioinformatics solutions to support these works, however, can be an overwhelming challenge.
 19 | 
 20 |  In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the [Public Health Alliance for Genomic Epidemiology (PHA4GE)](https://www.pha4ge.org) has drafted this living document to help define the major bioinformatics challenges for SC2 genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
 21 | 
 22 | Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
 23 | 
 24 | # Bioinformatics Challenges for Public Health
 25 | 
 26 | The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:
 27 | 
 28 | 1. **Generating consensus assemblies from PCR tiling NGS data:** Tiled amplicon sequencing--through the Artic V3 protocol, for example--is the most commonly adopted method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample. As a result, one of the initial bioinformatics challenges laboratories face is the assembly of PCR tiling NGS data into a contiguous SC2 genome from which powerful public health insights can be derived, such as lineage typing and genomic epidemiology studies that help inform public-health decision making.
 29 | 
 30 | 2. **Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases:** Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be.
 31 | 
 32 | 3. **Screening sequenced SC2 samples for variants of concern:** The detection of certain genetic variants of the SARS-CoV-2 virus may have a significant impact on the decisions of public health officials. Thus, an ability to accurately and reliably screen for variants of interest (VoI) and variants of concern(VoC), such as B.1.1.7 (Alpha) or B.1.617.2 (Delta), is a critical component to the bioinformatics analysis of SC2 genomes.
 33 | 
 34 | 4. **Performing phylogenetic analysis of SC2 datasets:** Genetic relatedness as inferred through phylogenetic analysis of SC2 datasets can be a powerful proxy for epidemiological associations that help resolve transmission networks, enable real-time surveillance, provide insights of the variance-over-time of SC2 samples, and support local outbreak investigations
 35 | 
 36 | # Open-Access/Source Bioinformatics Solutions & Resources
 37 | 
 38 | ## 1. Generating consensus assemblies from PCR tiling NGS data
 39 | 
 40 | The bioinformatics resources listed below are open-source pipelines that run on general-purpose, containerized workflow infrastructure to generate consensus SC2 assemblies from PCR tiling NGS data. While some parameters and modules may differ slightly, each pipeline will perform read mapping to the Wuhan-1 reference genome, remove primer regions from the mapped read data, and generate a consensus assembly based on conserved and variant positions identified in the resulting alignment. These resources have been organized into three categories: [Terra](app.terra.bio) and [Galaxy](https://galaxyproject.org/) Workflows, Web-Accessible Software as a Service (SaaS) Solutions, and Command-Line Interface (CLI) tools and are listed in no particular order.
 41 | 
 42 | <details>
 43 |  <summary>Terra and Galaxy Workflows</summary>
 44 | 
 45 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs)
 46 |   - **Brief Description:** The viral-ngs workflow collection contains many tools for viral analysis. The consensus genome caller is called assemble\_refbased and should work for any low-diversity microbial genome and is appropriate for viruses stemming from a single point-source outbreak, such as SARS-CoV-2. Accepts Illumina paired, single, or mixed reads, as well as ONT reads. Accepts metagenomic or amplicon-based reads with primer trimming.
 47 |   - **Developed/supported by:** Broad Institute Viral Genomics 
 48 |   - **Documentation:** [Technical documentation (ReadTheDocs)](https://viral-ngs.readthedocs.io/en/latest/)
 49 |   - **User base:** [H3Africa](https://h3africa.org/index.php/consortium/genomic-characterization-and-surveillance-of-microbial-threats-in-west-africa/) West African sites ([RUN](http://acegid.org/), [KGH](https://vhfc.org/consortium/people/), [UCAD](https://www.ucad.sn/))
 50 |   - **Workflow language:** WDL
 51 |     - **Web/Cloud GUI Platforms:** Terra, DNAnexus
 52 |     - **CLI Platforms:** Cromwell (local HPC, cloud), miniWDL
 53 | - [Theiagen&#39;s Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics)
 54 |   - **Brief Description:** Theiagen&#39;s Public Health Viral Genomics WDL Workflows include four separate WDL workflows (Titan\_Illumina\_PE, Titan\_Illumina\_SE, Titan\_ClearLabs, and Titan\_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
 55 |   - **Developed/supported by:** Theiagen Genomics
 56 |   - **Documentation:** [Technical documentation (ReadTheDocs)](https://public-health-viral-genomics-theiagen.readthedocs.io/en/latest/overview.html), [step-by-step protocols (Protocols.io)](https://www.protocols.io/file-manager/9EF18A27777511EBA1C60A58A9FEAC2A), and [video tutorials (YouTube Playlist)](https://www.youtube.com/watch?v=fy0Hm0lfIas&amp;list=PLU47xRg_MKJrtyoFwqGiywl7lQj6vq8Uz)
 57 |   - **User base:** US PHLs
 58 |   - **Workflow language:** WDL
 59 |     - **Web/Cloud GUI Platforms:** Terra
 60 |     - **CLI Platforms:** Cromwell (local HPC, cloud), miniWDL
 61 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/)
 62 |   - **Brief Description:** Several Galaxy workflows for performing SC2 consensus genome assembly have been available including a Galaxy workflow for the analysis of SARS-CoV-2 data.
 63 |   - **Workflow language:** Galaxy
 64 |     - **Developed/supported by:** usegalaxy.eu ([https://covid19.galaxyproject.org/artic/](https://covid19.galaxyproject.org/artic/))
 65 |       - **Web/Cloud GUI Platforms:** [usegalaxy.*](https://galaxyproject.org/use/)
 66 |       - **Documentation:** [SARS-CoV-2 Data Analysis and Monitoring with Galaxy](https://galaxyproject.eu/event/2021-06-21-sars-cov-2-data-analysis-monitoring-training/)
 67 |       - **Sequencing technologies supported:** Illumina metagenomic sequencing, Illumina and Oxford Nanopore ARTIC amplicon sequencing
 68 |     - **Developed/suppported by:** ARIES/Istituto Superiore di Sanità
 69 |       - **Web/Cloud GUI Platforms:** [ARIES Galaxy](https://aries.iss.it/) ([https://aries.iss.it/u/arnold-knijn/w/sars-cov-2recovery31](https://aries.iss.it/u/arnold-knijn/w/sars-cov-2recovery31))
 70 |       - **Documentation:** [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.01.16.425365v1)
 71 |       - **Sequencing technologies supported:** Illumina, Ion Torrent and Oxford Nanopore ARTIC amplicon sequencing
 72 | </details>
 73 | 
 74 | <details>
 75 |  <summary>Web-Accessible SaaS Solutions</summary>
 76 |  
 77 | - [IDSeq](https://idseq.net/)
 78 |   - **Brief Description:** User-friendly software platform originally developed for metagenomics studies that has since been repurposed to include SC2 consensus assembly from Oxford Nanopore or paired-end Illumina data
 79 |   - **Developed/supported by:** [Chan Zuckerberg Initiative (CZI)](https://chanzuckerberg.com/) 
 80 |   - **User base:** CZ Biohub &amp; partners; access available on request to other users
 81 |   - **User-interface** : Web application on CZI-funded AWS
 82 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/)
 83 |   - **Brief Description:** EDGE COVID-19 is a derivative of the original EDGE Bioinformatics software (Li _et al._ 2017) that was developed to perform reference-based SC2 assemblies and quality assessment of Illumina or Nanopore read data.
 84 |   - **Developed/supported by:** Los Alamos National Laboratories
 85 |   - **Documentation:** [EDGE COVID-19 User Guide](https://edge-covid19.edgebioinformatics.org/docs/EDGE_COVID-19_guide.pdf)
 86 |   - **User base:** LANL &amp; partners
 87 |   - **User-interface:** Web application on LANL hardware, [local instance using Docker](https://hub.docker.com/r/bioedge/edge-covid19
 88 | )
 89 | </details>
 90 | 
 91 | <details>
 92 |  <summary>Command-line interface (CLI) Tools</summary>
 93 |  
 94 | - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN/OnCOV)](https://github.com/jaleezyy/covid-19-signal)
 95 |   - **Brief Description:** Quality control, assembly, and analysis snakemake workflow for Illumina-based viral amplicon sequencing. Includes de-hosting via competitive mapping, freebayes variant and consensus generation, lineage assignment, interactive HTML run summaries, and integration with the [ncov-tools](https://github.com/jts/ncov-tools/) QC workflow.
 96 |   - **Developed/supported by:** [CARD/McArthur Lab](https://mcarthurbioinformatics.ca), lead maintainers: Jalees Nasir & Finlay Maguire 
 97 |   - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/jaleezyy/covid-19-signal)
 98 |   - **User base:** CA PHLs & academic partners
 99 |   - **User-interface:** CLI (Snakemake)
100 | - [ARTIC nCOV19 (ARTIC Network; Connor-lab)](https://github.com/connor-lab/ncov2019-artic-nf)
101 |   - **Brief Description:** Configured conda environment that enables access to Oxford Nanopore or Illumina consensus sequence assemblers: Medaka (ONT), NanoPolish (ONT) or BWA (Illumina)
102 |   - **Developed/supported by:** COG UK / ARTIC
103 |   - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/connor-lab/ncov2019-artic-nf/blob/master/README.md)
104 |   - **User base:** COG UK
105 |   - **Workflow language:** Nextflow
106 |     - **CLI Platforms:** Nextflow cli client, Nextflow Tower (local HPC, cloud, etc)
107 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit)
108 |   - **Brief Description:** Two StaPH-B workflows for performing SC2 consensus genome assembly have been available: Cecret, a pipeline developed for the analysis of single or paired-end Illumina reads. and Monroe, a workflow with various subcommands that perform consensus genome assembly from either Illumina or Nanopore read data.
109 |   - **Developed/supported by:** StaPH-B
110 |   - **Documentation:** [https://staph-b.github.io/staphb\_toolkit/](https://staph-b.github.io/staphb_toolkit/install/), [Python Package Index (PyPI)](https://pypi.org/project/staphb-toolkit/)
111 |   - **User base:** US PHLs
112 |   - **User-interface:** CLI (Python package)
113 | - [ViralConsensus](https://github.com/niemasd/ViralConsensus)
114 |   - **Brief Description:** A primer-aware consensus assembler developed for efficient assembly of SARS-CoV-2 reads from CRAM/BAM/SAM input. Written in C++. [Preprint](https://www.biorxiv.org/content/10.1101/2023.01.05.522928v1).
115 |   - **Developed/supported by:** Niema Moshiri
116 |   - **Documentation:** [Github Readme](https://github.com/niemasd/ViralConsensus), [DockerHub](https://hub.docker.com/r/niemasd/viral_consensus)
117 |   - **User base:** Unknown
118 |   - **User-interface:** CLI (C++ executable)
119 | - [nf-core/viralrecon](https://github.com/nf-core/viralrecon)
120 |   - **Brief Description:** nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data. For Illumina short-reads the pipeline is able to analyse metagenomics data typically obtained from shotgun sequencing (e.g. directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based). For Nanopore data the pipeline only supports amplicon-based analysis obtained from primer sets created and maintained by the ARTIC Network
121 |   - **Developed/supported by:** [nf-core](https://nf-co.re/)
122 |   - **Documentation:** [Github Readme](https://github.com/nf-core/viralrecon), [nf-core documentation](https://nf-co.re/viralrecon)
123 |   - **User base:** Unknown
124 |   - **Workflow language:** Nextflow
125 |     - **CLI Platforms:** Nextflow cli client, Nextflow Tower (local HPC, cloud, etc)
126 |  
127 | </details>
128 | 
129 | ## 2. Submitting raw sequence data (fastq), consensus assemblies (fasta), and relevant sample metadata to internationally-accessible databases
130 | 
131 | Below is a list of resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as [NCBI](https://www.ncbi.nlm.nih.gov/sars-cov-2/), [ENA](https://www.ebi.ac.uk/ena/browser/home), and [GISAID](https://www.gisaid.org/). We have also included a list of bioinformatics software designed to assess the quality of SC2 data; we recommend the use of such software prior to submission to avoid the inadvertent sharing of poor quality, contaminated, or otherwise misleading SC2 data. Additional information regarding the interpretation of read and assembly quality metrics for SC2 data will be made available as a separate document.
132 | 
133 | <details>
134 |  <summary>Recommended SC2 Sample Metadata Specifications</summary>
135 |  
136 | - [PHA4GE Contextual Data Specifications](https://www.preprints.org/manuscript/202008.0220/v1)
137 |   - **Database Target(s):** GISAID, ENA, SRA, Genbank
138 |   - **Brief Description:** A SARS-CoV-2 contextual data specification based on harmonizable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories.
139 |   - **Developed/supported by:** PHA4GE
140 |   - **Documentation:** [Technical documentation (GitHub README)](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification)
141 |   - **User base:** Global public health community
142 |   - **Protocols:** [NCBI Submission](http://dx.doi.org/10.17504/protocols.io.bsypnfvn), [ENA Submission](http://dx.doi.org/10.17504/protocols.io.buqnnvve), & [GISAID Submission](http://dx.doi.org/10.17504/protocols.io.bumknu4w)
143 |  
144 | </details>
145 | 
146 | <details>
147 |  <summary>Bioinformatics Solutions to Prepare and/or Submit SC2 Sample Data</summary>
148 | 
149 | - [Galaxy ENA Submission Plugin](https://github.com/galaxyproject/tools-iuc/tree/master/tools/ena_upload)
150 |   - **Database Target(s):** ENA
151 |   - **Brief Description:** Galaxy plugin for direct submission to the European Nucleotide Archive database
152 |   - **Developed/supported by:** [Galaxy IUC (Intergalactic Utilities Commission)](https://galaxyproject.org/iuc/)
153 |   - **Documentation:** [https://github.com/ELIXIR-Belgium/ena-upload-container](https://github.com/ELIXIR-Belgium/ena-upload-container)
154 |   - **User base:** European PHLs
155 |   - **Workflow language:** Galaxy
156 |     - **Web/Cloud GUI Platforms:** GalaxyProject  
157 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above)
158 |    - **Database Target(s):** GISAID, GenBank, & SRA
159 | - [Theiagen&#39;s Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above)
160 |    - **Database Target(s):** GISAID & GenBank (SRA submission in development)
161 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above) 
162 |     - **Database Target(s):** GISAID, GenBank, & SRA
163 | 
164 | 
165 | </details>
166 | 
167 | <details>
168 |  <summary>Bioinformatics Solutions to Assess Data Quality Prior to Submission</summary>
169 |  
170 | - [VADR - Viral Annotation DefineR](https://github.com/ncbi/vadr)
171 |   - **Brief Description:** VADR is a suite of CLI tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. With regards to SC2, laboratories have utilized VADR to identify samples with potentially mis-assembled genomes that are likely to be rejected from an internationally-accessible database.
172 |   - **Developed/supported by:** NCBI
173 |   - **Documentation:** [Technical Documentation (GitHub Wiki)](https://github.com/ncbi/vadr/wiki/Coronavirus-annotation)
174 |   - **User base:** NCBI GenBank & US PHLs
175 |   - **Accessibility:** [Local install](https://github.com/ncbi/vadr/blob/master/documentation/install.md#top) or the [StaPH-B Docker Image](https://hub.docker.com/r/staphb/vadr/)
176 | - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above; includes VADR)
177 | - [Titan Workflows for Genomic Characterization](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above; includes VADR)
178 | - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above)
179 | - [IDSeq (CZ BioHub)](https://idseq.net/) (SaaS solution described above)
180 | - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above)
181 | - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN)](https://github.com/jaleezyy/covid-19-signal) (CLI tool described above)
182 | - [ARTIC nCOV19 (ARTIC Network; Connor-lab)](https://github.com/connor-lab/ncov2019-artic-nf) (CLI tool described above)
183 | - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above; VADR included in the Cecret workflow)
184 |  
185 | </details>
186 | 
187 | ## 3. Screening sequenced SC2 samples for variants of concern &amp; general lineage typing
188 | 
189 | These tools either assign a clade or lineage descriptor to consensus sequences or provide databases for lookup of information on variants in the SARS-CoV-2 genome. As variants of concern are listed by their lineage descriptor (typically PANGO lineage or sometimes Nextclade clades) these tools help identify variants of concern.
190 | 
191 | <details>
192 |  <summary>Bioinformatics tools for SC2 lineage or clade assignment</summary>
193 | 
194 | - [Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages)](https://cov-lineages.org/pangolin.html)
195 |   - **Brief Description:** Tool developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (PANGO lineage) to SARS-CoV-2 query sequences.
196 |   - **Developed/supported by:** Pangolin Network
197 |   - **Documentation:** [Technical Documentation (Pangolin Website)](https://cov-lineages.org/pangolin.html), [publication (Nature Microbiology)](https://www.nature.com/articles/s41564-020-0770-5)
198 |   - **User base:** Global Public Health Community
199 |   - **Accessibility:** [Web application](https://pangolin.cog-uk.io/) &amp; [CLI tool](https://github.com/cov-lineages/pangolin)
200 |   - **Bioinformatics workflows that incorporate Pango lineage assignments:**
201 |     - [Datapipe](https://github.com/COG-UK/datapipe)
202 |       - **Brief Description:** Performs alignment and variant calling, assigns lineages with pangolin and VOC/VUI with scorpio and cleans up geography metadata.
203 |       - **Developed/supported by:** Virus Group (University of Edinburgh)
204 |       - **User-interface:** command-line tool, nextflow pipeline
205 |       - **User base:** COG-UK
206 |     - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above)
207 |     -  [Theiagen&#39;s Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above)
208 |     - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above)
209 |     - [IDSeq](https://idseq.net/) (SaaS solution described above)
210 |     - [EDGE COVID-19](https://edge-covid19.edgebioinformatics.org/) (SaaS solution described above)
211 |     - [SIGNAL (SARS-CoV-2 Illumina GeNome Assembly Line; CanCOGeN)](https://github.com/jaleezyy/covid-19-signal) (CLI tool described above)
212 |     - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above)
213 | - [NextClade](https://clades.nextstrain.org/)
214 |   - **Brief Descriptio:n** Tool that identifies differences between your sequences and a reference sequence used by Nextstrain, uses these differences to assign your sequences to clades, and reports potential sequence quality issues in your data
215 |   - **User-interface:** [Web application](https://clades.nextstrain.org/) &amp; CLI tool
216 |   - **Help/community/discussion:** [discussion.nextstrain.org](http://discussion.nextstrain.org/)
217 |   - **Bioinformatics workflows that incorporate NextClade clade assignments:**
218 |     - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above)
219 |     -  [Theiagen&#39;s Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above)
220 |     - [COVID-19 Galaxy Workflows](https://covid19.galaxyproject.org/artic/) (Galaxy resources described above)
221 |     - [IDSeq](https://idseq.net/) (SaaS solution described above)
222 |     - [StaPH-B ToolKit](https://github.com/StaPH-B/staphb_toolkit) (CLI tool described above)
223 | 
224 | </details>
225 | 
226 | 
227 | <details>
228 |  <summary>Public Health Resources that Track &amp; Visualize SC2 Variants Over Time</summary>
229 |  
230 |   - [PANGO cov-lineages](https://cov-lineages.org/)
231 |     - **Brief Description:** Track global prevalences of PANGO lineages
232 |     - **Developed/supported by:** Pangolin Network
233 |   - [Covariants](https://covariants.org/)
234 |     - **Brief Description:** Track global prevalence of Nextclade-annotated lineages
235 |     - **Developed/supported by:** NextStrain Team
236 |   - [Outbreak.info](https://outbreak.info/)
237 |     - **Brief Description:** Epidemiological info including PANGO lineage prevalence
238 |     - **Developed/supported by:** [Su](http://sulab.org/), [Wu](http://wulab.io/), and [Andersen](https://andersen-lab.com/) labs at Scripps Research
239 |   - [COV-GLUE](http://cov-glue.cvr.gla.ac.uk/)
240 |     - **Brief Description:** CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic Epidemiological info including PANGO lineage prevalence
241 |     - **Developed/supported by:** COG-UK
242 |   - [2019nCoVR](https://bigd.big.ac.cn/ncov/)
243 |     - **Brief Description** :2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected SARS-CoV-2 strains.
244 |     - **Developed/supported by:** China National Center for Bioinformation (CNCB)
245 |   - [CoVizu](https://filogeneti.ca/covizu/)
246 |     - **Brief Description:** CoVizu is an [open source project](https://github.com/PoonLab/CoVizu) endeavouring to visualize the global diversity of SARS-CoV-2 genomes, which are provided by the [GISAID Initiative](https://gisaid.org/).
247 |     - **Developed/supported by:** [Poon Laboratory](https://www.schulich.uwo.ca/pathol/people/bios/faculty/poon_art.html) of Western University
248 |   - [Annotation of SARS-2 Coronavirus Genome (Observable)](https://observablehq.com/@delphine-l/annotation-of-sars-2-coronavirus-genome)
249 |     - **Brief Description:** Annotation of variation in the genome with some notes on what is known about the various amino acids
250 |     - **Developed/supported by:** Delphine Lariviere (Penn State University)
251 | 
252 | </details>
253 | 
254 | <details>
255 |  <summary>Bioinformatics Tools to Track &amp; Visualize Your Own SC2 Variants Over Time </summary>
256 |  
257 |  - [KRISP R-scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics)
258 |     - **Brief Description:** Open-source repository containing all the code, data and information needed to reproduce the analyses for the [African genomic epidemiology manuscript](https://www.nature.com/articles/s41591-021-01255-3).
259 |     - **Developed/supported by:** Emmanuel James San (University of KwaZulu-Natal)
260 |     - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme), [publication (Nature Medicine)](https://www.nature.com/articles/s41591-021-01255-3)
261 |     - **Accessibility:** [RCL-Scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme)
262 |   - [GISAID Processing](https://github.com/pvanheus/GISAID_processing)
263 |     - **Brief Description:** Open-source repository containing python scripts to process GISIAD data into frequency graphs
264 |     - **Developed/supported by:** Peter van Heusden (University of Western Cape)
265 |     - **Documentation:** [Technical Documentation (GitHub README)](https://github.com/pvanheus/GISAID_processing/blob/main/README.md)
266 |     - **Accessibility:** [Python-Scripts](https://github.com/krisp-kwazulu-natal/africa-covid19-genomics#readme)
267 |  
268 |  </details>
269 |  
270 | ## 4. Performing phylogenetic analysis of SC2 datasets
271 | 
272 | _The tools listed below perform phylogenetic analyses of different complexity, ranging from web-apps to command-line tools that need to run on HPC facilities. The selected tools are integrated with visualization features that facilitate the interrogation of the results, but beware that such inferences might be uncertain and often require careful interpretation._
273 | 
274 | 
275 | <details>
276 |  <summary>Public Health Resources Performing Global SC2 Phylogenetic Analysis </summary>
277 | 
278 | 
279 | - [NextStrain](https://nextstrain.org/)
280 |   - **Brief Description:** Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.
281 |   - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team)
282 |   - **User base:** USA based groups
283 |   - **Documentation:** [docs](https://docs.nextstrain.org/en/latest/index.html)
284 |   - **Help/community/discussion:** [discussion.nextstrain.org](http://discussion.nextstrain.org/)
285 |   - Implementations for compute steps (&quot;augur&quot;):
286 |     - [**nextstrain/ncov**](https://github.com/nextstrain/ncov) snakemake pipeline
287 |       - **Description:** The authoritative implementation of the Nextstrain &quot;augur&quot; pipeline that takes genomes and metadata to trees and visualizations.
288 |       - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team)
289 |       - **Workflow language:** Snakemake
290 |     - [Broad viral-ngs](https://dockstore.org/organizations/BroadInstitute/collections/pgs) (Terra workflows described above)
291 |     - [Theiagen&#39;s Public Health Viral Genomics WDL Workflows](https://dockstore.org/organizations/Theiagen/collections/PublicHealthViralGenomics) (Terra workflows described above)
292 | - [Microreact](https://microreact.org/)
293 |   - **Brief Description:** Open data visualization and sharing for genomic epidemiology
294 |   - **Developed/supported by:** Centre for Genomic Pathogen Surveillance (CGPS)
295 |   - **User base:** COG-UK, New Zealand, etc
296 |   - **User-interface:** Web application / centrally hosted service
297 | 
298 | </details>
299 | 
300 | <details>
301 |  <summary>Offlineable Browser-Based Web Applications</summary>
302 | 
303 | - [Auspice](https://auspice.us/)
304 |   - **Brief Description:** Allows interactive exploration of phylogenomic datasets by simply dragging & dropping them onto this page.
305 |   - **Developed/supported by:** Fred Hutch/Basel (Nextstrain team)
306 |   - **Documentation:** [Technical documentation (GitHub README)](https://github.com/nextstrain/auspice#readme), [NextStrain discussion Forum](https://discussion.nextstrain.org/)
307 |   - **User-interface:** offlineable browser-based web app
308 | - [MicrobeTrace](https://microbetrace.cdc.gov/MicrobeTrace/)
309 |   - **Brief Description:** The Visualization Multitool for Molecular Epidemiology and Bioinformatics
310 |   - **Developed/supported by:** US CDC
311 |   - **Documentation:** https://github.com/CDCgov/MicrobeTrace
312 |   - **User-interface:** offlineable browser-based web app
313 | - [UShER](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace)
314 |   - **Brief Description:** Places user provided sequences on very large reference trees, extracts the relevant subtree, and provides a visualization
315 |   - **Developed/supported by:** UCSC
316 |   - **User-interface:** offlineable browser-based web app
317 |   
318 | </details>
319 | 
320 | <details>
321 |  <summary>Command-line interface (CLI) Tools</summary>
322 | 
323 | - [Grinch](https://github.com/cov-lineages/grinch)
324 |   - **Brief Description:** Generates reports for the international distribution of PANGO lineages that can be viewed in a web browser.
325 |   - **Developed/supported by:** PANGO, cov-lineages
326 |   - **User-interface:** command-line tool
327 | 
328 | - [Phylopipe](https://github.com/cov-ert/phylopipe)
329 |   - **Brief Description:** Generates a downsampled global tree using FastTree and updates it daily using UShER, cleans and annotates the tree; can be run on output from Datapipe.
330 |   - **Developed/supported by:** Virus Group (University of Edinburgh)
331 |   - **User-interface:** command-line tool, nextflow pipeline
332 |   - **User base:** COG-UK
333 |   
334 | </details>
335 | 


--------------------------------------------------------------------------------
/docs/hiv-bioinfo-solutions.md:
--------------------------------------------------------------------------------
  1 | # HIV Bioinformatics Solutions
  2 | 
  3 | Authors
  4 | =======
  5 | 
  6 | [Amy Gaskin](https://github.com/gaskinae), [Marc Niebel](https://github.com/MarcNiebel), [Frank Ambrosio](https://github.com/frankambrosio3), Abbas Abel Anzaku
  7 | 
  8 | ## Contents
  9 | 
 10 | [Introduction](#introduction)
 11 | 
 12 | [Background Information for Bioinformaticians](#background-information-for-bioinformaticians)
 13 | 
 14 | [Genomic Structure](#genomic-structure)
 15 | 
 16 | [Evolution](#evolution)
 17 | 
 18 | [Subtypes](#subtypes)
 19 | 
 20 | [HIV Bioinformatics Guidance Pathways](#hiv-bioinformatics-guidance-pathways)
 21 | 
 22 |   [Genomic Characterisation/Subtyping](#genomic-characterisationsubtyping)
 23 | 
 24 |   [Drug Resistance Surveillance](#drug-resistance-surveillance)
 25 | 
 26 |   [Drug Development and Resistance Prediction](#drug-development-and-resistance-prediction)
 27 |   
 28 |   
 29 |   [Genomic Epidemiology](#genomic-epidemiology)
 30 | 
 31 | [Sequencing Strategies](#sequencing-strategies)
 32 | 
 33 | [HIV-1 Bioinformatics Tools](#hiv-1-bioinformatics-tools)
 34 | 
 35 |   [Assembly](#assembly)
 36 | 
 37 |   [Resistance detection](#resistance-detection)
 38 | 
 39 |   [Transmission Network Analysis](#transmission-network-analysis)
 40 | 
 41 |   [Sequence Databases](#sequence-databases)
 42 | 
 43 |   [Case Studies](#case-studies)
 44 | 
 45 | ## Introduction
 46 | 
 47 | 
 48 | Human Immunodeficiency Virus (HIV), a highly contagious retrovirus, presents a formidable global public health challenge. According to the World Health Organization (WHO), HIV infections lead to severe immunodeficiency and acquired immunodeficiency syndrome (AIDS), causing millions of deaths annually. The virus primarily targets CD4+ T cells, compromising the immune system and necessitating effective treatment strategies. Given its high mutation rate, understanding the genomics of HIV is crucial for designing treatments, including antiretrovirals and vaccines.
 49 | 
 50 | In recent years, bioinformatics has played a pivotal role in HIV genomics research, offering insights into viral diversity, drug resistance, and transmission patterns. Applying bioinformatics to HIV research enhances public health initiatives by monitoring genetic variations, detecting mutations, supporting risk assessment, and refining vaccine development. Standardization holds the potential to make HIV genomics research more accessible across diverse global settings, as it enhances transparency, reproducibility, and reliability.
 51 | 
 52 | This paper serves as a guidance document, aiming to address existing challenges within the bioinformatics community related to standardization of HIV genomic analyses.  Here, we introduce guidance pathways, which can be likened to distinct themes of bioinformatic analysis specifically tailored to HIV research. These pathways include real-world public health case studies along with relevant HIV bioinformatics tools, making them an invaluable starting point for researchers familiar with microbial bioinformatics but seeking orientation in the world of HIV bioinformatics.
 53 | 
 54 | Therefore, this paper aims to contribute to a global network of knowledge-sharing, promoting equitable access to genomics for HIV -- empowering researchers and clinicians worldwide in pursuit of reducing the burden of disease. 
 55 | 
 56 | ## Background Information for Bioinformaticians
 57 | 
 58 | Understanding the HIV genome, evolutionary dynamics, and subtypes are essential for designing bioinformatic processes. Here, we present a set of resources to help springboard researchers into the world of HIV bioinformatics! 
 59 | 
 60 | ## Genomic Structure 
 61 | 
 62 | The diploid genome of HIV-1 consists of approximately 9700 nucleotides, and features nine genes which encode for fifteen proteins (Figure 1)[<sup>1</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/) which interact with human proteins as part of the HIV-1 viral life cycle. Structural proteins, enzymes, and envelope proteins are encoded by three main genes: gag, pol, and env respectively. The remaining genes are responsible for coding regulatory (tat, rev) and accessory (vif, vpr, vpu/vpx, nef) proteins.
 63 | 
 64 | Figure 1: HIV-1 DNA genome structure [<sup>1</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/)
 65 | 
 66 | <p align="center">
 67 |   <img src="./images/hiv-bioinfo-solutions-figure-1.svg" class="center">
 68 | </p>
 69 | 
 70 | ## Evolution
 71 | 
 72 | The HIV-1 population, despite having a relatively small genome (Figure 1), showcases extensive genomic diversity primarily due to its exceptionally high mutation rate. This rapid mutation occurs during replication, leading to an accumulation of genetic variations within the viral population. Furthermore, the virus exhibits a high proficiency in recombination[<sup>2</sup>](https://pubmed.ncbi.nlm.nih.gov/30687518/), particularly notable due to the varying recombination event rates observed in different segments of the HIV genome, which contribute to the overall genomic diversity of the virus.
 73 | 
 74 | In turn, this high genomic diversity gives rise to minor variants that assume a critical role in the development of drug resistance. When exposed to selective pressures, such as antiretroviral drugs or the host immune system, the virus adapts by favouring the proliferation of specific minor variants carrying resistance mutations.
 75 | 
 76 | A comprehensive understanding of these evolutionary mechanisms is crucial, as they pose a significant challenge in the clinical and public health management of HIV-1 by significantly influencing treatment outcomes.
 77 | 
 78 | ## Subtypes
 79 | 
 80 | HIV is classified into types, groups and subtypes according to its genetic diversity. [<sup>3</sup>](https://pubmed.ncbi.nlm.nih.gov/30882484/)
 81 | 
 82 | **HIV-1 / Group M, N, O, P**
 83 | 
 84 | Group M is the most widespread subtype, responsible for the majority of infections globally (Table  1). Groups N, O, and P are less common. These variations impact transmission, virulence, and treatment responses.
 85 | 
 86 | Table  1: Subtypes and main locations for group M
 87 | | Subtype                | Predominant Region                      |
 88 | | ---------------------- | --------------------------------------- |
 89 | | A | Eastern Europe & former Soviet Union countries |
 90 | | B | North America and Western Europe |
 91 | | C | Sub-saharan Africa |
 92 | | D | East Africa |
 93 | | F | Central Africa, Eastern Europe, and South America |
 94 | | G | Western and Central Africa |
 95 | | H | Central Africa |
 96 | | J | Spain |
 97 | 
 98 | ## HIV Bioinformatics Guidance Pathways 
 99 | 
100 | Below, we outline key guidance pathways, which aim to capture  distinct themes of common types of bioinformatics analysis specifically tailored to HIV research. These pathways include real-world public health case studies along with relevant sequencing strategies and HIV bioinformatics tools.
101 | 
102 | ### Genomic Characterisation/Subtyping
103 | 
104 | Subtyping of HIV-1 refers to the categorization of the genome into groups (M, N, O or P) and further into subtypes (A-J  & CRFs) or clades. Following the production of a consensus sequence various different approaches (similarity, statistical or phylogenetic) can be used  to assign a probable subtype.
105 | 
106 | Sequencing strategy
107 | 
108 | -   Tiled amplicon WGS (gold standard)
109 | 
110 | -   Targeted amplicon sequencing (overlap with drug resistance prediction on pol region)
111 | 
112 | Analysis
113 | 
114 | -   Reference based mapping or de novo assembly methods
115 | 
116 | -   Consensus sequence generation
117 | 
118 | -   Assignment of subtype to queried sequence
119 | 
120 | Tools
121 | 
122 | -   minimap2 [<sup>4</sup>](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [<sup>5</sup>](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [<sup>6</sup>](https://pubmed.ncbi.nlm.nih.gov/29876136/)
123 | 
124 | -   Quasitools HyDRA [<sup>7</sup>](https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf), samtools [<sup>8</sup>](https://pubmed.ncbi.nlm.nih.gov/33590861/), bcftools [<sup>8</sup>](https://pubmed.ncbi.nlm.nih.gov/33590861/)
125 | 
126 | -   Stanford HIVdb [<sup>9,10,11</sup>](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/), REGA [<sup>12</sup>](https://pubmed.ncbi.nlm.nih.gov/23660484/)
127 | 
128 | Case Study: Benchmarking study of HIV-1 subtyping tools for clinical and surveillance purposes [<sup>12</sup>](https://pubmed.ncbi.nlm.nih.gov/23660484/)
129 | 
130 | Description: In this study HIV-1 pol sequences obtained from Los Alamos were subtyped using various automated subtyping tools which were compared to manual phylogenetic analysis.This concluded that most automated subtyping tools work well with pure subtypes especially A & C, however variability of sensitivity and  specificity in subtyping CRFs concluded that multiple tools should be used to confirm HIV-1 subtype.
131 | 
132 | Tools & databases used: Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/), REGA [<sup>12</sup>](https://pubmed.ncbi.nlm.nih.gov/23660484/) ,COMET [<sup>13</sup>](https://pubmed.ncbi.nlm.nih.gov/25120265/) , jpHMM [<sup>14</sup>](https://pubmed.ncbi.nlm.nih.gov/16845050/), STAR [<sup>15</sup>](https://pubmed.ncbi.nlm.nih.gov/16046498/), NCBI [<sup>16</sup>](https://pubmed.ncbi.nlm.nih.gov/15215470/), Stanford HIVdb [<sup>17</sup>](https://hivdb.stanford.edu/page/hiv-subtyper/) and SCUEAL [<sup>18</sup>](https://pubmed.ncbi.nlm.nih.gov/19956739/).
133 | 
134 | ### Drug Resistance Surveillance
135 | 
136 | The goal of using bioinformatics in drug resistance surveillance is to comprehensively analyse and identify mutations that confer resistance to resistance to antiretoviral drugs, such as protease, reverse transcriptase and integrase inhibitors. Targeting the pol region of the genome is useful as this region is associated with genes coding for protease, reverse transcriptase, and integrase.
137 | 
138 | This guidance pathway can involve several commonly-used steps, including: aligning sequencing reads to a reference genome (e.g. HXB2: <https://www.ncbi.nlm.nih.gov/nuccore/K03455.1>), followed by de novo assembly methods to reconstruct the HIV genome.
139 | 
140 | Assembled contigs can be used to generate an optional consensus sequence, serving as a reference for variant identification.
141 | 
142 | Variant calling algorithms can either detect high-frequency genetic variations,  or perform more sensitive analysis for minor variant calling, which identifies low-frequency mutations. Both can be used to inform downstream clinical treatment.
143 | 
144 | Annotated variants are then cross-referenced with gold-standard databases and interpreted through the lens of literature on HIV drug resistance to assess their potential impact on drug susceptibility in human-readable formats. These results can then be bundled into tailored reports so clinicians can interpret these findings and tailor antiretroviral therapy appropriately, minimising treatment failure and optimising patient care.
145 | 
146 | Through this integrated approach, this guidance pathway facilitates proactive surveillance and management of HIV drug resistance, ultimately improving treatment efficacy and patient outcomes.
147 | 
148 | Sequencing Strategy
149 | 
150 | -   Targeted amplicon sequencing of pol (Table 2).
151 | 
152 | Analysis
153 | 
154 | -   Reference-based mapping or de novo assembly methods
155 | 
156 | -   Consensus sequence
157 | 
158 | -   Variant calling 
159 | 
160 | -   Minor variant calling
161 | 
162 | -   Database querying
163 | 
164 | Tools 
165 | 
166 | -   minimap2 [<sup>4</sup>](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [<sup>5</sup>](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [<sup>6</sup>](https://pubmed.ncbi.nlm.nih.gov/29876136/)
167 | 
168 | -   VarScan [<sup>19</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/)
169 | 
170 | -   Stanford Database (either for pipeline implementation - codon frequency file - or for API query from consensus sequence) [<sup>9,10,11</sup>](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/)
171 | 
172 | Case Study: Bioinformatic data processing pipelines in support of next-generation sequencing-based HIV drug resistance testing: the Winnipeg Consensus [<sup>20</sup>](https://pubmed.ncbi.nlm.nih.gov/30350345/)
173 | 
174 | Pipelines: 
175 | 
176 | Quasiflow: <https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac089/6849543>
177 | 
178 | HIV-DRIVES:  <https://www.medrxiv.org/content/10.1101/2023.09.30.23296350v1>
179 | 
180 | ### Drug Development and Resistance Prediction
181 | 
182 | Drug Target Identification: Understanding the genome aids in identifying potential drug targets, such as the Protease, Reverse Transcriptase, and Integrase enzymes. Bioinformatics tools predict inhibitors for these targets.
183 | 
184 | Drug Resistance Prediction: Analysing genomic sequences helps predict drug resistance mutations, informing clinicians about the efficacy of personalised highly active antiretroviral therapies (HAART).The treatment success of HIV infection is affected by development of viral drug resistance, thereby, complicating clinicians choice of selecting the right drugs for patients' treatment. This challenge has led to the development  of various bioinformatics software tools and databases for predicting drug resistance, and responses to combination therapy from viral genotypes.[<sup>21</sup>](https://link.springer.com/chapter/10.1007/978-981-10-7483-7_16)
185 | 
186 | Phylogenetics
187 | 
188 | The study of HIV genetic variation and evolution using genomic data serves several important purposes in understanding and combating HIV/AIDS. It allows researchers to reconstruct the evolutionary history of HIV and track transmission dynamics within populations. By analyzing the genetic sequences of HIV strains obtained from infected individuals, researchers can infer relationships between viral lineages, identify transmission clusters, and trace the spread of the virus over time and geographical regions. This information is crucial for understanding patterns of HIV transmission, identifying high-risk populations, and implementing targeted prevention and intervention strategies.
189 | 
190 | In addition to understanding transmission dynamics, HIV phylogenomics can shed light on the emergence and spread of drug resistance mutations in the HIV genome. Antiretroviral therapy (ART) is a cornerstone of HIV treatment, but the emergence of drug-resistant strains poses a significant challenge to effective treatment and control efforts around the world. By analyzing the genetic sequences of HIV strains, researchers can identify mutations associated with drug resistance and monitor their prevalence and transmission patterns within communities. This information is essential for guiding treatment decisions, designing effective drug regimens, and developing strategies to prevent the spread of drug-resistant HIV strains.
191 | 
192 | Furthermore, HIV phylogenomics can provide valuable insights into the broader epidemiology of HIV/AIDS and inform public health responses to outbreaks and epidemics. By integrating genomic data with epidemiological information, researchers can identify sources of infection, map transmission networks, and assess the impact of prevention and control measures. This knowledge can help public health officials allocate resources more effectively, tailor interventions to specific populations, and ultimately reduce the burden of HIV/AIDS on affected communities. This type of analysis can be performed using whole genome sequencing (WGS) or by using one of the more stable regions of the HIV genome such as the pol, env or gag genes, as mutations in these genes will likely have occurred from the process of natural viral evolution, and not from recombinations or from insertions and deletions.
193 | 
194 | Sequencing Strategy
195 | 
196 | -   Tiled amplicon WGS (Table 2).
197 | 
198 | -   Targeted amplicon sequencing of pol (Table 2).
199 | 
200 | Analysis
201 | 
202 | -   Reference-based mapping assembly methods 
203 | 
204 | -   Consensus sequence
205 | 
206 | -   Variant calling 
207 | 
208 | -   Pairwise genomic distance computation
209 | 
210 | -   Phylogenetic tree inference
211 | 
212 | -   Phylogenetic tree visualization
213 | 
214 | Tools 
215 | 
216 | -   minimap2 [<sup>4</sup>](https://pubmed.ncbi.nlm.nih.gov/29750242/), iva [<sup>5</sup>](https://pubmed.ncbi.nlm.nih.gov/25725497/), shiver [<sup>6</sup>](https://pubmed.ncbi.nlm.nih.gov/29876136/)
217 | 
218 | -   VarScan [<sup>19</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/)
219 | 
220 | -   HIV-TRACE [<sup>22</sup>](https://pubmed.ncbi.nlm.nih.gov/29401317/)
221 | 
222 | Protocol: https://www.researchgate.net/publication/376330219_An_NGS_amplicon_tiling_protocol_for_HIV-1_drug_resistance_detection_using_IlluminaR_COVIDSeq_Assay_Kit_v2
223 | 
224 | Pipelines: 
225 | 
226 | TheiaCoV: https://github.com/theiagen/public_health_bioinformatics/tree/main/workflows/theiacov
227 | 
228 | iVar: https://github.com/andersen-lab/ivar
229 | 
230 | ### Genomic Epidemiology 
231 | 
232 | Genomic Epidemiology: Utilising bioinformatics for large-scale analysis of HIV sequence data aids in tracking the spread of viral variants and understanding transmission dynamics.
233 | 
234 | Network transmission analysis usually takes place just after phylogenetic tree construction, and these analyses are inherently related by the fact that each requires the investigator to determine the pairwise genomic distances between a set of samples. However, there is a distinction between the use of phylogenetic trees to visualize the genomic diversity and relationships between a set of samples, and the use of the pairwise genomic distances of a set of samples to generate a putative transmission network.
235 | 
236 | Genomic distance information can be used to infer which samples may be linked by a transmission event. One can construct a putative transmission network from a set of inferred transmission events. By using genomic similarity as a proxy for likelihood of a direct transmission event one can map the propagation of a pathogen through a population using high-resolution genomic sequencing data.
237 | 
238 | Connections between nodes in the network (edges) are based on genomic similarity falling below a threshold of genomic distance computed by assessing the distribution of genomic distances found within samples known to be associated with the outbreak by traditional epidemiological techniques such as contact tracing. The techniques and tools outlined in the above phylogenomics section will be useful in generating assemblies, distance matrices and phylogenetic trees which are the inputs for most genomic network construction tools.
239 | 
240 | Sequencing Strategy
241 | 
242 | -   Tiled amplicon WGS (Table 2).
243 | 
244 | -   Targeted amplicon sequencing of pol (Table 2).
245 | 
246 | Analysis
247 | 
248 | -   Reference-based mapping assembly methods 
249 | 
250 | -   Consensus sequence
251 | 
252 | -   Variant calling 
253 | 
254 | -   Pairwise genomic distance computation
255 | 
256 | -   Putative transmission network construction
257 | 
258 | -   Force-directed network layout visualization
259 | 
260 | Tools:
261 | 
262 | -   MicrobeTrace [<sup>23</sup>](https://pubmed.ncbi.nlm.nih.gov/34492010/)
263 | 
264 | -   GrapeTree [<sup>24</sup>](https://pubmed.ncbi.nlm.nih.gov/30049790/)
265 | 
266 | ## Sequencing Strategies
267 | 
268 | The sequencing strategy (Table 2)  that you adopt is dependent on multiple factors but should be driven by the question that you are trying to answer. For example, targeted amplification of the Pol region has historically been used to assess drug resistance to antiretroviral therapy.
269 | 
270 | 
271 | Table 2: Potential sequencing  strategies for HIV-1
272 | 
273 | | Strategy/Application | DR Detection | Subtyping | Phylogenomics| Phylogenetics |
274 | | -------------------- | ------------ | --------- | ------------ | ------------- |
275 | | Targeted Amplicon Sequencing** | ✓ | ✓ | X | ✓ |
276 | | Long-Range PCR | X | ✓ | X | X |
277 | | Tiled amplicon WGS | ✓ | ✓ | ✓ | ✓ |
278 | 
279 | ** Will not be able to subtype some circulating recombinant forms due to missing breakpoints (CRF_AE & CRF_BG) [<sup>25,26</sup>](https://pubmed.ncbi.nlm.nih.gov/11981372/,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/)
280 | 
281 | A tiled amplicon sequencing primer scheme has been developed by the Association of Public Health Laboratories (APHL): <https://www.protocols.io/view/an-ngs-amplicon-tiling-protocol-for-hiv-1-drug-res-n92ldmq4ol5b/v2> (DOI:10.17504/protocols.io.n92ldmq4ol5b/v2)
282 | 
283 | ## HIV-1 Bioinformatics Tools 
284 | 
285 | For researchers and clinicians alike, designing a bioinformatics analysis of HIV sequence data comes down to careful selection of bioinformatics tools. The effectiveness of analysis, accuracy of results, and subsequent genomic and epidemiological insights hinge on the appropriateness of the chosen tools. This decision encapsulates the essence of bioinformatics in HIV research, where precise tool selection aligns with the specific research objectives, ensuring that analytical methods harmonize with the intricacies of the virus. Below, we provide a categorised list of bioinformatics tools used in HIV research, distinguishing between general tools applicable to various contexts and those specifically designed for HIV-related analyses.
286 | 
287 | ### Assembly
288 | 
289 | -   shiver: A tool for assembling HIV sequences, particularly focusing on improving de novo assembly by minimising biased information [<sup>6</sup>](https://pubmed.ncbi.nlm.nih.gov/29876136/)
290 | -   iva: Generating de novo assembly of RNA virus genomes [<sup>5</sup>](https://pubmed.ncbi.nlm.nih.gov/25725497/)
291 | Subtyping
292 | 
293 | Various HIV-1 subtyping tools are available (Table 3) which have been benchmarked previously [<sup>12,27</sup>](https://pubmed.ncbi.nlm.nih.gov/23660484/,https://pubmed.ncbi.nlm.nih.gov/28701420/)
294 | 
295 | Table 3: HIV-1  subtyping tools
296 | | Tool | Type | CLI | GUI |
297 | | ---- | ---- | --- | --- |
298 | | NCBI [<sup>16</sup>](https://pubmed.ncbi.nlm.nih.gov/15215470/) | similarity | X | ✓ |
299 | | Stanford [<sup>17</sup>](https://hivdb.stanford.edu/page/hiv-subtyper/) | similarity | ✓ | ✓ |
300 | | COMET [<sup>13</sup>](https://pubmed.ncbi.nlm.nih.gov/25120265/) | similarity | ✓ | ✓ |
301 | | jpHMM [<sup>14</sup>](https://pubmed.ncbi.nlm.nih.gov/16845050/) | statistical | X | ✓ |
302 | | REGA [<sup>12</sup>](https://pubmed.ncbi.nlm.nih.gov/23660484/) | phylogenetic | X | ✓ |
303 | | SCUEAL [<sup>18</sup>](https://pubmed.ncbi.nlm.nih.gov/19956739/) | phylogenetic | X | ✓ |
304 | 
305 | Multiple considerations need to be taken into account when choosing a subtyping tool.
306 | 
307 | Although the gold standard for HIV-1 subtyping is full-genome, often only the pol region is available. This region will allow for subtyping for most group M subtypes but will not differentiate CRF_AE & CRF_BG from the pure parent subtype due to lacking the recombination breakpoint in this region [<sup>25,26</sup>](https://pubmed.ncbi.nlm.nih.gov/11981372/,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/). The second is that an up-to-date alignment is desirable especially when considering treatment failures e.g. cabotegravir (integrase inhibitor) not working on HIV-1 subtypes A1/A6 [<sup>28</sup>](https://pubmed.ncbi.nlm.nih.gov/33730748/)
308 | 
309 | ### Resistance detection
310 | 
311 | Resistance detection is mainly undertaken in reference to HXB2 (Accession Number:K03455)
312 | 
313 | -   Stanford University HIVdb (<https://hivdb.stanford.edu/>) : An online database and tool for identifying drug-resistant mutations in HIV-1 using consensus and next-generation sequencing data [<sup>9,10,11</sup>](https://pubmed.ncbi.nlm.nih.gov/12520007/,https://pubmed.ncbi.nlm.nih.gov/16921473/,https://pubmed.ncbi.nlm.nih.gov/16652319/)
314 | 
315 | -   Quasitools HyDRA (no longer actively maintained): Command line tool to analyse next-generation sequencing data for cataloging drug resistance mutations using the Stanford University HIVdb [<sup>7</sup>](https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf)
316 | 
317 | -   SierraPy: Python client to query Stanford University HIVdb (<https://github.com/hivdb/sierra-client/blob/master/python/README.md>)
318 | 
319 | Other Stanford HIVdb resources which are useful to investigate especially for pipeline implementation:
320 | 
321 | -   Release Notes (<https://hivdb.stanford.edu/page/release-notes/>) 
322 | 
323 | -   Web Service (<https://hivdb.stanford.edu/page/webservice/>)
324 | 
325 | -   Github repository (<https://github.com/hivdb>)
326 | 
327 | ### Transmission Network Analysis 
328 | 
329 | -   HIV-TRACE: A command line tool for identifying and visualizing HIV transmission clusters using molecular sequence data [<sup>22</sup>](https://pubmed.ncbi.nlm.nih.gov/29401317/)
330 |   
331 | -   Clusterpicker (no longer actively maintained): A command line tool for identifying clusters in a phylogenetic tree based on bootstrap support and pairwise genetic distance within clusters [<sup>29</sup>](https://pubmed.ncbi.nlm.nih.gov/24191891/)
332 |   
333 | 
334 | ### Sequence Databases
335 | 
336 | -   NCBI HIV-1 Human Interaction Database
337 | 
338 | (<https://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses/hiv-1/interactions> )[<sup>30</sup>](https://pubmed.ncbi.nlm.nih.gov/25378338/)
339 | 
340 | An online database of HIV-1 sequence data and annotations, including drug resistance and subtype information
341 | 
342 | -   Los Alamos HIV Sequence Database <https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html>  : A web database which can be searched for HIV sequence data, including reference genomes, annotations, and geographic origins of subtypes
343 | 
344 | -   Stanford University HIVdb: In addition to being the most up-to-date resource for investigating HIV-1 drug resistance(see above) it also has a wealth of information on HIV-1 virus isolates,  published drug susceptibilities and archived treatment episodes (incorporates ARVs received, mutations detected and new regimen initiated with measured viral loads and CD4 counts longitudinally).
345 | 
346 | ### Case Studies
347 | 
348 | ##
349 | 
350 | #### Case Study: *Tracking HIV Transmission Networks in a High-Incidence Area*
351 | 
352 | **Description:** Using molecular epidemiology, this study identified transmission networks among individuals with acute HIV infection in a high-incidence region, providing insights into transmission dynamics and hotspots.
353 | 
354 | **Citation:** Wertheim JO, *et al.* (2014). "Social and Genetic Networks of HIV-1 Transmission in New York City." PLOS Pathogens, 10(7), e1004280. <https://pubmed.ncbi.nlm.nih.gov/28068413>
355 | 
356 | **Tool(s) & databases used:** HIV-TRACE, Los Alamos
357 | 
358 | ##
359 | 
360 | #### Case Study: *Evolution of Drug Resistance Mutations in Long-Term ART Patients*
361 | 
362 | **Description:** This study investigated the dynamics of drug resistance mutations in individuals on long-term therapy (ART), revealing the persistence of archived resistant variants and the importance of continuous monitoring.
363 | 
364 | **Citation:** Rhee SY, *et al.* (2005). "HIV-1 Protease and Reverse-Transcriptase Mutations: Correlations with Antiretroviral Therapy in Subtype B Isolates and Implications for Drug-Resistance Surveillance." Journal of Infectious Diseases, 194(4), 454-465. <https://pubmed.ncbi.nlm.nih.gov/15995959>
365 | 
366 | **Tool(s) & databases used:** PAUP, MESQUITE, Stanford HIVdb
367 | 
368 | ##
369 | 
370 | #### Case Study: *Impact of Drug Resistance Mutations on Treatment Outcomes*
371 | 
372 | **Description:** This study assessed the impact of specific drug resistance mutations on treatment response and virological outcomes, contributing to the optimization of treatment regimens for individuals with drug-resistant HIV.
373 | 
374 | **Citation:** Gupta RK, *et al.* (2009). "HIV-1 Drug Resistance before Initiation or Re-initiation of First-line Antiretroviral Therapy in Low-Income and Middle-Income Countries: A Systematic Review and Meta-Regression Analysis." The Lancet Infectious Diseases, 9(10), 711-718. <https://pubmed.ncbi.nlm.nih.gov/29198909>
375 | 
376 | **Tool(s) & databases used:** Stanford HIVdb
377 | 
378 | ##
379 | 
380 | #### Case Study: *HIV Phylogenetics to Investigate Cross-Border Transmission*
381 | 
382 | **Description:** Using phylogenetic analysis, this study traced cross-border transmission of HIV strains between neighboring countries, highlighting the need for coordinated prevention efforts in the region.
383 | 
384 | **Citation:** Novitsky V, *et al.* (2015). "Phylogenetic Relatedness of Circulating HIV-1C Strains in Mochudi, Botswana, and Implications for HIV Subtype Distribution in Botswana." AIDS Research and Human Retroviruses, 31(6), 631-638. <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3859477>
385 | 
386 | **Tool(s) & databases used:** Los Alamos
387 | 
388 | ##
389 | 
390 | #### Case Study: *Cross-clade simultaneous HIV drug resistance genotyping for reverse transcriptase, protease, and integrase inhibitor mutations by Illumina MiSeq*
391 | 
392 | **Description:** This study created a universal Illumina MiSeq-based HIV drug resistance genotyping assay, which works across all major group M HIV-1 subtypes and identifies DRMs in the pol gene known to confer resistance to protease, reverse transcriptase, and integrase inhibitors.
393 | 
394 | **Citation:** Dudley, D. M., *et al.* (2014). "Cross-clade simultaneous HIV drug resistance genotyping for reverse transcriptase, protease, and integrase inhibitor mutations by Illumina MiSeq". Retrovirology, 11, 122. <https://doi.org/10.1186/s12977-014-0122-8> 
395 | 
396 | **Tool(s) & databases used:** Los Alamos
397 | 
398 | References
399 | ==========
400 | 
401 | 1. Xiao, Q., Guo, D. & Chen, S. Application of CRISPR/Cas9-Based Gene Editing in HIV-1/AIDS Therapy. Front. Cell. Infect. Microbiol. 9, (2019). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439341/
402 | 2. Olabode, A. S. et al. Evidence for a recombinant origin of HIV-1 Group M from genomic variation. Virus Evol. 5, vey039 (2019). https://pubmed.ncbi.nlm.nih.gov/30687518/
403 | 3. Bbosa, N., Kaleebu, P. & Ssemwanga, D. HIV subtype diversity worldwide. Curr. Opin. HIV AIDS 14, 153--160 (2019). https://pubmed.ncbi.nlm.nih.gov/30882484/
404 | 4. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094-3100 (2018). https://pubmed.ncbi.nlm.nih.gov/29750242/
405 | 5. Hunt, M. et al. IVA: accurate de novo assembly of RNA virus genomes. Bioinformatics 31, 2374--2376 (2015). https://pubmed.ncbi.nlm.nih.gov/25725497/
406 | 6. Wymant, C. et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver. Virus Evol. 4, vey007 (2018). https://pubmed.ncbi.nlm.nih.gov/29876136/
407 | 7. Marinier, E. et al. quasitools: A Collection of Tools for Viral Quasispecies Analysis. 733238 Preprint at https://doi.org/10.1101/733238 (2019). https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733238.full.pdf
408 | 8. Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience 10(2):giab008 (2021).https://pubmed.ncbi.nlm.nih.gov/33590861/
409 | 9. Rhee, S.-Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 31, 298--303 (2003). https://pubmed.ncbi.nlm.nih.gov/12520007/
410 | 10. Shafer, R. W. Rationale and Uses of a Public HIV Drug-Resistance Database. J. Infect. Dis. 194, S51--S58 (2006). https://pubmed.ncbi.nlm.nih.gov/16921473/
411 | 11. Liu, T. F. & Shafer, R. W. Web Resources for HIV Type 1 Genotypic-Resistance Test Interpretation. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 42, 1608--1618 (2006). https://pubmed.ncbi.nlm.nih.gov/16652319/   
412 | 12. Pineda-Peña, A.-C. et al. Automated subtyping of HIV-1 genetic sequences for clinical and surveillance purposes: Performance evaluation of the new REGA version 3 and seven other tools. Infect. Genet. Evol. 19, 337--348 (2013). https://pubmed.ncbi.nlm.nih.gov/23660484/
413 | 13. Struck, D., Lawyer, G., Ternes, A.-M., Schmit, J.-C. & Bercoff, D. P. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res. 42, e144 (2014). https://pubmed.ncbi.nlm.nih.gov/25120265/
414 | 14. Zhang, M. et al. jpHMM at GOBICS: a web server to detect genomic recombinations in HIV-1. Nucleic Acids Res. 34, W463--W465 (2006). https://pubmed.ncbi.nlm.nih.gov/16845050/
415 | 15. Myers, R. et al. A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics 21(17):3535-40 (2005). https://pubmed.ncbi.nlm.nih.gov/16046498/
416 | 16. Rozanov, M., Plikat, U., Chappey, C., Kochergin, A. & Tatusova, T. A web-based genotyping resource for viral sequences. Nucleic Acids Res. 32, W654-659 (2004). https://pubmed.ncbi.nlm.nih.gov/15215470/
417 | 17. HIV Subtyping Program - HIV Drug Resistance Database. https://hivdb.stanford.edu/page/hiv-subtyper/
418 | 18. Pond, S. L. K. et al. An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1. PLOS Comput. Biol. 5, e1000581 (2009). https://pubmed.ncbi.nlm.nih.gov/19956739/
419 | 19. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25(17):2283-5 (2009).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734323/
420 | 20. Ji, H., Enns, E., Brumme, C. J., Parkin, N., Howison, M., Lee, E. R., Capina, R., Marinier, E., Avila-Rios, S., Sandstrom, P., Van Domselaar, G., Harrigan, R., Paredes, R., Kantor, R., & Noguera-Julian, M. (2018). Bioinformatic data processing pipelines in support of next-generation sequencing-based HIV drug resistance testing: the Winnipeg Consensus. Journal of the International AIDS Society, 21(10), e25193. https://pubmed.ncbi.nlm.nih.gov/30350345/
421 | 21. Mannu, J. & Mathur, P. P. Role of Bioinformatics in Drug Resistance Prediction for HIV/AIDS. in Current trends in Bioinformatics: An Insight (eds. Wadhwa, G., Shanmughavel, P., Singh, A. K. & Bellare, J. R.) 277--286 (Springer, Singapore, 2018). doi:10.1007/978-981-10-7483-7_16. https://link.springer.com/chapter/10.1007/978-981-10-7483-7_16
422 | 22. Kosakovsky Pond, S. L., Weaver, S., Leigh Brown, A. J. & Wertheim, J. O. HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. Mol. Biol. Evol. 35, 1812--1819 (2018). https://pubmed.ncbi.nlm.nih.gov/29401317/
423 | 23. Campbell, EM. et al. MicrobeTrace: Retooling molecular epidemiology for rapid public health response PLoS Comput Biol. 17(9):e1009300 (2021). https://pubmed.ncbi.nlm.nih.gov/34492010/
424 | 24. Zhou Z. et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 28(9):1395-1404 (2018). https://pubmed.ncbi.nlm.nih.gov/30049790/
425 | 25. Delgado, E. et al. Identification of a Newly Characterized HIV-1 BG Intersubtype Circulating Recombinant Form in Galicia, Spain, Which Exhibits a Pseudotype-Like Virion Structure. JAIDS J. Acquir. Immune Defic. Syndr. 29, 536 (2002). https://pubmed.ncbi.nlm.nih.gov/11981372/
426 | 26. Carr, J. K. et al. Full-length sequence and mosaic structure of a human immunodeficiency virus type 1 isolate from Thailand. J. Virol. 70, 5935--5943 (1996). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC190613/
427 | 27. Fabeni, L. et al. Comparative Evaluation of Subtyping Tools for Surveillance of Newly Emerging HIV-1 Strains. J. Clin. Microbiol. 55, 2827--2837 (2017). https://pubmed.ncbi.nlm.nih.gov/28701420/
428 | 28. Cutrell, A. G. et al. Exploring predictors of HIV-1 virologic failure to long-acting cabotegravir and rilpivirine: a multivariable analysis. AIDS Lond. Engl. 35, 1333--1342 (2021). https://pubmed.ncbi.nlm.nih.gov/33730748/
429 | 29. Ragonnet-Cronin, M. et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics 14, 317 (2013). https://pubmed.ncbi.nlm.nih.gov/24191891/
430 | 30. Ako-Adjei, D. et al. HIV-1, human interaction database: current status and new features. Nucleic Acids Res. 43, D566-570 (2015). https://pubmed.ncbi.nlm.nih.gov/25378338/
431 | 
432 | 
433 | 
434 | 


--------------------------------------------------------------------------------
/docs/images/PHA4GE_SC2_QC_Workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/PHA4GE_SC2_QC_Workflow.png


--------------------------------------------------------------------------------
/docs/images/influenza-guidance-fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/influenza-guidance-fig1.png


--------------------------------------------------------------------------------
/docs/images/influenza-guidance-fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/influenza-guidance-fig2.png


--------------------------------------------------------------------------------
/docs/images/omicron_standford.svg:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" standalone="no"?><!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"><svg font-family="&quot;Source Sans Pro&quot;, &quot;Helvetica Neue&quot;, Helvetica" viewBox="0 0 1960 315" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g id="sd-region-group-1_29903"><svg id="sd-position-axis" y="20"><path d="m 378 18 h 855 l 5 -5 v 10 l 5 -5 h 338.4102857142857" fill="none" stroke="#000000" stroke-width="2"/><text x="378" y="12" font-size="12" fill="#000000" text-anchor="middle">1</text><line x1="378" x2="378" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="501.391304347826" y="12" font-size="12" fill="#000000" text-anchor="middle">100</text><line x1="501.391304347826" x2="501.391304347826" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="601.1014492753624" y="12" font-size="12" fill="#000000" text-anchor="middle">180</text><line x1="601.1014492753624" x2="601.1014492753624" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="700.8115942028985" y="12" font-size="12" fill="#000000" text-anchor="middle">260</text><line x1="700.8115942028985" x2="700.8115942028985" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="800.5217391304348" y="12" font-size="12" fill="#000000" text-anchor="middle">340</text><line x1="800.5217391304348" x2="800.5217391304348" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="900.231884057971" y="12" font-size="12" fill="#000000" text-anchor="middle">420</text><line x1="900.231884057971" x2="900.231884057971" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="999.9420289855072" y="12" font-size="12" fill="#000000" text-anchor="middle">500</text><line x1="999.9420289855072" x2="999.9420289855072" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1099.6521739130435" y="12" font-size="12" fill="#000000" text-anchor="middle">580</text><line x1="1099.6521739130435" x2="1099.6521739130435" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1199.3623188405795" y="12" font-size="12" fill="#000000" text-anchor="middle">660</text><line x1="1199.3623188405795" x2="1199.3623188405795" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1299.1337142857142" y="12" font-size="12" fill="#000000" text-anchor="middle">795</text><line x1="1299.1337142857142" x2="1299.1337142857142" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1399.385142857143" y="12" font-size="12" fill="#000000" text-anchor="middle">965</text><line x1="1399.385142857143" x2="1399.385142857143" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1499.6365714285714" y="12" font-size="12" fill="#000000" text-anchor="middle">1135</text><line x1="1499.6365714285714" x2="1499.6365714285714" y1="17" y2="26" stroke="#000000" stroke-width="2"/><text x="1581.4102857142857" y="12" font-size="12" fill="#000000" text-anchor="middle">1273</text><line x1="1581.4102857142857" x2="1581.4102857142857" y1="17" y2="26" stroke="#000000" stroke-width="2"/></svg><svg id="sd-position-group-NA" y="45"><text x="100" y="44" fill="#000000" text-anchor="end" font-size="21" font-weight="bolder"/><g><line x1="120" x2="123.15903715041047" y1="39" y2="39" stroke-width="2" stroke="#333"/></g><g><rect x="123.17100320022263" y="24" width="254.74523445109224" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/><text x="131.17100320022263" y="39" font-size="16" dominant-baseline="central" fill="#505050">ORF1ab</text></g><g><rect x="378" y="24" width="1204" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#b2df8a"/><text x="980" y="12" font-size="16" dominant-baseline="central" text-anchor="middle" fill="#33a02c">Spike</text></g><g><rect x="393.3719806763286" y="36" width="364.7729468599033" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#a6cee3"/><text x="401.3719806763286" y="51" font-size="16" dominant-baseline="central" fill="#1f78b4">NTD</text></g><g><rect x="758.56038647343" y="36" width="285.0048309178743" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#fb9a99"/><text x="766.56038647343" y="51" font-size="16" dominant-baseline="central" fill="#e31a1c">RBD</text></g><g><rect x="921.8357487922705" y="48" width="89.32367149758466" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#cab2d6"/><text x="929.8357487922705" y="63" font-size="16" dominant-baseline="central" fill="#6a3d9a">RBM</text></g><g><rect x="1043.9806763285026" y="36" width="68.55072463768101" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/><text x="1051.9806763285026" y="51" font-size="16" dominant-baseline="central" fill="#505050">SD1</text></g><g><rect x="1112.9468599033817" y="36" width="112.58937198067633" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/><text x="1120.9468599033817" y="51" font-size="16" dominant-baseline="central" fill="#505050">SD2</text></g><g><rect x="1225.951690821256" y="36" width="12.04830917874392" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#fdbf6f"/><text x="1231.9758454106282" y="24" font-size="16" dominant-baseline="central" text-anchor="middle" fill="#ff7f00">S1/S2</text></g><g><rect x="1582.6277372262773" y="24" width="47.19442601194419" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1631.2488387524882" y="24" width="12.954213669542241" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1647.1134704711346" y="24" width="38.12076974120805" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1685.8619774386198" y="24" width="10.557398805574167" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1696.8188453881885" y="24" width="20.829462508294455" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1717.4771068347711" y="24" width="7.475779694757648" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1725.3523556735236" y="24" width="20.829462508294682" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><rect x="1747.0378234903783" y="24" width="71.84737889847361" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/><text x="1755.0378234903783" y="39" font-size="16" dominant-baseline="central" fill="#505050">N</text></g><g><rect x="1820.3118779031188" y="24" width="6.619774386197605" height="30" rx="5" ry="5" stroke="#ffffff" stroke-opacity="0.8" stroke-width="2" fill="#d8d8d8"/></g><g><line x1="1826.988719309887" x2="1840" y1="39" y2="39" stroke-width="2" stroke="#333"/></g><g><path d="m 456.2608695652174 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 -5 5 h -14 c -5 0 -5 0 -5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(436.2608695652174, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">A67V</text></g><g><path d="m 458.7536231884058 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(462.7536231884058, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Δ69-70</text></g><g><path d="m 491.159420289855 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(495.159420289855, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">T95I</text></g><g><path d="m 549.7391304347826 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 -5 5 h -14 c -5 0 -5 0 -5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(529.7391304347826, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">G142D</text></g><g><path d="m 550.9855072463768 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(554.9855072463768, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Δ143-145</text></g><g><path d="m 635.7391304347826 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 -5 5 h -36.75362318840587 c -5 0 -5 0 -5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(592.9855072463768, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Δ211</text></g><g><path d="m 636.9855072463768 21 l 4 4 l 4 -4 m -4 4 v 65 c 0 5 0 5 -5 5 h -14 c -5 0 -5 0 -5 5 v 75" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(616.9855072463768, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">L212I</text></g><g><path d="m 639.4782608695652 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(643.4782608695652, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">R214Insertion</text></g><g><path d="m 795.2753623188405 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 -5 5 h -41.13043478260852 c -5 0 -5 0 -5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(748.144927536232, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">G339D</text></g><g><path d="m 835.159420289855 21 l 4 4 l 4 -4 m -4 4 v 65 c 0 5 0 5 -5 5 h -57.01449275362302 c -5 0 -5 0 -5 5 v 75" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(772.144927536232, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">S371L</text></g><g><path d="m 837.6521739130435 21 l 4 4 l 4 -4 m -4 4 v 75 c 0 5 0 5 -5 5 h -35.50724637681151 c -5 0 -5 0 -5 5 v 65" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(796.144927536232, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">S373P</text></g><g><path d="m 840.144927536232 21 l 4 4 l 4 -4 m -4 4 v 85 c 0 5 0 5 -5 5 h -14 c -5 0 -5 0 -5 5 v 55" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(820.144927536232, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">S375F</text></g><g><path d="m 892.4927536231885 21 l 4 4 l 4 -4 m -4 4 v 95 c 0 5 0 5 -5 5 h -33.97101449275374 c -5 0 -5 0 -5 5 v 45" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(852.5217391304348, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">K417N</text></g><g><path d="m 921.1594202898551 21 l 4 4 l 4 -4 m -4 4 v 105 c 0 5 0 5 -5 5 h -38.63768115942037 c -5 0 -5 0 -5 5 v 35" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(876.5217391304348, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N440K</text></g><g><path d="m 928.6376811594203 21 l 4 4 l 4 -4 m -4 4 v 115 c 0 5 0 5 -5 5 h -22.1159420289855 c -5 0 -5 0 -5 5 v 25" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(900.5217391304348, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">G446S</text></g><g><path d="m 967.2753623188406 21 l 4 4 l 4 -4 m -4 4 v 125 c 0 5 0 5 -5 5 h -36.75362318840587 c -5 0 -5 0 -5 5 v 15" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(924.5217391304348, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">S477N</text></g><g><path d="m 968.5217391304348 21 l 4 4 l 4 -4 m -4 4 v 135 c 0 5 0 5 -5 5 h -14 c -5 0 -5 0 -5 5 v 5" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(948.5217391304348, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">T478K</text></g><g><path d="m 975.9999999999999 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(979.9999999999999, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">E484A</text></g><g><path d="m 987.2173913043478 21 l 4 4 l 4 -4 m -4 4 v 115 c 0 5 0 5 5 5 h 14 c 5 0 5 0 5 5 v 25" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1015.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Q493K</text></g><g><path d="m 990.9565217391304 21 l 4 4 l 4 -4 m -4 4 v 105 c 0 5 0 5 5 5 h 34.26086956521738 c 5 0 5 0 5 5 v 35" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1039.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">G496S</text></g><g><path d="m 993.4492753623189 21 l 4 4 l 4 -4 m -4 4 v 95 c 0 5 0 5 5 5 h 55.768115942028885 c 5 0 5 0 5 5 v 45" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1063.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Q498R</text></g><g><path d="m 997.1884057971015 21 l 4 4 l 4 -4 m -4 4 v 85 c 0 5 0 5 5 5 h 76.02898550724626 c 5 0 5 0 5 5 v 55" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1087.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N501Y</text></g><g><path d="m 1002.1739130434783 21 l 4 4 l 4 -4 m -4 4 v 75 c 0 5 0 5 5 5 h 95.0434782608695 c 5 0 5 0 5 5 v 65" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1111.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Y505H</text></g><g><path d="m 1054.5217391304348 21 l 4 4 l 4 -4 m -4 4 v 65 c 0 5 0 5 5 5 h 66.695652173913 c 5 0 5 0 5 5 v 75" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1135.2173913043478, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">T547K</text></g><g><path d="m 1138.0289855072463 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 5 5 h 14 c 5 0 5 0 5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1166.0289855072463, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">D614G</text></g><g><path d="m 1189.1304347826087 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1193.1304347826087, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">H655Y</text></g><g><path d="m 1219.0434782608697 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1223.0434782608697, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N679K</text></g><g><path d="m 1221.536231884058 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 5 5 h 14 c 5 0 5 0 5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1249.536231884058, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">P681H</text></g><g><path d="m 1276.8525714285715 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1280.8525714285715, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N764K</text></g><g><path d="m 1295.7234285714285 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1299.7234285714285, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">D796Y</text></g><g><path d="m 1331.1062857142858 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1335.1062857142858, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N856K</text></g><g><path d="m 1388.8982857142855 21 l 4 4 l 4 -4 m -4 4 v 150" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1392.8982857142855, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">Q954H</text></g><g><path d="m 1397.744 21 l 4 4 l 4 -4 m -4 4 v 65 c 0 5 0 5 5 5 h 14 c 5 0 5 0 5 5 v 75" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1425.744, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">N969K</text></g><g><path d="m 1404.8205714285714 21 l 4 4 l 4 -4 m -4 4 v 55 c 0 5 0 5 5 5 h 30.92342857142853 c 5 0 5 0 5 5 v 85" stroke="#000000" fill="none" stroke-width="1"/><text transform="translate(1449.744, 179) rotate(-60)" fill="#000000" dominant-baseline="central" text-anchor="end" font-family="Arial Narrow">L981F</text></g></svg></g></svg>
2 | 


--------------------------------------------------------------------------------
/docs/images/pha4ge_sc2_qc_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/pha4ge_sc2_qc_workflow.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/covariants21k.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/covariants21k.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/covariants21l.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/covariants21l.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex1-usher-metrics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1-usher-metrics.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex1-usher-tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1-usher-tree.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex1.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex2-usher-metrics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2-usher-metrics.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex2-usher-tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2-usher-tree.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex2.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/ex3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/ex3.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/mutations1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations1.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/mutations2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations2.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/mutations3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/mutations3.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/nextclade-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/nextclade-output.png


--------------------------------------------------------------------------------
/docs/images/sc2-recombinants/nextclade-output2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pha4ge/pipeline-resources/eab053ed4f9a46ce9a4092652320cdbd10c1a65f/docs/images/sc2-recombinants/nextclade-output2.png


--------------------------------------------------------------------------------
/docs/mpxv-bioinfo-solutions.md:
--------------------------------------------------------------------------------
  1 | # **Bioinformatics Solutions for Mpox Genomic Analysis**
  2 | 
  3 | PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group <br/>
  4 | Libuit KG, Southgate J, Ünal G, Maguire F, Smith E, Kapsak S, van Heusden P, Wright S, Neher R, Diallo A
  5 | 
  6 | <details>
  7 |  <summary> Document Changelog</summary>
  8 |  
  9 | - 2022-10-10:
 10 |   - First draft published
 11 | - 2022-11-28:
 12 |   - Nomenclature update: Monkeypox -> Mpox
 13 | - 2023-03-09:
 14 |   - Add changelog
 15 | - 2024-08-23:
 16 |   - Add PolkaPax and TOSTADAS details
 17 | </details>
 18 | 
 19 | 
 20 | # Overview
 21 | 
 22 | Genomic analysis of Mpox virus (MPXV) samples by public health laboratories is a critical component in understanding the global outbreak. The integration and awareness of appropriate bioinformatics tools to support these endeavours are potential challenges.
 23 | 
 24 |  In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the [Public Health Alliance for Genomic Epidemiology (PHA4GE)](https://www.pha4ge.org) has drafted this living document to help define the major bioinformatics challenges for MPXV genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
 25 | 
 26 | Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive list of all available MPXV bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
 27 | 
 28 | 
 29 | # Background
 30 | 
 31 | Mpox is a viral zoonosis which belongs to genus Orthopoxvirus in the family Poxviridae. The virus can be transmitted to humans from animals. After the eradication of smallpox in 1980, mpox emerged and became the most important orthopoxvirus for public health aspects. The virus is an enveloped double-stranded DNA virus and has two distinct genetic clades: the central African (Congo Basin) clade and the west African clades. Historically known as the Congo Basin can cause more severe disease and more transmissible [WHO](https://www.who.int/news-room/fact-sheets/detail/monkeypox). The clinical presentation of this virus is similar to smallpox but some vaccination with smallpox can help individuals for cross-immunity. Lethality rate varies %1-10 and transmission between humans mainly occurs either direct contact or body fluids and via droplets [Berthet, N. et al.](https://rdcu.be/cTOiG). 
 32 | 
 33 | MPXV is a linear DNA genome of ≈197 kb. Like other orthopoxviruses, the central coding region sequence (CRS) at MPXV is between ≈56000–120000 and is highly conserved. The genes in the terminal end of MPXV genome responsible for immunomodulation, host range and pathogenicity and also contains at least 4 ORF in the ITR region [Kugelman, JR et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901482/).  
 34 | 
 35 | 
 36 | # Public Mpox Case Databases
 37 | 
 38 | This [repository](https://github.com/globaldothealth/monkeypox) contains dated records of curated Mpox cases from the 2022 outbreak (April - ), a data dictionary, and a script used to pull contents from a spreadsheet into JSON and CSV files.
 39 | 
 40 | The downloadable [data file](https://www.ecdc.europa.eu/en/publications-data/data-monkeypox-cases-eueea) contains information on the number of mpox cases reported by EU/EEA countries or collected throughout epidemiologic intelligence at ECDC. Each row contains the corresponding data for a country, day of reporting, number of cases and source of information (data are in long format). The file is updated twice a week. You may use the data in line with ECDC’s copyright and data usage policy.
 41 | 
 42 | This [report](https://monkeypoxreport.ecdc.europa.eu/) provides an overview of the total number of cases of mpox identified by ECDC and the WHO Regional Office for Europe through IHR mechanisms and official public resources and case-based data through The European Surveillance System (TESSy) up to 9 August 2022. The first summary table and maps (first two tabs) describe the number of cases identified through the different platforms. The following figures and tables describe national case-based data for surveillance of mpox reported in TESSy from all the countries and areas of the WHO European Region, including the 24 countries of the European Union (EU) and the additional three countries of the European Economic Area (EEA).
 43 | 
 44 | 
 45 | # Bioinformatics Challenges for Public Health
 46 | 
 47 | The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:
 48 | 
 49 | 1. **Generating consensus assemblies** 
 50 | 
 51 | 2. **Submission of sequence data to international accessible databases** 
 52 | 
 53 | 3. **Screening for Variants of Concern** 
 54 | 
 55 | 4. **Performing Phylogenetic analysis of MPXV datasets** 
 56 | 
 57 | # Open-Access/Source Bioinformatics Solutions & Resources
 58 | 
 59 | ## Video resources
 60 | 
 61 | - [BV-BRC Mpox and Orthopoxvirus Mini Symposium Playlist](https://youtube.com/playlist?list=PLWfOyhOW_OavOhvmuyUf19nsYASClMnXU)
 62 | 
 63 | ## Sequencing resources
 64 | 
 65 | - [PrimalSeq amplicon scheme and protocol](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-cd8ds9s6)
 66 | - [Yale tiled-amplicon protocol](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-5qpvob1nbl4o/v2)
 67 | 
 68 | 
 69 | ## Generating consensus assemblies
 70 | - [TheiaCoV workflows (for Illumina SE/PE, ONT, and fasta files) with MPXV input variables](https://www.protocols.io/view/monkeypox-virus-multiplexed-pcr-amplicon-sequencin-cd8ds9s6)
 71 |     - Supports amplicon and metagenomic data
 72 | - [GalaxyProject MPXV analysis effort](https://galaxyproject.org/projects/mpxv/)
 73 |     - Only supports Illumina PE metagenomic data
 74 | - [Nextflow workflow from the Utah PHL](https://github.com/UPHL-BioNGS/Cecret#monkeypox)
 75 |     - Supports amplicon and metagenomic data
 76 | - [Epi2Me](https://labs.epi2me.io/basic-monkeypox-workflow/)
 77 |     - Only supports metagenomic data
 78 | - [Viral-Recon](https://github.com/nf-core/viralrecon):
 79 |     - Workflow for raw read quality control, de-hosting, assembly, variant calling, and consensus generation for illumina and nanopore monkeypox data. Currently does not include pre-built support for monkeypox (e.g., reference genome, reference annotations, nextclade dataset, and amplicon schemes) but these can be user-supplied on the command line and should be appropriate to the sequencing method (e.g., for amplicon sequencing using the reference used to create the amplicon scheme and for metagenomic sequencing, to be consistent with Nextstrain, you can use NC_063383.1.fasta, NC_063383.1.gff, with the nextclade dataset nextclade_hMPXV_B1_pseudo_ON563414_XXXXXXX).
 80 | - [PolkaPox](https://github.com/CDCgov/polkapox):
 81 |     - Nextflow workflow for taxonomic filtering, trimming, quality control, reference-based analysis, and de novo assembly of Illumina metagenomic sequencing reads from orthopoxviruses, including multiple lineages of MPXV.
 82 | 
 83 | ## Submission of sequence data to international accessible databases
 84 | - [Sample Metadata Specifications](https://sprcdn-assets.sprinklr.com/1652/133486a8-9b49-4461-a0d7-211c140947cc-562840094.pdf)
 85 | - Preparation and/or Submission of Samples
 86 |     - Terra_2_NCBI workflow (only SRA/BioSample at the moment) for programmatic submission of raw read data analysed on Terra to SRA and BioSample
 87 |     - [NCBI guide to submit consensus sequences using BankIt](https://www.ncbi.nlm.nih.gov/genbank/monkeypox_submission/)
 88 |     - [TOSTADAS](https://github.com/CDCgov/tostadas): Metadata validation, standardized gene annotation, and programmatic NCBI submission (Biosample, SRA, Genbank).
 89 | - Assess Data Quality Prior to Submission
 90 | 
 91 | 
 92 | ## Screening for Variants of Concern
 93 | 
 94 | - [Nextclade](https://clades.nextstrain.org/)
 95 |     - assignment of consensus sequences to Nextstrain clades, quality control, and mutation effect annotation.  References pre-built for inferred ancestral monkeypox, the human monkeypox clade, and the specific B.1 human monkeypox clade. 
 96 | 
 97 | 
 98 | ## Performing Phylogenetic analysis of MPXV datasets
 99 | 
100 | - [Augur](https://docs.nextstrain.org/projects/augur/en/stable/index.html)
101 |     - A bioinformatics toolkit for phylogenetic analysis which constructs phylogenetic trees that can be visualised in NextStrain 
102 | 
103 | - [Nextstrain Mpox build workflow](https://github.com/nextstrain/monkeypox)
104 |     - Workflow to perform contextualised phylogenetic analysis of monkeypox consensus sequences (by default using the human monkeypox reference genome NC_063383.1)
105 | 
106 | - [Taxonium](https://taxonium.org/?treeUrl=https%3A%2F%2Fns-proxy.vercel.app%2Fapi%2Fcharon%2FgetDataset%3Fprefix%3Dmonkeypox%2Fhmpxv1&ladderizeTree=true&treeType=nextstrain&color=%7B%22field%22%3A%22meta_country%22%7D)
107 |     - Tool for exploring large phylogenetic trees - Mpox sequences from GenBank
108 | 
109 | ## Publicly available data
110 | To help getting started with phylogenetic analysis, Nextstrain provides MPXV data available on NCBI in aggregated form:
111 | - [Sequences](https://data.nextstrain.org/files/workflows/monkeypox/sequences.fasta.xz)
112 | - [Metadata](https://data.nextstrain.org/files/workflows/monkeypox/metadata.tsv.gz)
113 | 
114 | Pairwise alignments with [Nextclade](https://clades.nextstrain.org/) against the [reference sequence MPXV-M5312_HM12_Rivers](https://www.ncbi.nlm.nih.gov/nuccore/NC_063383), insertions relative to the reference, and translated ORFs are available:
115 | 
116 | - [Alignment](https://data.nextstrain.org/files/workflows/monkeypox/alignment.fasta.xz)
117 | - [Insertions](https://data.nextstrain.org/files/workflows/monkeypox/insertions.csv.gz)
118 | - [Translations](data.nextstrain.org/files/workflows/monkeypox/translations.zip)
119 | 
120 | 


--------------------------------------------------------------------------------
/docs/omicron-resources.md:
--------------------------------------------------------------------------------
  1 | # Omicron Variant Resources
  2 | 
  3 | **PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group** <br/>
  4 | Libuit KG, Spinler JK, Southgate J, Black A, Nekrutenko A, Neuhaus B, O’Cathail C, Lemmer D, Jones D, Smith E, Gnimpieba E, Guthrie J, Maturure P, Monsierurs P, Maier W, Langhorst B, Page A, & Niewiadomska AM 
  5 | 
  6 | <details>
  7 |  <summary> Document Changelog</summary>
  8 |  
  9 | - 2021-12-19:
 10 |   - Added section detailing Omicron lineage and clade nomenclature, COVID-19 scenario modeling resource, and additional reference sequences
 11 |   - Updated Pangolin and Nextclade software minimums and resource links for genomic information (e.g. defining mutations), visualizations, and global case counts over time to include B.1.1.529 sub lineages
 12 | - 2022-10-10:
 13 |   - Updated variant designations
 14 |   - Historical Information / Archived Data added
 15 | - 2023-03-09:
 16 |   - Format changelog 
 17 | </details>
 18 | 
 19 | # Overview
 20 | 
 21 | The [World Health Organization (WHO) has classified the SARS-CoV-2 B.1.1.529 variant as a Variant of Concern (VOC)](https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern) under the advice of the [Technical Advisory Group on SARS-CoV-2 Virus Evolution (TAG-VE)](https://www.who.int/groups/technical-advisory-group-on-sars-cov-2-virus-evolution)—an independent group of experts that periodically monitors and evaluates the evolution of SARS-CoV-2 and assess if specific mutations and combinations of mutations alter the behavior of the virus. The WHO has assigned the B.1.1.529 VOC the label Omicron per [their greek-letter key variant assignment system](https://www.who.int/news/item/31-05-2021-who-announces-simple-easy-to-say-labels-for-sars-cov-2-variants-of-interest-and-concern).  The elevation of Omicron to a WHO-designated VOC was based on the TAG-VE's assessment of the variant’s large number of genomic mutations and plausible impact on COVID-19 epidemiology. 
 22 | 
 23 | The PHA4GE Pipelines and Visualization Working Group has created this document to highlight critical open-source/accesses resources to aid in the understanding and further analysis of the Omicron variant. 
 24 | 
 25 | In no way does this document represent a comprehensive list of all available SC2 bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
 26 | 
 27 | ## Contents
 28 | - [General Information on the Omicron Variant](#general-information-on-the-omicron-variant)
 29 | 	- [Omicron Lineage and Clade Nomenclature](#omicron-lineage-and-clade-nomenclature)
 30 | 	- [Educational Material](#educational-material)
 31 | 	- [Public Health Announcements and Publications](#public-health-announcements-and-publications)
 32 | 	- [Technical Details and Global Trackers](#technical-details-and-global-trackers)
 33 | 	- [Phylogenetic Visualizations](#phylogenetic-visualizations)
 34 | 	- [Data Reporting and Sharing](#data-reporting-and-sharing)
 35 | - [Potential impacts of Spike Protein Mutations](#potential-impacts-of-spike-protein-mutations)
 36 |     - [Diagnostic and Sequencing Assays](#diagnostic-and-sequencing-assays)
 37 | - [Bioinformatics Resources and Considerations](#bioinformatics-resources-and-considerations)
 38 |     - [Software Version Minimums](#software-version-minimums)
 39 |     - [Reference Sequences](#reference-sequences-and-assemblies)
 40 |     - [SARS-CoV-2 Multiple Sequence Alignments](#sars-cov-2-multiple-sequence-alignments)
 41 | 
 42 | # General Information on the Omicron Variant
 43 | Below is a list of various educational material, public health announcements and publications, thechnical details and global trackers, phylogenetic visualiations, and resources to assist in data sharing and reporting of the Omicron variant.
 44 | 
 45 | ## Omicron Lineage and Clade Nomenclature
 46 | - The Omicron Variant is the [WHO SARS-CoV-2 VOC label](https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/) for the pango lineage B.1.1.529 (includes BA.1, BA.2, BA.3, BA.4, BA.5 and descendent and recombinant lineage XE) Nextstrain designations include 21K, 21L, 21M, 22A, 22B, and 22C. 
 47 | 
 48 | ## Educational Material
 49 | - [Nature News Article - Heavily mutated Omicron variant puts scientists on alert](https://www.nature.com/articles/d41586-021-03552-w): Overview of the identified variant and its potential public health impacts.
 50 | - [Theiagen Genomics Primer the Omicron Variant (Video)](https://www.youtube.com/watch?v=xhyWjPgdP9U): To assist public health scientists' understanding of the Omicron Variant, Frank Ambrosio recorded a small primer on the Omicron variant that includes an overview of the Nature news article by Ewen Callaway, visual depictions of key Omicron mutations, and the genetic diversity of Omicron relative to other SARS-CoV-2 variants using MicrobeTrace. 
 51 | 
 52 | ## Public Health Announcements and Publications
 53 | - [Classification of Omicron (B.1.1.529): SARS-CoV-2 Variant of Concern (World Health Organization)](https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern)
 54 | - [CDC Statement on B.1.1.529 (Omicron variant)](https://www.cdc.gov/media/releases/2021/s1126-B11-529-omicron.html)
 55 | - [CDC Science Brief: Omicron (B.1.1.529) Variant](https://www.cdc.gov/coronavirus/2019-ncov/science/science-briefs/scientific-brief-omicron-variant.html)
 56 | - [SARS-CoV-2 variants of concern as of 3 December 2021 (ECDC)](https://www.ecdc.europa.eu/en/covid-19/variants-concern)
 57 | - [Implications of the further emergence and spread of the SARS-CoV-2 B.1.1.529 variant of concern (Omicron) for the EU/EEAECDC (2021-12-02)](https://www.ecdc.europa.eu/sites/default/files/documents/threat-assessment-covid-19-emergence-sars-cov-2-variant-omicron-december-2021.pdf)
 58 | - [SARS-CoV-2 variants of concern and variants under investigation in England (UK Health Security Agency)](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1036501/Technical_Briefing_29_published_26_November_2021.pdf)
 59 | - [Genomic surveillance of SARS-CoV-2 in Belgium ( National Reference Laboratory (UZ Leuven & KU Leuven))](https://assets.uzleuven.be/files/2021-11/genomic_surveillance_update_211126.pdf)
 60 | - [SARS-CoV-2 variants of concern and variants under investigation in England Variant of concern: Omicron, VOC21NOV-01 (B.1.1.529); Technical briefing 30 (2021-12-03)](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1038404/Technical_Briefing_30.pdf)
 61 | 
 62 | ## Technical Details and Global Trackers
 63 | 
 64 | - Various resources for genomic information (e.g. defining mutations), visualizations, and global case counts over time:
 65 |   - COV-Lineage Variant Summary Pages: [B.1.1.529](https://cov-lineages.org/lineage.html?lineage=B.1.1.529), [BA.1](https://cov-lineages.org/lineage.html?lineage=BA.1), [BA.2](https://cov-lineages.org/lineage.html?lineage=BA.2),[BA.3](https://cov-lineages.org/lineage.html?lineage=BA.3), [BA.4](https://cov-lineages.org/lineage.html?lineage=BA.4), [BA.5](https://cov-lineages.org/lineage.html?lineage=BA.5) & [XE](https://cov-lineages.org/lineage.html?lineage=XE)
 66 |   - BV-BRC Lineage Profiles: [BA.1](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.1), [BA.2](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.2), [BA.3](https://bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.3), [BA.4](https://www.bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.4), [BA.5](https://www.bv-brc.org/view/VariantLineage/#view_tab=lineage&loc=BA.5)
 67 |   - [Outbreak.info Omicron Variant Report](https://outbreak.info/situation-reports/omicron) 
 68 |   - CoVariants (Omicron) Profiles for Nextstrain: [21K](https://covariants.org/variants/21K.Omicron), [21L](https://covariants.org/variants/21L.Omicron), [22A](https://covariants.org/variants/22A.Omicron), [22B](https://covariants.org/variants/22B.Omicron), [22C](https://covariants.org/variants/22C.Omicron), [22D](https://covariants.org/variants/22D.Omicron)
 69 |   - [CNCB RCoV19 Lineage Browser](https://ngdc.cncb.ac.cn/ncov/lineage?lineage=B.1.1.529#goto) 
 70 |   - [COVID-19 Scenario Modeling Hub](https://covid19scenariomodelinghub.org/viz.html): Synthesis of over 30 COVID-19 models for public health forecasting
 71 | 
 72 | ## Phylogenetic Visualizations
 73 | - [NextStrain Build of B.1.1.529 (21K)](https://nextstrain.org/groups/neherlab/ncov/21K)
 74 | - [Outbreak.info VOC Lineage Comparisons](https://outbreak.info/compare-lineages?gene=ORF1a&gene=ORF1b&gene=S&gene=ORF8&gene=N&gene=ORF3a&gene=E&gene=M&gene=ORF6&gene=ORF7a&gene=ORF7b&gene=ORF10&threshold=75&nthresh=1&sub=false&dark=true)
 75 | 
 76 | ## Data Reporting and Sharing
 77 | - [PHA4GE Resource on Data Sharing](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification): Sharing of sample read and assembly data through internationally accessible databases allows insights to be drawn about how the virus is spreading and mutating across the globe; the more freely available these data are to international researchers and public health scientists, the stronger our decision making can be.
 78 | - [PHA4GE Resource on Data Submission](https://github.com/pha4ge/pipeline-resources/blob/main/docs/bioinfo-solutions.md#2-submitting-raw-sequence-data-fastq-consensus-assemblies-fasta-and-relevant-sample-metadata-to-internationally-accessible-databases): Resources developed to assist in the preparation and submission of raw NGS read data (fastq files), SC2 consensus assemblies (fasta files), and contextual sample metadata to internationally-accessible databases such as NCBI, ENA, and GISAID
 79 |     
 80 | # Potential Impacts of Spike Protein Mutations
 81 | 
 82 | The spike protein of the SARS-CoV-2 Omicron variant contains approximately 32 mutations, many of which have not been observed in previous VOCs. However, based on their location, several of these mutations have the potential to impact immune escape, transmissibility, and detection. Spike mutations found in the Omicron VOC can be analyzed in detail using the [Stanford University Coronavirus Antiviral & Resistance Database](https://covdb.stanford.edu/sierra/sars2/by-patterns/).
 83 | 
 84 | ![Omicron S-gene mutations](./images/omicron_standford.svg)
 85 | 
 86 | - Up to 15 mutations have been observed within the receptor binding domain (RBD). The RBD region of the Spike protein interacts directly with the human receptor ACE2 and mutations in this region may have a direct impact on how well SARS-CoV-2 viral particles attach to a host cell. 
 87 | 
 88 | - Approximately 8 mutations have been observed within the N-terminal domain (NTD). The NTD of the Spike protein aids in virus attachment and mutations in this region could also impact virus infectivity. 
 89 | 
 90 | - Both the RBD and NTD are surface exposed areas of the Spike protein that are targeted by antibodies. Mutations in these regions have the potential to evade immunity by antibodies acquired through previous infection or vaccination.
 91 | 
 92 | - Three mutations occur near the furin cleavage site, the region of the Spike protein responsible for viral-host membrane fusion. Mutations in this region have the potential to affect viral entry into host cells.
 93 | 
 94 | ## Diagnostic and Sequencing Assays
 95 | 
 96 | Mutations in the SARS-CoV-2 genome can affect PCR-based diagnostic assays and genomic sequencing. For example, the ThermoFisher TaqPath probe targeting the Spike gene is known to result in S-gene target failure (SGTF) when amplifying nucleic acid preparations from VOC Alpha. This occurs when the SARS-CoV-2 genome contains a deletion resulting in the loss of amino acids 69-70 of the NTD. When coupled with the positive amplification of other SARS-CoV-2 genetic regions, the SGTF has been used as a diagnostic indicator of VOC presence [SGF Deletion Assay](https://www.biorxiv.org/content/10.1101/2021.10.25.465706v1.full). 
 97 | 
 98 | - [Thermo Fisher Scientific Confirms Detection of SARS-CoV-2 in Samples Containing the Omicron Variant with its TaqPath COVID-19 Tests](https://thermofisher.mediaroom.com/2021-11-29-Thermo-Fisher-Scientific-Confirms-Detection-of-SARS-CoV-2-in-Samples-Containing-the-Omicron-Variant-with-its-TaqPath-COVID-19-Tests): The Omicron variant contains the NTD deletion at amino acids 69/70 and results in SGTF by the TaqPath PCR assay. 
 99 | 
100 | - [NEB's Primer Monitor Tool](https://primer-monitor.neb.com/lineages): Monitor registered primer sets for overlapping sequence variants in Omicron.
101 | 
102 | - [SARS-CoV-2 Artic V4.1 update for Omicron variant](https://community.artic.network/t/sars-cov-2-v4-1-update-for-omicron-variant/342): Ten mutations in the Omicron VOC affect the Artic V4 primer scheme for whole genome sequencing. The Artic Network has designed 11 new primers to account for these mutations. 
103 | 
104 |  
105 | # Bioinformatics Resources and Considerations
106 | Genome assembly as well as clade and lineage assignment of Omicron variants should follow the same bioinformatics workflow recommendations outlined in this working group's [Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis](https://github.com/jkspinler/pipeline-resources/edit/main/docs/omicron-resources.md) guidance document. Briefly, raw amplicon read data should be mapped to the Wuhan-1 reference genome and primer trimming performed before a consensus genome is called. Clade  annd lineage assignment can then be made by analyzing the resulting consensus genome assemblies with the [NextClade](https://clades.nextstrain.org/) and [Pangolin](https://pangolin.cog-uk.io/) software, respectively.
107 | 
108 | ## Software Version Minimums
109 | For laboraotires making clade and lineage assignements outside of the NextClade and Pangolin web applications, e.g. through a custom workflow available on CLI, Terra.Bio, or Galaxy Project, please ensure to utilize updated NextClade and Pangolin software capable of making an accurate Omicron clade and lineage designation:
110 | - [NextClade Software Version 1.7.0](https://github.com/nextstrain/nextclade/releases/tag/1.7.0) ([Dataset Tag >=2021-12-16T20:15:53Z](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html))
111 |   - [NextStrain Docker Container Image](https://hub.docker.com/r/nextstrain/nextclade)
112 | - [Pangolin Software Version 3.1.17](https://github.com/cov-lineages/pangolin/releases/tag/v3.1.17) ([Constellations >=0.1.0](https://github.com/cov-lineages/constellations/releases/tag/v0.1.0)
113 |   - [StaPH-B Docker Container Image](https://hub.docker.com/r/staphb/pangolin/tags?page=1&ordering=last_updated)
114 |   - [BioContainer Docker Container Image](https://quay.io/repository/biocontainers/pangolin?tab=tags)
115 | 
116 | ## Reference Sequences and Assemblies
117 | - [KRISP CERI NCBI BioProject of Omicron Data](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA784038): Sequencing of the Omicron variant in South Africa by the Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP) and the Centre for Epidemic Response and Innovation (CERI).
118 | - [NCBI SAMN23572360](https://www.ncbi.nlm.nih.gov/biosample/SAMN23572360): Raw read and assembly data for the first Omicron idenfied in Minnesota, USA
119 | - [NCBI SAMN23637602](https://www.ncbi.nlm.nih.gov/biosample/SAMN23637602): Raw reads and assembly data for first Omicron in Massachusetts, USA
120 | - ENA Assemblies: [ERZ4210179](https://www.ebi.ac.uk/ena/browser/view/ERZ4210179), [ERZ4209688](https://www.ebi.ac.uk/ena/browser/view/ERZ4209688), [ERZ4211168](https://www.ebi.ac.uk/ena/browser/view/ERZ4211168), [ERZ4210738](https://www.ebi.ac.uk/ena/browser/view/ERZ4210738)
121 | - [NCBI SAMN23998005](https://www.ncbi.nlm.nih.gov/biosample/SAMN23998005/): Raw read data for an Omicron variant sequecned with the [ONT Midnight 1200 primers](https://store.nanoporetech.com/us/midnight-rt-pcr-expansion.html)
122 | - 
123 | 
124 | ## SARS-CoV-2 Multiple Sequence Alignments
125 | Primer dropouts in Omicron sequence data may lead to errant evolutionary inferences when performing phylogenetic analysis of SARS-CoV-2 genomes. A proposed work around to these dropout regions is to mask the spike region and adjust the molecular clock rate accordingly, as [performed by Trevor Bedford in a recent phylodynamic analysis](https://twitter.com/trvrb/status/1466102128343093248?s=20). 
126 | - [Nextstrain default masked sites for tree topology](https://github.com/nextstrain/ncov/blob/master/defaults/sites_ignored_for_tree_topology.txt)
127 | 
128 | 
129 | 
130 | ## Historical Information / Archived Data
131 | 
132 | - [Pango-designation proposed new lineage](https://github.com/cov-lineages/pango-designation/issues/343) and the [associated twitter thread](https://twitter.com/PeacockFlu/status/1463176821416075279) (Tom Peacock)
133 | - [Proposal for third sublineage in B.1.1.529 (BA.3)](https://github.com/cov-lineages/pango-designation/issues/367) (Andrew Rambaut)
134 |   - Includes table of shared and unique mutations across B.1.1.529, BA.1, BA.2, and BA.3
135 | - [Galaxy EU Omicron Public Analysis](https://galaxyproject.eu/posts/2021/11/29/omicron-and-galaxy/): View of the Omicron lineage’s mutational pattern derived transparently and fully reproducibly from raw sequencing reads using the Galaxy Project bioinformatics platform
136 | - [Omicron Data Round Up](https://docs.google.com/presentation/d/1sOaHoXFZqIUnqmjdeuaUODCqaUSvtxQp4f2hF9pBdn8/edit#slide=id.g104e9fe3cf0_2_75): Summary of the Omicron variant and what can be inferred based on publicly-accessible data presented 2021-12-01 by Anna Niewiadomska 
137 | 


--------------------------------------------------------------------------------
/docs/pipeline-best-practices.md:
--------------------------------------------------------------------------------
  1 | 
  2 | **<p align="center">NOTE: AS OF 2024-03-24, THIS DOCUMENT HAS BEEN MOVED TO A SEPARATE REPOSITORY. FOR THE MOST UP-TO-DATE VERSION OF THIS DOCUMENT, PLEASE VISIT https://github.com/pha4ge/public-health-pipeline-best-practices. </p>**
  3 | 
  4 | ----
  5 | 
  6 | 
  7 | # **Best Practices for Public Health Bioinformatics Pipelines**
  8 | 
  9 | **PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group <br/>**
 10 | Libuit KG, Guthrie J, Ambrosio F, Kapsak C, Unal Gultekin, Holmes J, Wright S, Nguinkal J, Doughty E, Southgate J, O'Cathail C, Carleton H, Kingwara L, Khan W, Baker K, Diallo A, Connor T, Kanwar S, Maturure P, James S, Cuesta I, Dyster V, Gaskin A, Williams C, Smith E, Rokney A, Petkau A, Varona S, Gnimpieba E, Rey S, Macori G, & Mboowa G
 11 | <details>
 12 |  <summary> Document Changelog</summary>
 13 | 
 14 | - 2023-07-05:
 15 |     - Commit to main
 16 |  - 2023-12-03:
 17 |     - Adding max duration for commitments to maintain (2  years)
 18 | - 2024-02-11:
 19 |     - Adding descriptions of meeting and verifying proposed standards
 20 | - 2024-03-11:
 21 |     - Focus shift from proposed standards for bioinformatics software to best practices for bioinformatics pipelines
 22 | </details>
 23 | 
 24 | ## Overview
 25 | 
 26 | The field of public health bioinformatics relies heavily on the development and sustainability of high-quality software to support efforts in disease surveillance, outbreak investigation, and genomic research. Bioinformatics pipelines, also known as bioinformatics workflows, play a critical role in facilitating the routine analysis of genomic data by orchestrating the flow from raw data through various processing stages to final analysis. 
 27 | 
 28 | Despite their critical role, the absence of guidelines and best practices specific to public health pathogen genomics has hindered progress towards accessible, reproducible, interoperable, and standardized bioinformatics analysis in public health.
 29 | 
 30 | To support both pipeline developers and analysts who rely on these pipelines to inform critical public health decision making, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has proposed a set of best practices for bioinformatics pipelines, tailored for public health applications. These best practices aim to provide a framework for the development, testing, and maintenance of bioinformatics pipelines, enhancing the quality, reliability, and sustainability of these resources and facilitating their impact on public health.
 31 | 
 32 | ## 10 Best Practices for Public Health Bioinformatics Pipelines
 33 | 
 34 | ### 1. Publicly-Accessible Repository
 35 | _Is the source code for this pipeline available at a publicly-accessible repository URL?_
 36 | 
 37 | Publicly-accessible software bolsters collaboration and expedites innovation in public health bioinformatics, empowering worldwide public health communities to address critical challenges. By enhancing accessibility, publicly available software enable interoperability and reproducibility across public health investigations, crucial for well-informed decision-making and policy creation. Popular code repositories such as GitHub, GitLab, Bitbucket, and SourceForge offer platforms for developers to share their work. 
 38 | 
 39 | **To adhere to this best practice:** Host the bioinformatics pipeline on an open code repository platform.
 40 | 
 41 | **To verify adherence to this best practice:** The reviewer should confirm existence and functionality of the repository.
 42 | 
 43 | ### 2. Open-Source License
 44 | _Does the repository contain a plain-text LICENSE file with the contents of an OSI-approved software license?_
 45 | 
 46 | Open-source licenses in public health bioinformatics encourage the widespread adoption and sharing of valuable tools and resources, fostering a collaborative environment for research and innovation. These licenses also support the unrestricted improvement and customization of software, enabling researchers to tailor solutions to specific public health challenges and enhance overall outcomes. Without a license, the code author retains all rights to the work, and others are not allowed to use, copy, distribute, or modify it without permission. This means that the code is effectively unusable by the research community as a whole. A license grants this permission and allows others to use, copy, distribute, or modify the work under certain conditions. Popular license types for open-source bioinformatics software include GNU General Public License (GPL), MIT License, and Apache License 2.0. For helpful information regarding open-source licenses, we recommend the ARS Technica Article “[Open source licenses: What, which, and why]([url](https://arstechnica.com/gadgets/2020/02/how-to-choose-an-open-source-license/))”. 
 47 | 
 48 | **To adhere to this best practice:** Choose an open-source license (e.g., MIT) for the pipeline, providing legal permissions for users to modify and share the code.
 49 | 
 50 | **To verify adherence to this best practice:** The reviewer should confirm the presence of a clearly defined open-source license in the pipeline repository.
 51 | 
 52 | ### 3. Version Controlled Software 
 53 | _Are stable releases that follow semantic versioning of the pipeline available for implementation in public health laboratories?_
 54 | 
 55 | Utilizing stable software releases with semantic versioning in public health bioinformatics ensures consistent functionality and compatibility, minimizing disruptions in research workflows. This approach also simplifies version tracking and communication, facilitating seamless collaboration among researchers and reducing the likelihood of errors due to software discrepancies. Maintaining a detailed changelog is also highly recommended to track software updates, bug fixes, and feature additions, ensuring transparency and ease of understanding for users.
 56 | 
 57 | **To adhere to this best practice:** Implement semantic versioning (e.g., MAJOR.MINOR.PATCH).
 58 | 
 59 | **To verify adherence to this best practice:** The reviewer should check for version tags in the repository, ensuring that versioning follows semantic versioning principles and accurately reflects the pipeline's development history.
 60 | 
 61 | ### 4. Workflow Management System
 62 | _Does the pipeline utilize a workflow management system for its development and execution??_ 
 63 | 
 64 | Implementing workflow management systems in public health bioinformatics pipelines ensures efficient, scalable, and reproducible analyses. These systems automate complex data processing tasks, facilitating seamless integration and execution of diverse bioinformatics tools. By standardizing workflow execution, they enhance data analysis consistency across different computational environments, contributing significantly to the reliability and reproducibility of public health research findings.
 65 | 
 66 | Additionally, the use of workflow management systems facilitates maintainability. The shared knowledge and use of these systems by a community of developers ensure that pipelines can be more readily supported and updated, significantly enhancing long-term usability and stability. Common workflow management systems adopted acorss public health pathogen genomics include [NextFlow](https://www.nextflow.io/), [WDL](https://openwdl.org/), [SnakeMake](https://snakemake.readthedocs.io/en/stable/), and [CWL](https://www.commonwl.org/). 
 67 | 
 68 | **To adhere to this best practice:** Choose a workflow management system that supports scalability, is compatible with common bioinformatics tools, and integrates easily with existing infrastructure. Document the workflow configuration and dependencies clearly.
 69 | 
 70 | **To verify adherence to this best practice:** The reviwer should assess the pipeline source code to identify the use of a workflow management system.
 71 | 
 72 | ### 5. Containerized/Packaged Software
 73 | _Does the pipeline utilize containerized (e.g. Docker) or packaging (e.g. conda) software software to enhance interoperable pipeline distribution?_ 
 74 | 
 75 | Using software packages and/or containerization within public health bioinformatics pipeline enhances interoperability by enabling seamless integration and deployment across different platforms and environments. This approach simplifies pipeline distribution and installation, promotes reproducibility, and facilitates collaboration among researchers, contributing to the development of more accessible and interoperable tools and resources.
 76 | 
 77 | Containers are essential for modern bioinformatics development and pipeline distribution, as the ability to replicate results is a fundamental principle of the scientific method. This is true for public health, as any lab should be able to easily install and maintain pipeline and reproduce/verify results from another lab. Containers should be packaged with one or a combination of the following methods:
 78 | - conda environments
 79 | - venv environments
 80 | - Singularity containers
 81 | - Docker containers.
 82 | 
 83 | There should be a clear summary in the Git README pointing to which containerisation method has been chosen and instructions for how a lab can install this pipeline and/or where docker images are available, e.g. dockerhub and quay for containers; and where conda packages are available, e.g. anaconda and bioconda (cross-referenced with Installation Instructions). Documentation should indicate the specific version included in the pipeline. This is important as specific software versions  may impact functionality.
 84 | 
 85 | **To adhear this best practice:** Implement pipeline components within Docker containers or distribute them as Conda packages; use of containerized/packaged software should be clearly documented.
 86 | 
 87 | **To verify adherence to this best practice:** The reviewer should inspect the pipeline source code and review analyatical steps (e.g. NextFLow processes or WDL tasks) to ensure use of containerized or packaged softare and verify documentation of these resources.
 88 | 
 89 | ### 6. Common File Formats
 90 | _Does the pipeline accept as input and generate as output common file format utilized in public health pathogen genomics?_
 91 | 
 92 | Accepting and generating common file formats for public health bioinformatics pipeline enhances interoperability and data exchange between different tools and platforms. This approach facilitates data sharing, promotes consistency, and enables researchers to leverage a wide range of pipeline solutions, contributing to the development of more comprehensive and effective solutions for addressing critical public health challenges. These files include: .fasta, .fastq(.gz), .sam, .bam, .bai, .bed, .vcf, .gff, .gtf, .txt, .log, .tsv, .csv, .nwk, and .json.
 93 | 
 94 | **To adhere to this best practice:** The pipeline should accept as input and generate as output common file formats utilized in public health pathogen genomics, enhancing interoperability and data exchange between different tools and platforms.
 95 | 
 96 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for explicit information on the common file formats supported by the pipeline, ensuring compatibility with widely used formats in public health bioinformatics.
 97 | 
 98 | ### 7. Software Testing
 99 | _Are there automated and/or manual tests described so that the functionality of the pipeline can be assessed?_
100 | 
101 | Including software tests for public health bioinformatics pipeline ensures the reliability and accuracy of the tools, enhancing user confidence and promoting consistent research outcomes. These tests also facilitate early detection and resolution of potential issues, contributing to the overall stability and robustness of the pipeline in a rapidly evolving public health landscape. At a minimum, pipeline being implemented for public health pathogen genomics should include: 
102 | - Smoke tests to ensure that the basic functionality of the program is working correctly
103 | - Unit tests to test individual code units
104 | - System tests/end-to-end tests to assess the overall functionality of the program, with a focus on common and important paths
105 | - Regression tests to ensure that changes to the code do not break existing functionality (note: system tests can be utilized to implement regression tests)
106 | 
107 | When appropriate, inclusion of additional testing strategies may also enhance the robustness of pipeline, e.g.
108 | - Acceptance tests to ensure that the program meets a project’s fundamental requirements
109 | - Runtime tests to evaluate the pipeline's behavior, performance, and stability during its operation to ensure that it meets the required standards and functions correctly in real-world scenarios
110 | - Testing frameworks, e.g. use of GitHub Actions to automate defined software tests, provide a consistent and organized structure for writing and running test cases, enabling developers to efficiently validate the correctness, performance, and reliability of their pipeline
111 | 
112 | The description of functionality and performance testing should be made accessible via the code repository where the pipeline is made available.
113 | 
114 | **To adhere to this best practice:** Provide a description of both automated and/or manual tests that assess the functionality of the pipeline, ensuring reliability and accuracy.
115 | 
116 | **To verify adherence to this best practice:** The reviewer should check the documentation for a description of tests, both automated and manual, that evaluate the functionality of the pipeline, contributing to user confidence and consistent research outcomes.
117 | 
118 | ### 8. Benchmark/Validation Datasets
119 | _Is there a publically available set of inputs with known outputs that can be used to test successful installation and benchmark against other tools?_ 
120 | 
121 | Including a benchmark or validation dataset for public health bioinformatics pipeline provides researchers with a standard reference for evaluating and comparing the performance of different tools, promoting transparency and consistency in the evaluation of pipeline. By establishing a common reference point, benchmarking enables researchers to identify the strengths and weaknesses of various pipeline solutions and promotes the development of more accurate, reliable, and effective tools. Authors should make benchmark and/or validation datasets publicly available and well-documented, allowing others to reproduce the experiments and validate the results.
122 | 
123 | A benchmark dataset is a standardized set of inputs with known outputs that is used to compare the performance of different bioinformatics tools on the same set of data. The benchmark dataset is typically designed to be representative of the types of data that the tool is likely to encounter in real-world scenarios and covers a range of use cases.
124 | 
125 | A validation dataset, on the other hand, is used to validate the accuracy and reliability of a specific bioinformatics tool. The validation dataset is designed to test the tool's performance on a range of input data types and sizes and evaluate its ability to correctly identify the target sequences and distinguish them from non-target sequences.
126 | 
127 | **To adhere to this best practice:** Include a benchmark or validation dataset for public health bioinformatics pipeline, promoting transparency and consistency in the evaluation and comparison of different tools. 
128 | 
129 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for information on the availability and accessibility of a benchmark and/or validation dataset for the public health bioinformatics pipeline.
130 | 
131 | ### 9. Reference Data Requirements
132 | _Are required reference data and/or databases clearly documented, publicly accessible, and maintained?_
133 | 
134 | Documenting any external reference data or database requirements for public health bioinformatics pipeline enhances the usability and reproducibility of the pipeline by providing clear and comprehensive information on the necessary data sources and dependencies. This documentation promotes transparency, facilitates replication, and enables researchers to more effectively integrate the pipeline into their workflows, contributing to the development of more reliable and impactful tools and resources.
135 | 
136 | If an external reference data or database is required, the following standards should be met: static versioning, open-access, and clear instructions to install/access database; database versioning and date of most recent update – version control and compatibility 
137 | - Identify what aspects of the database need to be documented, such as the database schema, table structure, and stored procedures. Identify the format the documentation should take, such as technical documentation, user guides, and reference manuals.
138 | - Clearly document the sources of data used to construct the database, including information on how the data was acquired, processed, and validated.
139 | - Specify the format of the data, including any file formats, parameters used, and other relevant information.
140 | - Describe the process of data curation, including any quality control measures, data cleaning, and data integration.
141 | - Describe any taxonomy and annotation used in the database, including any reference standards or guidelines that were followed.
142 | - Specify the terms of use and any restrictions on the use of data from the database, including any attribution or citation requirements.
143 | - Mention any community or website, such as a help forum or feedback mechanism, that are in place for the database.
144 | - Ensure that the database is compatible with the pipeline that it is being used with. Make sure that the documentation clearly states the versions of pipeline and systems that the database is compatible with.
145 | 
146 | The format of open-access downloadable should be defined, ideally in compressed format, and in such a format that will be best suited for downstream usage/analysis.
147 | 
148 | **To adhere to this best practice:** The required reference data and/or databases for public health bioinformatics pipeline should be clearly documented, publicly accessible, and maintained. If an external reference database is required, it should also adhere to standards such as static versioning, open-access, and clear instructions for installation/access. Clearly document aspects of the database, such as the schema, table structure, and stored procedures, in a format suitable for users, such as technical documentation, user guides, and reference manuals.
149 | 
150 | **To verify adherence to this best practice:** The reviewer should check the documentation or repository for comprehensive information on external reference data or database requirements, ensuring transparency, usability, and reproducibility.
151 | 
152 | ### 10. Pipeline Documentation
153 | _Is the pipeline documentation clearly written and publicly accessible?_
154 | 
155 | Having clearly written and publicly-accessible pipeline documentation enhances user understanding, facilitates adoption, and promotes efficient usage of the pipeline. It provides comprehensive instructions, usage examples, and explanations of key functionalities, enabling public health scientists to effectively utilize the pipeline for their specific bioinformatics needs. Best practices for pipeline documentation include:
156 | - Defining the documentation scope: Identify what aspects of the tool's core functionality need to be documented and what format the documentation should take. This can include things like user guides, reference manuals, and API documentation.
157 | - Establishing documentation guidelines: Develop a set of guidelines or standards for documenting the tool's core functionality. 
158 | - Creating a documentation template: Develop a template or set of templates that can be used to create consistent and accurate documentation..
159 | - Reviewing and update the documentation: Regularly review and update the documentation to ensure that it is accurate and up to date. This can be done by gathering feedback, monitoring usage data, and making adjustments as necessary.
160 | - Keeping it accessible: Make the documentation easily accessible to users by providing it in different formats like HTML, PDF, and user-friendly formats.
161 | 
162 | Effective pipeline documentation encompasses a broad range of practices, each targeting specific aspects of usability, transparency, and collaboration. These documentation practices ensure that users and contributors have a clear understanding of the pipeline's development, usage, and governance. By incorporating these elements, documentation becomes a comprehensive resource that supports the pipeline's integrity, facilitates community contribution, and enhances user engagement, making it an indispensable part of best practices in bioinformatics pipeline development.
163 | 
164 | **To adhere to this best practice and verify adherence to this best practice:** Refer to the documentation practices listed below; all pipeline documentation practices should be met.
165 | 
166 | #### 10a. Contribution, Authorship, and Verified Point of Contact
167 | _Does the full list of authors seem appropriate and include a verified point of contact?_
168 | 
169 | Clearly listing authorship and credit in public health bioinformatics acknowledges the contributions of individual researchers, fostering a sense of ownership and responsibility for their work. This practice also promotes transparency, collaboration, and recognition within the scientific community, enhancing career development opportunities and encouraging the sharing of expertise. We would recommend using the [CRedIT system](https://credit.niso.org/) adopted by the Natural Sciences field to acknowledge contributions to bioinformatic tools. Contributors to the pipeline must be acknowledged as a co-author if they have contributed by: programming, pipeline development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components.
170 | 
171 | A verified point of contact must include a working email address of an individual or organization that is most likely to maintain the bioinformatic tool in the long term. Ideally, email addresses for multiple individuals should be provided and these should not be organizational email addresses (e.g. joe.bloggs@phag4e.org), as they could lose access to that email when they leave that organization. 
172 | 
173 | **To adhere to this documentation practice:** Clearly list authorship and credit for the bioinformatics tool, acknowledging individual contributions and following the CRedIT system for appropriate recognition based on specific contributions.
174 | 
175 | **To verify adherence to this documentation practice:** The reviewer should check the documentation and repository for a clear and comprehensive list of authors, ensuring adherence to the CRedIT system and appropriate acknowledgment of contributors based on their roles in programming, pipeline development, design, implementation, and testing.
176 | 
177 | #### 10b. Conflict of Interest Statement
178 | _Have all potential conflicts of interest been disclosed?_
179 | 
180 | Stating potential conflicts of interest regarding pipeline authors in public health bioinformatics promotes transparency and integrity in scientific research. This practice enables users to make informed decisions about the pipeline they utilize, ensuring unbiased results and fostering trust within the research community. Conflict of interest is defined as any factor which renders an author, co-author, or collaborative team unable to (potentially or otherwise) perform an independent peer review or evaluation pertaining to a study. Examples of conflict of interest include but are not limited to commercial, personal, political, or religious interests. When developing bioinformatics (for public health) pipelines, any conflict of interest should be disclosed by the responsible authors to ensure independent peer review and testing of code has been carried out prior to publication. Some conflict of interest statements may be waived if an author can demonstrate they are able to perform an impartial code review. For example, JOSS suggests that if two co-authors did not ever truly collaborate, this might mean a co-author is a suitable selection for code review. 
181 | 
182 | **To adhere to this documentation practice:** Authors of public health bioinformatics pipeline, pipelines, or methods should transparently disclose any potential conflicts of interest, such as commercial, personal, political, or religious affiliations, to promote transparency and integrity in scientific research.
183 | 
184 | **To verify adherence to this documentation practice:** The reviewer should assess the presence of a clear conflict of interest statement in the documentation or repository, ensuring that responsible authors have openly disclosed any factors that may impact their ability to perform an unbiased peer review or evaluation.
185 | 
186 | #### 10c. Pipeline Maintenance Statement
187 | _Have the authors provided documentation regarding the intent to maintain the pipeline?_
188 | 
189 | Ensuring the long-term sustainability and maintenance of a public health bioinformatics pipeline is crucial for its continued relevance and reliability in the face of evolving public health challenges and technological advancements. Clear documentation regarding the author’s intent to maintain the pipeline not only signals to potential users that the tool will remain up-to-date and secure but also demonstrates the authors' dedication to supporting the public health community over time. 
190 | 
191 | This documentation may detail the support mechanism for users, including how they can report issues, request features, or contribute to the project. If applicable, this documentation may also cover the funding model or community support strategy that will ensure the pipeline's ongoing development and maintenance. This could include details on how the project is funded, plans for seeking future funding, or how the project fosters a community of contributors. 
192 | 
193 | **To adhere to this documentation practice:** Authors must include a detailed statement regarding the pipeline's future upkeep, covering aspects such as update schedules, user support mechanisms, and funding or community support strategies. 
194 | 
195 | **To verify adherence to this documentation practice:** The reviewer will check that the documentation contains a comprehensive statement detailing the authors' commitment to maintaining the pipeline, including specific plans for updates, user support, and securing the pipeline's sustainability.
196 | 
197 | #### 10d. Community Guidelines for Contribution and Support
198 | _Are there clear guidelines for third parties wishing to 1) contribute to the pipeline 2) report issues or problems with the pipeline and 3) seek support?_
199 | 
200 | Including community guidelines for contribution and support in public health bioinformatics pipeline promotes open and transparent communication channels between developers, users, and the broader scientific community. These guidelines foster an environment of shared knowledge and expertise, enabling individuals to provide feedback, report issues, and contribute to the improvement and sustainability of essential tools and resources and can include repository style guides, issue templates, and/or guidelines for providing support to users, including how to report issues, how to troubleshoot common problems, and how to escalate issues that cannot be resolved through standard support channels.
201 | 
202 | **To adhere to this documentation practice:** Include community guidelines for contribution and support on the code repository where the pipeline is made available. Ensure that community guidelines foster an environment of shared knowledge and expertise, enabling individuals to contribute feedback, report issues, and actively participate in the improvement and sustainability of essential tools and resources.
203 | 
204 | **To verify adherence to this documentation practice:** The reviewer should confirm the presence of well-documented community guidelines in the documentation or repository, encompassing aspects such as repository style guides, issue templates, and guidelines for providing user support.
205 | 
206 | 
207 | #### 10e. Statement of Need with Respect to Public Health Pathogen Genomics
208 | _Have the authors clearly stated the challenges in public health pathogen genomics that this pipeline aims to address?_
209 | 
210 | A clear statement of need for public health bioinformatics pipeline highlights the significance and relevance of the tool within the public health landscape, facilitating its adoption by the target user base. This practice also helps to align development efforts with pressing public health challenges, ensuring that resources are directed towards addressing the most critical issues. For instance, the tool could address the challenge of integrating multiple types of genomic data analysis, such as variant calling, phylogenetic reconstruction, and outbreak investigation, into a single platform. The pipeline could also incorporate machine learning algorithms to provide automated classification and identification of pathogens, reducing the need for manual curation. The type of organization or researcher that the tool is intended for should be made clear, and it is helpful to provide information regarding the level of computational expertise needed.
211 | 
212 | Users are more likely to adopt a new tool if they can see how it addresses existing limitations or provides new and innovative features that are not available in current tools. The authors should explain how their tool is different from existing tools and how it improves upon established methods, if there are any. For example, the tool might provide more accurate and reliable results due to the incorporation of new algorithms or statistical models. It might also be more user-friendly and accessible to users with varying levels of computational expertise, allowing a wider range of end-users to take advantage of the tool's capabilities.
213 | 
214 | **To adhere to this documentation practice:** Clearly state the significance and relevance of public health bioinformatics pipeline and explain how the tool differs from existing ones and/or improves established methods.
215 | 
216 | **To verify adherence to this documentation practice:** The reviewer will ensure that the documentation includes a statement highlighting the pipeline's purpose and its alignment with pressing public health challenges.
217 | 
218 | #### 10f. Pipeline Functionality
219 | _Has the function of this software as it pertains to public health bioinformatics been clearly articulated?_
220 | 
221 | Including a clear indication of software function in public health bioinformatics enables researchers to easily identify the most suitable tools for their specific needs, enhancing productivity and the overall quality of their work. This practice also fosters informed decision-making, ensuring that the software is applied effectively and appropriately to address public health challenges.
222 | 
223 | The intended use of the software in the context of public health pathogen genomics should be clearly stated, accompanied by the means to confirm this functionality, e.g. through the provision of a validation dataset with information detailing expected outputs and, if appropriate, how these outputs can be compared to a benchmark standard. Standardizing the documentation of the core functionality of a tool can help to ensure that it is clear, accurate, and easy to understand. Limitations of the software that may affect use of the results for clinical or epidemiological purposes and decisions should also be indicated with clarification of which organisms, species and subspecies have been validated for use, and where limitations and discrepancies may occur.
224 | 
225 | **To adhere to this documentation practice:** Clearly indicate the function, intended use, and limitations of the software.
226 | 
227 | **To verify adherence to this documentation practice:** The reviewer should check for explicit documentation that clearly communicates the software's functionality, intended use, limitations, validated organisms, and potential discrepancies.
228 | 
229 | #### 10g. Documentation for Local Installation and/or Remote Access (e.g. Web Server or Galaxy/Terra Workflow)
230 | _Does installation and/or access to the pipeline proceed as outlined in the documentation?_
231 | 
232 | Providing clear local installation and/or remote access instructions for public health bioinformatics pipeline streamlines the user experience, enabling researchers to efficiently deploy and utilize essential tools. This practice also minimizes potential technical barriers, fostering accessibility and promoting widespread adoption within the scientific community. The pipeline installation guide should be clear, concise, and easy to follow. The system requirements for the pipeline should be outlined with a clearly-stated list of all prerequisites and dependencies that are necessary to install the pipeline correctly. Ideally, dependencies are handled with an automated package management solution. The necessary pipeline should be defined with the required minimum version and release.
233 | 
234 | Installation instructions should include a step-by-step list. Configuration settings should be detailed and need to be clear as to the expected outcome and result. A method for verification of a successful installation should be described, and any typical problems that might occur during the installation along with methods for troubleshooting. Where there are manual post-installation / cleanup-tasks it is necessary to provide details of necessary tasks. If there are any software license terms, these should be listed. 
235 | Methods to update the pipeline should also be described within the installation instructions.
236 | 
237 | If the resources have been made available via a web application (e.g. Galaxy or Terra.Bio), instructions on how to access and utilize the pipeline through the web application should be clearly indicated. 
238 | 
239 | **To adhere to this documentation practice:** Authors of public health bioinformatics pipeline should offer clear local installation and/or remote access instructions, ensuring a streamlined user experience and facilitating efficient deployment of essential tools. Installation instructions should include detailed configuration settings with clear expected outcomes, step-by-step lists, and verification methods for successful installation, along with troubleshooting guidance for common issues. If resources are available via a web application (e.g., Galaxy or Terra.Bio), clear instructions on how to access and utilize the pipeline through the web application should be provided.
240 | 
241 | **To verify adherence to this documentation practice:** The reviewer should check for a well-documented installation guide with clear, concise, and easy-to-follow step-by-step instructions, including system requirements, dependencies, and automated package management solutions, as well as information on software versioning.
242 | 
243 | #### 10h. Example Usage
244 | _Do the authors include examples of how to use this pipeline?_
245 | 
246 | Documenting an example usage for public health bioinformatics pipeline provides researchers with practical guidance on how to effectively apply the tool in real-world scenarios, enhancing their understanding of its potential applications. This practice promotes successful integration of the pipeline into research workflows, ensuring that it is utilized to its full capacity and ultimately advancing public health outcomes. An example usage for a command-line interface (CLI) tool in public health bioinformatics might illustrate the required input data, command syntax, and expected output, providing a tangible demonstration of the tool's application. For instance, a tool analyzing genomic variants could have an example usage like:
247 | 
248 | ```
249 | Input files:
250 | - sample.vcf (Variant Call Format file containing genomic variants)
251 | 
252 | Command:
253 | $ analyze_variants.py -i sample.vcf -o output.txt -p population_data.csv
254 | 
255 | Output:
256 | - output.txt (file containing the filtered and annotated genomic variants relevant to public health)
257 | ```
258 | 
259 | This example usage showcases the necessary input files, command options, and the output generated, helping researchers to better understand the tool's functionality and how to incorporate it into their own analyses.
260 | 
261 | **To adhere to this documentation practice:** Provide practical guidance by documenting an example usage for public health bioinformatics pipeline, offering researchers clear instructions on how to effectively apply the tool in real-world scenarios. Ensure that the example usage enhances researchers' understanding of the pipeline, facilitating successful integration into research workflows and maximizing its utility for advancing public health outcomes.
262 | 
263 | **To verify adherence to this documentation practice:** The reviewer should check for comprehensive documentation that includes an example usage, illustrating the required input data, command syntax, and expected output, enhancing researchers' understanding of the tool's potential applications.
264 | 
265 | 


--------------------------------------------------------------------------------
/docs/qc-solutions.md:
--------------------------------------------------------------------------------
  1 | # **QC Solutions for SARS-CoV-2 Genomic Analysis**
  2 | 
  3 | **PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group** <br/>
  4 | Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J 
  5 | 
  6 | <details>
  7 |  <summary> Document Changelog</summary>
  8 |  
  9 | - 2022-06-23:
 10 |   - First draft published
 11 | - 2023-03-09:
 12 |   - Added changelog
 13 | 
 14 | </details>
 15 | 
 16 | 
 17 | # Overview
 18 | 
 19 | Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data. 
 20 | 
 21 | However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses. 
 22 | 
 23 | In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.
 24 | 
 25 | Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions **as per the opinions of our working group** and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues. 
 26 | 
 27 | ## Contents
 28 | - [Process Control For Bioinformatics QC Checkpoints](#process-control-for-bioinformatics-qc-checkpoints)
 29 | - [QC Acceptance Criteria](#qc-acceptance-creiteria)
 30 |   - [PHA4GE Suggested Thresholds](#pha4ge-suggested-thresholds)
 31 | - [QC Metric Definitions](#qc-metric-definitions)
 32 |   - [Read QC Metrics](#read-qc-metrics)
 33 |   - [Alignment QC Metrics](#alignment-qc-metrics)
 34 |   - [Consensus Assembly QC Metrics](#consensus-assembly-qc-metrics)
 35 | - [Additional QC Resources and Materials](#additional-qc-resrouces-and-materials)
 36 | 
 37 | # Process Control For Bioinformatics QC Checkpoints
 38 | The focus of this document is on the quality control (QC) of tiled amplicon sequencing--through the Artic V3 protocol, for example--a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and--as discussed in this working group's [Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document](../docs/bioinfo-solutions.md)--assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples.
 39 | 
 40 | Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly.
 41 | 
 42 | <p align="center">
 43 |   <img src="./images/pha4ge_sc2_qc_workflow.png" width=10800" class="center">
 44 | </p>
 45 | 
 46 | In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself.
 47 | 
 48 | Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data.
 49 | 									    
 50 | # QC Acceptance Criteria
 51 | 
 52 | When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group's suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols. 
 53 | 									    
 54 | ## PHA4GE Suggested Thresholds
 55 | 
 56 | <table>
 57 |   <tbody>
 58 |     <tr>
 59 |       <th align="center" colspan=2 background-color="#F00000">Read QC Metrics</th>
 60 |     </tr>
 61 |     <tr>
 62 |       <td rowspan=1 height="5px"><a href="#number-of-reads">Number of Reads </a></td>
 63 |       <td rowspan=1>Protocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq)</td>
 64 |     </tr>
 65 |     <tr>
 66 |       <td width="510px;" rowspan=1 "><a href="#percent-human-reads">Percent Human Reads</a></td>
 67 |       <td width="510px;" rowspan=1><20%</td>
 68 |     </tr>
 69 |     <tr>
 70 |       <th align="center" colspan=2>Alignment QC Metrics</th>
 71 |     </tr>
 72 |     <tr>
 73 |       <td rowspan=1 height="5px"><a href="#sequencing-depth">Average Read Depth</a></td>
 74 |       <td rowspan=1>≥100x </td>
 75 |     </tr>
 76 |     <tr>
 77 |       <td rowspan=1 height="5px"><a href="#percent-mapped-reads">Percent mapped reads to Wuhan reference genome</a></td>
 78 |       <td rowspan=1>≥65% </td>
 79 |     </tr>
 80 |     <tr>
 81 |       <td rowspan=1 height="5px"><a href="#coverage">Coverage at a Single Base to Make a Base Call</a></td>
 82 |       <td rowspan=1>≥50x </td>
 83 |     </tr>
 84 |     <tr>
 85 |       <td rowspan=1 height="5px"><a href="#percent-agreement">Percent Agreement</a></td>
 86 |       <td rowspan=1>80%</td>
 87 |     </tr>
 88 |     <tr>
 89 |       <td rowspan=1 height="5px"><a href="#average-base-quality-of-aligned-reads">Average base quality of aligned reads</a></td>
 90 |       <td rowspan=1>>15</td>
 91 |     </tr>
 92 |     <tr>
 93 | 	    <th align="center" colspan=2><u>Assembly QC Metrics</u></th>
 94 |     </tr>
 95 |     <tr>
 96 |       <td rowspan=1 height="5px"><a href="#percent-reference-coverage">Percent reference coverage</a></td>
 97 |       <td rowspan=1>>83%</td>
 98 |     </tr>
 99 |     <tr>
100 |       <td rowspan=1 height="5px"><a href="#number-of-ns">Number of Ns</a></td>
101 |       <td rowspan=1><5,000bp</td>
102 |     </tr>
103 |     <tr>
104 |       <td rowspan=1 height="5px"><a href="#assembly-length-unambiguous">Assembly length unambiguous</a></td>
105 |       <td rowspan=1>>24,000bp</td>
106 |     </tr>
107 |     <tr>
108 |       <td rowspan=1 height="5px"><a href="#ntc-percent-coverage">NTC percent coverage</a></td>
109 |       <td rowspan=1><10%</td>
110 |     </tr>
111 |     <tr>
112 |       <td rowspan=1 height="5px"><a href="#lineage-defining-mutations">Lineage defining mutations</a></td>
113 |       <td rowspan=1>≥60%</td>
114 |     </tr>
115 |     <tr>
116 |       <td rowspan=1 height="5px"><a href="#s-gene-coverage">S-gene coverage</a></td>
117 |       <td rowspan=1>≥99%</td>
118 |     </tr>
119 |     <tr>
120 |       <td rowspan=1 height="5px"><a href="#s-gene-frameshifts">S-gene frameshifts sequence</a></td>
121 |       <td rowspan=1>0</td>
122 |     </tr>
123 |     <tr>
124 |       <td rowspan=1 height="5px"><a href="#s-gene-ambiguous-bases">S-gene ambiguous bases</a></td>
125 |       <td rowspan=1><10%</td>
126 |     </tr>
127 |   </tbody>
128 | </table>
129 | 								    
130 | 									    
131 | ## QC Metric Definitions
132 | 
133 | ### Read QC Metrics
134 | Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files).
135 | 
136 | | Term                  | Definition                             |
137 | | ---------------------- | --------------------------------------- |
138 | | <h4>Reads</h4> | Fragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform |
139 | | <h4>Number of Reads</h4> | Count of reads generated in an NGS run|
140 | | <h4>BCL Files</h4> | Raw image files produced by Illumina instruments, converted to fastq via bcl2fastq program |
141 | | <h4>FAST5 Files</h4> | Raw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard) |
142 | | <h4>Basecalling</h4> | The computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence <br/>[Performance of neural network basecalling tools for Oxford Nanopore sequencing](https://pubmed.ncbi.nlm.nih.gov/31234903/) |
143 | | <h4>FASTQ Files</h4> | The common “raw” sequence files containing nucleotide sequences and their associated quality scores  <br/> &bull; The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length <br/> &bull; The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position <br/> &bull; Q scores range from 0 to 40 and are mathematically equivalent to: <br>&nbsp;&nbsp;&nbsp;&nbsp; <pre> Q = -10log<sub>10</sub>P</pre> &bull; [Quality Scores for Next-Generation Sequencing - illumina](https://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf) <br/>&bull; [Measuring sequencing accuracy - illumina](https://emea.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html) <br/> &bull; Q Scores for Illumina and ONT sequencing will differ dramatically <br>&nbsp;&nbsp;&nbsp;&nbsp; &bull; An excellent Illumina run will have an average Q Score of 27-30 <br>&nbsp;&nbsp;&nbsp;&nbsp; &bull; An excellent Nanopore run will have an average Q Score of 12-15 <br/> &bull; Low Q Scores indicate poor sequencing quality which will impact all downstream analyses | 
144 | | <h4>Ambiguity / Mixed Sites</h4> | The percent of each read where the base called is ambiguous <br/> [IUPAC Codes](https://www.bioinformatics.org/sms/iupac.html) |
145 | | <h4>Sequence GC Content</h4> | The GC content of reads should be normally distributed |
146 | | <h4>Raw vs Processed Reads</h4> | It is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed.|
147 | | <h4>Percent Human Reads</h4> | Percentage of human read data sequenced in an NGS run. |
148 | 
149 | ### Alignment QC Metrics
150 | Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome--usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)--and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome--VCF files can also be inspected to assess quality of identified variant positions. 
151 | 
152 | | Term                  | Definition                             |
153 | | ---------------------- | --------------------------------------- |
154 | | <h4>Sequence Alignment</h4>  | A method of arranging nucleic acid (DNA/RNA) or protein sequences to identify regions of similarity or conservation that may be of function, structural, or evolutionary relationships. Pairwise sequence alignment consists of two sequences whereas multiple sequence alignment consists of more than three sequences |
155 | |<h4>Sequencing Depth</h4> | The number of reads that cover a particular nucleotide, section/amplicon of the genome, or average across the reference sequence<br/>  &bull; Ideally a min depth of 10X for Illumina or 20X for Nanopore would be reached<br/>&bull; Uniform depth of coverage is better<br/> &bull; Nonuniform depth may be indicative of differential amplification of amplicons, or amplicon dropout<br/> &nbsp;&nbsp;&nbsp;&nbsp;&bull; This can be assessed using bedtools |
156 | | <h4>Percent Agreement</h4> | Percentage of base call concordance in reads mapped at a designated position in the reference genome|
157 | | <h4>Coverage</h4> | What percent of the reference sequence is covered by the reads that have been produced<br/> &bull; This metric is typically used in conjunction with depth<br/>|
158 | | <h4>Percent Mapped Reads</h4> | Percentage of read data mapped to a specified reference genome|
159 | | <h4>Average Base Quality of Aligned Reads</h4> | Mean phred score of read data mapped to a reference genome|
160 | 
161 | ### Consensus Assembly QC Metrics
162 | An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations. 
163 | 
164 | | Term                  | Definition                             |
165 | | ---------------------- | --------------------------------------- |
166 | | <h4>Length of the Assembly</h4> | Should be similar to that of reference. If it is not, why? Have there been large insertions/deletions, gene duplications, etc.| 
167 | | <h4>Total Number of N’s</h4> | The total number of ambiguous basecalls in the assembly |
168 | | <h4>Length of Strings of N’s</h4> | While the total number of N’s is important, the length of the strings of N’s can indicate issues with upstream laboratory workflows. If a string of N’s is consistently reported over a specific region of the genome, then one can cross reference the primer binding loci in the bed file to see if one amplicon is dropping out or amplifying at a lower rate than the other amplicons. This could be due to amplification bias, resulting from a large differential in the GC content between the amplicons. This may also indicate that you have a mixed population and there may be a subpopulation with a different sequence in the ambiguous region.|
169 | | <h4>Percent Reference Coverage</h4> | Percentage of the Wuhan-1 reference genome represented in the consensus assembly|
170 | | <h4>Number of Ns</h4> | Number of ambiguous base calls (Ns) incorporated into the consensus assembly|
171 | | <h4>Assembly Length Unambiguous</h4> | Number of unambiguous base calls (ATCGs) incorporated into the consensus assembly|
172 | | <h4>NTC Percent Coverage</h4> | Percentage of the Wuhan-1 reference genome represented in the consensus assembly of a non-template control (NTC; i.e. negative control)|
173 | | <h4>Lineage Defining Mutations</h4> | Percentage of lineage-specific mutations represented in the consensus assembly|
174 | | <h4>Number of Ns</h4> | Number of ambiguous base calls (Ns) incorporated into the consensus assembly|
175 | | <h4>S-gene Coverage</h4> | Percentage of the SARS-CoV-2 S-gene represented in the consensus assembly|
176 | | <h4>S-gene Frameshifts</h4> | S-gene insertion or deletion events represented in the consensus assembly|
177 | | <h4>S-gene Ambiguous Bases</h4> | Number of ambiguous base calls (Ns) incorporated into the s-gene of the consensus assembly|
178 | 
179 | 
180 | ## Additional QC Resources and Materials
181 | 
182 | - [ncov-tools](https://github.com/jts/ncov-tools) - Tools and plots for performing quality control on coronavirus sequencing results.
183 | - [Quality Management Systems Tools & Resources - Process Management](https://www.cdc.gov/labquality/qms-tools-and-resources.html#:~:text=Click%20to%20expand-,Process%20Management,-Provides%20guidance%20on) - US CDC Quality Management Systems for SARS-CoV-2 NGS Data
184 | - [TheiaCoV QC output Video](https://www.youtube.com/watch?v=Amb-8M71umw&list=PLU47xRg_MKJrtyoFwqGiywl7lQj6vq8Uz&index=3) - Video tutorial for assessing SARS-CoV-2 genomic characterization with Theiagen's TheiaCoV workflows
185 | - [StaPH-B Glossary](http://www.staphb.org/resources/glossary/) - US State Public Health Bioinformatics (StaPH-B) working group's bioinformatics glossary of terms
186 | - [PHA4GE Bioinformatics Solutions](https://github.com/pha4ge/pipeline-resources/blob/main/docs/bioinfo-solutions.md) - This working groups list of bioinformatics solutions for SARS-CoV-2 bioinformatics 
187 | - [ECDC: Guidance for representative and targeted genomic SARS-CoV-2 monitoring](https://www.ecdc.europa.eu/sites/default/files/documents/Guidance-for-representative-and-targeted-genomic-SARS-CoV-2-monitoring.pdf) - European CDC Guidance Document for SARS-CoV-2 genomic analysis
188 | 


--------------------------------------------------------------------------------
/docs/sc2-recombinants.md:
--------------------------------------------------------------------------------
  1 | # **Identifying SARS-CoV-2 Recombinants**
  2 | 
  3 | **PHA4GE Bioinformatics Pipelines &amp; Visualization Working Group** <br/>
  4 | Smith E, Wright S, Libuit K
  5 | 
  6 | <details>
  7 |  <summary> Document Change Log</summary>
  8 |  
  9 | - 2022-06-28:
 10 |   - First draft published
 11 | - 2023-02-16:
 12 |   - Update recombinant image (Example 3)
 13 | - 2023-03-09:
 14 |   - Add changelog
 15 | </details>
 16 | 
 17 | # Overview
 18 | SARS-CoV-2 recombinants have garnered the attention of the public health community largely due to the unknown clinical and epidemiological implications. This uncertainty emphasizes the need to detect and characterize recombinant SARS-CoV-2 genomes, but the ability to do so rapidly and systematically is not without challenges. Often, recombinant genomes receive an “Unassigned” pango lineage, a non-recombinant pango lineage, or the incorrect recombinant lineage assignment. Additionally, determining the site of recombination within the genome can be difficult for those without extensive SARS-CoV-2 bioinformatics experience. 
 19 | 
 20 | The PHA4GE Pipelines and Visualization Working Group has created this document as an attempt to highlight critical sources of information and open-source/access resources to aid in the analysis and surveillance of potential recombinant specimens.
 21 | 
 22 | In no way does this document represent a comprehensive list of all available SC2 bioinformatics resources for assessing recombination. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues
 23 | 
 24 | ## Contents
 25 | - [General Information](#general-information)
 26 | - [Tools used to detect recombinants/find breakpoint](#tools-used-to-detect-recombinants/find-breakpoint)
 27 | - [Investigating Putative Recombinant Specimen](#investigating-putative-recombinant-specimen)
 28 |     - [Common terminology](#common-terminology)
 29 |     - [Steps for investigating putative recombinant genomes](#steps-for-investigating-putative-recombinant-genomes)
 30 |         - [Assess whether recombination exists within the genome](#steps-for-investigating-putative-recombinant-genomes)
 31 |         - [Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage](#determine-whether-the-genome-belongs-to-a-designated-recombinant-lineage-or-represents-a-novel-recombinant-lineage)
 32 |         - [Identify the breakpoint of the putative recombinant](#identify-the-breakpoint-of-the-putative-recombinant)
 33 |     - [Proposing a new recombinant lineage](#proposing-a-new-recombinant-lineage)
 34 | - [Publications on SARS-CoV-2 recombinants](#publications-on-sars-cov-2-recombinants)
 35 |         
 36 | 
 37 | # General Information
 38 | 
 39 | More info and resources on recombinants
 40 | 
 41 | # Tools used to detect recombinants/find breakpoint
 42 | 
 43 | - [Sc2rf](https://github.com/lenaschimmel/sc2rf)
 44 | - [Nextclade](https://clades.nextstrain.org/)
 45 | - [UShER](https://clades.nextstrain.org/)
 46 | - [Potential Recombinant List - Sakaguchi Hitoshi](https://docs.google.com/spreadsheets/d/1cQILRxXD756gJoRsaqMdJkxZm7sEjhV7ceY398Iz7gI/edit#gid=0)
 47 | 
 48 | # Investigating Putative Recombinant Specimen
 49 | 
 50 | ## Common terminology
 51 | 
 52 | | Term                  | Definition                             |
 53 | | ---------------------- | --------------------------------------- |
 54 | | Breakpoint | The site within the genome where recombination occurred. This is also sometimes referred to as the recombinant site. Usually, the breakpoint is a range of nucleotide positions instead of a single nucleotide position. This is due to a lack of lineage-specific mutations in certain regions, or the same mutations being shared between different lineages. The beginning of the breakpoint range is the first possible site within the genome that recombination could have occurred, which generally follows a lineage-specific mutation site. The end of the breakpoint range is the last possible site where the recombination could have occurred, which generally precedes the site of a lineage-specific mutation. |
 55 | | Donor | The lineage from which a portion of the recombinant genome originated. This is also called the “parental lineage”. For example, a BA.1 x BA.2 recombinant has BA.1 and BA.2 donor sequences.  |
 56 | | Designated lineage  | A specific pango lineage. Designated recombinant lineages are currently referred to with the XX nomenclature per the pango network [guidelines](https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/). If a recombinant sequence does not belong to a designated recombinant lineage, it may be a novel recombinant lineage that has not yet been designated |
 57 | | Allele frequency | The proportion of reads that contain a specific nucleotide at a specific position within the genome. Ideally, the nucleotide sites for lineage-defining mutations within a recombinant genome will have near 100% allele frequency. This is one way to distinguish a recombinant genome from a contaminated genome. |
 58 | | Sequencing depth | The number of reads covering a specific position within the genome. This is also often referred to as “coverage”. Ideally, a recombinant genome will have high sequencing depth at the sites of lineage-defining mutations. If a genome has low sequencing depth, it is difficult to determine what the donor, or parental lineage, is at that site. |
 59 | 
 60 | ## Steps for investigating putative recombinant genomes
 61 | 
 62 | There are three main steps in investigating a putative SARS-CoV-2 recombinant genome:  
 63 | 
 64 | 1. Assess whether recombination exists within the genome 
 65 | 2. Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage
 66 | 3. Identify the breakpoint of the putative recombinant (if it represents a novel recombinant lineage)
 67 | 
 68 | ### Assess whether recombination exists within the genome
 69 | 
 70 | #### 1. Run genome assembly through Nextclade. This can be done either through the [Nextclade web portal](https://clades.nextstrain.org/) or [NextClade CLI](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html) (command-line interface).
 71 | 
 72 | #### 2. Assess Nextclade output for sequence quality and potential for recombination.
 73 | 
 74 | 
 75 | 
 76 | ##### Example 1
 77 | 
 78 | This genome (EPI_ISL_11758210) received a clade assignment of “recombinant” using v1.14.1. It has 67 mutations relative to the Wuhan-1 reference genome. It also has 0 Ns, meaning that there are no ambiguous bases in the genome and it is likely a high quality genome assembly. Therefore, we can be confident that this sequence is a true recombinant. However, additional investigation will be needed to determine whether this genome actually belongs to the XQ recombinant lineage. 
 79 | 
 80 | <p align="center">
 81 |   <img src="./images/sc2-recombinants/ex1.png" class="center">
 82 | </p>
 83 | 
 84 | ##### Example 2
 85 | 
 86 | This genome (EPI_ISL_12612634) received a clade assignment of “recombinant” using v1.14.1. It has 64 mutations relative to the Wuhan-1 reference genome. It has 646 N’s, which means ~2% of the genome assembly is composed of Ns. While any ambiguous bases in the genome assembly are not ideal, it is not abnormal. This is still a relatively low number of N’s and we can have confidence that this genome is a recombinant. However, additional investigation will be needed to determine whether this genome actually belongs to the XE lineage.
 87 | 
 88 | 
 89 | <p align="center">
 90 |   <img src="./images/sc2-recombinants/ex2.png" class="center">
 91 | </p>
 92 | 
 93 | ##### Example 3
 94 | 
 95 | This genome received a clade assignment of “recombinant” using v1.14.1. It has 10 mutations relative to the Wuhan-1 reference genome. It has 21,498 Ns, meaning that the majority of the genome assembly is composed of ambiguous bases. Despite the “recombinant” assignment and “XF” pango lineage, the quality of this sequence is too poor to continue investigating for recombination. 
 96 | 
 97 | <p align="center">
 98 |   <img src="./images/sc2-recombinants/ex3.png" class="center">
 99 | </p>
100 | 
101 | 
102 | ### Determine whether the genome belongs to a designated recombinant lineage or represents a novel recombinant lineage
103 | 
104 | #### 1. Upload sequences to the [UShER web portal](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace). To follow along with Examples 1 and 2, you can either upload the fasta files or copy and paste the GISAID accessions.
105 | 
106 | #### 2. Review UShER outputs and subtrees
107 | 
108 | ##### Example 1
109 | 
110 | This genome was given a lineage assignment of BA.2 by UShER. This differs from the Nextclade lineage assignment of XQ, which is a bit confusing.  On the UShER subtree, this genome falls into a clade of similar sequences, but all of them are assigned the BA.2 lineage. This means that this genome will need further investigation. Since this sequence was high quality, determined to be a recombinant by Nextclade, but was not assigned a recombinant pango lineage in UShER, it may belong to a novel recombinant lineage that has not been designated.
111 | 
112 | <p align="center">
113 |   <img src="./images/sc2-recombinants/ex1-usher-metrics.png" class="center">
114 | </p>
115 | 
116 | <p align="center">
117 |   <img src="./images/sc2-recombinants/ex1-usher-tree.png" class="center">
118 | </p>
119 | 
120 | ##### Example 2
121 | 
122 | This genome was given a lineage assignment of XE by UShER. When viewing the UShER subtree, this sequence falls amongst many other XE genomes. In fact, it is nearly identical to several other genomes in the public repositories. Therefore, this genome is likely a true XE recombinant. Since this genome belongs to a designated recombinant lineage that has been previously characterized, no further investigation is necessary. 
123 | 
124 | <p align="center">
125 |   <img src="./images/sc2-recombinants/ex2-usher-metrics.png" class="center">
126 | </p>
127 | 
128 | <p align="center">
129 |   <img src="./images/sc2-recombinants/ex2-usher-tree.png" class="center">
130 | </p>
131 | 
132 | 
133 | ### Identify the breakpoint of the putative recombinant
134 | 
135 | #### 1. Return to Nextclade output
136 | 
137 | When in doubt, always go back to the list of mutations! While looking at individual mutations can be a slow and manual process, it is the best way to determine the breakpoint of the putative novel recombinant genome. You can hover over the number of mutations in the web portal to get a graphical display of the mutations, or you can download the list of mutations as a JSON, CSV, or TSV.
138 | 
139 | <p align="center">
140 |   <img src="./images/sc2-recombinants/nextclade-output.png" class="center">
141 | </p>
142 | 
143 | <p align="center">
144 |   <img src="./images/sc2-recombinants/nextclade-output2.png" class="center">
145 | </p>
146 | 
147 | #### 2. Find list of defining mutations for each variant
148 | 
149 | Lists of mutations for each variant can be found on the [CoVariants web page](https://covariants.org/variants/21K.Omicron). When you click on a specific variant, a list of defining mutations will appear on the right side of the page. Be aware that several variants share defining mutations.
150 | 
151 | <p align="center">
152 |   <img src="./images/sc2-recombinants/covariants21k.png" class="center">
153 | </p>
154 | 
155 | <p align="center">
156 |   <img src="./images/sc2-recombinants/covariants21l.png" class="center">
157 | </p>
158 | 
159 | #### 3. Determine which mutations within the putative novel recombinant genome belong to which donors
160 | 
161 | Go mutation-by-mutation and determine whether that mutation is specific to 21K (BA.1) or 21L (BA.2). It is also important to check to see if lineage-specific mutations are missing from the genome. Starting in ORF1a with Example 1, the genome contains ORF1a:K856R and ORF1a:L2084I, which are both specific to 21K (BA.1). However, with a BA.1 genome we would expect the next amino acid substitution to be ORF1a:A2710T based on the list from CoVariants. Instead, the two amino acid substitutions are ORF1a:L3027F and ORF1a:T3090I, which are both specific to 21L (BA.2). This means that the breakpoint for this recombinant genome is between ORF1a:L2084I and ORF1a:A2710T, which corresponds to 6,516-8,392bp. 
162 | 
163 | <p align="center">
164 |   <img src="./images/sc2-recombinants/mutations1.png" class="center">
165 | </p>
166 | 
167 | <p align="center">
168 |   <img src="./images/sc2-recombinants/mutations2.png" class="center">
169 | </p>
170 | 
171 | In this case, the breakpoint was at two sites that confer amino acid substitutions, however, some cases will require looking at synonymous nucleotide changes as well. [CoVariants](https://covariants.org/) lists these underneath the list of amino acid substitutions. 
172 | 
173 | <p align="center">
174 |   <img src="./images/sc2-recombinants/mutations3.png" class="center">
175 | </p>
176 | 
177 | ## Proposing a new recombinant lineage:
178 | 
179 | The pango team recently released a set of [guidelines](https://www.pango.network/pango-lineages-guidelines-for-suggesting-novel-and-recombinant-lineages/) for proposing new recombinant lineages. 
180 | 
181 | # Publications on SARS-CoV-2 recombinants
182 | 
183 | - Bolze, A., White, S., Basler, T., Dei Rossi, A., Roychoudhury, P., Greninger, A. L., ... & Luo, S. (2022). Evidence for SARS-CoV-2 Delta and Omicron co-infections and recombination. medRxiv. doi: https://doi.org/10.1101/2022.03.09.22272113
184 | 
185 | - Colson, P, Fournier, P-E, Delerce, J, et al. Culture and identification of a “Deltamicron” SARS-CoV-2 in a three cases cluster in southern France. J Med Virol. 2022; 1- 11. doi:10.1002/jmv.27789
186 | 
187 | - Colson, P., Delerce, J., Marion-Paris, E., Lagier, J. C., Levasseur, A., Fournier, P. E., ... & Raoult, D. (2022). A 21L/BA. 2-21K/BA. 1 “MixOmicron” SARS-CoV-2 hybrid undetected by qPCR that screen for variant in routine diagnosis. medRxiv. doi: https://doi.org/10.1101/2022.03.28.22273010
188 | 
189 | - Duerr, R., Dimartino, D., Marier, C., Zappile, P., Wang, G., Plitnick, J., ... & Heguy, A. (2022). Delta-Omicron recombinant SARS-CoV-2 in a transplant patient treated with Sotrovimab. bioRxiv. doi: https://doi.org/10.1101/2022.04.06.487325
190 | 
191 | - Gu, H., Ng, D., Liu, G., Cheng, S., Krishnan, P., Chang, L....Poon, L. (2022). Recombinant BA.1/BA.2 SARS-CoV-2 Virus in Arriving Travelers, Hong Kong, February 2022. Emerging Infectious Diseases, 28(6), 1276-1278. https://doi.org/10.3201/eid2806.220523.
192 | 
193 | - Lacek, Kristine A., Benjamin Rambo-Martin, Dhwani Batra, Xiao-yu Zheng, Matthew W. Keller, Malania Wilson, Mili Sheth et al. "Identification of a Novel SARS-CoV-2 Delta-Omicron Recombinant Virus in the United States." bioRxiv (2022). doi: https://doi.org/10.1101/2022.03.19.484981
194 | 
195 | - Lacek, K. A., Rambo-Martin, B. L., Batra, D., Zheng, X. Y., Hassell, N., Sakaguchi, H., ... & Paden, C. R. (2022). SARS-CoV-2 Delta-Omicron Recombinant Viruses, United States. Emerging Infectious Diseases, 28(7). DOI: 10.3201/eid2807.220526 
196 | 
197 | - da Silva, L. S., de Oliveira, C. M., Cota, B. D. C. V., Romano, C. M., & Levi, J. E. (2022). Three SARS-CoV-2 recombinants identified in Brazilian children. DOI: https://doi.org/10.21203/rs.3.rs-1641864/v1
198 | 
199 | - Moisan, A., Mastrovito, B., De Oliveira, F., Martel, M., Hedin, H., Leoz, M., ... & Plantier, J. C. (2022). Evidence of transmission and circulation of Deltacron XD recombinant SARS-CoV-2 in Northwest France. Clinical Infectious Diseases. doi: https://doi.org/10.1093/cid/ciac360
200 | 
201 | - SIMON-LORIERE, E., Montagutelli, X., Lemoine, F., Donati, F., Touret, F., Bourret, J., ... & Danish COVID-19 Genome Consortium (DCGC). (2022). Rapid characterization of a Delta-Omicron SARS-CoV-2 recombinant detected in Europe. https://doi.org/10.21203/rs.3.rs-1502293/v1
202 | 
203 | - VanInsberghe, D., Neish, A. S., Lowen, A. C., & Koelle, K. (2021). Recombinant SARS-CoV-2 genomes circulated at low levels over the first year of the pandemic. Virus Evolution, 7(2), veab059.doi: https://doi.org/10.1093/ve/veab059
204 | 
205 | - Wertheim, J. O., Wang, J. C., Leelawong, M., Martin, D. P., Havens, J. L., Chowdhury, M. A., ... & Hughes, S. (2022). Capturing intrahost recombination of SARS-CoV-2 during superinfection with Alpha and Epsilon variants in New York City. medRxiv. doi: https://doi.org/10.1101/2022.01.18.22269300
206 | 


--------------------------------------------------------------------------------