├── img
    ├── v1.0.0.png
    ├── pbaa_card.png
    ├── pbaa_logo.png
    ├── workflow.png
    ├── error_correction.png
    ├── pbaa_logo_transparent.png
    └── logo_pbaa.svg
├── guide_reference.md
├── README_BETA.md
└── README.md


/img/v1.0.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/v1.0.0.png


--------------------------------------------------------------------------------
/img/pbaa_card.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/pbaa_card.png


--------------------------------------------------------------------------------
/img/pbaa_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/pbaa_logo.png


--------------------------------------------------------------------------------
/img/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/workflow.png


--------------------------------------------------------------------------------
/img/error_correction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/error_correction.png


--------------------------------------------------------------------------------
/img/pbaa_logo_transparent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacificBiosciences/pbAA/HEAD/img/pbaa_logo_transparent.png


--------------------------------------------------------------------------------
/guide_reference.md:
--------------------------------------------------------------------------------
  1 | Introduction
  2 | ============
  3 | 
  4 | Amplicon Analysis with Guided Clustering (`pbaa`) requires a file of
  5 | reference sequences in FASTA format to use as a the cluster guide. The
  6 | program can accept any valid FASTA-format sequence file, but using a
  7 | well curated sequence file for the cluster guide will provide for
  8 | faster, higher quality results. This document offers some pointers on
  9 | how to curate a guide-sequence FASTA file for optimal performance, using
 10 | the construction of a reference for an 11-locus HLA panel as an example.
 11 | 
 12 | Protocol
 13 | --------
 14 | 
 15 | 1.  Acquire FASTA sequences for each expected or desired Locus
 16 | 
 17 | The initial collection of reference sequences should ideally correspond
 18 | to the exact region being amplified for each locus, but small
 19 | differences are acceptable. The initial collection of sequences should
 20 | also span the widest possible range of template sequences, and should be
 21 | genomic rather than from cDNA, to ensure maximum sensitivity. In
 22 | particular, we recommend including any known closely related
 23 | pseudo-genes, for example HLA-DPB2 in addition to HLA-DPB1. Often there
 24 | are sequences in nature not currently in any database that are fully
 25 | functional, but are more similar to known alleles of the pseudogene, and
 26 | this ensures they too will be captured.
 27 | 
 28 | For HLA, we recommend using the genomic FASTA files from the public IMGT
 29 | database (<ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/)>,
 30 | which is both the most comprehensive public database of HLA alleles,
 31 | .For our purposes, we wish to cover the core MHC Class I and II genes,
 32 | so we use the following files: `A_gen.fasta, B_gen.fasta, C_gen.fasta,
 33 | DPA1_gen.fasta, DPB1_gen.fasta, DPB2_gen.fasta, DQA1_gen.fasta,
 34 | DQB1_gen.fasta, DRB1_gen.fasta, DRB345_gen.fasta`
 35 | 
 36 | 2.  Concatenate FASTA files from different targets of interest
 37 | 
 38 | The final selection of sequences must be uploaded to SMRT-link or passed
 39 | to `pbaa` as a single FASTA file. In addition, it's usually more
 40 | efficient to filter all of the sequences at once, for the most
 41 | aggressive possible filtering.
 42 | 
 43 | ```
 44 | $ cat *_gen.fasta > combined.fasta
 45 | $ grep '>' combined.fasta | head
 46 | >HLA:HLA00001 A*01:01:01:01 3503 bp
 47 | >HLA:HLA02169 A*01:01:01:02N 3291 bp
 48 | >HLA:HLA14798 A*01:01:01:03 3503 bp
 49 | >HLA:HLA15760 A*01:01:01:04 3087 bp
 50 | >HLA:HLA16415 A*01:01:01:05 3321 bp
 51 | >HLA:HLA16417 A*01:01:01:06 3097 bp
 52 | >HLA:HLA16436 A*01:01:01:07 3321 bp
 53 | >HLA:HLA16651 A*01:01:01:08 3121 bp
 54 | >HLA:HLA16652 A*01:01:01:09 3340 bp
 55 | >HLA:HLA16653 A*01:01:01:10 3121 bp
 56 | ```
 57 | 
 58 | 3.  Filter out highly similar guide sequences
 59 | 
 60 | Most comprehensive sequence databases, like IMGT, contain many highly
 61 | similar alleles that differ by only a few nucleotides. However even
 62 | sequence differences of \>5% are often unnecessary for correctly
 63 | clustering most data with `pbaa`, but can substantially slow down the
 64 | clustering process. We therefore recommend filtering out the most
 65 | similar sequences from the reference guide to speed up the clustering
 66 | process. We suggest using **cd-hit-est** from the CD-HIT suite of tools
 67 | (<https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide>) to filter
 68 | the combined FASTA file down to a subset of representative sequences of
 69 | approximately \~90-95% similarity.
 70 | 
 71 | ```$ cd-hit-est -c 0.9 -i combined.fasta -o filtered.fasta```
 72 | 
 73 | 4.  Add cluster group designations to each remaining sequence
 74 | 
 75 | Once filtered there will still often be multiple sequences from the same
 76 | loci in the cluster guide, particularly highly diverse loci like MHC
 77 | Class I. We want some way to designate that certain sequences belong
 78 | together -- that they represent the same underlying gene -- so that `pbaa`
 79 | doesn't create duplicate results. The FASTA format specification
 80 | requires that each sequence id in a file be unique, so instead we place
 81 | the name of the locus for each sequence in the cluster guide FASTA at
 82 | the end of it's header, which `pbaa` will use if available.
 83 | 
 84 | ```
 85 | $ sed -E 's/>([^:]*):([^ ]*) ([^\*])(.*)/>\2 \3\4|\1-\3/;s/[*: ]/_/g' filtered.fasta > guides.fasta
 86 | $ grep '>' guides.fasta | head
 87 | >HLA00001_A_01_01_01_01_3503_bp|HLA-A
 88 | >HLA02169_A_01_01_01_02N_3291_bp|HLA-A
 89 | >HLA14798_A_01_01_01_03_3503_bp|HLA-A
 90 | >HLA15760_A_01_01_01_04_3087_bp|HLA-A
 91 | >HLA16415_A_01_01_01_05_3321_bp|HLA-A
 92 | >HLA16417_A_01_01_01_06_3097_bp|HLA-A
 93 | >HLA16436_A_01_01_01_07_3321_bp|HLA-A
 94 | >HLA16651_A_01_01_01_08_3121_bp|HLA-A
 95 | >HLA16652_A_01_01_01_09_3340_bp|HLA-A
 96 | >HLA16653_A_01_01_01_10_3121_bp|HLA-A
 97 | ```
 98 | 
 99 | 
100 | 5.  Upload the final cluster-guide FASTA to SMRTLink
101 | 
102 | Please see the section on "Adding Reference Guides" in the documentation
103 | for Long Amplicon Analysis with Guided Clustering.
104 | 


--------------------------------------------------------------------------------
/img/logo_pbaa.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8"?>
  2 | <svg id="Layer_1" data-name="Layer 1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 64 64">
  3 |   <defs>
  4 |     <style>
  5 |       .cls-1 {
  6 |         fill: #fff;
  7 |       }
  8 | 
  9 |       .cls-1, .cls-2, .cls-3, .cls-4 {
 10 |         stroke-width: 0px;
 11 |       }
 12 | 
 13 |       .cls-1, .cls-5 {
 14 |         opacity: .2;
 15 |       }
 16 | 
 17 |       .cls-6 {
 18 |         clip-path: url(#clippath);
 19 |       }
 20 | 
 21 |       .cls-2, .cls-7, .cls-8, .cls-9, .cls-10, .cls-11, .cls-12, .cls-13, .cls-14, .cls-5 {
 22 |         fill: none;
 23 |       }
 24 | 
 25 |       .cls-3 {
 26 |         fill: url(#radial-gradient-2);
 27 |         fill-opacity: .8;
 28 |       }
 29 | 
 30 |       .cls-7 {
 31 |         stroke: #1383c6;
 32 |         stroke-width: 4.80725px;
 33 |       }
 34 | 
 35 |       .cls-7, .cls-9, .cls-10, .cls-12, .cls-14, .cls-5 {
 36 |         stroke-linecap: round;
 37 |       }
 38 | 
 39 |       .cls-7, .cls-9, .cls-14, .cls-5 {
 40 |         stroke-linejoin: round;
 41 |       }
 42 | 
 43 |       .cls-4 {
 44 |         fill: url(#radial-gradient);
 45 |       }
 46 | 
 47 |       .cls-8 {
 48 |         stroke-width: 5px;
 49 |       }
 50 | 
 51 |       .cls-8, .cls-10, .cls-11, .cls-12, .cls-13 {
 52 |         stroke-miterlimit: 10;
 53 |       }
 54 | 
 55 |       .cls-8, .cls-10, .cls-11, .cls-12, .cls-5 {
 56 |         stroke: #fff;
 57 |       }
 58 | 
 59 |       .cls-9 {
 60 |         stroke: #313541;
 61 |       }
 62 | 
 63 |       .cls-9, .cls-13 {
 64 |         stroke-width: 2px;
 65 |       }
 66 | 
 67 |       .cls-10, .cls-12, .cls-14 {
 68 |         stroke-width: 2px;
 69 |       }
 70 | 
 71 |       .cls-10, .cls-15 {
 72 |         opacity: .5;
 73 |       }
 74 | 
 75 |       .cls-16 {
 76 |         clip-path: url(#clippath-1);
 77 |       }
 78 | 
 79 |       .cls-11 {
 80 |         stroke-width: 5px;
 81 |       }
 82 | 
 83 |       .cls-13 {
 84 |         stroke: #005496;
 85 |       }
 86 | 
 87 |       .cls-14 {
 88 |         stroke: #ccc;
 89 |       }
 90 | 
 91 |       .cls-5 {
 92 |         stroke-width: 1.5px;
 93 |       }
 94 |     </style>
 95 |     <radialGradient id="radial-gradient" cx="32" cy="32" fx="32" fy="32" r="22" gradientUnits="userSpaceOnUse">
 96 |       <stop offset="0" stop-color="#00b8c4"/>
 97 |       <stop offset="1" stop-color="#1383c6"/>
 98 |     </radialGradient>
 99 |     <clipPath id="clippath">
100 |       <circle class="cls-2" cx="32" cy="32" r="22"/>
101 |     </clipPath>
102 |     <clipPath id="clippath-1">
103 |       <circle class="cls-2" cx="35.91897" cy="34.92894" r="9.61451"/>
104 |     </clipPath>
105 |     <radialGradient id="radial-gradient-2" cx="43.78945" cy="40.92359" fx="43.78945" fy="40.92359" r="13.48075" gradientUnits="userSpaceOnUse">
106 |       <stop offset="0" stop-color="#fff"/>
107 |       <stop offset=".28636" stop-color="#e6e9ed"/>
108 |       <stop offset=".89135" stop-color="#a7b1c2"/>
109 |       <stop offset="1" stop-color="#9ba7ba"/>
110 |     </radialGradient>
111 |   </defs>
112 |   <rect class="cls-2" width="64" height="64"/>
113 |   <g>
114 |     <circle class="cls-4" cx="32" cy="32" r="22"/>
115 |     <g class="cls-6">
116 |       <g>
117 |         <g>
118 |           <line class="cls-12" x1="10" y1="22.5" x2="17.89744" y2="22.5"/>
119 |           <line class="cls-12" x1="22.03419" y1="22.5" x2="29.93162" y2="22.5"/>
120 |           <line class="cls-12" x1="34.06838" y1="22.5" x2="41.96581" y2="22.5"/>
121 |           <line class="cls-12" x1="46.10256" y1="22.5" x2="54" y2="22.5"/>
122 |         </g>
123 |         <g>
124 |           <line class="cls-12" x1="10" y1="26.31444" x2="17.89744" y2="26.31444"/>
125 |           <line class="cls-12" x1="22.03419" y1="26.31444" x2="29.93162" y2="26.31444"/>
126 |           <line class="cls-12" x1="40.19352" y1="26.31444" x2="41.96581" y2="26.31444"/>
127 |           <line class="cls-12" x1="46.10256" y1="26.31444" x2="54" y2="26.31444"/>
128 |         </g>
129 |       </g>
130 |       <g>
131 |         <g>
132 |           <line class="cls-12" x1="10" y1="37.26182" x2="17.89744" y2="37.26182"/>
133 |           <line class="cls-12" x1="22.03419" y1="37.26182" x2="29.93162" y2="37.26182"/>
134 |           <line class="cls-12" x1="46.10256" y1="37.26182" x2="54" y2="37.26182"/>
135 |         </g>
136 |         <g>
137 |           <line class="cls-12" x1="10" y1="41.07625" x2="17.89744" y2="41.07625"/>
138 |           <line class="cls-12" x1="22.03419" y1="41.07625" x2="28.2248" y2="41.07625"/>
139 |           <line class="cls-12" x1="46.10256" y1="41.07625" x2="54" y2="41.07625"/>
140 |         </g>
141 |       </g>
142 |     </g>
143 |     <path class="cls-1" d="M16.44365,47.55635c-8.59153-8.59153-8.59153-22.52116,0-31.1127,8.59153-8.59153,22.52116-8.59153,31.1127,0l-31.1127,31.1127Z"/>
144 |     <g>
145 |       <g class="cls-16">
146 |         <g>
147 |           <line class="cls-8" x1="25.97093" y1="31.12165" x2="45" y2="31.12165"/>
148 |           <line class="cls-11" x1="27.30015" y1="38.12165" x2="45" y2="38.12165"/>
149 |           <g>
150 |             <line class="cls-13" x1="36.4422" y1="28.62165" x2="36.4422" y2="33.62165"/>
151 |             <line class="cls-13" x1="39.45751" y1="28.62165" x2="39.45751" y2="33.62165"/>
152 |             <line class="cls-13" x1="36.4422" y1="35.62165" x2="36.4422" y2="40.62165"/>
153 |             <line class="cls-13" x1="39.45751" y1="35.62165" x2="39.45751" y2="40.62165"/>
154 |           </g>
155 |         </g>
156 |       </g>
157 |       <g>
158 |         <g>
159 |           <line class="cls-7" x1="47.18813" y1="46.55101" x2="54" y2="53.36287"/>
160 |           <line class="cls-5" x1="47.37296" y1="48.0403" x2="53.33266" y2="54"/>
161 |           <line class="cls-14" x1="42.73283" y1="42.09123" x2="46.66995" y2="46.03283"/>
162 |         </g>
163 |         <g id="icn_data_analysis" class="cls-15">
164 |           <circle class="cls-3" cx="35.91897" cy="34.92894" r="9.61451"/>
165 |         </g>
166 |         <circle class="cls-9" cx="35.91897" cy="34.92894" r="9.61451"/>
167 |         <path class="cls-10" d="M30.53461,34.92767c.00073-2.96854,2.41646-5.38427,5.38573-5.38427"/>
168 |       </g>
169 |     </g>
170 |   </g>
171 | </svg>


--------------------------------------------------------------------------------
/README_BETA.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |   <img src="img/pbaa_logo_transparent.png" alt="pbaa logo" width="250px"/>
  3 | </p>
  4 | <h1 align="center"><i>pbaa</i></h1>
  5 | <p align="center">PacBio Amplicon Analysis</p>
  6 | 
  7 | ***
  8 | 
  9 | PacBio Amplicon Analysis (_pbaa_) separates complex mixtures of amplicon targets from genomic samples. The _pbaa_ application is designed to cluster and generate high-quality consensus sequences from HiFi reads. This application only works on HiFi amplicon data. There are several assumptions made within the code that will only support high quality reads (>QV20). This application will not work on CLR data. _pbaa_ is reference aided method (pseudo de-novo).
 10 | 
 11 | Typical use cases involve multi-allelic samples where the sample-specific ploidy or copy number is unknown. _pbaa_ can effectively separate alleles with one to many variants, including SNVs and large indels contained within the target region. _pbaa_ has been optimized and tested for datasets with a moderate (<10) cluster count.  Feedback for higher cluster density is welcome and may be addressed in future releases.
 12 | 
 13 | ## Workflow
 14 | ![HiFi Amplicon Analysis Workflow](img/workflow.png)
 15 | 
 16 | ## Availability
 17 | Latest version can be installed via bioconda package `pbaa`.
 18 | 
 19 | Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
 20 | for information on Installation, Support, License, Copyright, and Disclaimer.
 21 | 
 22 | ## Latest Version
 23 | Version **0.1.2**: [Full changelog here](#full-changelog)
 24 | 
 25 | ## Usage
 26 | _pbaa_ has two executables, cluster and bam paint.
 27 | 
 28 | ```
 29 | pbaa - PacBio HiFi Amplicon Analysis.
 30 | 
 31 | Usage:
 32 |   pbaa <tool>
 33 | 
 34 |   -h,--help    Show this help and exit.
 35 |   --version    Show application version and exit.
 36 | 
 37 | Tools:
 38 |   cluster    Run clustering tool.
 39 |   bampaint   Add color tags to BAM records, based on pbaa clusters.
 40 | 
 41 | ```
 42 | 
 43 | ### Main clustering tool
 44 | This tool runs the placement, clustering, and consensus algorithms.
 45 | 
 46 | ```
 47 | pbaa cluster - Run clustering tool.
 48 | 
 49 | Usage:
 50 |   pbaa cluster [options] <guide input> <read input> <prefix>
 51 | 
 52 |   guide input               FILE   Guide sequence(s) in fasta format. A FOFN can be provide for multiple files.
 53 |   read input                FILE   De-multiplexed HiFi reads in fastq format. A FOFN can be provide for multiple files.
 54 |   prefix                    STR    Output prefix for run.
 55 | 
 56 | Placement and Variant Options:
 57 |   --filter                  INT    Low coverage kmer count cutoff. Automatically adjusted by min-var-frequency. [3]
 58 |   --trim-ends               INT    Number of bases to trim from both sides of reads during graph construction. [20]
 59 |   --min-var-frequency       FLOAT  Minimum variant frequency. [0.05]
 60 | 
 61 | Clustering Options:
 62 |   --fixed-cluster-count     INT    Maximum number of clusters per locus/guide-group. [10]
 63 |   --em-iterations           INT    Number of iterations to run expectation maximization. [300]
 64 |   --no-cluster-by-length           Disable fallback length clustering if no variants were discovered.
 65 | 
 66 | Consensus Options:
 67 |   --min-cluster-frequency   FLOAT  Low frequency cluster cutoff. [0.1]
 68 |   --min-cluster-read-count  INT    Low read count cluster cutoff. [5]
 69 |   --max-consensus-reads     INT    Maximum number of reads to use per cluster consensus. [100]
 70 |   --off-target-groups       STR    Group names to exclude, i.e. these loci are off-target (not amplified).
 71 | 
 72 | General Options:
 73 |   --max-amplicon-size       INT    Upper read length cutoff, longer reads will be skipped. [15000]
 74 |   --min-read-qv             FLOAT  Low read QV cutoff. [20]
 75 | 
 76 |   -h,--help                        Show this help and exit.
 77 |   --version                        Show application version and exit.
 78 |   -j,--num-threads          INT    Number of threads to use, 0 means autodetection. [0]
 79 |   --log-level               STR    Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
 80 |   --log-file                FILE   Log to a file, instead of stderr.
 81 |   ```
 82 | ### Coloring reads by clusters
 83 | If you have a BAM file (mapped amplicon reads) this tool will add IGV tags for grouping and coloring. It matches the read names in the clustering file, and the BAM file.
 84 | 
 85 | ```
 86 | pbaa bampaint - Add color tags to BAM records, based on pbaa clusters.
 87 | 
 88 | Usage:
 89 |   pbaa bampaint [options] <read info file> <input bam> <output bam>
 90 | 
 91 |   read info file    FILE  Read information file produced by pbaa cluster.
 92 |   input bam         FILE  Bam file to add color tags.
 93 |   output bam        FILE  Output bam file name.
 94 | 
 95 | Options:
 96 |   -h,--help               Show this help and exit.
 97 |   --version               Show application version and exit.
 98 |   -j,--num-threads  INT   Number of threads to use, 0 means autodetection. [0]
 99 |   --log-level       STR   Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
100 |   --log-file        FILE  Log to a file, instead of stderr.
101 | ```
102 | 
103 | ## Input
104 | 
105 | _pbaa_ requires two input files, guide/reference sequences in fasta file format and HiFi de-multiplexed reads in fastq file format. Guide/reference sequences should contain the amplified region, but not much more. Providing a chromosome, for example, will cause a large problems for placement.
106 | 
107 | _pbaa_ supports batching of samples via the FOFN (file of file name[s]) format. A FOFN is a line separated file  that contains the **full paths** to the input files. **All input sequence files need to be indexed before running _pbaa_.** Indexing can be achieved with _samtools_ version 1.9 or greater.
108 | 
109 | ## Customizing guide sequences
110 | 
111 | Guide/reference sequence choice affects read grouping/placement. It is important to choose guides that are sufficiently divergent. If too many similar alleles are used for the same locus the fraction of un-placed reads will increase because the number of informative kmers decrease within a locus. Too few guides can also cause cluster dropout; it's the goldilocks problem.
112 | 
113 | Guide sequences should be grouped into locus assignments. For example if multiple HLA-A alleles are used in the guide sequence, they should be grouped, so clustering will be performed at the locus level.
114 | 
115 | ```
116 | Allele_1|HLA-A (sequence name | group name)
117 | Allele_2|HLA-A (sequence name | group name)
118 | ```
119 | 
120 | In the example above, reads assigned to either allele (1,2) will be merged into a single dataset for clustering.
121 | For more details on setting up guides see `guide_reference.md`
122 | 
123 | 
124 | ## Output
125 | 
126 | _pbaa_ will generate three output files.
127 | 
128 | 1. {prefix}_passed_cluster_sequences.fasta
129 | 2. {prefix}_failed_cluster_sequences.fasta
130 | 3. {prefix}_read_info.txt
131 | 
132 | 
133 | ### consensus sequence output
134 | 
135 | The headers entries in the consensus sequence output contain statistics about the clusters. These statistics are used for filtering (pass/fail criterion).
136 | 
137 | example of a passing sequence:
138 | 
139 | ```
140 | >sample-bc1001--bc1001_guide-HLA-C_cluster-0_ReadCount-617 uchime_score:-1 uchime_left_parent: N/A uchime_right_parent: N/A cluster_freq:0.454345 diversity:0.0113006 avg_quality:87.0029 filters: none
141 | ```
142 | 
143 | example of a failing sequence:
144 | 
145 | ```
146 | >sample-bc1001--bc1001_guide-HLA-A_cluster-6_ReadCount-32 uchime_score:3.33333 uchime_left_parent: bc1001--bc1001_HLA-A_0 uchime_right_parent: bc1001--bc1001_HLA-A_1 cluster_freq:0.0152164 diversity:2.30176 avg_quality:87.9363 filters: fail-high-chimera-score fail-cluster-is-low-frequency fail-high-diversity
147 | ```
148 | 
149 | The fields in the header are:
150 | 
151 | 1. **uchime_score** The UCHIME score flags chimeric consensus sequences. The higher the score the more likely the sequence is chimeric. For more details see: Edgar, Robert C., et al. “UCHIME improves sensitivity and speed of chimera detection.” Bioinformatics 27.16 (2011): 2194-2200.
152 | 
153 | 2. **uchime_left_parent/uchime_right_parent** The parent sequences of a chimeric sequence.
154 | 
155 | 3. **cluster_freq** A measures of the clusters’ frequencies. The frequency is calculated by reads counts within groupings.
156 | 
157 | 4. **diversity** A measure of the variability of variants within a cluster. Clusters with homogenous reads will have low diversity. A negative value indicates this metric was not calculated.
158 | 
159 | 5. **avg_quality** The average PHRED quality of the reads within the cluster.
160 | 
161 | 6. **filters** This is a space separated field enumerating the possible reasons a cluster was placed in the fail category.
162 | 
163 | ### Read Information Output File
164 | One row per read, columns as follows:
165 | 1. SeqName
166 | 2. GuideName
167 | 3. strand
168 | 4. SecondBestGuideName
169 | 5. Score
170 | 6. FirstHighest/SecondHighest/UniqueHitSum
171 | 7. Sample
172 | 8. VarString
173 | 9. ClusterId
174 | 10. ClusterProb
175 | 11. ClusterSize
176 | 12. ChimeraScore
177 | 
178 | **_Example:_**
179 | ```
180 | m54043_190914_194303/4195156/ccs HLA-B + HLA00158_B_14-02-01-01_4070_bp|HLA-B 0.663158 f:315/s:160/sum:619 reads.fasta bp-3300 0 1.000000 1 0.000000
181 | ```
182 | 
183 | ## Best practices
184 | 
185 | ### Sample preparation and sequencing
186 | 
187 | [Targeted Sequencing For Amplicons Document](https://www.pacb.com/wp-content/uploads/Application-Brief-Targeted-sequencing-Best-Practices.pdf)
188 | 
189 | 
190 | ### Use defaults
191 | We've optimized the default parameters to perform well on several datasets. In general use the defaults unless needed. If you discover an edge case, please share this experience.
192 | 
193 | ### Provide an off-target-groups file.
194 | Amplification can generate off target reads. Similarly, pbaa can accidentally place a few reads in the wrong grouping/locus. These reads may generate clusters that are off-target (not amplified). By providing a list of guide-names / group-names (one per line), pbaa will filter these out.
195 | 
196 | 
197 | ## Advanced / Hidden Options
198 | 
199 | A number of heuristics and advanced options are hidden from the interface. They are documented here, but consider them experimental features. Changing these settings can have undesired effects.
200 | 
201 | **_--no-indel_** : Only consider variants where both alleles are the same length in the bubble.
202 | 
203 | **_--skip-consensus_** : Only run read placement and clustering. No consensus sequences will be generated.
204 | 
205 | **_--no-hpc_** : Disable homopolymer compression for variant detection, and cluster filtering. Some variants can be masked by homopolymer compression.
206 | 
207 | **_--kmer-size_** : Kmer size, not to exceed 31, length must be odd.
208 | 
209 | **_max-cluster-count_** : This flag interacts with _--fixed-cluster-count_, which must be set to zero to turn on this option. The algorithm attempts to learn the number of haplotype clusters. In practice a fixed upper limit runs quicker and generates very similar results.
210 | 
211 | **_min-path-coverage_** : Low coverage for bubble path cutoff. This options filters paths in the graph with fewer than N reads supporting the path at a fractional membership.
212 | 
213 | ## Demonstration Data
214 | Demonstration data for testing __pbaa__ can be found [here](https://downloads.pacbcloud.com/public/dataset/pbAmpliconAnalysis_HLA/).  Dataset contains HiFi reads for 6 pooled HLA genes in FASTQ format. Outputs from running _pbaa_ as well as validated genotypes are included.
215 | 
216 | ## FAQ
217 | 
218 | **_I'm finding extra false clusters_** : There are a number of reasons pbaa might generate false positive clusters. Chimeric reads, false variant calls, and clustering errors. Often the statistics in the fasta headers can provide clues. Changing filtering settings may reduce false positives.
219 | 
220 | **_I'm missing an expected cluster_** : _pbaa_ was tuned for sensitivity, favoring false positives over false negatives. However, false negatives do happen at a low rate. First check if the missing cluster is in the {prefix}_failed_cluster_sequences.fasta file. Then check that there is sufficient coverage over the guide sequences, by aligning reads to the guide sequences. For this application we recommend 50x depth of coverage per allele. If there is coverage check for missed variants. If you believe pbaa failed to discover a variant resulting in a false negative, please share a small example. Variant discover is the most challenging step in _pbaa_.
221 | 
222 | **_I think I found a bug or I have feedback_** : Please share your experiences via github bug reports. Do not expect a quick reply.
223 | 
224 | ## Disclaimer
225 | THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
226 | 
227 | 
228 | ## Full Changelog
229 | 
230 | * 0.1.4: March 25th, 2021
231 |  1. bugfix: Homopolymer compression can be turned off in duplicate marking using --no-hpc.
232 |  2. bugfix: Filter reads that are too short, before/after homopolymer compression.
233 | 
234 | * 0.1.3: Feb 16, 2021
235 |  1. Updated CLI, highlighting faidx/fqidx required.
236 | 
237 | * 0.1.2:
238 |  1. Added BAM coloring binary.
239 | 
240 | * 0.1.1:
241 |  1. Simplification of the CLI.
242 |  2. Chimera detection across loci.
243 |  3. CLI now supports both single files and FOFNs.
244 |  4. Updated documentation.
245 | 
246 | * 0.1.0:
247 |  1. Beta code release.
248 |  2. Documentation.
249 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |   <img src="img/logo_pbaa.svg" alt="pbaa logo" width="250px"/>
  3 | </p>
  4 | <h1 align="center"><i>pbaa</i></h1>
  5 | <p align="center">PacBio Amplicon Analysis</p>
  6 | 
  7 | ***
  8 | 
  9 | PacBio Amplicon Analysis (_pbaa_) separates complex mixtures of amplicon targets from genomic samples. The _pbaa_ application is designed to cluster and generate high-quality consensus sequences from HiFi reads. This application only works on HiFi amplicon data. There are several assumptions made within the code that will only support high quality reads (>QV20). This application will not work on CLR data. _pbaa_ is reference aided method (pseudo de-novo).
 10 | 
 11 | Typical use cases involve multi-allelic samples where the sample-specific ploidy or copy number is unknown. _pbaa_ can effectively separate alleles with one to many variants, including SNVs and large indels contained within the target region. _pbaa_ has been optimized and tested for datasets with a moderate to high (<50) cluster counts.
 12 | 
 13 | ## Workflow
 14 | ![HiFi Amplicon Analysis Workflow](img/v1.0.0.png)
 15 | 
 16 | ## Availability
 17 | The latest version can be installed via bioconda package `pbaa`.
 18 | 
 19 | Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
 20 | for information on Installation, Support, License, Copyright, and Disclaimer.
 21 | 
 22 | [Full changelog here](#full-changelog)
 23 | 
 24 | ## Usage
 25 | _pbaa_ has two executables, cluster and bam paint.
 26 | 
 27 | ```
 28 | pbaa - PacBio HiFi Amplicon Analysis.
 29 | 
 30 | Usage:
 31 |   pbaa <tool>
 32 | 
 33 |   -h,--help    Show this help and exit.
 34 |   --version    Show application version and exit.
 35 | 
 36 | Tools:
 37 |   cluster    Run clustering tool.
 38 |   bampaint   Add color tags to BAM records, based on pbaa clusters.
 39 | 
 40 | ```
 41 | 
 42 | ### Main clustering tool
 43 | This tool runs the placement, clustering, and consensus algorithms.
 44 | 
 45 | ```
 46 | pbaa cluster - Run clustering tool.
 47 | 
 48 | Usage:
 49 |   pbaa cluster [options] <guide input> <read input> <prefix>
 50 | 
 51 |   guide input                FILE   Guide sequence(s) in fasta format indexed with samtools faidx version 1.9 or
 52 |                                     greater. A FOFN can be provided for multiple files.
 53 |   read input                 FILE   De-multiplexed HiFi reads in fastq format indexed with samtools fqidx version 1.9
 54 |                                     or greater. A FOFN can be provided for multiple files.
 55 |   prefix                     STR    Output prefix for run.
 56 | 
 57 | Placement and Variant Options:
 58 |   --filter                   INT    Variants with coverage lower than filter will be ignored. [3]
 59 |   --trim-ends                INT    Number of bases to trim from both sides of reads during graph construction and
 60 |                                     variant detection. [5]
 61 |   --pile-size                INT    The number of best alignments to keep for each read during error correction. [30]
 62 |   --min-var-frequency        FLOAT  Minimum coverage frequency within a pile. [0.3]
 63 |   --max-alignments-per-read  INT    The number of random alignments, for each read, within a guide grouping [1000]
 64 | 
 65 | Clustering Options:
 66 |   --max-reads-per-guide      INT    The number randomly selected reads to use within a guide grouping. [500]
 67 |   --iterations               INT    Number of iterations to run k-means. [9]
 68 |   --seed                     INT    Randomization seed. [1984]
 69 | 
 70 | Consensus Options:
 71 |   --max-consensus-reads      INT    Maximum number of reads to use per cluster consensus. [100]
 72 | 
 73 | Filtering Options:
 74 |   --max-amplicon-size        INT    Upper read length cutoff, longer reads will be skipped. [15000]
 75 |   --min-read-qv              FLOAT  Low read QV cutoff. [20]
 76 |   --off-target-groups        STR    Group names to exclude, i.e. these loci are off-target (not amplified).
 77 |   --min-cluster-frequency    FLOAT  Low frequency cluster cutoff. [0.1]
 78 |   --min-cluster-read-count   INT    Low read count cluster cutoff. [5]
 79 |   --max-uchime-score         FLOAT  High UCHIME score cutoff. [1]
 80 | 
 81 | General Options:
 82 | 
 83 |   -h,--help                         Show this help and exit.
 84 |   --version                         Show application version and exit.
 85 |   -j,--num-threads           INT    Number of threads to use, 0 means autodetection. [0]
 86 |   --log-level                STR    Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
 87 |   --log-file                 FILE   Log to a file, instead of stderr.
 88 |   ```
 89 | ### Coloring reads by clusters
 90 | If you have a BAM file (mapped amplicon reads) this tool will add IGV tags for grouping (HP) and coloring (YC). It matches the read names in the clustering file, and the BAM file. The amplicon reads can be aligned against any reference.
 91 | 
 92 | ```
 93 | pbaa bampaint - Add color tags to BAM records, based on pbaa clusters.
 94 | 
 95 | Usage:
 96 |   pbaa bampaint [options] <read info file> <input bam> <output bam>
 97 | 
 98 |   read info file    FILE  Read information file produced by pbaa cluster.
 99 |   input bam         FILE  Bam file to add color tags.
100 |   output bam        FILE  Output bam file name.
101 | 
102 | Options:
103 |   -h,--help               Show this help and exit.
104 |   --version               Show application version and exit.
105 |   -j,--num-threads  INT   Number of threads to use, 0 means autodetection. [0]
106 |   --log-level       STR   Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
107 |   --log-file        FILE  Log to a file, instead of stderr.
108 | ```
109 | 
110 | ## Input
111 | 
112 | _pbaa_ requires two input files, guide/reference sequences in fasta file format and HiFi de-multiplexed reads in fastq file format (indexed with fqidx). Guide/reference sequences should contain the amplified region, but not much more. Providing a chromosome as a guide, for example, will lead to issues during read placement.
113 | 
114 | _pbaa_ supports batching of samples via the FOFN (file of file name[s]) format. A FOFN is a line separated file  that contains the **full paths** to the input files. **All input sequence files need to be indexed before running _pbaa_.** Indexing can be achieved with _samtools_ version 1.9 or greater.
115 | 
116 | ## Customizing guide sequences
117 | 
118 | Guide/reference sequence choice affects read grouping/placement. It is important to choose guides that are sufficiently divergent. If too many similar alleles are used for the same locus the fraction of un-placed reads will increase because the number of informative guide kmers decreases. Too few guides can also cause cluster dropout; it's the goldilocks problem.
119 | 
120 | Guide sequences should be grouped into locus assignments. For example if multiple HLA-A alleles are used in the guide sequence, they should be grouped, so clustering will be performed at the locus level.  
121 | 
122 | ```
123 | Allele_1|HLA-A (sequence name | group name)
124 | Allele_2|HLA-A (sequence name | group name)
125 | ```
126 | 
127 | In the example above, reads assigned to either allele (1,2) will be merged into a single dataset for clustering.
128 | For more details on setting up guides see `guide_reference.md`
129 | 
130 | 
131 | ## Output
132 | 
133 | _pbaa_ will generate three output files.
134 | 
135 | 1. {prefix}_passed_cluster_sequences.fasta
136 | 2. {prefix}_failed_cluster_sequences.fasta
137 | 3. {prefix}_read_info.txt
138 | 
139 | 
140 | ### consensus sequence output
141 | 
142 | The headers entries in the consensus sequence output contain statistics about the clusters. These statistics are used for filtering (pass/fail criterion).
143 | 
144 | example of a passing sequence:
145 | 
146 | ```
147 | >sample-bc1001--bc1001_guide-HLA-A_cluster-1_ReadCount-22 uchime_score:-1 uchime_left_parent:N/A uchime_right_parent:N/A cluster_freq:0.44 diversity:0.188552 avg_quality:53.5878 duplicate_parent:N/A seq_length:3152 filters:none
148 | ```
149 | 
150 | example of a failing sequence:
151 | 
152 | ```
153 | >sample-bc1011--bc1011_guide-HLA-DRB1_cluster-3_ReadCount-6 uchime_score:0.03125 uchime_left_parent:bc1011--bc1011_HLA-DRB1_0 uchime_right_parent:bc1011--bc1011_HLA-DRB1_1 cluster_freq:0.12 diversity:0 avg_quality:42.2187 duplicate_parent:N/A seq_length:3706 filters:fail-low-frequency
154 | ```
155 | 
156 | The fields in the header are:
157 | 
158 | 1. **uchime_score** The UCHIME score flags chimeric consensus sequences. The higher the score the more likely the sequence is chimeric. For more details see: Edgar, Robert C., et al. “UCHIME improves sensitivity and speed of chimera detection.” Bioinformatics 27.16 (2011): 2194-2200.
159 | 
160 | 2. **uchime_left_parent/uchime_right_parent** The parent sequences of a chimeric sequence.
161 | 
162 | 3. **cluster_freq** A measures of the clusters’ frequencies. The frequency is calculated by reads counts within groupings.
163 | 
164 | 4. **diversity** A measure of the variability of variants within a cluster. Clusters with homogenous reads will have low diversity. A negative value indicates this metric was not calculated.
165 | 
166 | 5. **avg_quality** The average PHRED quality of the reads within the cluster.
167 | 
168 | 6. **seq_length** The sequence length of the cluster.
169 | 
170 | 6. **filters** This is a space separated field enumerating the possible reasons a cluster was placed in the fail category.
171 | 
172 | ### Read Information Output File
173 | One row per read, columns as follows:
174 | 1. SeqName
175 | 2. GuideName
176 | 3. strand
177 | 4. SecondBestGuideName
178 | 5. Score
179 | 6. FirstHighest/SecondHighest/UniqueHitSum
180 | 7. Sample, input fastq
181 | 8. Sequence length
182 | 9. Average read quality
183 | 10. Cluster ID
184 | 11. Cluster Size
185 | 
186 | **_Example:_**
187 | ```
188 | m64012_200712_164638/72090819/ccs HLA-DRB5 - HLA00622_DQB1_02-01-01_7480_bp|HLA-DQB1 0.714286 f:5/s:2/sum:8 /pbi/dept/appslab/projects/old/2020/jh_hla/2020-07-13_HGgendx/fastq_sqII/demultiplex.bc1099--bc1099.fastq 3125 58.6149 1 1
189 | ```
190 | 
191 | ## Demonstration Data
192 | Demonstration data for testing __pbaa__ can be found [here](https://downloads.pacbcloud.com/public/dataset/pbAmpliconAnalysis_HLA/).  Dataset contains HiFi reads for 6 pooled HLA genes in FASTQ format. Outputs from running _pbaa_ as well as validated genotypes are included.
193 | 
194 | ## Best practices
195 | 
196 | ### Sample preparation and sequencing  
197 | 
198 | [Targeted Sequencing For Amplicons Document](https://www.pacb.com/wp-content/uploads/Application-Brief-Targeted-sequencing-Best-Practices.pdf)
199 | 
200 | 
201 | ### Use defaults
202 | We've optimized the default parameters to perform well on several datasets. In general use the defaults unless needed. If you discover an edge case, please share this experience.
203 | 
204 | ### Provide an off-target-groups file.
205 | Amplification can generate off target reads. Similarly, pbaa can accidentally place a few reads in the wrong grouping/locus. These reads may generate clusters that are off-target (not amplified). By providing a list of guide-names / group-names (one per line), pbaa will filter these out.
206 | 
207 | ### Understanding the error correction stage options
208 | ![HiFi Amplicon Analysis Workflow](img/error_correction.png)
209 | 
210 | There are four hurestics to consider adjusting depending on the experiment (_max-reads-per-guide_, _max-alignments-per-read_, _pile-size_, _min-var-frequency_). In the above image, there are a total of _max-reads-per-guide_ to consider for each HiFi (A-read). An A-read is randomly aligned to _max-alignments-per-read_. These alignments are sorted in decreasing sequence idenity. A total of _pile-size_ reads are retained to correct the A-read. The _min-var-frequency_ is the fraction of reads within a _pile_ that support that the A-read is correct at any position along the A-read.
211 | 
212 | 
213 | ## Advanced / Hidden Options
214 | 
215 | A number of heuristics and advanced options are hidden from the interface. They are documented here, but consider them experimental features. Changing these settings can have undesired effects.
216 | 
217 | **_--skip-consensus_** : Only run read placement and clustering. No consensus sequences will be generated.
218 | 
219 | **_--skip-chimera-detection_** : Skip chimera detection (UCHIME algorithm) step.
220 | 
221 | **_--kmer-size_** : Kmer size, not to exceed 31, length must be odd. This only affects read placement.
222 | 
223 | ## FAQ
224 | 
225 | **_Why are reads missing_** : By default _pbaa_ only uses 500 reads per locus/guide/grouping. Increasing `--max-reads-per-guide` will use more reads, at the expense of runtime/memory. If runtime is not a consideration use all the reads.
226 | 
227 | **_Considerations for pooled experiments_**: Pooled experiments are bounded by runtime/memory. The settings below increase the accuracy of the results at the expense of runtime memory.
228 | 
229 |  `--max-reads-per-guide` : Increase the number of reads uses locus/guide/grouping.
230 | 
231 |  `--max-alignments-per-read` : Increase the number of alignments per read. This setting can drastically increase computational expense.
232 | 
233 |  `--pile-size` : The number of reads to use for error masking per read.
234 | 
235 |  `--min-var-frequency` : The minimum variant frequency within a pile of reads used for error correction. Decreasing this value will increase sensitivity to low-frequency variants. This can lead to over clustering.
236 | 
237 | **_Extra false clusters_**: There are a number of reasons pbaa might generate false positive clusters. Chimeric reads, false variant calls, and clustering errors. Often the statistics in the fasta headers can provide clues. Changing filtering settings may reduce false positives.
238 | 
239 | **_Missing clusters_**: _pbaa_ was tuned for sensitivity, favoring false positives over false negatives. However, false negatives do happen at a low rate. First check if the missing cluster is in the {prefix}_failed_cluster_sequences.fasta file. Then check that there is sufficient coverage over the guide sequences, by aligning reads to the guide sequences. For this application we recommend a minimum of 25x depth of coverage per allele. If there is coverage check for missed variants. If you believe pbaa failed to discover a variant resulting in a false negative, please share a small example. Variant discover is the most challenging step in _pbaa_.
240 | 
241 | **_Feedback or bug report_**: Please share your experiences via [Github issue](https://github.com/PacificBiosciences/pbbioconda/issues).
242 | 
243 | ## Disclaimer
244 | THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
245 | 
246 | 
247 | ## Full Changelog
248 | 
249 | * 1.0.3: September 1st, 2021
250 |  1. Changes to error masking for indels.  
251 | 
252 | * 1.0.2: June 22nd, 2021
253 |  1. Minor bug fixes.
254 | 
255 | * 1.0.1: May the 4th be with you, 2021
256 |  1. Decrease memory and runtime by refactoring how the reads are loaded. 
257 | 
258 | * 1.0.0: April 23rd, 2021
259 |  1. Major algorithm refactor, and many changes to CLI.
260 | 
261 | * For previous releases please see [beta release documentation](README_BETA.md)
262 | 


--------------------------------------------------------------------------------