├── LICENSE
├── README.md
├── bam
    ├── best.md
    ├── modkit.md
    ├── perbase.md
    └── rustybam.md
├── csv
    ├── csview.md
    ├── madato.md
    ├── xsv.md
    └── xtab.md
├── dna
    ├── fakit.md
    ├── fq.md
    ├── ngs.md
    ├── rust-bio-tools.md
    └── skc.md
├── fastq
    ├── fasten.md
    ├── faster.md
    ├── fqgrep.md
    ├── fqkit.md
    ├── fqtk.md
    └── rasusa.md
├── longreads
    ├── NextPolish2.md
    ├── chopper.md
    └── longshot.md
├── metagenomics
    ├── coverm.md
    ├── skani.md
    └── sylph.md
├── pangenomics
    └── impg.md
├── phylogenomics
    └── segul.md
├── proteomics
    └── sage.md
├── rna
    └── rnapkin.md
├── singlecell
    ├── alevin-fry.md
    └── cellranger.md
├── slurm
    └── ssubmit.md
└── vcf
    └── echtvar.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 size_t
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ### rust in bioinformatics
  2 | 
  3 | A collection of genomics software tools written in Rust
  4 | 
  5 | 
  6 | 
  7 | ### index section
  8 | 
  9 | ##### bam
 10 | - [alignoth](https://github.com/alignoth/alignoth) : Creating alignment plots from bam files
 11 | - [bamrescue](https://github.com/arkanosis/bamrescue) : Utility to check Binary Sequence Alignment / Map (BAM) files for corruption and repair them
 12 | - [best](https://github.com/google/best) : Bam Error Stats Tool (best): analysis of error types in aligned reads
 13 | - [modkit](https://github.com/nanoporetech/modkit) : A bioinformatics tool for working with modified bases
 14 | - [mapAD](https://github.com/mpieva/mapAD) : An aDNA aware short-read mapper
 15 | - [perbase](https://github.com/sstadick/perbase) : Per-base per-nucleotide depth analysis
 16 | - [rustybam](https://github.com/mrvollger/rustybam) : bioinformatics toolkit in rust
 17 | 
 18 | ##### csv
 19 | 
 20 | - [csview](https://github.com/wfxr/csview) : 📠 Pretty and fast csv viewer for cli with cjk/emoji support
 21 | - [csvlens](https://github.com/YS-L/csvlens) : csvlens is a command line CSV file viewer. It is like less but made for CSV.
 22 | - [madato](https://github.com/inosion/madato) : Markdown Cmd Line, Rust and JS library for Excel to Markdown Tables
 23 | - [qsv](https://github.com/dathere/qsv) : Blazing-fast Data-Wrangling toolkit
 24 | - [rsv](https://github.com/ribbondz/rsv) : A command-line tool written in Rust for analyzing CSV, TXT, and Excel files.
 25 | - [tabiew](https://github.com/shshemi/tabiew) : A lightweight TUI app to view and query CSV files
 26 | - [tv](https://github.com/alexhallam/tv) : 📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
 27 | - [xan](https://github.com/medialab/xan) : The CSV magician
 28 | - [xsv](https://github.com/BurntSushi/xsv) : A fast CSV command line toolkit written in Rust.  
 29 | - [xtab](https://github.com/sharkLoc/xtab) : CSV command line utilities
 30 | 
 31 | ##### dna
 32 | 
 33 | - [biotools](https://github.com/jimrybarski/biotools) : Command line bioinformatics functions
 34 | - [darwin](https://github.com/Ebedthan/darwin) : Create (rapid) neighbor-joining tree from sequences using mash distance
 35 | - [fakit](https://github.com/sharkLoc/fakit) : fakit: a simple program for fasta file manipulation
 36 | - [filterx](https://github.com/dwpeng/filterx) : process any file in tabular format. Fasta/fastq/GTF/GFF/VCF/SAM/BED
 37 | - [fq](https://github.com/stjude-rust-labs/fq) : Command line utility for manipulating Illumina-generated FASTQ files.
 38 | - [gsearch](https://github.com/jean-pierreboth/gsearch) : Approximate nearest neighbour search for microbial genomes based on hash metric
 39 | - [Hyper-Gen](https://github.com/wh-xu/Hyper-Gen) : HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors
 40 | - [kanpig](https://github.com/ACEnglish/kanpig) : Kmer Analysis of Pileups for Genotyping
 41 | - [kfc](https://github.com/lrobidou/KFC) : KFC (K-mer Fast Counter) is a fast and space-efficient k-mer counter based on hyper-k-mers.
 42 | - [ngs](https://github.com/stjude-rust-labs/ngs) : Command line utility for working with next-generation sequencing files.
 43 | - [nail](https://github.com/TravisWheelerLab/nail) : Nail is an Alignment Inference tooL
 44 | - [palindrome-finder](https://github.com/brianli314/palindrome-finder) : A bioinformatics tool written in Rust to find palindromic sequences in DNA
 45 | - [poasta](https://github.com/broadinstitute/poasta) : Fast and exact gap-affine partial order alignment
 46 | - [psdm](https://github.com/mbhall88/psdm) : Compute a pairwise SNP distance matrix from one or two alignment(s)
 47 | - [rust-bio-tools](https://github.com/rust-bio/rust-bio-tools) : A set of command line utilities based on Rust-Bio.
 48 | - [ska](https://github.com/bacpop/ska.rust) : Split k-mer analysis – version 2
 49 | - [skc](https://github.com/mbhall88/skc) : Shared k-mer content between two genomes
 50 | - [sketchy](https://github.com/esteinig/sketchy) : Genomic neighbor typing of bacterial pathogens using MinHash 🐀
 51 | - [tidk](https://github.com/tolkit/telomeric-identifier) : Identify and find telomeres, or telomeric repeats in a genome.
 52 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies
 53 | - [xgt](https://github.com/Ebedthan/xgt) : Efficient and fast querying and parsing of GTDB's data
 54 | 
 55 | ##### fastq
 56 | 
 57 | - [deacon](https://github.com/bede/deacon) : Fast (host) DNA sequence filtering 
 58 | - [fasten](https://github.com/lskatz/fasten) : 👷 Fasten toolkit, for streaming operations on fastq files
 59 | - [faster](https://github.com/angelovangel/faster) :  A (very) fast program for getting statistics about a fastq file, the way I need them, written in Rust
 60 | - [fqgrep](https://github.com/fulcrumgenomics/fqgrep) : Grep for FASTQ files
 61 | - [fqkit](https://github.com/sharkLoc/fqkit) : 🦀 Fqkit: A simple and cross-platform program for fastq file manipulation  
 62 | - [fqtk](https://github.com/fulcrumgenomics/fqtk) : Fast FASTQ sample demultiplexing in Rust.
 63 | - [grepq](https://github.com/rbfinch/grepq): quickly filter fastq files by matching sequences to a set of regex patterns
 64 | - [guide-counter](https://github.com/fulcrumgenomics/guide-counter) : A better, faster way to count guides in CRISPR screens.
 65 | - [K2Rmini](https://github.com/Malfoy/K2Rmini) : K2Rmini (or K-mer to Reads mini) is a tool to filter the reads contained in a FASTA/Q file based on a set of k-mers of interest.
 66 | - [kractor](https://github.com/sam-sims/kractor) : Rapidly extract reads from a FASTQ file based on taxonomic classification via Kraken2.
 67 | - [rasusa](https://github.com/mbhall88/rasusa) : Randomly subsample sequencing reads
 68 | - [SeqSizzle](https://github.com/ChangqingW/SeqSizzle) : SeqSizzle is a pager for viewing FASTQ files with fuzzy matching, allowing different adaptors to be colored differently.
 69 | - [sabreur](https://github.com/ebedthan/sabreur) : fast, reliable and handy demultiplexing tool for fastx files
 70 | 
 71 | #### format
 72 | - [atlas](https://github.com/stjude-rust-labs/atlas) : Enables storing, querying, transforming, and visualizing of multidimensional count data.
 73 | - [bigtools](https://github.com/jackh726/bigtools) : A high-performance BigWig and BigBed library in Rust
 74 | - [biotest](https://github.com/natir/biotest) : Generate random test data for bioinformatics
 75 | - [bqtools](https://github.com/arcinstitute/bqtools) : A command line utilty for working with BINSEQ files
 76 | - [cigzip](https://github.com/AndreaGuarracino/cigzip) : A tool for compression and decompression of alignment CIGAR strings using tracepoints.
 77 | - [d4tools](https://github.com/38/d4-format) : The D4 Quantitative Data Format
 78 | - [gfa2bin](https://github.com/MoinSebi/gfa2bin) : Convert various graph-related data to PLINK file. In addition, we offer multiple commands for filtering or modifying the generated PLINK files.
 79 | - [gia](https://github.com/noamteyssier/gia) : gia: Genomic Interval Arithmetic
 80 | - [granges](https://github.com/vsbuffalo/granges) : A Rust library and command line tool for working with genomic ranges and their data.
 81 | - [intspan](https://github.com/wang-q/intspan) : Command line tools for IntSpan related bioinformatics operations
 82 | - [nuc2bit](https://github.com/natir/nuc2bit) : A rust crate that provides methods for rapidly encoding and decoding nucleotides in 2-bit representation.
 83 | - [recmap](https://github.com/vsbuffalo/recmap) : A command line tool and Rust library for working with recombination maps.
 84 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies
 85 | - [thirdkind](https://github.com/simonpenel/thirdkind) : Drawing reconciled phylogenetic trees allowing 1, 2 or 3 reconcillation levels
 86 | - [xsra](https://github.com/ArcInstitute/xsra) : An efficient CLI to extract sequences from the SRA
 87 | 
 88 | 
 89 | ##### gff3
 90 | 
 91 | - [atg](https://github.com/anergictcell/atg) : A Rust library and CLI tool to handle genomic transcripts   
 92 | - [gffkit](https://github.com/sharkloc/gffkit) : a simple program for gff3 file manipulation
 93 | 
 94 | ##### longreads
 95 | 
 96 | - [Autocycler](https://github.com/rrwick/Autocycler) : A tool for generating consensus long-read assemblies for bacterial genomes
 97 | - [chopper](https://github.com/wdecoster/chopper) : Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file.
 98 | - [DeepChopper](https://github.com/ylab-hi/DeepChopper) : Language models identify chimeric artificial reads in NanoPore direct-RNA sequencing data.
 99 | - [fpa](https://github.com/natir/fpa) : Filter of Pairwise Alignement
100 | - [herro](https://github.com/lbcb-sci/herro) : HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended).
101 | - [HiPhase](https://github.com/PacificBiosciences/HiPhase) : Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
102 | - [longshot](https://github.com/pjedge/longshot) : diploid SNV caller for error-prone reads
103 | - [lrge](https://github.com/mbhall88/lrge) : Genome size estimation from long read overlaps
104 | - [myloasm](https://github.com/bluenote-1577/myloasm) : A new high-resolution long-read metagenome assembler for even noisy reads
105 | - [Polypolish](https://github.com/rrwick/Polypolish) : a short-read polishing tool for long-read assemblies
106 | - [nextpolish2](https://github.com/Nextomics/NextPolish2) : Repeat-aware polishing genomes assembled using HiFi long reads
107 | - [nanoq](https://github.com/esteinig/nanoq) : Minimal but speedy quality control for nanopore reads in Rust 🐻
108 | - [smrest](https://github.com/jts/smrest) : Tumour-only somatic mutation calling using long reads
109 | - [trgt](https://github.com/PacificBiosciences/trgt) : Tandem repeat genotyping and visualization from PacBio HiFi data
110 | - [yacrd](https://github.com/natir/yacrd) : Yet Another Chimeric Read Detector
111 | 
112 | 
113 | 
114 | ##### metagenomics
115 | 
116 | - [coverm](https://github.com/wwood/CoverM) : Read coverage calculator for metagenomics
117 | - [galah](https://github.com/wwood/galah) : More scalable dereplication for metagenome assembled genomes
118 | - [hyperex](https://github.com/Ebedthan/hyperex) : Hypervariable region primer-based extractor for 16S rRNA and other SSU/LSU sequences.
119 | - [kun_peng](https://github.com/eric9n/Kun-peng) : Kun-peng: an ultra-fast, low-memory footprint and accurate taxonomy classifier for all
120 | - [kmertools](https://github.com/anuradhawick/kmertools) : kmer based feature extraction tool for bioinformatics, metagenomics, AI/ML and more
121 | - [kmerutils](https://github.com/jean-pierreBoth/kmerutils) : Kmer generating, counting hashing and related
122 | - [Lorikeet](https://github.com/rhysnewell/Lorikeet) : Strain resolver for metagenomics
123 | - [nohuman](https://github.com/mbhall88/nohuman) : Remove human reads from a sequencing run
124 | - [rosella](https://github.com/rhysnewell/rosella) : Metagenomic Binning Algorithm
125 | - [skani](https://github.com/bluenote-1577/skani) : Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
126 | - [sourmash](https://github.com/sourmash-bio/sourmash) : Quickly search, compare, and analyze genomic and metagenomic data sets.
127 | - [sylph](https://github.com/bluenote-1577/sylph) : ultrafast genome querying and taxonomic profiling for metagenomic samples by abundance-corrected minhash.
128 | - [vircov](https://github.com/esteinig/vircov) : Viral genome coverage evaluation for metagenomic diagnostics 🩸
129 | 
130 | ##### pangenomics
131 | 
132 | - [impg](https://github.com/pangenome/impg) : implicit pangenome graph
133 | - [panacus](https://github.com/marschall-lab/panacus) : Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
134 | 
135 | ##### phylogenomics
136 | 
137 | - [nextclade](https://github.com/nextstrain/nextclade) : Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
138 | - [nwr](https://github.com/wang-q/nwr) : nwr is a command line tool for working with NCBI taxonomy, Newick files and assembly reports
139 | - [unicore](https://github.com/steineggerlab/unicore) : Universal and efficient core gene phylogeny with Foldseek and ProstT5 
140 | - [segul](https://github.com/hhandika/segul) : An ultrafast and memory efficient tool for phylogenomics
141 | 
142 | ##### proteomics
143 | 
144 | - [align-cli](https://github.com/snijderlab/align-cli) : A CLI for pairwise alignment of sequences, using both normal and mass based alignment.
145 | - [daedalus](https://github.com/David-OConnor/daedalus) : Protein and molecule viewer
146 | - [folddisco](https://github.com/steineggerlab/folddisco) : Fast indexing and search of discontinuous motifs in protein structures
147 | - [foldmason](https://github.com/steineggerlab/foldmason) : Foldmason builds multiple alignments of large structure sets.
148 | - [sage](https://github.com/lazear/sage) : Proteomics search & quantification so fast that it feels like magic
149 | 
150 | ##### rna
151 | - [oarfish](https://github.com/COMBINE-lab/oarfish) : long read RNA-seq quantification
152 | - [rnapkin](https://github.com/ukmrs/rnapkin) : drawing RNA secondary structure with style; instantly
153 | - [R2Dtool](https://github.com/comprna/R2Dtool) : R2Dtool is a set of genomics utilities for handling, integrating, and viualising isoform-mapped RNA feature data.
154 | - [squab](https://github.com/zaeleus/squab) : Alignment-based gene expression quantification
155 | 
156 | ##### singlecell
157 | 
158 | - [adview](https://github.com/JianYang-Lab/adview) : Adata Viewer: Head/Less/Shape h5ad file in terminal
159 | - [alevin-fry](https://github.com/COMBINE-lab/alevin-fry) : 🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
160 | - [cellranger](https://github.com/10XGenomics/cellranger) : 10x Genomics Single Cell Analysis
161 | - [precellar](https://github.com/regulatory-genomics/precellar) : Single-cell genomics preprocessing package
162 | - [proseg](https://github.com/dcjones/proseg) : Probabilistic cell segmentation for in situ spatial transcriptomics
163 | - [SnapATAC2](https://github.com/kaizhang/SnapATAC2) : Single-cell epigenomics analysis tools
164 | 
165 | ##### slurm
166 | 
167 | - [ssubmit](https://github.com/mbhall88/ssubmit) : Submit slurm sbatch jobs without the need to create a script
168 | 
169 | ##### vcf
170 | 
171 | - [echtvar](https://github.com/brentp/echtvar) : using all the bits for echt rapid variant annotation and filtering
172 | - [gvcf_norm](https://github.com/mlin/gvcf_norm) : gVCF allele normalizer
173 | - [mehari](https://github.com/varfish-org/mehari): VEP-like tool for sequence ontology and HGVS annotation of VCF files
174 | - [vcf2parquet](https://github.com/natir/vcf2parquet) : Convert vcf in parquet
175 | - [vcfexpress](https://github.com/brentp/vcfexpress) : expressions on VCFs
176 | 
177 | 
178 | ##### Gui
179 | 
180 | - [plascad](https://github.com/David-OConnor/plascad) : Design software for plasmid (vector) and primer creation and validation. Edit plasmids, perform PCR-based cloning, digest and ligate DNA fragments, and display details about expressed proteins. Integrates with online resources like NCBI and PDB.
181 | 
182 | ##### other
183 | - [biobear](https://github.com/wheretrue/biobear) : Work with bioinformatic files using Arrow, Polars, and/or DuckDB
184 | - [binseq](https://github.com/ArcInstitute/binseq) : A high efficiency binary format for sequencing data
185 | - [ggetrs](https://github.com/noamteyssier/ggetrs) : Efficient querying of biological databases
186 | - [htsget-rs](https://github.com/umccr/htsget-rs) : A server implementation of the htsget protocol for bioinformatics in Rust
187 | - [ibu](https://github.com/noamteyssier/ibu) : a rust library for high throughput binary encoding of genomic sequences
188 | - [scidataflow](https://github.com/vsbuffalo/scidataflow): Command line scientific data management tool
189 | - [sufr](https://github.com/TravisWheelerLab/sufr) : Parallel Construction of Suffix Arrays in Rust
190 | 
191 | 
192 | 
193 | 
194 | ## Starchart
195 | <img src="https://starchart.cc/sharkLoc/rust-in-bioinformatics.svg" alt="Stargazers over time" style="max-width: 100%">
196 | 


--------------------------------------------------------------------------------
/bam/best.md:
--------------------------------------------------------------------------------
 1 | # best
 2 | 
 3 | Bam Error Stats Tool (best): analysis of error types in aligned reads.
 4 | 
 5 | `best` is used to assess the quality of reads after aligning them to a
 6 | reference assembly.
 7 | 
 8 | ## Features
 9 | 
10 | * Collect overall and per alignment stats
11 | * Distribution of indel lengths
12 | * Yield at different empirical Q-value thresholds
13 | * Bin per read stats to easily examine the distribution of errors for certain
14 |   types of reads
15 | * Stats for regions specified by intervals (BED file, homopolymer regions,
16 |   windows etc.)
17 | * Stats for quality scores vs empirical Q-values
18 | * Multithreading for speed
19 | 
20 | ## Usage
21 | 
22 | The [`best` Usage Guide](Usage.md) gives an overview of how to use `best`.
23 | 
24 | ## Installing
25 | 
26 | 1. Install [Rust](https://www.rust-lang.org/tools/install).
27 | 2. Clone this repository and navigate into the directory of this repository.
28 | 3. Run `cargo install --locked --path .`
29 | 4. Run `best input.bam reference.fasta prefix/path`
30 | 
31 | This will generate stats files with the `prefix/path` prefix.
32 | 
33 | ## Development
34 | 
35 | ### Running
36 | 
37 | 1. Install [Rust](https://www.rust-lang.org/tools/install).
38 | 2. Clone this repository and navigate into the directory of this repository.
39 | 3. Run `cargo build --release`
40 | 4. Run `cargo run --release -- input.bam reference.fasta prefix/path` or
41 |    `target/release/best input.bam reference.fasta prefix/path`
42 | 
43 | This will generate stats files with the `prefix/path` prefix.
44 | 
45 | The built binary is located at `target/release/best`.
46 | 
47 | ### Formatting
48 | 
49 | ```
50 | cargo fmt
51 | ```
52 | 
53 | ### Comparing
54 | 
55 | Remember to pass the `-t 1` option to ensure that only one thread is used for
56 | testing. Best generally tries to ensure the order of outputs is deterministic
57 | with multiple threads, but the order of per-alignment stats is arbitrary unless
58 | only one thread is used.
59 | 
60 | ### Disclaimer
61 | 
62 | This is not an official Google product.
63 | 
64 | The code is not intended for use in any clinical settings.  It is not intended to be a medical device and is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
65 | 
66 | No representations or warranties are made with regards to the accuracy of results generated.  User or licensee is responsible for verifying and validating accuracy when using this tool.
67 | 


--------------------------------------------------------------------------------
/bam/modkit.md:
--------------------------------------------------------------------------------
  1 | ![Oxford Nanopore Technologies logo](https://github.com/nanoporetech/modkit/blob/master/ONT_logo_590x106.png?raw=true)
  2 | 
  3 | # Modkit
  4 | 
  5 | A bioinformatics tool for working with modified bases from Oxford Nanopore. Specifically for converting modBAM
  6 | to bedMethyl files using best practices, but also manipulating modBAM files and generating summary statistics.
  7 | Detailed documentation and quick-start can be found in the [online documentation](https://nanoporetech.github.io/modkit/).
  8 | 
  9 | ## Installation
 10 | 
 11 | Pre-compiled binaries are provided for Linux from the [release page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in most circumstances.
 12 | 
 13 | ### Building from source
 14 | 
 15 | The provided packages should be used where possible. We understand that some users may wish to compile the software from its source code. To build `modkit` from source [cargo](https://www.rust-lang.org/learn/get-started) should be used.
 16 | 
 17 | ```bash
 18 | git clone https://github.com/nanoporetech/modkit.git
 19 | cd modkit
 20 | cargo install --path .
 21 | # or
 22 | cargo install --git https://github.com/nanoporetech/modkit.git
 23 | ```
 24 | 
 25 | ## Usage
 26 | 
 27 | Modkit comprises a suite of tools for manipulating modified-base data stored in [BAM](http://www.htslib.org/) files. Modified base information is stored in the `MM` and `ML` tags (see section 1.7 of the [SAM tags](https://samtools.github.io/hts-specs/SAMtags.pdf) specification). These tags are produced by contemporary basecallers of data from Oxford Nanopore Technologies sequencing platforms.
 28 | 
 29 | ### Constructing bedMethyl tables
 30 | 
 31 | A primary use of `modkit` is to create summary counts of modified and unmodified bases in an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position.
 32 | 
 33 | In its simplest form `modkit` creates a bedMethyl file using the following:
 34 | 
 35 | ```bash
 36 | modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
 37 | ```
 38 | 
 39 | No reference sequence is required. A single file (described [below](#description-of-bedmethyl-output)) with base count summaries will be created. The final argument here specifies an optional log file output.
 40 | 
 41 | The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./book/src/filtering.md).
 42 | 
 43 | For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:
 44 | 
 45 | ```bash
 46 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
 47 | ```
 48 | 
 49 | The program also contains a range of presets which combine several options for ease of use. The `traditional` preset,
 50 | 
 51 | ```bash
 52 | modkit pileup path/to/reads.bam output/path/pileup.bed \
 53 |   --ref path/to/reference.fasta \
 54 |   --preset traditional
 55 | ```
 56 | 
 57 | performs three transforms:
 58 | 
 59 | * restricts output to locations where there is a CG dinucleotide in
 60 |   the reference,
 61 | * reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and
 62 | * aggregates data across strands. The strand field od the output will be marked as '.' indicating that the strand information has been lost.
 63 | 
 64 | Using this option is equivalent to running with the options:
 65 | 
 66 | ```bash
 67 | modkit pileup --cpg --ref <reference.fasta> --ignore h --combine-strands
 68 | ```
 69 | 
 70 | For more information on the individual options see the [Advanced Usage](./book/src/advanced_usage.md) help document.
 71 | 
 72 | ## Description of bedMethyl output
 73 | 
 74 | Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the
 75 | bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/).
 76 | 
 77 | ### Definitions:
 78 | 
 79 | * N`<sub>`mod`</sub>` - Number of calls passing filters that were classified as a residue with a specified base modification.
 80 | * N`<sub>`canonical`</sub>` - Number of calls passing filters were classified as the canonical base rather than modified. The
 81 |   exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then
 82 |   the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine.
 83 | * N`<sub>`other mod`</sub>` - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with
 84 |   `h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N`<sub>`other_mod`</sub>` would be 2. In the
 85 |   `m` row N`<sub>`other_mod`</sub>` would be 3.
 86 | * N`<sub>`valid_cov`</sub>` - the valid coverage. N`<sub>`valid_cov`</sub>` = N`<sub>`mod`</sub>` + N`<sub>`other_mod`</sub>` + N`<sub>`canonical`</sub>`, also used as the `score` in the bedMethyl
 87 | * N`<sub>`diff`</sub>` - Number of reads with a base other than the canonical base for this modification. For example, in a row
 88 |   for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N`<sub>`diff`</sub>` will be 2.
 89 | * N`<sub>`delete`</sub>` - Number of reads with a deletion at this reference position
 90 | * N`<sub>`fail`</sub>` - Number of calls where the probability of the call was below the threshold. The threshold can be
 91 |   set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
 92 | * N`<sub>`nocall`</sub>` - Number of reads aligned to this reference position, with the correct canonical base, but without a base
 93 |   modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a
 94 |   CG->CH substitution such that no modification call was produced by the basecaller.
 95 | 
 96 | ### bedMethyl column descriptions
 97 | 
 98 | | column | name                          | description                                                                    | type  |
 99 | | ------ | ----------------------------- | ------------------------------------------------------------------------------ | ----- |
100 | | 1      | chrom                         | name of reference sequence from BAM header                                     | str   |
101 | | 2      | start position                | 0-based start position                                                         | int   |
102 | | 3      | end position                  | 0-based exclusive end position                                                 | int   |
103 | | 4      | modified base code            | single letter code for modified base                                           | str   |
104 | | 5      | score                         | Equal to N`<sub>`valid_cov`</sub>`.                                        | int   |
105 | | 6      | strand                        | '+' for positive strand '-' for negative strand, '.' when strands are combined | str   |
106 | | 7      | start position                | included for compatibility                                                     | int   |
107 | | 8      | end position                  | included for compatibility                                                     | int   |
108 | | 9      | color                         | included for compatibility, always 255,0,0                                     | str   |
109 | | 10     | N`<sub>`valid_cov`</sub>` | See definitions above.                                                         | int   |
110 | | 11     | fraction modified             | N`<sub>`mod`</sub>` / N`<sub>`valid_cov`</sub>`                        | float |
111 | | 12     | N`<sub>`mod`</sub>`       | See definitions above.                                                         | int   |
112 | | 13     | N`<sub>`canonical`</sub>` | See definitions above.                                                         | int   |
113 | | 14     | N`<sub>`other_mod`</sub>` | See definitions above.                                                         | int   |
114 | | 15     | N`<sub>`delete`</sub>`    | See definitions above.                                                         | int   |
115 | | 16     | N`<sub>`fail`</sub>`      | See definitions above.                                                         | int   |
116 | | 17     | N`<sub>`diff`</sub>`      | See definitions above.                                                         | int   |
117 | | 18     | N`<sub>`nocall`</sub>`    | See definitions above.                                                         | int   |
118 | 
119 | ## Description of columns in `modkit summary`:
120 | 
121 | ### Totals table
122 | 
123 | The lines of the totals table are prefixed with a `#` character.
124 | 
125 | | row | name                    | description                                                             | type  |
126 | | --- | ----------------------- | ----------------------------------------------------------------------- | ----- |
127 | | 1   | bases                   | comma-separated list of canonical bases with modification calls.        | str   |
128 | | 2   | total_reads_used        | total number of reads from which base modification calls were extracted | int   |
129 | | 3+  | count_reads_{base}      | total number of reads that contained base modifications for {base}      | int   |
130 | | 4+  | filter_threshold_{base} | filter threshold used for {base}                                        | float |
131 | 
132 | ### Modification calls table
133 | 
134 | The modification calls table follows immediately after the totals table.
135 | 
136 | | column | name       | description                                                                              | type  |
137 | | ------ | ---------- | ---------------------------------------------------------------------------------------- | ----- |
138 | | 1      | base       | canonical base with modification call                                                    | char  |
139 | | 2      | code       | base modification code, or `-` for canonical                                           | char  |
140 | | 3      | pass_count | total number of passing (confidence >= threshold) calls for the modification in column 2 | int   |
141 | | 4      | pass_frac  | fraction of passing (>= threshold) calls for the modification in column 2                | float |
142 | | 5      | all_count  | total number of calls for the modification code in column 2                              | int   |
143 | | 6      | all_frac   | fraction of all calls for the modification in column 2                                   | float |
144 | 
145 | ## Advanced usage examples
146 | 
147 | For complete usage instructions please see the command-line help of the program or the [Advanced usage](./book/src/advanced_usage.md) help documentation. Some more commonly required examples are provided below.
148 | 
149 | To combine multiple base modification calls into one, for example to combine basecalls for both 5hmC and 5mC into a count for "all cytosine modifications" (with code `C`) the `--combine-mods` option can be used:
150 | 
151 | ```bash
152 | modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods
153 | ```
154 | 
155 | In standard usage the `--preset traditional` option can be used as outlined in the [Usage](#usage) section. By more directly specifying individual options we can perform something similar without loss of information for 5hmC data stored in the input file:
156 | 
157 | ```bash
158 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta \
159 |     --combine-strands  
160 | ```
161 | 
162 | To produce a bedGraph file for each modification in the BAM file the `--bedgraph` option can be given. Counts for the positive and negative strands will be put in separate files.
163 | 
164 | ```bash
165 | modkit pileup path/to/reads.bam output/directory/path --bedgraph <--prefix string>
166 | ```
167 | 
168 | The option `--prefix [str]` parameter allows specification of a prefix to the output file names.
169 | 
170 | **Licence and Copyright**
171 | 
172 | (c) 2023 Oxford Nanopore Technologies Plc.
173 | 
174 | Modkit is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0.
175 | If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com
176 | 


--------------------------------------------------------------------------------
/bam/rustybam.md:
--------------------------------------------------------------------------------
  1 | # rustybam
  2 | 
  3 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Test%20and%20Build/badge.svg)](https://github.com/mrvollger/rustybam/actions)
  4 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Formatting/badge.svg)](https://github.com/mrvollger/rustybam/actions)
  5 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Clippy/badge.svg)](https://github.com/mrvollger/rustybam/actions)
  6 | 
  7 | [![Conda (channel only)](https://img.shields.io/conda/vn/bioconda/rustybam?color=green)](https://anaconda.org/bioconda/rustybam)
  8 | [![Downloads](https://img.shields.io/conda/dn/bioconda/rustybam?color=green)](https://anaconda.org/bioconda/rustybam)
  9 | 
 10 | [![crates.io version](https://img.shields.io/crates/v/rustybam)](https://crates.io/crates/rustybam)
 11 | [![crates.io downloads](https://img.shields.io/crates/d/rustybam?color=orange&label=downloads)](https://crates.io/crates/rustybam)
 12 | 
 13 | [![DOI](https://zenodo.org/badge/351639424.svg)](https://zenodo.org/badge/latestdoi/351639424)
 14 | 
 15 | `rustybam` is a bioinformatics toolkit written in the `rust` programing language focused around manipulation of alignment (`bam` and `PAF`), annotation (`bed`), and sequence (`fasta` and `fastq`) files. If your alignment is in a different format checkout if [wgatools](https://github.com/wjwei-handsome/wgatools) can convert it for you!
 16 | 
 17 | ## What can rustybam do?
 18 | 
 19 | Here is a commented example that highlights some of the better features of `rustybam`, and demonstrates how each result can be read directly into another subcommand.
 20 | 
 21 | ```bash
 22 | rb trim-paf .test/asm_small.paf `#trims back alignments that align the same query sequence more than once` \
 23 |     | rb break-paf --max-size 100 `#breaks the alignment into smaller pieces on indels of 100 bases or more` \
 24 |     | rb orient `#orients each contig so that the majority of bases are forward aligned` \
 25 |     | rb liftover --bed <(printf "chr22\t12000000\t13000000\n") `#subsets and trims the alignment to 1 Mbp of chr22.` \
 26 |     | rb filter --paired-len 10000 `#filters for query sequences that have at least 10,000 bases aligned to a target across all alignments.` \
 27 |     | rb stats --paf `#calculates statistics from the trimmed paf file` \
 28 |     | less -S
 29 | ```
 30 | 
 31 | ## Usage
 32 | 
 33 | ```shell
 34 | rustybam [OPTIONS] <SUBCOMMAND>
 35 | ```
 36 | 
 37 | or
 38 | 
 39 | ```shell
 40 | rb [OPTIONS] <SUBCOMMAND>
 41 | ```
 42 | 
 43 | ### Subcommands
 44 | 
 45 | The full manual of subcommands can be found on the [docs](https://docs.rs/rustybam/latest/rustybam/cli/enum.Commands.html).
 46 | 
 47 | ```shell
 48 | SUBCOMMANDS:
 49 |     stats          Get percent identity stats from a sam/bam/cram or PAF
 50 |     bed-length     Count the number of bases in a bed file [aliases: bedlen, bl, bedlength]
 51 |     filter         Filter PAF records in various ways
 52 |     invert         Invert the target and query sequences in a PAF along with the CIGAR string
 53 |     liftover       Liftover target sequence coordinates onto query sequence using a PAF
 54 |     trim-paf       Trim paf records that overlap in query sequence [aliases: trim, tp]
 55 |     orient         Orient paf records so that most of the bases are in the forward direction
 56 |     break-paf      Break PAF records with large indels into multiple records (useful for
 57 |                    SafFire) [aliases: breakpaf, bp]
 58 |     paf-to-sam     Convert a PAF file into a SAM file. Warning, all alignments will be marked as
 59 |                    primary! [aliases: paftosam, p2s, paf2sam]
 60 |     fasta-split    Reads in a fasta from stdin and divides into files (can compress by adding
 61 |                    .gz) [aliases: fastasplit, fasplit]
 62 |     fastq-split    Reads in a fastq from stdin and divides into files (can compress by adding
 63 |                    .gz) [aliases: fastqsplit, fqsplit]
 64 |     get-fasta      Mimic bedtools getfasta but allow for bgzip in both bed and fasta inputs
 65 |                    [aliases: getfasta, gf]
 66 |     nucfreq        Get the frequencies of each bp at each position
 67 |     repeat         Report the longest exact repeat length at every position in a fasta
 68 |     suns           Extract the intervals in a genome (fasta) that are made up of SUNs
 69 |     help           Print this message or the help of the given subcommand(s)
 70 | ```
 71 | 
 72 | ## Install
 73 | 
 74 | ### conda
 75 | 
 76 | ```shell
 77 | mamba install -c bioconda rustybam
 78 | ```
 79 | 
 80 | ### cargo
 81 | 
 82 | ```shell
 83 | cargo install rustybam
 84 | ```
 85 | 
 86 | ### Pre-complied binaries
 87 | 
 88 | Download from [releases](https://github.com/mrvollger/rustybam/releases) (may be slower than locally complied versions).
 89 | 
 90 | ### Source
 91 | 
 92 | ```shell
 93 | git clone https://github.com/mrvollger/rustybam.git
 94 | cd rustybam
 95 | cargo build --release
 96 | ```
 97 | 
 98 | and the executables will be built here:
 99 | 
100 | ```shell
101 | target/release/{rustybam,rb}
102 | ```
103 | 
104 | ## Examples
105 | 
106 | ### PAF or BAM statistics
107 | 
108 | For BAM files with extended cigar operations we can calculate statistics about the aliment and report them in BED format.
109 | 
110 | ```shell
111 | rustybam stats {input.bam} > {stats.bed}
112 | ```
113 | 
114 | The same can be done with PAF files as long as they are generated with `-c --eqx`.
115 | 
116 | ```shell
117 | rustybam stats --paf {input.paf} > {stats.bed}
118 | ```
119 | 
120 | ### PAF liftovers
121 | 
122 | > I have a `PAF` and I want to subset it for just a particular region in the reference.
123 | 
124 | With `rustybam` its easy:
125 | 
126 | ```shell
127 | rustybam liftover \
128 |      --bed <(printf "chr1\t0\t250000000\n") \
129 |      input.paf > trimmed.paf
130 | ```
131 | 
132 | > But I also want the alignment statistics for the region.
133 | 
134 | No problem, `rustybam liftover` does not just trim the coordinates but also the CIGAR
135 | so it is ready for `rustybam stats`:
136 | 
137 | ```shell
138 | rustybam liftover \
139 |     --bed <(printf "chr1\t0\t250000000\n") \
140 |     input.paf \
141 |     | rustybam stats --paf \
142 |     > trimmed.stats.bed
143 | ```
144 | 
145 | > Okay, but Evan asked for an "align slider" so I need to realign in chunks.
146 | 
147 | No need, just make your `bed` query to `rustybam liftoff` a set of sliding windows
148 | and it will do the rest.
149 | 
150 | ```shell
151 | rustybam liftover \
152 |     --bed <(bedtools makewindows -w 100000 \
153 |         <(printf "chr1\t0\t250000000\n") \
154 |         ) \
155 |     input.paf \
156 |     | rustybam stats --paf \
157 |     > trimmed.stats.bed
158 | ```
159 | 
160 | You can also use `rustybam breakpaf` to break up the paf records of indels above a certain size to
161 | get more "miropeats" like intervals.
162 | 
163 | ```shell
164 | rustybam breakpaf --max-size 1000 input.paf \
165 |     | rustybam liftover \
166 |     --bed <(printf "chr1\t0\t250000000\n") \
167 |     | ./rustybam stats --paf \
168 |     > trimmed.stats.bed
169 | ```
170 | 
171 | > Yeah but how do I visualize the data?
172 | 
173 | Try out
174 | [SafFire](https://mrvollger.github.io/SafFire/)!
175 | 
176 | ### Align once
177 | 
178 | At the boundaries of CNVs and inversions minimap2 may align the same section of query sequence to multiple stretches of the target sequence. This utility uses the CIGAR (must use `--eqx`) strings of PAF alignments to determine an optimal split of the alignments such no query base is aligned more than once. To do this the whole PAF file is loaded in memory and then overlaps are removed starting with the largest overlapping interval and iterating.
179 | 
180 | ```bash
181 | rb trim-paf {input.paf} > {trimmed.paf}
182 | ```
183 | 
184 | Here is an example from the NOTCH2NL region comparing CHM1 against CHM13 before trimming:
185 | ![](images/no-trim.svg)
186 | 
187 | and after trimming
188 | ![](images/trim.svg)
189 | 
190 | ### Split fastx files
191 | 
192 | Split a fasta file between `stdout` and two other files both compressed and uncompressed.
193 | 
194 | ```shell
195 | cat {input.fasta} | rustybam fasta-split two.fa.gz three.fa
196 | ```
197 | 
198 | Split a fastq file between `stdout` and two other files both compressed and uncompressed.
199 | 
200 | ```shell
201 | cat {input.fastq} | rustybam fastq-split two.fq.gz three.fq
202 | ```
203 | 
204 | ### Extract from a fasta
205 | 
206 | This tools is designed to mimic `bedtools getfasta` but this tools allows the fasta to be `bgzipped`.
207 | 
208 | ```shell
209 | samtools faidx {seq.fa(.gz)}
210 | rb get-fasta --name --strand --bed {regions.of.interest.bed} --fasta {seq.fa(.gz)}
211 | ```
212 | 
213 | ## TODO
214 | 
215 | - [x] Add a `bedtools getfasta` like operation that actually works with bgzipped input.
216 |   - [ ] implement bed12/split
217 | - [ ] Allow sam or paf for operations:
218 |   - [x] make a sam header from a PAF file
219 |   - [x] convert sam record to paf record
220 |   - [x] convert paf record to sam record
221 |   - [ ] make tools seemlessly work with sam and paf
222 | - [ ] Add `D4` for Nucfreq.
223 | - [ ] Finish implementing `suns`.
224 | - [ ] Allow multiple input files in `bed-length`
225 | - [ ] Start keeping a changelog


--------------------------------------------------------------------------------
/csv/csview.md:
--------------------------------------------------------------------------------
  1 | <h1 align="center">📠 csview</h1>
  2 | <p align="center">
  3 |     <em>A high performance csv viewer with cjk/emoji support.</em>
  4 | </p>
  5 | 
  6 | <p align="center">
  7 |     <a href="https://github.com/wfxr/csview/actions?query=workflow%3ACICD">
  8 |         <img src="https://github.com/wfxr/csview/workflows/CICD/badge.svg" alt="CICD"/>
  9 |     </a>
 10 |     <img src="https://img.shields.io/crates/l/csview.svg" alt="License"/>
 11 |     <a href="https://crates.io/crates/csview">
 12 |         <img src="https://img.shields.io/crates/v/csview.svg?colorB=319e8c" alt="Version">
 13 |     </a>
 14 |     <a href="https://github.com/wfxr/csview/releases">
 15 |         <img src="https://img.shields.io/badge/platform-%20Linux%20|%20OSX%20|%20Win%20|%20ARM-orange.svg" alt="Platform"/>
 16 |     </a>
 17 | </p>
 18 | 
 19 | <img src="https://raw.githubusercontent.com/wfxr/i/master/csview-screenshot.png" />
 20 | 
 21 | ### Features
 22 | 
 23 | * Small and *fast* (see [benchmarks](#benchmark) below).
 24 | * Memory efficient.
 25 | * Correctly align [CJK](https://en.wikipedia.org/wiki/CJK_characters) and emoji characters.
 26 | * Support `tsv` and custom delimiters.
 27 | * Support different styles, including markdown table.
 28 | 
 29 | ### Usage
 30 | ```
 31 | $ cat example.csv
 32 | Year,Make,Model,Description,Price
 33 | 1997,Ford,E350,"ac, abs, moon",3000.00
 34 | 1999,Chevy,"Venture ""Extended Edition""","",4900.00
 35 | 1999,Chevy,"Venture ""Extended Edition, Large""",,5000.00
 36 | 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof",4799.00
 37 | 
 38 | $ csview example.csv
 39 | ┌──────┬───────┬───────────────────────────────────┬───────────────────────────┬─────────┐
 40 | │ Year │ Make  │               Model               │        Description        │  Price  │
 41 | ├──────┼───────┼───────────────────────────────────┼───────────────────────────┼─────────┤
 42 | │ 1997 │ Ford  │ E350                              │ ac, abs, moon             │ 3000.00 │
 43 | │ 1999 │ Chevy │ Venture "Extended Edition"        │                           │ 4900.00 │
 44 | │ 1999 │ Chevy │ Venture "Extended Edition, Large" │                           │ 5000.00 │
 45 | │ 1996 │ Jeep  │ Grand Cherokee                    │ MUST SELL! air, moon roof │ 4799.00 │
 46 | └──────┴───────┴───────────────────────────────────┴───────────────────────────┴─────────┘
 47 | 
 48 | $ head /etc/passwd | csview -H -d:
 49 | ┌────────────────────────┬───┬───────┬───────┬────────────────────────────┬─────────────────┐
 50 | │ root                   │ x │ 0     │ 0     │                            │ /root           │
 51 | │ bin                    │ x │ 1     │ 1     │                            │ /               │
 52 | │ daemon                 │ x │ 2     │ 2     │                            │ /               │
 53 | │ mail                   │ x │ 8     │ 12    │                            │ /var/spool/mail │
 54 | │ ftp                    │ x │ 14    │ 11    │                            │ /srv/ftp        │
 55 | │ http                   │ x │ 33    │ 33    │                            │ /srv/http       │
 56 | │ nobody                 │ x │ 65534 │ 65534 │ Nobody                     │ /               │
 57 | │ dbus                   │ x │ 81    │ 81    │ System Message Bus         │ /               │
 58 | │ systemd-journal-remote │ x │ 981   │ 981   │ systemd Journal Remote     │ /               │
 59 | │ systemd-network        │ x │ 980   │ 980   │ systemd Network Management │ /               │
 60 | └────────────────────────┴───┴───────┴───────┴────────────────────────────┴─────────────────┘
 61 | ```
 62 | 
 63 | Run `csview --help` to view detailed usage.
 64 | 
 65 | ### Installation
 66 | 
 67 | #### On Arch Linux
 68 | 
 69 | `csview` is available in the Arch User Repository. To install it from [AUR](https://aur.archlinux.org/packages/csview):
 70 | 
 71 | ```
 72 | yay -S csview
 73 | ```
 74 | 
 75 | #### On macOS
 76 | 
 77 | You can install `csview` with Homebrew:
 78 | 
 79 | ```
 80 | brew install csview
 81 | ```
 82 | 
 83 | #### On NetBSD
 84 | 
 85 | `csview` is available from the main pkgsrc Repositories. To install simply run
 86 | 
 87 | ```
 88 | pkgin install csview
 89 | ```
 90 | 
 91 | or, if you prefer to build from source using [pkgsrc](https://pkgsrc.se/textproc/csview) on any of the supported platforms:
 92 | 
 93 | ```
 94 | cd /usr/pkgsrc/textproc/csview
 95 | make install
 96 | ```
 97 | 
 98 | #### On Windows
 99 | 
100 | You can install `csview` with [Scoop](https://scoop.sh/):
101 | ```
102 | scoop install csview
103 | ```
104 | 
105 | #### From binaries
106 | 
107 | Pre-built versions of `csview` for various architectures are available at [Github release page](https://github.com/wfxr/csview/releases).
108 | 
109 | *Note that you can try the `musl` version (which is statically-linked) if runs into dependency related errors.*
110 | 
111 | #### From source
112 | 
113 | `csview` is also published on [crates.io](https://crates.io). If you have latest Rust toolchains installed you can use `cargo` to install it from source:
114 | 
115 | ```
116 | cargo install --locked csview
117 | ```
118 | 
119 | If you want the latest version, clone this repository and run `cargo build --release`.
120 | 
121 | ### Benchmark
122 | 
123 | - [small.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-action-csv) (10 rows, 4 cols, 695 bytes):
124 | 
125 | |                                           Tool                                           | Command                   | Mean Time |  Min Time |    Memory |
126 | |:----------------------------------------------------------------------------------------:|---------------------------|----------:|----------:|----------:|
127 | |                   [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0)                   | `xsv table small.csv`     |     2.0ms |     1.8ms |     3.9mb |
128 | |  [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3)  | `csview small.csv`        | **0.3ms** | **0.1ms** | **2.4mb** |
129 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` |     1.3ms |     1.1ms | **2.4mb** |
130 | |                [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6)               | `csvlook small.csv`       |   148.1ms |   142.4ms |    27.3mb |
131 | 
132 | - [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) (10,000 rows, 10 cols, 624K bytes):
133 | 
134 | |                                           Tool                                           | Command                   |  Mean Time |   Min Time |    Memory |
135 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:|
136 | |                   [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0)                   | `xsv table medium.csv`    |     0.031s |     0.029s |     4.4mb |
137 | |  [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3)  | `csview medium.csv`       | **0.017s** | **0.016s** | **2.8mb** |
138 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` |     0.052s |     0.050s |     9.9mb |
139 | |                [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6)               | `csvlook medium.csv`      |     2.664s |     2.617s |    46.8mb |
140 | 
141 | - `large.csv` (1,000,000 rows, 10 cols, 61M bytes, generated by concatenating [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) 100 times):
142 | 
143 | |                                           Tool                                           | Command                   |  Mean Time |   Min Time |    Memory |
144 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:|
145 | |                   [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0)                   | `xsv table large.csv`     |     2.912s |     2.820s |     4.4mb |
146 | |  [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3)  | `csview large.csv`        | **1.686s** | **1.665s** | **2.8mb** |
147 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` |     5.777s |     5.759s |   767.6mb |
148 | |                [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6)               | `csvlook large.csv`       |    20.665s |    20.549s |  1105.7mb |
149 | 
150 | ### F.A.Q.
151 | 
152 | ---
153 | #### We already have [xsv](https://github.com/BurntSushi/xsv), why not contribute to it but build a new tool?
154 | 
155 | `xsv` is great. But it's aimed for analyzing and manipulating csv data.
156 | `csview` is designed for formatting and viewing. See also: [xsv/issues/156](https://github.com/BurntSushi/xsv/issues/156)
157 | 
158 | ---
159 | #### I encountered UTF-8 related errors, how to solve it?
160 | 
161 | The file may use a non-UTF8 encoding. You can check the file encoding using `file` command:
162 | 
163 | ```
164 | $ file -i a.csv
165 | a.csv: application/csv; charset=iso-8859-1
166 | ```
167 | And then convert it to `utf8`:
168 | 
169 | ```
170 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv -o b.csv
171 | $ csview b.csv
172 | ```
173 | 
174 | Or do it in place:
175 | 
176 | ```
177 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv | csview
178 | ```
179 | 
180 | ### Credits
181 | 
182 | * [csv-rust](https://github.com/BurntSushi/rust-csv)
183 | * [prettytable-rs](https://github.com/phsym/prettytable-rs)
184 | * [structopt](https://github.com/TeXitoi/structopt)
185 | 
186 | ### License
187 | 
188 | `csview` is distributed under the terms of both the MIT License and the Apache License 2.0.
189 | 
190 | See the [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT) files for license details.


--------------------------------------------------------------------------------
/csv/madato.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # madato &emsp; [![Build Status]][travis] [![Latest Version]][crates.io]
  3 | 
  4 | [Build Status]: https://travis-ci.org/inosion/madato.svg?branch=master
  5 | [travis]: https://travis-ci.org/inosion/madato
  6 | [Latest Version]: https://img.shields.io/crates/v/madato.svg
  7 | [crates.io]: https://crates.io/crates/madato
  8 | 
  9 | ***madato is a library and command line tool for working tabular data, and Markdown***
 10 | 
 11 | Windows, Mac and Linux
 12 | 
 13 | Converts XLSX and ODS Spreadsheets to 
 14 | - JSON
 15 | - YAML 
 16 | - Markdown
 17 | 
 18 | ### TL;DR
 19 | 
 20 | ```
 21 | madato table -t XLSX -o JSON --sheetname Sheet2         path/to/workbook.xlsx
 22 | madato table -t XLSX -o MD   --sheetname Sheet2         path/to/workbook.xlsx
 23 | madato table -t XLSX -o YAML --sheetname 'Annual Sales' path/to/workbook.xlsx
 24 | madato table -t XLSX -o YAML path/to/workbook.ods
 25 | madato table -t YAML -o MD   path/to/workbook.yaml
 26 | ```
 27 | 
 28 | --------------------------------------------------------------------------------
 29 | 
 30 | The tools is primarly centered around getting tabular data (spreadsheets, CSVs)
 31 | into Markdown. 
 32 | 
 33 | It currently supports:
 34 | - Reading a XLS*, ODS Spreadsheet or YAML file `-- to -->` Markdown
 35 | - Reading a XLS*, ODS Spreadsheet `-- to -->` Markdown
 36 | 
 37 | When generating the output:
 38 | - Filter the Rows using basic Regex over Key/Value pairs
 39 | - Limit the columns to named headings
 40 | - Re-order the columns, or repeat them using the same column feature
 41 | - Only generate a table for a named "sheet" (applicable for the XLS/ODS formats)
 42 | 
 43 | Madato is: 
 44 | - Command Line Tool (Windows, Mac, Linux) - good for CI/CD preprocessing
 45 | - Rust Library - Good for integration into Rust Markdown tooling
 46 | - Node JS WASM API - To be used later for Atom and VSCode Extensions
 47 | 
 48 | Madato expects that every column has a heading row. That is, the first row are headings/column names. If a cell in that first row is blank, it will create `NULL0..NULLn` entries as required.
 49 | 
 50 | ## Examples
 51 | 
 52 | * Extract the `3rd Sheet` sheet from an MS Excel Document
 53 | ```
 54 | 08:39 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet"
 55 | |col1|col2| col3 |col4 |                         col5                          |NULL5|
 56 | |----|----|------|-----|-------------------------------------------------------|-----|
 57 | | 1  |that| are  |wider|  value ‘aaa’ is in the next cell, but has no heading  | aaa |
 58 | |than|the |header| row |       (open the spreadsheet to see what I mean)       |     |
 59 | ```
 60 | 
 61 | * Extract and reorder just 3 Columns
 62 | ```
 63 | 08:42 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet" -c col2 -c col3 -c NULL5
 64 | |col2| col3 |NULL5|
 65 | |----|------|-----|
 66 | |that| are  | aaa |
 67 | |the |header|     |
 68 | ```
 69 | * Pull from the `second_sheet` sheet
 70 | * Only extract `Heading 4` column
 71 | * Use a Filter, where `Heading 4` values must only have a letter or number.
 72 | 
 73 | ```
 74 | 08:48 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname second_sheet -c "Heading 4" -f 'Heading 4=[a-zA-Z0-9]'
 75 | |        Heading 4         |
 76 | |--------------------------|
 77 | |         << empty         |
 78 | |*Some Bolding in Markdown*|
 79 | |   `escaped value` foo    |
 80 | |           0.22           |
 81 | |         #DIV/0!          |
 82 | |  “This cell has quotes”  |
 83 | |       😕 ← Emoticon       |
 84 | ```
 85 | 
 86 | * Filtering on a Column, ensuring that a "+" is there in `Trend` Column
 87 | 
 88 | ```
 89 | 09:00 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname Sheet1 -c Rank -c Language -c Trend -f "Trend=\+"
 90 | |                         Rank                         |  Language  |Trend |
 91 | |------------------------------------------------------|------------|------|
 92 | |                          1                           |   Python   |+5.5 %|
 93 | |                          3                           | Javascript |+0.2 %|
 94 | |                          7                           |     R      |+0.0 %|
 95 | |                          12                          | TypeScript |+0.3 %|
 96 | |                          16                          |   Kotlin   |+0.5 %|
 97 | |                          17                          |     Go     |+0.3 %|
 98 | |                          20                          |    Rust    |+0.0 %|
 99 | ```
100 | 
101 | ## Internals
102 | madato uses:
103 | - [calamine](https://github.com/tafia/calamine) for reading XLS and ODS sheets
104 | - [wasm bindings](https://github.com/rustwasm/wasm-bindgen) to created JS API versions of the Rust API
105 | - [regex]() for filtering, and [serde]() for serialisation.
106 | 
107 | ## Tips
108 | 
109 | * I have found that copying the "table" I want from a website: HTML, to a spreadsheet, then through `madato` gives an excellent Markdown table of the original.
110 | 
111 | ## Rust API
112 | 
113 | ## JS API
114 | 
115 | ## More Commandline
116 | 
117 | ### Sheet List
118 | 
119 | You can list the "sheets" of an XLS*, ODS file with 
120 | 
121 | ```
122 | $ madato sheetlist test/sample_multi_sheet.xlsx 
123 | Sheet1
124 | second_sheet
125 | 3rd Sheet
126 | ```
127 | 
128 | ### YAML to Markdown 
129 | 
130 | Madato reads a "YAML" file, in the same way it can a Spreadsheet.
131 | This is useful for "keeping" tabular data in your source repository, and perhaps not
132 | the XLS.
133 | 
134 | `madato table -t yaml test/www-sample/test.yml`
135 | 
136 | ```
137 | |col3| col4  |  data1  |       data2        |
138 | |----|-------|---------|--------------------|
139 | |100 |gar gar|somevalue|someother value here|
140 | |190x|       |  that   |        nice        |
141 | |100 | ta da |  this   |someother value here|
142 | ```
143 | 
144 | *Please see the [test/test.yml](test/test.yml) file for the expected layout of this file*
145 | 
146 | ### Excel/ODS to YAML
147 | 
148 | Changing the output from default "Markdown (MD)" to "YAML", you get a Markdown file of the Spreadsheet.
149 | 
150 | ```
151 | madato table -t xlsx test/sample_multi_sheet.xslx.xlsx -s Sheet1 -o yaml
152 | ---
153 | - Rank: "1"
154 |   Change: ""
155 |   Language: Python
156 |   Share: "23.59 %"
157 |   Trend: "+5.5 %"
158 | - Rank: "2"
159 |   Change: ""
160 |   Language: Java
161 |   Share: "22.4 %"
162 |   Trend: "-0.5 %"
163 | - Rank: "3"
164 |   Change: ""
165 |   Language: Javascript
166 |   Share: "8.49 %"
167 | ...
168 | ```
169 | 
170 | If you omit the sheet name, it will dump all sheets into an order map of array of maps.
171 | 
172 | 
173 | ### Features
174 | 
175 | * `[x]` Reads a formatted YAML string and renders a Markdown Table
176 | * `[x]` Can take an optional list of column headings, and only display those from the table (filtering out other columns present)
177 | * `[X]` Native Binary Command Line (windows, linux, osx)
178 | * `[X]` Read an XLSX file and produce a Markdown Table
179 | * `[X]` Read an ODS file and produce a Markdown Table
180 | * `[ ]` Read a CSV, TSV, PSV (etc) file and produce a Markdown Table
181 | * `[ ]` Support Nested Structures in the YAML input
182 | * `[ ]` Read a Markdown File, and select the "table" and turn it back into YAML
183 | 
184 | ### Future Goals
185 | * Finish the testing and publishing of the JS WASM Bindings. (PS - it works.. 
186 |   (see : [test/www-sample](test/www-sample) and the [Makefile](Makefile) )
187 | * Embed the "importing" of YAML, CSV and XLS* files into the `mume` Markdown Preview Enhanced Plugin. [https://shd101wyy.github.io/markdown-preview-enhanced/](https://shd101wyy.github.io/markdown-preview-enhanced/) So we can have Awesome Markdown Documents.
188 | * Provide a `PreRenderer` for `[rust-lang-nursery/mdBook](https://github.com/rust-lang-nursery/mdBook) to "import" MD tables from files.
189 | 
190 | ### Known Issues
191 | * A Spreadsheet Cell with a Date will come out as the "magic" Excel date number :-( - https://github.com/tafia/calamine/issues/116
192 | 
193 | ## License
194 | 
195 | Serde is licensed under either of
196 | 
197 |  * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
198 |    http://www.apache.org/licenses/LICENSE-2.0)
199 |  * MIT license ([LICENSE-MIT](LICENSE-MIT) or
200 |    http://opensource.org/licenses/MIT)
201 | 
202 | at your option.
203 | 
204 | ### Contribution
205 | 
206 | Unless you explicitly state otherwise, any contribution intentionally submitted
207 | for inclusion in Serde by you, as defined in the Apache-2.0 license, shall be
208 | dual licensed as above, without any additional terms or conditions.


--------------------------------------------------------------------------------
/csv/xtab.md:
--------------------------------------------------------------------------------
 1 | # xtab
 2 | 
 3 | 🦀 CSV command line utilities
 4 | 
 5 | ## install
 6 | 
 7 | ##### setp1：install cargo first
 8 | 
 9 | ```bash
10 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
11 | ```
12 | 
13 | ##### step2:
14 | 
15 | ```bash
16 | cargo install xtab
17 | # or
18 | 
19 | git clone https://github.com/sharkLoc/xtab.git
20 | cd xtab
21 | cargo b --release
22 | # mv target/release/xtab to anywhere you want 
23 | ```
24 | 
25 | ## usage
26 | 
27 | ```bash
28 | xtab -- CSV command line utilities
29 | Version: 0.0.8
30 | 
31 | Authors: sharkLoc <mmtinfo@163.com>
32 | Source code: https://github.com/sharkLoc/xtab.git
33 | 
34 | xtab supports reading and writing gzip/bzip2/xz format file.
35 | Compression level:
36 |   format   range   default   crate
37 |   gzip     1-9     6         https://crates.io/crates/flate2
38 |   bzip2    1-9     6         https://crates.io/crates/bzip2
39 |   xz       1-9     6         https://crates.io/crates/xz2
40 | 
41 | 
42 | Usage: xtab [OPTIONS] [CSV] <COMMAND>
43 | 
44 | Commands:
45 |   addheader  Set new header for CSV file [aliases: ah]
46 |   csv2xlsx   Convert CSV/TSV files to XLSX file [aliases: c2x]
47 |   dim        Dimensions of CSV file
48 |   drop       Drop or Select CSV fields by columns index
49 |   flatten    flattened view of CSV records [aliases: flat]
50 |   freq       Build frequency table of selected column in CSV data
51 |   head       Print first N records from CSV file
52 |   pretty     Convert CSV to a readable aligned table [aliases: prt]
53 |   replace    Replace data of matched fields
54 |   reverse    Reverses rows of CSV data [aliases: rev]
55 |   sample     Randomly select rows from CSV file using reservoir sampling
56 |   search     Applies the regex to each field individually and shows only matching rows
57 |   slice      Slice rows from a part of a CSV file
58 |   tail       Print last N records from CSV file
59 |   transpose  Transpose CSV data [aliases: trans]
60 |   uniq       Unique data with keys
61 |   xlsx2csv   Convert XLSX to CSV format [aliases: x2c]
62 |   view       Show CSV file content
63 |   help       Print this message or the help of the given subcommand(s)
64 | 
65 | Global Arguments:
66 |   -d, --delimiter <CHAR>      Set delimiter for input csv file, e.g., in linux -d $'\t' for tab, in powershell -d `t for tab [default: ,]
67 |   -D, --out-delimite <CHAR>   Set delimiter for output CSV file, e.g., in linux -D $'\t' for tab, in powershell -D `t for tab [default: ,]
68 |       --log <FILE>            If file name specified, write log message to this file, or write to stderr
69 |       --compress-level <INT>  Set compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with option -o/--out [default: 6]
70 |   -v, --verbosity...          control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug]
71 |   [CSV]                   Input CSV file name, if file not specified read data from stdin
72 | 
73 | Global FLAGS:
74 |   -H, --no-header  If set, the first row is treated as a special header row, and the original header row excluded from output
75 |   -q, --quiet      Be quiet and do not show any extra information
76 |   -h, --help       prints help information
77 |   -V, --version    prints version information
78 | 
79 | Use "xtab help [command]" for more information about a command
80 | ```
81 | 


--------------------------------------------------------------------------------
/dna/fakit.md:
--------------------------------------------------------------------------------
 1 | # fakit
 2 | 
 3 | 🦀 a simple program for fasta file manipulation
 4 | 
 5 | ## install latest version
 6 | 
 7 | ```bash
 8 | cargo install --git https://github.com/sharkLoc/fakit.git
 9 | ```
10 | 
11 | ## install
12 | 
13 | ```bash
14 | cargo install fakit
15 | ```
16 | 
17 | ## usage
18 | 
19 | ```bash
20 | Fakit: A simple program for fasta file manipulation
21 | 
22 | Version: 0.3.6
23 | 
24 | Authors: sharkLoc <mmtinfo@163.com>
25 | Source code: https://github.com/sharkLoc/fakit.git
26 | 
27 | Fakit supports reading and writing gzip (.gz) format.
28 | Bzip2 (.bz2) and xz (.xz) format is supported since v0.3.0.
29 | Under the same compression level, xz has the highest compression ratio but consumes more time.
30 | 
31 | Compression level:
32 |   format   range   default   crate
33 |   gzip     1-9     6         https://crates.io/crates/flate2
34 |   bzip2    1-9     6         https://crates.io/crates/bzip2
35 |   xz       1-9     6         https://crates.io/crates/xz2
36 | 
37 | 
38 | Usage: fakit [OPTIONS] <COMMAND>
39 | 
40 | Commands:
41 |   topn     get first N records from fasta file [aliases: head]
42 |   tail     get last N records from fasta file
43 |   fa2fq    convert fasta to fastq file
44 |   faidx    create index and random access to fasta files [aliases: fai]
45 |   flatten  flatten fasta sequences [aliases: flat]
46 |   range    print fasta records in a range
47 |   rename   rename sequence id in fasta file
48 |   reverse  get a reverse-complement of fasta file [aliases: rev]
49 |   window   stat dna fasta gc content by sliding windows [aliases: slide]
50 |   grep     grep fasta sequences by name/seq
51 |   seq      convert all bases to lower/upper case, filter by length
52 |   sort     sort fasta file by name/seq/gc/length
53 |   search   search subsequences/motifs from fasta file
54 |   shuffle  shuffle fasta sequences
55 |   size     report fasta sequence base count
56 |   subfa    subsample sequences from big fasta file
57 |   split    split fasta file by sequence id
58 |   split2   split fasta file by sequence number
59 |   summ     simple summary for dna fasta files [aliases: stat]
60 |   codon    show codon table and amino acid name
61 |   help     Print this message or the help of the given subcommand(s)
62 | 
63 | Global Arguments:
64 |   -w, --line-width <int>      line width when outputting fasta sequences, 0 for no wrap [default: 70]
65 |       --compress-level <int>  set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with
66 |                               option -o/--out [default: 6]
67 |       --log <str>             if file name specified, write log message to this file, or write to stderr
68 |   -v, --verbosity...          control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug]
69 | 
70 | Global FLAGS:
71 |   -q, --quiet    be quiet and do not show extra information
72 |   -h, --help     prints help information
73 |   -V, --version  prints version information
74 | 
75 | Use "fakit help [command]" for more information about a command
76 | 
77 | ```
78 | 
79 | <br>
80 | ** any bugs please report issues **💖
81 | 


--------------------------------------------------------------------------------
/dna/fq.md:
--------------------------------------------------------------------------------
  1 | # fq
  2 | 
  3 | [![CI status](https://github.com/stjude-rust-labs/fq/workflows/CI/badge.svg)](https://github.com/stjude-rust-labs/fq/actions)
  4 | 
  5 | **fq** filters, generates, subsamples, and validates [FASTQ] files.
  6 | 
  7 | [FASTQ]: https://en.wikipedia.org/wiki/FASTQ_format
  8 | 
  9 | ## Install
 10 | 
 11 | There are different methods to install fq.
 12 | 
 13 | ### Releases
 14 | 
 15 | [Precompiled binaries are built][releases] for modern Linux distributions
 16 | (`x86_64-unknown-linux-gnu`), macOS (`x86_64-apple-darwin`), and Windows
 17 | (`x86_64-pc-windows-msvc`). The Linux binaries require glibc 2.18+ (CentOS/RHEL
 18 | 8+, Debian 8+, Ubuntu 14.04+, etc.).
 19 | 
 20 | [releases]: https://github.com/stjude-rust-labs/fq/releases
 21 | 
 22 | ### Conda
 23 | 
 24 | fq is available via [Bioconda].
 25 | 
 26 | ```
 27 | $ conda install fq=0.11.0
 28 | ```
 29 | 
 30 | [Bioconda]: https://bioconda.github.io/recipes/fq/README.html
 31 | 
 32 | ### Manual
 33 | 
 34 | Clone the repository and use [Cargo] to install fq.
 35 | 
 36 | ```
 37 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git
 38 | $ cd fq
 39 | $ cargo install --locked --path .
 40 | ```
 41 | 
 42 | [Cargo]: https://doc.rust-lang.org/cargo/getting-started/installation.html
 43 | 
 44 | ### Container image
 45 | 
 46 | Container images are managed by Bioconda and available through [Quay.io], e.g.,
 47 | using [Docker]:
 48 | 
 49 | ```
 50 | $ docker image pull quay.io/biocontainers/fq:<tag>
 51 | ```
 52 | 
 53 | See [the repository tags] for the available tags.
 54 | 
 55 | Alternatively, build the development container image:
 56 | 
 57 | ```
 58 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git
 59 | $ cd fq
 60 | $ docker image build --tag fq:0.11.0 .
 61 | ```
 62 | 
 63 | [Quay.io]: https://quay.io/repository/biocontainers/fq
 64 | [the repository tags]: https://quay.io/repository/biocontainers/fq?tab=tags
 65 | [Docker]: https://www.docker.com/
 66 | 
 67 | ## Usage
 68 | 
 69 | fq provides subcommands for filtering, generating, subsampling, and
 70 | validating FASTQ files.
 71 | 
 72 | ### filter
 73 | 
 74 | **fq filter** filters a given FASTQ file by a set of names or a sequence
 75 | pattern. The result includes only the records that match the given options.
 76 | 
 77 | #### Usage
 78 | 
 79 | ```
 80 | Filters a FASTQ file
 81 | 
 82 | Usage: fq filter [OPTIONS] --dsts <DSTS> [SRCS]...
 83 | 
 84 | Arguments:
 85 |   [SRCS]...  FASTQ sources
 86 | 
 87 | Options:
 88 |       --names <NAMES>
 89 |           Allowlist of record names
 90 |       --sequence-pattern <SEQUENCE_PATTERN>
 91 |           Keep records that have sequences that match the given regular expression
 92 |       --dsts <DSTS>
 93 |           Filtered FASTQ destinations
 94 |   -h, --help
 95 |           Print help
 96 |   -V, --version
 97 |           Print version
 98 | ```
 99 | 
100 | #### Examples
101 | 
102 | ```sh
103 | # Filters an input FASTQ using the given allowlist.
104 | $ fq filter --names allowlist.txt --dsts /dev/stdout in.fastq
105 | 
106 | # Filters FASTQ files by matching a sequence pattern in the first input's
107 | # records and applying the match to all inputs.
108 | $ fq filter --sequence-pattern ^TC --dsts out.1.fq --dsts out.2.fq in.1.fq in.2.fq
109 | ```
110 | 
111 | ### generate
112 | 
113 | **fq generate** is a FASTQ file pair generator. It creates two reads, formatting
114 | names as [described by Illumina][1].
115 | 
116 | While _generate_ creates "valid" FASTQ reads, the content of the files are
117 | completely random. The sequences do not align to any genome.
118 | 
119 | [1]: https://help.basespace.illumina.com/articles/descriptive/fastq-files/
120 | 
121 | #### Usage
122 | 
123 | ```
124 | Generates a random FASTQ file pair
125 | 
126 | Usage: fq generate [OPTIONS] <R1_DST> <R2_DST>
127 | 
128 | Arguments:
129 |   <R1_DST>  Read 1 destination. Output will be gzipped if ends in `.gz`
130 |   <R2_DST>  Read 2 destination. Output will be gzipped if ends in `.gz`
131 | 
132 | Options:
133 |   -s, --seed <SEED>                  Seed to use for the random number generator
134 |   -n, --record-count <RECORD_COUNT>  Number of records to generate [default: 10000]
135 |       --read-length <READ_LENGTH>    Number of bases in the sequence [default: 101]
136 |   -h, --help                         Print help
137 |   -V, --version                      Print version
138 | ```
139 | 
140 | #### Examples
141 | 
142 | ```sh
143 | # Generates the default number of records, written to uncompressed files.
144 | $ fq generate /tmp/r1.fastq /tmp/r2.fastq
145 | 
146 | # Generates FASTQ paired reads with 32 records, written to gzipped outputs.
147 | $ fq generate --record-count 32 /tmp/r1.fastq.gz /tmp/r2.fastq.gz
148 | ```
149 | 
150 | ### lint
151 | 
152 | **fq lint** is a FASTQ file pair validator.
153 | 
154 | #### Usage
155 | 
156 | ```
157 | Validates a FASTQ file pair
158 | 
159 | Usage: fq lint [OPTIONS] <R1_SRC> [R2_SRC]
160 | 
161 | Arguments:
162 |   <R1_SRC>  Read 1 source. Accepts both raw and gzipped FASTQ inputs
163 |   [R2_SRC]  Read 2 source. Accepts both raw and gzipped FASTQ inputs
164 | 
165 | Options:
166 |       --lint-mode <LINT_MODE>
167 |           Panic on first error or log all errors [default: panic] [possible values: panic, log]
168 |       --single-read-validation-level <SINGLE_READ_VALIDATION_LEVEL>
169 |           Only use single read validators up to a given level [default: high] [possible values: low, medium, high]
170 |       --paired-read-validation-level <PAIRED_READ_VALIDATION_LEVEL>
171 |           Only use paired read validators up to a given level [default: high] [possible values: low, medium, high]
172 |       --disable-validator <DISABLE_VALIDATOR>
173 |           Disable validators by code. Use multiple times to disable more than one
174 |   -h, --help
175 |           Print help
176 |   -V, --version
177 |           Print version
178 | ```
179 | 
180 | #### Validators
181 | 
182 | _validate_ includes a set of validators that run on single or paired records.
183 | By default, records are validated with all rules, but validators can be
184 | disabled using `--disable-validator CODE`, where `CODE` is one of validators
185 | listed below.
186 | 
187 | ##### Single
188 | 
189 | | Code | Level  | Name              | Validation
190 | |------|--------|-------------------|------------
191 | | S001 | low    | PlusLine          | Plus line starts with a "+".
192 | | S002 | medium | Alphabet          | All characters in sequence line are one of "ACGTN", case-insensitive.
193 | | S003 | high   | Name              | Name line starts with an "@".
194 | | S004 | low    | Complete          | All four record lines (name, sequence, plus line, and quality) are present.
195 | | S005 | high   | ConsistentSeqQual | Sequence and quality lengths are the same.
196 | | S006 | medium | QualityString     | All characters in quality line are between "!" and "~" (ordinal values).
197 | | S007 | high   | DuplicateName     | All record names are unique.
198 | 
199 | ##### Paired
200 | 
201 | | Code | Level   | Name              | Validation
202 | |------|---------|-------------------|------------
203 | | P001 | medium  | Names             | Each paired read name is the same, excluding interleave.
204 | 
205 | #### Examples
206 | 
207 | ```sh
208 | # Validate both reads using all validators. Exits cleanly (0) if no validation
209 | # errors occur.
210 | $ fq lint r1.fastq r2.fastq
211 | 
212 | # Log errors instead of quitting on first error.
213 | $ fq lint --lint-mode log r1.fastq r2.fastq
214 | 
215 | # Disable validators S004 and S007.
216 | $ fq lint --disable-validator S004 --disable-validator S007 r1.fastq r2.fastq
217 | ```
218 | 
219 | ### subsample
220 | 
221 | **fq subsample** outputs a subset of records from single or paired FASTQ files.
222 | 
223 | When using a probability (`-p, --probability`), each file is read through once,
224 | and a subset of records is selected based on that chance. Given the randomness
225 | used when sampling a uniform distribution, the output record count will not be
226 | exact but (statistically) close.
227 | 
228 | When using a record count (`-n, --record-count`), the first input is read
229 | twice, but it provides an exact number of records to be selected.
230 | 
231 | A seed (`-s, --seed`) can be provided to influence the results, e.g.,
232 | for a deterministic subset of records.
233 | 
234 | For paired input, the sampling is applied to each pair.
235 | 
236 | #### Usage
237 | 
238 | ```
239 | Outputs a subset of records
240 | 
241 | Usage: fq subsample [OPTIONS] --r1-dst <R1_DST> <--probability <PROBABILITY>|--record-count <RECORD_COUNT>> <R1_SRC> [R2_SRC]
242 | 
243 | Arguments:
244 |   <R1_SRC>  Read 1 source. Accepts both raw and gzipped FASTQ inputs
245 |   [R2_SRC]  Read 2 source. Accepts both raw and gzipped FASTQ inputs
246 | 
247 | Options:
248 |   -p, --probability <PROBABILITY>    The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
249 |   -n, --record-count <RECORD_COUNT>  The exact number of records to keep. Cannot be used with `probability`
250 |   -s, --seed <SEED>                  Seed to use for the random number generator
251 |       --r1-dst <R1_DST>              Read 1 destination. Output will be gzipped if ends in `.gz`
252 |       --r2-dst <R2_DST>              Read 2 destination. Output will be gzipped if ends in `.gz`
253 |   -h, --help                         Print help
254 |   -V, --version                      Print version
255 | ```
256 | 
257 | #### Examples
258 | 
259 | ```sh
260 | # Sample ~50% of records from a single FASTQ file
261 | $ fq subsample --probability 0.5 --r1-dst r1.50pct.fastq r1.fastq
262 | 
263 | # Sample ~50% of records from a single FASTQ file and seed the RNG
264 | $ fq subsample --probability --seed 13 --r1-dst r1.50pct.fastq r1.fastq
265 | 
266 | # Sample ~25% of records from paired FASTQ files
267 | $ fq subsample --probability 0.25 --r1-dst r1.25pct.fastq --r2-dst r2.25pct.fastq r1.fastq r2.fastq
268 | 
269 | # Sample ~10% of records from a gzipped FASTQ file and compress output
270 | $ fq subsample --probability 0.1 --r1-dst r1.10pct.fastq.gz r1.fastq.gz
271 | 
272 | # Sample exactly 10000 records from a single FASTQ file
273 | $ fq subsample --record-count 10000 -r1-dst r1.10k.fastq r1.fastq
274 | ```


--------------------------------------------------------------------------------
/dna/ngs.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |   <h1 align="center">
  3 |     ngs
  4 |   </h1>
  5 | 
  6 |   <p align="center">
  7 |     <a href="https://github.com/stjude-rust-labs/ngs/actions/workflows/CI.yml" target="_blank">
  8 |       <img alt="CI: Status" src="https://github.com/stjude-rust-labs/ngs/actions/workflows/CI.yml/badge.svg" />
  9 |     </a>
 10 |     <a href="https://crates.io/crates/ngs" target="_blank">
 11 |       <img alt="crates.io version" src="https://img.shields.io/crates/v/ngs">
 12 |     </a>
 13 |     <img alt="crates.io downloads" src="https://img.shields.io/crates/d/ngs">
 14 |     <a href="https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-APACHE" target="_blank">
 15 |       <img alt="License: Apache 2.0" src="https://img.shields.io/badge/license-Apache 2.0-blue.svg" />
 16 |     </a>
 17 |     <a href="https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-MIT" target="_blank">
 18 |       <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-blue.svg" />
 19 |     </a>
 20 |   </p>
 21 | 
 22 | 
 23 |   <p align="center">
 24 |     Command line utility for working with next-generation sequencing files. 
 25 |     <br />
 26 |     <a href="https://github.com/stjude-rust-labs/ngs/wiki"><strong>Explore the docs »</strong></a>
 27 |     <br />
 28 |     <br />
 29 |     <a href="https://github.com/stjude-rust-labs/ngs/issues/new?assignees=&labels=&template=feature_request.md&title=Descriptive%20Title&labels=enhancement">Request Feature</a>
 30 |     ·
 31 |     <a href="https://github.com/stjude-rust-labs/ngs/issues/new?assignees=&labels=&template=bug_report.md&title=Descriptive%20Title&labels=bug">Report Bug</a>
 32 |     ·
 33 |     ⭐ Consider starring the repo! ⭐
 34 |     <br />
 35 |   </p>
 36 | 
 37 |   <p>
 38 |     <img src="https://raw.githubusercontent.com/stjude-rust-labs/ngs/main/.github/assets/experimental-warning.png">
 39 |   </p>
 40 | </p>
 41 | 
 42 | 
 43 | ## 🎨 Features
 44 | 
 45 | * **[`ngs convert`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-convert).** Convert between next-generation sequencing formats. 
 46 | * **[`ngs derive`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-derive).** Forensic analysis tool for next-generation sequencing data.
 47 | * **[`ngs generate`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-generate).** Generates a BAM file from a given reference genome.
 48 | * **[`ngs index`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-index).** Generates the index file to various next-generation sequencing files.
 49 | * **[`ngs list`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-list).** Utility to list various supported items in this command line tool.
 50 | * **[`ngs plot`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-plot).** Produces plots for data generated by `ngs qc`.
 51 | * **[`ngs qc`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-qc).** Generates quality control metrics for BAM files.
 52 | * **[`ngs view`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-view).** Views various next-generation sequencing files, sometimes with a query region.
 53 | 
 54 | 
 55 | ## Guiding Principles
 56 | 
 57 | * **Modern, reliable foundation for everyday bioinformatics analysis—written in Rust.** `ngs` aims to package together a fairly comprehensive set of analysis tools and utilities for everyday work in bioinformatics. It is built with modern, multi-core systems in mind and written in Rust. Though we are not there today, we plan to work towards this goal in the future.
 58 | * **Runs on readily available hardware/software.** We aim for every subcommand within `ngs` to run within most computing environments without the need for special hardware or software. Practically, this means we've designed `ngs` to run in any UNIX-like environment that has at least four (4) cores and sixteen (16) GB of RAM. Often, tools will run with fewer resources. This design decision is important and sometimes means that `ngs` runs slower than it otherwise could.
 59 | 
 60 | ## 📚 Getting Started
 61 | 
 62 | ### Installation
 63 | 
 64 | To install the latest released version, you can simply use `cargo`.
 65 | 
 66 | ```bash
 67 | cargo install ngs
 68 | ```
 69 | 
 70 | To install the latest version on `main`, you can use the following command.
 71 | 
 72 | ```bash
 73 | cargo install --locked --git https://github.com/stjude-rust-labs/ngs.git
 74 | ```
 75 | 
 76 | ### Using Docker
 77 | 
 78 | ```bash
 79 | docker pull ghcr.io/stjude-rust-labs/ngs
 80 | docker run -it --rm --volume "$(pwd)":/data ghcr.io/stjude-rust-labs/ngs
 81 | ```
 82 | 
 83 | `/data` is the working directory of the docker image. Running this command from the directory with your data will allow
 84 | the continer to act on those files.
 85 | 
 86 | Note: Currently the `latest` tag refers to the latest release of `ngs` and not the most recent code changes in this
 87 | repository.
 88 | 
 89 | ## 🖥️ Development
 90 | 
 91 | To bootstrap a development environment, please use the following commands.
 92 | 
 93 | ```bash
 94 | # Clone the repository
 95 | git clone git@github.com:stjude-rust-labs/ngs.git
 96 | cd ngs
 97 | 
 98 | # Run the command line tool using cargo.
 99 | cargo run -- -h
100 | ```
101 | 
102 | ## 🚧️ Tests
103 | 
104 | ```bash
105 | # Run the project's tests.
106 | cargo test
107 | 
108 | # Ensure the project doesn't have any linting warnings.
109 | cargo clippy
110 | 
111 | # Ensure the project passes `cargo fmt`.
112 | cargo fmt --check
113 | ```
114 | 
115 | ## Minimum Supported Rust Version (MSRV)
116 | 
117 | The minimum supported Rust version for this project is 1.64.0.
118 | 
119 | ## 🤝 Contributing
120 | 
121 | Contributions, issues and feature requests are welcome! Feel free to check
122 | [issues page](https://github.com/stjude-rust-labs/ngs/issues).
123 | 
124 | ## 📝 License
125 | 
126 | * All code related to the `ngs derive instrument` subcommand is licensed under the [AGPL v2.0][agpl-v2]. This is not due to any strict requirement, but out of deference to some [code][10x-inspiration] that inspired our strategy (and from which patterns were copied), the decision was made to license this code consistently.
127 | * The rest of this project is licensed as either [Apache 2.0][license-apache] or
128 | [MIT][license-mit] at your discretion.
129 | 
130 | Copyright © 2021-Present [St. Jude Children's Research
131 | Hospital](https://github.com/stjude).
132 | 
133 | [10x-inspiration]: https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/illumina_instrument.py
134 | [agpl-v2]: http://www.affero.org/agpl2.html
135 | [contributing-md]: https://github.com/stjude-rust-labs/ngs/blob/master/CONTRIBUTING.md
136 | [license-apache]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-APACHE
137 | [license-mit]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-MIT


--------------------------------------------------------------------------------
/dna/rust-bio-tools.md:
--------------------------------------------------------------------------------
 1 | [![Gitpod Ready-to-Code](https://img.shields.io/badge/Gitpod-ready--to--code-blue?logo=gitpod)](https://gitpod.io/#https://github.com/rust-bio/rust-bio-tools)
 2 | [![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
 3 | [![Bioconda version](https://img.shields.io/conda/vn/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
 4 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
 5 | [![Licence](https://img.shields.io/conda/l/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
 6 | [![GitHub Workflow Status](https://img.shields.io/github/workflow/status/rust-bio/rust-bio-tools/CI)](https://github.com/rust-bio/rust-bio-tools/actions)
 7 | 
 8 | # Rust-Bio-Tools
 9 | 
10 | A set of ultra fast and robust command line utilities for bioinformatics tasks based on Rust-Bio.
11 | Rust-Bio-Tools provides a command `rbt`, which currently supports the following operations:
12 | 
13 | * a linear time implementation for fuzzy matching of two vcf/bcf files (`rbt vcf-match`)
14 | * a vcf/bcf to txt converter, that flexibly allows to select tags and properly handles multiallelic sites (`rbt vcf-to-txt`)
15 | * a linear time round-robin FASTQ splitter that splits a given FASTQ files into a given number of chunks (`rbt fastq-split`)
16 | * a linear time extraction of depth information from BAMs at given loci (`rbt bam-depth`)
17 | * a utility to quickly filter records from a FASTQ file (`rbt fastq-filter`)
18 | * a tool to merge BAM or FASTQ reads using marked duplicates respectively unique molecular identifiers (UMIs) (`rbt collapse-reads-to-fragments bam|fastq`)
19 | * a tool to generate interactive HTML based reports that offer multiple plots visualizing the provided genomics data in VCF and BAM format (`rbt vcf-report`)
20 | * a tool to generate an interactive HTML based report from a csv file including visualizations (`rbt csv-report`)
21 | * a tool for splitting VCF/BCF files into N equal chunks, including BND support (`rbt vcf-split`)
22 | * a tool to generate visualizations for a specific region of one or multiple BAM files with a given reference contained in a single HTML file (`rbt plot-bam`)
23 | 
24 | Further functionality is added as it is needed by the authors. Check out the [Contributing](#Contributing) section if you want contribute anything yourself.
25 | For a list of changes, take a look at the [CHANGELOG](CHANGELOG.md).
26 | 
27 | 
28 | ## Installation
29 | 
30 | ### Requirements
31 | 
32 | Rust-Bio-Tools depends [rgsl](https://docs.rs/GSL/*/rgsl/) which needs [GSL](https://www.gnu.org/software/gsl/) to be installed:
33 | 
34 | - Ubuntu: `sudo apt-get install libgsl-dev`
35 | - Arch: `sudo pacman -S gsl`
36 | - OSX: `brew install gsl` 
37 | 
38 | ### Bioconda
39 | 
40 | Rust-Bio-Tools is available via [Bioconda](https://bioconda.github.io).
41 | With Bioconda set up, installation is as easy as
42 | 
43 |     conda install rust-bio-tools
44 | 
45 | ### Cargo
46 | 
47 | If the [Rust](https://www.rust-lang.org/tools/install) compiler and associated [Cargo](https://github.com/rust-lang/cargo/) are installed, Rust-Bio-Tools may be installed via
48 | 
49 |     cargo install rust-bio-tools
50 | 
51 | ### Source
52 | 
53 | Download the source code and within the root directory of source run
54 | 
55 |     cargo install
56 | 
57 | ## Usage and Documentation
58 | 
59 | Rust-Bio-Tools installs a command line utility `rbt`. Issue
60 | 
61 |     rbt --help
62 | 
63 | for a summary of all options and tools.
64 | 
65 | ## Contributing
66 | 
67 | Any contributions are highly welcome. If you plan to contribute we suggest installing pre-commit hooks. To do so:
68 | 1. Install `pre-commit` as explained [here](https://pre-commit.com/#installation)
69 | 2. Run `pre-commit install` in the rust-bio-tools base directory
70 | 
71 | This should format, check and lint your code when committing.
72 | 
73 | ## Authors
74 | 
75 | * [Johannes Köster](https://github.com/johanneskoester) (https://koesterlab.github.io)
76 | * [Felix Mölder](https://github.com/FelixMoelder)
77 | * [Henning Timm](https://github.com/HenningTimm)
78 | * [Felix Wiegand](https://github.com/fxwiegand)


--------------------------------------------------------------------------------
/dna/skc.md:
--------------------------------------------------------------------------------
  1 | # skc
  2 | 
  3 | `skc` is a simple tool for finding shared k-mer content between two genomes.
  4 | 
  5 | ## Installation
  6 | 
  7 | ### Prebuilt binary
  8 | 
  9 | ```
 10 | curl -sSL skc.mbh.sh | sh
 11 | # or with wget
 12 | wget -nv -O - skc.mbh.sh | sh
 13 | ```
 14 | 
 15 | You can also pass options to the script like so
 16 | 
 17 | ```text
 18 | $ curl -sSL skc.mbh.sh | sh -s -- --help
 19 | install.sh [option]
 20 | 
 21 | Fetch and install the latest version of skc, if skc is already
 22 | installed it will be updated to the latest version.
 23 | 
 24 | Options
 25 |         -V, --verbose
 26 |                 Enable verbose output for the installer
 27 | 
 28 |         -f, -y, --force, --yes
 29 |                 Skip the confirmation prompt during installation
 30 | 
 31 |         -p, --platform
 32 |                 Override the platform identified by the installer
 33 | 
 34 |         -b, --bin-dir
 35 |                 Override the bin installation directory [default: /usr/local/bin]
 36 | 
 37 |         -a, --arch
 38 |                 Override the architecture identified by the installer [default: x86_64]
 39 | 
 40 |         -B, --base-url
 41 |                 Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]
 42 | 
 43 |         -h, --help
 44 |                 Display this help message
 45 | 
 46 | ```
 47 | 
 48 | ### Cargo
 49 | 
 50 | ```text
 51 | cargo install skc
 52 | ```
 53 | 
 54 | ### Conda
 55 | 
 56 | ```text
 57 | conda install skc
 58 | ```
 59 | 
 60 | ### Local
 61 | 
 62 | ```text
 63 | cargo build --release
 64 | ./target/release/skc --help
 65 | ```
 66 | 
 67 | ## Usage
 68 | 
 69 | Check for shared 16-mers between the [HIV-1 genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1) and the [
 70 | *Mycobacterium tuberculosis* genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3).
 71 | 
 72 | ```text
 73 | $ skc -k 16 NC_001802.1.fa NC_000962.3.fa
 74 | [2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target
 75 | [2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
 76 | >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
 77 | TGCAGAACATCCAGGG
 78 | >4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482
 79 | CCAGCAGCAGATAGGG
 80 | ```
 81 | 
 82 | So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout -
 83 | use the `-o` option to write them to file.
 84 | 
 85 | ### Fasta description
 86 | 
 87 | Example: `>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106`
 88 | 
 89 | The ID (`4233642782`) is the 64-bit integer representation of the k-mer's value in bit-space (
 90 | see [Daniel Liu's brilliant ][cute] for more information). `tcount` and `qcount` are the
 91 | number of times the k-mer is present in the target and query genomes, respectively. `tpos` and `qpos` are the (1-based)
 92 | k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs
 93 | multiple times.
 94 | 
 95 | ### Usage help
 96 | 
 97 | ```text
 98 | $ skc --help
 99 | Shared k-mer content between two genomes
100 | 
101 | Usage: skc [OPTIONS] <TARGET> <QUERY>
102 | 
103 | Arguments:
104 |   <TARGET>
105 |           Target sequence
106 | 
107 |           Can be compressed with gzip, bzip2, xz, or zstd
108 | 
109 |   <QUERY>
110 |           Query sequence
111 | 
112 |           Can be compressed with gzip, bzip2, xz, or zstd
113 | 
114 | Options:
115 |   -k, --kmer <KMER>
116 |           Size of k-mers (max. 32)
117 | 
118 |           [default: 21]
119 | 
120 |   -o, --output <OUTPUT>
121 |           Output filepath(s); stdout if not present
122 | 
123 |   -O, --output-type <u|b|g|l|z>
124 |           u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd
125 | 
126 |           Output compression format is automatically guessed from the filename extension. This option is used to override that
127 | 
128 |           [default: u]
129 | 
130 |   -l, --compress-level <INT>
131 |           Compression level to use if compressing output
132 | 
133 |           [default: 6]
134 | 
135 |   -h, --help
136 |           Print help (see a summary with '-h')
137 | 
138 |   -V, --version
139 |           Print version
140 | ```
141 | 
142 | ### Caveats
143 | 
144 | - Make the first genome passed (`<TARGET>`) the smallest genome. This is to reduce memory usage as all unique k-mers (
145 |   well their `u64` value) for this genome will be held in memory.
146 | - We do not use canonical k-mers
147 | - 32 is the largest k-mer size that can be used. This is basically a (lazy) implementation decision, but also helps to
148 |   keep the memory footprint as low as possible. If you want larger k-mer values, I would suggest checking out some of
149 |   the [similar tools](#alternate-tools).
150 | 
151 | ## Alternate tools
152 | 
153 | `skc` does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it
154 | as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared
155 | k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is
156 | a (non-exhaustive) list of other tools that can be used to get shared k-mer content
157 | 
158 | - [unikmer](https://github.com/shenwei356/unikmer) - this was brought to my attention *after* I wrote `skc`. Had I known
159 |   about it beforehand, I probably wouldn't have written `skc`. So I would recommend unikmer for almost all use
160 |   cases - [Wei Shen](https://github.com/shenwei356) writes awesome tools
161 | - [Jellyfish](https://github.com/gmarcais/Jellyfish)
162 | - [REINDEER](https://github.com/kamimrcht/REINDEER)
163 | - [kmer-db](https://github.com/refresh-bio/kmer-db)
164 | - [GGCAT](https://github.com/algbio/ggcat)
165 | - [KAT](https://github.com/TGAC/KAT)
166 | 
167 | ## Acknowledgements
168 | 
169 | [Daniel Liu's brilliant ][cute] is used to (rapidly) convert k-mers into 64-bit integers.
170 | 
171 | 
172 | [cute]: https://github.com/Daniel-Liu-c0deb0t/cute-nucleotides
173 | 


--------------------------------------------------------------------------------
/fastq/fasten.md:
--------------------------------------------------------------------------------
  1 | # Fasten
  2 | 
  3 | [![Crates.io](https://img.shields.io/crates/v/fasten)](https://crates.io/crates/fasten)
  4 | [![CI](https://github.com/lskatz/fasten/actions/workflows/basic.yml/badge.svg)](https://github.com/lskatz/fasten/actions/workflows/basic.yml)
  5 | [![DOI](https://joss.theoj.org/papers/10.21105/joss.06030/status.svg)](https://doi.org/10.21105/joss.06030)
  6 | 
  7 | A powerful manipulation suite for interleaved fastq files.
  8 | Executables can read/write to `stdin` and `stdout`, and they are compatible with the interleaved fastq format.
  9 | This makes it much easier to perform streaming operations using unix pipes.
 10 | 
 11 | ## Synopsis
 12 | 
 13 | ### read metrics
 14 | 
 15 |     $ cat testdata/R1.fastq testdata/R2.fastq | \
 16 |         fasten_shuffle | fasten_metrics | column -t
 17 |     totalLength  numReads  avgReadLength  avgQual
 18 |     800          8         100            19.53875
 19 | 
 20 | ### read cleaning
 21 | 
 22 |     $ cat testdata/R1.fastq testdata/R2.fastq | \
 23 |         fasten_shuffle | \
 24 |         fasten_clean --paired-end --min-length 2 | \
 25 |         gzip -c > cleaned.shuffled.fastq.gz
 26 | 
 27 |     $ zcat cleaned.shuffled.fastq.gz | fasten_metrics | column -t
 28 |     totalLength  numReads  avgReadLength  avgQual
 29 |     800          8         100            19.53875
 30 |     # No reads were actually filtered with cleaning, with --min-length=2
 31 | 
 32 | ## Installation
 33 | 
 34 | ### Installation from source
 35 | 
 36 | Fasten is programmed in the Rust programming language.  More information about Rust, including installation and the executable `cargo`, can be found at [rust-lang.org](https://www.rust-lang.org).
 37 | 
 38 | After downloading, use the Rust executable `cargo` like so:
 39 | 
 40 |     cd fasten
 41 |     cargo build --release
 42 |     export PATH=$PATH:$(pwd)/target/release
 43 | 
 44 | All executables will be in the directory `fasten/target/release`.
 45 | 
 46 | _note_: there are some `Makefile` methods to help including
 47 | 
 48 | * `make all` to make the following
 49 |   * `make release` install fast executables
 50 |   * `make debug` install executables quickly (although the executables will not be optimized)
 51 |   * `make fasten/doc` compile lastest documents
 52 | * `make clean` uninstall local binaries
 53 | 
 54 | ### Installation without `git`
 55 | 
 56 | You can also install Fasten straight from <https://crates.io> using the following command:
 57 | 
 58 |     cargo install fasten
 59 | 
 60 | Detailed information on how this works can be found in the cargo handbook at <https://doc.rust-lang.org/cargo/commands/cargo-install.html>.
 61 | 
 62 | ## General usage
 63 | 
 64 | All scripts accept the parameters, read uncompressed fastq format from stdin, and print uncompressed fastq format to stdout.  All paired end fastq files must be in interleaved format, and they are written in [interleaved format](./docs/file-formats.md), except when deshuffling with `fasten_shuffle`.
 65 | 
 66 | * `--help`
 67 | * `--numcpus` Not all scripts will take advantage of numcpus. (not currently implemented)
 68 | * `--paired-end` Input reads are interleaved paired end
 69 | * `--verbose` Print more status messages
 70 | 
 71 | ## Documentation
 72 | 
 73 | Please see the inline documentation at <https://lskatz.github.io/fasten/fasten>
 74 | 
 75 | This documentation was built with `cargo doc --no-deps`
 76 | 
 77 | ### Other documentation
 78 | 
 79 | * Some wrapper scripts are noted in the [scripts](./scripts.md) page.
 80 | 
 81 | ### Contributing
 82 | 
 83 | Instructions for how to contribute can be found in [CONTRIBUTING.md](CONTRIBUTING.md).
 84 | 
 85 | ## Fasten script descriptions
 86 | 
 87 | All executables read and write in the fastq format
 88 | except `fasten_convert`.
 89 | 
 90 | |executable         |Description|
 91 | |-------------------|-----------|
 92 | |[`fasten_clean`](https://lskatz.github.io/fasten/fasten_clean)     | Trims and cleans a fastq file.|
 93 | |[`fasten_convert`](https://lskatz.github.io/fasten/fasten_convert)   | Converts between different sequence formats like fastq, sam, fasta.|
 94 | |[`fasten_straighten`](https://lskatz.github.io/fasten/fasten_straighten)| Convert any fastq file to a standard four-line-per-entry format.|
 95 | |[`fasten_metrics`](https://lskatz.github.io/fasten/fasten_metrics)   | Prints basic read metrics.|
 96 | |[`fasten_pe`](https://lskatz.github.io/fasten/fasten_pe)        | Determines paired-endedness based on read IDs.|
 97 | |[`fasten_randomize`](https://lskatz.github.io/fasten/fasten_randomize) | Randomizes reads from input |
 98 | |[`fasten_combine`](https://lskatz.github.io/fasten/fasten_combine)   | Combines identical reads and updates quality scores.|
 99 | |[`fasten_kmer`](https://lskatz.github.io/fasten/fasten_kmer)      | Kmer counting.|
100 | |[`fasten_normalize`](https://lskatz.github.io/fasten/fasten_normalize)      | Normalize read depth by using kmer counting.|
101 | |[`fasten_sample`](https://lskatz.github.io/fasten/fasten_sample)    | Downsamples reads.|
102 | |[`fasten_shuffle`](https://lskatz.github.io/fasten/fasten_shuffle)   | Shuffles or deshuffles paired end reads.|
103 | |[`fasten_validate`](https://lskatz.github.io/fasten/fasten_validate)  | Validates your reads (deprecated in favor of `fasten_inspect` and `fasten_repair`|
104 | |[`fasten_inspect`](https://lskatz.github.io/fasten/fasten_inspect)  | adds information to read IDs such as seqlength |
105 | |[`fasten_repair`](https://lskatz.github.io/fasten/fasten_repair)  | Repairs corrupted reads |
106 | |[`fasten_quality_filter`](https://lskatz.github.io/fasten/fasten_quality_filter) | Transforms nucleotides to "N" if the quality is low | |
107 | |[`fasten_trim`](https://lskatz.github.io/fasten/fasten_trim)      | Blunt-end trims reads | |
108 | |[`fasten_replace`](https://lskatz.github.io/fasten/fasten_replace)   | Find and replace using regex | |
109 | |[`fasten_mutate`](https://lskatz.github.io/fasten/fasten_mutate)    | introduce random mutations | |
110 | |[`fasten_regex`](https://lskatz.github.io/fasten/fasten_regex)     | Filter for reads using regex | |
111 | |[`fasten_progress`](https://lskatz.github.io/fasten/fasten_progress)  | Add progress to any place in the pipeline | |
112 | |[`fasten_sort`](https://lskatz.github.io/fasten/fasten_sort)  | Sort fastq entries | |
113 | 
114 | ## Etymology
115 | 
116 | Many of these scripts have inspiration from the fastx toolkit, and I wanted to make a `fasty` which was already the name of a bioinformatics program.
117 | Therefore I cycled through other letters of the alphabet and came across "N."  So it is possible to pronounce this project like "Fast-N" or in a way
118 | that indicates that you are securing your analysis by "fasten"ing it (with a silent T).
119 | 
120 | ## Citation
121 | 
122 | [![DOI](https://joss.theoj.org/papers/10.21105/joss.06030/status.svg)](https://doi.org/10.21105/joss.06030)
123 | 
124 | To cite, please refer to Katz et al., (2024). Fasten: a toolkit for streaming operations on fastq files. Journal of Open Source Software, 9(94), 6030, https://doi.org/10.21105/joss.06030


--------------------------------------------------------------------------------
/fastq/faster.md:
--------------------------------------------------------------------------------
  1 | ![Rust](https://github.com/angelovangel/faster/workflows/Rust/badge.svg)
  2 | # faster
  3 | 
  4 | A (very) fast program for getting statistics and features from a fastq file, in a usable form, written in Rust.
  5 | 
  6 | ## Description
  7 | 
  8 | I wrote this program to get *fast* and *accurate* statistics about a fastq file, formatted as a tab-delimited table. In addition, it can do the following with a fastq file:
  9 | 
 10 | - get the read lengths
 11 | - get gc content per read
 12 | - get geometric mean of phred scores per read
 13 | - get NX values for all the reads, e.g. N50
 14 | - filter reads based on length (both greater than and smaller than a desired length)
 15 | - subsample reads (by proportion of all reads in the file)
 16 | - trim front and trim tail - trim x number of bases from the beginning/end of each read
 17 | - regex search for reads containing a pattern in their description field
 18 | 
 19 | The motivation behind it:
 20 | 
 21 | - many of the tools out there are just wrong when it comes to calculating 'mean' phred scores (yes, just taking the arithmetic mean phred score is wrong)
 22 | - one simple executable doing one thing well, no dependencies
 23 | - it is straightforward to parse the output in other programs and the output is easy to tweak as desired
 24 | - reasonably fast
 25 | - can be easily run in parallel
 26 | 
 27 | ## Install
 28 | 
 29 | Compiled binaries are provided for x86_64 Linux, macOS and Windows - download from the releases section and run. You will have to make the file executable (`chmod a+x faster`) and for MacOS, allow running external apps in your security settings. If you need to run it on something else (your phone?!), you will have to compile it yourself (which is pretty easy though). Below is an example on how to setup a Rust toolchain and compile `faster`:
 30 | 
 31 | ```bash
 32 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 33 | git clone https://github.com/angelovangel/faster.git
 34 | 
 35 | cd faster
 36 | cargo build --release
 37 | 
 38 | # the binary is now under ./target/release/, run it like this:
 39 | ./target/release/faster -t /path/to/fastq/file.fastq.gz
 40 | 
 41 | ```
 42 | 
 43 | ## Usage and tweaking the output
 44 | 
 45 | The program takes one fastq/fastq.gz file as an argument and, when used with the `--table` flag, outputs a tab-separated table with statistics to stdout. There are options to obtain the length, GC-content, and 'mean' phred scores per read, or to filter reads by length, see `-help` for details.
 46 | 
 47 | ```bash
 48 | # for help
 49 | faster --help # or -h
 50 | 
 51 | # get some N10, N50 and N90 values
 52 | for i in 0.1 0.5 0.9; do faster --nx $i /path/to/fastq/file.fastq; done
 53 | 
 54 | # get a table with statistics
 55 | faster -t /path/to/fastq/file.fastq
 56 | 
 57 | # for many files, with parallel
 58 | parallel faster -t ::: /path/to/fastq/*.fastq.gz
 59 | 
 60 | # again with parallel, but get rid of the table header
 61 | parallel faster -ts ::: /path/to/fastq/*.fastq.gz
 62 | ```
 63 | 
 64 | The statistics output is a tab-separated table with the following columns:   
 65 | `file   reads   bases   n_bases   min_len   max_len   mean_len   Q1   Q2   Q3   N50 Q20_percent Q30_percent`
 66 | 
 67 | ## Performance
 68 | 
 69 | To get an idea how `faster` compares to other tools, I have benchmarked it with two other popular programs and 3 different datasets. **I am aware that these tools have different and often much richer functionality (especially seqkit, I use it all the time), so these comparisons are for orientation only**. 
 70 | The benchmarks were performed with [hyperfine](https://github.com/sharkdp/hyperfine) (`-r 15 --warmup 2`) on a MacBook Pro with an 8-core 2.3 GHz Quad-Core Intel Core i5 and 8 GB RAM. For Illumina reads, `faster` is slightly slower than `seqstats` (written in C using the `klib` [library by Heng Li](https://github.com/attractivechaos/klib) - the fastest thing possible out there), and for Nanopore it is even a bit faster than `seqstats`. `seqkit stats` performs worse of the three tools tested, but bear in mind the extraordinarily rich functionality it has.
 71 | 
 72 | ***
 73 | ### dataset A - a small Nanopore fastq file with 37k reads and 350M bases
 74 | 
 75 | | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
 76 | |:---|---:|---:|---:|---:|
 77 | | `faster -t datasetA.fastq` | 398.1 ± 21.2 | 380.4 | 469.6 | 1.00 |
 78 | | `seqstats datasetA.fastq` | 633.6 ± 54.1 | 593.3 | 773.6 | 1.59 ± 0.16 |
 79 | | `seqkit stats -a datasetA.fastq` | 1864.5 ± 70.3 | 1828.7 | 2117.3 | 4.68 ± 0.31 |
 80 | 
 81 | ***
 82 | 
 83 | ### dataset B - a small Illumina fastq.gz file with ~100k reads
 84 | 
 85 | | Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
 86 | |:---|---:|---:|---:|---:|
 87 | | `faster -t datasetB.fastq.gz` | 181.7 ± 2.3 | 177.7 | 184.6 | 1.36 ± 0.09 |
 88 | | `seqstats datasetB.fastq.gz` | 133.4 ± 8.4 | 125.7 | 154.2 | 1.00 |
 89 | | `seqkit stats -a datasetB.fastq.gz` | 932.6 ± 37.0 | 873.8 | 1028.9 | 6.99 ± 0.52 |
 90 | 
 91 | ***
 92 | 
 93 | ### dataset C - a small Illumina iSeq run, 11.5M reads and 1.7G bases, using `gnu parallel`
 94 | 
 95 | | Command | Mean [s] | Min [s] | Max [s] | Relative |
 96 | |:---|---:|---:|---:|---:|
 97 | | `parallel faster -t ::: *.fastq.gz` | 6.438 ± 0.384 | 6.009 | 7.062 | 1.43 ± 0.15 |
 98 | | `parallel seqstats ::: *.fastq.gz` | 4.488 ± 0.394 | 4.120 | 5.312 | 1.00 |
 99 | | `parallel seqkit stats -a ::: *.fastq.gz` | 40.156 ± 1.747 | 38.762 | 44.132 | 8.95 ± 0.88 |
100 | 
101 | ***
102 | ## Reference
103 | 
104 | `faster` uses the excellent Rust-Bio library:
105 | 
106 | [Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.](https://academic.oup.com/bioinformatics/article/32/3/444/1743419)


--------------------------------------------------------------------------------
/fastq/fqgrep.md:
--------------------------------------------------------------------------------
 1 | # fqgrep
 2 | 
 3 | <p align="center">
 4 |   <a href="https://github.com/fulcrumgenomics/fqgrep/actions?query=workflow%3ACheck"><img src="https://github.com/fulcrumgenomics/fqgrep/workflows/Check/badge.svg" alt="Build Status"></a>
 5 |   <img src="https://img.shields.io/crates/l/fqgrep.svg" alt="license">
 6 |   <a href="https://crates.io/crates/fqgrep"><img src="https://img.shields.io/crates/v/fqgrep.svg?colorB=319e8c" alt="Version info"></a>
 7 |   <a href="http://bioconda.github.io/recipes/fqgrep/README.html"><img src="https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat" alt="Install with bioconda"></a>
 8 |   <br>
 9 |   Grep for FASTQ files.
10 | </p>
11 | 
12 | Search a pair of fastq files for reads that match a given ref or alt sequence.
13 | 
14 | ## Install
15 | 
16 | ### From bioconda
17 | 
18 | ```console
19 | conda install -c bioconda fqgrep
20 | ```
21 | 
22 | ### From Source
23 | 
24 | ```console
25 | git clone ... && cd fqgrep
26 | cargo install --path .
27 | ```
28 | 
29 | ## Usage
30 | 
31 | ```console
32 | fqgrep -r 'GACGAGATTA' -a 'GACGTGATTA' --r1-fastq /data/testR1.fastq.gz  --r2-fastq /data/testR2.fastq.gz -o ./test_out -t 28
33 | ```
34 | 
35 | ## Help
36 | 
37 | See the following for usage:
38 | 
39 | ```console
40 | fqgrep -h
41 | ```
42 | 


--------------------------------------------------------------------------------
/fastq/fqkit.md:
--------------------------------------------------------------------------------
  1 | ![icon](https://github.com/sharkLoc/fqkit/blob/main/doc/fqkit_icon.PNG)
  2 | 
  3 | <!-- ![icon](doc/fqkit_icon.PNG) -->
  4 | 
  5 | # fqkit
  6 | 
  7 | ![Static Badge](https://img.shields.io/badge/Author-sharkLoc-blue)
  8 | ![Static Badge](https://img.shields.io/badge/Tool-fqkit-red)
  9 | ![Crates.io (latest)](https://img.shields.io/crates/dv/fqkit?labelColor=rgb&color=hex&link=https%3A%2F%2Fcrates.io%2Fcrates%2Ffqkit)
 10 | ![Crates.io](https://img.shields.io/crates/d/fqkit?label=Total%20download%20in%20crate.io)
 11 | ![GitHub Gist last commit](https://img.shields.io/github/gist/last-commit/a4910923a230b8975218a188528463d7?logo=github)
 12 | 
 13 | 🦀 a simple program for fastq file manipulation
 14 | 
 15 | ## install
 16 | 
 17 | ##### setp1： install cargo first
 18 | 
 19 | ```bash
 20 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 21 | ```
 22 | 
 23 | ##### step2:  on linux or windows
 24 | 
 25 | ```bash
 26 | cargo install fqkit
 27 | # or
 28 | 
 29 | git clone https://github.com/sharkLoc/fqkit.git
 30 | cd fqkit
 31 | cargo b --release
 32 | # mv target/release/fqkit to anywhere you want 
 33 | ```
 34 | 
 35 | ##### install latest version
 36 | 
 37 | ```bash
 38 | cargo install --git https://github.com/sharkLoc/fqkit.git
 39 | ```
 40 | 
 41 | ## usage
 42 | 
 43 | ```bash
 44 | FqKit -- A simple and cross-platform program for fastq file manipulation
 45 | 
 46 | Version: 0.4.8
 47 | 
 48 | Authors: sharkLoc <mmtinfo@163.com>
 49 | Source code: https://github.com/sharkLoc/fqkit.git
 50 | 
 51 | Fqkit supports reading and writing gzip (.gz) format.
 52 | Bzip2 (.bz2) format is supported since v0.3.8.
 53 | Xz (.xz) format is supported since v0.3.9.
 54 | Under the same compression level, xz has the highest compression ratio but consumes more time. 
 55 | 
 56 | Compression level:
 57 |   format   range   default   crate
 58 |   gzip     1-9     6         https://crates.io/crates/flate2
 59 |   bzip2    1-9     6         https://crates.io/crates/bzip2
 60 |   xz       1-9     6         https://crates.io/crates/xz2
 61 | 
 62 | 
 63 | Usage: fqkit [OPTIONS] <COMMAND>
 64 | 
 65 | Commands:
 66 |   topn     get first N records from fastq file [aliases: head]
 67 |   tail     get last N records from fastq file
 68 |   concat   concat fastq files from different lanes
 69 |   subfq    subsample sequences from big fastq file [aliases: sample]
 70 |   select   select pair-end reads by read id
 71 |   trim     trim fastq reads by position
 72 |   adapter  cut the adapter sequence on the reads
 73 |   filter   a simple filter for pair end fastq sqeuence
 74 |   range    print fastq records in a range
 75 |   search   search reads/motifs from fastq file
 76 |   grep     grep fastq sequence by read id or full name
 77 |   stats    summary for fastq format file [aliases: stat]
 78 |   shuffle  shuffle fastq sequences
 79 |   size     report the number sequences and bases
 80 |   slide    extract subsequences in sliding windows
 81 |   sort     sort fastq file by name/seq/gc/length
 82 |   plot     line plot for A T G C N percentage in read position
 83 |   fq2fa    translate fastq to fasta
 84 |   fq2sam   converts a fastq file to an unaligned SAM file
 85 |   fqscore  converts the fastq file quality scores
 86 |   flatten  flatten fastq sequences [aliases: flat]
 87 |   barcode  perform demultiplex for pair-end fastq reads [aliases: demux]
 88 |   check    check the validity of a fastq record
 89 |   remove   remove reads by read name
 90 |   rename   rename sequence id in fastq file
 91 |   reverse  get a reverse-complement of fastq file [aliases: rev]
 92 |   split    split interleaved fastq file
 93 |   merge    merge PE reads as interleaved fastq file
 94 |   mask     convert any low quality base to 'N' or other chars
 95 |   split2   split fastq file by records number
 96 |   gcplot   get GC content result and plot
 97 |   length   get reads length count [aliases: len]
 98 |   view     view fastq file page by page
 99 |   help     Print this message or the help of the given subcommand(s)
100 | 
101 | Global Arguments:
102 |       --compress-level <INT>  set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with option -o/--out [default: 6]
103 |       --log <FILE>            if file name specified, write log message to this file, or write to stderr
104 |   -v, --verbosity...          control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug]
105 | 
106 | Global FLAGS:
107 |   -q, --quiet    be quiet and do not show any extra information
108 |   -h, --help     prints help information
109 |   -V, --version  prints version information
110 | 
111 | Use "fqkit help [command]" for more information about a command
112 | ```
113 | 
114 | #### ** any bugs please report issues **💖
115 | 


--------------------------------------------------------------------------------
/fastq/fqtk.md:
--------------------------------------------------------------------------------
  1 | # fqtk
  2 | 
  3 | <p align="center">
  4 |   <a href="https://github.com/fulcrumgenomics/fqtk/actions?query=workflow%3ACheck"><img src="https://github.com/fulcrumgenomics/fqtk/actions/workflows/build_and_test.yml/badge.svg" alt="Build Status"></a>
  5 |   <img src="https://img.shields.io/crates/l/fqtk.svg" alt="license">
  6 |   <a href="https://crates.io/crates/fqtk"><img src="https://img.shields.io/crates/v/fqtk.svg?colorB=319e8c" alt="Version info"></a>
  7 |   <a href="http://bioconda.github.io/recipes/fqtk/README.html"><img src="https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat" alt="Install with bioconda"></a>
  8 |   <br>
  9 | </p>
 10 | 
 11 | A toolkit for working with FASTQ files, written in Rust.
 12 | 
 13 | Currently `fqtk` contains a single tool, `demux` for demultiplexing FASTQ files based on sample barcodes.
 14 | `fqtk demux` can be used to demultiplex one or more FASTQ files (e.g. a set of R1, R2 and I1 FASTQ files) with any number of sample barcodes at fixed locations within the reads.
 15 | It is highly efficient and multi-threaded for high performance.
 16 | 
 17 | Usage for `fqtk demux` follows:
 18 | 
 19 | ```console
 20 | Performs sample demultiplexing on FASTQs.
 21 | 
 22 | The sample barcode for each sample in the metadata TSV will be compared against
 23 | the sample barcode bases extracted from the FASTQs, to assign each read to a
 24 | sample.  Reads that do not match any sample within the given error tolerance
 25 | will be placed in the ``unmatched_prefix`` file.
 26 | 
 27 | FASTQs and associated read structures for each sub-read should be given:
 28 | 
 29 | - a single fragment read (with inline index) should have one FASTQ and one read
 30 |   structure 
 31 | - paired end reads should have two FASTQs and two read structures 
 32 | - a dual-index sample with paired end reads should have four FASTQs and four read
 33 |   structures given: two for the two index reads, and two for the template reads.
 34 | 
 35 | If multiple FASTQs are present for each sub-read, then the FASTQs for each
 36 | sub-read should be concatenated together prior to running this tool (e.g. 
 37 | `zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`).
 38 | 
 39 | Read structures are made up of `<number><operator>` pairs much like the `CIGAR`
 40 | string in BAM files. Four kinds of operators are recognized:
 41 | 
 42 | 1. `T` identifies a template read
 43 | 2. `B` identifies a sample barcode read
 44 | 3. `M` identifies a unique molecular index read
 45 | 4. `S` identifies a set of bases that should be skipped or ignored
 46 | 
 47 | The last `<number><operator>` pair may be specified using a `+` sign instead of
 48 | number to denote "all remaining bases". This is useful if, e.g., fastqs have
 49 | been trimmed and contain reads of varying length. Both reads must have template
 50 | bases.  Any molecular identifiers will be concatenated using the `-` delimiter
 51 | and placed in the given SAM record tag (`RX` by default).  Similarly, the sample
 52 | barcode bases from the given read will be placed in the `BC` tag.
 53 | 
 54 | Metadata about the samples should be given as a headered metadata TSV file with
 55 | two columns 1. `sample_id` - the id of the sample or library. 2. `barcode` - the
 56 | expected barcode sequence associated with the `sample_id`.
 57 | 
 58 | The read structures will be used to extract the observed sample barcode, template
 59 | bases, and molecular identifiers from each read.  The observed sample barcode
 60 | will be matched to the sample barcodes extracted from the bases in the sample
 61 | metadata and associated read structures.
 62 | 
 63 | An observed barcode matches an expected barcocde if all the following are true:
 64 | 
 65 | 1. The number of mismatches (edits/substitutions) is less than or equal to the
 66 |    maximum mismatches (see --max-mismatches).
 67 | 2. The difference between number of mismatches in the best and second best
 68 |    barcodes is greater than or equal to the minimum mismatch delta
 69 |    (`--min-mismatch-delta`). The expected barcode sequence may contains Ns,
 70 |    which are not counted as mismatches regardless of the observed base (e.g.
 71 |    the expected barcode `AAN` will have zero mismatches relative to both the
 72 |    observed barcodes `AAA` and `AAN`).
 73 | 
 74 | ## Outputs
 75 | 
 76 | All outputs are generated in the provided `--output` directory.  For each sample
 77 | plus the unmatched reads, FASTQ files are written for each read segment
 78 | (specified in the read structures) of one of the types supplied to
 79 | `--output-types`.
 80 | 
 81 | FASTQ files have names of the format:
 82 | 
 83 | {sample_id}.{segment_type}{read_num}.fq.gz
 84 | 
 85 | where `segment_type` is one of `R`, `I`, and `U` (for template, barcode/index
 86 | and molecular barcode/UMI reads respectively) and `read_num` is a number starting
 87 | at 1 for each segment type.
 88 | 
 89 | In addition a `demux-metrics.txt` file is written that is a tab-delimited file
 90 | with counts of how many reads were assigned to each sample and derived metrics.
 91 | 
 92 | ## Example Command Line
 93 | 
 94 | As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index
 95 | reads both reading a sample barcode, as well as an in-line 8bp sample barcode in
 96 | read one, the command line would be:
 97 | 
 98 | fqtk demux \
 99 |   --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \
100 |   --read-structures 8B92T 8B 8B 100T \
101 |   --sample-metadata metadata.tsv \
102 |   --output output_folder
103 | 
104 | Usage: fqtk demux [OPTIONS] --inputs <INPUTS>... --read-structures <READ_STRUCTURES>... --sample-metadata <SAMPLE_METADATA> --output <OUTPUT>
105 | 
106 | Options:
107 |   -i, --inputs <INPUTS>...
108 |           One or more input fastq files each corresponding to a sequencing (e.g. R1, I1)
109 | 
110 |   -r, --read-structures <READ_STRUCTURES>...
111 |           The read structures, one per input FASTQ in the same order
112 | 
113 |   -b, --output-types <OUTPUT_TYPES>...
114 |           The read structure types to write to their own files (Must be one of T, B,
115 |           or M for template reads, sample barcode reads, and molecular barcode reads)
116 | 
117 |           Multiple output types may be specified as a space-delimited list.
118 | 
119 |           [default: T]
120 | 
121 |   -s, --sample-metadata <SAMPLE_METADATA>
122 |           A file containing the metadata about the samples
123 | 
124 |   -o, --output <OUTPUT>
125 |           The output directory into which to write per-sample FASTQs
126 | 
127 |   -u, --unmatched-prefix <UNMATCHED_PREFIX>
128 |           Output prefix for FASTQ file(s) for reads that cannot be matched to a sample
129 | 
130 |           [default: unmatched]
131 | 
132 |       --max-mismatches <MAX_MISMATCHES>
133 |           Maximum mismatches for a barcode to be considered a match
134 | 
135 |           [default: 1]
136 | 
137 |   -d, --min-mismatch-delta <MIN_MISMATCH_DELTA>
138 |           Minimum difference between number of mismatches in the best and second best barcodes
139 |           for a barcode to be considered a match
140 | 
141 |           [default: 2]
142 | 
143 |   -t, --threads <THREADS>
144 |           The number of threads to use. Cannot be less than 3
145 | 
146 |           [default: 8]
147 | 
148 |   -c, --compression-level <COMPRESSION_LEVEL>
149 |           The level of compression to use to compress outputs
150 | 
151 |           [default: 5]
152 | 
153 |   -S, --skip-reasons <SKIP_REASONS>
154 |           Skip demultiplexing reads for any of the following reasons, otherwise panic.
155 | 
156 |           1. `too-few-bases`: there are too few bases or qualities to extract given the
157 |              read structures.  For example, if a read is 8bp long but the read structure
158 |              is `10B`, or if a read is empty and the read structure is `+T`.
159 | 
160 |   -h, --help
161 |           Print help information (use `-h` for a summary)
162 | 
163 |   -V, --version
164 |           Print version information
165 | ```
166 | 
167 | ## Installing
168 | 
169 | ### Installing with `conda`
170 | 
171 | To install with conda you must first [install conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html#installation).
172 | Then, in your command line (and with the environment you wish to install fqtk into active) run:
173 | 
174 | ```console
175 | conda install -c bioconda fqtk
176 | ```
177 | 
178 | ### Installing with `cargo`
179 | 
180 | To install with cargo you must first [install rust](https://doc.rust-lang.org/cargo/getting-started/installation.html).
181 | Which (On Mac OS and Linux) can be done with the command:
182 | 
183 | ```console
184 | curl https://sh.rustup.rs -sSf | sh
185 | ```
186 | 
187 | Then, to install `fqtk` run:
188 | 
189 | ```console
190 | cargo install fqtk
191 | ```
192 | 
193 | ### Building From Source
194 | 
195 | First, clone the git repo:
196 | 
197 | ```console
198 | git clone https://github.com/fulcrumgenomics/fqtk.git
199 | ```
200 | 
201 | Secondly, if you do not already have rust development tools installed, install via [rustup](https://rustup.rs/):
202 | 
203 | ```console
204 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
205 | ```
206 | 
207 | Then build the toolkit in release mode:
208 | 
209 | ```console
210 | cd fqtk
211 | cargo build --release
212 | ./target/release/fqtk --help
213 | ```
214 | 
215 | ## Developing
216 | 
217 | fqtk is developed in Rust and follows the conventions of using `rustfmt` and `clippy` to ensure both code quality and standardized formatting.
218 | When working on fqtk, before pushing any commits, please first run `./ci/check.sh` and resolve any issues that are reported.
219 | 
220 | ## Releasing a New Version
221 | 
222 | ### Pre-requisites
223 | 
224 | Install 
225 | 
226 | ```console
227 | cargo install cargo-release
228 | ```
229 | 
230 | ### Prior to Any Release
231 | 
232 | Create a release that will not try to push to `crates.io` and verify the command:
233 | 
234 | ```console
235 | cargo release [major,minor,patch,release,rc...] --no-publish
236 | ```
237 | 
238 | Note: "dry-run" is the default for cargo release.
239 | 
240 | See the [ reference documentation][cargo-release-docs-link] for more information
241 | 
242 | ### Semantic Versioning
243 | 
244 | This tool follows [Semantic Versioning](https://semver.org/).  In brief:
245 | 
246 | * MAJOR version when you make incompatible API changes,
247 | * MINOR version when you add functionality in a backwards compatible manner, and
248 | * PATCH version when you make backwards compatible bug fixes.
249 | 
250 | ### Major Release
251 | 
252 | To create a major release:
253 | 
254 | ```console
255 | cargo release major --execute
256 | ```
257 | 
258 | This will remove any pre-release extension, create a new tag and push it to github, and push the release to creates.io.
259 | 
260 | Upon success, move the version to the [next candidate release](#release-candidate).
261 | 
262 | Finally, make sure to [create a new release][new-release-link] on GitHub.
263 | 
264 | ### Minor and Patch Release
265 | 
266 | To create a _minor_ (_patch_) release, follow the [Major Release](#major-release) instructions substituting `major` with `minor` (`patch`):
267 | 
268 | ```console
269 | cargo release minor --execute
270 | ```
271 | 
272 | ### Release Candidate
273 | 
274 | To move to the next release candidate:
275 | 
276 | ```console
277 | cargo release rc --no-tag --no-publish --execute
278 | ```
279 | 
280 | This will create or bump the pre-release version and push the changes to the main branch on github.
281 | This will not tag and publish the release candidate.
282 | If you would like to tag the release candidate on github, remove `--no-tag` to create a new tag and push it to github.
283 | 
284 | [cargo-release-link]: https://github.com/crate-ci/cargo-release
285 | [cargo-release-docs-link]: https://github.com/crate-ci/cargo-release/blob/master/docs/reference.md
286 | 
287 | [new-release-link]: https://github.com/fulcrumgenomics/fqtk/releases/new
288 | 


--------------------------------------------------------------------------------
/longreads/NextPolish2.md:
--------------------------------------------------------------------------------
  1 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/nextpolish2/README.html)
  2 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/nextpolish2/badges/version.svg)](https://anaconda.org/bioconda/nextpolish2)
  3 | # NextPolish2
  4 | 
  5 | Telomere-to-telomere (T2T) genome has been emerging as a new hotspot in the field of genomics. Typically, we obtain a T2T genome with datasets including both high-accuracy PacBio HiFi long reads and Oxford Nanopore Technologies (ONT) ultra-long reads. Although genomes obtained using HiFi long reads have considerably higher qualities, however, they still contain a handful of assembly errors in regions where HiFi long reads stumble as well, such as homopolymer or low-complexity microsatellite regions. Additionally, a typical gap-filling step is accomplished using ONT ultra long reads which contain a certain amount of errors. Hence, the current T2T genomes assembled still require further improvement in terms of consensus accuracy. NextPolish2 can be used to fix these errors (SNV/Indel) in a high quality assembly. Through the built-in phasing module, it can only correct the error bases while maintaining the original haplotype consistency. Therefore, even in the regions with complex repeat elements, NextPolish2 will still not produce overcorrections. In fact, in some cases it can reduce switching errors in the heterozygous region. NextPolish2 is not an upgraded version of NextPolish, but an additional supplement for the pursuit of extremely-high-quality genome assemblies.
  6 | 
  7 | ## Table of Contents
  8 | 
  9 | - [Installation](#install)
 10 | - [General usage](#usage)
 11 | - [Getting help](#help)
 12 | - [Citation](#cite)
 13 | - [License](#license)
 14 | - [Limitations](#limit)
 15 | - [Benchmarking](#benchmark)
 16 | - [FAQ](./doc/faq.md)
 17 | 
 18 | ### <a name="install"></a>Installation
 19 | 
 20 | #### Installing from bioconda
 21 | ```sh
 22 | conda install nextpolish2
 23 | ```
 24 | #### Installing from source
 25 | ##### Dependencies
 26 | 
 27 | `NextPolish2` is written in rust, try below commands (no root required) or refer [here](https://www.rust-lang.org/tools/install) to install `Rust` first.
 28 | ```sh
 29 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 30 | ```
 31 | 
 32 | ##### Download and install
 33 | 
 34 | ```sh
 35 | git clone --recursive git@github.com:Nextomics/NextPolish2.git
 36 | cd NextPolish2 && cargo build --release
 37 | ```
 38 | 
 39 | ##### Test
 40 | 
 41 | ```sh
 42 | cd test && bash hh.sh
 43 | ```
 44 | 
 45 | ### <a name="usage"></a>General usage
 46 | 
 47 | NextPolish2 takes a genome assembly file, a HiFi mapping file and one or more k-mer dataset files from short reads as input and generates the polished genome.
 48 | 
 49 | 1. Prepare HiFi mapping file ([winnowmap](https://github.com/marbl/Winnowmap) or [minimap2](https://github.com/lh3/minimap2/)).
 50 | 
 51 | ```sh
 52 | meryl count k=15 output merylDB asm.fa.gz
 53 | meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
 54 | winnowmap -t 5 -W repetitive_k15.txt -ax map-pb asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam -
 55 | 
 56 | # or mapping using minimap2
 57 | # minimap2 -ax map-hifi -t 5 asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam -
 58 | 
 59 | # indexing
 60 | samtools index hifi.map.sort.bam
 61 | ```
 62 | 
 63 | 2. Prepare k-mer dataset files ([yak](https://github.com/lh3/yak)). Here we only produce 21-mer and 31-mer datasets, you can produce more k-mer datasets with different k-mer size.
 64 | 
 65 | ```sh
 66 | # produce a 21-mer dataset, remove -b 37 if you want to count singletons
 67 | ./yak/yak count -o k21.yak -k 21 -b 37 <(zcat sr.R*.fastq.gz) <(zcat sr.R*.fastq.gz)
 68 | 
 69 | # produce a 31-mer dataset, remove -b 37 if you want to count singletons
 70 | ./yak/yak count -o k31.yak -k 31 -b 37 <(zcat sr.R*.fastq.gz) <(zcat sr.R*.fastq.gz) 
 71 | ```
 72 | 
 73 | 3. Run NextPolish2.
 74 | 
 75 | ```sh
 76 | ./target/release/nextPolish2 -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak > asm.np2.fa
 77 | 
 78 | # or try with -r
 79 | # ./target/release/nextPolish2 -r -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak > asm.np2.fa
 80 | ```
 81 | 
 82 | ***Optional:*** If your genome is assembled via **trio binning**. You can discard reads that have different haplotype with the reference before the mapping procedure, see [here](./doc/benchmark3.md) for an example.
 83 | 
 84 | #### More options
 85 | 
 86 | Use `./target/release/nextPolish2 -h` to see options.
 87 | 
 88 | ### <a name="help"></a>Getting help
 89 | 
 90 | #### Help
 91 | 
 92 |    Feel free to raise an issue at the [issue page](https://github.com/Nextomics/NextPolish2/issues/new).
 93 | 
 94 |    ***Note:*** Please ask questions on the issue page first. They are also helpful to other users.
 95 | #### Contact
 96 |    
 97 |    For additional help, please send an email to huj\_at\_grandomics\_dot\_com.
 98 | 
 99 | ### <a name="cite"></a>Citation
100 | 
101 | Jiang Hu, Zhuo Wang, Fan Liang, Shan-Lin Liu, Kai Ye, De-Peng Wang, NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads, Genomics, Proteomics & Bioinformatics, 2024;, qzad009, https://doi.org/10.1093/gpbjnl/qzad009
102 | 
103 | ### <a name="license"></a>License
104 | 
105 | NextPolish2 is only freely available for academic use and other non-commercial use.
106 | 
107 | ### <a name="limit"></a>Limitations
108 | 
109 | 1. NextPolish2 can only correct the regions that are mapped by HiFi reads. For regions without HiFi reads mapping (usually cause by high error rate), you can try to adjust mapping parameters.
110 | 2. The performance of NextPolish2 relies heavily on the quality of short reads.
111 | 3. NextPolish2 can only fix some structural misassemblies.
112 | 
113 | ### <a name="benchmark"></a>Benchmarking
114 | 
115 | | Source                                           | Software           | QV      | Switch error rate (‱) |
116 | | :----------------------------------------------: | ------------------ | :-----: | :---------------------: |
117 | | [*A. thaliana*](./doc/benchmark1.md)             | Hifiasm  (primary) | 47.67   | 1.99                    |
118 | |^(simulated data, primary contigs)^               | NextPolish2        |**65.42**| **0.35**                |
119 | | [*A. thaliana*](./doc/benchmark2.md)             | Hifiasm  (primary) | 58.03   |                         |
120 | | ^(Col-XJTU, primary contigs)^                    | NextPolish2        |**64.26**|                         |
121 | | [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (primary) | 60.25   | 0.15                    |
122 | | ^(HG002, primary contigs)^                       | NextPolish2        |**62.87**| **0.14**                |
123 | | [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (trio)    | 59.77   | 0.21                    |
124 | |^(HG002, paternal contigs)^                       | NextPolish2        |**63.49**| **0.20**                |
125 | | [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (trio)    | 59.78   | 0.33                    |
126 | |^(HG002, maternal contigs)^                       | NextPolish2        |**63.29**| **0.30**                |
127 | 
128 | ### Star
129 | You can track updates by tab the **Star** button on the upper-right corner at the [github page](https://github.com/Nextomics/NextPolish2).


--------------------------------------------------------------------------------
/longreads/chopper.md:
--------------------------------------------------------------------------------
 1 | # chopper
 2 | 
 3 | Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file.  
 4 | Filtering is done on average read quality and minimal or maximal read length, and applying a headcrop (start of read) and tailcrop (end of read) while printing the reads passing the filter.
 5 | 
 6 | Compared to the Python implementation the scope is to deliver the same results, almost the same functionality, at much faster execution times. At the moment this tool does not support filtering using a sequencing_summary file. If those features are of interest then please reach out.  
 7 | 
 8 | ## Installation
 9 | 
10 | Preferably, for most users, download a ready-to-use binary for your system to add directory on your $PATH from the [releases](https://github.com/wdecoster/chopper/releases).  
11 | You may have to change the file permissions to execute it with `chmod +x chopper`
12 | 
13 | Alternatively, use conda to install  
14 | `conda install -c bioconda chopper`
15 | 
16 | ## Usage
17 | 
18 | Reads on stdin and writes to stdout.
19 | 
20 | ```text
21 | FLAGS:
22 |     -h, --help       Prints help information
23 |     -V, --version    Prints version information
24 | 
25 | OPTIONS:
26 |         --headcrop      Trim N nucleotides from the start of a read [default: 0]
27 |         --maxlength     Sets a maximum read length [default: 2147483647]
28 |     -l, --minlength     Sets a minimum read length [default: 1]
29 |     -q, --quality       Sets a minimum Phred average quality score [default: 0]
30 |         --tailcrop      Trim N nucleotides from the end of a read [default: 0]
31 |         --threads       Number of parallel threads to use [default: 4]
32 |         --contam        Fasta file with reference to check potential contaminants against [default None]
33 |     -i, --input         Input filename [default: read from stdin]
34 |         --maxgc         Sets a maximum GC content [default: 1.0]
35 |         --mingc         Sets a minimum GC content [default: 0.0]
36 | ```
37 | 
38 | EXAMPLES:
39 | 
40 | ```bash
41 | gunzip -c reads.fastq.gz | chopper -q 10 -l 500 | gzip > filtered_reads.fastq.gz
42 | chopper -q 10 -l 500 -i reads.fastq > filtered_reads.fastq
43 | chopper -q 10 -l 500 -i reads.fastq.gz | gzip > filtered_reads.fastq.gz
44 | ```
45 | 
46 | Note that the tool may be substantially slower in the third example above, and piping while decompressing is recommended (as in the first example). 
47 | 
48 | ## CITATION
49 | 
50 | If you use this tool, please consider citing our [publication](https://academic.oup.com/bioinformatics/article/39/5/btad311/7160911).
51 | 


--------------------------------------------------------------------------------
/longreads/longshot.md:
--------------------------------------------------------------------------------
  1 | # longshot
  2 | 
  3 | Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM/CRAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF.
  4 | 
  5 | ## citation
  6 | If you use Longshot, please cite the publication:
  7 | 
  8 | [Edge, P. and Bansal, V., 2019. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nature communications, 10(1), pp.1-10.](https://www.nature.com/articles/s41467-019-12493-y)
  9 | 
 10 | ## supported operating systems
 11 | Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave.
 12 | It should work on any linux-based system that has Rust and Cargo installed.
 13 | 
 14 | ## dependencies
 15 | 
 16 | * rust >= 1.40.0
 17 | * zlib >= 1.2.11
 18 | * xz >= 5.2.3
 19 | * clangdev >= 7.0.1
 20 | * gcc >= 7.3.0
 21 | * libc-dev
 22 | * make
 23 | * various rust dependencies (automatically managed by cargo)
 24 | 
 25 | (older versions may work but have not been tested)
 26 | ## installation
 27 | 
 28 | ### installation using Bioconda
 29 | 
 30 | It is recommended to install Longshot using [Bioconda](https://bioconda.github.io/):
 31 | ```
 32 | conda install longshot
 33 | ```
 34 | This method supports Linux and Mac.
 35 | If you do not have Bioconda, you can install it with these steps:
 36 | First, install Miniconda (or Anaconda). Miniconda can be installed using the
 37 | scripts [here](https://docs.conda.io/en/latest/miniconda.html). 
 38 | 
 39 | The Bioconda channel can then be added using these commands:
 40 | ```
 41 | conda config --add channels defaults
 42 | conda config --add channels bioconda
 43 | conda config --add channels conda-forge
 44 | ```
 45 | ### manual installation using apt for dependencies (Ubuntu 18.04)
 46 | If you are using Ubuntu 18.04, you can install the dependencies using apt. Then, the Rust cargo package manager is used to compile Longshot. 
 47 | ```
 48 | sudo apt-get install cargo zlib1g-dev xz-utils \
 49 |          libclang-dev clang cmake build-essential curl git  # install dependencies 
 50 | git clone https://github.com/pjedge/longshot          # clone the Longshot repository
 51 | cd longshot                                           # change directory
 52 | cargo install --path .                                # install Longshot
 53 | export PATH=$PATH:/home/$USER/.cargo/bin              # add cargo binaries to path
 54 | ```
 55 | Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space.
 56 | It is recommended to add the line ```export PATH=$PATH:/home/$USER/.cargo/bin``` to the end of your ```~/.bashrc``` file so that the longshot binary is in the PATH for future shell sessions.
 57 | 
 58 | ## usage:
 59 | After installation, execute the longshot binary as so:
 60 | ```
 61 | $ longshot [FLAGS] [OPTIONS] --bam <BAM/CRAM> --ref <FASTA> --out <VCF>
 62 | ```
 63 | 
 64 | ## execution on an example dataset
 65 | The directory ```example_data``` contains a simulated toy dataset that can be used to test out Longshot:
 66 | - Reference genome containing 3 contigs each with length 200 kb (```example_data/genome.fa```)
 67 | - 30x coverage simulated pacbio reads generated using [SimLoRD](https://bitbucket.org/genomeinformatics/simlord/) (```example_data/pacbio_reads_30x.bam```)
 68 | - The 714 "true" variants for validation (```example_data/ground_truth_variants.vcf```)
 69 | 
 70 | Run Longshot on the example data as so:
 71 | ```
 72 | longshot --bam example_data/pacbio_reads_30x.bam --ref example_data/genome.fa --out example_data/longshot_output.vcf
 73 | ```
 74 | 
 75 | Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to ```ground_truth_variants.vcf``` for accuracy.
 76 | 
 77 | ## command line options
 78 | ```
 79 | $ longshot --help
 80 | 
 81 | Longshot: variant caller (SNVs) for long-read sequencing data 
 82 | 
 83 | USAGE:
 84 |     longshot [FLAGS] [OPTIONS] --bam <BAM/CRAM> --ref <FASTA> --out <VCF>
 85 | 
 86 | FLAGS:
 87 |     -A, --auto_max_cov        Automatically calculate mean coverage for region and set max coverage to mean_coverage +
 88 |                               5*sqrt(mean_coverage). (SLOWER)
 89 |     -S, --stable_alignment    Use numerically-stable (logspace) pair HMM forward algorithm. Is significantly slower but
 90 |                               may be more accurate. Tests have shown this not to be necessary for highly error prone
 91 |                               reads (PacBio CLR).
 92 |     -F, --force_overwrite     If output files (VCF or variant debug directory) exist, delete and overwrite them.
 93 |     -x, --max_alignment       Use max scoring alignment algorithm rather than pair HMM forward algorithm.
 94 |     -n, --no_haps             Don't call HapCUT2 to phase variants.
 95 |         --output-ref          print reference genotypes (non-variant), use this option only in combination with -v
 96 |                               option.
 97 |     -h, --help                Prints help information
 98 |     -V, --version             Prints version information
 99 | 
100 | OPTIONS:
101 |     -b, --bam <BAM>                            sorted, indexed BAM file with error-prone reads (CRAM files also supported)
102 |     -f, --ref <FASTA>                          indexed FASTA reference that BAM file is aligned to
103 |     -o, --out <VCF>                            output VCF file with called variants.
104 |     -r, --region <string>                      Region in format <chrom> or <chrom:start-stop> in which to call variants
105 |                                                (1-based, inclusive).
106 |     -v, --potential_variants <VCF>             Genotype and phase the variants in this VCF instead of using pileup
107 |                                                method to find variants. NOTES: VCF must be gzipped and tabix indexed or
108 |                                                contain contig information. Use with caution because excessive false
109 |                                                potential variants can lead to inaccurate results. Every variant is used
110 |                                                and only the allele fields are considered -- Genotypes, filters,
111 |                                                qualities etc are ignored. Indel variants will be genotyped but not
112 |                                                phased. Structural variants (length > 50 bp) are currently not supported.
113 |     -O, --out_bam <BAM>                        Write new bam file with haplotype tags (HP:i:1 and HP:i:2) for reads
114 |                                                assigned to each haplotype, any existing HP and PS tags are removed
115 |     -c, --min_cov <int>                        Minimum coverage (of reads passing filters) to consider position as a
116 |                                                potential SNV. [default: 6]
117 |     -C, --max_cov <int>                        Maximum coverage (of reads passing filters) to consider position as a
118 |                                                potential SNV. [default: 8000]
119 |     -q, --min_mapq <int>                       Minimum mapping quality to use a read. [default: 20]
120 |     -a, --min_allele_qual <float>              Minimum estimated quality (Phred-scaled) of allele observation on read to
121 |                                                use for genotyping/haplotyping. [default: 7.0]
122 |     -y, --hap_assignment_qual <float>          Minimum quality (Phred-scaled) of read->haplotype assignment (for read
123 |                                                separation). [default: 20.0]
124 |     -Q, --potential_snv_cutoff <float>         Consider a site as a potential SNV if the original PHRED-scaled QUAL
125 |                                                score for 0/0 genotype is below this amount (a larger value considers
126 |                                                more potential SNV sites). [default: 20.0]
127 |     -e, --min_alt_count <int>                  Require a potential SNV to have at least this many alternate allele
128 |                                                observations. [default: 3]
129 |     -E, --min_alt_frac <float>                 Require a potential SNV to have at least this fraction of alternate
130 |                                                allele observations. [default: 0.125]
131 |     -L, --hap_converge_delta <float>           Terminate the haplotype/genotype iteration when the relative change in
132 |                                                log-likelihood falls below this amount. Setting a larger value results in
133 |                                                faster termination but potentially less accurate results. [default:
134 |                                                0.0001]
135 |     -l, --anchor_length <int>                  Length of indel-free anchor sequence on the left and right side of read
136 |                                                realignment window. [default: 6]
137 |     -m, --max_snvs <int>                       Cut off variant clusters after this many variants. 2^m haplotypes must be
138 |                                                aligned against per read for a variant cluster of size m. [default: 3]
139 |     -W, --max_window <int>                     Maximum "padding" bases on either side of variant realignment window
140 |                                                [default: 50]
141 |     -I, --max_cigar_indel <int>                Throw away a read-variant during allelotyping if there is a CIGAR indel
142 |                                                (I/D/N) longer than this amount in its window. [default: 20]
143 |     -B, --band_width <Band width>              Minimum width of alignment band. Band will increase in size if sequences
144 |                                                are different lengths. [default: 20]
145 |     -D, --density_params <string>              Parameters to flag a variant as part of a "dense cluster". Format
146 |                                                <n>:<l>:<gq>. If there are at least n variants within l base pairs with
147 |                                                genotype quality >=gq, then these variants are flagged as "dn" [default:
148 |                                                10:500:50]
149 |     -s, --sample_id <string>                   Specify a sample ID to write to the output VCF [default: SAMPLE]
150 |         --hom_snv_rate <float>                 Specify the homozygous SNV Rate for genotype prior estimation [default:
151 |                                                0.0005]
152 |         --het_snv_rate <float>                 Specify the heterozygous SNV Rate for genotype prior estimation [default:
153 |                                                0.001]
154 |         --ts_tv_ratio <float>                  Specify the transition/transversion rate for genotype grior estimation
155 |                                                [default: 0.5]
156 |     -P, --strand_bias_pvalue_cutoff <float>    Remove a variant if the allele observations are biased toward one strand
157 |                                                (forward or reverse) according to Fisher's exact test. Use this cutoff
158 |                                                for the two-tailed P-value. [default: 0.01]
159 |     -d, --variant_debug_dir <path>             write out current information about variants at each step of algorithm to
160 |                                                files in this directory
161 | ```
162 | 
163 | ## usage examples
164 | Call variants with default parameters:
165 | ```
166 | longshot --bam pacbio.bam --ref ref.fa --out output.vcf
167 | ```
168 | Call variants for chromosome 1 only using the automatic max coverage cutoff:
169 | ```
170 | longshot -A -r chr1 --bam pacbio.bam --ref ref.fa --out output.vcf
171 | ```
172 | Call variants in a 500 kb region and then output the reads into ```reads.bam``` using a haplotype assignment threshold of 30:
173 | ```
174 | longshot -r chr1:1000000-1500000 -y 30 -O reads.bam --bam pacbio.bam --ref ref.fa --out output.vcf
175 | ```
176 | If a read has an assigned haplotype, it will get a tag `HP:i:1` or `HP:i:2` and tag `PS:i:x` where `x` is a phase set number of the variants it covers.
177 | 
178 | ## important considerations
179 | - It is highly recommended to use reads with at least 30x coverage.
180 | - It is recommended to process chromosomes separately using the ```--region``` option.
181 | - Longshot has only been tested using data from humans. Results may vary with organisms with significantly higher or lower SNV rate.
182 | - It is important to set a reasonable max read coverage cutoff (```-C``` option) to filter out sites coinciding with genomic features such as CNVs which can be problematic for variant calling. If the ```-A``` option is used, Longshot will estimate the mean read coverage and set the max coverage to ```mean_cov+5*sqrt(mean_cov)```, which we have found to be a reasonable filter in practice for humans.
183 | - CNVs and mapping issues can result in dense clusters of false positive SNVs. Longshot will attempt to find clusters like this and mark them as "dn" in the FILTER field. The ```--density_params``` option is used to control which variants are flagged as "dn". The default parameters have been found to be effective for human sequencing data, but this option may need to be tweaked for other organisms with SNV rates significantly different from human.
184 | - Oxford Nanopore Technology (ONT) SMS reads are now officially supported. It is recommended to use the default ```--strand_bias_pvalue_cutoff``` of 0.01 for ONT reads, since this option filters out false SNV sites prior to variant calling.
185 | 
186 | ## installation troubleshooting
187 | 
188 | ### older version of Rust
189 | Check that the Rust version is 1.30.0 or higher:
190 | ```
191 | rustc --version
192 | ```
193 | If not, update Rust using this command:
194 | ```
195 | rustup update
196 | ```
197 | 
198 | 
199 | ### linker errors
200 | For example:
201 | ```
202 | error: linking with `cc` failed: exit code: 1
203 | ...
204 | ...
205 | ...
206 | = note: Non-UTF-8 output: /usr/bin/ld: /home/pedge/temp/longshot/target/release/build/longshot-347f3774e75b380c/out/libhapcut2.a(common.o)(.text.fprintf_time+0x81): unresolvable H\x89\\$\xe8H\x89l$\xf0H\x89\xf3L\x89d$\xf8H\x83\xec\x18H\x8bG\x10H\x89\xfdI\x89\xd4H\x89\xd6H\x8b;\xffPxH\x8bE\x10I\x8dt$\x08H\x8b{\x08\xffPxH\x8bE\x10H\x8b{\x10I\x8dt$\x10H\x8b\x1c$H\x8bl$\x08L\x8bd$\x10H\x8b@xH\x83\xc4\x18\xff\xe0f\x90H\x89\\$\xe8H\x89l$\xf0H\x89\xfbL\x89d$\xf8H\x83\xec\x18H\x8bG\x10I\x89\xd4H\x89\xf5H\x89\xf7\xffPhI\x89\x04$H\x8bC\x10H\x8d}\x08\xffPhH\x8b\x1c$I\x89D$\x08H\x8bl$\x08L\x8bd$\x10H\x83\xc4\x18\xc3\x0f\x1f relocation against symbol `time@@GLIBC_2.2.5\'\n/usr/bin/ld: BFD version 2.20.51.0.2-5.42.el6 20100205 internal error, aborting at reloc.c line 443 in bfd_get_reloc_size\n\n/usr/bin/ld: Please report this bug.\n\ncollect2: ld returned 1 exit status\n
207 | ...
208 | ...
209 | ...
210 | ```
211 | Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows:
212 | ```
213 | rustc -vV
214 | ```
215 | Note the build target after "host: ", i.e. "x86_64-unknown-linux-gnu".
216 | ```
217 | mkdir .cargo
218 | nano .cargo/config
219 | ```
220 | edit the config file to have these contents:
221 | ```
222 | [target.<target-name>]
223 | linker = "</path/to/linker>"
224 | ```
225 | for example,
226 | ```
227 | [target.x86_64-unknown-linux-gnu]
228 | linker = "/opt/gnu/gcc/bin/gcc"
229 | ```
230 | then,
231 | ```
232 | cargo clean
233 | cargo build --release
234 | ```


--------------------------------------------------------------------------------
/metagenomics/coverm.md:
--------------------------------------------------------------------------------
  1 | ![CoverM logo](https://github.com/wwood/CoverM/blob/main/images/coverm.png?raw=true)
  2 | 
  3 | - [CoverM](#coverm)
  4 |   - [Installation](#installation)
  5 |     - [Install through the bioconda package](#install-through-the-bioconda-package)
  6 |     - [Pre-compiled binary](#pre-compiled-binary)
  7 |     - [Compiling from source](#compiling-from-source)
  8 |     - [Development version](#development-version)
  9 |     - [Dependencies](#dependencies)
 10 |     - [Shell completion](#shell-completion)
 11 |   - [Usage](#usage)
 12 |   - [Calculation methods](#calculation-methods)
 13 |   - [License](#license)
 14 | 
 15 | # CoverM
 16 | 
 17 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/coverm/badges/version.svg)](https://anaconda.org/bioconda/coverm)
 18 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/coverm/badges/downloads.svg)](https://anaconda.org/bioconda/coverm)
 19 | 
 20 | CoverM aims to be a configurable, easy to use and fast DNA read coverage and
 21 | relative abundance calculator focused on metagenomics applications.
 22 | 
 23 | CoverM calculates coverage of genomes/MAGs `coverm genome` ([help](https://wwood.github.io/CoverM/coverm-genome.html)) or individual
 24 | contigs `coverm contig` ([help](https://wwood.github.io/CoverM/coverm-contig.html)). Calculating coverage by read mapping, its input can
 25 | either be BAM files sorted by reference, or raw reads and reference genomes in various formats.
 26 | 
 27 | ## Installation
 28 | 
 29 | ### Install through the bioconda package
 30 | 
 31 | CoverM and its dependencies can be installed through the [bioconda](https://bioconda.github.io/user/install.html) conda channel. After initial setup of conda and the bioconda channel, it can be installed with
 32 | 
 33 | ```
 34 | conda install coverm
 35 | ```
 36 | 
 37 | ### Pre-compiled binary
 38 | 
 39 | Statically compiled CoverM binaries available on the [releases page](https://github.com/wwood/CoverM/releases).
 40 | This installation method requires non-Rust dependencies to be installed separately - see the [dependencies section](#Dependencies).
 41 | 
 42 | ### Compiling from source
 43 | 
 44 | CoverM can also be installed from source, using the cargo build system after
 45 | installing [Rust](https://www.rust-lang.org/).
 46 | 
 47 | ```
 48 | cargo install coverm
 49 | ```
 50 | 
 51 | ### Development version
 52 | 
 53 | To run an unreleased version of CoverM, after installing
 54 | [Rust](https://www.rust-lang.org/) and any additional dependencies listed below:
 55 | 
 56 | ```
 57 | git clone https://github.com/wwood/CoverM
 58 | cd CoverM
 59 | cargo run -- genome ...etc...
 60 | ```
 61 | 
 62 | To run tests:
 63 | 
 64 | ```
 65 | cargo build
 66 | cargo test
 67 | ```
 68 | 
 69 | ### Dependencies
 70 | 
 71 | For the full suite of options, additional programs must also be installed, when
 72 | installing from source or for development.
 73 | 
 74 | These can be installed using the conda YAML environment definition:
 75 | 
 76 | ```
 77 | conda env create -n coverm -f coverm.yml
 78 | ```
 79 | 
 80 | Or, these can be installed manually:
 81 | 
 82 | * [samtools](https://github.com/samtools/samtools) v1.9
 83 | * [tee](https://www.gnu.org/software/coreutils/), which is installed by default
 84 |   on most Linux operating systems.
 85 | * [man](http://man-db.nongnu.org/), which is installed by default on most Linux
 86 |   operating systems.
 87 | 
 88 | and some mapping software:
 89 | 
 90 | * [minimap2](https://github.com/lh3/minimap2) v2.21
 91 | * [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) v2.0
 92 | 
 93 | For dereplication:
 94 | 
 95 | * [Dashing](https://github.com/dnbaker/dashing) v0.4.0
 96 | * [FastANI](https://github.com/ParBLiSS/FastANI) v1.3
 97 | 
 98 | ### Shell completion
 99 | 
100 | Completion scripts for various shells e.g. BASH can be generated. For example, to install the bash completion script system-wide (this requires root privileges):
101 | 
102 | ```
103 | coverm shell-completion --output-file coverm --shell bash
104 | mv coverm /etc/bash_completion.d/
105 | ```
106 | 
107 | It can also be installed into a user's home directory (root privileges not required):
108 | 
109 | ```
110 | coverm shell-completion --shell bash --output-file /dev/stdout >>~/.bash_completion
111 | ```
112 | 
113 | In both cases, to take effect, the terminal will likely need to be restarted. To test, type `coverm gen` and it should complete after pressing the TAB key.
114 | 
115 | ## Usage
116 | 
117 | CoverM operates in several modes. Detailed usage information including examples is given at the links below, or alternatively by using the `-h` or `--full-help` flags for each mode:
118 | 
119 | * [genome](https://wwood.github.io/CoverM/coverm-genome.html) - Calculate coverage of genomes
120 | * [contig](https://wwood.github.io/CoverM/coverm-contig.html) - Calculate coverage of contigs
121 | 
122 | There are several utility modes as well:
123 | 
124 | * [make](https://wwood.github.io/CoverM/coverm-make.html) - Generate BAM files through alignment
125 | * [filter](https://wwood.github.io/CoverM/coverm-filter.html) - Remove (or only keep) alignments with insufficient identity
126 | * [cluster](https://wwood.github.io/CoverM/coverm-cluster.html) - Dereplicate and cluster genomes
127 | * shell-completion - Generate shell completion scripts
128 | 
129 | ## Calculation methods
130 | 
131 | The `-m/--methods` flag specifies the specific kind(s) of coverage that are
132 | to be calculated.
133 | 
134 | To illustrate, imagine a set of 3 pairs of reads, where only 1 aligns to a
135 | single reference contig of length 1000bp:
136 | 
137 | ```
138 | read1_forward    ========>
139 | read1_reverse                                  <====+====
140 | contig    ...-----------------------------------------------------....
141 |                  |        |         |         |         |
142 | position        200      210       220       230       240
143 | ```
144 | 
145 | The difference coverage measures would be:
146 | 
147 | | Method             | Value                                                                               | Formula                             | Explanation                                                                                                                                                                                                                                             |
148 | | ------------------ | ----------------------------------------------------------------------------------- | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
149 | | mean               | 0.02235294                                                                          | (10+9)/(1000-2*75)                  | The two reads have 10 and 9 bases aligned exactly, averaged over 1000-2*75 bp (length of contig minus 75bp from each end).                                                                                                                              |
150 | | relative_abundance | 33.3%                                                                               | 0.02235294/0.02235294*(2/6)         | If the contig is considered a genome, then its mean coverage is 0.02235294. There is a total of 0.02235294 mean coverage across all genomes, and 2 out of 6 reads (1 out of 3 pairs) map. This coverage calculation is only available in 'genome' mode. |
151 | | trimmed_mean       | 0                                                                                   | mean_coverage(mid-ranked-positions) | After removing the 5% of bases with highest coverage and 5% of bases with lowest coverage, all remaining positions have coverage 0.                                                                                                                     |
152 | | covered_fraction   | 0.02                                                                                | (10+10)/1000                        | 20 bases are covered by any read, out of 1000bp.                                                                                                                                                                                                        |
153 | | covered_bases      | 20                                                                                  | 10+10                               | 20 bases are covered.                                                                                                                                                                                                                                   |
154 | | variance           | 0.01961962                                                                          | var({1;20},{0;980})                 | Variance is calculated as the sample variance.                                                                                                                                                                                                          |
155 | | length             | 1000                                                                                |                                     | The contig's length is 1000bp.                                                                                                                                                                                                                          |
156 | | count              | 2                                                                                   |                                     | 2 reads are mapped.                                                                                                                                                                                                                                     |
157 | | reads_per_base     | 0.002                                                                               | 2/1000                              | 2 reads are mapped over 1000bp.                                                                                                                                                                                                                         |
158 | | metabat            | contigLen 1000, totalAvgDepth 0.02235294, bam depth 0.02235294, variance 0.01961962 |                                     | Reproduction of the[MetaBAT](https://bitbucket.org/berkeleylab/metabat) 'jgi_summarize_bam_contig_depths' tool output, producing [identical output](https://bitbucket.org/berkeleylab/metabat/issues/48/jgi_summarize_bam_contig_depths-coverage).            |
159 | | coverage_histogram | 20 bases with coverage 1, 980 bases with coverage 0                                 |                                     | The number of positions with each different coverage are tallied.                                                                                                                                                                                       |
160 | | rpkm               | 1000000                                                                             | 2 * 10^9 / 1000 / 2                 | Calculation here assumes no other reads map to other contigs. See https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ for an explanation of RPKM and TPM                                                   |
161 | | tpm                | 1000000                                                                             | rpkm/total_of_rpkm * 10^6           | Calculation here assumes no other reads map to other contigs. See RPKM above.                                                                                                                                                                           |
162 | 
163 | Calculation of genome-wise coverage (`genome` mode) is similar to calculating
164 | contig-wise (`contig` mode) coverage, except that the unit of reporting is
165 | per-genome rather than per-contig. For calculation methods which exclude base
166 | positions based on their coverage, all positions from all contigs are considered
167 | together. For instance, if a 2000bp contig with all positions having 1X coverage
168 | is in a genome with 2,000,000bp contig with no reads mapped, then the
169 | trimmed_mean will be 0 as all positions in the 2000bp are in the top 5% of
170 | positions sorted by coverage.
171 | 
172 | ## License
173 | 
174 | CoverM is made available under GPL3+. See LICENSE.txt for details. Copyright Ben
175 | Woodcroft.
176 | 
177 | Developed by Ben Woodcroft at the Queensland University of Technology [Centre for Microbiome Research](https://research.qut.edu.au/cmr/).
178 | 


--------------------------------------------------------------------------------
/metagenomics/skani.md:
--------------------------------------------------------------------------------
  1 | # skani - accurate, fast nucleotide identity calculation for MAGs, genomes, and databases
  2 | 
  3 | ## Introduction
  4 | 
  5 | **skani** is a program for calculating average nucleotide identity (ANI) from DNA sequences (contigs/MAGs/genomes) for ANI > ~80%.
  6 | 
  7 | skani uses an approximate mapping method without base-level alignment to get ANI. It is magnitudes faster than BLAST based methods and almost as accurate. skani offers:
  8 | 
  9 | 1. **Accurate ANI calculations for MAGs**. skani is accurate for incomplete and medium-quality metagenome-assembled genomes (MAGs). Pure sketching methods (e.g. Mash) may underestimate ANI for incomplete MAGs.
 10 | 
 11 | 2. **Aligned fraction results**. skani outputs the fraction of genome aligned, whereas pure k-mer based methods do not. 
 12 | 
 13 | 3. **Fast computations**. Indexing/sketching is ~ 3x faster than Mash, and querying is about 25x faster than FastANI (but slower than Mash). 
 14 | 
 15 | 4. **Efficient database search**. Querying a genome against a preprocessed database of >65000 prokaryotic genomes takes a few seconds with a single processor and ~6 GB of RAM. Constructing a database from genome sequences takes a few minutes to an hour.
 16 | 
 17 | ##  Updates
 18 | 
 19 | ### v0.2.1 released - 2023-10-11
 20 | 
 21 | More consistent support for small contigs and sequences. 
 22 | 
 23 | #### Major
 24 | 
 25 | * --faster-small option included in dist and triangle. 
 26 | 
 27 | Genomes (and contigs with the --i, --ri, --qi options) with less than 20 marker k-mers are not screened according to the -s option. This was always the case but never documented. This makes skani more sensitive for small sequences, but can hamper performance on very large datasets with lots of small genomes/contigs. 
 28 | 
 29 | This heuristic can now be disabled with the `--faster-small` option. 
 30 | 
 31 | See the [CHANGELOG](https://github.com/bluenote-1577/skani/blob/main/CHANGELOG.md) for the skani's full versioning history. 
 32 | 
 33 | ##  Install
 34 | 
 35 | #### Option 1: Build from source
 36 | 
 37 | Requirements:
 38 | 1. [rust](https://www.rust-lang.org/tools/install) programming language and associated tools such as cargo are required and assumed to be in PATH.
 39 | 2. A c compiler (e.g. GCC)
 40 | 3. make
 41 | 
 42 | Building takes a few minutes (depending on # of cores).
 43 | 
 44 | ```sh
 45 | git clone https://github.com/bluenote-1577/skani
 46 | cd skani
 47 | 
 48 | # If default rust install directory is ~/.cargo
 49 | cargo install --path . --root ~/.cargo
 50 | skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta
 51 | 
 52 | # If ~/.cargo doesn't exist use below commands instead
 53 | #cargo build --release
 54 | #./target/release/skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta
 55 | ```
 56 | 
 57 | See the [Releases](https://github.com/bluenote-1577/skani/releases) page for obtaining specific versions of skani.
 58 | 
 59 | #### Option 2: Conda (source version: 0.2.1)
 60 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/version.svg)](https://anaconda.org/bioconda/skani)
 61 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/latest_release_date.svg)](https://anaconda.org/bioconda/skani)
 62 | ```sh
 63 | conda install -c bioconda skani
 64 | ```
 65 | 
 66 | #### Option 3: Pre-built x86-64 linux statically compiled executable
 67 | 
 68 | We offer a pre-built statically compiled executable for x86-64 Linux systems. That is, if you're on an x86-64 Linux system, you can just download the binary and run it without installing anything. 
 69 | 
 70 | For using the latest version of skani: 
 71 | 
 72 | ```sh
 73 | wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
 74 | chmod +x skani
 75 | ./skani -h
 76 | ```
 77 | 
 78 | **Important**: the binary runs slightly slower (3-10%) most of the time, but it can be drastically slower on some tasks. 
 79 | 
 80 | ## Quick start
 81 | 
 82 | ```sh
 83 | # compare two genomes for ANI. skani is symmetric, so order does not affect ANI
 84 | skani dist genome1.fa genome2.fa 
 85 | skani dist genome2.fa genome1.fa 
 86 | 
 87 | # compare multiple genomes; all options take -t for multi-threading.
 88 | skani dist -t 3 -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt
 89 | 
 90 | # compare individual fasta records (e.g. contigs)
 91 | skani dist --qi -q assembly1.fa --ri -r assembly2.fa  
 92 | 
 93 | # construct database and do memory-efficient search
 94 | skani sketch genomes_to_search/* -o database
 95 | skani search query1.fa query2.fa ... -d database
 96 | 
 97 | # use sketch from "skani sketch" output as drop-in replacement
 98 | skani dist database/query.fa.sketch database/ref.fa.sketch
 99 | 
100 | # construct similarity matrix/edge list for all genomes in folder
101 | skani triangle genome_folder/* > skani_ani_matrix.txt
102 | skani triangle genome_folder/* -E > skani_ani_edge_list.txt
103 | 
104 | # we provide a script in this repository for clustering/visualizing distance matrices.
105 | # requires python3, seaborn, scipy/numpy, and matplotlib.
106 | python scripts/clustermap_triangle.py skani_ani_matrix.txt 
107 | 
108 | ```
109 | 
110 | ## Tutorials and manuals
111 | 
112 | ### [skani basic usage information](https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide)
113 | 
114 | For more information about using the specific skani subcommands, see the [guide linked above](https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide). 
115 | 
116 | ### skani tutorials
117 | 
118 | 1. #### [Tutorial: setting up the GTDB prokaryotic genome database to search against](https://github.com/bluenote-1577/skani/wiki/Tutorial:-setting-up-the-GTDB-genome-database-to-search-against)
119 | 2. #### [Tutorial: classifying entire assemblies against > 85,000 genomes in under 2 minutes](https://github.com/bluenote-1577/skani/wiki/Tutorial:-classifying-entire-assemblies-(MAGs-or-contigs)-against-85,000-genomes-in-under-2-minutes)
120 | 3. #### [Tutorial: strain-level clustering of MAGs using skani, and why Mash/FastANI have issues](https://github.com/bluenote-1577/skani/wiki/Tutorial:-strain-and-species-level-clustering-of-MAGs-with-skani-triangle)
121 | 
122 | ### [skani cookbook](https://github.com/bluenote-1577/skani/wiki/skani-cookbook)
123 | 
124 | Some common use cases and parameter settings are outlined in the cookbook. 
125 | 
126 | ### [Pre-sketched databases for searching](https://github.com/bluenote-1577/skani/wiki/Pre%E2%80%90sketched-databases)
127 | 
128 | Pre-sketched databases can be downloaded and quickly searched against. GTDB-R214 is currently supported. 
129 | 
130 | ### [skani advanced usage information](https://github.com/bluenote-1577/skani/wiki/skani-advanced-usage-guide)
131 | 
132 | See the advanced usage guide linked above for more information about topics such as:
133 | 
134 | * optimizing sensitivity/speed of skani
135 | * optimizing skani for long-reads or contigs
136 | * making skani for memory efficient for huge data sets
137 | 
138 | ## Output
139 | 
140 | If the resulting aligned fraction for the two genomes is < 15%, no output is given. 
141 | 
142 | **In practice, this means that only results with > ~82% ANI are reliably output** (with default parameters). See the [skani advanced usage guide](https://github.com/bluenote-1577/skani/wiki/skani-advanced-usage-guide) for information on how to compare lower ANI genomes. 
143 | 
144 | The default output for `search` and `dist` looks like
145 | ```
146 | Ref_file	Query_file	ANI	Align_fraction_ref	Align_fraction_query	Ref_name	Query_name
147 | refs/e.coli-EC590.fasta	refs/e.coli-K12.fasta	99.39	93.95	93.37	NZ_CP016182.2 Escherichia coli strain EC590 chromosome, complete genome	NC_007779.1 Escherichia coli str. K-12 substr. W3110, complete sequence
148 | ```
149 | - Ref_file: the filename of the reference.
150 | - Query_file: the filename of the query.
151 | - ANI: the ANI.
152 | - Aligned_fraction_query/reference: fraction of query/reference covered by alignments.
153 | - Ref/Query_name: the id of the first record in the reference/query file.
154 | 
155 | The order of results is dependent on the command and not guaranteed to be deterministic when > 5000 query genomes are present. `dist` and `search` try to place the highest ANI results first. 
156 | 
157 | ## Citation
158 | 
159 | Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023). https://doi.org/10.1038/s41592-023-02018-3
160 | 
161 | ## Feature requests, issues
162 | 
163 | skani is actively being developed by me ([Jim Shaw](https://jim-shaw-bluenote.github.io/)). I'm more than happy to accommodate simple feature requests (different types of outputs, etc). Feel free to open an issue with your feature request on the GitHub repository. If you catch any bugs, please open an issue or e-mail me (e-mail on my website). 
164 | 
165 | ## Calling skani from rust or python
166 | 
167 | ### Rust API
168 | 
169 | If you're interested in using skani as a rust library, check out the minimal example here: https://github.com/bluenote-1577/skani-lib-example. The documentation is currently minimal (https://docs.rs/skani/0.1.0/skani/) and I guarantee no API stability. 
170 | 
171 | ### Python bindings 
172 | 
173 | If you're interested in calling skani from python, see the [pyskani](https://github.com/althonos/pyskani) python interface and bindings to skani written by [Martin Larralde](https://github.com/althonos). Note: I am not personally involved in the pyskani project and do not offer guarantees on the correctness of the outputs. 


--------------------------------------------------------------------------------
/metagenomics/sylph.md:
--------------------------------------------------------------------------------
  1 | # sylph - fast and precise species-level metagenomic profiling with ANIs 
  2 | 
  3 | ## Introduction
  4 | 
  5 | **sylph** is a program that performs ultrafast (1) **ANI querying** or (2) **metagenomic profiling** for metagenomic shotgun samples. 
  6 | 
  7 | **Containment ANI querying**: sylph can search a genome, e.g. E. coli, against your sample. If sylph outputs an estimate of 97% ANI, your sample contains an E. coli with 97% ANI to the queried genome.
  8 | 
  9 | **Metagenomic profiling**: sylph can determine the species/taxa in your sample and their abundances, just like [Kraken](https://ccb.jhu.edu/software/kraken/) or [MetaPhlAn](https://github.com/biobakery/MetaPhlAn).
 10 | 
 11 | <p align="center"><img src="assets/sylph.gif?raw=true"/></p>
 12 | <p align="center">
 13 |    <i>
 14 |    Profiling 1 Gbp of mouse gut reads against 85,205 genomes in a few seconds 
 15 |    </i>
 16 | </p>
 17 | 
 18 | 
 19 | ### Why sylph?
 20 | 
 21 | 1. **Precise species-level profiling**: Our tests show that sylph is more precise than Kraken and about as precise and sensitive as marker gene methods (MetaPhlAn, mOTUs). 
 22 | 
 23 | 2. **Ultrafast, multithreaded, multi-sample**: sylph can be > 50x faster than MetaPhlAn for multi-sample processing. sylph only takes ~15GB of RAM for profiling against the entire GTDB-R220 database (110k genomes).
 24 | 
 25 | 3. **Accurate (containment) ANIs down to 0.1x effective coverage**: for bacterial ANI queries of > 90% ANI, sylph can often give accurate ANI estimates down to 0.1x coverage.
 26 | 
 27 | 4. **Easily customized databases**: sylph can profile against [metagenome-assembled genomes (MAGs), viruses, eukaryotes](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases), and more. Taxonomic information can be incorporated downstream for traditional profiling reports. 
 28 | 
 29 | ### How does sylph work?
 30 | 
 31 | sylph uses a k-mer containment method, similar to sourmash or Mash. sylph's novelty lies in **using a statistical technique to correct ANI for low coverage genomes** within the sample, giving accurate results for low abundance genomes. See [here for more information on what sylph can and can not do](https://github.com/bluenote-1577/sylph/wiki/Introduction:-what-is-sylph-and-how-does-it-work%3F). 
 32 | 
 33 | ## Very quick start
 34 | 
 35 | #### Profile metagenome sample against [GTDB-R220](https://gtdb.ecogenomic.org/) (113,104 bacterial/archaeal species representative genomes) 
 36 | 
 37 | ```sh
 38 | # download GTDB-R220 pre-built database (~13 GB)
 39 | wget https://storage.googleapis.com/sylph-stuff/gtdb-r220-c200-dbv1.syldb
 40 | 
 41 | # multi-sample paired-end profiling (sylph version >= 0.6)
 42 | sylph profile gtdb-r220-c200-dbv1.syldb -1 *_1.fastq.gz -2 *_2.fastq.gz -t (threads) > profiling.tsv
 43 | 
 44 | # multi-sample single-end profiling
 45 | sylph profile gtdb-r220-c200-dbv1.syldb *.fastq -t (threads) > profiling.tsv
 46 | ```
 47 | 
 48 | See below for install and more comprehensive usage information/tutorials/manuals. 
 49 | 
 50 | ##  Install (current version v0.6.1)
 51 | 
 52 | #### Option 1: conda install 
 53 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/sylph/badges/version.svg)](https://anaconda.org/bioconda/sylph)
 54 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/sylph/badges/latest_release_date.svg)](https://anaconda.org/bioconda/sylph)
 55 | 
 56 | ```sh
 57 | conda install -c bioconda sylph
 58 | ```
 59 | 
 60 | **WARNING**: conda may break if you don't have AVX2 instructions or for v0.6.0. See the [issue here](https://github.com/bluenote-1577/sylph/issues/2). The binary and source install still work. 
 61 | 
 62 | #### Option 2: Build from source
 63 | 
 64 | Requirements:
 65 | 1. [rust](https://www.rust-lang.org/tools/install) (version > 1.63) programming language and associated tools such as cargo are required and assumed to be in PATH.
 66 | 2. A c compiler (e.g. GCC)
 67 | 3. make
 68 | 4. cmake
 69 | 
 70 | Building takes a few minutes (depending on # of cores).
 71 | 
 72 | ```sh
 73 | git clone https://github.com/bluenote-1577/sylph
 74 | cd sylph
 75 | 
 76 | # If default rust install directory is ~/.cargo
 77 | cargo install --path . --root ~/.cargo
 78 | sylph query test_files/*
 79 | ```
 80 | #### Option 3: Pre-built x86-64 linux statically compiled executable
 81 | 
 82 | If you're on an x86-64 system, you can download the binary and use it without any installation. 
 83 | 
 84 | ```sh
 85 | wget https://github.com/bluenote-1577/sylph/releases/download/latest/sylph
 86 | chmod +x sylph
 87 | ./sylph -h
 88 | ```
 89 | 
 90 | Note: the binary is compiled with a different set of libraries (musl instead of glibc), probably impacting performance. 
 91 | 
 92 | ## Standard usage
 93 | 
 94 | #### Sketching reads/genomes (indexing)
 95 | 
 96 | ```sh
 97 | # all fasta -> one *.syldb; fasta are assumed to be genomes
 98 | sylph sketch genome1.fa genome2.fa -o database
 99 | #EQUIVALENT: sylph sketch -g genome1.fa genome2.fa -o database
100 | 
101 | # multi-sample sketching of paired reads
102 | sylph sketch -1 A_1.fq B_1.fq -2 A_2.fq B_2.fq -d read_sketch_folder
103 | 
104 | # multi-sample sketching for single end reads, fastq are assumed to be reads
105 | sylph sketch reads.fq 
106 | #EQUIVALENT: sylph sketch -r reads.fq
107 | ```
108 | 
109 | #### Profiling or querying with sketch files
110 | ```sh
111 | # ANI querying 
112 | sylph query database.syldb read_sketch_folder/*.sylsp -t (threads) > ani_queries.tsv
113 | 
114 | # taxonomic profiling 
115 | sylph profile database.syldb read_sketch_folder/*.sylsp -t (threads) > profiling.tsv
116 | ```
117 | 
118 | ## Tutorials, manuals, and pre-built databases
119 | 
120 | ### [Pre-built databases](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases)
121 | 
122 | The pre-built databases [available here](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases) can be downloaded and used with sylph for profiling and containment querying. 
123 | 
124 | ### [Cookbook](https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook)
125 | 
126 | For common use cases and fast explanations, see the above [cookbook](https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook).
127 | 
128 | ### Tutorials
129 | 1. #### [Introduction: 5-minute sylph tutorial outlining basic usage](https://github.com/bluenote-1577/sylph/wiki/5%E2%80%90minute-sylph-tutorial)
130 | 2. #### [Taxonomic profiling against GTDB database with MetaPhlAn output format](https://github.com/bluenote-1577/sylph/wiki/Taxonomic-profiling-with-the-GTDB%E2%80%90R214-database)
131 | 
132 | ### Manuals
133 | 1. #### [Output format (TSV) and containment ANI explanation](https://github.com/bluenote-1577/sylph/wiki/Output-format)
134 | 2. #### [Incoporating custom taxonomies to get CAMI-like or MetaPhlAn-like outputs](https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph)
135 | 
136 | ### [sylph-utils](https://github.com/bluenote-1577/sylph-utils) 
137 | 
138 | For incorporating taxonomy and manipulating output formats, see the [sylph-utils repository](https://github.com/bluenote-1577/sylph-utils).
139 | 
140 | ## Changelog
141 | 
142 | #### Version v0.6.1 - 2024-04-29. 
143 | 
144 | * Made unknown estimation (-u) more robust for low-depth short-read sequencing. 
145 | 
146 | See the [CHANGELOG](https://github.com/bluenote-1577/sylph/blob/main/CHANGELOG.md) for complete details.
147 | 
148 | ## Citing sylph
149 | 
150 | Jim Shaw and Yun William Yu. Metagenome profiling and containment estimation through abundance-corrected k-mer sketching with sylph (2023). bioRxiv.


--------------------------------------------------------------------------------
/pangenomics/impg.md:
--------------------------------------------------------------------------------
 1 | # impg: implicit pangenome graph
 2 | 
 3 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/impg/README.html)
 4 | 
 5 | Pangenome graphs and whole genome multiple alignments are powerful tools, but they are expensive to build and manipulate.
 6 | Often, we would like to be able to break a small piece out of a pangenome without constructing the whole thing.
 7 | `impg` lets us do this by projecting sequence ranges through many-way (e.g. all-vs-all) pairwise alignments built by tools like `wfmash` and `minimap2`.
 8 | 
 9 | ## What does `impg` do?
10 | 
11 | At its core, `impg` lifts over ranges from a target sequence (used as reference) into the queries (the other sequences aligned to the sequence used as reference) described in alignments.
12 | In effect, it lets us pick up homologous loci from all genomes mapped onto our specific target region.
13 | This is particularly useful when you're interested in comparing a specific genomic region across different individuals, strains, or species in a pangenomic or comparative genomic setting.
14 | The output is provided in BED, BEDPE and PAF formats, making it straightforward to use to extract FASTA sequences for downstream use in multiple sequence alignment (like `mafft`) or pangenome graph building (e.g., `pggb` or `minigraph-cactus`).
15 | 
16 | ## How does it work?
17 | 
18 | `impg` uses coitrees (implicit interval trees) to provide efficient range lookup over the input alignments.
19 | CIGAR strings are converted to a compact delta encoding.
20 | This approach allows for fast and memory-efficient projection of sequence ranges through alignments.
21 | 
22 | ## Using `impg`
23 | 
24 | Getting started with `impg` is straightforward. Here's a basic example of how to use the command-line utility:
25 | 
26 | ```bash
27 | impg -p cerevisiae.pan.paf.gz -r S288C#1#chrI:50000-100000 -x
28 | ```
29 | 
30 | Your alignments must use `wfmash` default or `minimap2 --eqx` type CIGAR strings which have `=` for matches and `X` for mismatches. The `M` positional match character is not allowed.
31 | 
32 | Depending on your alignments, this might result in the following BED file:
33 | 
34 | ```txt
35 | S288C#1#chrI        50000  100000
36 | DBVPG6044#1#chrI    35335  85288
37 | Y12#1#chrI          36263  86288
38 | DBVPG6765#1#chrI    36166  86150
39 | YPS128#1#chrI       47080  97062
40 | UWOPS034614#1#chrI  36826  86817
41 | SK1#1#chrI          52740  102721
42 | ```
43 | 
44 | In this example, `-p` specifies the path to the PAF file, `-r` defines the target range in the format of `seq_name:start-end`, and `-x` requests a *transitive closure* of the matches.
45 | That is, for each collected range, we then find what sequence ranges are aligned onto it.
46 | This is done progressively until we've closed the set of alignments connected to the initial target range.
47 | 
48 | ### Installation
49 | 
50 | To compile and install `impg` from source, you'll need a recent rust build toolchain and cargo.
51 | 
52 | 1. Clone the repository:
53 |    ```bash
54 |    git clone https://github.com/ekg/impg.git
55 |    ```
56 | 2. Navigate to the `impg` directory:
57 |    ```bash
58 |    cd impg
59 |    ```
60 | 3. Compile the tool (requires rust build tools):
61 |    ```bash
62 |    cargo install --force --path .
63 |    ```
64 | 
65 | ## Authors
66 | 
67 | Erik Garrison <erik.garrison@gmail.com>
68 | Andrea Guarracino <aguarra1@uthsc.edu>
69 | Bryce Kille <brycekille@gmail.com>
70 | 
71 | ## License
72 | 
73 | MIT
74 | 


--------------------------------------------------------------------------------
/phylogenomics/segul.md:
--------------------------------------------------------------------------------
  1 | # SEGUL `<img src="https://raw.githubusercontent.com/hhandika/segui/main/assets/launcher/iconDesktop.png" alt="segul logo" align="right" width="150"/>`
  2 | 
  3 | ![Segul-Tests](https://github.com/hhandika/segul/workflows/Segul-Tests/badge.svg)
  4 | ![Crate-IO](https://img.shields.io/crates/v/segul)
  5 | ![Crates-Download](https://img.shields.io/crates/d/segul?color=orange&label=crates.io-downloads)
  6 | ![GH-Release](https://img.shields.io/github/v/tag/hhandika/segul?label=gh-releases)
  7 | ![GH-Downloads](https://img.shields.io/github/downloads/hhandika/segul/total?color=blue&label=gh-release-downloads)
  8 | [![LoC](https://tokei.rs/b1/github/hhandika/segul?category=code)](https://github.com/XAMPPRocky/tokei)
  9 | ![last-commit](https://img.shields.io/github/last-commit/hhandika/segul)
 10 | ![License](https://img.shields.io/github/license/hhandika/segul)
 11 | 
 12 | SEGUL is an ultra-fast, memory-efficient application for working with phylogenomic datasets. It is available as standalone, zero dependency command line, GUI applications (called SEGUI), and library/packages for Rust and other programming languages. It runs from your smartphone to High Performance Computers (see platform support below). It is safe, multi threaded, and easy to use.
 13 | 
 14 | It is designed to handle operations on large genomic datasets, while using minimal computational resources. However, it also provides convenient features for working on smaller datasets (e.g., Sanger datasets). In our tests, it consistently offers a faster and more efficient (low memory footprint) alternative to existing applications for a variety of sequence alignment manipulations ([see benchmark](https://www.segul.app/docs/cli_gui#performance)).
 15 | 
 16 | ## Citation
 17 | 
 18 | > Handika, H., and J. A. Esselstyn. 2024. SEGUL: Ultrafast, memory-efficient and mobile-friendly software for manipulating and summarizing phylogenomic datasets. _Molecular Ecology Resources_. [https://doi.org/10.1111/1755-0998.13964](https://doi.org/10.1111/1755-0998.13964).
 19 | 
 20 | ## Links
 21 | 
 22 | - App Documentation: [[EN]](https://segul.app/)
 23 | - API Documentation: [[Rust]](https://docs.rs/segul/0.18.1/segul/)
 24 | - GUI source code: [[Repository]](https://github.com/hhandika/segui)
 25 | 
 26 | ## Installation
 27 | 
 28 | ### GUI Version
 29 | 
 30 | ### Desktop
 31 | 
 32 | [`<img style="padding-left: 15" alt="Microsoft Store download" src="https://get.microsoft.com/images/en-us%20dark.svg" width="200" />`](https://apps.microsoft.com/detail/SEGUI/9NP1BQ6FW9PW?mode=direct)
 33 | 
 34 | [`<img
 35 |     style="padding: 15"
 36 |     src="https://tools.applemediaservices.com/api/badges/download-on-the-mac-app-store/black/en-us?size=250x83&amp;releaseDate=1716076800"
 37 |     alt="Download on the Mac App Store"
 38 |     width="220"
 39 |   />`](https://apps.apple.com/us/app/segui/id6447999874?mt=12&itsct=apps_box_badge&itscg=30200)
 40 | 
 41 | ### Mobile
 42 | 
 43 | [`<img style="padding-left: 15" src="https://tools.applemediaservices.com/api/badges/download-on-the-app-store/black/en-us?size=250x83&amp;releaseDate=1716076800" alt="Download on the App Store" width="180">`](https://apps.apple.com/us/app/segui/id6447999874?itsct=apps_box_badge&itscg=30200)
 44 | 
 45 | [`<img
 46 |     alt="Get it on Google Play"
 47 |     src="https://play.google.com/intl/en_us/badges/static/images/badges/en_badge_web_generic.png"
 48 |     height="80"
 49 |   />`](https://play.google.com/store/apps/details?id=com.hhandika.segui&pcampaignid=pcampaignidMKT-Other-global-all-co-prtnr-py-PartBadge-Mar2515-1)
 50 | 
 51 | Learn more about device requirements and GUI app installation in the [documentation](https://www.segul.app/docs/installation/install_gui).
 52 | 
 53 | ### CLI Version
 54 | 
 55 | The CLI app may work in any Rust supported [platform](https://doc.rust-lang.org/nightly/rustc/platform-support.html). However, we only tested and officially support the following platforms:
 56 | 
 57 | - Linux
 58 | - MacOS
 59 | - Windows
 60 | - Windows Subsystem for Linux (WSL)
 61 | 
 62 | #### CLI Installation Methods
 63 | 
 64 | - Pre-compiled binaries: [[Releases]](https://github.com/hhandika/segul/releases) [[Docs]](https://www.segul.app/docs/installation/install_binary)
 65 | - Package manager: [[Docs]](https://www.segul.app/docs/installation/install_cargo)
 66 | - From source: [[Docs]](https://www.segul.app/docs/installation/install_source)
 67 | - Beta version: [[Docs]](https://www.segul.app/docs/installation/install_dev)
 68 | 
 69 | ## Features
 70 | 
 71 | | Feature                        | Quick Link                                                                                                       |
 72 | | ------------------------------ | ---------------------------------------------------------------------------------------------------------------- |
 73 | | Alignment concatenation        | [CLI](https://www.segul.app/docs/cli-usage/concat) / [GUI](https://www.segul.app/docs/gui-usage/align-concat)          |
 74 | | Alignment conversion           | [CLI](https://www.segul.app/docs/cli-usage/convert) / [GUI](https://www.segul.app/docs/gui-usage/align-convert)        |
 75 | | Alignment filtering            | [CLI](https://www.segul.app/docs/cli-usage/filter) / [GUI](https://www.segul.app/docs/gui-usage/align-filter)          |
 76 | | Alignment splitting            | [CLI](https://www.segul.app/docs/cli-usage/split) / [GUI](https://www.segul.app/docs/gui-usage/align-split)            |
 77 | | Alignment partition conversion | [CLI](https://www.segul.app/docs/cli-usage/part) / [GUI](https://www.segul.app/docs/gui-usage/align-partition)         |
 78 | | Alignment summary statistics   | [CLI](https://www.segul.app/docs/cli-usage/summary) / [GUI](https://www.segul.app/docs/gui-usage/align-summary)        |
 79 | | Genomic summary statistics     | [CLI](https://www.segul.app/docs/cli-usage/genomic) / [GUI](https://www.segul.app/docs/gui-usage/genomic)              |
 80 | | Sequence extraction            | [CLI](https://www.segul.app/docs/cli-usage/extract) / [GUI](https://www.segul.app/docs/gui-usage/sequence-extract)     |
 81 | | Sequence ID extraction         | [CLI](https://www.segul.app/docs/cli-usage/id) / [GUI](https://www.segul.app/docs/gui-usage/sequence-id)               |
 82 | | Sequence ID mapping            | [CLI](https://www.segul.app/docs/cli-usage/map) / [GUI](https://www.segul.app/docs/gui-usage/sequence-id-map)          |
 83 | | Sequence ID renaming           | [CLI](https://www.segul.app/docs/cli-usage/rename) / [GUI](https://www.segul.app/docs/gui-usage/sequence-rename)       |
 84 | | Sequence removal               | [CLI](https://www.segul.app/docs/cli-usage/remove) / [GUI](https://www.segul.app/docs/gui-usage/sequence-remove)       |
 85 | | Sequence translation           | [CLI](https://www.segul.app/docs/cli-usage/translate) / [GUI](https://www.segul.app/docs/gui-usage/sequence-translate) |
 86 | 
 87 | Supported sequence formats:
 88 | 
 89 | 1. NEXUS
 90 | 2. Relaxed PHYLIP
 91 | 3. FASTA
 92 | 4. FASTQ (gzipped and uncompressed)
 93 | 
 94 | All of the formats are supported in interleave and sequential versions. Except for FASTQ (DNA only), the app supports both DNA and amino acid sequences.
 95 | 
 96 | Supported partition formats:
 97 | 
 98 | 1. RaXML
 99 | 2. NEXUS
100 | 
101 | The NEXUS partition can be written as a charset block embedded in NEXUS formatted sequences or a separate file.
102 | 
103 | ## Contribution
104 | 
105 | We welcome any kind of contribution, from issue reporting, ideas to improve the app, to code contribution. For ideas and issue reporting please post in [the Github issues page](https://github.com/hhandika/segul/issues). For code contribution, please fork the repository and send pull requests to this repository.
106 | 


--------------------------------------------------------------------------------
/proteomics/sage.md:
--------------------------------------------------------------------------------
 1 | <img src="https://github.com/lazear/sage/blob/master/figures/logo.png?raw=true" width="300">
 2 | 
 3 | # Sage: proteomics searching so fast it seems like magic
 4 | 
 5 | [![Rust](https://github.com/lazear/sage/actions/workflows/rust.yml/badge.svg)](https://github.com/lazear/sage/actions/workflows/rust.yml) [![Anaconda-Server Badge](https://anaconda.org/bioconda/sage-proteomics/badges/version.svg)](https://anaconda.org/bioconda/sage-proteomics)
 6 | 
 7 | 
 8 | For more information please read [the online documentation!](https://sage-docs.vercel.app/docs)
 9 | 
10 | 
11 | # Introduction
12 |  
13 | Sage is, at it's core, a proteomics database search engine - 
14 |     a tool that transforms raw mass spectra from proteomics experiments into peptide identifications 
15 |     via database searching & spectral matching. 
16 | 
17 | However, Sage includes a variety of advanced features that make it a one-stop shop: retention time prediction, quantification (both isobaric & LFQ), peptide-spectrum match rescoring, and FDR control. You can directly use results from Sage without needing to use other tools for these tasks.
18 | 
19 | Additionally, Sage was designed with cloud computing in mind - massively parallel processing and the ability to directly stream compressed mass spectrometry data to/from AWS S3 enables unprecedented search speeds with minimal cost. 
20 | 
21 |  Sage also runs just as well reading local files from your Mac/PC/Linux device!
22 | 
23 | ## Why use Sage instead of other tools?
24 | 
25 | Sage is **simple to configure**, **powerful** and **flexible**. 
26 | It also happens to be well-tested, **mind-boggingly fast**, open-source (MIT-licensed) and free.
27 | 
28 | ## Citation
29 | 
30 | If you use Sage in a scientific publication, please cite the following paper:
31 | 
32 | [Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale](https://doi.org/10.1021/acs.jproteome.3c00486)
33 | 
34 | 
35 | ## Features
36 | 
37 | - Incredible performance out of the box
38 | - [Effortlessly cross-platform](https://sage-docs.vercel.app/docs/started#download-the-latest-binary-release) (Linux/MacOS/Windows), effortlessly parallel (uses all of your CPU cores)
39 | - [Fragment indexing strategy](https://sage-docs.vercel.app/docs/how_it_works) allows for blazing fast narrow and open searches (> 500 Da precursor tolerance)
40 | - [Isobaric quantification](https://sage-docs.vercel.app/docs/how_it_works#tmt-based) (MS2/MS3-TMT, or custom reporter ions)
41 | - [Label-free quantification](https://sage-docs.vercel.app/docs/how_it_works#label-free): consider all charge states & isotopologues *a la* FlashLFQ
42 | - Capable of searching for [chimeric/co-fragmenting spectra](https://sage-docs.vercel.app/docs/configuration/additional)
43 | - Wide-window (dynamic precursor tolerance) search mode - [enables WWA/PRM/DIA searches](https://sage-docs.vercel.app/docs/configuration/tolerance#wide-window-mode)
44 | - Retention time prediction models fit to each LC/MS run
45 | - [PSM rescoring](https://sage-docs.vercel.app/docs/how_it_works#machine-learning-for-psm-rescoring) using built-in linear discriminant analysis (LDA)
46 | - PEP calculation using a non-parametric model (KDE)
47 | - FDR calculation using target-decoy competition and picked-peptide & picked-protein approaches
48 | - Percolator/Mokapot [compatible output](https://sage-docs.vercel.app/docs/configuration#env)
49 | - Configuration by [JSON file](https://sage-docs.vercel.app/docs/configuration#file)
50 | - Built-in support for reading gzipped-mzML files
51 | - Support for reading/writing directly from [AWS S3](https://sage-docs.vercel.app/docs/configuration/aws)
52 | 
53 | ## Interoperability
54 | 
55 | Sage is well-integrated into the open-source proteomics ecosystem. The following projects support analyzing results from Sage (typically in addition to other tools), or redistribute Sage binaries for use in their pipelines. 
56 | 
57 | - [SearchGUI](http://compomics.github.io/projects/searchgui): a graphical user interface for running searches
58 | - [PeptideShaker](http://compomics.github.io/projects/peptide-shaker): visualize peptide-spectrum matches
59 | - [MS2Rescore](http://compomics.github.io/projects/ms2rescore): AI-assisted rescoring of results
60 | - [Picked group FDR](https://github.com/kusterlab/picked_group_fdr): scalable protein group FDR for large-scale experiments
61 | - [sagepy](https://github.com/theGreatHerrLebert/sagepy): Python bindings to the sage-core library
62 | - [quantms](https://github.com/bigbio/quantms): nextflow pipeline for running searches with Sage
63 | - [OpenMS](https://github.com/OpenMS/OpenMS): Sage is included as a "TOPP" tool in OpenMS
64 | - [sager](https://github.com/UCLouvain-CBIO/sager): R package for analyzing results from Sage searches
65 | - If your project supports Sage and it's not listed, please open a pull request! If you need help integrating or interfacing with Sage in some way, please reach out.
66 | 
67 | Check out the (now outdated) [blog post introducing the first version of Sage](https://lazear.github.io/sage/) for more information and full benchmarks!


--------------------------------------------------------------------------------
/rna/rnapkin.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # rnapkin: drawing RNA secondary structure with style
  3 | [![Crates.io](https://img.shields.io/crates/v/rnapkin?color=F55353)](https://crates.io/crates/rnapkin)
  4 | [![Downloads](https://img.shields.io/crates/d/rnapkin?color=FEB139)](https://crates.io/crates/rnapkin)
  5 | 
  6 | ## Usage
  7 | rnapkin accepts a file containing secondary structure and optionally sequence and a name.
  8 | For example you could have this marvelous RNA molecule sitting peacefully
  9 | in a file called "guaniners"
 10 | ```text
 11 | >fantastic guanine riboswitch
 12 | AAUAUAAUAGGAACACUCAUAUAAUCGCGUGGAUAUGGCACGCAAGUUUCUACCGGGCAC
 13 | ..........(..(.((((.((((..(((((.......)))))..........((((((.
 14 | CGUAAAUGUCCGACUAUGGGUGAGCAAUGGAACCGCACGUGUACGGUUUUUUGUGAUAUC
 15 | ......)))))).....((((((((((((((((((........))))))...........
 16 | AGCAUUGCUUGCUCUUUAUUUGAGCGGGCAAUGCUUUUUUUA
 17 | ..)))))))))))).)))).)))).)..).............
 18 | ```
 19 | Then, if you wish to visualize it, you could invoke rnapkin thus:
 20 | ```
 21 | rnapkin guaniners
 22 | ```
 23 | Surely rnapkin would respond with the name of a file it has just drawn to:
 24 | ```
 25 | fantastic_guanine_riboswitch.svg
 26 | ```
 27 | And this scalable vector graphic would be produced:
 28 | <p align="center">
 29 |  <img src="https://raw.githubusercontent.com/ukmrs/gallery/main/rnapkin/v0.3.0/guaniners.svg" height="750"/>
 30 | </p>
 31 | 
 32 | *I* happen to quite enjoy the outcome, so *I* would say:
 33 | ```
 34 | that's pretty neat
 35 | ```
 36 | Your mileage may vary though.
 37 | 
 38 | ## Rotating and flipping
 39 | If you'd like to see this or any other RNA molecule upside-down, tilted or what have you, there are
 40 | some options listed below that you can use and combine:
 41 | ```text
 42 | -a / --angle <DEGREES> | starting Angle / boils down to clockwise rotation
 43 | --mx                   | Mirror along X axis / aka vertical flip
 44 | --my                   | Mirror along Y axis / aka horizontal flip
 45 | ```
 46 | <p align="center">
 47 |  <img src="https://raw.githubusercontent.com/ukmrs/gallery/main/rnapkin/v0.3.0/angle_mirror_demo.png" />
 48 | </p>
 49 | 
 50 | color themes can be changed by -t option as demonstrated; a config file allowing to define custom color themes
 51 | is planned though unimplemented!()
 52 | 
 53 | ## Installing
 54 | I plan to offer precompiled binaries but for now
 55 | you'll need **rust**. Easiest way to acquire **rust** is via [rustup](https://rustup.rs) :crab:
 56 | 
 57 | ### Anywhere
 58 | ```bash
 59 | cargo install rnapkin
 60 | ```
 61 | ### WSL
 62 | Fontconfig is the default Fontmanagement utility on Linux and Unix but WSL may not have them installed;
 63 | ```bash
 64 | sudo apt-get install libfontconfig libfontconfig1-dev
 65 | cargo install rnapkin
 66 | ```
 67 | 
 68 | ## Input
 69 | input can be served to rnapkin as a file or be piped in:
 70 | 
 71 | ```bash
 72 | rnapkin cmolecule.fa -a 20 -o crab
 73 | echo ".......(((((......))))).....(((((......)))))......." | rnapkin -a 20 -o crab
 74 | ```
 75 | 
 76 | input is quite **flexible**; it should contain secondary_structure and optionally
 77 | name and sequence. Name has to start with ">" and can be overwritten with -o flag
 78 | which has priority. Here are some variations of valid input files:
 79 | 
 80 | ### simple one
 81 | 
 82 | ```text
 83 | # you can add .png to the name to request png instead of svg
 84 | @ the same of course can be achieved with -o flag.
 85 | * this is a comment btw: any symbol other than ">.()" works but prefer "#"
 86 | >simple molecule.png
 87 | ((((((((((..((((((.........))))))......).((((((.......))))))..)))))))))
 88 | CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
 89 | ```
 90 | 
 91 | ### Highlighting!
 92 | There are [9 available colors](https://docs.rs/rnapkin/0.3.2/rnapkin/draw/colors/default_pallette/constant.HIGHLIGHTS.html) 
 93 | denoted by 1-9, while 0 means None.
 94 | For example consider input below representing SAM riboswitch in the OFF conformation
 95 | according to smFRET study by [Manz et al. 2017](https://doi.org/10.1038/nchembio.2476).
 96 | By using numbers in the input we can mark aptamer forming helices P1, P2, P3, P4 #2-#5 and
 97 | the **TERMINATOR** #1.
 98 | 
 99 | ```text
100 | > offsam
101 | 
102 | 0000022222222223333333333333333333333333333333333444444444444444444444444444444
103 | AUAUCCGUUCUUAUCAAGAGAAGCAGAGGGACUGGCCCGACGAUGCUUCAGCAACCAGUGUAAUGGCGAUCAGCCAUGA
104 | .......((((((((....(((((...(((.....)))......)))))(((..(((((...(((((.....))))).)
105 | 
106 | 4444444444555555555555555555555555555522222222222211111111111111111111111111111
107 | CUAAGGUGCUAAAUCCAGCAAGCUCGAACAGCUUGGAAGAUAAGAAGAGACAAAAUCACUGACAAAGUCUUCUUCUUAA
108 | ))..)).)))........((((((.....))))))...)))))))).................((((((.((((...))
109 | 
110 | 111111111111
111 | GAGGACUUUUUU
112 | )).))))))...
113 | ```
114 | 
115 | <p align="center">
116 |  <img src="https://raw.githubusercontent.com/ukmrs/gallery/main/rnapkin/v0.3.2/offsam.svg" />
117 | </p>
118 | 
119 | ### only secondary structure
120 | 
121 | ```text
122 | .........(((..((((((...((((((((.....((((((((((...)))))).....
123 | (((((((...))))))).))))(((.....)))...)))).)))).))))))..)))..(
124 | (((.(((((..(((......))).)))))..))))(((((((((((((....))))))))
125 | ))))).....
126 | ```
127 | 
128 | ### multiline
129 | sequence and secondary structure can be separate,
130 | mixed and aligned, everything should work.
131 | 
132 | ## DIY
133 | using -p / --points flag you can make rnapkin print calculated coordinates
134 | of nucleotide bubbles (with 0.5 unit radius). You can then plot it
135 | yourself if you need to do something specific;
136 | 
137 | If you happen to clone the repository, there is an example python
138 | script using **matplotlib** that you can pipe the input to.
139 | 
140 | ```bash
141 | cargo run -- atelier/example_inputs/guaniners -p | atelier/plot.py
142 | ```
143 | 
144 | You can also combine -p flag with --mx --my and -a
145 | 
146 | ## rnapkin name
147 | The wordsmithing proccess was arduous. It involved
148 | googling "words starting with na" and looking for anything drawing related.
149 | Once the word was found, unparalled strength was employed to slap it on top of "rna"
150 | ultimately creating this glorious amalgamation.
151 | ### why it kinda makes sense:
152 | You ever heard of all those physicists, mathematicians and the like, scribbling formulas on the
153 | back of a napkin ~~or a book margin~~? There is even a [wikipedia page](https://en.wikipedia.org/wiki/Back-of-the-envelope_calculation) about it.
154 | 
155 | It doesn't take much mental gymnastic to imagine a biologist frantically scrambling together
156 | rna structure on a napkin. I am currently working on baiting my biologist 
157 | friend into heated rna debate while in close proximity to abundant napkin source
158 | in order to produce a proof of concept.


--------------------------------------------------------------------------------
/singlecell/alevin-fry.md:
--------------------------------------------------------------------------------
 1 | <img alt="logo" src="https://github.com/COMBINE-lab/alevin-fry/raw/master/docs/logo.png" width="200">
 2 | 
 3 | # alevin-fry ![Rust](https://github.com/COMBINE-lab/alevin-fry/workflows/Rust/badge.svg) [![Anaconda-Server Badge](https://anaconda.org/bioconda/alevin-fry/badges/platforms.svg)](https://anaconda.org/bioconda/alevin-fry) [![Anaconda-Server Badge](https://anaconda.org/bioconda/alevin-fry/badges/license.svg)](https://anaconda.org/bioconda/alevin-fry) ![GitHub tag (latest SemVer)](https://img.shields.io/github/v/tag/combine-lab/alevin-fry?style=flat-square)
 4 | 
 5 | `alevin-fry` is a suite of tools for the rapid, accurate and memory-frugal processing single-cell and single-nucleus sequencing data.  It consumes RAD files generated by [`piscem`](https://github.com/COMBINE-lab/piscem) or `salmon alevin`, and performs common operations like generating permit lists, and estimating the number of distinct molecules from each gene within each cell.  The focus in `alevin-fry` is on safety, accuracy and efficiency (in terms of both time and memory usage).
 6 | 
 7 | You can read the paper describing alevin fry, "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" [here](https://www.nature.com/articles/s41592-022-01408-3), and the pre-print [on bioRxiv](https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1).
 8 | 
 9 | **Note**: We recommend using [`piscem`](https://github.com/COMBINE-lab/piscem) as the back-end mapper, rather than salmon, as it is substantially more resource-frugal, faster, and is a larger focus of current and future development.
10 | 
11 | ### Getting started with `alevin-fry` and dedicated documentation
12 | 
13 | While this `README` contains some useful information to get started and some pointers, `alevin-fry` has it's own [dedicated documentation site](https://alevin-fry.readthedocs.io/en/latest/), hosted on `ReadTheDocs`.
14 | 
15 | ### More information 
16 | 
17 | * [**Quickstart guide using the `simpleaf` wrapper**](https://combine-lab.github.io/alevin-fry-tutorials/2023/simpleaf-piscem/)
18 | 
19 | * **Relationship to alevin**: Alevin-fry has been designed as the successor to alevin. It subsumes the core features of alevin, while also providing important new capabilities and considerably improving the performance profile. We anticipate that new method development and feature additions will take place primarily within the alevin-fry codebase.  Thus, we encourage users of alevin to migrate to alevin-fry when feasible.  That being said, alevin is still actively-maintained and supported, so if you are using it and not ready to migrate you can continue to ask questions and post issues in [the salmon repository](https://github.com/COMBINE-lab/salmon).
20 | 
21 | ## FAQs 
22 | 
23 | Are you curious about processing details like [whether to use a sparse or dense index](https://github.com/COMBINE-lab/alevin-fry/discussions/38)? Do you have a question that isn't necessarily a bug report or feature request, and that isn't readily answered by the documentation or tutorials?  Then please feel free to ask over in the [Q&A](https://github.com/COMBINE-lab/alevin-fry/discussions/categories/q-a).
24 | 
25 | ## Sister repositories
26 | 
27 | * The generation of the reduced alignment data (RAD) files processed by alevin-fry is done by either [piscem](https://github.com/COMBINE-lab/piscem) or [salmon](https://github.com/COMBINE-lab/salmon). The latest version of both are available on GitHub and via bioconda. 
28 | 
29 | * The [`simpleaf`](https://github.com/COMBINE-lab/simpleaf) repository contains a dedicated wrapper / workflow runner for processing data with `alevin-fry` that vastly simplifies both the creation of extended references and the subsequent quantification of samples. If you find that `simpleaf` is missing a feature that you'd like to have, please consider submitting a feature request in the [`simpleaf` repository](https://github.com/COMBINE-lab/simpleaf/issues).
30 | 
31 | * The [`pyroe`](https://github.com/COMBINE-lab/pyroe) repository provides tools to help easily construct an enhanced (_spliced + intronic_ or _spliced + unspliced_) transcriptome from a reference genome and GTF file.
32 | 
33 | * The [`fishpond`](https://github.com/mikelove/fishpond) package — maintained by @mikelove and his lab — contains the recommended relevant functions for reading `alevin-fry` output (particularly USA-mode output) into the R ecosystem, in the form of a [`singleCellExperiment`](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object.
34 | 
35 | * The [`alevinqc`](https://github.com/csoneson/alevinQC) package — maintained by @csoneson — provides tool and functions for performing quality control and assessment downstream of `alevin-fry`.
36 | 
37 | ## Installing from bioconda
38 | 
39 | Alevin-fry is available for both x86 linux and OSX platforms [using bioconda](https://anaconda.org/bioconda/alevin-fry). On Apple silicon, you can either build (easily) from source (see below) or run `alevin-fry` under the rosetta 2 emulation layer.
40 | 
41 | With `bioconda` in the appropriate place in your channel list, you should simply be able to install via:
42 | 
43 | 
44 | ```{bash}
45 | $ conda install -c bioconda alevin-fry
46 | ``` 
47 | 
48 | ## Installing from crates.io
49 | 
50 | Alevin-fry can also be installed from [`crates.io`](https://crates.io/crates/alevin-fry) using `cargo`.  This can be done with the following command:
51 | 
52 | ```{bash}
53 | $ cargo install alevin-fry
54 | ```
55 | 
56 | ## Building from source
57 | 
58 | If you want to use features or fixes that may only be available in the latest develop branch (or want to build for a different architecture), then you have to build from source.  Luckily, `cargo` makes that easy; see below.
59 | 
60 | Alevin-fry is built and tested with the latest (major & minor) stable version of [Rust](https://www.rust-lang.org/). While it will likely compile fine with slightly older versions of Rust, this is not a guarantee and is not a support priority.  Unlike with C++, Rust has a frequent and stable release cadence, is designed to be installed and updated from user space, and is easy to keep up to date with [rustup](https://rustup.rs/). Thanks to cargo, building should be as easy as:
61 | 
62 | ```{bash}
63 | $ cargo build --release
64 | ```
65 | 
66 | subsequent commands below will assume that the executable is in your path.  Temporarily, this can 
67 | be done (in bash-like shells) using:
68 | 
69 | ```{bash}
70 | $ export PATH=`pwd`/target/release/:$PATH
71 | ```
72 | 
73 | ## Citing alevin-fry
74 | 
75 | If you use `alevin-fry` in your work, please cite:
76 | 
77 | > He, D., Zakeri, M., Sarkar, H., Soneson, C., Srivastava, A., and Patro, R. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3
78 | 
79 | **BibTeX:**
80 | 
81 | ```
82 | @Article{He2022,
83 | author={He, Dongze and Zakeri, Mohsen and Sarkar, Hirak and Soneson, Charlotte and Srivastava, Avi and Patro, Rob},
84 | title={Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data},
85 | journal={Nature Methods},
86 | year={2022},
87 | month={Mar},
88 | day={01},
89 | volume={19},
90 | number={3},
91 | pages={316-322},
92 | issn={1548-7105},
93 | doi={10.1038/s41592-022-01408-3},
94 | url={https://doi.org/10.1038/s41592-022-01408-3}
95 | }
96 | 


--------------------------------------------------------------------------------
/singlecell/cellranger.md:
--------------------------------------------------------------------------------
1 | Cell Ranger is a set of analysis pipelines that perform sample demultiplexing, barcode processing, single cell 3' and 5' gene counting, V(D)J transcript sequence assembly and annotation, and Feature Barcode analysis from 10x Genomics Chromium Single Cell data.
2 | 
3 | Please note that this source code is made available only for informational purposes. 10x does not provide support for interpreting, modifying, building, or running this code.
4 | 
5 | The officially supported release binaries are available at <https://www.10xgenomics.com/support/software/cell-ranger/downloads>. Documentation is available at <https://www.10xgenomics.com/support/software/cell-ranger>.


--------------------------------------------------------------------------------
/slurm/ssubmit.md:
--------------------------------------------------------------------------------
  1 | # ssubmit
  2 | 
  3 | [![Rust CI](https://github.com/mbhall88/ssubmit/actions/workflows/ci.yaml/badge.svg)](https://github.com/mbhall88/ssubmit/actions/workflows/ci.yaml)
  4 | [![codecov](https://codecov.io/gh/mbhall88/ssubmit/branch/main/graph/badge.svg?token=4O7HTGKD6Q)](https://codecov.io/gh/mbhall88/ssubmit)
  5 | [![Crates.io](https://img.shields.io/crates/v/ssubmit.svg)](https://crates.io/crates/ssubmit)
  6 | 
  7 | Submit sbatch jobs without having to create a submission script
  8 | 
  9 | - [Motivation](#motivation)
 10 | - [Install](#install)
 11 | - [Usage](#usage)
 12 | 
 13 | 
 14 | ## Motivation
 15 | 
 16 | This project is motivated by the fact that I want to just be able to submit commands as
 17 | jobs and I don't want to fluff around with making a submission script.
 18 | 
 19 | `ssubmit` wraps that whole process and lets you live your best lyf #blessed.
 20 | 
 21 | ## Install
 22 | 
 23 | **tl;dr**
 24 | 
 25 | ```shell
 26 | curl -sSL install.ssubmit.mbh.sh | sh
 27 | # or with wget
 28 | wget -nv -O - install.ssubmit.mbh.sh | sh
 29 | ```
 30 | 
 31 | You can pass options to the script like so
 32 | 
 33 | ```
 34 | $ curl -sSL install.ssubmit.mbh.sh | sh -s -- --help
 35 | install.sh [option]
 36 | 
 37 | Fetch and install the latest version of ssubmit, if ssubmit is already
 38 | installed it will be updated to the latest version.
 39 | 
 40 | Options
 41 |         -V, --verbose
 42 |                 Enable verbose output for the installer
 43 | 
 44 |         -f, -y, --force, --yes
 45 |                 Skip the confirmation prompt during installation
 46 | 
 47 |         -p, --platform
 48 |                 Override the platform identified by the installer [default: apple-darwin]
 49 | 
 50 |         -b, --bin-dir
 51 |                 Override the bin installation directory [default: /usr/local/bin]
 52 | 
 53 |         -a, --arch
 54 |                 Override the architecture identified by the installer [default: x86_64]
 55 | 
 56 |         -B, --base-url
 57 |                 Override the base URL used for downloading releases [default: https://github.com/mbhall88/ssubmit/releases]
 58 | 
 59 |         -h, --help
 60 |                 Display this help message
 61 | ```
 62 | 
 63 | 
 64 | ### Cargo
 65 | 
 66 | ```shell
 67 | $ cargo install ssubmit
 68 | ```
 69 | 
 70 | ### Build from source
 71 | 
 72 | ```shell
 73 | $ git clone https://github.com/mbhall88/ssubmit.git
 74 | $ cd ssubmit
 75 | $ cargo build --release
 76 | $ target/release/ssubmit -h
 77 | ```
 78 | 
 79 | ## Usage
 80 | 
 81 | Submit an rsync job named "foo" and request 350MB of memory and a one week time limit
 82 | 
 83 | ```shell
 84 | $ ssubmit -m 350m -t 1w foo "rsync -az src/ dest/"
 85 | ```
 86 | 
 87 | Submit a job that needs 8 CPUs
 88 | 
 89 | ```shell
 90 | $ ssubmit -m 16g -t 1d align "minimap2 -t 8 ref.fa query.fq > out.paf" -- -c 8
 91 | ```
 92 | 
 93 | The basic anatomy of a `ssubmit` call is
 94 | 
 95 | ```
 96 | ssubmit [OPTIONS] <NAME> <COMMAND> [-- <REMAINDER>...]
 97 | ```
 98 | 
 99 | `NAME` is the name of the job (the `--job-name` parameter in `sbatch`).
100 | 
101 | `COMMAND` is what you want to be executed by the job. It **must** be quoted (single or
102 | double).
103 | 
104 | `REMAINDER` is any (optional) [`sbatch`-specific options](https://slurm.schedmd.com/sbatch.html#lbAG) you want to pass on. These
105 | must follow a `--` after `COMMAND`.
106 | 
107 | ### Memory
108 | 
109 | Memory (`-m,--mem`) is intended to be a little more user-friendly than the `sbatch
110 | --mem` option. For example, you can pass `-m 0.5g` and `ssubmit` will interpret and
111 | convert this as 500M. However, `-m 1.7G` will be rounded up to 2G. One place where this
112 | option differs from `sbatch` is that if you don't give units, it will be interpreted as
113 | bytes - i.e., `-m 1000` will be converted to 1K. Units are case insensitive.
114 | 
115 | ### Time
116 | 
117 | As with memory, time (`-t,--time`) is intended to be simple. If you want a time limit of
118 | three days, then just pass `-t 3d`. Want two and a half hours? Then `-t 2h30m` works. If
119 | you want to just use the default limit of your cluster, then just pass `-t 0`. You can
120 | also just pass the [time format `sbatch` uses](https://slurm.schedmd.com/sbatch.html#OPT_time) and this will be seamlessly passed on. For
121 | a full list of supported time units, check out the
122 | [`duration-str`](https://github.com/baoyachi/duration-str) repo.
123 | 
124 | ### Dry run
125 | 
126 | You can see what `ssubmit` would do without actually submitting a job using dry run
127 | (`-n,--dry-run`). This will print the `sbatch` command and also the submission script
128 | that would have been provided.
129 | 
130 | ```shell
131 | $ ssubmit -n -m 4g -t 1d dry "rsync -az src/ dest/" -- -c 8
132 | [2022-01-19T08:58:58Z INFO  ssubmit] Dry run requested. Nothing submitted
133 | sbatch -c 8 <script>
134 | =====<script>=====
135 | #!/usr/bin/env bash
136 | #SBATCH --job-name=dry
137 | #SBATCH --mem=4G
138 | #SBATCH --time=24:0:0
139 | #SBATCH --error=%x.err
140 | #SBATCH --output=%x.out
141 | set -euxo pipefail
142 | 
143 | rsync -az src/ dest/
144 | =====<script>=====
145 | ```
146 | 
147 | ### Script settings
148 | 
149 | The default shebang for the script is `#!/usr/bin/env bash`. However, if you'd prefer
150 | something else, pass this with `-S,--shebang`.
151 | 
152 | Additionally, we use `set -euxo pipefail` by default, which will exit when a command exits with a
153 | non-zero exit code (`e`), error when trying to use an unset variable (`u`), print
154 | all commands that were run to stderr (`x`), and exit if a command in a pipeline fails 
155 | (`-o pipefail`). You can change these setting with `-s,--set`. You can turn this off 
156 | by passing `-s ''`.
157 | 
158 | ### Log files
159 | 
160 | By default, the stderr and stdout of the job are sent to `%x.err` and `%x.out`,
161 | respectively. `%x` is a filename pattern for job name. So if the job name is foo, the
162 | stderr file will be `foo.err`. You can see all available patterns in
163 | [the docs](https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E).
164 | You don't have to use patterns of course.
165 | 
166 | ### Full usage
167 | 
168 | ```shell
169 | $ ssubmit --help
170 | ssubmit 0.2.0
171 | Michael Hall <michael@mbh.sh>
172 | Submit sbatch jobs without having to create a submission script
173 | 
174 | -----------
175 | # EXAMPLES
176 | -----------
177 | 
178 | Submit a simple rsync command with a 600MB memory limit.
179 | 
180 | $ ssubmit -m 600m rsync_my_data "rsync -az src/ dest/"
181 | 
182 | Submit a command that involves piping the output into another command. sbatch options
183 | are passed after a `--`.
184 | 
185 | $ ssubmit -m 4G align "minimap2 -t 8 ref.fa reads.fq | samtools sort -o sorted.bam" -- -c 8
186 | 
187 | USAGE:
188 |     ssubmit [OPTIONS] <NAME> <COMMAND> [-- <REMAINDER>...]
189 | 
190 | ARGS:
191 |     <NAME>
192 |             Name of the job
193 | 
194 |             See `man sbatch | grep -A 2 'job-name='` for more details.
195 | 
196 |     <COMMAND>
197 |             Command to be executed by the job
198 | 
199 |     <REMAINDER>...
200 |             Options to be passed on to sbatch
201 | 
202 | OPTIONS:
203 |     -e, --error <ERROR>
204 |             File to write job stderr to. (See `man sbatch | grep -A 3 'error='`)
205 | 
206 |             Run `man sbatch | grep -A 37 '^filename pattern'` to see available patterns.
207 | 
208 |             [default: %x.err]
209 | 
210 |     -h, --help
211 |             Print help information
212 | 
213 |     -m, --mem <size[units]>
214 |             Specify the real memory required per node. e.g., 4.3kb, 7G, 9000, 4.1MB
215 | 
216 |             Note, floating point numbers will be rounded up. e.g., 10.1G will request 11G. This is
217 |             because sbatch only allows integers. See `man sbatch | grep -A 4 'mem='` for the full
218 |             details.
219 | 
220 |             [default: 1G]
221 | 
222 |     -n, --dry-run
223 |             Print the sbatch command and submission script would be executed, but do not execute
224 |             them
225 | 
226 |     -o, --output <OUTPUT>
227 |             File to write job stdout to. (See `man sbatch | grep -A 3 'output='`)
228 | 
229 |             Run `man sbatch | grep -A 37 '^filename pattern'` to see available patterns.
230 | 
231 |             [default: %x.out]
232 | 
233 |     -s, --set <SET>
234 |             Options for the set command in the shell script
235 | 
236 |             For example, to exit when the command exits with a non-zero code and to treat unset
237 |             variables as an error during substitution, pass 'eu'. Pass '' or "" to set nothing
238 | 
239 |             [default: "euxo pipefail"]
240 | 
241 |     -S, --shebang <SHEBANG>
242 |             The shell shebang for the submission script
243 | 
244 |             [default: "#!/usr/bin/env bash"]
245 | 
246 |     -t, --time <TIME>
247 |             Time limit for the job. e.g. 5d, 10h, 45m21s (case insensitive)
248 | 
249 |             Run `man sbatch | grep -A 7 'time=<'` for more details.
250 | 
251 |             [default: 1w]
252 | 
253 |     -T, --test-only
254 |             Return an estimate of when the job would be scheduled to run given the current queue. No
255 |             job is actually submitted. [sbatch --test-only]
256 | 
257 |     -V, --version
258 |             Print version information
259 | ```
260 | 
261 | 
262 | [releases]: https://github.com/mbhall88/ssubmit/releases


--------------------------------------------------------------------------------
/vcf/echtvar.md:
--------------------------------------------------------------------------------
  1 | ## Echtvar: Really, truly rapid variant annotation and filtering 
  2 | [![Rust](https://github.com/brentp/echtvar/actions/workflows/ci.yml/badge.svg)](https://github.com/brentp/echtvar/actions/workflows/ci.yml)
  3 | 
  4 | Echtvar efficiently encodes variant allele frequency and other information from huge population datasets to enable rapid (1M variants/second) annotation of genetic variants.
  5 | It chunks the genome into 1<<20 (~1 million) bases,
  6 | [encodes each variant into a 32 bit integer](https://github.com/brentp/echtvar/blob/02774b8d1cd3703b65bd2c8d7aab93af05b7940f/src/lib/var32.rs#L9-L21) (with a [supplemental table](https://github.com/brentp/echtvar/blob/02774b8d1cd3703b65bd2c8d7aab93af05b7940f/src/lib/var32.rs#L33-L38)
  7 | for those that can't fit due to large REF and/or ALT alleles). It uses the zip format, [delta
  8 | encoding](https://en.wikipedia.org/wiki/Delta_encoding)
  9 | and [integer compression
 10 | ](https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-speed-records-for-integer-compression/)
 11 | to create a compact and searchable format of any integer, float, or low-cardinality string columns
 12 | selected from the population file.
 13 | 
 14 | read more at the [why of echtvar](https://github.com/brentp/echtvar/wiki/why)
 15 | 
 16 | ### Getting started.
 17 | 
 18 | Get a static binary and pre-encoded echtvar files for gnomad v3.1.2 (hg38) here: https://github.com/brentp/echtvar/releases/latest
 19 | That page contains exact instructions to get started with the static binary.
 20 | 
 21 | <details>
 22 |   <summary>:arrow_down:Download or Build instructions for linux</summary>
 23 | 
 24 | The linux binary is available via:
 25 | 
 26 | ```
 27 | wget -O ~/bin/echtvar https://github.com/brentp/echtvar/releases/latest/download/echtvar \
 28 |     && chmod +x ~/bin/echtvar \
 29 |     && ~/bin/echtvar # show help
 30 |  ```
 31 | 
 32 | Users can make their own *echtvar* archives with `echtvar encode`, and pre-made archives for
 33 | gnomAD version 3.1.2 are [here](https://github.com/brentp/echtvar/release)
 34 | 
 35 | Rust users can build on linux with:
 36 | 
 37 | ```
 38 | cargo build --release --target x86_64-unknown-linux-gnu
 39 | ```
 40 | 
 41 | </details>
 42 | 
 43 | To run echtvar with an existing archive (we have several available in [releases](https://github.com/brentp/echtvar/releases/latest)) is as simple as
 44 | ```
 45 | echtvar anno -e gnomad.echtvar.zip -e other.echtvar.zip input.vcf output.annotated.bcf
 46 | ```
 47 | 
 48 | an optional filter that utilizes fields available any of the zip files can be added like:
 49 | ```
 50 | -i "gnomad_popmax_af < 0.01"
 51 | ```
 52 | 
 53 | echtvar can also accept input from stdin using "-" or "/dev/stdin" for the input argument.
 54 | 
 55 | ### usage
 56 | 
 57 | ##### encode 
 58 | 
 59 | make (`encode`) a new echtvar file. This is usually done once  (or download from those provided in the [Release pages](https://github.com/brentp/echtvar/releases/latest)) 
 60 | and then the file can be re-used for the annotation (`echtvar anno`) step with each new query file.
 61 | Note that input VCFs must be [decomposed](https://github.com/brentp/echtvar/wiki/decompose).
 62 | 
 63 | ```
 64 | echtvar \
 65 |    encode \
 66 |    gnomad.v3.1.2.echtvar.zip \
 67 |    conf.json # this defines the columns to pull from $input_vcf, and how to
 68 |    $input_population_vcf[s] \ can be split by chromosome or all in a single file.
 69 | name and encode them
 70 | 
 71 | ```
 72 | 
 73 | See below for a description of the json file that defines which columns are
 74 | pulled from the population VCF.
 75 | 
 76 | ##### annotate 
 77 | 
 78 | Annotate a [**decomposed** (and normalized)](https://github.com/brentp/echtvar/wiki/decompose) VCF with an echtvar file and only output variants where `gnomad_af`
 79 | from the echtvar file is < 0.01. Note that multiple echtvar files can be specified
 80 | and the `-i` expression is optional and can be elided to output all variants.
 81 | 
 82 | ```
 83 | echtvar anno \
 84 |    -e gnomad.v3.1.2.echtvar.v2.zip \
 85 |    -e dbsnp.echtvar.zip \
 86 |    -i 'gnomad_popmax_af < 0.01' \
 87 |    $cohort.input.bcf \
 88 |    $cohort.echtvar-annotated.filtered.bcf
 89 | ```
 90 | 
 91 | #### Configuration File for Encode
 92 | 
 93 | When running `echtvar encode`, a [json5](https://json5.org/) (json with
 94 | comments and other nice features) determines which columns are pulled from the
 95 | input VCF and how they are stored.
 96 | 
 97 | A simple example is to pull a single integer field and give it a new name (`alias`):
 98 | 
 99 | ```
100 | [{"field": "AC", "alias": "gnomad_AC"}]
101 | ```
102 | 
103 | This will extract the "AC" field from the INFO and labeled as "gnomad_AC" when
104 | later used to annotate a VCF. Note that it's important to give a description/unique prefix lke "`gnomad_`" so
105 | as not to collide with fields already in the query VCF.
106 | 
107 | <details>
108 |   <summary>:arrow_down:Expand this section for detail on additional fields, including float and string types</summary>
109 | 
110 | ```
111 | [
112 |     {"field": "AC", "alias": "gnomad_AC"},
113 |     // this JSON file is json 5 and so can have comments
114 |     // the missing value will default to -1, but the value: -2147483648 will
115 |     // result in '.' as it is the missing value for VCF.
116 |     {"field": "AN", "alias":, gnomad_AN", missing_value: -2147483648},
117 |     {
118 |            field: "AF",
119 |            alias: "gnomad_AF",
120 |            missing_value: -1,
121 |            // since all values (including floats) are stored as integers, echtvar internally converts
122 |            // any float to an integer by multiplying by `multiplier`.
123 |            // higher values give better precision and worse compression.
124 |            // upon annotation, the score is divided by multiplier to give a number close to the original float.
125 |            multiplier: 2000000,
126 |            // set zigzag to true if your data has negative values
127 |            zigzag: true,
128 |    }
129 |     // echtvar will save strings as integers along with a lookup. this can work for fields with a low cardinality.
130 |     {"field": "string_field", "alias":, gnomad_string_field", missing_string: "UNKNOWN"},
131 |     // "FILTER" is a special case that indicates that echtvar should extract the FILTER column from the annotation vcf.
132 |     {"field": "FILTER", "alias": "gnomad_filter"},
133 | ]
134 | ```
135 | 
136 | The above file will extract 5 fields, but the user can chooose as many as they like when encoding.
137 | All fields in an `echtvar` file will be added (with the given alias) to any VCF it is used to annotate.
138 | 
139 | </details>
140 | 
141 | Other examples are available [here](https://github.com/brentp/echtvar/tree/main/examples)
142 | 
143 | And full examples are in the [wiki](https://github.com/brentp/echtvar/wiki)
144 | 
145 | #### Expressions
146 | 
147 | An optional expression will determine which variants are written. It can utilize any (and only) integer or float fields present in the
148 | echtvar file (not those present in the query VCF). An example could be:
149 | 
150 | ```
151 | -i 'gnomad_af < 0.01 && gnomad_nhomalts < 10'
152 | ```
153 | 
154 | The expressions are enabled by [fasteval](https://github.com/likebike/fasteval) with supported syntax detailed [here](https://docs.rs/fasteval/latest/fasteval/). 
155 | 
156 | In brief, the normal operators: (`&&, ||, +, -, *, /, <, <=, >, >=` and groupings `(, )`, etc) are supported and can be used to
157 | craft an expression that returns true or false as above.
158 | 
159 | # References and Acknowledgements
160 | 
161 | Without these (and other) critical libraries, `echtvar` would not exist.
162 | 
163 | + [htslib](https://github.com/samtools/htslib) is used for reading and writing BCF and VCF via [rust-htslib](https://github.com/rust-bio/rust-htslib)
164 | + [stream-vbyte](https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-speed-records-for-integer-compression/) is used for integer compression via the [excellent rust bindings](https://bitbucket.org/marshallpierce/stream-vbyte-rust/src/master/)
165 | + [fasteval](https://github.com/likebike/fasteval) is used for the expressions. It is fast and simple and awesome.
166 | + [bincode](https://docs.rs/bincode/latest/bincode/) is used for rapid serialization of large variants.
167 | 
168 | 
169 | `echtvar` is developed in the [Jeroen De Ridder lab](https://www.deridderlab.nl/)


--------------------------------------------------------------------------------