├── LICENSE ├── README.md ├── bam ├── best.md ├── modkit.md ├── perbase.md └── rustybam.md ├── csv ├── csview.md ├── madato.md ├── xsv.md └── xtab.md ├── dna ├── fakit.md ├── fq.md ├── ngs.md ├── rust-bio-tools.md └── skc.md ├── fastq ├── fasten.md ├── faster.md ├── fqgrep.md ├── fqkit.md ├── fqtk.md └── rasusa.md ├── longreads ├── NextPolish2.md ├── chopper.md └── longshot.md ├── metagenomics ├── coverm.md ├── skani.md └── sylph.md ├── pangenomics └── impg.md ├── phylogenomics └── segul.md ├── proteomics └── sage.md ├── rna └── rnapkin.md ├── singlecell ├── alevin-fry.md └── cellranger.md ├── slurm └── ssubmit.md └── vcf └── echtvar.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 size_t 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### rust in bioinformatics 2 | 3 | A collection of genomics software tools written in Rust 4 | 5 | 6 | 7 | ### index section 8 | 9 | ##### bam 10 | - [alignoth](https://github.com/alignoth/alignoth) : Creating alignment plots from bam files 11 | - [bamrescue](https://github.com/arkanosis/bamrescue) : Utility to check Binary Sequence Alignment / Map (BAM) files for corruption and repair them 12 | - [best](https://github.com/google/best) : Bam Error Stats Tool (best): analysis of error types in aligned reads 13 | - [modkit](https://github.com/nanoporetech/modkit) : A bioinformatics tool for working with modified bases 14 | - [mapAD](https://github.com/mpieva/mapAD) : An aDNA aware short-read mapper 15 | - [perbase](https://github.com/sstadick/perbase) : Per-base per-nucleotide depth analysis 16 | - [rustybam](https://github.com/mrvollger/rustybam) : bioinformatics toolkit in rust 17 | 18 | ##### csv 19 | 20 | - [csview](https://github.com/wfxr/csview) : 📠 Pretty and fast csv viewer for cli with cjk/emoji support 21 | - [csvlens](https://github.com/YS-L/csvlens) : csvlens is a command line CSV file viewer. It is like less but made for CSV. 22 | - [madato](https://github.com/inosion/madato) : Markdown Cmd Line, Rust and JS library for Excel to Markdown Tables 23 | - [qsv](https://github.com/dathere/qsv) : Blazing-fast Data-Wrangling toolkit 24 | - [rsv](https://github.com/ribbondz/rsv) : A command-line tool written in Rust for analyzing CSV, TXT, and Excel files. 25 | - [tabiew](https://github.com/shshemi/tabiew) : A lightweight TUI app to view and query CSV files 26 | - [tv](https://github.com/alexhallam/tv) : 📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment. 27 | - [xan](https://github.com/medialab/xan) : The CSV magician 28 | - [xsv](https://github.com/BurntSushi/xsv) : A fast CSV command line toolkit written in Rust.   29 | - [xtab](https://github.com/sharkLoc/xtab) : CSV command line utilities 30 | 31 | ##### dna 32 | 33 | - [biotools](https://github.com/jimrybarski/biotools) : Command line bioinformatics functions 34 | - [darwin](https://github.com/Ebedthan/darwin) : Create (rapid) neighbor-joining tree from sequences using mash distance 35 | - [fakit](https://github.com/sharkLoc/fakit) : fakit: a simple program for fasta file manipulation 36 | - [filterx](https://github.com/dwpeng/filterx) : process any file in tabular format. Fasta/fastq/GTF/GFF/VCF/SAM/BED 37 | - [fq](https://github.com/stjude-rust-labs/fq) : Command line utility for manipulating Illumina-generated FASTQ files. 38 | - [gsearch](https://github.com/jean-pierreboth/gsearch) : Approximate nearest neighbour search for microbial genomes based on hash metric 39 | - [Hyper-Gen](https://github.com/wh-xu/Hyper-Gen) : HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors 40 | - [kanpig](https://github.com/ACEnglish/kanpig) : Kmer Analysis of Pileups for Genotyping 41 | - [kfc](https://github.com/lrobidou/KFC) : KFC (K-mer Fast Counter) is a fast and space-efficient k-mer counter based on hyper-k-mers. 42 | - [ngs](https://github.com/stjude-rust-labs/ngs) : Command line utility for working with next-generation sequencing files. 43 | - [nail](https://github.com/TravisWheelerLab/nail) : Nail is an Alignment Inference tooL 44 | - [palindrome-finder](https://github.com/brianli314/palindrome-finder) : A bioinformatics tool written in Rust to find palindromic sequences in DNA 45 | - [poasta](https://github.com/broadinstitute/poasta) : Fast and exact gap-affine partial order alignment 46 | - [psdm](https://github.com/mbhall88/psdm) : Compute a pairwise SNP distance matrix from one or two alignment(s) 47 | - [rust-bio-tools](https://github.com/rust-bio/rust-bio-tools) : A set of command line utilities based on Rust-Bio. 48 | - [ska](https://github.com/bacpop/ska.rust) : Split k-mer analysis – version 2 49 | - [skc](https://github.com/mbhall88/skc) : Shared k-mer content between two genomes 50 | - [sketchy](https://github.com/esteinig/sketchy) : Genomic neighbor typing of bacterial pathogens using MinHash 🐀 51 | - [tidk](https://github.com/tolkit/telomeric-identifier) : Identify and find telomeres, or telomeric repeats in a genome. 52 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies 53 | - [xgt](https://github.com/Ebedthan/xgt) : Efficient and fast querying and parsing of GTDB's data 54 | 55 | ##### fastq 56 | 57 | - [deacon](https://github.com/bede/deacon) : Fast (host) DNA sequence filtering 58 | - [fasten](https://github.com/lskatz/fasten) : 👷 Fasten toolkit, for streaming operations on fastq files 59 | - [faster](https://github.com/angelovangel/faster) : A (very) fast program for getting statistics about a fastq file, the way I need them, written in Rust 60 | - [fqgrep](https://github.com/fulcrumgenomics/fqgrep) : Grep for FASTQ files 61 | - [fqkit](https://github.com/sharkLoc/fqkit) : 🦀 Fqkit: A simple and cross-platform program for fastq file manipulation   62 | - [fqtk](https://github.com/fulcrumgenomics/fqtk) : Fast FASTQ sample demultiplexing in Rust. 63 | - [grepq](https://github.com/rbfinch/grepq): quickly filter fastq files by matching sequences to a set of regex patterns 64 | - [guide-counter](https://github.com/fulcrumgenomics/guide-counter) : A better, faster way to count guides in CRISPR screens. 65 | - [K2Rmini](https://github.com/Malfoy/K2Rmini) : K2Rmini (or K-mer to Reads mini) is a tool to filter the reads contained in a FASTA/Q file based on a set of k-mers of interest. 66 | - [kractor](https://github.com/sam-sims/kractor) : Rapidly extract reads from a FASTQ file based on taxonomic classification via Kraken2. 67 | - [rasusa](https://github.com/mbhall88/rasusa) : Randomly subsample sequencing reads 68 | - [SeqSizzle](https://github.com/ChangqingW/SeqSizzle) : SeqSizzle is a pager for viewing FASTQ files with fuzzy matching, allowing different adaptors to be colored differently. 69 | - [sabreur](https://github.com/ebedthan/sabreur) : fast, reliable and handy demultiplexing tool for fastx files 70 | 71 | #### format 72 | - [atlas](https://github.com/stjude-rust-labs/atlas) : Enables storing, querying, transforming, and visualizing of multidimensional count data. 73 | - [bigtools](https://github.com/jackh726/bigtools) : A high-performance BigWig and BigBed library in Rust 74 | - [biotest](https://github.com/natir/biotest) : Generate random test data for bioinformatics 75 | - [bqtools](https://github.com/arcinstitute/bqtools) : A command line utilty for working with BINSEQ files 76 | - [cigzip](https://github.com/AndreaGuarracino/cigzip) : A tool for compression and decompression of alignment CIGAR strings using tracepoints. 77 | - [d4tools](https://github.com/38/d4-format) : The D4 Quantitative Data Format 78 | - [gfa2bin](https://github.com/MoinSebi/gfa2bin) : Convert various graph-related data to PLINK file. In addition, we offer multiple commands for filtering or modifying the generated PLINK files. 79 | - [gia](https://github.com/noamteyssier/gia) : gia: Genomic Interval Arithmetic 80 | - [granges](https://github.com/vsbuffalo/granges) : A Rust library and command line tool for working with genomic ranges and their data. 81 | - [intspan](https://github.com/wang-q/intspan) : Command line tools for IntSpan related bioinformatics operations 82 | - [nuc2bit](https://github.com/natir/nuc2bit) : A rust crate that provides methods for rapidly encoding and decoding nucleotides in 2-bit representation. 83 | - [recmap](https://github.com/vsbuffalo/recmap) : A command line tool and Rust library for working with recombination maps. 84 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies 85 | - [thirdkind](https://github.com/simonpenel/thirdkind) : Drawing reconciled phylogenetic trees allowing 1, 2 or 3 reconcillation levels 86 | - [xsra](https://github.com/ArcInstitute/xsra) : An efficient CLI to extract sequences from the SRA 87 | 88 | 89 | ##### gff3 90 | 91 | - [atg](https://github.com/anergictcell/atg) : A Rust library and CLI tool to handle genomic transcripts 92 | - [gffkit](https://github.com/sharkloc/gffkit) : a simple program for gff3 file manipulation 93 | 94 | ##### longreads 95 | 96 | - [Autocycler](https://github.com/rrwick/Autocycler) : A tool for generating consensus long-read assemblies for bacterial genomes 97 | - [chopper](https://github.com/wdecoster/chopper) : Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file. 98 | - [DeepChopper](https://github.com/ylab-hi/DeepChopper) : Language models identify chimeric artificial reads in NanoPore direct-RNA sequencing data. 99 | - [fpa](https://github.com/natir/fpa) : Filter of Pairwise Alignement 100 | - [herro](https://github.com/lbcb-sci/herro) : HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended). 101 | - [HiPhase](https://github.com/PacificBiosciences/HiPhase) : Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads 102 | - [longshot](https://github.com/pjedge/longshot) : diploid SNV caller for error-prone reads 103 | - [lrge](https://github.com/mbhall88/lrge) : Genome size estimation from long read overlaps 104 | - [myloasm](https://github.com/bluenote-1577/myloasm) : A new high-resolution long-read metagenome assembler for even noisy reads 105 | - [Polypolish](https://github.com/rrwick/Polypolish) : a short-read polishing tool for long-read assemblies 106 | - [nextpolish2](https://github.com/Nextomics/NextPolish2) : Repeat-aware polishing genomes assembled using HiFi long reads 107 | - [nanoq](https://github.com/esteinig/nanoq) : Minimal but speedy quality control for nanopore reads in Rust 🐻 108 | - [smrest](https://github.com/jts/smrest) : Tumour-only somatic mutation calling using long reads 109 | - [trgt](https://github.com/PacificBiosciences/trgt) : Tandem repeat genotyping and visualization from PacBio HiFi data 110 | - [yacrd](https://github.com/natir/yacrd) : Yet Another Chimeric Read Detector 111 | 112 | 113 | 114 | ##### metagenomics 115 | 116 | - [coverm](https://github.com/wwood/CoverM) : Read coverage calculator for metagenomics 117 | - [galah](https://github.com/wwood/galah) : More scalable dereplication for metagenome assembled genomes 118 | - [hyperex](https://github.com/Ebedthan/hyperex) : Hypervariable region primer-based extractor for 16S rRNA and other SSU/LSU sequences. 119 | - [kun_peng](https://github.com/eric9n/Kun-peng) : Kun-peng: an ultra-fast, low-memory footprint and accurate taxonomy classifier for all 120 | - [kmertools](https://github.com/anuradhawick/kmertools) : kmer based feature extraction tool for bioinformatics, metagenomics, AI/ML and more 121 | - [kmerutils](https://github.com/jean-pierreBoth/kmerutils) : Kmer generating, counting hashing and related 122 | - [Lorikeet](https://github.com/rhysnewell/Lorikeet) : Strain resolver for metagenomics 123 | - [nohuman](https://github.com/mbhall88/nohuman) : Remove human reads from a sequencing run 124 | - [rosella](https://github.com/rhysnewell/rosella) : Metagenomic Binning Algorithm 125 | - [skani](https://github.com/bluenote-1577/skani) : Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs. 126 | - [sourmash](https://github.com/sourmash-bio/sourmash) : Quickly search, compare, and analyze genomic and metagenomic data sets. 127 | - [sylph](https://github.com/bluenote-1577/sylph) : ultrafast genome querying and taxonomic profiling for metagenomic samples by abundance-corrected minhash. 128 | - [vircov](https://github.com/esteinig/vircov) : Viral genome coverage evaluation for metagenomic diagnostics 🩸 129 | 130 | ##### pangenomics 131 | 132 | - [impg](https://github.com/pangenome/impg) : implicit pangenome graph 133 | - [panacus](https://github.com/marschall-lab/panacus) : Panacus is a tool for computing statistics for GFA-formatted pangenome graphs 134 | 135 | ##### phylogenomics 136 | 137 | - [nextclade](https://github.com/nextstrain/nextclade) : Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement 138 | - [nwr](https://github.com/wang-q/nwr) : nwr is a command line tool for working with NCBI taxonomy, Newick files and assembly reports 139 | - [unicore](https://github.com/steineggerlab/unicore) : Universal and efficient core gene phylogeny with Foldseek and ProstT5 140 | - [segul](https://github.com/hhandika/segul) : An ultrafast and memory efficient tool for phylogenomics 141 | 142 | ##### proteomics 143 | 144 | - [align-cli](https://github.com/snijderlab/align-cli) : A CLI for pairwise alignment of sequences, using both normal and mass based alignment. 145 | - [daedalus](https://github.com/David-OConnor/daedalus) : Protein and molecule viewer 146 | - [folddisco](https://github.com/steineggerlab/folddisco) : Fast indexing and search of discontinuous motifs in protein structures 147 | - [foldmason](https://github.com/steineggerlab/foldmason) : Foldmason builds multiple alignments of large structure sets. 148 | - [sage](https://github.com/lazear/sage) : Proteomics search & quantification so fast that it feels like magic 149 | 150 | ##### rna 151 | - [oarfish](https://github.com/COMBINE-lab/oarfish) : long read RNA-seq quantification 152 | - [rnapkin](https://github.com/ukmrs/rnapkin) : drawing RNA secondary structure with style; instantly 153 | - [R2Dtool](https://github.com/comprna/R2Dtool) : R2Dtool is a set of genomics utilities for handling, integrating, and viualising isoform-mapped RNA feature data. 154 | - [squab](https://github.com/zaeleus/squab) : Alignment-based gene expression quantification 155 | 156 | ##### singlecell 157 | 158 | - [adview](https://github.com/JianYang-Lab/adview) : Adata Viewer: Head/Less/Shape h5ad file in terminal 159 | - [alevin-fry](https://github.com/COMBINE-lab/alevin-fry) : 🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding. 160 | - [cellranger](https://github.com/10XGenomics/cellranger) : 10x Genomics Single Cell Analysis 161 | - [precellar](https://github.com/regulatory-genomics/precellar) : Single-cell genomics preprocessing package 162 | - [proseg](https://github.com/dcjones/proseg) : Probabilistic cell segmentation for in situ spatial transcriptomics 163 | - [SnapATAC2](https://github.com/kaizhang/SnapATAC2) : Single-cell epigenomics analysis tools 164 | 165 | ##### slurm 166 | 167 | - [ssubmit](https://github.com/mbhall88/ssubmit) : Submit slurm sbatch jobs without the need to create a script 168 | 169 | ##### vcf 170 | 171 | - [echtvar](https://github.com/brentp/echtvar) : using all the bits for echt rapid variant annotation and filtering 172 | - [gvcf_norm](https://github.com/mlin/gvcf_norm) : gVCF allele normalizer 173 | - [mehari](https://github.com/varfish-org/mehari): VEP-like tool for sequence ontology and HGVS annotation of VCF files 174 | - [vcf2parquet](https://github.com/natir/vcf2parquet) : Convert vcf in parquet 175 | - [vcfexpress](https://github.com/brentp/vcfexpress) : expressions on VCFs 176 | 177 | 178 | ##### Gui 179 | 180 | - [plascad](https://github.com/David-OConnor/plascad) : Design software for plasmid (vector) and primer creation and validation. Edit plasmids, perform PCR-based cloning, digest and ligate DNA fragments, and display details about expressed proteins. Integrates with online resources like NCBI and PDB. 181 | 182 | ##### other 183 | - [biobear](https://github.com/wheretrue/biobear) : Work with bioinformatic files using Arrow, Polars, and/or DuckDB 184 | - [binseq](https://github.com/ArcInstitute/binseq) : A high efficiency binary format for sequencing data 185 | - [ggetrs](https://github.com/noamteyssier/ggetrs) : Efficient querying of biological databases 186 | - [htsget-rs](https://github.com/umccr/htsget-rs) : A server implementation of the htsget protocol for bioinformatics in Rust 187 | - [ibu](https://github.com/noamteyssier/ibu) : a rust library for high throughput binary encoding of genomic sequences 188 | - [scidataflow](https://github.com/vsbuffalo/scidataflow): Command line scientific data management tool 189 | - [sufr](https://github.com/TravisWheelerLab/sufr) : Parallel Construction of Suffix Arrays in Rust 190 | 191 | 192 | 193 | 194 | ## Starchart 195 | Stargazers over time 196 | -------------------------------------------------------------------------------- /bam/best.md: -------------------------------------------------------------------------------- 1 | # best 2 | 3 | Bam Error Stats Tool (best): analysis of error types in aligned reads. 4 | 5 | `best` is used to assess the quality of reads after aligning them to a 6 | reference assembly. 7 | 8 | ## Features 9 | 10 | * Collect overall and per alignment stats 11 | * Distribution of indel lengths 12 | * Yield at different empirical Q-value thresholds 13 | * Bin per read stats to easily examine the distribution of errors for certain 14 | types of reads 15 | * Stats for regions specified by intervals (BED file, homopolymer regions, 16 | windows etc.) 17 | * Stats for quality scores vs empirical Q-values 18 | * Multithreading for speed 19 | 20 | ## Usage 21 | 22 | The [`best` Usage Guide](Usage.md) gives an overview of how to use `best`. 23 | 24 | ## Installing 25 | 26 | 1. Install [Rust](https://www.rust-lang.org/tools/install). 27 | 2. Clone this repository and navigate into the directory of this repository. 28 | 3. Run `cargo install --locked --path .` 29 | 4. Run `best input.bam reference.fasta prefix/path` 30 | 31 | This will generate stats files with the `prefix/path` prefix. 32 | 33 | ## Development 34 | 35 | ### Running 36 | 37 | 1. Install [Rust](https://www.rust-lang.org/tools/install). 38 | 2. Clone this repository and navigate into the directory of this repository. 39 | 3. Run `cargo build --release` 40 | 4. Run `cargo run --release -- input.bam reference.fasta prefix/path` or 41 | `target/release/best input.bam reference.fasta prefix/path` 42 | 43 | This will generate stats files with the `prefix/path` prefix. 44 | 45 | The built binary is located at `target/release/best`. 46 | 47 | ### Formatting 48 | 49 | ``` 50 | cargo fmt 51 | ``` 52 | 53 | ### Comparing 54 | 55 | Remember to pass the `-t 1` option to ensure that only one thread is used for 56 | testing. Best generally tries to ensure the order of outputs is deterministic 57 | with multiple threads, but the order of per-alignment stats is arbitrary unless 58 | only one thread is used. 59 | 60 | ### Disclaimer 61 | 62 | This is not an official Google product. 63 | 64 | The code is not intended for use in any clinical settings. It is not intended to be a medical device and is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis. 65 | 66 | No representations or warranties are made with regards to the accuracy of results generated. User or licensee is responsible for verifying and validating accuracy when using this tool. 67 | -------------------------------------------------------------------------------- /bam/modkit.md: -------------------------------------------------------------------------------- 1 | ![Oxford Nanopore Technologies logo](https://github.com/nanoporetech/modkit/blob/master/ONT_logo_590x106.png?raw=true) 2 | 3 | # Modkit 4 | 5 | A bioinformatics tool for working with modified bases from Oxford Nanopore. Specifically for converting modBAM 6 | to bedMethyl files using best practices, but also manipulating modBAM files and generating summary statistics. 7 | Detailed documentation and quick-start can be found in the [online documentation](https://nanoporetech.github.io/modkit/). 8 | 9 | ## Installation 10 | 11 | Pre-compiled binaries are provided for Linux from the [release page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in most circumstances. 12 | 13 | ### Building from source 14 | 15 | The provided packages should be used where possible. We understand that some users may wish to compile the software from its source code. To build `modkit` from source [cargo](https://www.rust-lang.org/learn/get-started) should be used. 16 | 17 | ```bash 18 | git clone https://github.com/nanoporetech/modkit.git 19 | cd modkit 20 | cargo install --path . 21 | # or 22 | cargo install --git https://github.com/nanoporetech/modkit.git 23 | ``` 24 | 25 | ## Usage 26 | 27 | Modkit comprises a suite of tools for manipulating modified-base data stored in [BAM](http://www.htslib.org/) files. Modified base information is stored in the `MM` and `ML` tags (see section 1.7 of the [SAM tags](https://samtools.github.io/hts-specs/SAMtags.pdf) specification). These tags are produced by contemporary basecallers of data from Oxford Nanopore Technologies sequencing platforms. 28 | 29 | ### Constructing bedMethyl tables 30 | 31 | A primary use of `modkit` is to create summary counts of modified and unmodified bases in an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position. 32 | 33 | In its simplest form `modkit` creates a bedMethyl file using the following: 34 | 35 | ```bash 36 | modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log 37 | ``` 38 | 39 | No reference sequence is required. A single file (described [below](#description-of-bedmethyl-output)) with base count summaries will be created. The final argument here specifies an optional log file output. 40 | 41 | The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./book/src/filtering.md). 42 | 43 | For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference: 44 | 45 | ```bash 46 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta 47 | ``` 48 | 49 | The program also contains a range of presets which combine several options for ease of use. The `traditional` preset, 50 | 51 | ```bash 52 | modkit pileup path/to/reads.bam output/path/pileup.bed \ 53 | --ref path/to/reference.fasta \ 54 | --preset traditional 55 | ``` 56 | 57 | performs three transforms: 58 | 59 | * restricts output to locations where there is a CG dinucleotide in 60 | the reference, 61 | * reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and 62 | * aggregates data across strands. The strand field od the output will be marked as '.' indicating that the strand information has been lost. 63 | 64 | Using this option is equivalent to running with the options: 65 | 66 | ```bash 67 | modkit pileup --cpg --ref --ignore h --combine-strands 68 | ``` 69 | 70 | For more information on the individual options see the [Advanced Usage](./book/src/advanced_usage.md) help document. 71 | 72 | ## Description of bedMethyl output 73 | 74 | Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the 75 | bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/). 76 | 77 | ### Definitions: 78 | 79 | * N``mod`` - Number of calls passing filters that were classified as a residue with a specified base modification. 80 | * N``canonical`` - Number of calls passing filters were classified as the canonical base rather than modified. The 81 | exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then 82 | the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine. 83 | * N``other mod`` - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with 84 | `h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N``other_mod`` would be 2. In the 85 | `m` row N``other_mod`` would be 3. 86 | * N``valid_cov`` - the valid coverage. N``valid_cov`` = N``mod`` + N``other_mod`` + N``canonical``, also used as the `score` in the bedMethyl 87 | * N``diff`` - Number of reads with a base other than the canonical base for this modification. For example, in a row 88 | for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N``diff`` will be 2. 89 | * N``delete`` - Number of reads with a deletion at this reference position 90 | * N``fail`` - Number of calls where the probability of the call was below the threshold. The threshold can be 91 | set on the command line or computed from the data (usually failing the lowest 10th percentile of calls). 92 | * N``nocall`` - Number of reads aligned to this reference position, with the correct canonical base, but without a base 93 | modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a 94 | CG->CH substitution such that no modification call was produced by the basecaller. 95 | 96 | ### bedMethyl column descriptions 97 | 98 | | column | name | description | type | 99 | | ------ | ----------------------------- | ------------------------------------------------------------------------------ | ----- | 100 | | 1 | chrom | name of reference sequence from BAM header | str | 101 | | 2 | start position | 0-based start position | int | 102 | | 3 | end position | 0-based exclusive end position | int | 103 | | 4 | modified base code | single letter code for modified base | str | 104 | | 5 | score | Equal to N``valid_cov``. | int | 105 | | 6 | strand | '+' for positive strand '-' for negative strand, '.' when strands are combined | str | 106 | | 7 | start position | included for compatibility | int | 107 | | 8 | end position | included for compatibility | int | 108 | | 9 | color | included for compatibility, always 255,0,0 | str | 109 | | 10 | N``valid_cov`` | See definitions above. | int | 110 | | 11 | fraction modified | N``mod`` / N``valid_cov`` | float | 111 | | 12 | N``mod`` | See definitions above. | int | 112 | | 13 | N``canonical`` | See definitions above. | int | 113 | | 14 | N``other_mod`` | See definitions above. | int | 114 | | 15 | N``delete`` | See definitions above. | int | 115 | | 16 | N``fail`` | See definitions above. | int | 116 | | 17 | N``diff`` | See definitions above. | int | 117 | | 18 | N``nocall`` | See definitions above. | int | 118 | 119 | ## Description of columns in `modkit summary`: 120 | 121 | ### Totals table 122 | 123 | The lines of the totals table are prefixed with a `#` character. 124 | 125 | | row | name | description | type | 126 | | --- | ----------------------- | ----------------------------------------------------------------------- | ----- | 127 | | 1 | bases | comma-separated list of canonical bases with modification calls. | str | 128 | | 2 | total_reads_used | total number of reads from which base modification calls were extracted | int | 129 | | 3+ | count_reads_{base} | total number of reads that contained base modifications for {base} | int | 130 | | 4+ | filter_threshold_{base} | filter threshold used for {base} | float | 131 | 132 | ### Modification calls table 133 | 134 | The modification calls table follows immediately after the totals table. 135 | 136 | | column | name | description | type | 137 | | ------ | ---------- | ---------------------------------------------------------------------------------------- | ----- | 138 | | 1 | base | canonical base with modification call | char | 139 | | 2 | code | base modification code, or `-` for canonical | char | 140 | | 3 | pass_count | total number of passing (confidence >= threshold) calls for the modification in column 2 | int | 141 | | 4 | pass_frac | fraction of passing (>= threshold) calls for the modification in column 2 | float | 142 | | 5 | all_count | total number of calls for the modification code in column 2 | int | 143 | | 6 | all_frac | fraction of all calls for the modification in column 2 | float | 144 | 145 | ## Advanced usage examples 146 | 147 | For complete usage instructions please see the command-line help of the program or the [Advanced usage](./book/src/advanced_usage.md) help documentation. Some more commonly required examples are provided below. 148 | 149 | To combine multiple base modification calls into one, for example to combine basecalls for both 5hmC and 5mC into a count for "all cytosine modifications" (with code `C`) the `--combine-mods` option can be used: 150 | 151 | ```bash 152 | modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods 153 | ``` 154 | 155 | In standard usage the `--preset traditional` option can be used as outlined in the [Usage](#usage) section. By more directly specifying individual options we can perform something similar without loss of information for 5hmC data stored in the input file: 156 | 157 | ```bash 158 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta \ 159 | --combine-strands 160 | ``` 161 | 162 | To produce a bedGraph file for each modification in the BAM file the `--bedgraph` option can be given. Counts for the positive and negative strands will be put in separate files. 163 | 164 | ```bash 165 | modkit pileup path/to/reads.bam output/directory/path --bedgraph <--prefix string> 166 | ``` 167 | 168 | The option `--prefix [str]` parameter allows specification of a prefix to the output file names. 169 | 170 | **Licence and Copyright** 171 | 172 | (c) 2023 Oxford Nanopore Technologies Plc. 173 | 174 | Modkit is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0. 175 | If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com 176 | -------------------------------------------------------------------------------- /bam/rustybam.md: -------------------------------------------------------------------------------- 1 | # rustybam 2 | 3 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Test%20and%20Build/badge.svg)](https://github.com/mrvollger/rustybam/actions) 4 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Formatting/badge.svg)](https://github.com/mrvollger/rustybam/actions) 5 | [![Actions Status](https://github.com/mrvollger/rustybam/workflows/Clippy/badge.svg)](https://github.com/mrvollger/rustybam/actions) 6 | 7 | [![Conda (channel only)](https://img.shields.io/conda/vn/bioconda/rustybam?color=green)](https://anaconda.org/bioconda/rustybam) 8 | [![Downloads](https://img.shields.io/conda/dn/bioconda/rustybam?color=green)](https://anaconda.org/bioconda/rustybam) 9 | 10 | [![crates.io version](https://img.shields.io/crates/v/rustybam)](https://crates.io/crates/rustybam) 11 | [![crates.io downloads](https://img.shields.io/crates/d/rustybam?color=orange&label=downloads)](https://crates.io/crates/rustybam) 12 | 13 | [![DOI](https://zenodo.org/badge/351639424.svg)](https://zenodo.org/badge/latestdoi/351639424) 14 | 15 | `rustybam` is a bioinformatics toolkit written in the `rust` programing language focused around manipulation of alignment (`bam` and `PAF`), annotation (`bed`), and sequence (`fasta` and `fastq`) files. If your alignment is in a different format checkout if [wgatools](https://github.com/wjwei-handsome/wgatools) can convert it for you! 16 | 17 | ## What can rustybam do? 18 | 19 | Here is a commented example that highlights some of the better features of `rustybam`, and demonstrates how each result can be read directly into another subcommand. 20 | 21 | ```bash 22 | rb trim-paf .test/asm_small.paf `#trims back alignments that align the same query sequence more than once` \ 23 | | rb break-paf --max-size 100 `#breaks the alignment into smaller pieces on indels of 100 bases or more` \ 24 | | rb orient `#orients each contig so that the majority of bases are forward aligned` \ 25 | | rb liftover --bed <(printf "chr22\t12000000\t13000000\n") `#subsets and trims the alignment to 1 Mbp of chr22.` \ 26 | | rb filter --paired-len 10000 `#filters for query sequences that have at least 10,000 bases aligned to a target across all alignments.` \ 27 | | rb stats --paf `#calculates statistics from the trimmed paf file` \ 28 | | less -S 29 | ``` 30 | 31 | ## Usage 32 | 33 | ```shell 34 | rustybam [OPTIONS] 35 | ``` 36 | 37 | or 38 | 39 | ```shell 40 | rb [OPTIONS] 41 | ``` 42 | 43 | ### Subcommands 44 | 45 | The full manual of subcommands can be found on the [docs](https://docs.rs/rustybam/latest/rustybam/cli/enum.Commands.html). 46 | 47 | ```shell 48 | SUBCOMMANDS: 49 | stats Get percent identity stats from a sam/bam/cram or PAF 50 | bed-length Count the number of bases in a bed file [aliases: bedlen, bl, bedlength] 51 | filter Filter PAF records in various ways 52 | invert Invert the target and query sequences in a PAF along with the CIGAR string 53 | liftover Liftover target sequence coordinates onto query sequence using a PAF 54 | trim-paf Trim paf records that overlap in query sequence [aliases: trim, tp] 55 | orient Orient paf records so that most of the bases are in the forward direction 56 | break-paf Break PAF records with large indels into multiple records (useful for 57 | SafFire) [aliases: breakpaf, bp] 58 | paf-to-sam Convert a PAF file into a SAM file. Warning, all alignments will be marked as 59 | primary! [aliases: paftosam, p2s, paf2sam] 60 | fasta-split Reads in a fasta from stdin and divides into files (can compress by adding 61 | .gz) [aliases: fastasplit, fasplit] 62 | fastq-split Reads in a fastq from stdin and divides into files (can compress by adding 63 | .gz) [aliases: fastqsplit, fqsplit] 64 | get-fasta Mimic bedtools getfasta but allow for bgzip in both bed and fasta inputs 65 | [aliases: getfasta, gf] 66 | nucfreq Get the frequencies of each bp at each position 67 | repeat Report the longest exact repeat length at every position in a fasta 68 | suns Extract the intervals in a genome (fasta) that are made up of SUNs 69 | help Print this message or the help of the given subcommand(s) 70 | ``` 71 | 72 | ## Install 73 | 74 | ### conda 75 | 76 | ```shell 77 | mamba install -c bioconda rustybam 78 | ``` 79 | 80 | ### cargo 81 | 82 | ```shell 83 | cargo install rustybam 84 | ``` 85 | 86 | ### Pre-complied binaries 87 | 88 | Download from [releases](https://github.com/mrvollger/rustybam/releases) (may be slower than locally complied versions). 89 | 90 | ### Source 91 | 92 | ```shell 93 | git clone https://github.com/mrvollger/rustybam.git 94 | cd rustybam 95 | cargo build --release 96 | ``` 97 | 98 | and the executables will be built here: 99 | 100 | ```shell 101 | target/release/{rustybam,rb} 102 | ``` 103 | 104 | ## Examples 105 | 106 | ### PAF or BAM statistics 107 | 108 | For BAM files with extended cigar operations we can calculate statistics about the aliment and report them in BED format. 109 | 110 | ```shell 111 | rustybam stats {input.bam} > {stats.bed} 112 | ``` 113 | 114 | The same can be done with PAF files as long as they are generated with `-c --eqx`. 115 | 116 | ```shell 117 | rustybam stats --paf {input.paf} > {stats.bed} 118 | ``` 119 | 120 | ### PAF liftovers 121 | 122 | > I have a `PAF` and I want to subset it for just a particular region in the reference. 123 | 124 | With `rustybam` its easy: 125 | 126 | ```shell 127 | rustybam liftover \ 128 | --bed <(printf "chr1\t0\t250000000\n") \ 129 | input.paf > trimmed.paf 130 | ``` 131 | 132 | > But I also want the alignment statistics for the region. 133 | 134 | No problem, `rustybam liftover` does not just trim the coordinates but also the CIGAR 135 | so it is ready for `rustybam stats`: 136 | 137 | ```shell 138 | rustybam liftover \ 139 | --bed <(printf "chr1\t0\t250000000\n") \ 140 | input.paf \ 141 | | rustybam stats --paf \ 142 | > trimmed.stats.bed 143 | ``` 144 | 145 | > Okay, but Evan asked for an "align slider" so I need to realign in chunks. 146 | 147 | No need, just make your `bed` query to `rustybam liftoff` a set of sliding windows 148 | and it will do the rest. 149 | 150 | ```shell 151 | rustybam liftover \ 152 | --bed <(bedtools makewindows -w 100000 \ 153 | <(printf "chr1\t0\t250000000\n") \ 154 | ) \ 155 | input.paf \ 156 | | rustybam stats --paf \ 157 | > trimmed.stats.bed 158 | ``` 159 | 160 | You can also use `rustybam breakpaf` to break up the paf records of indels above a certain size to 161 | get more "miropeats" like intervals. 162 | 163 | ```shell 164 | rustybam breakpaf --max-size 1000 input.paf \ 165 | | rustybam liftover \ 166 | --bed <(printf "chr1\t0\t250000000\n") \ 167 | | ./rustybam stats --paf \ 168 | > trimmed.stats.bed 169 | ``` 170 | 171 | > Yeah but how do I visualize the data? 172 | 173 | Try out 174 | [SafFire](https://mrvollger.github.io/SafFire/)! 175 | 176 | ### Align once 177 | 178 | At the boundaries of CNVs and inversions minimap2 may align the same section of query sequence to multiple stretches of the target sequence. This utility uses the CIGAR (must use `--eqx`) strings of PAF alignments to determine an optimal split of the alignments such no query base is aligned more than once. To do this the whole PAF file is loaded in memory and then overlaps are removed starting with the largest overlapping interval and iterating. 179 | 180 | ```bash 181 | rb trim-paf {input.paf} > {trimmed.paf} 182 | ``` 183 | 184 | Here is an example from the NOTCH2NL region comparing CHM1 against CHM13 before trimming: 185 | ![](images/no-trim.svg) 186 | 187 | and after trimming 188 | ![](images/trim.svg) 189 | 190 | ### Split fastx files 191 | 192 | Split a fasta file between `stdout` and two other files both compressed and uncompressed. 193 | 194 | ```shell 195 | cat {input.fasta} | rustybam fasta-split two.fa.gz three.fa 196 | ``` 197 | 198 | Split a fastq file between `stdout` and two other files both compressed and uncompressed. 199 | 200 | ```shell 201 | cat {input.fastq} | rustybam fastq-split two.fq.gz three.fq 202 | ``` 203 | 204 | ### Extract from a fasta 205 | 206 | This tools is designed to mimic `bedtools getfasta` but this tools allows the fasta to be `bgzipped`. 207 | 208 | ```shell 209 | samtools faidx {seq.fa(.gz)} 210 | rb get-fasta --name --strand --bed {regions.of.interest.bed} --fasta {seq.fa(.gz)} 211 | ``` 212 | 213 | ## TODO 214 | 215 | - [x] Add a `bedtools getfasta` like operation that actually works with bgzipped input. 216 | - [ ] implement bed12/split 217 | - [ ] Allow sam or paf for operations: 218 | - [x] make a sam header from a PAF file 219 | - [x] convert sam record to paf record 220 | - [x] convert paf record to sam record 221 | - [ ] make tools seemlessly work with sam and paf 222 | - [ ] Add `D4` for Nucfreq. 223 | - [ ] Finish implementing `suns`. 224 | - [ ] Allow multiple input files in `bed-length` 225 | - [ ] Start keeping a changelog -------------------------------------------------------------------------------- /csv/csview.md: -------------------------------------------------------------------------------- 1 |

📠 csview

2 |

3 | A high performance csv viewer with cjk/emoji support. 4 |

5 | 6 |

7 | 8 | CICD 9 | 10 | License 11 | 12 | Version 13 | 14 | 15 | Platform 16 | 17 |

18 | 19 | 20 | 21 | ### Features 22 | 23 | * Small and *fast* (see [benchmarks](#benchmark) below). 24 | * Memory efficient. 25 | * Correctly align [CJK](https://en.wikipedia.org/wiki/CJK_characters) and emoji characters. 26 | * Support `tsv` and custom delimiters. 27 | * Support different styles, including markdown table. 28 | 29 | ### Usage 30 | ``` 31 | $ cat example.csv 32 | Year,Make,Model,Description,Price 33 | 1997,Ford,E350,"ac, abs, moon",3000.00 34 | 1999,Chevy,"Venture ""Extended Edition""","",4900.00 35 | 1999,Chevy,"Venture ""Extended Edition, Large""",,5000.00 36 | 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof",4799.00 37 | 38 | $ csview example.csv 39 | ┌──────┬───────┬───────────────────────────────────┬───────────────────────────┬─────────┐ 40 | │ Year │ Make │ Model │ Description │ Price │ 41 | ├──────┼───────┼───────────────────────────────────┼───────────────────────────┼─────────┤ 42 | │ 1997 │ Ford │ E350 │ ac, abs, moon │ 3000.00 │ 43 | │ 1999 │ Chevy │ Venture "Extended Edition" │ │ 4900.00 │ 44 | │ 1999 │ Chevy │ Venture "Extended Edition, Large" │ │ 5000.00 │ 45 | │ 1996 │ Jeep │ Grand Cherokee │ MUST SELL! air, moon roof │ 4799.00 │ 46 | └──────┴───────┴───────────────────────────────────┴───────────────────────────┴─────────┘ 47 | 48 | $ head /etc/passwd | csview -H -d: 49 | ┌────────────────────────┬───┬───────┬───────┬────────────────────────────┬─────────────────┐ 50 | │ root │ x │ 0 │ 0 │ │ /root │ 51 | │ bin │ x │ 1 │ 1 │ │ / │ 52 | │ daemon │ x │ 2 │ 2 │ │ / │ 53 | │ mail │ x │ 8 │ 12 │ │ /var/spool/mail │ 54 | │ ftp │ x │ 14 │ 11 │ │ /srv/ftp │ 55 | │ http │ x │ 33 │ 33 │ │ /srv/http │ 56 | │ nobody │ x │ 65534 │ 65534 │ Nobody │ / │ 57 | │ dbus │ x │ 81 │ 81 │ System Message Bus │ / │ 58 | │ systemd-journal-remote │ x │ 981 │ 981 │ systemd Journal Remote │ / │ 59 | │ systemd-network │ x │ 980 │ 980 │ systemd Network Management │ / │ 60 | └────────────────────────┴───┴───────┴───────┴────────────────────────────┴─────────────────┘ 61 | ``` 62 | 63 | Run `csview --help` to view detailed usage. 64 | 65 | ### Installation 66 | 67 | #### On Arch Linux 68 | 69 | `csview` is available in the Arch User Repository. To install it from [AUR](https://aur.archlinux.org/packages/csview): 70 | 71 | ``` 72 | yay -S csview 73 | ``` 74 | 75 | #### On macOS 76 | 77 | You can install `csview` with Homebrew: 78 | 79 | ``` 80 | brew install csview 81 | ``` 82 | 83 | #### On NetBSD 84 | 85 | `csview` is available from the main pkgsrc Repositories. To install simply run 86 | 87 | ``` 88 | pkgin install csview 89 | ``` 90 | 91 | or, if you prefer to build from source using [pkgsrc](https://pkgsrc.se/textproc/csview) on any of the supported platforms: 92 | 93 | ``` 94 | cd /usr/pkgsrc/textproc/csview 95 | make install 96 | ``` 97 | 98 | #### On Windows 99 | 100 | You can install `csview` with [Scoop](https://scoop.sh/): 101 | ``` 102 | scoop install csview 103 | ``` 104 | 105 | #### From binaries 106 | 107 | Pre-built versions of `csview` for various architectures are available at [Github release page](https://github.com/wfxr/csview/releases). 108 | 109 | *Note that you can try the `musl` version (which is statically-linked) if runs into dependency related errors.* 110 | 111 | #### From source 112 | 113 | `csview` is also published on [crates.io](https://crates.io). If you have latest Rust toolchains installed you can use `cargo` to install it from source: 114 | 115 | ``` 116 | cargo install --locked csview 117 | ``` 118 | 119 | If you want the latest version, clone this repository and run `cargo build --release`. 120 | 121 | ### Benchmark 122 | 123 | - [small.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-action-csv) (10 rows, 4 cols, 695 bytes): 124 | 125 | | Tool | Command | Mean Time | Min Time | Memory | 126 | |:----------------------------------------------------------------------------------------:|---------------------------|----------:|----------:|----------:| 127 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table small.csv` | 2.0ms | 1.8ms | 3.9mb | 128 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview small.csv` | **0.3ms** | **0.1ms** | **2.4mb** | 129 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 1.3ms | 1.1ms | **2.4mb** | 130 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook small.csv` | 148.1ms | 142.4ms | 27.3mb | 131 | 132 | - [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) (10,000 rows, 10 cols, 624K bytes): 133 | 134 | | Tool | Command | Mean Time | Min Time | Memory | 135 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:| 136 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table medium.csv` | 0.031s | 0.029s | 4.4mb | 137 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview medium.csv` | **0.017s** | **0.016s** | **2.8mb** | 138 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 0.052s | 0.050s | 9.9mb | 139 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook medium.csv` | 2.664s | 2.617s | 46.8mb | 140 | 141 | - `large.csv` (1,000,000 rows, 10 cols, 61M bytes, generated by concatenating [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) 100 times): 142 | 143 | | Tool | Command | Mean Time | Min Time | Memory | 144 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:| 145 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table large.csv` | 2.912s | 2.820s | 4.4mb | 146 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview large.csv` | **1.686s** | **1.665s** | **2.8mb** | 147 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 5.777s | 5.759s | 767.6mb | 148 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook large.csv` | 20.665s | 20.549s | 1105.7mb | 149 | 150 | ### F.A.Q. 151 | 152 | --- 153 | #### We already have [xsv](https://github.com/BurntSushi/xsv), why not contribute to it but build a new tool? 154 | 155 | `xsv` is great. But it's aimed for analyzing and manipulating csv data. 156 | `csview` is designed for formatting and viewing. See also: [xsv/issues/156](https://github.com/BurntSushi/xsv/issues/156) 157 | 158 | --- 159 | #### I encountered UTF-8 related errors, how to solve it? 160 | 161 | The file may use a non-UTF8 encoding. You can check the file encoding using `file` command: 162 | 163 | ``` 164 | $ file -i a.csv 165 | a.csv: application/csv; charset=iso-8859-1 166 | ``` 167 | And then convert it to `utf8`: 168 | 169 | ``` 170 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv -o b.csv 171 | $ csview b.csv 172 | ``` 173 | 174 | Or do it in place: 175 | 176 | ``` 177 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv | csview 178 | ``` 179 | 180 | ### Credits 181 | 182 | * [csv-rust](https://github.com/BurntSushi/rust-csv) 183 | * [prettytable-rs](https://github.com/phsym/prettytable-rs) 184 | * [structopt](https://github.com/TeXitoi/structopt) 185 | 186 | ### License 187 | 188 | `csview` is distributed under the terms of both the MIT License and the Apache License 2.0. 189 | 190 | See the [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT) files for license details. -------------------------------------------------------------------------------- /csv/madato.md: -------------------------------------------------------------------------------- 1 | 2 | # madato   [![Build Status]][travis] [![Latest Version]][crates.io] 3 | 4 | [Build Status]: https://travis-ci.org/inosion/madato.svg?branch=master 5 | [travis]: https://travis-ci.org/inosion/madato 6 | [Latest Version]: https://img.shields.io/crates/v/madato.svg 7 | [crates.io]: https://crates.io/crates/madato 8 | 9 | ***madato is a library and command line tool for working tabular data, and Markdown*** 10 | 11 | Windows, Mac and Linux 12 | 13 | Converts XLSX and ODS Spreadsheets to 14 | - JSON 15 | - YAML 16 | - Markdown 17 | 18 | ### TL;DR 19 | 20 | ``` 21 | madato table -t XLSX -o JSON --sheetname Sheet2 path/to/workbook.xlsx 22 | madato table -t XLSX -o MD --sheetname Sheet2 path/to/workbook.xlsx 23 | madato table -t XLSX -o YAML --sheetname 'Annual Sales' path/to/workbook.xlsx 24 | madato table -t XLSX -o YAML path/to/workbook.ods 25 | madato table -t YAML -o MD path/to/workbook.yaml 26 | ``` 27 | 28 | -------------------------------------------------------------------------------- 29 | 30 | The tools is primarly centered around getting tabular data (spreadsheets, CSVs) 31 | into Markdown. 32 | 33 | It currently supports: 34 | - Reading a XLS*, ODS Spreadsheet or YAML file `-- to -->` Markdown 35 | - Reading a XLS*, ODS Spreadsheet `-- to -->` Markdown 36 | 37 | When generating the output: 38 | - Filter the Rows using basic Regex over Key/Value pairs 39 | - Limit the columns to named headings 40 | - Re-order the columns, or repeat them using the same column feature 41 | - Only generate a table for a named "sheet" (applicable for the XLS/ODS formats) 42 | 43 | Madato is: 44 | - Command Line Tool (Windows, Mac, Linux) - good for CI/CD preprocessing 45 | - Rust Library - Good for integration into Rust Markdown tooling 46 | - Node JS WASM API - To be used later for Atom and VSCode Extensions 47 | 48 | Madato expects that every column has a heading row. That is, the first row are headings/column names. If a cell in that first row is blank, it will create `NULL0..NULLn` entries as required. 49 | 50 | ## Examples 51 | 52 | * Extract the `3rd Sheet` sheet from an MS Excel Document 53 | ``` 54 | 08:39 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet" 55 | |col1|col2| col3 |col4 | col5 |NULL5| 56 | |----|----|------|-----|-------------------------------------------------------|-----| 57 | | 1 |that| are |wider| value ‘aaa’ is in the next cell, but has no heading | aaa | 58 | |than|the |header| row | (open the spreadsheet to see what I mean) | | 59 | ``` 60 | 61 | * Extract and reorder just 3 Columns 62 | ``` 63 | 08:42 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet" -c col2 -c col3 -c NULL5 64 | |col2| col3 |NULL5| 65 | |----|------|-----| 66 | |that| are | aaa | 67 | |the |header| | 68 | ``` 69 | * Pull from the `second_sheet` sheet 70 | * Only extract `Heading 4` column 71 | * Use a Filter, where `Heading 4` values must only have a letter or number. 72 | 73 | ``` 74 | 08:48 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname second_sheet -c "Heading 4" -f 'Heading 4=[a-zA-Z0-9]' 75 | | Heading 4 | 76 | |--------------------------| 77 | | << empty | 78 | |*Some Bolding in Markdown*| 79 | | `escaped value` foo | 80 | | 0.22 | 81 | | #DIV/0! | 82 | | “This cell has quotes” | 83 | | 😕 ← Emoticon | 84 | ``` 85 | 86 | * Filtering on a Column, ensuring that a "+" is there in `Trend` Column 87 | 88 | ``` 89 | 09:00 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname Sheet1 -c Rank -c Language -c Trend -f "Trend=\+" 90 | | Rank | Language |Trend | 91 | |------------------------------------------------------|------------|------| 92 | | 1 | Python |+5.5 %| 93 | | 3 | Javascript |+0.2 %| 94 | | 7 | R |+0.0 %| 95 | | 12 | TypeScript |+0.3 %| 96 | | 16 | Kotlin |+0.5 %| 97 | | 17 | Go |+0.3 %| 98 | | 20 | Rust |+0.0 %| 99 | ``` 100 | 101 | ## Internals 102 | madato uses: 103 | - [calamine](https://github.com/tafia/calamine) for reading XLS and ODS sheets 104 | - [wasm bindings](https://github.com/rustwasm/wasm-bindgen) to created JS API versions of the Rust API 105 | - [regex]() for filtering, and [serde]() for serialisation. 106 | 107 | ## Tips 108 | 109 | * I have found that copying the "table" I want from a website: HTML, to a spreadsheet, then through `madato` gives an excellent Markdown table of the original. 110 | 111 | ## Rust API 112 | 113 | ## JS API 114 | 115 | ## More Commandline 116 | 117 | ### Sheet List 118 | 119 | You can list the "sheets" of an XLS*, ODS file with 120 | 121 | ``` 122 | $ madato sheetlist test/sample_multi_sheet.xlsx 123 | Sheet1 124 | second_sheet 125 | 3rd Sheet 126 | ``` 127 | 128 | ### YAML to Markdown 129 | 130 | Madato reads a "YAML" file, in the same way it can a Spreadsheet. 131 | This is useful for "keeping" tabular data in your source repository, and perhaps not 132 | the XLS. 133 | 134 | `madato table -t yaml test/www-sample/test.yml` 135 | 136 | ``` 137 | |col3| col4 | data1 | data2 | 138 | |----|-------|---------|--------------------| 139 | |100 |gar gar|somevalue|someother value here| 140 | |190x| | that | nice | 141 | |100 | ta da | this |someother value here| 142 | ``` 143 | 144 | *Please see the [test/test.yml](test/test.yml) file for the expected layout of this file* 145 | 146 | ### Excel/ODS to YAML 147 | 148 | Changing the output from default "Markdown (MD)" to "YAML", you get a Markdown file of the Spreadsheet. 149 | 150 | ``` 151 | madato table -t xlsx test/sample_multi_sheet.xslx.xlsx -s Sheet1 -o yaml 152 | --- 153 | - Rank: "1" 154 | Change: "" 155 | Language: Python 156 | Share: "23.59 %" 157 | Trend: "+5.5 %" 158 | - Rank: "2" 159 | Change: "" 160 | Language: Java 161 | Share: "22.4 %" 162 | Trend: "-0.5 %" 163 | - Rank: "3" 164 | Change: "" 165 | Language: Javascript 166 | Share: "8.49 %" 167 | ... 168 | ``` 169 | 170 | If you omit the sheet name, it will dump all sheets into an order map of array of maps. 171 | 172 | 173 | ### Features 174 | 175 | * `[x]` Reads a formatted YAML string and renders a Markdown Table 176 | * `[x]` Can take an optional list of column headings, and only display those from the table (filtering out other columns present) 177 | * `[X]` Native Binary Command Line (windows, linux, osx) 178 | * `[X]` Read an XLSX file and produce a Markdown Table 179 | * `[X]` Read an ODS file and produce a Markdown Table 180 | * `[ ]` Read a CSV, TSV, PSV (etc) file and produce a Markdown Table 181 | * `[ ]` Support Nested Structures in the YAML input 182 | * `[ ]` Read a Markdown File, and select the "table" and turn it back into YAML 183 | 184 | ### Future Goals 185 | * Finish the testing and publishing of the JS WASM Bindings. (PS - it works.. 186 | (see : [test/www-sample](test/www-sample) and the [Makefile](Makefile) ) 187 | * Embed the "importing" of YAML, CSV and XLS* files into the `mume` Markdown Preview Enhanced Plugin. [https://shd101wyy.github.io/markdown-preview-enhanced/](https://shd101wyy.github.io/markdown-preview-enhanced/) So we can have Awesome Markdown Documents. 188 | * Provide a `PreRenderer` for `[rust-lang-nursery/mdBook](https://github.com/rust-lang-nursery/mdBook) to "import" MD tables from files. 189 | 190 | ### Known Issues 191 | * A Spreadsheet Cell with a Date will come out as the "magic" Excel date number :-( - https://github.com/tafia/calamine/issues/116 192 | 193 | ## License 194 | 195 | Serde is licensed under either of 196 | 197 | * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or 198 | http://www.apache.org/licenses/LICENSE-2.0) 199 | * MIT license ([LICENSE-MIT](LICENSE-MIT) or 200 | http://opensource.org/licenses/MIT) 201 | 202 | at your option. 203 | 204 | ### Contribution 205 | 206 | Unless you explicitly state otherwise, any contribution intentionally submitted 207 | for inclusion in Serde by you, as defined in the Apache-2.0 license, shall be 208 | dual licensed as above, without any additional terms or conditions. -------------------------------------------------------------------------------- /csv/xtab.md: -------------------------------------------------------------------------------- 1 | # xtab 2 | 3 | 🦀 CSV command line utilities 4 | 5 | ## install 6 | 7 | ##### setp1:install cargo first 8 | 9 | ```bash 10 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 11 | ``` 12 | 13 | ##### step2: 14 | 15 | ```bash 16 | cargo install xtab 17 | # or 18 | 19 | git clone https://github.com/sharkLoc/xtab.git 20 | cd xtab 21 | cargo b --release 22 | # mv target/release/xtab to anywhere you want 23 | ``` 24 | 25 | ## usage 26 | 27 | ```bash 28 | xtab -- CSV command line utilities 29 | Version: 0.0.8 30 | 31 | Authors: sharkLoc 32 | Source code: https://github.com/sharkLoc/xtab.git 33 | 34 | xtab supports reading and writing gzip/bzip2/xz format file. 35 | Compression level: 36 | format range default crate 37 | gzip 1-9 6 https://crates.io/crates/flate2 38 | bzip2 1-9 6 https://crates.io/crates/bzip2 39 | xz 1-9 6 https://crates.io/crates/xz2 40 | 41 | 42 | Usage: xtab [OPTIONS] [CSV] 43 | 44 | Commands: 45 | addheader Set new header for CSV file [aliases: ah] 46 | csv2xlsx Convert CSV/TSV files to XLSX file [aliases: c2x] 47 | dim Dimensions of CSV file 48 | drop Drop or Select CSV fields by columns index 49 | flatten flattened view of CSV records [aliases: flat] 50 | freq Build frequency table of selected column in CSV data 51 | head Print first N records from CSV file 52 | pretty Convert CSV to a readable aligned table [aliases: prt] 53 | replace Replace data of matched fields 54 | reverse Reverses rows of CSV data [aliases: rev] 55 | sample Randomly select rows from CSV file using reservoir sampling 56 | search Applies the regex to each field individually and shows only matching rows 57 | slice Slice rows from a part of a CSV file 58 | tail Print last N records from CSV file 59 | transpose Transpose CSV data [aliases: trans] 60 | uniq Unique data with keys 61 | xlsx2csv Convert XLSX to CSV format [aliases: x2c] 62 | view Show CSV file content 63 | help Print this message or the help of the given subcommand(s) 64 | 65 | Global Arguments: 66 | -d, --delimiter Set delimiter for input csv file, e.g., in linux -d $'\t' for tab, in powershell -d `t for tab [default: ,] 67 | -D, --out-delimite Set delimiter for output CSV file, e.g., in linux -D $'\t' for tab, in powershell -D `t for tab [default: ,] 68 | --log If file name specified, write log message to this file, or write to stderr 69 | --compress-level Set compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with option -o/--out [default: 6] 70 | -v, --verbosity... control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug] 71 | [CSV] Input CSV file name, if file not specified read data from stdin 72 | 73 | Global FLAGS: 74 | -H, --no-header If set, the first row is treated as a special header row, and the original header row excluded from output 75 | -q, --quiet Be quiet and do not show any extra information 76 | -h, --help prints help information 77 | -V, --version prints version information 78 | 79 | Use "xtab help [command]" for more information about a command 80 | ``` 81 | -------------------------------------------------------------------------------- /dna/fakit.md: -------------------------------------------------------------------------------- 1 | # fakit 2 | 3 | 🦀 a simple program for fasta file manipulation 4 | 5 | ## install latest version 6 | 7 | ```bash 8 | cargo install --git https://github.com/sharkLoc/fakit.git 9 | ``` 10 | 11 | ## install 12 | 13 | ```bash 14 | cargo install fakit 15 | ``` 16 | 17 | ## usage 18 | 19 | ```bash 20 | Fakit: A simple program for fasta file manipulation 21 | 22 | Version: 0.3.6 23 | 24 | Authors: sharkLoc 25 | Source code: https://github.com/sharkLoc/fakit.git 26 | 27 | Fakit supports reading and writing gzip (.gz) format. 28 | Bzip2 (.bz2) and xz (.xz) format is supported since v0.3.0. 29 | Under the same compression level, xz has the highest compression ratio but consumes more time. 30 | 31 | Compression level: 32 | format range default crate 33 | gzip 1-9 6 https://crates.io/crates/flate2 34 | bzip2 1-9 6 https://crates.io/crates/bzip2 35 | xz 1-9 6 https://crates.io/crates/xz2 36 | 37 | 38 | Usage: fakit [OPTIONS] 39 | 40 | Commands: 41 | topn get first N records from fasta file [aliases: head] 42 | tail get last N records from fasta file 43 | fa2fq convert fasta to fastq file 44 | faidx create index and random access to fasta files [aliases: fai] 45 | flatten flatten fasta sequences [aliases: flat] 46 | range print fasta records in a range 47 | rename rename sequence id in fasta file 48 | reverse get a reverse-complement of fasta file [aliases: rev] 49 | window stat dna fasta gc content by sliding windows [aliases: slide] 50 | grep grep fasta sequences by name/seq 51 | seq convert all bases to lower/upper case, filter by length 52 | sort sort fasta file by name/seq/gc/length 53 | search search subsequences/motifs from fasta file 54 | shuffle shuffle fasta sequences 55 | size report fasta sequence base count 56 | subfa subsample sequences from big fasta file 57 | split split fasta file by sequence id 58 | split2 split fasta file by sequence number 59 | summ simple summary for dna fasta files [aliases: stat] 60 | codon show codon table and amino acid name 61 | help Print this message or the help of the given subcommand(s) 62 | 63 | Global Arguments: 64 | -w, --line-width line width when outputting fasta sequences, 0 for no wrap [default: 70] 65 | --compress-level set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with 66 | option -o/--out [default: 6] 67 | --log if file name specified, write log message to this file, or write to stderr 68 | -v, --verbosity... control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug] 69 | 70 | Global FLAGS: 71 | -q, --quiet be quiet and do not show extra information 72 | -h, --help prints help information 73 | -V, --version prints version information 74 | 75 | Use "fakit help [command]" for more information about a command 76 | 77 | ``` 78 | 79 |
80 | ** any bugs please report issues **💖 81 | -------------------------------------------------------------------------------- /dna/fq.md: -------------------------------------------------------------------------------- 1 | # fq 2 | 3 | [![CI status](https://github.com/stjude-rust-labs/fq/workflows/CI/badge.svg)](https://github.com/stjude-rust-labs/fq/actions) 4 | 5 | **fq** filters, generates, subsamples, and validates [FASTQ] files. 6 | 7 | [FASTQ]: https://en.wikipedia.org/wiki/FASTQ_format 8 | 9 | ## Install 10 | 11 | There are different methods to install fq. 12 | 13 | ### Releases 14 | 15 | [Precompiled binaries are built][releases] for modern Linux distributions 16 | (`x86_64-unknown-linux-gnu`), macOS (`x86_64-apple-darwin`), and Windows 17 | (`x86_64-pc-windows-msvc`). The Linux binaries require glibc 2.18+ (CentOS/RHEL 18 | 8+, Debian 8+, Ubuntu 14.04+, etc.). 19 | 20 | [releases]: https://github.com/stjude-rust-labs/fq/releases 21 | 22 | ### Conda 23 | 24 | fq is available via [Bioconda]. 25 | 26 | ``` 27 | $ conda install fq=0.11.0 28 | ``` 29 | 30 | [Bioconda]: https://bioconda.github.io/recipes/fq/README.html 31 | 32 | ### Manual 33 | 34 | Clone the repository and use [Cargo] to install fq. 35 | 36 | ``` 37 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git 38 | $ cd fq 39 | $ cargo install --locked --path . 40 | ``` 41 | 42 | [Cargo]: https://doc.rust-lang.org/cargo/getting-started/installation.html 43 | 44 | ### Container image 45 | 46 | Container images are managed by Bioconda and available through [Quay.io], e.g., 47 | using [Docker]: 48 | 49 | ``` 50 | $ docker image pull quay.io/biocontainers/fq: 51 | ``` 52 | 53 | See [the repository tags] for the available tags. 54 | 55 | Alternatively, build the development container image: 56 | 57 | ``` 58 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git 59 | $ cd fq 60 | $ docker image build --tag fq:0.11.0 . 61 | ``` 62 | 63 | [Quay.io]: https://quay.io/repository/biocontainers/fq 64 | [the repository tags]: https://quay.io/repository/biocontainers/fq?tab=tags 65 | [Docker]: https://www.docker.com/ 66 | 67 | ## Usage 68 | 69 | fq provides subcommands for filtering, generating, subsampling, and 70 | validating FASTQ files. 71 | 72 | ### filter 73 | 74 | **fq filter** filters a given FASTQ file by a set of names or a sequence 75 | pattern. The result includes only the records that match the given options. 76 | 77 | #### Usage 78 | 79 | ``` 80 | Filters a FASTQ file 81 | 82 | Usage: fq filter [OPTIONS] --dsts [SRCS]... 83 | 84 | Arguments: 85 | [SRCS]... FASTQ sources 86 | 87 | Options: 88 | --names 89 | Allowlist of record names 90 | --sequence-pattern 91 | Keep records that have sequences that match the given regular expression 92 | --dsts 93 | Filtered FASTQ destinations 94 | -h, --help 95 | Print help 96 | -V, --version 97 | Print version 98 | ``` 99 | 100 | #### Examples 101 | 102 | ```sh 103 | # Filters an input FASTQ using the given allowlist. 104 | $ fq filter --names allowlist.txt --dsts /dev/stdout in.fastq 105 | 106 | # Filters FASTQ files by matching a sequence pattern in the first input's 107 | # records and applying the match to all inputs. 108 | $ fq filter --sequence-pattern ^TC --dsts out.1.fq --dsts out.2.fq in.1.fq in.2.fq 109 | ``` 110 | 111 | ### generate 112 | 113 | **fq generate** is a FASTQ file pair generator. It creates two reads, formatting 114 | names as [described by Illumina][1]. 115 | 116 | While _generate_ creates "valid" FASTQ reads, the content of the files are 117 | completely random. The sequences do not align to any genome. 118 | 119 | [1]: https://help.basespace.illumina.com/articles/descriptive/fastq-files/ 120 | 121 | #### Usage 122 | 123 | ``` 124 | Generates a random FASTQ file pair 125 | 126 | Usage: fq generate [OPTIONS] 127 | 128 | Arguments: 129 | Read 1 destination. Output will be gzipped if ends in `.gz` 130 | Read 2 destination. Output will be gzipped if ends in `.gz` 131 | 132 | Options: 133 | -s, --seed Seed to use for the random number generator 134 | -n, --record-count Number of records to generate [default: 10000] 135 | --read-length Number of bases in the sequence [default: 101] 136 | -h, --help Print help 137 | -V, --version Print version 138 | ``` 139 | 140 | #### Examples 141 | 142 | ```sh 143 | # Generates the default number of records, written to uncompressed files. 144 | $ fq generate /tmp/r1.fastq /tmp/r2.fastq 145 | 146 | # Generates FASTQ paired reads with 32 records, written to gzipped outputs. 147 | $ fq generate --record-count 32 /tmp/r1.fastq.gz /tmp/r2.fastq.gz 148 | ``` 149 | 150 | ### lint 151 | 152 | **fq lint** is a FASTQ file pair validator. 153 | 154 | #### Usage 155 | 156 | ``` 157 | Validates a FASTQ file pair 158 | 159 | Usage: fq lint [OPTIONS] [R2_SRC] 160 | 161 | Arguments: 162 | Read 1 source. Accepts both raw and gzipped FASTQ inputs 163 | [R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs 164 | 165 | Options: 166 | --lint-mode 167 | Panic on first error or log all errors [default: panic] [possible values: panic, log] 168 | --single-read-validation-level 169 | Only use single read validators up to a given level [default: high] [possible values: low, medium, high] 170 | --paired-read-validation-level 171 | Only use paired read validators up to a given level [default: high] [possible values: low, medium, high] 172 | --disable-validator 173 | Disable validators by code. Use multiple times to disable more than one 174 | -h, --help 175 | Print help 176 | -V, --version 177 | Print version 178 | ``` 179 | 180 | #### Validators 181 | 182 | _validate_ includes a set of validators that run on single or paired records. 183 | By default, records are validated with all rules, but validators can be 184 | disabled using `--disable-validator CODE`, where `CODE` is one of validators 185 | listed below. 186 | 187 | ##### Single 188 | 189 | | Code | Level | Name | Validation 190 | |------|--------|-------------------|------------ 191 | | S001 | low | PlusLine | Plus line starts with a "+". 192 | | S002 | medium | Alphabet | All characters in sequence line are one of "ACGTN", case-insensitive. 193 | | S003 | high | Name | Name line starts with an "@". 194 | | S004 | low | Complete | All four record lines (name, sequence, plus line, and quality) are present. 195 | | S005 | high | ConsistentSeqQual | Sequence and quality lengths are the same. 196 | | S006 | medium | QualityString | All characters in quality line are between "!" and "~" (ordinal values). 197 | | S007 | high | DuplicateName | All record names are unique. 198 | 199 | ##### Paired 200 | 201 | | Code | Level | Name | Validation 202 | |------|---------|-------------------|------------ 203 | | P001 | medium | Names | Each paired read name is the same, excluding interleave. 204 | 205 | #### Examples 206 | 207 | ```sh 208 | # Validate both reads using all validators. Exits cleanly (0) if no validation 209 | # errors occur. 210 | $ fq lint r1.fastq r2.fastq 211 | 212 | # Log errors instead of quitting on first error. 213 | $ fq lint --lint-mode log r1.fastq r2.fastq 214 | 215 | # Disable validators S004 and S007. 216 | $ fq lint --disable-validator S004 --disable-validator S007 r1.fastq r2.fastq 217 | ``` 218 | 219 | ### subsample 220 | 221 | **fq subsample** outputs a subset of records from single or paired FASTQ files. 222 | 223 | When using a probability (`-p, --probability`), each file is read through once, 224 | and a subset of records is selected based on that chance. Given the randomness 225 | used when sampling a uniform distribution, the output record count will not be 226 | exact but (statistically) close. 227 | 228 | When using a record count (`-n, --record-count`), the first input is read 229 | twice, but it provides an exact number of records to be selected. 230 | 231 | A seed (`-s, --seed`) can be provided to influence the results, e.g., 232 | for a deterministic subset of records. 233 | 234 | For paired input, the sampling is applied to each pair. 235 | 236 | #### Usage 237 | 238 | ``` 239 | Outputs a subset of records 240 | 241 | Usage: fq subsample [OPTIONS] --r1-dst <--probability |--record-count > [R2_SRC] 242 | 243 | Arguments: 244 | Read 1 source. Accepts both raw and gzipped FASTQ inputs 245 | [R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs 246 | 247 | Options: 248 | -p, --probability The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count` 249 | -n, --record-count The exact number of records to keep. Cannot be used with `probability` 250 | -s, --seed Seed to use for the random number generator 251 | --r1-dst Read 1 destination. Output will be gzipped if ends in `.gz` 252 | --r2-dst Read 2 destination. Output will be gzipped if ends in `.gz` 253 | -h, --help Print help 254 | -V, --version Print version 255 | ``` 256 | 257 | #### Examples 258 | 259 | ```sh 260 | # Sample ~50% of records from a single FASTQ file 261 | $ fq subsample --probability 0.5 --r1-dst r1.50pct.fastq r1.fastq 262 | 263 | # Sample ~50% of records from a single FASTQ file and seed the RNG 264 | $ fq subsample --probability --seed 13 --r1-dst r1.50pct.fastq r1.fastq 265 | 266 | # Sample ~25% of records from paired FASTQ files 267 | $ fq subsample --probability 0.25 --r1-dst r1.25pct.fastq --r2-dst r2.25pct.fastq r1.fastq r2.fastq 268 | 269 | # Sample ~10% of records from a gzipped FASTQ file and compress output 270 | $ fq subsample --probability 0.1 --r1-dst r1.10pct.fastq.gz r1.fastq.gz 271 | 272 | # Sample exactly 10000 records from a single FASTQ file 273 | $ fq subsample --record-count 10000 -r1-dst r1.10k.fastq r1.fastq 274 | ``` -------------------------------------------------------------------------------- /dna/ngs.md: -------------------------------------------------------------------------------- 1 |

2 |

3 | ngs 4 |

5 | 6 |

7 | 8 | CI: Status 9 | 10 | 11 | crates.io version 12 | 13 | crates.io downloads 14 | 15 | License: Apache 2.0 16 | 17 | 18 | License: MIT 19 | 20 |

21 | 22 | 23 |

24 | Command line utility for working with next-generation sequencing files. 25 |
26 | Explore the docs » 27 |
28 |
29 | Request Feature 30 | · 31 | Report Bug 32 | · 33 | ⭐ Consider starring the repo! ⭐ 34 |
35 |

36 | 37 |

38 | 39 |

40 |

41 | 42 | 43 | ## 🎨 Features 44 | 45 | * **[`ngs convert`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-convert).** Convert between next-generation sequencing formats. 46 | * **[`ngs derive`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-derive).** Forensic analysis tool for next-generation sequencing data. 47 | * **[`ngs generate`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-generate).** Generates a BAM file from a given reference genome. 48 | * **[`ngs index`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-index).** Generates the index file to various next-generation sequencing files. 49 | * **[`ngs list`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-list).** Utility to list various supported items in this command line tool. 50 | * **[`ngs plot`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-plot).** Produces plots for data generated by `ngs qc`. 51 | * **[`ngs qc`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-qc).** Generates quality control metrics for BAM files. 52 | * **[`ngs view`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-view).** Views various next-generation sequencing files, sometimes with a query region. 53 | 54 | 55 | ## Guiding Principles 56 | 57 | * **Modern, reliable foundation for everyday bioinformatics analysis—written in Rust.** `ngs` aims to package together a fairly comprehensive set of analysis tools and utilities for everyday work in bioinformatics. It is built with modern, multi-core systems in mind and written in Rust. Though we are not there today, we plan to work towards this goal in the future. 58 | * **Runs on readily available hardware/software.** We aim for every subcommand within `ngs` to run within most computing environments without the need for special hardware or software. Practically, this means we've designed `ngs` to run in any UNIX-like environment that has at least four (4) cores and sixteen (16) GB of RAM. Often, tools will run with fewer resources. This design decision is important and sometimes means that `ngs` runs slower than it otherwise could. 59 | 60 | ## 📚 Getting Started 61 | 62 | ### Installation 63 | 64 | To install the latest released version, you can simply use `cargo`. 65 | 66 | ```bash 67 | cargo install ngs 68 | ``` 69 | 70 | To install the latest version on `main`, you can use the following command. 71 | 72 | ```bash 73 | cargo install --locked --git https://github.com/stjude-rust-labs/ngs.git 74 | ``` 75 | 76 | ### Using Docker 77 | 78 | ```bash 79 | docker pull ghcr.io/stjude-rust-labs/ngs 80 | docker run -it --rm --volume "$(pwd)":/data ghcr.io/stjude-rust-labs/ngs 81 | ``` 82 | 83 | `/data` is the working directory of the docker image. Running this command from the directory with your data will allow 84 | the continer to act on those files. 85 | 86 | Note: Currently the `latest` tag refers to the latest release of `ngs` and not the most recent code changes in this 87 | repository. 88 | 89 | ## 🖥️ Development 90 | 91 | To bootstrap a development environment, please use the following commands. 92 | 93 | ```bash 94 | # Clone the repository 95 | git clone git@github.com:stjude-rust-labs/ngs.git 96 | cd ngs 97 | 98 | # Run the command line tool using cargo. 99 | cargo run -- -h 100 | ``` 101 | 102 | ## 🚧️ Tests 103 | 104 | ```bash 105 | # Run the project's tests. 106 | cargo test 107 | 108 | # Ensure the project doesn't have any linting warnings. 109 | cargo clippy 110 | 111 | # Ensure the project passes `cargo fmt`. 112 | cargo fmt --check 113 | ``` 114 | 115 | ## Minimum Supported Rust Version (MSRV) 116 | 117 | The minimum supported Rust version for this project is 1.64.0. 118 | 119 | ## 🤝 Contributing 120 | 121 | Contributions, issues and feature requests are welcome! Feel free to check 122 | [issues page](https://github.com/stjude-rust-labs/ngs/issues). 123 | 124 | ## 📝 License 125 | 126 | * All code related to the `ngs derive instrument` subcommand is licensed under the [AGPL v2.0][agpl-v2]. This is not due to any strict requirement, but out of deference to some [code][10x-inspiration] that inspired our strategy (and from which patterns were copied), the decision was made to license this code consistently. 127 | * The rest of this project is licensed as either [Apache 2.0][license-apache] or 128 | [MIT][license-mit] at your discretion. 129 | 130 | Copyright © 2021-Present [St. Jude Children's Research 131 | Hospital](https://github.com/stjude). 132 | 133 | [10x-inspiration]: https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/illumina_instrument.py 134 | [agpl-v2]: http://www.affero.org/agpl2.html 135 | [contributing-md]: https://github.com/stjude-rust-labs/ngs/blob/master/CONTRIBUTING.md 136 | [license-apache]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-APACHE 137 | [license-mit]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-MIT -------------------------------------------------------------------------------- /dna/rust-bio-tools.md: -------------------------------------------------------------------------------- 1 | [![Gitpod Ready-to-Code](https://img.shields.io/badge/Gitpod-ready--to--code-blue?logo=gitpod)](https://gitpod.io/#https://github.com/rust-bio/rust-bio-tools) 2 | [![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html) 3 | [![Bioconda version](https://img.shields.io/conda/vn/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html) 4 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html) 5 | [![Licence](https://img.shields.io/conda/l/bioconda/rust-bio-tools.svg?style=flat)](http://bioconda.github.io/recipes/rust-bio-tools/README.html) 6 | [![GitHub Workflow Status](https://img.shields.io/github/workflow/status/rust-bio/rust-bio-tools/CI)](https://github.com/rust-bio/rust-bio-tools/actions) 7 | 8 | # Rust-Bio-Tools 9 | 10 | A set of ultra fast and robust command line utilities for bioinformatics tasks based on Rust-Bio. 11 | Rust-Bio-Tools provides a command `rbt`, which currently supports the following operations: 12 | 13 | * a linear time implementation for fuzzy matching of two vcf/bcf files (`rbt vcf-match`) 14 | * a vcf/bcf to txt converter, that flexibly allows to select tags and properly handles multiallelic sites (`rbt vcf-to-txt`) 15 | * a linear time round-robin FASTQ splitter that splits a given FASTQ files into a given number of chunks (`rbt fastq-split`) 16 | * a linear time extraction of depth information from BAMs at given loci (`rbt bam-depth`) 17 | * a utility to quickly filter records from a FASTQ file (`rbt fastq-filter`) 18 | * a tool to merge BAM or FASTQ reads using marked duplicates respectively unique molecular identifiers (UMIs) (`rbt collapse-reads-to-fragments bam|fastq`) 19 | * a tool to generate interactive HTML based reports that offer multiple plots visualizing the provided genomics data in VCF and BAM format (`rbt vcf-report`) 20 | * a tool to generate an interactive HTML based report from a csv file including visualizations (`rbt csv-report`) 21 | * a tool for splitting VCF/BCF files into N equal chunks, including BND support (`rbt vcf-split`) 22 | * a tool to generate visualizations for a specific region of one or multiple BAM files with a given reference contained in a single HTML file (`rbt plot-bam`) 23 | 24 | Further functionality is added as it is needed by the authors. Check out the [Contributing](#Contributing) section if you want contribute anything yourself. 25 | For a list of changes, take a look at the [CHANGELOG](CHANGELOG.md). 26 | 27 | 28 | ## Installation 29 | 30 | ### Requirements 31 | 32 | Rust-Bio-Tools depends [rgsl](https://docs.rs/GSL/*/rgsl/) which needs [GSL](https://www.gnu.org/software/gsl/) to be installed: 33 | 34 | - Ubuntu: `sudo apt-get install libgsl-dev` 35 | - Arch: `sudo pacman -S gsl` 36 | - OSX: `brew install gsl` 37 | 38 | ### Bioconda 39 | 40 | Rust-Bio-Tools is available via [Bioconda](https://bioconda.github.io). 41 | With Bioconda set up, installation is as easy as 42 | 43 | conda install rust-bio-tools 44 | 45 | ### Cargo 46 | 47 | If the [Rust](https://www.rust-lang.org/tools/install) compiler and associated [Cargo](https://github.com/rust-lang/cargo/) are installed, Rust-Bio-Tools may be installed via 48 | 49 | cargo install rust-bio-tools 50 | 51 | ### Source 52 | 53 | Download the source code and within the root directory of source run 54 | 55 | cargo install 56 | 57 | ## Usage and Documentation 58 | 59 | Rust-Bio-Tools installs a command line utility `rbt`. Issue 60 | 61 | rbt --help 62 | 63 | for a summary of all options and tools. 64 | 65 | ## Contributing 66 | 67 | Any contributions are highly welcome. If you plan to contribute we suggest installing pre-commit hooks. To do so: 68 | 1. Install `pre-commit` as explained [here](https://pre-commit.com/#installation) 69 | 2. Run `pre-commit install` in the rust-bio-tools base directory 70 | 71 | This should format, check and lint your code when committing. 72 | 73 | ## Authors 74 | 75 | * [Johannes Köster](https://github.com/johanneskoester) (https://koesterlab.github.io) 76 | * [Felix Mölder](https://github.com/FelixMoelder) 77 | * [Henning Timm](https://github.com/HenningTimm) 78 | * [Felix Wiegand](https://github.com/fxwiegand) -------------------------------------------------------------------------------- /dna/skc.md: -------------------------------------------------------------------------------- 1 | # skc 2 | 3 | `skc` is a simple tool for finding shared k-mer content between two genomes. 4 | 5 | ## Installation 6 | 7 | ### Prebuilt binary 8 | 9 | ``` 10 | curl -sSL skc.mbh.sh | sh 11 | # or with wget 12 | wget -nv -O - skc.mbh.sh | sh 13 | ``` 14 | 15 | You can also pass options to the script like so 16 | 17 | ```text 18 | $ curl -sSL skc.mbh.sh | sh -s -- --help 19 | install.sh [option] 20 | 21 | Fetch and install the latest version of skc, if skc is already 22 | installed it will be updated to the latest version. 23 | 24 | Options 25 | -V, --verbose 26 | Enable verbose output for the installer 27 | 28 | -f, -y, --force, --yes 29 | Skip the confirmation prompt during installation 30 | 31 | -p, --platform 32 | Override the platform identified by the installer 33 | 34 | -b, --bin-dir 35 | Override the bin installation directory [default: /usr/local/bin] 36 | 37 | -a, --arch 38 | Override the architecture identified by the installer [default: x86_64] 39 | 40 | -B, --base-url 41 | Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases] 42 | 43 | -h, --help 44 | Display this help message 45 | 46 | ``` 47 | 48 | ### Cargo 49 | 50 | ```text 51 | cargo install skc 52 | ``` 53 | 54 | ### Conda 55 | 56 | ```text 57 | conda install skc 58 | ``` 59 | 60 | ### Local 61 | 62 | ```text 63 | cargo build --release 64 | ./target/release/skc --help 65 | ``` 66 | 67 | ## Usage 68 | 69 | Check for shared 16-mers between the [HIV-1 genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1) and the [ 70 | *Mycobacterium tuberculosis* genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3). 71 | 72 | ```text 73 | $ skc -k 16 NC_001802.1.fa NC_000962.3.fa 74 | [2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target 75 | [2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query 76 | >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106 77 | TGCAGAACATCCAGGG 78 | >4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482 79 | CCAGCAGCAGATAGGG 80 | ``` 81 | 82 | So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout - 83 | use the `-o` option to write them to file. 84 | 85 | ### Fasta description 86 | 87 | Example: `>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106` 88 | 89 | The ID (`4233642782`) is the 64-bit integer representation of the k-mer's value in bit-space ( 90 | see [Daniel Liu's brilliant ][cute] for more information). `tcount` and `qcount` are the 91 | number of times the k-mer is present in the target and query genomes, respectively. `tpos` and `qpos` are the (1-based) 92 | k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs 93 | multiple times. 94 | 95 | ### Usage help 96 | 97 | ```text 98 | $ skc --help 99 | Shared k-mer content between two genomes 100 | 101 | Usage: skc [OPTIONS] 102 | 103 | Arguments: 104 | 105 | Target sequence 106 | 107 | Can be compressed with gzip, bzip2, xz, or zstd 108 | 109 | 110 | Query sequence 111 | 112 | Can be compressed with gzip, bzip2, xz, or zstd 113 | 114 | Options: 115 | -k, --kmer 116 | Size of k-mers (max. 32) 117 | 118 | [default: 21] 119 | 120 | -o, --output 121 | Output filepath(s); stdout if not present 122 | 123 | -O, --output-type 124 | u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd 125 | 126 | Output compression format is automatically guessed from the filename extension. This option is used to override that 127 | 128 | [default: u] 129 | 130 | -l, --compress-level 131 | Compression level to use if compressing output 132 | 133 | [default: 6] 134 | 135 | -h, --help 136 | Print help (see a summary with '-h') 137 | 138 | -V, --version 139 | Print version 140 | ``` 141 | 142 | ### Caveats 143 | 144 | - Make the first genome passed (``) the smallest genome. This is to reduce memory usage as all unique k-mers ( 145 | well their `u64` value) for this genome will be held in memory. 146 | - We do not use canonical k-mers 147 | - 32 is the largest k-mer size that can be used. This is basically a (lazy) implementation decision, but also helps to 148 | keep the memory footprint as low as possible. If you want larger k-mer values, I would suggest checking out some of 149 | the [similar tools](#alternate-tools). 150 | 151 | ## Alternate tools 152 | 153 | `skc` does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it 154 | as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared 155 | k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is 156 | a (non-exhaustive) list of other tools that can be used to get shared k-mer content 157 | 158 | - [unikmer](https://github.com/shenwei356/unikmer) - this was brought to my attention *after* I wrote `skc`. Had I known 159 | about it beforehand, I probably wouldn't have written `skc`. So I would recommend unikmer for almost all use 160 | cases - [Wei Shen](https://github.com/shenwei356) writes awesome tools 161 | - [Jellyfish](https://github.com/gmarcais/Jellyfish) 162 | - [REINDEER](https://github.com/kamimrcht/REINDEER) 163 | - [kmer-db](https://github.com/refresh-bio/kmer-db) 164 | - [GGCAT](https://github.com/algbio/ggcat) 165 | - [KAT](https://github.com/TGAC/KAT) 166 | 167 | ## Acknowledgements 168 | 169 | [Daniel Liu's brilliant ][cute] is used to (rapidly) convert k-mers into 64-bit integers. 170 | 171 | 172 | [cute]: https://github.com/Daniel-Liu-c0deb0t/cute-nucleotides 173 | -------------------------------------------------------------------------------- /fastq/fasten.md: -------------------------------------------------------------------------------- 1 | # Fasten 2 | 3 | [![Crates.io](https://img.shields.io/crates/v/fasten)](https://crates.io/crates/fasten) 4 | [![CI](https://github.com/lskatz/fasten/actions/workflows/basic.yml/badge.svg)](https://github.com/lskatz/fasten/actions/workflows/basic.yml) 5 | [![DOI](https://joss.theoj.org/papers/10.21105/joss.06030/status.svg)](https://doi.org/10.21105/joss.06030) 6 | 7 | A powerful manipulation suite for interleaved fastq files. 8 | Executables can read/write to `stdin` and `stdout`, and they are compatible with the interleaved fastq format. 9 | This makes it much easier to perform streaming operations using unix pipes. 10 | 11 | ## Synopsis 12 | 13 | ### read metrics 14 | 15 | $ cat testdata/R1.fastq testdata/R2.fastq | \ 16 | fasten_shuffle | fasten_metrics | column -t 17 | totalLength numReads avgReadLength avgQual 18 | 800 8 100 19.53875 19 | 20 | ### read cleaning 21 | 22 | $ cat testdata/R1.fastq testdata/R2.fastq | \ 23 | fasten_shuffle | \ 24 | fasten_clean --paired-end --min-length 2 | \ 25 | gzip -c > cleaned.shuffled.fastq.gz 26 | 27 | $ zcat cleaned.shuffled.fastq.gz | fasten_metrics | column -t 28 | totalLength numReads avgReadLength avgQual 29 | 800 8 100 19.53875 30 | # No reads were actually filtered with cleaning, with --min-length=2 31 | 32 | ## Installation 33 | 34 | ### Installation from source 35 | 36 | Fasten is programmed in the Rust programming language. More information about Rust, including installation and the executable `cargo`, can be found at [rust-lang.org](https://www.rust-lang.org). 37 | 38 | After downloading, use the Rust executable `cargo` like so: 39 | 40 | cd fasten 41 | cargo build --release 42 | export PATH=$PATH:$(pwd)/target/release 43 | 44 | All executables will be in the directory `fasten/target/release`. 45 | 46 | _note_: there are some `Makefile` methods to help including 47 | 48 | * `make all` to make the following 49 | * `make release` install fast executables 50 | * `make debug` install executables quickly (although the executables will not be optimized) 51 | * `make fasten/doc` compile lastest documents 52 | * `make clean` uninstall local binaries 53 | 54 | ### Installation without `git` 55 | 56 | You can also install Fasten straight from using the following command: 57 | 58 | cargo install fasten 59 | 60 | Detailed information on how this works can be found in the cargo handbook at . 61 | 62 | ## General usage 63 | 64 | All scripts accept the parameters, read uncompressed fastq format from stdin, and print uncompressed fastq format to stdout. All paired end fastq files must be in interleaved format, and they are written in [interleaved format](./docs/file-formats.md), except when deshuffling with `fasten_shuffle`. 65 | 66 | * `--help` 67 | * `--numcpus` Not all scripts will take advantage of numcpus. (not currently implemented) 68 | * `--paired-end` Input reads are interleaved paired end 69 | * `--verbose` Print more status messages 70 | 71 | ## Documentation 72 | 73 | Please see the inline documentation at 74 | 75 | This documentation was built with `cargo doc --no-deps` 76 | 77 | ### Other documentation 78 | 79 | * Some wrapper scripts are noted in the [scripts](./scripts.md) page. 80 | 81 | ### Contributing 82 | 83 | Instructions for how to contribute can be found in [CONTRIBUTING.md](CONTRIBUTING.md). 84 | 85 | ## Fasten script descriptions 86 | 87 | All executables read and write in the fastq format 88 | except `fasten_convert`. 89 | 90 | |executable |Description| 91 | |-------------------|-----------| 92 | |[`fasten_clean`](https://lskatz.github.io/fasten/fasten_clean) | Trims and cleans a fastq file.| 93 | |[`fasten_convert`](https://lskatz.github.io/fasten/fasten_convert) | Converts between different sequence formats like fastq, sam, fasta.| 94 | |[`fasten_straighten`](https://lskatz.github.io/fasten/fasten_straighten)| Convert any fastq file to a standard four-line-per-entry format.| 95 | |[`fasten_metrics`](https://lskatz.github.io/fasten/fasten_metrics) | Prints basic read metrics.| 96 | |[`fasten_pe`](https://lskatz.github.io/fasten/fasten_pe) | Determines paired-endedness based on read IDs.| 97 | |[`fasten_randomize`](https://lskatz.github.io/fasten/fasten_randomize) | Randomizes reads from input | 98 | |[`fasten_combine`](https://lskatz.github.io/fasten/fasten_combine) | Combines identical reads and updates quality scores.| 99 | |[`fasten_kmer`](https://lskatz.github.io/fasten/fasten_kmer) | Kmer counting.| 100 | |[`fasten_normalize`](https://lskatz.github.io/fasten/fasten_normalize) | Normalize read depth by using kmer counting.| 101 | |[`fasten_sample`](https://lskatz.github.io/fasten/fasten_sample) | Downsamples reads.| 102 | |[`fasten_shuffle`](https://lskatz.github.io/fasten/fasten_shuffle) | Shuffles or deshuffles paired end reads.| 103 | |[`fasten_validate`](https://lskatz.github.io/fasten/fasten_validate) | Validates your reads (deprecated in favor of `fasten_inspect` and `fasten_repair`| 104 | |[`fasten_inspect`](https://lskatz.github.io/fasten/fasten_inspect) | adds information to read IDs such as seqlength | 105 | |[`fasten_repair`](https://lskatz.github.io/fasten/fasten_repair) | Repairs corrupted reads | 106 | |[`fasten_quality_filter`](https://lskatz.github.io/fasten/fasten_quality_filter) | Transforms nucleotides to "N" if the quality is low | | 107 | |[`fasten_trim`](https://lskatz.github.io/fasten/fasten_trim) | Blunt-end trims reads | | 108 | |[`fasten_replace`](https://lskatz.github.io/fasten/fasten_replace) | Find and replace using regex | | 109 | |[`fasten_mutate`](https://lskatz.github.io/fasten/fasten_mutate) | introduce random mutations | | 110 | |[`fasten_regex`](https://lskatz.github.io/fasten/fasten_regex) | Filter for reads using regex | | 111 | |[`fasten_progress`](https://lskatz.github.io/fasten/fasten_progress) | Add progress to any place in the pipeline | | 112 | |[`fasten_sort`](https://lskatz.github.io/fasten/fasten_sort) | Sort fastq entries | | 113 | 114 | ## Etymology 115 | 116 | Many of these scripts have inspiration from the fastx toolkit, and I wanted to make a `fasty` which was already the name of a bioinformatics program. 117 | Therefore I cycled through other letters of the alphabet and came across "N." So it is possible to pronounce this project like "Fast-N" or in a way 118 | that indicates that you are securing your analysis by "fasten"ing it (with a silent T). 119 | 120 | ## Citation 121 | 122 | [![DOI](https://joss.theoj.org/papers/10.21105/joss.06030/status.svg)](https://doi.org/10.21105/joss.06030) 123 | 124 | To cite, please refer to Katz et al., (2024). Fasten: a toolkit for streaming operations on fastq files. Journal of Open Source Software, 9(94), 6030, https://doi.org/10.21105/joss.06030 -------------------------------------------------------------------------------- /fastq/faster.md: -------------------------------------------------------------------------------- 1 | ![Rust](https://github.com/angelovangel/faster/workflows/Rust/badge.svg) 2 | # faster 3 | 4 | A (very) fast program for getting statistics and features from a fastq file, in a usable form, written in Rust. 5 | 6 | ## Description 7 | 8 | I wrote this program to get *fast* and *accurate* statistics about a fastq file, formatted as a tab-delimited table. In addition, it can do the following with a fastq file: 9 | 10 | - get the read lengths 11 | - get gc content per read 12 | - get geometric mean of phred scores per read 13 | - get NX values for all the reads, e.g. N50 14 | - filter reads based on length (both greater than and smaller than a desired length) 15 | - subsample reads (by proportion of all reads in the file) 16 | - trim front and trim tail - trim x number of bases from the beginning/end of each read 17 | - regex search for reads containing a pattern in their description field 18 | 19 | The motivation behind it: 20 | 21 | - many of the tools out there are just wrong when it comes to calculating 'mean' phred scores (yes, just taking the arithmetic mean phred score is wrong) 22 | - one simple executable doing one thing well, no dependencies 23 | - it is straightforward to parse the output in other programs and the output is easy to tweak as desired 24 | - reasonably fast 25 | - can be easily run in parallel 26 | 27 | ## Install 28 | 29 | Compiled binaries are provided for x86_64 Linux, macOS and Windows - download from the releases section and run. You will have to make the file executable (`chmod a+x faster`) and for MacOS, allow running external apps in your security settings. If you need to run it on something else (your phone?!), you will have to compile it yourself (which is pretty easy though). Below is an example on how to setup a Rust toolchain and compile `faster`: 30 | 31 | ```bash 32 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 33 | git clone https://github.com/angelovangel/faster.git 34 | 35 | cd faster 36 | cargo build --release 37 | 38 | # the binary is now under ./target/release/, run it like this: 39 | ./target/release/faster -t /path/to/fastq/file.fastq.gz 40 | 41 | ``` 42 | 43 | ## Usage and tweaking the output 44 | 45 | The program takes one fastq/fastq.gz file as an argument and, when used with the `--table` flag, outputs a tab-separated table with statistics to stdout. There are options to obtain the length, GC-content, and 'mean' phred scores per read, or to filter reads by length, see `-help` for details. 46 | 47 | ```bash 48 | # for help 49 | faster --help # or -h 50 | 51 | # get some N10, N50 and N90 values 52 | for i in 0.1 0.5 0.9; do faster --nx $i /path/to/fastq/file.fastq; done 53 | 54 | # get a table with statistics 55 | faster -t /path/to/fastq/file.fastq 56 | 57 | # for many files, with parallel 58 | parallel faster -t ::: /path/to/fastq/*.fastq.gz 59 | 60 | # again with parallel, but get rid of the table header 61 | parallel faster -ts ::: /path/to/fastq/*.fastq.gz 62 | ``` 63 | 64 | The statistics output is a tab-separated table with the following columns: 65 | `file reads bases n_bases min_len max_len mean_len Q1 Q2 Q3 N50 Q20_percent Q30_percent` 66 | 67 | ## Performance 68 | 69 | To get an idea how `faster` compares to other tools, I have benchmarked it with two other popular programs and 3 different datasets. **I am aware that these tools have different and often much richer functionality (especially seqkit, I use it all the time), so these comparisons are for orientation only**. 70 | The benchmarks were performed with [hyperfine](https://github.com/sharkdp/hyperfine) (`-r 15 --warmup 2`) on a MacBook Pro with an 8-core 2.3 GHz Quad-Core Intel Core i5 and 8 GB RAM. For Illumina reads, `faster` is slightly slower than `seqstats` (written in C using the `klib` [library by Heng Li](https://github.com/attractivechaos/klib) - the fastest thing possible out there), and for Nanopore it is even a bit faster than `seqstats`. `seqkit stats` performs worse of the three tools tested, but bear in mind the extraordinarily rich functionality it has. 71 | 72 | *** 73 | ### dataset A - a small Nanopore fastq file with 37k reads and 350M bases 74 | 75 | | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | 76 | |:---|---:|---:|---:|---:| 77 | | `faster -t datasetA.fastq` | 398.1 ± 21.2 | 380.4 | 469.6 | 1.00 | 78 | | `seqstats datasetA.fastq` | 633.6 ± 54.1 | 593.3 | 773.6 | 1.59 ± 0.16 | 79 | | `seqkit stats -a datasetA.fastq` | 1864.5 ± 70.3 | 1828.7 | 2117.3 | 4.68 ± 0.31 | 80 | 81 | *** 82 | 83 | ### dataset B - a small Illumina fastq.gz file with ~100k reads 84 | 85 | | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | 86 | |:---|---:|---:|---:|---:| 87 | | `faster -t datasetB.fastq.gz` | 181.7 ± 2.3 | 177.7 | 184.6 | 1.36 ± 0.09 | 88 | | `seqstats datasetB.fastq.gz` | 133.4 ± 8.4 | 125.7 | 154.2 | 1.00 | 89 | | `seqkit stats -a datasetB.fastq.gz` | 932.6 ± 37.0 | 873.8 | 1028.9 | 6.99 ± 0.52 | 90 | 91 | *** 92 | 93 | ### dataset C - a small Illumina iSeq run, 11.5M reads and 1.7G bases, using `gnu parallel` 94 | 95 | | Command | Mean [s] | Min [s] | Max [s] | Relative | 96 | |:---|---:|---:|---:|---:| 97 | | `parallel faster -t ::: *.fastq.gz` | 6.438 ± 0.384 | 6.009 | 7.062 | 1.43 ± 0.15 | 98 | | `parallel seqstats ::: *.fastq.gz` | 4.488 ± 0.394 | 4.120 | 5.312 | 1.00 | 99 | | `parallel seqkit stats -a ::: *.fastq.gz` | 40.156 ± 1.747 | 38.762 | 44.132 | 8.95 ± 0.88 | 100 | 101 | *** 102 | ## Reference 103 | 104 | `faster` uses the excellent Rust-Bio library: 105 | 106 | [Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.](https://academic.oup.com/bioinformatics/article/32/3/444/1743419) -------------------------------------------------------------------------------- /fastq/fqgrep.md: -------------------------------------------------------------------------------- 1 | # fqgrep 2 | 3 |

4 | Build Status 5 | license 6 | Version info 7 | Install with bioconda 8 |
9 | Grep for FASTQ files. 10 |

11 | 12 | Search a pair of fastq files for reads that match a given ref or alt sequence. 13 | 14 | ## Install 15 | 16 | ### From bioconda 17 | 18 | ```console 19 | conda install -c bioconda fqgrep 20 | ``` 21 | 22 | ### From Source 23 | 24 | ```console 25 | git clone ... && cd fqgrep 26 | cargo install --path . 27 | ``` 28 | 29 | ## Usage 30 | 31 | ```console 32 | fqgrep -r 'GACGAGATTA' -a 'GACGTGATTA' --r1-fastq /data/testR1.fastq.gz --r2-fastq /data/testR2.fastq.gz -o ./test_out -t 28 33 | ``` 34 | 35 | ## Help 36 | 37 | See the following for usage: 38 | 39 | ```console 40 | fqgrep -h 41 | ``` 42 | -------------------------------------------------------------------------------- /fastq/fqkit.md: -------------------------------------------------------------------------------- 1 | ![icon](https://github.com/sharkLoc/fqkit/blob/main/doc/fqkit_icon.PNG) 2 | 3 | 4 | 5 | # fqkit 6 | 7 | ![Static Badge](https://img.shields.io/badge/Author-sharkLoc-blue) 8 | ![Static Badge](https://img.shields.io/badge/Tool-fqkit-red) 9 | ![Crates.io (latest)](https://img.shields.io/crates/dv/fqkit?labelColor=rgb&color=hex&link=https%3A%2F%2Fcrates.io%2Fcrates%2Ffqkit) 10 | ![Crates.io](https://img.shields.io/crates/d/fqkit?label=Total%20download%20in%20crate.io) 11 | ![GitHub Gist last commit](https://img.shields.io/github/gist/last-commit/a4910923a230b8975218a188528463d7?logo=github) 12 | 13 | 🦀 a simple program for fastq file manipulation 14 | 15 | ## install 16 | 17 | ##### setp1: install cargo first 18 | 19 | ```bash 20 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 21 | ``` 22 | 23 | ##### step2: on linux or windows 24 | 25 | ```bash 26 | cargo install fqkit 27 | # or 28 | 29 | git clone https://github.com/sharkLoc/fqkit.git 30 | cd fqkit 31 | cargo b --release 32 | # mv target/release/fqkit to anywhere you want 33 | ``` 34 | 35 | ##### install latest version 36 | 37 | ```bash 38 | cargo install --git https://github.com/sharkLoc/fqkit.git 39 | ``` 40 | 41 | ## usage 42 | 43 | ```bash 44 | FqKit -- A simple and cross-platform program for fastq file manipulation 45 | 46 | Version: 0.4.8 47 | 48 | Authors: sharkLoc 49 | Source code: https://github.com/sharkLoc/fqkit.git 50 | 51 | Fqkit supports reading and writing gzip (.gz) format. 52 | Bzip2 (.bz2) format is supported since v0.3.8. 53 | Xz (.xz) format is supported since v0.3.9. 54 | Under the same compression level, xz has the highest compression ratio but consumes more time. 55 | 56 | Compression level: 57 | format range default crate 58 | gzip 1-9 6 https://crates.io/crates/flate2 59 | bzip2 1-9 6 https://crates.io/crates/bzip2 60 | xz 1-9 6 https://crates.io/crates/xz2 61 | 62 | 63 | Usage: fqkit [OPTIONS] 64 | 65 | Commands: 66 | topn get first N records from fastq file [aliases: head] 67 | tail get last N records from fastq file 68 | concat concat fastq files from different lanes 69 | subfq subsample sequences from big fastq file [aliases: sample] 70 | select select pair-end reads by read id 71 | trim trim fastq reads by position 72 | adapter cut the adapter sequence on the reads 73 | filter a simple filter for pair end fastq sqeuence 74 | range print fastq records in a range 75 | search search reads/motifs from fastq file 76 | grep grep fastq sequence by read id or full name 77 | stats summary for fastq format file [aliases: stat] 78 | shuffle shuffle fastq sequences 79 | size report the number sequences and bases 80 | slide extract subsequences in sliding windows 81 | sort sort fastq file by name/seq/gc/length 82 | plot line plot for A T G C N percentage in read position 83 | fq2fa translate fastq to fasta 84 | fq2sam converts a fastq file to an unaligned SAM file 85 | fqscore converts the fastq file quality scores 86 | flatten flatten fastq sequences [aliases: flat] 87 | barcode perform demultiplex for pair-end fastq reads [aliases: demux] 88 | check check the validity of a fastq record 89 | remove remove reads by read name 90 | rename rename sequence id in fastq file 91 | reverse get a reverse-complement of fastq file [aliases: rev] 92 | split split interleaved fastq file 93 | merge merge PE reads as interleaved fastq file 94 | mask convert any low quality base to 'N' or other chars 95 | split2 split fastq file by records number 96 | gcplot get GC content result and plot 97 | length get reads length count [aliases: len] 98 | view view fastq file page by page 99 | help Print this message or the help of the given subcommand(s) 100 | 101 | Global Arguments: 102 | --compress-level set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with option -o/--out [default: 6] 103 | --log if file name specified, write log message to this file, or write to stderr 104 | -v, --verbosity... control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug] 105 | 106 | Global FLAGS: 107 | -q, --quiet be quiet and do not show any extra information 108 | -h, --help prints help information 109 | -V, --version prints version information 110 | 111 | Use "fqkit help [command]" for more information about a command 112 | ``` 113 | 114 | #### ** any bugs please report issues **💖 115 | -------------------------------------------------------------------------------- /fastq/fqtk.md: -------------------------------------------------------------------------------- 1 | # fqtk 2 | 3 |

4 | Build Status 5 | license 6 | Version info 7 | Install with bioconda 8 |
9 |

10 | 11 | A toolkit for working with FASTQ files, written in Rust. 12 | 13 | Currently `fqtk` contains a single tool, `demux` for demultiplexing FASTQ files based on sample barcodes. 14 | `fqtk demux` can be used to demultiplex one or more FASTQ files (e.g. a set of R1, R2 and I1 FASTQ files) with any number of sample barcodes at fixed locations within the reads. 15 | It is highly efficient and multi-threaded for high performance. 16 | 17 | Usage for `fqtk demux` follows: 18 | 19 | ```console 20 | Performs sample demultiplexing on FASTQs. 21 | 22 | The sample barcode for each sample in the metadata TSV will be compared against 23 | the sample barcode bases extracted from the FASTQs, to assign each read to a 24 | sample. Reads that do not match any sample within the given error tolerance 25 | will be placed in the ``unmatched_prefix`` file. 26 | 27 | FASTQs and associated read structures for each sub-read should be given: 28 | 29 | - a single fragment read (with inline index) should have one FASTQ and one read 30 | structure 31 | - paired end reads should have two FASTQs and two read structures 32 | - a dual-index sample with paired end reads should have four FASTQs and four read 33 | structures given: two for the two index reads, and two for the template reads. 34 | 35 | If multiple FASTQs are present for each sub-read, then the FASTQs for each 36 | sub-read should be concatenated together prior to running this tool (e.g. 37 | `zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`). 38 | 39 | Read structures are made up of `` pairs much like the `CIGAR` 40 | string in BAM files. Four kinds of operators are recognized: 41 | 42 | 1. `T` identifies a template read 43 | 2. `B` identifies a sample barcode read 44 | 3. `M` identifies a unique molecular index read 45 | 4. `S` identifies a set of bases that should be skipped or ignored 46 | 47 | The last `` pair may be specified using a `+` sign instead of 48 | number to denote "all remaining bases". This is useful if, e.g., fastqs have 49 | been trimmed and contain reads of varying length. Both reads must have template 50 | bases. Any molecular identifiers will be concatenated using the `-` delimiter 51 | and placed in the given SAM record tag (`RX` by default). Similarly, the sample 52 | barcode bases from the given read will be placed in the `BC` tag. 53 | 54 | Metadata about the samples should be given as a headered metadata TSV file with 55 | two columns 1. `sample_id` - the id of the sample or library. 2. `barcode` - the 56 | expected barcode sequence associated with the `sample_id`. 57 | 58 | The read structures will be used to extract the observed sample barcode, template 59 | bases, and molecular identifiers from each read. The observed sample barcode 60 | will be matched to the sample barcodes extracted from the bases in the sample 61 | metadata and associated read structures. 62 | 63 | An observed barcode matches an expected barcocde if all the following are true: 64 | 65 | 1. The number of mismatches (edits/substitutions) is less than or equal to the 66 | maximum mismatches (see --max-mismatches). 67 | 2. The difference between number of mismatches in the best and second best 68 | barcodes is greater than or equal to the minimum mismatch delta 69 | (`--min-mismatch-delta`). The expected barcode sequence may contains Ns, 70 | which are not counted as mismatches regardless of the observed base (e.g. 71 | the expected barcode `AAN` will have zero mismatches relative to both the 72 | observed barcodes `AAA` and `AAN`). 73 | 74 | ## Outputs 75 | 76 | All outputs are generated in the provided `--output` directory. For each sample 77 | plus the unmatched reads, FASTQ files are written for each read segment 78 | (specified in the read structures) of one of the types supplied to 79 | `--output-types`. 80 | 81 | FASTQ files have names of the format: 82 | 83 | {sample_id}.{segment_type}{read_num}.fq.gz 84 | 85 | where `segment_type` is one of `R`, `I`, and `U` (for template, barcode/index 86 | and molecular barcode/UMI reads respectively) and `read_num` is a number starting 87 | at 1 for each segment type. 88 | 89 | In addition a `demux-metrics.txt` file is written that is a tab-delimited file 90 | with counts of how many reads were assigned to each sample and derived metrics. 91 | 92 | ## Example Command Line 93 | 94 | As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index 95 | reads both reading a sample barcode, as well as an in-line 8bp sample barcode in 96 | read one, the command line would be: 97 | 98 | fqtk demux \ 99 | --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \ 100 | --read-structures 8B92T 8B 8B 100T \ 101 | --sample-metadata metadata.tsv \ 102 | --output output_folder 103 | 104 | Usage: fqtk demux [OPTIONS] --inputs ... --read-structures ... --sample-metadata --output 105 | 106 | Options: 107 | -i, --inputs ... 108 | One or more input fastq files each corresponding to a sequencing (e.g. R1, I1) 109 | 110 | -r, --read-structures ... 111 | The read structures, one per input FASTQ in the same order 112 | 113 | -b, --output-types ... 114 | The read structure types to write to their own files (Must be one of T, B, 115 | or M for template reads, sample barcode reads, and molecular barcode reads) 116 | 117 | Multiple output types may be specified as a space-delimited list. 118 | 119 | [default: T] 120 | 121 | -s, --sample-metadata 122 | A file containing the metadata about the samples 123 | 124 | -o, --output 125 | The output directory into which to write per-sample FASTQs 126 | 127 | -u, --unmatched-prefix 128 | Output prefix for FASTQ file(s) for reads that cannot be matched to a sample 129 | 130 | [default: unmatched] 131 | 132 | --max-mismatches 133 | Maximum mismatches for a barcode to be considered a match 134 | 135 | [default: 1] 136 | 137 | -d, --min-mismatch-delta 138 | Minimum difference between number of mismatches in the best and second best barcodes 139 | for a barcode to be considered a match 140 | 141 | [default: 2] 142 | 143 | -t, --threads 144 | The number of threads to use. Cannot be less than 3 145 | 146 | [default: 8] 147 | 148 | -c, --compression-level 149 | The level of compression to use to compress outputs 150 | 151 | [default: 5] 152 | 153 | -S, --skip-reasons 154 | Skip demultiplexing reads for any of the following reasons, otherwise panic. 155 | 156 | 1. `too-few-bases`: there are too few bases or qualities to extract given the 157 | read structures. For example, if a read is 8bp long but the read structure 158 | is `10B`, or if a read is empty and the read structure is `+T`. 159 | 160 | -h, --help 161 | Print help information (use `-h` for a summary) 162 | 163 | -V, --version 164 | Print version information 165 | ``` 166 | 167 | ## Installing 168 | 169 | ### Installing with `conda` 170 | 171 | To install with conda you must first [install conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html#installation). 172 | Then, in your command line (and with the environment you wish to install fqtk into active) run: 173 | 174 | ```console 175 | conda install -c bioconda fqtk 176 | ``` 177 | 178 | ### Installing with `cargo` 179 | 180 | To install with cargo you must first [install rust](https://doc.rust-lang.org/cargo/getting-started/installation.html). 181 | Which (On Mac OS and Linux) can be done with the command: 182 | 183 | ```console 184 | curl https://sh.rustup.rs -sSf | sh 185 | ``` 186 | 187 | Then, to install `fqtk` run: 188 | 189 | ```console 190 | cargo install fqtk 191 | ``` 192 | 193 | ### Building From Source 194 | 195 | First, clone the git repo: 196 | 197 | ```console 198 | git clone https://github.com/fulcrumgenomics/fqtk.git 199 | ``` 200 | 201 | Secondly, if you do not already have rust development tools installed, install via [rustup](https://rustup.rs/): 202 | 203 | ```console 204 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 205 | ``` 206 | 207 | Then build the toolkit in release mode: 208 | 209 | ```console 210 | cd fqtk 211 | cargo build --release 212 | ./target/release/fqtk --help 213 | ``` 214 | 215 | ## Developing 216 | 217 | fqtk is developed in Rust and follows the conventions of using `rustfmt` and `clippy` to ensure both code quality and standardized formatting. 218 | When working on fqtk, before pushing any commits, please first run `./ci/check.sh` and resolve any issues that are reported. 219 | 220 | ## Releasing a New Version 221 | 222 | ### Pre-requisites 223 | 224 | Install 225 | 226 | ```console 227 | cargo install cargo-release 228 | ``` 229 | 230 | ### Prior to Any Release 231 | 232 | Create a release that will not try to push to `crates.io` and verify the command: 233 | 234 | ```console 235 | cargo release [major,minor,patch,release,rc...] --no-publish 236 | ``` 237 | 238 | Note: "dry-run" is the default for cargo release. 239 | 240 | See the [ reference documentation][cargo-release-docs-link] for more information 241 | 242 | ### Semantic Versioning 243 | 244 | This tool follows [Semantic Versioning](https://semver.org/). In brief: 245 | 246 | * MAJOR version when you make incompatible API changes, 247 | * MINOR version when you add functionality in a backwards compatible manner, and 248 | * PATCH version when you make backwards compatible bug fixes. 249 | 250 | ### Major Release 251 | 252 | To create a major release: 253 | 254 | ```console 255 | cargo release major --execute 256 | ``` 257 | 258 | This will remove any pre-release extension, create a new tag and push it to github, and push the release to creates.io. 259 | 260 | Upon success, move the version to the [next candidate release](#release-candidate). 261 | 262 | Finally, make sure to [create a new release][new-release-link] on GitHub. 263 | 264 | ### Minor and Patch Release 265 | 266 | To create a _minor_ (_patch_) release, follow the [Major Release](#major-release) instructions substituting `major` with `minor` (`patch`): 267 | 268 | ```console 269 | cargo release minor --execute 270 | ``` 271 | 272 | ### Release Candidate 273 | 274 | To move to the next release candidate: 275 | 276 | ```console 277 | cargo release rc --no-tag --no-publish --execute 278 | ``` 279 | 280 | This will create or bump the pre-release version and push the changes to the main branch on github. 281 | This will not tag and publish the release candidate. 282 | If you would like to tag the release candidate on github, remove `--no-tag` to create a new tag and push it to github. 283 | 284 | [cargo-release-link]: https://github.com/crate-ci/cargo-release 285 | [cargo-release-docs-link]: https://github.com/crate-ci/cargo-release/blob/master/docs/reference.md 286 | 287 | [new-release-link]: https://github.com/fulcrumgenomics/fqtk/releases/new 288 | -------------------------------------------------------------------------------- /longreads/NextPolish2.md: -------------------------------------------------------------------------------- 1 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/nextpolish2/README.html) 2 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/nextpolish2/badges/version.svg)](https://anaconda.org/bioconda/nextpolish2) 3 | # NextPolish2 4 | 5 | Telomere-to-telomere (T2T) genome has been emerging as a new hotspot in the field of genomics. Typically, we obtain a T2T genome with datasets including both high-accuracy PacBio HiFi long reads and Oxford Nanopore Technologies (ONT) ultra-long reads. Although genomes obtained using HiFi long reads have considerably higher qualities, however, they still contain a handful of assembly errors in regions where HiFi long reads stumble as well, such as homopolymer or low-complexity microsatellite regions. Additionally, a typical gap-filling step is accomplished using ONT ultra long reads which contain a certain amount of errors. Hence, the current T2T genomes assembled still require further improvement in terms of consensus accuracy. NextPolish2 can be used to fix these errors (SNV/Indel) in a high quality assembly. Through the built-in phasing module, it can only correct the error bases while maintaining the original haplotype consistency. Therefore, even in the regions with complex repeat elements, NextPolish2 will still not produce overcorrections. In fact, in some cases it can reduce switching errors in the heterozygous region. NextPolish2 is not an upgraded version of NextPolish, but an additional supplement for the pursuit of extremely-high-quality genome assemblies. 6 | 7 | ## Table of Contents 8 | 9 | - [Installation](#install) 10 | - [General usage](#usage) 11 | - [Getting help](#help) 12 | - [Citation](#cite) 13 | - [License](#license) 14 | - [Limitations](#limit) 15 | - [Benchmarking](#benchmark) 16 | - [FAQ](./doc/faq.md) 17 | 18 | ### Installation 19 | 20 | #### Installing from bioconda 21 | ```sh 22 | conda install nextpolish2 23 | ``` 24 | #### Installing from source 25 | ##### Dependencies 26 | 27 | `NextPolish2` is written in rust, try below commands (no root required) or refer [here](https://www.rust-lang.org/tools/install) to install `Rust` first. 28 | ```sh 29 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh 30 | ``` 31 | 32 | ##### Download and install 33 | 34 | ```sh 35 | git clone --recursive git@github.com:Nextomics/NextPolish2.git 36 | cd NextPolish2 && cargo build --release 37 | ``` 38 | 39 | ##### Test 40 | 41 | ```sh 42 | cd test && bash hh.sh 43 | ``` 44 | 45 | ### General usage 46 | 47 | NextPolish2 takes a genome assembly file, a HiFi mapping file and one or more k-mer dataset files from short reads as input and generates the polished genome. 48 | 49 | 1. Prepare HiFi mapping file ([winnowmap](https://github.com/marbl/Winnowmap) or [minimap2](https://github.com/lh3/minimap2/)). 50 | 51 | ```sh 52 | meryl count k=15 output merylDB asm.fa.gz 53 | meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt 54 | winnowmap -t 5 -W repetitive_k15.txt -ax map-pb asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam - 55 | 56 | # or mapping using minimap2 57 | # minimap2 -ax map-hifi -t 5 asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam - 58 | 59 | # indexing 60 | samtools index hifi.map.sort.bam 61 | ``` 62 | 63 | 2. Prepare k-mer dataset files ([yak](https://github.com/lh3/yak)). Here we only produce 21-mer and 31-mer datasets, you can produce more k-mer datasets with different k-mer size. 64 | 65 | ```sh 66 | # produce a 21-mer dataset, remove -b 37 if you want to count singletons 67 | ./yak/yak count -o k21.yak -k 21 -b 37 <(zcat sr.R*.fastq.gz) <(zcat sr.R*.fastq.gz) 68 | 69 | # produce a 31-mer dataset, remove -b 37 if you want to count singletons 70 | ./yak/yak count -o k31.yak -k 31 -b 37 <(zcat sr.R*.fastq.gz) <(zcat sr.R*.fastq.gz) 71 | ``` 72 | 73 | 3. Run NextPolish2. 74 | 75 | ```sh 76 | ./target/release/nextPolish2 -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak > asm.np2.fa 77 | 78 | # or try with -r 79 | # ./target/release/nextPolish2 -r -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak > asm.np2.fa 80 | ``` 81 | 82 | ***Optional:*** If your genome is assembled via **trio binning**. You can discard reads that have different haplotype with the reference before the mapping procedure, see [here](./doc/benchmark3.md) for an example. 83 | 84 | #### More options 85 | 86 | Use `./target/release/nextPolish2 -h` to see options. 87 | 88 | ### Getting help 89 | 90 | #### Help 91 | 92 | Feel free to raise an issue at the [issue page](https://github.com/Nextomics/NextPolish2/issues/new). 93 | 94 | ***Note:*** Please ask questions on the issue page first. They are also helpful to other users. 95 | #### Contact 96 | 97 | For additional help, please send an email to huj\_at\_grandomics\_dot\_com. 98 | 99 | ### Citation 100 | 101 | Jiang Hu, Zhuo Wang, Fan Liang, Shan-Lin Liu, Kai Ye, De-Peng Wang, NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads, Genomics, Proteomics & Bioinformatics, 2024;, qzad009, https://doi.org/10.1093/gpbjnl/qzad009 102 | 103 | ### License 104 | 105 | NextPolish2 is only freely available for academic use and other non-commercial use. 106 | 107 | ### Limitations 108 | 109 | 1. NextPolish2 can only correct the regions that are mapped by HiFi reads. For regions without HiFi reads mapping (usually cause by high error rate), you can try to adjust mapping parameters. 110 | 2. The performance of NextPolish2 relies heavily on the quality of short reads. 111 | 3. NextPolish2 can only fix some structural misassemblies. 112 | 113 | ### Benchmarking 114 | 115 | | Source | Software | QV | Switch error rate (‱) | 116 | | :----------------------------------------------: | ------------------ | :-----: | :---------------------: | 117 | | [*A. thaliana*](./doc/benchmark1.md) | Hifiasm  (primary) | 47.67 | 1.99 | 118 | |^(simulated data, primary contigs)^ | NextPolish2 |**65.42**| **0.35** | 119 | | [*A. thaliana*](./doc/benchmark2.md) | Hifiasm  (primary) | 58.03 | | 120 | | ^(Col-XJTU, primary contigs)^ | NextPolish2 |**64.26**| | 121 | | [*H. sapiens*](./doc/benchmark3.md) | Hifiasm  (primary) | 60.25 | 0.15 | 122 | | ^(HG002, primary contigs)^ | NextPolish2 |**62.87**| **0.14** | 123 | | [*H. sapiens*](./doc/benchmark3.md) | Hifiasm  (trio) | 59.77 | 0.21 | 124 | |^(HG002, paternal contigs)^ | NextPolish2 |**63.49**| **0.20** | 125 | | [*H. sapiens*](./doc/benchmark3.md) | Hifiasm  (trio) | 59.78 | 0.33 | 126 | |^(HG002, maternal contigs)^ | NextPolish2 |**63.29**| **0.30** | 127 | 128 | ### Star 129 | You can track updates by tab the **Star** button on the upper-right corner at the [github page](https://github.com/Nextomics/NextPolish2). -------------------------------------------------------------------------------- /longreads/chopper.md: -------------------------------------------------------------------------------- 1 | # chopper 2 | 3 | Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file. 4 | Filtering is done on average read quality and minimal or maximal read length, and applying a headcrop (start of read) and tailcrop (end of read) while printing the reads passing the filter. 5 | 6 | Compared to the Python implementation the scope is to deliver the same results, almost the same functionality, at much faster execution times. At the moment this tool does not support filtering using a sequencing_summary file. If those features are of interest then please reach out. 7 | 8 | ## Installation 9 | 10 | Preferably, for most users, download a ready-to-use binary for your system to add directory on your $PATH from the [releases](https://github.com/wdecoster/chopper/releases). 11 | You may have to change the file permissions to execute it with `chmod +x chopper` 12 | 13 | Alternatively, use conda to install 14 | `conda install -c bioconda chopper` 15 | 16 | ## Usage 17 | 18 | Reads on stdin and writes to stdout. 19 | 20 | ```text 21 | FLAGS: 22 | -h, --help Prints help information 23 | -V, --version Prints version information 24 | 25 | OPTIONS: 26 | --headcrop Trim N nucleotides from the start of a read [default: 0] 27 | --maxlength Sets a maximum read length [default: 2147483647] 28 | -l, --minlength Sets a minimum read length [default: 1] 29 | -q, --quality Sets a minimum Phred average quality score [default: 0] 30 | --tailcrop Trim N nucleotides from the end of a read [default: 0] 31 | --threads Number of parallel threads to use [default: 4] 32 | --contam Fasta file with reference to check potential contaminants against [default None] 33 | -i, --input Input filename [default: read from stdin] 34 | --maxgc Sets a maximum GC content [default: 1.0] 35 | --mingc Sets a minimum GC content [default: 0.0] 36 | ``` 37 | 38 | EXAMPLES: 39 | 40 | ```bash 41 | gunzip -c reads.fastq.gz | chopper -q 10 -l 500 | gzip > filtered_reads.fastq.gz 42 | chopper -q 10 -l 500 -i reads.fastq > filtered_reads.fastq 43 | chopper -q 10 -l 500 -i reads.fastq.gz | gzip > filtered_reads.fastq.gz 44 | ``` 45 | 46 | Note that the tool may be substantially slower in the third example above, and piping while decompressing is recommended (as in the first example). 47 | 48 | ## CITATION 49 | 50 | If you use this tool, please consider citing our [publication](https://academic.oup.com/bioinformatics/article/39/5/btad311/7160911). 51 | -------------------------------------------------------------------------------- /longreads/longshot.md: -------------------------------------------------------------------------------- 1 | # longshot 2 | 3 | Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM/CRAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF. 4 | 5 | ## citation 6 | If you use Longshot, please cite the publication: 7 | 8 | [Edge, P. and Bansal, V., 2019. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nature communications, 10(1), pp.1-10.](https://www.nature.com/articles/s41467-019-12493-y) 9 | 10 | ## supported operating systems 11 | Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave. 12 | It should work on any linux-based system that has Rust and Cargo installed. 13 | 14 | ## dependencies 15 | 16 | * rust >= 1.40.0 17 | * zlib >= 1.2.11 18 | * xz >= 5.2.3 19 | * clangdev >= 7.0.1 20 | * gcc >= 7.3.0 21 | * libc-dev 22 | * make 23 | * various rust dependencies (automatically managed by cargo) 24 | 25 | (older versions may work but have not been tested) 26 | ## installation 27 | 28 | ### installation using Bioconda 29 | 30 | It is recommended to install Longshot using [Bioconda](https://bioconda.github.io/): 31 | ``` 32 | conda install longshot 33 | ``` 34 | This method supports Linux and Mac. 35 | If you do not have Bioconda, you can install it with these steps: 36 | First, install Miniconda (or Anaconda). Miniconda can be installed using the 37 | scripts [here](https://docs.conda.io/en/latest/miniconda.html). 38 | 39 | The Bioconda channel can then be added using these commands: 40 | ``` 41 | conda config --add channels defaults 42 | conda config --add channels bioconda 43 | conda config --add channels conda-forge 44 | ``` 45 | ### manual installation using apt for dependencies (Ubuntu 18.04) 46 | If you are using Ubuntu 18.04, you can install the dependencies using apt. Then, the Rust cargo package manager is used to compile Longshot. 47 | ``` 48 | sudo apt-get install cargo zlib1g-dev xz-utils \ 49 | libclang-dev clang cmake build-essential curl git # install dependencies 50 | git clone https://github.com/pjedge/longshot # clone the Longshot repository 51 | cd longshot # change directory 52 | cargo install --path . # install Longshot 53 | export PATH=$PATH:/home/$USER/.cargo/bin # add cargo binaries to path 54 | ``` 55 | Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space. 56 | It is recommended to add the line ```export PATH=$PATH:/home/$USER/.cargo/bin``` to the end of your ```~/.bashrc``` file so that the longshot binary is in the PATH for future shell sessions. 57 | 58 | ## usage: 59 | After installation, execute the longshot binary as so: 60 | ``` 61 | $ longshot [FLAGS] [OPTIONS] --bam --ref --out 62 | ``` 63 | 64 | ## execution on an example dataset 65 | The directory ```example_data``` contains a simulated toy dataset that can be used to test out Longshot: 66 | - Reference genome containing 3 contigs each with length 200 kb (```example_data/genome.fa```) 67 | - 30x coverage simulated pacbio reads generated using [SimLoRD](https://bitbucket.org/genomeinformatics/simlord/) (```example_data/pacbio_reads_30x.bam```) 68 | - The 714 "true" variants for validation (```example_data/ground_truth_variants.vcf```) 69 | 70 | Run Longshot on the example data as so: 71 | ``` 72 | longshot --bam example_data/pacbio_reads_30x.bam --ref example_data/genome.fa --out example_data/longshot_output.vcf 73 | ``` 74 | 75 | Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to ```ground_truth_variants.vcf``` for accuracy. 76 | 77 | ## command line options 78 | ``` 79 | $ longshot --help 80 | 81 | Longshot: variant caller (SNVs) for long-read sequencing data 82 | 83 | USAGE: 84 | longshot [FLAGS] [OPTIONS] --bam --ref --out 85 | 86 | FLAGS: 87 | -A, --auto_max_cov Automatically calculate mean coverage for region and set max coverage to mean_coverage + 88 | 5*sqrt(mean_coverage). (SLOWER) 89 | -S, --stable_alignment Use numerically-stable (logspace) pair HMM forward algorithm. Is significantly slower but 90 | may be more accurate. Tests have shown this not to be necessary for highly error prone 91 | reads (PacBio CLR). 92 | -F, --force_overwrite If output files (VCF or variant debug directory) exist, delete and overwrite them. 93 | -x, --max_alignment Use max scoring alignment algorithm rather than pair HMM forward algorithm. 94 | -n, --no_haps Don't call HapCUT2 to phase variants. 95 | --output-ref print reference genotypes (non-variant), use this option only in combination with -v 96 | option. 97 | -h, --help Prints help information 98 | -V, --version Prints version information 99 | 100 | OPTIONS: 101 | -b, --bam sorted, indexed BAM file with error-prone reads (CRAM files also supported) 102 | -f, --ref indexed FASTA reference that BAM file is aligned to 103 | -o, --out output VCF file with called variants. 104 | -r, --region Region in format or in which to call variants 105 | (1-based, inclusive). 106 | -v, --potential_variants Genotype and phase the variants in this VCF instead of using pileup 107 | method to find variants. NOTES: VCF must be gzipped and tabix indexed or 108 | contain contig information. Use with caution because excessive false 109 | potential variants can lead to inaccurate results. Every variant is used 110 | and only the allele fields are considered -- Genotypes, filters, 111 | qualities etc are ignored. Indel variants will be genotyped but not 112 | phased. Structural variants (length > 50 bp) are currently not supported. 113 | -O, --out_bam Write new bam file with haplotype tags (HP:i:1 and HP:i:2) for reads 114 | assigned to each haplotype, any existing HP and PS tags are removed 115 | -c, --min_cov Minimum coverage (of reads passing filters) to consider position as a 116 | potential SNV. [default: 6] 117 | -C, --max_cov Maximum coverage (of reads passing filters) to consider position as a 118 | potential SNV. [default: 8000] 119 | -q, --min_mapq Minimum mapping quality to use a read. [default: 20] 120 | -a, --min_allele_qual Minimum estimated quality (Phred-scaled) of allele observation on read to 121 | use for genotyping/haplotyping. [default: 7.0] 122 | -y, --hap_assignment_qual Minimum quality (Phred-scaled) of read->haplotype assignment (for read 123 | separation). [default: 20.0] 124 | -Q, --potential_snv_cutoff Consider a site as a potential SNV if the original PHRED-scaled QUAL 125 | score for 0/0 genotype is below this amount (a larger value considers 126 | more potential SNV sites). [default: 20.0] 127 | -e, --min_alt_count Require a potential SNV to have at least this many alternate allele 128 | observations. [default: 3] 129 | -E, --min_alt_frac Require a potential SNV to have at least this fraction of alternate 130 | allele observations. [default: 0.125] 131 | -L, --hap_converge_delta Terminate the haplotype/genotype iteration when the relative change in 132 | log-likelihood falls below this amount. Setting a larger value results in 133 | faster termination but potentially less accurate results. [default: 134 | 0.0001] 135 | -l, --anchor_length Length of indel-free anchor sequence on the left and right side of read 136 | realignment window. [default: 6] 137 | -m, --max_snvs Cut off variant clusters after this many variants. 2^m haplotypes must be 138 | aligned against per read for a variant cluster of size m. [default: 3] 139 | -W, --max_window Maximum "padding" bases on either side of variant realignment window 140 | [default: 50] 141 | -I, --max_cigar_indel Throw away a read-variant during allelotyping if there is a CIGAR indel 142 | (I/D/N) longer than this amount in its window. [default: 20] 143 | -B, --band_width Minimum width of alignment band. Band will increase in size if sequences 144 | are different lengths. [default: 20] 145 | -D, --density_params Parameters to flag a variant as part of a "dense cluster". Format 146 | ::. If there are at least n variants within l base pairs with 147 | genotype quality >=gq, then these variants are flagged as "dn" [default: 148 | 10:500:50] 149 | -s, --sample_id Specify a sample ID to write to the output VCF [default: SAMPLE] 150 | --hom_snv_rate Specify the homozygous SNV Rate for genotype prior estimation [default: 151 | 0.0005] 152 | --het_snv_rate Specify the heterozygous SNV Rate for genotype prior estimation [default: 153 | 0.001] 154 | --ts_tv_ratio Specify the transition/transversion rate for genotype grior estimation 155 | [default: 0.5] 156 | -P, --strand_bias_pvalue_cutoff Remove a variant if the allele observations are biased toward one strand 157 | (forward or reverse) according to Fisher's exact test. Use this cutoff 158 | for the two-tailed P-value. [default: 0.01] 159 | -d, --variant_debug_dir write out current information about variants at each step of algorithm to 160 | files in this directory 161 | ``` 162 | 163 | ## usage examples 164 | Call variants with default parameters: 165 | ``` 166 | longshot --bam pacbio.bam --ref ref.fa --out output.vcf 167 | ``` 168 | Call variants for chromosome 1 only using the automatic max coverage cutoff: 169 | ``` 170 | longshot -A -r chr1 --bam pacbio.bam --ref ref.fa --out output.vcf 171 | ``` 172 | Call variants in a 500 kb region and then output the reads into ```reads.bam``` using a haplotype assignment threshold of 30: 173 | ``` 174 | longshot -r chr1:1000000-1500000 -y 30 -O reads.bam --bam pacbio.bam --ref ref.fa --out output.vcf 175 | ``` 176 | If a read has an assigned haplotype, it will get a tag `HP:i:1` or `HP:i:2` and tag `PS:i:x` where `x` is a phase set number of the variants it covers. 177 | 178 | ## important considerations 179 | - It is highly recommended to use reads with at least 30x coverage. 180 | - It is recommended to process chromosomes separately using the ```--region``` option. 181 | - Longshot has only been tested using data from humans. Results may vary with organisms with significantly higher or lower SNV rate. 182 | - It is important to set a reasonable max read coverage cutoff (```-C``` option) to filter out sites coinciding with genomic features such as CNVs which can be problematic for variant calling. If the ```-A``` option is used, Longshot will estimate the mean read coverage and set the max coverage to ```mean_cov+5*sqrt(mean_cov)```, which we have found to be a reasonable filter in practice for humans. 183 | - CNVs and mapping issues can result in dense clusters of false positive SNVs. Longshot will attempt to find clusters like this and mark them as "dn" in the FILTER field. The ```--density_params``` option is used to control which variants are flagged as "dn". The default parameters have been found to be effective for human sequencing data, but this option may need to be tweaked for other organisms with SNV rates significantly different from human. 184 | - Oxford Nanopore Technology (ONT) SMS reads are now officially supported. It is recommended to use the default ```--strand_bias_pvalue_cutoff``` of 0.01 for ONT reads, since this option filters out false SNV sites prior to variant calling. 185 | 186 | ## installation troubleshooting 187 | 188 | ### older version of Rust 189 | Check that the Rust version is 1.30.0 or higher: 190 | ``` 191 | rustc --version 192 | ``` 193 | If not, update Rust using this command: 194 | ``` 195 | rustup update 196 | ``` 197 | 198 | 199 | ### linker errors 200 | For example: 201 | ``` 202 | error: linking with `cc` failed: exit code: 1 203 | ... 204 | ... 205 | ... 206 | = note: Non-UTF-8 output: /usr/bin/ld: /home/pedge/temp/longshot/target/release/build/longshot-347f3774e75b380c/out/libhapcut2.a(common.o)(.text.fprintf_time+0x81): unresolvable H\x89\\$\xe8H\x89l$\xf0H\x89\xf3L\x89d$\xf8H\x83\xec\x18H\x8bG\x10H\x89\xfdI\x89\xd4H\x89\xd6H\x8b;\xffPxH\x8bE\x10I\x8dt$\x08H\x8b{\x08\xffPxH\x8bE\x10H\x8b{\x10I\x8dt$\x10H\x8b\x1c$H\x8bl$\x08L\x8bd$\x10H\x8b@xH\x83\xc4\x18\xff\xe0f\x90H\x89\\$\xe8H\x89l$\xf0H\x89\xfbL\x89d$\xf8H\x83\xec\x18H\x8bG\x10I\x89\xd4H\x89\xf5H\x89\xf7\xffPhI\x89\x04$H\x8bC\x10H\x8d}\x08\xffPhH\x8b\x1c$I\x89D$\x08H\x8bl$\x08L\x8bd$\x10H\x83\xc4\x18\xc3\x0f\x1f relocation against symbol `time@@GLIBC_2.2.5\'\n/usr/bin/ld: BFD version 2.20.51.0.2-5.42.el6 20100205 internal error, aborting at reloc.c line 443 in bfd_get_reloc_size\n\n/usr/bin/ld: Please report this bug.\n\ncollect2: ld returned 1 exit status\n 207 | ... 208 | ... 209 | ... 210 | ``` 211 | Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows: 212 | ``` 213 | rustc -vV 214 | ``` 215 | Note the build target after "host: ", i.e. "x86_64-unknown-linux-gnu". 216 | ``` 217 | mkdir .cargo 218 | nano .cargo/config 219 | ``` 220 | edit the config file to have these contents: 221 | ``` 222 | [target.] 223 | linker = "" 224 | ``` 225 | for example, 226 | ``` 227 | [target.x86_64-unknown-linux-gnu] 228 | linker = "/opt/gnu/gcc/bin/gcc" 229 | ``` 230 | then, 231 | ``` 232 | cargo clean 233 | cargo build --release 234 | ``` -------------------------------------------------------------------------------- /metagenomics/coverm.md: -------------------------------------------------------------------------------- 1 | ![CoverM logo](https://github.com/wwood/CoverM/blob/main/images/coverm.png?raw=true) 2 | 3 | - [CoverM](#coverm) 4 | - [Installation](#installation) 5 | - [Install through the bioconda package](#install-through-the-bioconda-package) 6 | - [Pre-compiled binary](#pre-compiled-binary) 7 | - [Compiling from source](#compiling-from-source) 8 | - [Development version](#development-version) 9 | - [Dependencies](#dependencies) 10 | - [Shell completion](#shell-completion) 11 | - [Usage](#usage) 12 | - [Calculation methods](#calculation-methods) 13 | - [License](#license) 14 | 15 | # CoverM 16 | 17 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/coverm/badges/version.svg)](https://anaconda.org/bioconda/coverm) 18 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/coverm/badges/downloads.svg)](https://anaconda.org/bioconda/coverm) 19 | 20 | CoverM aims to be a configurable, easy to use and fast DNA read coverage and 21 | relative abundance calculator focused on metagenomics applications. 22 | 23 | CoverM calculates coverage of genomes/MAGs `coverm genome` ([help](https://wwood.github.io/CoverM/coverm-genome.html)) or individual 24 | contigs `coverm contig` ([help](https://wwood.github.io/CoverM/coverm-contig.html)). Calculating coverage by read mapping, its input can 25 | either be BAM files sorted by reference, or raw reads and reference genomes in various formats. 26 | 27 | ## Installation 28 | 29 | ### Install through the bioconda package 30 | 31 | CoverM and its dependencies can be installed through the [bioconda](https://bioconda.github.io/user/install.html) conda channel. After initial setup of conda and the bioconda channel, it can be installed with 32 | 33 | ``` 34 | conda install coverm 35 | ``` 36 | 37 | ### Pre-compiled binary 38 | 39 | Statically compiled CoverM binaries available on the [releases page](https://github.com/wwood/CoverM/releases). 40 | This installation method requires non-Rust dependencies to be installed separately - see the [dependencies section](#Dependencies). 41 | 42 | ### Compiling from source 43 | 44 | CoverM can also be installed from source, using the cargo build system after 45 | installing [Rust](https://www.rust-lang.org/). 46 | 47 | ``` 48 | cargo install coverm 49 | ``` 50 | 51 | ### Development version 52 | 53 | To run an unreleased version of CoverM, after installing 54 | [Rust](https://www.rust-lang.org/) and any additional dependencies listed below: 55 | 56 | ``` 57 | git clone https://github.com/wwood/CoverM 58 | cd CoverM 59 | cargo run -- genome ...etc... 60 | ``` 61 | 62 | To run tests: 63 | 64 | ``` 65 | cargo build 66 | cargo test 67 | ``` 68 | 69 | ### Dependencies 70 | 71 | For the full suite of options, additional programs must also be installed, when 72 | installing from source or for development. 73 | 74 | These can be installed using the conda YAML environment definition: 75 | 76 | ``` 77 | conda env create -n coverm -f coverm.yml 78 | ``` 79 | 80 | Or, these can be installed manually: 81 | 82 | * [samtools](https://github.com/samtools/samtools) v1.9 83 | * [tee](https://www.gnu.org/software/coreutils/), which is installed by default 84 | on most Linux operating systems. 85 | * [man](http://man-db.nongnu.org/), which is installed by default on most Linux 86 | operating systems. 87 | 88 | and some mapping software: 89 | 90 | * [minimap2](https://github.com/lh3/minimap2) v2.21 91 | * [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) v2.0 92 | 93 | For dereplication: 94 | 95 | * [Dashing](https://github.com/dnbaker/dashing) v0.4.0 96 | * [FastANI](https://github.com/ParBLiSS/FastANI) v1.3 97 | 98 | ### Shell completion 99 | 100 | Completion scripts for various shells e.g. BASH can be generated. For example, to install the bash completion script system-wide (this requires root privileges): 101 | 102 | ``` 103 | coverm shell-completion --output-file coverm --shell bash 104 | mv coverm /etc/bash_completion.d/ 105 | ``` 106 | 107 | It can also be installed into a user's home directory (root privileges not required): 108 | 109 | ``` 110 | coverm shell-completion --shell bash --output-file /dev/stdout >>~/.bash_completion 111 | ``` 112 | 113 | In both cases, to take effect, the terminal will likely need to be restarted. To test, type `coverm gen` and it should complete after pressing the TAB key. 114 | 115 | ## Usage 116 | 117 | CoverM operates in several modes. Detailed usage information including examples is given at the links below, or alternatively by using the `-h` or `--full-help` flags for each mode: 118 | 119 | * [genome](https://wwood.github.io/CoverM/coverm-genome.html) - Calculate coverage of genomes 120 | * [contig](https://wwood.github.io/CoverM/coverm-contig.html) - Calculate coverage of contigs 121 | 122 | There are several utility modes as well: 123 | 124 | * [make](https://wwood.github.io/CoverM/coverm-make.html) - Generate BAM files through alignment 125 | * [filter](https://wwood.github.io/CoverM/coverm-filter.html) - Remove (or only keep) alignments with insufficient identity 126 | * [cluster](https://wwood.github.io/CoverM/coverm-cluster.html) - Dereplicate and cluster genomes 127 | * shell-completion - Generate shell completion scripts 128 | 129 | ## Calculation methods 130 | 131 | The `-m/--methods` flag specifies the specific kind(s) of coverage that are 132 | to be calculated. 133 | 134 | To illustrate, imagine a set of 3 pairs of reads, where only 1 aligns to a 135 | single reference contig of length 1000bp: 136 | 137 | ``` 138 | read1_forward ========> 139 | read1_reverse <====+==== 140 | contig ...-----------------------------------------------------.... 141 | | | | | | 142 | position 200 210 220 230 240 143 | ``` 144 | 145 | The difference coverage measures would be: 146 | 147 | | Method | Value | Formula | Explanation | 148 | | ------------------ | ----------------------------------------------------------------------------------- | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 149 | | mean | 0.02235294 | (10+9)/(1000-2*75) | The two reads have 10 and 9 bases aligned exactly, averaged over 1000-2*75 bp (length of contig minus 75bp from each end). | 150 | | relative_abundance | 33.3% | 0.02235294/0.02235294*(2/6) | If the contig is considered a genome, then its mean coverage is 0.02235294. There is a total of 0.02235294 mean coverage across all genomes, and 2 out of 6 reads (1 out of 3 pairs) map. This coverage calculation is only available in 'genome' mode. | 151 | | trimmed_mean | 0 | mean_coverage(mid-ranked-positions) | After removing the 5% of bases with highest coverage and 5% of bases with lowest coverage, all remaining positions have coverage 0. | 152 | | covered_fraction | 0.02 | (10+10)/1000 | 20 bases are covered by any read, out of 1000bp. | 153 | | covered_bases | 20 | 10+10 | 20 bases are covered. | 154 | | variance | 0.01961962 | var({1;20},{0;980}) | Variance is calculated as the sample variance. | 155 | | length | 1000 | | The contig's length is 1000bp. | 156 | | count | 2 | | 2 reads are mapped. | 157 | | reads_per_base | 0.002 | 2/1000 | 2 reads are mapped over 1000bp. | 158 | | metabat | contigLen 1000, totalAvgDepth 0.02235294, bam depth 0.02235294, variance 0.01961962 | | Reproduction of the[MetaBAT](https://bitbucket.org/berkeleylab/metabat) 'jgi_summarize_bam_contig_depths' tool output, producing [identical output](https://bitbucket.org/berkeleylab/metabat/issues/48/jgi_summarize_bam_contig_depths-coverage). | 159 | | coverage_histogram | 20 bases with coverage 1, 980 bases with coverage 0 | | The number of positions with each different coverage are tallied. | 160 | | rpkm | 1000000 | 2 * 10^9 / 1000 / 2 | Calculation here assumes no other reads map to other contigs. See https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ for an explanation of RPKM and TPM | 161 | | tpm | 1000000 | rpkm/total_of_rpkm * 10^6 | Calculation here assumes no other reads map to other contigs. See RPKM above. | 162 | 163 | Calculation of genome-wise coverage (`genome` mode) is similar to calculating 164 | contig-wise (`contig` mode) coverage, except that the unit of reporting is 165 | per-genome rather than per-contig. For calculation methods which exclude base 166 | positions based on their coverage, all positions from all contigs are considered 167 | together. For instance, if a 2000bp contig with all positions having 1X coverage 168 | is in a genome with 2,000,000bp contig with no reads mapped, then the 169 | trimmed_mean will be 0 as all positions in the 2000bp are in the top 5% of 170 | positions sorted by coverage. 171 | 172 | ## License 173 | 174 | CoverM is made available under GPL3+. See LICENSE.txt for details. Copyright Ben 175 | Woodcroft. 176 | 177 | Developed by Ben Woodcroft at the Queensland University of Technology [Centre for Microbiome Research](https://research.qut.edu.au/cmr/). 178 | -------------------------------------------------------------------------------- /metagenomics/skani.md: -------------------------------------------------------------------------------- 1 | # skani - accurate, fast nucleotide identity calculation for MAGs, genomes, and databases 2 | 3 | ## Introduction 4 | 5 | **skani** is a program for calculating average nucleotide identity (ANI) from DNA sequences (contigs/MAGs/genomes) for ANI > ~80%. 6 | 7 | skani uses an approximate mapping method without base-level alignment to get ANI. It is magnitudes faster than BLAST based methods and almost as accurate. skani offers: 8 | 9 | 1. **Accurate ANI calculations for MAGs**. skani is accurate for incomplete and medium-quality metagenome-assembled genomes (MAGs). Pure sketching methods (e.g. Mash) may underestimate ANI for incomplete MAGs. 10 | 11 | 2. **Aligned fraction results**. skani outputs the fraction of genome aligned, whereas pure k-mer based methods do not. 12 | 13 | 3. **Fast computations**. Indexing/sketching is ~ 3x faster than Mash, and querying is about 25x faster than FastANI (but slower than Mash). 14 | 15 | 4. **Efficient database search**. Querying a genome against a preprocessed database of >65000 prokaryotic genomes takes a few seconds with a single processor and ~6 GB of RAM. Constructing a database from genome sequences takes a few minutes to an hour. 16 | 17 | ## Updates 18 | 19 | ### v0.2.1 released - 2023-10-11 20 | 21 | More consistent support for small contigs and sequences. 22 | 23 | #### Major 24 | 25 | * --faster-small option included in dist and triangle. 26 | 27 | Genomes (and contigs with the --i, --ri, --qi options) with less than 20 marker k-mers are not screened according to the -s option. This was always the case but never documented. This makes skani more sensitive for small sequences, but can hamper performance on very large datasets with lots of small genomes/contigs. 28 | 29 | This heuristic can now be disabled with the `--faster-small` option. 30 | 31 | See the [CHANGELOG](https://github.com/bluenote-1577/skani/blob/main/CHANGELOG.md) for the skani's full versioning history. 32 | 33 | ## Install 34 | 35 | #### Option 1: Build from source 36 | 37 | Requirements: 38 | 1. [rust](https://www.rust-lang.org/tools/install) programming language and associated tools such as cargo are required and assumed to be in PATH. 39 | 2. A c compiler (e.g. GCC) 40 | 3. make 41 | 42 | Building takes a few minutes (depending on # of cores). 43 | 44 | ```sh 45 | git clone https://github.com/bluenote-1577/skani 46 | cd skani 47 | 48 | # If default rust install directory is ~/.cargo 49 | cargo install --path . --root ~/.cargo 50 | skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 51 | 52 | # If ~/.cargo doesn't exist use below commands instead 53 | #cargo build --release 54 | #./target/release/skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 55 | ``` 56 | 57 | See the [Releases](https://github.com/bluenote-1577/skani/releases) page for obtaining specific versions of skani. 58 | 59 | #### Option 2: Conda (source version: 0.2.1) 60 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/version.svg)](https://anaconda.org/bioconda/skani) 61 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/latest_release_date.svg)](https://anaconda.org/bioconda/skani) 62 | ```sh 63 | conda install -c bioconda skani 64 | ``` 65 | 66 | #### Option 3: Pre-built x86-64 linux statically compiled executable 67 | 68 | We offer a pre-built statically compiled executable for x86-64 Linux systems. That is, if you're on an x86-64 Linux system, you can just download the binary and run it without installing anything. 69 | 70 | For using the latest version of skani: 71 | 72 | ```sh 73 | wget https://github.com/bluenote-1577/skani/releases/download/latest/skani 74 | chmod +x skani 75 | ./skani -h 76 | ``` 77 | 78 | **Important**: the binary runs slightly slower (3-10%) most of the time, but it can be drastically slower on some tasks. 79 | 80 | ## Quick start 81 | 82 | ```sh 83 | # compare two genomes for ANI. skani is symmetric, so order does not affect ANI 84 | skani dist genome1.fa genome2.fa 85 | skani dist genome2.fa genome1.fa 86 | 87 | # compare multiple genomes; all options take -t for multi-threading. 88 | skani dist -t 3 -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt 89 | 90 | # compare individual fasta records (e.g. contigs) 91 | skani dist --qi -q assembly1.fa --ri -r assembly2.fa 92 | 93 | # construct database and do memory-efficient search 94 | skani sketch genomes_to_search/* -o database 95 | skani search query1.fa query2.fa ... -d database 96 | 97 | # use sketch from "skani sketch" output as drop-in replacement 98 | skani dist database/query.fa.sketch database/ref.fa.sketch 99 | 100 | # construct similarity matrix/edge list for all genomes in folder 101 | skani triangle genome_folder/* > skani_ani_matrix.txt 102 | skani triangle genome_folder/* -E > skani_ani_edge_list.txt 103 | 104 | # we provide a script in this repository for clustering/visualizing distance matrices. 105 | # requires python3, seaborn, scipy/numpy, and matplotlib. 106 | python scripts/clustermap_triangle.py skani_ani_matrix.txt 107 | 108 | ``` 109 | 110 | ## Tutorials and manuals 111 | 112 | ### [skani basic usage information](https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide) 113 | 114 | For more information about using the specific skani subcommands, see the [guide linked above](https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide). 115 | 116 | ### skani tutorials 117 | 118 | 1. #### [Tutorial: setting up the GTDB prokaryotic genome database to search against](https://github.com/bluenote-1577/skani/wiki/Tutorial:-setting-up-the-GTDB-genome-database-to-search-against) 119 | 2. #### [Tutorial: classifying entire assemblies against > 85,000 genomes in under 2 minutes](https://github.com/bluenote-1577/skani/wiki/Tutorial:-classifying-entire-assemblies-(MAGs-or-contigs)-against-85,000-genomes-in-under-2-minutes) 120 | 3. #### [Tutorial: strain-level clustering of MAGs using skani, and why Mash/FastANI have issues](https://github.com/bluenote-1577/skani/wiki/Tutorial:-strain-and-species-level-clustering-of-MAGs-with-skani-triangle) 121 | 122 | ### [skani cookbook](https://github.com/bluenote-1577/skani/wiki/skani-cookbook) 123 | 124 | Some common use cases and parameter settings are outlined in the cookbook. 125 | 126 | ### [Pre-sketched databases for searching](https://github.com/bluenote-1577/skani/wiki/Pre%E2%80%90sketched-databases) 127 | 128 | Pre-sketched databases can be downloaded and quickly searched against. GTDB-R214 is currently supported. 129 | 130 | ### [skani advanced usage information](https://github.com/bluenote-1577/skani/wiki/skani-advanced-usage-guide) 131 | 132 | See the advanced usage guide linked above for more information about topics such as: 133 | 134 | * optimizing sensitivity/speed of skani 135 | * optimizing skani for long-reads or contigs 136 | * making skani for memory efficient for huge data sets 137 | 138 | ## Output 139 | 140 | If the resulting aligned fraction for the two genomes is < 15%, no output is given. 141 | 142 | **In practice, this means that only results with > ~82% ANI are reliably output** (with default parameters). See the [skani advanced usage guide](https://github.com/bluenote-1577/skani/wiki/skani-advanced-usage-guide) for information on how to compare lower ANI genomes. 143 | 144 | The default output for `search` and `dist` looks like 145 | ``` 146 | Ref_file Query_file ANI Align_fraction_ref Align_fraction_query Ref_name Query_name 147 | refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 99.39 93.95 93.37 NZ_CP016182.2 Escherichia coli strain EC590 chromosome, complete genome NC_007779.1 Escherichia coli str. K-12 substr. W3110, complete sequence 148 | ``` 149 | - Ref_file: the filename of the reference. 150 | - Query_file: the filename of the query. 151 | - ANI: the ANI. 152 | - Aligned_fraction_query/reference: fraction of query/reference covered by alignments. 153 | - Ref/Query_name: the id of the first record in the reference/query file. 154 | 155 | The order of results is dependent on the command and not guaranteed to be deterministic when > 5000 query genomes are present. `dist` and `search` try to place the highest ANI results first. 156 | 157 | ## Citation 158 | 159 | Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023). https://doi.org/10.1038/s41592-023-02018-3 160 | 161 | ## Feature requests, issues 162 | 163 | skani is actively being developed by me ([Jim Shaw](https://jim-shaw-bluenote.github.io/)). I'm more than happy to accommodate simple feature requests (different types of outputs, etc). Feel free to open an issue with your feature request on the GitHub repository. If you catch any bugs, please open an issue or e-mail me (e-mail on my website). 164 | 165 | ## Calling skani from rust or python 166 | 167 | ### Rust API 168 | 169 | If you're interested in using skani as a rust library, check out the minimal example here: https://github.com/bluenote-1577/skani-lib-example. The documentation is currently minimal (https://docs.rs/skani/0.1.0/skani/) and I guarantee no API stability. 170 | 171 | ### Python bindings 172 | 173 | If you're interested in calling skani from python, see the [pyskani](https://github.com/althonos/pyskani) python interface and bindings to skani written by [Martin Larralde](https://github.com/althonos). Note: I am not personally involved in the pyskani project and do not offer guarantees on the correctness of the outputs. -------------------------------------------------------------------------------- /metagenomics/sylph.md: -------------------------------------------------------------------------------- 1 | # sylph - fast and precise species-level metagenomic profiling with ANIs 2 | 3 | ## Introduction 4 | 5 | **sylph** is a program that performs ultrafast (1) **ANI querying** or (2) **metagenomic profiling** for metagenomic shotgun samples. 6 | 7 | **Containment ANI querying**: sylph can search a genome, e.g. E. coli, against your sample. If sylph outputs an estimate of 97% ANI, your sample contains an E. coli with 97% ANI to the queried genome. 8 | 9 | **Metagenomic profiling**: sylph can determine the species/taxa in your sample and their abundances, just like [Kraken](https://ccb.jhu.edu/software/kraken/) or [MetaPhlAn](https://github.com/biobakery/MetaPhlAn). 10 | 11 |

12 |

13 | 14 | Profiling 1 Gbp of mouse gut reads against 85,205 genomes in a few seconds 15 | 16 |

17 | 18 | 19 | ### Why sylph? 20 | 21 | 1. **Precise species-level profiling**: Our tests show that sylph is more precise than Kraken and about as precise and sensitive as marker gene methods (MetaPhlAn, mOTUs). 22 | 23 | 2. **Ultrafast, multithreaded, multi-sample**: sylph can be > 50x faster than MetaPhlAn for multi-sample processing. sylph only takes ~15GB of RAM for profiling against the entire GTDB-R220 database (110k genomes). 24 | 25 | 3. **Accurate (containment) ANIs down to 0.1x effective coverage**: for bacterial ANI queries of > 90% ANI, sylph can often give accurate ANI estimates down to 0.1x coverage. 26 | 27 | 4. **Easily customized databases**: sylph can profile against [metagenome-assembled genomes (MAGs), viruses, eukaryotes](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases), and more. Taxonomic information can be incorporated downstream for traditional profiling reports. 28 | 29 | ### How does sylph work? 30 | 31 | sylph uses a k-mer containment method, similar to sourmash or Mash. sylph's novelty lies in **using a statistical technique to correct ANI for low coverage genomes** within the sample, giving accurate results for low abundance genomes. See [here for more information on what sylph can and can not do](https://github.com/bluenote-1577/sylph/wiki/Introduction:-what-is-sylph-and-how-does-it-work%3F). 32 | 33 | ## Very quick start 34 | 35 | #### Profile metagenome sample against [GTDB-R220](https://gtdb.ecogenomic.org/) (113,104 bacterial/archaeal species representative genomes) 36 | 37 | ```sh 38 | # download GTDB-R220 pre-built database (~13 GB) 39 | wget https://storage.googleapis.com/sylph-stuff/gtdb-r220-c200-dbv1.syldb 40 | 41 | # multi-sample paired-end profiling (sylph version >= 0.6) 42 | sylph profile gtdb-r220-c200-dbv1.syldb -1 *_1.fastq.gz -2 *_2.fastq.gz -t (threads) > profiling.tsv 43 | 44 | # multi-sample single-end profiling 45 | sylph profile gtdb-r220-c200-dbv1.syldb *.fastq -t (threads) > profiling.tsv 46 | ``` 47 | 48 | See below for install and more comprehensive usage information/tutorials/manuals. 49 | 50 | ## Install (current version v0.6.1) 51 | 52 | #### Option 1: conda install 53 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/sylph/badges/version.svg)](https://anaconda.org/bioconda/sylph) 54 | [![Anaconda-Server Badge](https://anaconda.org/bioconda/sylph/badges/latest_release_date.svg)](https://anaconda.org/bioconda/sylph) 55 | 56 | ```sh 57 | conda install -c bioconda sylph 58 | ``` 59 | 60 | **WARNING**: conda may break if you don't have AVX2 instructions or for v0.6.0. See the [issue here](https://github.com/bluenote-1577/sylph/issues/2). The binary and source install still work. 61 | 62 | #### Option 2: Build from source 63 | 64 | Requirements: 65 | 1. [rust](https://www.rust-lang.org/tools/install) (version > 1.63) programming language and associated tools such as cargo are required and assumed to be in PATH. 66 | 2. A c compiler (e.g. GCC) 67 | 3. make 68 | 4. cmake 69 | 70 | Building takes a few minutes (depending on # of cores). 71 | 72 | ```sh 73 | git clone https://github.com/bluenote-1577/sylph 74 | cd sylph 75 | 76 | # If default rust install directory is ~/.cargo 77 | cargo install --path . --root ~/.cargo 78 | sylph query test_files/* 79 | ``` 80 | #### Option 3: Pre-built x86-64 linux statically compiled executable 81 | 82 | If you're on an x86-64 system, you can download the binary and use it without any installation. 83 | 84 | ```sh 85 | wget https://github.com/bluenote-1577/sylph/releases/download/latest/sylph 86 | chmod +x sylph 87 | ./sylph -h 88 | ``` 89 | 90 | Note: the binary is compiled with a different set of libraries (musl instead of glibc), probably impacting performance. 91 | 92 | ## Standard usage 93 | 94 | #### Sketching reads/genomes (indexing) 95 | 96 | ```sh 97 | # all fasta -> one *.syldb; fasta are assumed to be genomes 98 | sylph sketch genome1.fa genome2.fa -o database 99 | #EQUIVALENT: sylph sketch -g genome1.fa genome2.fa -o database 100 | 101 | # multi-sample sketching of paired reads 102 | sylph sketch -1 A_1.fq B_1.fq -2 A_2.fq B_2.fq -d read_sketch_folder 103 | 104 | # multi-sample sketching for single end reads, fastq are assumed to be reads 105 | sylph sketch reads.fq 106 | #EQUIVALENT: sylph sketch -r reads.fq 107 | ``` 108 | 109 | #### Profiling or querying with sketch files 110 | ```sh 111 | # ANI querying 112 | sylph query database.syldb read_sketch_folder/*.sylsp -t (threads) > ani_queries.tsv 113 | 114 | # taxonomic profiling 115 | sylph profile database.syldb read_sketch_folder/*.sylsp -t (threads) > profiling.tsv 116 | ``` 117 | 118 | ## Tutorials, manuals, and pre-built databases 119 | 120 | ### [Pre-built databases](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases) 121 | 122 | The pre-built databases [available here](https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases) can be downloaded and used with sylph for profiling and containment querying. 123 | 124 | ### [Cookbook](https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook) 125 | 126 | For common use cases and fast explanations, see the above [cookbook](https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook). 127 | 128 | ### Tutorials 129 | 1. #### [Introduction: 5-minute sylph tutorial outlining basic usage](https://github.com/bluenote-1577/sylph/wiki/5%E2%80%90minute-sylph-tutorial) 130 | 2. #### [Taxonomic profiling against GTDB database with MetaPhlAn output format](https://github.com/bluenote-1577/sylph/wiki/Taxonomic-profiling-with-the-GTDB%E2%80%90R214-database) 131 | 132 | ### Manuals 133 | 1. #### [Output format (TSV) and containment ANI explanation](https://github.com/bluenote-1577/sylph/wiki/Output-format) 134 | 2. #### [Incoporating custom taxonomies to get CAMI-like or MetaPhlAn-like outputs](https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph) 135 | 136 | ### [sylph-utils](https://github.com/bluenote-1577/sylph-utils) 137 | 138 | For incorporating taxonomy and manipulating output formats, see the [sylph-utils repository](https://github.com/bluenote-1577/sylph-utils). 139 | 140 | ## Changelog 141 | 142 | #### Version v0.6.1 - 2024-04-29. 143 | 144 | * Made unknown estimation (-u) more robust for low-depth short-read sequencing. 145 | 146 | See the [CHANGELOG](https://github.com/bluenote-1577/sylph/blob/main/CHANGELOG.md) for complete details. 147 | 148 | ## Citing sylph 149 | 150 | Jim Shaw and Yun William Yu. Metagenome profiling and containment estimation through abundance-corrected k-mer sketching with sylph (2023). bioRxiv. -------------------------------------------------------------------------------- /pangenomics/impg.md: -------------------------------------------------------------------------------- 1 | # impg: implicit pangenome graph 2 | 3 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/impg/README.html) 4 | 5 | Pangenome graphs and whole genome multiple alignments are powerful tools, but they are expensive to build and manipulate. 6 | Often, we would like to be able to break a small piece out of a pangenome without constructing the whole thing. 7 | `impg` lets us do this by projecting sequence ranges through many-way (e.g. all-vs-all) pairwise alignments built by tools like `wfmash` and `minimap2`. 8 | 9 | ## What does `impg` do? 10 | 11 | At its core, `impg` lifts over ranges from a target sequence (used as reference) into the queries (the other sequences aligned to the sequence used as reference) described in alignments. 12 | In effect, it lets us pick up homologous loci from all genomes mapped onto our specific target region. 13 | This is particularly useful when you're interested in comparing a specific genomic region across different individuals, strains, or species in a pangenomic or comparative genomic setting. 14 | The output is provided in BED, BEDPE and PAF formats, making it straightforward to use to extract FASTA sequences for downstream use in multiple sequence alignment (like `mafft`) or pangenome graph building (e.g., `pggb` or `minigraph-cactus`). 15 | 16 | ## How does it work? 17 | 18 | `impg` uses coitrees (implicit interval trees) to provide efficient range lookup over the input alignments. 19 | CIGAR strings are converted to a compact delta encoding. 20 | This approach allows for fast and memory-efficient projection of sequence ranges through alignments. 21 | 22 | ## Using `impg` 23 | 24 | Getting started with `impg` is straightforward. Here's a basic example of how to use the command-line utility: 25 | 26 | ```bash 27 | impg -p cerevisiae.pan.paf.gz -r S288C#1#chrI:50000-100000 -x 28 | ``` 29 | 30 | Your alignments must use `wfmash` default or `minimap2 --eqx` type CIGAR strings which have `=` for matches and `X` for mismatches. The `M` positional match character is not allowed. 31 | 32 | Depending on your alignments, this might result in the following BED file: 33 | 34 | ```txt 35 | S288C#1#chrI 50000 100000 36 | DBVPG6044#1#chrI 35335 85288 37 | Y12#1#chrI 36263 86288 38 | DBVPG6765#1#chrI 36166 86150 39 | YPS128#1#chrI 47080 97062 40 | UWOPS034614#1#chrI 36826 86817 41 | SK1#1#chrI 52740 102721 42 | ``` 43 | 44 | In this example, `-p` specifies the path to the PAF file, `-r` defines the target range in the format of `seq_name:start-end`, and `-x` requests a *transitive closure* of the matches. 45 | That is, for each collected range, we then find what sequence ranges are aligned onto it. 46 | This is done progressively until we've closed the set of alignments connected to the initial target range. 47 | 48 | ### Installation 49 | 50 | To compile and install `impg` from source, you'll need a recent rust build toolchain and cargo. 51 | 52 | 1. Clone the repository: 53 | ```bash 54 | git clone https://github.com/ekg/impg.git 55 | ``` 56 | 2. Navigate to the `impg` directory: 57 | ```bash 58 | cd impg 59 | ``` 60 | 3. Compile the tool (requires rust build tools): 61 | ```bash 62 | cargo install --force --path . 63 | ``` 64 | 65 | ## Authors 66 | 67 | Erik Garrison 68 | Andrea Guarracino 69 | Bryce Kille 70 | 71 | ## License 72 | 73 | MIT 74 | -------------------------------------------------------------------------------- /phylogenomics/segul.md: -------------------------------------------------------------------------------- 1 | # SEGUL `segul logo` 2 | 3 | ![Segul-Tests](https://github.com/hhandika/segul/workflows/Segul-Tests/badge.svg) 4 | ![Crate-IO](https://img.shields.io/crates/v/segul) 5 | ![Crates-Download](https://img.shields.io/crates/d/segul?color=orange&label=crates.io-downloads) 6 | ![GH-Release](https://img.shields.io/github/v/tag/hhandika/segul?label=gh-releases) 7 | ![GH-Downloads](https://img.shields.io/github/downloads/hhandika/segul/total?color=blue&label=gh-release-downloads) 8 | [![LoC](https://tokei.rs/b1/github/hhandika/segul?category=code)](https://github.com/XAMPPRocky/tokei) 9 | ![last-commit](https://img.shields.io/github/last-commit/hhandika/segul) 10 | ![License](https://img.shields.io/github/license/hhandika/segul) 11 | 12 | SEGUL is an ultra-fast, memory-efficient application for working with phylogenomic datasets. It is available as standalone, zero dependency command line, GUI applications (called SEGUI), and library/packages for Rust and other programming languages. It runs from your smartphone to High Performance Computers (see platform support below). It is safe, multi threaded, and easy to use. 13 | 14 | It is designed to handle operations on large genomic datasets, while using minimal computational resources. However, it also provides convenient features for working on smaller datasets (e.g., Sanger datasets). In our tests, it consistently offers a faster and more efficient (low memory footprint) alternative to existing applications for a variety of sequence alignment manipulations ([see benchmark](https://www.segul.app/docs/cli_gui#performance)). 15 | 16 | ## Citation 17 | 18 | > Handika, H., and J. A. Esselstyn. 2024. SEGUL: Ultrafast, memory-efficient and mobile-friendly software for manipulating and summarizing phylogenomic datasets. _Molecular Ecology Resources_. [https://doi.org/10.1111/1755-0998.13964](https://doi.org/10.1111/1755-0998.13964). 19 | 20 | ## Links 21 | 22 | - App Documentation: [[EN]](https://segul.app/) 23 | - API Documentation: [[Rust]](https://docs.rs/segul/0.18.1/segul/) 24 | - GUI source code: [[Repository]](https://github.com/hhandika/segui) 25 | 26 | ## Installation 27 | 28 | ### GUI Version 29 | 30 | ### Desktop 31 | 32 | [`Microsoft Store download`](https://apps.microsoft.com/detail/SEGUI/9NP1BQ6FW9PW?mode=direct) 33 | 34 | [`Download on the Mac App Store`](https://apps.apple.com/us/app/segui/id6447999874?mt=12&itsct=apps_box_badge&itscg=30200) 40 | 41 | ### Mobile 42 | 43 | [`Download on the App Store`](https://apps.apple.com/us/app/segui/id6447999874?itsct=apps_box_badge&itscg=30200) 44 | 45 | [`Get it on Google Play`](https://play.google.com/store/apps/details?id=com.hhandika.segui&pcampaignid=pcampaignidMKT-Other-global-all-co-prtnr-py-PartBadge-Mar2515-1) 50 | 51 | Learn more about device requirements and GUI app installation in the [documentation](https://www.segul.app/docs/installation/install_gui). 52 | 53 | ### CLI Version 54 | 55 | The CLI app may work in any Rust supported [platform](https://doc.rust-lang.org/nightly/rustc/platform-support.html). However, we only tested and officially support the following platforms: 56 | 57 | - Linux 58 | - MacOS 59 | - Windows 60 | - Windows Subsystem for Linux (WSL) 61 | 62 | #### CLI Installation Methods 63 | 64 | - Pre-compiled binaries: [[Releases]](https://github.com/hhandika/segul/releases) [[Docs]](https://www.segul.app/docs/installation/install_binary) 65 | - Package manager: [[Docs]](https://www.segul.app/docs/installation/install_cargo) 66 | - From source: [[Docs]](https://www.segul.app/docs/installation/install_source) 67 | - Beta version: [[Docs]](https://www.segul.app/docs/installation/install_dev) 68 | 69 | ## Features 70 | 71 | | Feature | Quick Link | 72 | | ------------------------------ | ---------------------------------------------------------------------------------------------------------------- | 73 | | Alignment concatenation | [CLI](https://www.segul.app/docs/cli-usage/concat) / [GUI](https://www.segul.app/docs/gui-usage/align-concat) | 74 | | Alignment conversion | [CLI](https://www.segul.app/docs/cli-usage/convert) / [GUI](https://www.segul.app/docs/gui-usage/align-convert) | 75 | | Alignment filtering | [CLI](https://www.segul.app/docs/cli-usage/filter) / [GUI](https://www.segul.app/docs/gui-usage/align-filter) | 76 | | Alignment splitting | [CLI](https://www.segul.app/docs/cli-usage/split) / [GUI](https://www.segul.app/docs/gui-usage/align-split) | 77 | | Alignment partition conversion | [CLI](https://www.segul.app/docs/cli-usage/part) / [GUI](https://www.segul.app/docs/gui-usage/align-partition) | 78 | | Alignment summary statistics | [CLI](https://www.segul.app/docs/cli-usage/summary) / [GUI](https://www.segul.app/docs/gui-usage/align-summary) | 79 | | Genomic summary statistics | [CLI](https://www.segul.app/docs/cli-usage/genomic) / [GUI](https://www.segul.app/docs/gui-usage/genomic) | 80 | | Sequence extraction | [CLI](https://www.segul.app/docs/cli-usage/extract) / [GUI](https://www.segul.app/docs/gui-usage/sequence-extract) | 81 | | Sequence ID extraction | [CLI](https://www.segul.app/docs/cli-usage/id) / [GUI](https://www.segul.app/docs/gui-usage/sequence-id) | 82 | | Sequence ID mapping | [CLI](https://www.segul.app/docs/cli-usage/map) / [GUI](https://www.segul.app/docs/gui-usage/sequence-id-map) | 83 | | Sequence ID renaming | [CLI](https://www.segul.app/docs/cli-usage/rename) / [GUI](https://www.segul.app/docs/gui-usage/sequence-rename) | 84 | | Sequence removal | [CLI](https://www.segul.app/docs/cli-usage/remove) / [GUI](https://www.segul.app/docs/gui-usage/sequence-remove) | 85 | | Sequence translation | [CLI](https://www.segul.app/docs/cli-usage/translate) / [GUI](https://www.segul.app/docs/gui-usage/sequence-translate) | 86 | 87 | Supported sequence formats: 88 | 89 | 1. NEXUS 90 | 2. Relaxed PHYLIP 91 | 3. FASTA 92 | 4. FASTQ (gzipped and uncompressed) 93 | 94 | All of the formats are supported in interleave and sequential versions. Except for FASTQ (DNA only), the app supports both DNA and amino acid sequences. 95 | 96 | Supported partition formats: 97 | 98 | 1. RaXML 99 | 2. NEXUS 100 | 101 | The NEXUS partition can be written as a charset block embedded in NEXUS formatted sequences or a separate file. 102 | 103 | ## Contribution 104 | 105 | We welcome any kind of contribution, from issue reporting, ideas to improve the app, to code contribution. For ideas and issue reporting please post in [the Github issues page](https://github.com/hhandika/segul/issues). For code contribution, please fork the repository and send pull requests to this repository. 106 | -------------------------------------------------------------------------------- /proteomics/sage.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Sage: proteomics searching so fast it seems like magic 4 | 5 | [![Rust](https://github.com/lazear/sage/actions/workflows/rust.yml/badge.svg)](https://github.com/lazear/sage/actions/workflows/rust.yml) [![Anaconda-Server Badge](https://anaconda.org/bioconda/sage-proteomics/badges/version.svg)](https://anaconda.org/bioconda/sage-proteomics) 6 | 7 | 8 | For more information please read [the online documentation!](https://sage-docs.vercel.app/docs) 9 | 10 | 11 | # Introduction 12 | 13 | Sage is, at it's core, a proteomics database search engine - 14 | a tool that transforms raw mass spectra from proteomics experiments into peptide identifications 15 | via database searching & spectral matching. 16 | 17 | However, Sage includes a variety of advanced features that make it a one-stop shop: retention time prediction, quantification (both isobaric & LFQ), peptide-spectrum match rescoring, and FDR control. You can directly use results from Sage without needing to use other tools for these tasks. 18 | 19 | Additionally, Sage was designed with cloud computing in mind - massively parallel processing and the ability to directly stream compressed mass spectrometry data to/from AWS S3 enables unprecedented search speeds with minimal cost. 20 | 21 | Sage also runs just as well reading local files from your Mac/PC/Linux device! 22 | 23 | ## Why use Sage instead of other tools? 24 | 25 | Sage is **simple to configure**, **powerful** and **flexible**. 26 | It also happens to be well-tested, **mind-boggingly fast**, open-source (MIT-licensed) and free. 27 | 28 | ## Citation 29 | 30 | If you use Sage in a scientific publication, please cite the following paper: 31 | 32 | [Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale](https://doi.org/10.1021/acs.jproteome.3c00486) 33 | 34 | 35 | ## Features 36 | 37 | - Incredible performance out of the box 38 | - [Effortlessly cross-platform](https://sage-docs.vercel.app/docs/started#download-the-latest-binary-release) (Linux/MacOS/Windows), effortlessly parallel (uses all of your CPU cores) 39 | - [Fragment indexing strategy](https://sage-docs.vercel.app/docs/how_it_works) allows for blazing fast narrow and open searches (> 500 Da precursor tolerance) 40 | - [Isobaric quantification](https://sage-docs.vercel.app/docs/how_it_works#tmt-based) (MS2/MS3-TMT, or custom reporter ions) 41 | - [Label-free quantification](https://sage-docs.vercel.app/docs/how_it_works#label-free): consider all charge states & isotopologues *a la* FlashLFQ 42 | - Capable of searching for [chimeric/co-fragmenting spectra](https://sage-docs.vercel.app/docs/configuration/additional) 43 | - Wide-window (dynamic precursor tolerance) search mode - [enables WWA/PRM/DIA searches](https://sage-docs.vercel.app/docs/configuration/tolerance#wide-window-mode) 44 | - Retention time prediction models fit to each LC/MS run 45 | - [PSM rescoring](https://sage-docs.vercel.app/docs/how_it_works#machine-learning-for-psm-rescoring) using built-in linear discriminant analysis (LDA) 46 | - PEP calculation using a non-parametric model (KDE) 47 | - FDR calculation using target-decoy competition and picked-peptide & picked-protein approaches 48 | - Percolator/Mokapot [compatible output](https://sage-docs.vercel.app/docs/configuration#env) 49 | - Configuration by [JSON file](https://sage-docs.vercel.app/docs/configuration#file) 50 | - Built-in support for reading gzipped-mzML files 51 | - Support for reading/writing directly from [AWS S3](https://sage-docs.vercel.app/docs/configuration/aws) 52 | 53 | ## Interoperability 54 | 55 | Sage is well-integrated into the open-source proteomics ecosystem. The following projects support analyzing results from Sage (typically in addition to other tools), or redistribute Sage binaries for use in their pipelines. 56 | 57 | - [SearchGUI](http://compomics.github.io/projects/searchgui): a graphical user interface for running searches 58 | - [PeptideShaker](http://compomics.github.io/projects/peptide-shaker): visualize peptide-spectrum matches 59 | - [MS2Rescore](http://compomics.github.io/projects/ms2rescore): AI-assisted rescoring of results 60 | - [Picked group FDR](https://github.com/kusterlab/picked_group_fdr): scalable protein group FDR for large-scale experiments 61 | - [sagepy](https://github.com/theGreatHerrLebert/sagepy): Python bindings to the sage-core library 62 | - [quantms](https://github.com/bigbio/quantms): nextflow pipeline for running searches with Sage 63 | - [OpenMS](https://github.com/OpenMS/OpenMS): Sage is included as a "TOPP" tool in OpenMS 64 | - [sager](https://github.com/UCLouvain-CBIO/sager): R package for analyzing results from Sage searches 65 | - If your project supports Sage and it's not listed, please open a pull request! If you need help integrating or interfacing with Sage in some way, please reach out. 66 | 67 | Check out the (now outdated) [blog post introducing the first version of Sage](https://lazear.github.io/sage/) for more information and full benchmarks! -------------------------------------------------------------------------------- /rna/rnapkin.md: -------------------------------------------------------------------------------- 1 | 2 | # rnapkin: drawing RNA secondary structure with style 3 | [![Crates.io](https://img.shields.io/crates/v/rnapkin?color=F55353)](https://crates.io/crates/rnapkin) 4 | [![Downloads](https://img.shields.io/crates/d/rnapkin?color=FEB139)](https://crates.io/crates/rnapkin) 5 | 6 | ## Usage 7 | rnapkin accepts a file containing secondary structure and optionally sequence and a name. 8 | For example you could have this marvelous RNA molecule sitting peacefully 9 | in a file called "guaniners" 10 | ```text 11 | >fantastic guanine riboswitch 12 | AAUAUAAUAGGAACACUCAUAUAAUCGCGUGGAUAUGGCACGCAAGUUUCUACCGGGCAC 13 | ..........(..(.((((.((((..(((((.......)))))..........((((((. 14 | CGUAAAUGUCCGACUAUGGGUGAGCAAUGGAACCGCACGUGUACGGUUUUUUGUGAUAUC 15 | ......)))))).....((((((((((((((((((........))))))........... 16 | AGCAUUGCUUGCUCUUUAUUUGAGCGGGCAAUGCUUUUUUUA 17 | ..)))))))))))).)))).)))).)..)............. 18 | ``` 19 | Then, if you wish to visualize it, you could invoke rnapkin thus: 20 | ``` 21 | rnapkin guaniners 22 | ``` 23 | Surely rnapkin would respond with the name of a file it has just drawn to: 24 | ``` 25 | fantastic_guanine_riboswitch.svg 26 | ``` 27 | And this scalable vector graphic would be produced: 28 |

29 | 30 |

31 | 32 | *I* happen to quite enjoy the outcome, so *I* would say: 33 | ``` 34 | that's pretty neat 35 | ``` 36 | Your mileage may vary though. 37 | 38 | ## Rotating and flipping 39 | If you'd like to see this or any other RNA molecule upside-down, tilted or what have you, there are 40 | some options listed below that you can use and combine: 41 | ```text 42 | -a / --angle | starting Angle / boils down to clockwise rotation 43 | --mx | Mirror along X axis / aka vertical flip 44 | --my | Mirror along Y axis / aka horizontal flip 45 | ``` 46 |

47 | 48 |

49 | 50 | color themes can be changed by -t option as demonstrated; a config file allowing to define custom color themes 51 | is planned though unimplemented!() 52 | 53 | ## Installing 54 | I plan to offer precompiled binaries but for now 55 | you'll need **rust**. Easiest way to acquire **rust** is via [rustup](https://rustup.rs) :crab: 56 | 57 | ### Anywhere 58 | ```bash 59 | cargo install rnapkin 60 | ``` 61 | ### WSL 62 | Fontconfig is the default Fontmanagement utility on Linux and Unix but WSL may not have them installed; 63 | ```bash 64 | sudo apt-get install libfontconfig libfontconfig1-dev 65 | cargo install rnapkin 66 | ``` 67 | 68 | ## Input 69 | input can be served to rnapkin as a file or be piped in: 70 | 71 | ```bash 72 | rnapkin cmolecule.fa -a 20 -o crab 73 | echo ".......(((((......))))).....(((((......)))))......." | rnapkin -a 20 -o crab 74 | ``` 75 | 76 | input is quite **flexible**; it should contain secondary_structure and optionally 77 | name and sequence. Name has to start with ">" and can be overwritten with -o flag 78 | which has priority. Here are some variations of valid input files: 79 | 80 | ### simple one 81 | 82 | ```text 83 | # you can add .png to the name to request png instead of svg 84 | @ the same of course can be achieved with -o flag. 85 | * this is a comment btw: any symbol other than ">.()" works but prefer "#" 86 | >simple molecule.png 87 | ((((((((((..((((((.........))))))......).((((((.......))))))..))))))))) 88 | CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG 89 | ``` 90 | 91 | ### Highlighting! 92 | There are [9 available colors](https://docs.rs/rnapkin/0.3.2/rnapkin/draw/colors/default_pallette/constant.HIGHLIGHTS.html) 93 | denoted by 1-9, while 0 means None. 94 | For example consider input below representing SAM riboswitch in the OFF conformation 95 | according to smFRET study by [Manz et al. 2017](https://doi.org/10.1038/nchembio.2476). 96 | By using numbers in the input we can mark aptamer forming helices P1, P2, P3, P4 #2-#5 and 97 | the **TERMINATOR** #1. 98 | 99 | ```text 100 | > offsam 101 | 102 | 0000022222222223333333333333333333333333333333333444444444444444444444444444444 103 | AUAUCCGUUCUUAUCAAGAGAAGCAGAGGGACUGGCCCGACGAUGCUUCAGCAACCAGUGUAAUGGCGAUCAGCCAUGA 104 | .......((((((((....(((((...(((.....)))......)))))(((..(((((...(((((.....))))).) 105 | 106 | 4444444444555555555555555555555555555522222222222211111111111111111111111111111 107 | CUAAGGUGCUAAAUCCAGCAAGCUCGAACAGCUUGGAAGAUAAGAAGAGACAAAAUCACUGACAAAGUCUUCUUCUUAA 108 | ))..)).)))........((((((.....))))))...)))))))).................((((((.((((...)) 109 | 110 | 111111111111 111 | GAGGACUUUUUU 112 | )).))))))... 113 | ``` 114 | 115 |

116 | 117 |

118 | 119 | ### only secondary structure 120 | 121 | ```text 122 | .........(((..((((((...((((((((.....((((((((((...))))))..... 123 | (((((((...))))))).))))(((.....)))...)))).)))).))))))..)))..( 124 | (((.(((((..(((......))).)))))..))))(((((((((((((....)))))))) 125 | )))))..... 126 | ``` 127 | 128 | ### multiline 129 | sequence and secondary structure can be separate, 130 | mixed and aligned, everything should work. 131 | 132 | ## DIY 133 | using -p / --points flag you can make rnapkin print calculated coordinates 134 | of nucleotide bubbles (with 0.5 unit radius). You can then plot it 135 | yourself if you need to do something specific; 136 | 137 | If you happen to clone the repository, there is an example python 138 | script using **matplotlib** that you can pipe the input to. 139 | 140 | ```bash 141 | cargo run -- atelier/example_inputs/guaniners -p | atelier/plot.py 142 | ``` 143 | 144 | You can also combine -p flag with --mx --my and -a 145 | 146 | ## rnapkin name 147 | The wordsmithing proccess was arduous. It involved 148 | googling "words starting with na" and looking for anything drawing related. 149 | Once the word was found, unparalled strength was employed to slap it on top of "rna" 150 | ultimately creating this glorious amalgamation. 151 | ### why it kinda makes sense: 152 | You ever heard of all those physicists, mathematicians and the like, scribbling formulas on the 153 | back of a napkin ~~or a book margin~~? There is even a [wikipedia page](https://en.wikipedia.org/wiki/Back-of-the-envelope_calculation) about it. 154 | 155 | It doesn't take much mental gymnastic to imagine a biologist frantically scrambling together 156 | rna structure on a napkin. I am currently working on baiting my biologist 157 | friend into heated rna debate while in close proximity to abundant napkin source 158 | in order to produce a proof of concept. -------------------------------------------------------------------------------- /singlecell/alevin-fry.md: -------------------------------------------------------------------------------- 1 | logo 2 | 3 | # alevin-fry ![Rust](https://github.com/COMBINE-lab/alevin-fry/workflows/Rust/badge.svg) [![Anaconda-Server Badge](https://anaconda.org/bioconda/alevin-fry/badges/platforms.svg)](https://anaconda.org/bioconda/alevin-fry) [![Anaconda-Server Badge](https://anaconda.org/bioconda/alevin-fry/badges/license.svg)](https://anaconda.org/bioconda/alevin-fry) ![GitHub tag (latest SemVer)](https://img.shields.io/github/v/tag/combine-lab/alevin-fry?style=flat-square) 4 | 5 | `alevin-fry` is a suite of tools for the rapid, accurate and memory-frugal processing single-cell and single-nucleus sequencing data. It consumes RAD files generated by [`piscem`](https://github.com/COMBINE-lab/piscem) or `salmon alevin`, and performs common operations like generating permit lists, and estimating the number of distinct molecules from each gene within each cell. The focus in `alevin-fry` is on safety, accuracy and efficiency (in terms of both time and memory usage). 6 | 7 | You can read the paper describing alevin fry, "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" [here](https://www.nature.com/articles/s41592-022-01408-3), and the pre-print [on bioRxiv](https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1). 8 | 9 | **Note**: We recommend using [`piscem`](https://github.com/COMBINE-lab/piscem) as the back-end mapper, rather than salmon, as it is substantially more resource-frugal, faster, and is a larger focus of current and future development. 10 | 11 | ### Getting started with `alevin-fry` and dedicated documentation 12 | 13 | While this `README` contains some useful information to get started and some pointers, `alevin-fry` has it's own [dedicated documentation site](https://alevin-fry.readthedocs.io/en/latest/), hosted on `ReadTheDocs`. 14 | 15 | ### More information 16 | 17 | * [**Quickstart guide using the `simpleaf` wrapper**](https://combine-lab.github.io/alevin-fry-tutorials/2023/simpleaf-piscem/) 18 | 19 | * **Relationship to alevin**: Alevin-fry has been designed as the successor to alevin. It subsumes the core features of alevin, while also providing important new capabilities and considerably improving the performance profile. We anticipate that new method development and feature additions will take place primarily within the alevin-fry codebase. Thus, we encourage users of alevin to migrate to alevin-fry when feasible. That being said, alevin is still actively-maintained and supported, so if you are using it and not ready to migrate you can continue to ask questions and post issues in [the salmon repository](https://github.com/COMBINE-lab/salmon). 20 | 21 | ## FAQs 22 | 23 | Are you curious about processing details like [whether to use a sparse or dense index](https://github.com/COMBINE-lab/alevin-fry/discussions/38)? Do you have a question that isn't necessarily a bug report or feature request, and that isn't readily answered by the documentation or tutorials? Then please feel free to ask over in the [Q&A](https://github.com/COMBINE-lab/alevin-fry/discussions/categories/q-a). 24 | 25 | ## Sister repositories 26 | 27 | * The generation of the reduced alignment data (RAD) files processed by alevin-fry is done by either [piscem](https://github.com/COMBINE-lab/piscem) or [salmon](https://github.com/COMBINE-lab/salmon). The latest version of both are available on GitHub and via bioconda. 28 | 29 | * The [`simpleaf`](https://github.com/COMBINE-lab/simpleaf) repository contains a dedicated wrapper / workflow runner for processing data with `alevin-fry` that vastly simplifies both the creation of extended references and the subsequent quantification of samples. If you find that `simpleaf` is missing a feature that you'd like to have, please consider submitting a feature request in the [`simpleaf` repository](https://github.com/COMBINE-lab/simpleaf/issues). 30 | 31 | * The [`pyroe`](https://github.com/COMBINE-lab/pyroe) repository provides tools to help easily construct an enhanced (_spliced + intronic_ or _spliced + unspliced_) transcriptome from a reference genome and GTF file. 32 | 33 | * The [`fishpond`](https://github.com/mikelove/fishpond) package — maintained by @mikelove and his lab — contains the recommended relevant functions for reading `alevin-fry` output (particularly USA-mode output) into the R ecosystem, in the form of a [`singleCellExperiment`](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object. 34 | 35 | * The [`alevinqc`](https://github.com/csoneson/alevinQC) package — maintained by @csoneson — provides tool and functions for performing quality control and assessment downstream of `alevin-fry`. 36 | 37 | ## Installing from bioconda 38 | 39 | Alevin-fry is available for both x86 linux and OSX platforms [using bioconda](https://anaconda.org/bioconda/alevin-fry). On Apple silicon, you can either build (easily) from source (see below) or run `alevin-fry` under the rosetta 2 emulation layer. 40 | 41 | With `bioconda` in the appropriate place in your channel list, you should simply be able to install via: 42 | 43 | 44 | ```{bash} 45 | $ conda install -c bioconda alevin-fry 46 | ``` 47 | 48 | ## Installing from crates.io 49 | 50 | Alevin-fry can also be installed from [`crates.io`](https://crates.io/crates/alevin-fry) using `cargo`. This can be done with the following command: 51 | 52 | ```{bash} 53 | $ cargo install alevin-fry 54 | ``` 55 | 56 | ## Building from source 57 | 58 | If you want to use features or fixes that may only be available in the latest develop branch (or want to build for a different architecture), then you have to build from source. Luckily, `cargo` makes that easy; see below. 59 | 60 | Alevin-fry is built and tested with the latest (major & minor) stable version of [Rust](https://www.rust-lang.org/). While it will likely compile fine with slightly older versions of Rust, this is not a guarantee and is not a support priority. Unlike with C++, Rust has a frequent and stable release cadence, is designed to be installed and updated from user space, and is easy to keep up to date with [rustup](https://rustup.rs/). Thanks to cargo, building should be as easy as: 61 | 62 | ```{bash} 63 | $ cargo build --release 64 | ``` 65 | 66 | subsequent commands below will assume that the executable is in your path. Temporarily, this can 67 | be done (in bash-like shells) using: 68 | 69 | ```{bash} 70 | $ export PATH=`pwd`/target/release/:$PATH 71 | ``` 72 | 73 | ## Citing alevin-fry 74 | 75 | If you use `alevin-fry` in your work, please cite: 76 | 77 | > He, D., Zakeri, M., Sarkar, H., Soneson, C., Srivastava, A., and Patro, R. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3 78 | 79 | **BibTeX:** 80 | 81 | ``` 82 | @Article{He2022, 83 | author={He, Dongze and Zakeri, Mohsen and Sarkar, Hirak and Soneson, Charlotte and Srivastava, Avi and Patro, Rob}, 84 | title={Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data}, 85 | journal={Nature Methods}, 86 | year={2022}, 87 | month={Mar}, 88 | day={01}, 89 | volume={19}, 90 | number={3}, 91 | pages={316-322}, 92 | issn={1548-7105}, 93 | doi={10.1038/s41592-022-01408-3}, 94 | url={https://doi.org/10.1038/s41592-022-01408-3} 95 | } 96 | -------------------------------------------------------------------------------- /singlecell/cellranger.md: -------------------------------------------------------------------------------- 1 | Cell Ranger is a set of analysis pipelines that perform sample demultiplexing, barcode processing, single cell 3' and 5' gene counting, V(D)J transcript sequence assembly and annotation, and Feature Barcode analysis from 10x Genomics Chromium Single Cell data. 2 | 3 | Please note that this source code is made available only for informational purposes. 10x does not provide support for interpreting, modifying, building, or running this code. 4 | 5 | The officially supported release binaries are available at . Documentation is available at . -------------------------------------------------------------------------------- /slurm/ssubmit.md: -------------------------------------------------------------------------------- 1 | # ssubmit 2 | 3 | [![Rust CI](https://github.com/mbhall88/ssubmit/actions/workflows/ci.yaml/badge.svg)](https://github.com/mbhall88/ssubmit/actions/workflows/ci.yaml) 4 | [![codecov](https://codecov.io/gh/mbhall88/ssubmit/branch/main/graph/badge.svg?token=4O7HTGKD6Q)](https://codecov.io/gh/mbhall88/ssubmit) 5 | [![Crates.io](https://img.shields.io/crates/v/ssubmit.svg)](https://crates.io/crates/ssubmit) 6 | 7 | Submit sbatch jobs without having to create a submission script 8 | 9 | - [Motivation](#motivation) 10 | - [Install](#install) 11 | - [Usage](#usage) 12 | 13 | 14 | ## Motivation 15 | 16 | This project is motivated by the fact that I want to just be able to submit commands as 17 | jobs and I don't want to fluff around with making a submission script. 18 | 19 | `ssubmit` wraps that whole process and lets you live your best lyf #blessed. 20 | 21 | ## Install 22 | 23 | **tl;dr** 24 | 25 | ```shell 26 | curl -sSL install.ssubmit.mbh.sh | sh 27 | # or with wget 28 | wget -nv -O - install.ssubmit.mbh.sh | sh 29 | ``` 30 | 31 | You can pass options to the script like so 32 | 33 | ``` 34 | $ curl -sSL install.ssubmit.mbh.sh | sh -s -- --help 35 | install.sh [option] 36 | 37 | Fetch and install the latest version of ssubmit, if ssubmit is already 38 | installed it will be updated to the latest version. 39 | 40 | Options 41 | -V, --verbose 42 | Enable verbose output for the installer 43 | 44 | -f, -y, --force, --yes 45 | Skip the confirmation prompt during installation 46 | 47 | -p, --platform 48 | Override the platform identified by the installer [default: apple-darwin] 49 | 50 | -b, --bin-dir 51 | Override the bin installation directory [default: /usr/local/bin] 52 | 53 | -a, --arch 54 | Override the architecture identified by the installer [default: x86_64] 55 | 56 | -B, --base-url 57 | Override the base URL used for downloading releases [default: https://github.com/mbhall88/ssubmit/releases] 58 | 59 | -h, --help 60 | Display this help message 61 | ``` 62 | 63 | 64 | ### Cargo 65 | 66 | ```shell 67 | $ cargo install ssubmit 68 | ``` 69 | 70 | ### Build from source 71 | 72 | ```shell 73 | $ git clone https://github.com/mbhall88/ssubmit.git 74 | $ cd ssubmit 75 | $ cargo build --release 76 | $ target/release/ssubmit -h 77 | ``` 78 | 79 | ## Usage 80 | 81 | Submit an rsync job named "foo" and request 350MB of memory and a one week time limit 82 | 83 | ```shell 84 | $ ssubmit -m 350m -t 1w foo "rsync -az src/ dest/" 85 | ``` 86 | 87 | Submit a job that needs 8 CPUs 88 | 89 | ```shell 90 | $ ssubmit -m 16g -t 1d align "minimap2 -t 8 ref.fa query.fq > out.paf" -- -c 8 91 | ``` 92 | 93 | The basic anatomy of a `ssubmit` call is 94 | 95 | ``` 96 | ssubmit [OPTIONS] [-- ...] 97 | ``` 98 | 99 | `NAME` is the name of the job (the `--job-name` parameter in `sbatch`). 100 | 101 | `COMMAND` is what you want to be executed by the job. It **must** be quoted (single or 102 | double). 103 | 104 | `REMAINDER` is any (optional) [`sbatch`-specific options](https://slurm.schedmd.com/sbatch.html#lbAG) you want to pass on. These 105 | must follow a `--` after `COMMAND`. 106 | 107 | ### Memory 108 | 109 | Memory (`-m,--mem`) is intended to be a little more user-friendly than the `sbatch 110 | --mem` option. For example, you can pass `-m 0.5g` and `ssubmit` will interpret and 111 | convert this as 500M. However, `-m 1.7G` will be rounded up to 2G. One place where this 112 | option differs from `sbatch` is that if you don't give units, it will be interpreted as 113 | bytes - i.e., `-m 1000` will be converted to 1K. Units are case insensitive. 114 | 115 | ### Time 116 | 117 | As with memory, time (`-t,--time`) is intended to be simple. If you want a time limit of 118 | three days, then just pass `-t 3d`. Want two and a half hours? Then `-t 2h30m` works. If 119 | you want to just use the default limit of your cluster, then just pass `-t 0`. You can 120 | also just pass the [time format `sbatch` uses](https://slurm.schedmd.com/sbatch.html#OPT_time) and this will be seamlessly passed on. For 121 | a full list of supported time units, check out the 122 | [`duration-str`](https://github.com/baoyachi/duration-str) repo. 123 | 124 | ### Dry run 125 | 126 | You can see what `ssubmit` would do without actually submitting a job using dry run 127 | (`-n,--dry-run`). This will print the `sbatch` command and also the submission script 128 | that would have been provided. 129 | 130 | ```shell 131 | $ ssubmit -n -m 4g -t 1d dry "rsync -az src/ dest/" -- -c 8 132 | [2022-01-19T08:58:58Z INFO ssubmit] Dry run requested. Nothing submitted 133 | sbatch -c 8