├── LICENSE
├── README.md
├── bam
├── best.md
├── modkit.md
├── perbase.md
└── rustybam.md
├── csv
├── csview.md
├── madato.md
├── xsv.md
└── xtab.md
├── dna
├── fakit.md
├── fq.md
├── ngs.md
├── rust-bio-tools.md
└── skc.md
├── fastq
├── fasten.md
├── faster.md
├── fqgrep.md
├── fqkit.md
├── fqtk.md
└── rasusa.md
├── longreads
├── NextPolish2.md
├── chopper.md
└── longshot.md
├── metagenomics
├── coverm.md
├── skani.md
└── sylph.md
├── pangenomics
└── impg.md
├── phylogenomics
└── segul.md
├── proteomics
└── sage.md
├── rna
└── rnapkin.md
├── singlecell
├── alevin-fry.md
└── cellranger.md
├── slurm
└── ssubmit.md
└── vcf
└── echtvar.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 size_t
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ### rust in bioinformatics
2 |
3 | A collection of genomics software tools written in Rust
4 |
5 |
6 |
7 | ### index section
8 |
9 | ##### bam
10 | - [alignoth](https://github.com/alignoth/alignoth) : Creating alignment plots from bam files
11 | - [bamrescue](https://github.com/arkanosis/bamrescue) : Utility to check Binary Sequence Alignment / Map (BAM) files for corruption and repair them
12 | - [best](https://github.com/google/best) : Bam Error Stats Tool (best): analysis of error types in aligned reads
13 | - [modkit](https://github.com/nanoporetech/modkit) : A bioinformatics tool for working with modified bases
14 | - [mapAD](https://github.com/mpieva/mapAD) : An aDNA aware short-read mapper
15 | - [perbase](https://github.com/sstadick/perbase) : Per-base per-nucleotide depth analysis
16 | - [rustybam](https://github.com/mrvollger/rustybam) : bioinformatics toolkit in rust
17 |
18 | ##### csv
19 |
20 | - [csview](https://github.com/wfxr/csview) : 📠 Pretty and fast csv viewer for cli with cjk/emoji support
21 | - [csvlens](https://github.com/YS-L/csvlens) : csvlens is a command line CSV file viewer. It is like less but made for CSV.
22 | - [madato](https://github.com/inosion/madato) : Markdown Cmd Line, Rust and JS library for Excel to Markdown Tables
23 | - [qsv](https://github.com/dathere/qsv) : Blazing-fast Data-Wrangling toolkit
24 | - [rsv](https://github.com/ribbondz/rsv) : A command-line tool written in Rust for analyzing CSV, TXT, and Excel files.
25 | - [tabiew](https://github.com/shshemi/tabiew) : A lightweight TUI app to view and query CSV files
26 | - [tv](https://github.com/alexhallam/tv) : 📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
27 | - [xan](https://github.com/medialab/xan) : The CSV magician
28 | - [xsv](https://github.com/BurntSushi/xsv) : A fast CSV command line toolkit written in Rust.
29 | - [xtab](https://github.com/sharkLoc/xtab) : CSV command line utilities
30 |
31 | ##### dna
32 |
33 | - [biotools](https://github.com/jimrybarski/biotools) : Command line bioinformatics functions
34 | - [darwin](https://github.com/Ebedthan/darwin) : Create (rapid) neighbor-joining tree from sequences using mash distance
35 | - [fakit](https://github.com/sharkLoc/fakit) : fakit: a simple program for fasta file manipulation
36 | - [filterx](https://github.com/dwpeng/filterx) : process any file in tabular format. Fasta/fastq/GTF/GFF/VCF/SAM/BED
37 | - [fq](https://github.com/stjude-rust-labs/fq) : Command line utility for manipulating Illumina-generated FASTQ files.
38 | - [gsearch](https://github.com/jean-pierreboth/gsearch) : Approximate nearest neighbour search for microbial genomes based on hash metric
39 | - [Hyper-Gen](https://github.com/wh-xu/Hyper-Gen) : HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors
40 | - [kanpig](https://github.com/ACEnglish/kanpig) : Kmer Analysis of Pileups for Genotyping
41 | - [kfc](https://github.com/lrobidou/KFC) : KFC (K-mer Fast Counter) is a fast and space-efficient k-mer counter based on hyper-k-mers.
42 | - [ngs](https://github.com/stjude-rust-labs/ngs) : Command line utility for working with next-generation sequencing files.
43 | - [nail](https://github.com/TravisWheelerLab/nail) : Nail is an Alignment Inference tooL
44 | - [palindrome-finder](https://github.com/brianli314/palindrome-finder) : A bioinformatics tool written in Rust to find palindromic sequences in DNA
45 | - [poasta](https://github.com/broadinstitute/poasta) : Fast and exact gap-affine partial order alignment
46 | - [psdm](https://github.com/mbhall88/psdm) : Compute a pairwise SNP distance matrix from one or two alignment(s)
47 | - [rust-bio-tools](https://github.com/rust-bio/rust-bio-tools) : A set of command line utilities based on Rust-Bio.
48 | - [ska](https://github.com/bacpop/ska.rust) : Split k-mer analysis – version 2
49 | - [skc](https://github.com/mbhall88/skc) : Shared k-mer content between two genomes
50 | - [sketchy](https://github.com/esteinig/sketchy) : Genomic neighbor typing of bacterial pathogens using MinHash 🐀
51 | - [tidk](https://github.com/tolkit/telomeric-identifier) : Identify and find telomeres, or telomeric repeats in a genome.
52 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies
53 | - [xgt](https://github.com/Ebedthan/xgt) : Efficient and fast querying and parsing of GTDB's data
54 |
55 | ##### fastq
56 |
57 | - [deacon](https://github.com/bede/deacon) : Fast (host) DNA sequence filtering
58 | - [fasten](https://github.com/lskatz/fasten) : 👷 Fasten toolkit, for streaming operations on fastq files
59 | - [faster](https://github.com/angelovangel/faster) : A (very) fast program for getting statistics about a fastq file, the way I need them, written in Rust
60 | - [fqgrep](https://github.com/fulcrumgenomics/fqgrep) : Grep for FASTQ files
61 | - [fqkit](https://github.com/sharkLoc/fqkit) : 🦀 Fqkit: A simple and cross-platform program for fastq file manipulation
62 | - [fqtk](https://github.com/fulcrumgenomics/fqtk) : Fast FASTQ sample demultiplexing in Rust.
63 | - [grepq](https://github.com/rbfinch/grepq): quickly filter fastq files by matching sequences to a set of regex patterns
64 | - [guide-counter](https://github.com/fulcrumgenomics/guide-counter) : A better, faster way to count guides in CRISPR screens.
65 | - [K2Rmini](https://github.com/Malfoy/K2Rmini) : K2Rmini (or K-mer to Reads mini) is a tool to filter the reads contained in a FASTA/Q file based on a set of k-mers of interest.
66 | - [kractor](https://github.com/sam-sims/kractor) : Rapidly extract reads from a FASTQ file based on taxonomic classification via Kraken2.
67 | - [rasusa](https://github.com/mbhall88/rasusa) : Randomly subsample sequencing reads
68 | - [SeqSizzle](https://github.com/ChangqingW/SeqSizzle) : SeqSizzle is a pager for viewing FASTQ files with fuzzy matching, allowing different adaptors to be colored differently.
69 | - [sabreur](https://github.com/ebedthan/sabreur) : fast, reliable and handy demultiplexing tool for fastx files
70 |
71 | #### format
72 | - [atlas](https://github.com/stjude-rust-labs/atlas) : Enables storing, querying, transforming, and visualizing of multidimensional count data.
73 | - [bigtools](https://github.com/jackh726/bigtools) : A high-performance BigWig and BigBed library in Rust
74 | - [biotest](https://github.com/natir/biotest) : Generate random test data for bioinformatics
75 | - [bqtools](https://github.com/arcinstitute/bqtools) : A command line utilty for working with BINSEQ files
76 | - [cigzip](https://github.com/AndreaGuarracino/cigzip) : A tool for compression and decompression of alignment CIGAR strings using tracepoints.
77 | - [d4tools](https://github.com/38/d4-format) : The D4 Quantitative Data Format
78 | - [gfa2bin](https://github.com/MoinSebi/gfa2bin) : Convert various graph-related data to PLINK file. In addition, we offer multiple commands for filtering or modifying the generated PLINK files.
79 | - [gia](https://github.com/noamteyssier/gia) : gia: Genomic Interval Arithmetic
80 | - [granges](https://github.com/vsbuffalo/granges) : A Rust library and command line tool for working with genomic ranges and their data.
81 | - [intspan](https://github.com/wang-q/intspan) : Command line tools for IntSpan related bioinformatics operations
82 | - [nuc2bit](https://github.com/natir/nuc2bit) : A rust crate that provides methods for rapidly encoding and decoding nucleotides in 2-bit representation.
83 | - [recmap](https://github.com/vsbuffalo/recmap) : A command line tool and Rust library for working with recombination maps.
84 | - [transanno](https://github.com/informationsea/transanno) : accurate LiftOver tool for new genome assemblies
85 | - [thirdkind](https://github.com/simonpenel/thirdkind) : Drawing reconciled phylogenetic trees allowing 1, 2 or 3 reconcillation levels
86 | - [xsra](https://github.com/ArcInstitute/xsra) : An efficient CLI to extract sequences from the SRA
87 |
88 |
89 | ##### gff3
90 |
91 | - [atg](https://github.com/anergictcell/atg) : A Rust library and CLI tool to handle genomic transcripts
92 | - [gffkit](https://github.com/sharkloc/gffkit) : a simple program for gff3 file manipulation
93 |
94 | ##### longreads
95 |
96 | - [Autocycler](https://github.com/rrwick/Autocycler) : A tool for generating consensus long-read assemblies for bacterial genomes
97 | - [chopper](https://github.com/wdecoster/chopper) : Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file.
98 | - [DeepChopper](https://github.com/ylab-hi/DeepChopper) : Language models identify chimeric artificial reads in NanoPore direct-RNA sequencing data.
99 | - [fpa](https://github.com/natir/fpa) : Filter of Pairwise Alignement
100 | - [herro](https://github.com/lbcb-sci/herro) : HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended).
101 | - [HiPhase](https://github.com/PacificBiosciences/HiPhase) : Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
102 | - [longshot](https://github.com/pjedge/longshot) : diploid SNV caller for error-prone reads
103 | - [lrge](https://github.com/mbhall88/lrge) : Genome size estimation from long read overlaps
104 | - [myloasm](https://github.com/bluenote-1577/myloasm) : A new high-resolution long-read metagenome assembler for even noisy reads
105 | - [Polypolish](https://github.com/rrwick/Polypolish) : a short-read polishing tool for long-read assemblies
106 | - [nextpolish2](https://github.com/Nextomics/NextPolish2) : Repeat-aware polishing genomes assembled using HiFi long reads
107 | - [nanoq](https://github.com/esteinig/nanoq) : Minimal but speedy quality control for nanopore reads in Rust 🐻
108 | - [smrest](https://github.com/jts/smrest) : Tumour-only somatic mutation calling using long reads
109 | - [trgt](https://github.com/PacificBiosciences/trgt) : Tandem repeat genotyping and visualization from PacBio HiFi data
110 | - [yacrd](https://github.com/natir/yacrd) : Yet Another Chimeric Read Detector
111 |
112 |
113 |
114 | ##### metagenomics
115 |
116 | - [coverm](https://github.com/wwood/CoverM) : Read coverage calculator for metagenomics
117 | - [galah](https://github.com/wwood/galah) : More scalable dereplication for metagenome assembled genomes
118 | - [hyperex](https://github.com/Ebedthan/hyperex) : Hypervariable region primer-based extractor for 16S rRNA and other SSU/LSU sequences.
119 | - [kun_peng](https://github.com/eric9n/Kun-peng) : Kun-peng: an ultra-fast, low-memory footprint and accurate taxonomy classifier for all
120 | - [kmertools](https://github.com/anuradhawick/kmertools) : kmer based feature extraction tool for bioinformatics, metagenomics, AI/ML and more
121 | - [kmerutils](https://github.com/jean-pierreBoth/kmerutils) : Kmer generating, counting hashing and related
122 | - [Lorikeet](https://github.com/rhysnewell/Lorikeet) : Strain resolver for metagenomics
123 | - [nohuman](https://github.com/mbhall88/nohuman) : Remove human reads from a sequencing run
124 | - [rosella](https://github.com/rhysnewell/rosella) : Metagenomic Binning Algorithm
125 | - [skani](https://github.com/bluenote-1577/skani) : Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
126 | - [sourmash](https://github.com/sourmash-bio/sourmash) : Quickly search, compare, and analyze genomic and metagenomic data sets.
127 | - [sylph](https://github.com/bluenote-1577/sylph) : ultrafast genome querying and taxonomic profiling for metagenomic samples by abundance-corrected minhash.
128 | - [vircov](https://github.com/esteinig/vircov) : Viral genome coverage evaluation for metagenomic diagnostics 🩸
129 |
130 | ##### pangenomics
131 |
132 | - [impg](https://github.com/pangenome/impg) : implicit pangenome graph
133 | - [panacus](https://github.com/marschall-lab/panacus) : Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
134 |
135 | ##### phylogenomics
136 |
137 | - [nextclade](https://github.com/nextstrain/nextclade) : Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
138 | - [nwr](https://github.com/wang-q/nwr) : nwr is a command line tool for working with NCBI taxonomy, Newick files and assembly reports
139 | - [unicore](https://github.com/steineggerlab/unicore) : Universal and efficient core gene phylogeny with Foldseek and ProstT5
140 | - [segul](https://github.com/hhandika/segul) : An ultrafast and memory efficient tool for phylogenomics
141 |
142 | ##### proteomics
143 |
144 | - [align-cli](https://github.com/snijderlab/align-cli) : A CLI for pairwise alignment of sequences, using both normal and mass based alignment.
145 | - [daedalus](https://github.com/David-OConnor/daedalus) : Protein and molecule viewer
146 | - [folddisco](https://github.com/steineggerlab/folddisco) : Fast indexing and search of discontinuous motifs in protein structures
147 | - [foldmason](https://github.com/steineggerlab/foldmason) : Foldmason builds multiple alignments of large structure sets.
148 | - [sage](https://github.com/lazear/sage) : Proteomics search & quantification so fast that it feels like magic
149 |
150 | ##### rna
151 | - [oarfish](https://github.com/COMBINE-lab/oarfish) : long read RNA-seq quantification
152 | - [rnapkin](https://github.com/ukmrs/rnapkin) : drawing RNA secondary structure with style; instantly
153 | - [R2Dtool](https://github.com/comprna/R2Dtool) : R2Dtool is a set of genomics utilities for handling, integrating, and viualising isoform-mapped RNA feature data.
154 | - [squab](https://github.com/zaeleus/squab) : Alignment-based gene expression quantification
155 |
156 | ##### singlecell
157 |
158 | - [adview](https://github.com/JianYang-Lab/adview) : Adata Viewer: Head/Less/Shape h5ad file in terminal
159 | - [alevin-fry](https://github.com/COMBINE-lab/alevin-fry) : 🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
160 | - [cellranger](https://github.com/10XGenomics/cellranger) : 10x Genomics Single Cell Analysis
161 | - [precellar](https://github.com/regulatory-genomics/precellar) : Single-cell genomics preprocessing package
162 | - [proseg](https://github.com/dcjones/proseg) : Probabilistic cell segmentation for in situ spatial transcriptomics
163 | - [SnapATAC2](https://github.com/kaizhang/SnapATAC2) : Single-cell epigenomics analysis tools
164 |
165 | ##### slurm
166 |
167 | - [ssubmit](https://github.com/mbhall88/ssubmit) : Submit slurm sbatch jobs without the need to create a script
168 |
169 | ##### vcf
170 |
171 | - [echtvar](https://github.com/brentp/echtvar) : using all the bits for echt rapid variant annotation and filtering
172 | - [gvcf_norm](https://github.com/mlin/gvcf_norm) : gVCF allele normalizer
173 | - [mehari](https://github.com/varfish-org/mehari): VEP-like tool for sequence ontology and HGVS annotation of VCF files
174 | - [vcf2parquet](https://github.com/natir/vcf2parquet) : Convert vcf in parquet
175 | - [vcfexpress](https://github.com/brentp/vcfexpress) : expressions on VCFs
176 |
177 |
178 | ##### Gui
179 |
180 | - [plascad](https://github.com/David-OConnor/plascad) : Design software for plasmid (vector) and primer creation and validation. Edit plasmids, perform PCR-based cloning, digest and ligate DNA fragments, and display details about expressed proteins. Integrates with online resources like NCBI and PDB.
181 |
182 | ##### other
183 | - [biobear](https://github.com/wheretrue/biobear) : Work with bioinformatic files using Arrow, Polars, and/or DuckDB
184 | - [binseq](https://github.com/ArcInstitute/binseq) : A high efficiency binary format for sequencing data
185 | - [ggetrs](https://github.com/noamteyssier/ggetrs) : Efficient querying of biological databases
186 | - [htsget-rs](https://github.com/umccr/htsget-rs) : A server implementation of the htsget protocol for bioinformatics in Rust
187 | - [ibu](https://github.com/noamteyssier/ibu) : a rust library for high throughput binary encoding of genomic sequences
188 | - [scidataflow](https://github.com/vsbuffalo/scidataflow): Command line scientific data management tool
189 | - [sufr](https://github.com/TravisWheelerLab/sufr) : Parallel Construction of Suffix Arrays in Rust
190 |
191 |
192 |
193 |
194 | ## Starchart
195 |
196 |
--------------------------------------------------------------------------------
/bam/best.md:
--------------------------------------------------------------------------------
1 | # best
2 |
3 | Bam Error Stats Tool (best): analysis of error types in aligned reads.
4 |
5 | `best` is used to assess the quality of reads after aligning them to a
6 | reference assembly.
7 |
8 | ## Features
9 |
10 | * Collect overall and per alignment stats
11 | * Distribution of indel lengths
12 | * Yield at different empirical Q-value thresholds
13 | * Bin per read stats to easily examine the distribution of errors for certain
14 | types of reads
15 | * Stats for regions specified by intervals (BED file, homopolymer regions,
16 | windows etc.)
17 | * Stats for quality scores vs empirical Q-values
18 | * Multithreading for speed
19 |
20 | ## Usage
21 |
22 | The [`best` Usage Guide](Usage.md) gives an overview of how to use `best`.
23 |
24 | ## Installing
25 |
26 | 1. Install [Rust](https://www.rust-lang.org/tools/install).
27 | 2. Clone this repository and navigate into the directory of this repository.
28 | 3. Run `cargo install --locked --path .`
29 | 4. Run `best input.bam reference.fasta prefix/path`
30 |
31 | This will generate stats files with the `prefix/path` prefix.
32 |
33 | ## Development
34 |
35 | ### Running
36 |
37 | 1. Install [Rust](https://www.rust-lang.org/tools/install).
38 | 2. Clone this repository and navigate into the directory of this repository.
39 | 3. Run `cargo build --release`
40 | 4. Run `cargo run --release -- input.bam reference.fasta prefix/path` or
41 | `target/release/best input.bam reference.fasta prefix/path`
42 |
43 | This will generate stats files with the `prefix/path` prefix.
44 |
45 | The built binary is located at `target/release/best`.
46 |
47 | ### Formatting
48 |
49 | ```
50 | cargo fmt
51 | ```
52 |
53 | ### Comparing
54 |
55 | Remember to pass the `-t 1` option to ensure that only one thread is used for
56 | testing. Best generally tries to ensure the order of outputs is deterministic
57 | with multiple threads, but the order of per-alignment stats is arbitrary unless
58 | only one thread is used.
59 |
60 | ### Disclaimer
61 |
62 | This is not an official Google product.
63 |
64 | The code is not intended for use in any clinical settings. It is not intended to be a medical device and is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
65 |
66 | No representations or warranties are made with regards to the accuracy of results generated. User or licensee is responsible for verifying and validating accuracy when using this tool.
67 |
--------------------------------------------------------------------------------
/bam/modkit.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Modkit
4 |
5 | A bioinformatics tool for working with modified bases from Oxford Nanopore. Specifically for converting modBAM
6 | to bedMethyl files using best practices, but also manipulating modBAM files and generating summary statistics.
7 | Detailed documentation and quick-start can be found in the [online documentation](https://nanoporetech.github.io/modkit/).
8 |
9 | ## Installation
10 |
11 | Pre-compiled binaries are provided for Linux from the [release page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in most circumstances.
12 |
13 | ### Building from source
14 |
15 | The provided packages should be used where possible. We understand that some users may wish to compile the software from its source code. To build `modkit` from source [cargo](https://www.rust-lang.org/learn/get-started) should be used.
16 |
17 | ```bash
18 | git clone https://github.com/nanoporetech/modkit.git
19 | cd modkit
20 | cargo install --path .
21 | # or
22 | cargo install --git https://github.com/nanoporetech/modkit.git
23 | ```
24 |
25 | ## Usage
26 |
27 | Modkit comprises a suite of tools for manipulating modified-base data stored in [BAM](http://www.htslib.org/) files. Modified base information is stored in the `MM` and `ML` tags (see section 1.7 of the [SAM tags](https://samtools.github.io/hts-specs/SAMtags.pdf) specification). These tags are produced by contemporary basecallers of data from Oxford Nanopore Technologies sequencing platforms.
28 |
29 | ### Constructing bedMethyl tables
30 |
31 | A primary use of `modkit` is to create summary counts of modified and unmodified bases in an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position.
32 |
33 | In its simplest form `modkit` creates a bedMethyl file using the following:
34 |
35 | ```bash
36 | modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
37 | ```
38 |
39 | No reference sequence is required. A single file (described [below](#description-of-bedmethyl-output)) with base count summaries will be created. The final argument here specifies an optional log file output.
40 |
41 | The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./book/src/filtering.md).
42 |
43 | For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:
44 |
45 | ```bash
46 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
47 | ```
48 |
49 | The program also contains a range of presets which combine several options for ease of use. The `traditional` preset,
50 |
51 | ```bash
52 | modkit pileup path/to/reads.bam output/path/pileup.bed \
53 | --ref path/to/reference.fasta \
54 | --preset traditional
55 | ```
56 |
57 | performs three transforms:
58 |
59 | * restricts output to locations where there is a CG dinucleotide in
60 | the reference,
61 | * reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and
62 | * aggregates data across strands. The strand field od the output will be marked as '.' indicating that the strand information has been lost.
63 |
64 | Using this option is equivalent to running with the options:
65 |
66 | ```bash
67 | modkit pileup --cpg --ref --ignore h --combine-strands
68 | ```
69 |
70 | For more information on the individual options see the [Advanced Usage](./book/src/advanced_usage.md) help document.
71 |
72 | ## Description of bedMethyl output
73 |
74 | Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the
75 | bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/).
76 |
77 | ### Definitions:
78 |
79 | * N``mod`` - Number of calls passing filters that were classified as a residue with a specified base modification.
80 | * N``canonical`` - Number of calls passing filters were classified as the canonical base rather than modified. The
81 | exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then
82 | the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine.
83 | * N``other mod`` - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with
84 | `h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N``other_mod`` would be 2. In the
85 | `m` row N``other_mod`` would be 3.
86 | * N``valid_cov`` - the valid coverage. N``valid_cov`` = N``mod`` + N``other_mod`` + N``canonical``, also used as the `score` in the bedMethyl
87 | * N``diff`` - Number of reads with a base other than the canonical base for this modification. For example, in a row
88 | for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N``diff`` will be 2.
89 | * N``delete`` - Number of reads with a deletion at this reference position
90 | * N``fail`` - Number of calls where the probability of the call was below the threshold. The threshold can be
91 | set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
92 | * N``nocall`` - Number of reads aligned to this reference position, with the correct canonical base, but without a base
93 | modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a
94 | CG->CH substitution such that no modification call was produced by the basecaller.
95 |
96 | ### bedMethyl column descriptions
97 |
98 | | column | name | description | type |
99 | | ------ | ----------------------------- | ------------------------------------------------------------------------------ | ----- |
100 | | 1 | chrom | name of reference sequence from BAM header | str |
101 | | 2 | start position | 0-based start position | int |
102 | | 3 | end position | 0-based exclusive end position | int |
103 | | 4 | modified base code | single letter code for modified base | str |
104 | | 5 | score | Equal to N``valid_cov``. | int |
105 | | 6 | strand | '+' for positive strand '-' for negative strand, '.' when strands are combined | str |
106 | | 7 | start position | included for compatibility | int |
107 | | 8 | end position | included for compatibility | int |
108 | | 9 | color | included for compatibility, always 255,0,0 | str |
109 | | 10 | N``valid_cov`` | See definitions above. | int |
110 | | 11 | fraction modified | N``mod`` / N``valid_cov`` | float |
111 | | 12 | N``mod`` | See definitions above. | int |
112 | | 13 | N``canonical`` | See definitions above. | int |
113 | | 14 | N``other_mod`` | See definitions above. | int |
114 | | 15 | N``delete`` | See definitions above. | int |
115 | | 16 | N``fail`` | See definitions above. | int |
116 | | 17 | N``diff`` | See definitions above. | int |
117 | | 18 | N``nocall`` | See definitions above. | int |
118 |
119 | ## Description of columns in `modkit summary`:
120 |
121 | ### Totals table
122 |
123 | The lines of the totals table are prefixed with a `#` character.
124 |
125 | | row | name | description | type |
126 | | --- | ----------------------- | ----------------------------------------------------------------------- | ----- |
127 | | 1 | bases | comma-separated list of canonical bases with modification calls. | str |
128 | | 2 | total_reads_used | total number of reads from which base modification calls were extracted | int |
129 | | 3+ | count_reads_{base} | total number of reads that contained base modifications for {base} | int |
130 | | 4+ | filter_threshold_{base} | filter threshold used for {base} | float |
131 |
132 | ### Modification calls table
133 |
134 | The modification calls table follows immediately after the totals table.
135 |
136 | | column | name | description | type |
137 | | ------ | ---------- | ---------------------------------------------------------------------------------------- | ----- |
138 | | 1 | base | canonical base with modification call | char |
139 | | 2 | code | base modification code, or `-` for canonical | char |
140 | | 3 | pass_count | total number of passing (confidence >= threshold) calls for the modification in column 2 | int |
141 | | 4 | pass_frac | fraction of passing (>= threshold) calls for the modification in column 2 | float |
142 | | 5 | all_count | total number of calls for the modification code in column 2 | int |
143 | | 6 | all_frac | fraction of all calls for the modification in column 2 | float |
144 |
145 | ## Advanced usage examples
146 |
147 | For complete usage instructions please see the command-line help of the program or the [Advanced usage](./book/src/advanced_usage.md) help documentation. Some more commonly required examples are provided below.
148 |
149 | To combine multiple base modification calls into one, for example to combine basecalls for both 5hmC and 5mC into a count for "all cytosine modifications" (with code `C`) the `--combine-mods` option can be used:
150 |
151 | ```bash
152 | modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods
153 | ```
154 |
155 | In standard usage the `--preset traditional` option can be used as outlined in the [Usage](#usage) section. By more directly specifying individual options we can perform something similar without loss of information for 5hmC data stored in the input file:
156 |
157 | ```bash
158 | modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta \
159 | --combine-strands
160 | ```
161 |
162 | To produce a bedGraph file for each modification in the BAM file the `--bedgraph` option can be given. Counts for the positive and negative strands will be put in separate files.
163 |
164 | ```bash
165 | modkit pileup path/to/reads.bam output/directory/path --bedgraph <--prefix string>
166 | ```
167 |
168 | The option `--prefix [str]` parameter allows specification of a prefix to the output file names.
169 |
170 | **Licence and Copyright**
171 |
172 | (c) 2023 Oxford Nanopore Technologies Plc.
173 |
174 | Modkit is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0.
175 | If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com
176 |
--------------------------------------------------------------------------------
/bam/rustybam.md:
--------------------------------------------------------------------------------
1 | # rustybam
2 |
3 | [](https://github.com/mrvollger/rustybam/actions)
4 | [](https://github.com/mrvollger/rustybam/actions)
5 | [](https://github.com/mrvollger/rustybam/actions)
6 |
7 | [](https://anaconda.org/bioconda/rustybam)
8 | [](https://anaconda.org/bioconda/rustybam)
9 |
10 | [](https://crates.io/crates/rustybam)
11 | [](https://crates.io/crates/rustybam)
12 |
13 | [](https://zenodo.org/badge/latestdoi/351639424)
14 |
15 | `rustybam` is a bioinformatics toolkit written in the `rust` programing language focused around manipulation of alignment (`bam` and `PAF`), annotation (`bed`), and sequence (`fasta` and `fastq`) files. If your alignment is in a different format checkout if [wgatools](https://github.com/wjwei-handsome/wgatools) can convert it for you!
16 |
17 | ## What can rustybam do?
18 |
19 | Here is a commented example that highlights some of the better features of `rustybam`, and demonstrates how each result can be read directly into another subcommand.
20 |
21 | ```bash
22 | rb trim-paf .test/asm_small.paf `#trims back alignments that align the same query sequence more than once` \
23 | | rb break-paf --max-size 100 `#breaks the alignment into smaller pieces on indels of 100 bases or more` \
24 | | rb orient `#orients each contig so that the majority of bases are forward aligned` \
25 | | rb liftover --bed <(printf "chr22\t12000000\t13000000\n") `#subsets and trims the alignment to 1 Mbp of chr22.` \
26 | | rb filter --paired-len 10000 `#filters for query sequences that have at least 10,000 bases aligned to a target across all alignments.` \
27 | | rb stats --paf `#calculates statistics from the trimmed paf file` \
28 | | less -S
29 | ```
30 |
31 | ## Usage
32 |
33 | ```shell
34 | rustybam [OPTIONS]
35 | ```
36 |
37 | or
38 |
39 | ```shell
40 | rb [OPTIONS]
41 | ```
42 |
43 | ### Subcommands
44 |
45 | The full manual of subcommands can be found on the [docs](https://docs.rs/rustybam/latest/rustybam/cli/enum.Commands.html).
46 |
47 | ```shell
48 | SUBCOMMANDS:
49 | stats Get percent identity stats from a sam/bam/cram or PAF
50 | bed-length Count the number of bases in a bed file [aliases: bedlen, bl, bedlength]
51 | filter Filter PAF records in various ways
52 | invert Invert the target and query sequences in a PAF along with the CIGAR string
53 | liftover Liftover target sequence coordinates onto query sequence using a PAF
54 | trim-paf Trim paf records that overlap in query sequence [aliases: trim, tp]
55 | orient Orient paf records so that most of the bases are in the forward direction
56 | break-paf Break PAF records with large indels into multiple records (useful for
57 | SafFire) [aliases: breakpaf, bp]
58 | paf-to-sam Convert a PAF file into a SAM file. Warning, all alignments will be marked as
59 | primary! [aliases: paftosam, p2s, paf2sam]
60 | fasta-split Reads in a fasta from stdin and divides into files (can compress by adding
61 | .gz) [aliases: fastasplit, fasplit]
62 | fastq-split Reads in a fastq from stdin and divides into files (can compress by adding
63 | .gz) [aliases: fastqsplit, fqsplit]
64 | get-fasta Mimic bedtools getfasta but allow for bgzip in both bed and fasta inputs
65 | [aliases: getfasta, gf]
66 | nucfreq Get the frequencies of each bp at each position
67 | repeat Report the longest exact repeat length at every position in a fasta
68 | suns Extract the intervals in a genome (fasta) that are made up of SUNs
69 | help Print this message or the help of the given subcommand(s)
70 | ```
71 |
72 | ## Install
73 |
74 | ### conda
75 |
76 | ```shell
77 | mamba install -c bioconda rustybam
78 | ```
79 |
80 | ### cargo
81 |
82 | ```shell
83 | cargo install rustybam
84 | ```
85 |
86 | ### Pre-complied binaries
87 |
88 | Download from [releases](https://github.com/mrvollger/rustybam/releases) (may be slower than locally complied versions).
89 |
90 | ### Source
91 |
92 | ```shell
93 | git clone https://github.com/mrvollger/rustybam.git
94 | cd rustybam
95 | cargo build --release
96 | ```
97 |
98 | and the executables will be built here:
99 |
100 | ```shell
101 | target/release/{rustybam,rb}
102 | ```
103 |
104 | ## Examples
105 |
106 | ### PAF or BAM statistics
107 |
108 | For BAM files with extended cigar operations we can calculate statistics about the aliment and report them in BED format.
109 |
110 | ```shell
111 | rustybam stats {input.bam} > {stats.bed}
112 | ```
113 |
114 | The same can be done with PAF files as long as they are generated with `-c --eqx`.
115 |
116 | ```shell
117 | rustybam stats --paf {input.paf} > {stats.bed}
118 | ```
119 |
120 | ### PAF liftovers
121 |
122 | > I have a `PAF` and I want to subset it for just a particular region in the reference.
123 |
124 | With `rustybam` its easy:
125 |
126 | ```shell
127 | rustybam liftover \
128 | --bed <(printf "chr1\t0\t250000000\n") \
129 | input.paf > trimmed.paf
130 | ```
131 |
132 | > But I also want the alignment statistics for the region.
133 |
134 | No problem, `rustybam liftover` does not just trim the coordinates but also the CIGAR
135 | so it is ready for `rustybam stats`:
136 |
137 | ```shell
138 | rustybam liftover \
139 | --bed <(printf "chr1\t0\t250000000\n") \
140 | input.paf \
141 | | rustybam stats --paf \
142 | > trimmed.stats.bed
143 | ```
144 |
145 | > Okay, but Evan asked for an "align slider" so I need to realign in chunks.
146 |
147 | No need, just make your `bed` query to `rustybam liftoff` a set of sliding windows
148 | and it will do the rest.
149 |
150 | ```shell
151 | rustybam liftover \
152 | --bed <(bedtools makewindows -w 100000 \
153 | <(printf "chr1\t0\t250000000\n") \
154 | ) \
155 | input.paf \
156 | | rustybam stats --paf \
157 | > trimmed.stats.bed
158 | ```
159 |
160 | You can also use `rustybam breakpaf` to break up the paf records of indels above a certain size to
161 | get more "miropeats" like intervals.
162 |
163 | ```shell
164 | rustybam breakpaf --max-size 1000 input.paf \
165 | | rustybam liftover \
166 | --bed <(printf "chr1\t0\t250000000\n") \
167 | | ./rustybam stats --paf \
168 | > trimmed.stats.bed
169 | ```
170 |
171 | > Yeah but how do I visualize the data?
172 |
173 | Try out
174 | [SafFire](https://mrvollger.github.io/SafFire/)!
175 |
176 | ### Align once
177 |
178 | At the boundaries of CNVs and inversions minimap2 may align the same section of query sequence to multiple stretches of the target sequence. This utility uses the CIGAR (must use `--eqx`) strings of PAF alignments to determine an optimal split of the alignments such no query base is aligned more than once. To do this the whole PAF file is loaded in memory and then overlaps are removed starting with the largest overlapping interval and iterating.
179 |
180 | ```bash
181 | rb trim-paf {input.paf} > {trimmed.paf}
182 | ```
183 |
184 | Here is an example from the NOTCH2NL region comparing CHM1 against CHM13 before trimming:
185 | 
186 |
187 | and after trimming
188 | 
189 |
190 | ### Split fastx files
191 |
192 | Split a fasta file between `stdout` and two other files both compressed and uncompressed.
193 |
194 | ```shell
195 | cat {input.fasta} | rustybam fasta-split two.fa.gz three.fa
196 | ```
197 |
198 | Split a fastq file between `stdout` and two other files both compressed and uncompressed.
199 |
200 | ```shell
201 | cat {input.fastq} | rustybam fastq-split two.fq.gz three.fq
202 | ```
203 |
204 | ### Extract from a fasta
205 |
206 | This tools is designed to mimic `bedtools getfasta` but this tools allows the fasta to be `bgzipped`.
207 |
208 | ```shell
209 | samtools faidx {seq.fa(.gz)}
210 | rb get-fasta --name --strand --bed {regions.of.interest.bed} --fasta {seq.fa(.gz)}
211 | ```
212 |
213 | ## TODO
214 |
215 | - [x] Add a `bedtools getfasta` like operation that actually works with bgzipped input.
216 | - [ ] implement bed12/split
217 | - [ ] Allow sam or paf for operations:
218 | - [x] make a sam header from a PAF file
219 | - [x] convert sam record to paf record
220 | - [x] convert paf record to sam record
221 | - [ ] make tools seemlessly work with sam and paf
222 | - [ ] Add `D4` for Nucfreq.
223 | - [ ] Finish implementing `suns`.
224 | - [ ] Allow multiple input files in `bed-length`
225 | - [ ] Start keeping a changelog
--------------------------------------------------------------------------------
/csv/csview.md:
--------------------------------------------------------------------------------
1 | 📠 csview
2 |
3 | A high performance csv viewer with cjk/emoji support.
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 | ### Features
22 |
23 | * Small and *fast* (see [benchmarks](#benchmark) below).
24 | * Memory efficient.
25 | * Correctly align [CJK](https://en.wikipedia.org/wiki/CJK_characters) and emoji characters.
26 | * Support `tsv` and custom delimiters.
27 | * Support different styles, including markdown table.
28 |
29 | ### Usage
30 | ```
31 | $ cat example.csv
32 | Year,Make,Model,Description,Price
33 | 1997,Ford,E350,"ac, abs, moon",3000.00
34 | 1999,Chevy,"Venture ""Extended Edition""","",4900.00
35 | 1999,Chevy,"Venture ""Extended Edition, Large""",,5000.00
36 | 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof",4799.00
37 |
38 | $ csview example.csv
39 | ┌──────┬───────┬───────────────────────────────────┬───────────────────────────┬─────────┐
40 | │ Year │ Make │ Model │ Description │ Price │
41 | ├──────┼───────┼───────────────────────────────────┼───────────────────────────┼─────────┤
42 | │ 1997 │ Ford │ E350 │ ac, abs, moon │ 3000.00 │
43 | │ 1999 │ Chevy │ Venture "Extended Edition" │ │ 4900.00 │
44 | │ 1999 │ Chevy │ Venture "Extended Edition, Large" │ │ 5000.00 │
45 | │ 1996 │ Jeep │ Grand Cherokee │ MUST SELL! air, moon roof │ 4799.00 │
46 | └──────┴───────┴───────────────────────────────────┴───────────────────────────┴─────────┘
47 |
48 | $ head /etc/passwd | csview -H -d:
49 | ┌────────────────────────┬───┬───────┬───────┬────────────────────────────┬─────────────────┐
50 | │ root │ x │ 0 │ 0 │ │ /root │
51 | │ bin │ x │ 1 │ 1 │ │ / │
52 | │ daemon │ x │ 2 │ 2 │ │ / │
53 | │ mail │ x │ 8 │ 12 │ │ /var/spool/mail │
54 | │ ftp │ x │ 14 │ 11 │ │ /srv/ftp │
55 | │ http │ x │ 33 │ 33 │ │ /srv/http │
56 | │ nobody │ x │ 65534 │ 65534 │ Nobody │ / │
57 | │ dbus │ x │ 81 │ 81 │ System Message Bus │ / │
58 | │ systemd-journal-remote │ x │ 981 │ 981 │ systemd Journal Remote │ / │
59 | │ systemd-network │ x │ 980 │ 980 │ systemd Network Management │ / │
60 | └────────────────────────┴───┴───────┴───────┴────────────────────────────┴─────────────────┘
61 | ```
62 |
63 | Run `csview --help` to view detailed usage.
64 |
65 | ### Installation
66 |
67 | #### On Arch Linux
68 |
69 | `csview` is available in the Arch User Repository. To install it from [AUR](https://aur.archlinux.org/packages/csview):
70 |
71 | ```
72 | yay -S csview
73 | ```
74 |
75 | #### On macOS
76 |
77 | You can install `csview` with Homebrew:
78 |
79 | ```
80 | brew install csview
81 | ```
82 |
83 | #### On NetBSD
84 |
85 | `csview` is available from the main pkgsrc Repositories. To install simply run
86 |
87 | ```
88 | pkgin install csview
89 | ```
90 |
91 | or, if you prefer to build from source using [pkgsrc](https://pkgsrc.se/textproc/csview) on any of the supported platforms:
92 |
93 | ```
94 | cd /usr/pkgsrc/textproc/csview
95 | make install
96 | ```
97 |
98 | #### On Windows
99 |
100 | You can install `csview` with [Scoop](https://scoop.sh/):
101 | ```
102 | scoop install csview
103 | ```
104 |
105 | #### From binaries
106 |
107 | Pre-built versions of `csview` for various architectures are available at [Github release page](https://github.com/wfxr/csview/releases).
108 |
109 | *Note that you can try the `musl` version (which is statically-linked) if runs into dependency related errors.*
110 |
111 | #### From source
112 |
113 | `csview` is also published on [crates.io](https://crates.io). If you have latest Rust toolchains installed you can use `cargo` to install it from source:
114 |
115 | ```
116 | cargo install --locked csview
117 | ```
118 |
119 | If you want the latest version, clone this repository and run `cargo build --release`.
120 |
121 | ### Benchmark
122 |
123 | - [small.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-action-csv) (10 rows, 4 cols, 695 bytes):
124 |
125 | | Tool | Command | Mean Time | Min Time | Memory |
126 | |:----------------------------------------------------------------------------------------:|---------------------------|----------:|----------:|----------:|
127 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table small.csv` | 2.0ms | 1.8ms | 3.9mb |
128 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview small.csv` | **0.3ms** | **0.1ms** | **2.4mb** |
129 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 1.3ms | 1.1ms | **2.4mb** |
130 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook small.csv` | 148.1ms | 142.4ms | 27.3mb |
131 |
132 | - [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) (10,000 rows, 10 cols, 624K bytes):
133 |
134 | | Tool | Command | Mean Time | Min Time | Memory |
135 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:|
136 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table medium.csv` | 0.031s | 0.029s | 4.4mb |
137 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview medium.csv` | **0.017s** | **0.016s** | **2.8mb** |
138 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 0.052s | 0.050s | 9.9mb |
139 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook medium.csv` | 2.664s | 2.617s | 46.8mb |
140 |
141 | - `large.csv` (1,000,000 rows, 10 cols, 61M bytes, generated by concatenating [medium.csv](https://gist.github.com/wfxr/567e890d4db508b3c7630a96b703a57e#file-sample-csv) 100 times):
142 |
143 | | Tool | Command | Mean Time | Min Time | Memory |
144 | |:----------------------------------------------------------------------------------------:|---------------------------|-----------:|-----------:|----------:|
145 | | [xsv](https://github.com/BurntSushi/xsv/tree/0.13.0) | `xsv table large.csv` | 2.912s | 2.820s | 4.4mb |
146 | | [csview](https://github.com/wfxr/csview/tree/90ff90e26c3e4c4c37818d717555b3e8f90d27e3) | `csview large.csv` | **1.686s** | **1.665s** | **2.8mb** |
147 | | [column](https://github.com/util-linux/util-linux/blob/stable/v2.37/text-utils/column.c) | `column -s, -t small.csv` | 5.777s | 5.759s | 767.6mb |
148 | | [csvlook](https://github.com/wireservice/csvkit/tree/1.0.6) | `csvlook large.csv` | 20.665s | 20.549s | 1105.7mb |
149 |
150 | ### F.A.Q.
151 |
152 | ---
153 | #### We already have [xsv](https://github.com/BurntSushi/xsv), why not contribute to it but build a new tool?
154 |
155 | `xsv` is great. But it's aimed for analyzing and manipulating csv data.
156 | `csview` is designed for formatting and viewing. See also: [xsv/issues/156](https://github.com/BurntSushi/xsv/issues/156)
157 |
158 | ---
159 | #### I encountered UTF-8 related errors, how to solve it?
160 |
161 | The file may use a non-UTF8 encoding. You can check the file encoding using `file` command:
162 |
163 | ```
164 | $ file -i a.csv
165 | a.csv: application/csv; charset=iso-8859-1
166 | ```
167 | And then convert it to `utf8`:
168 |
169 | ```
170 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv -o b.csv
171 | $ csview b.csv
172 | ```
173 |
174 | Or do it in place:
175 |
176 | ```
177 | $ iconv -f iso-8859-1 -t UTF8//TRANSLIT a.csv | csview
178 | ```
179 |
180 | ### Credits
181 |
182 | * [csv-rust](https://github.com/BurntSushi/rust-csv)
183 | * [prettytable-rs](https://github.com/phsym/prettytable-rs)
184 | * [structopt](https://github.com/TeXitoi/structopt)
185 |
186 | ### License
187 |
188 | `csview` is distributed under the terms of both the MIT License and the Apache License 2.0.
189 |
190 | See the [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT) files for license details.
--------------------------------------------------------------------------------
/csv/madato.md:
--------------------------------------------------------------------------------
1 |
2 | # madato [![Build Status]][travis] [![Latest Version]][crates.io]
3 |
4 | [Build Status]: https://travis-ci.org/inosion/madato.svg?branch=master
5 | [travis]: https://travis-ci.org/inosion/madato
6 | [Latest Version]: https://img.shields.io/crates/v/madato.svg
7 | [crates.io]: https://crates.io/crates/madato
8 |
9 | ***madato is a library and command line tool for working tabular data, and Markdown***
10 |
11 | Windows, Mac and Linux
12 |
13 | Converts XLSX and ODS Spreadsheets to
14 | - JSON
15 | - YAML
16 | - Markdown
17 |
18 | ### TL;DR
19 |
20 | ```
21 | madato table -t XLSX -o JSON --sheetname Sheet2 path/to/workbook.xlsx
22 | madato table -t XLSX -o MD --sheetname Sheet2 path/to/workbook.xlsx
23 | madato table -t XLSX -o YAML --sheetname 'Annual Sales' path/to/workbook.xlsx
24 | madato table -t XLSX -o YAML path/to/workbook.ods
25 | madato table -t YAML -o MD path/to/workbook.yaml
26 | ```
27 |
28 | --------------------------------------------------------------------------------
29 |
30 | The tools is primarly centered around getting tabular data (spreadsheets, CSVs)
31 | into Markdown.
32 |
33 | It currently supports:
34 | - Reading a XLS*, ODS Spreadsheet or YAML file `-- to -->` Markdown
35 | - Reading a XLS*, ODS Spreadsheet `-- to -->` Markdown
36 |
37 | When generating the output:
38 | - Filter the Rows using basic Regex over Key/Value pairs
39 | - Limit the columns to named headings
40 | - Re-order the columns, or repeat them using the same column feature
41 | - Only generate a table for a named "sheet" (applicable for the XLS/ODS formats)
42 |
43 | Madato is:
44 | - Command Line Tool (Windows, Mac, Linux) - good for CI/CD preprocessing
45 | - Rust Library - Good for integration into Rust Markdown tooling
46 | - Node JS WASM API - To be used later for Atom and VSCode Extensions
47 |
48 | Madato expects that every column has a heading row. That is, the first row are headings/column names. If a cell in that first row is blank, it will create `NULL0..NULLn` entries as required.
49 |
50 | ## Examples
51 |
52 | * Extract the `3rd Sheet` sheet from an MS Excel Document
53 | ```
54 | 08:39 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet"
55 | |col1|col2| col3 |col4 | col5 |NULL5|
56 | |----|----|------|-----|-------------------------------------------------------|-----|
57 | | 1 |that| are |wider| value ‘aaa’ is in the next cell, but has no heading | aaa |
58 | |than|the |header| row | (open the spreadsheet to see what I mean) | |
59 | ```
60 |
61 | * Extract and reorder just 3 Columns
62 | ```
63 | 08:42 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname "3rd Sheet" -c col2 -c col3 -c NULL5
64 | |col2| col3 |NULL5|
65 | |----|------|-----|
66 | |that| are | aaa |
67 | |the |header| |
68 | ```
69 | * Pull from the `second_sheet` sheet
70 | * Only extract `Heading 4` column
71 | * Use a Filter, where `Heading 4` values must only have a letter or number.
72 |
73 | ```
74 | 08:48 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname second_sheet -c "Heading 4" -f 'Heading 4=[a-zA-Z0-9]'
75 | | Heading 4 |
76 | |--------------------------|
77 | | << empty |
78 | |*Some Bolding in Markdown*|
79 | | `escaped value` foo |
80 | | 0.22 |
81 | | #DIV/0! |
82 | | “This cell has quotes” |
83 | | 😕 ← Emoticon |
84 | ```
85 |
86 | * Filtering on a Column, ensuring that a "+" is there in `Trend` Column
87 |
88 | ```
89 | 09:00 $ target/debug/madato table --type xlsx test/sample_multi_sheet.xlsx --sheetname Sheet1 -c Rank -c Language -c Trend -f "Trend=\+"
90 | | Rank | Language |Trend |
91 | |------------------------------------------------------|------------|------|
92 | | 1 | Python |+5.5 %|
93 | | 3 | Javascript |+0.2 %|
94 | | 7 | R |+0.0 %|
95 | | 12 | TypeScript |+0.3 %|
96 | | 16 | Kotlin |+0.5 %|
97 | | 17 | Go |+0.3 %|
98 | | 20 | Rust |+0.0 %|
99 | ```
100 |
101 | ## Internals
102 | madato uses:
103 | - [calamine](https://github.com/tafia/calamine) for reading XLS and ODS sheets
104 | - [wasm bindings](https://github.com/rustwasm/wasm-bindgen) to created JS API versions of the Rust API
105 | - [regex]() for filtering, and [serde]() for serialisation.
106 |
107 | ## Tips
108 |
109 | * I have found that copying the "table" I want from a website: HTML, to a spreadsheet, then through `madato` gives an excellent Markdown table of the original.
110 |
111 | ## Rust API
112 |
113 | ## JS API
114 |
115 | ## More Commandline
116 |
117 | ### Sheet List
118 |
119 | You can list the "sheets" of an XLS*, ODS file with
120 |
121 | ```
122 | $ madato sheetlist test/sample_multi_sheet.xlsx
123 | Sheet1
124 | second_sheet
125 | 3rd Sheet
126 | ```
127 |
128 | ### YAML to Markdown
129 |
130 | Madato reads a "YAML" file, in the same way it can a Spreadsheet.
131 | This is useful for "keeping" tabular data in your source repository, and perhaps not
132 | the XLS.
133 |
134 | `madato table -t yaml test/www-sample/test.yml`
135 |
136 | ```
137 | |col3| col4 | data1 | data2 |
138 | |----|-------|---------|--------------------|
139 | |100 |gar gar|somevalue|someother value here|
140 | |190x| | that | nice |
141 | |100 | ta da | this |someother value here|
142 | ```
143 |
144 | *Please see the [test/test.yml](test/test.yml) file for the expected layout of this file*
145 |
146 | ### Excel/ODS to YAML
147 |
148 | Changing the output from default "Markdown (MD)" to "YAML", you get a Markdown file of the Spreadsheet.
149 |
150 | ```
151 | madato table -t xlsx test/sample_multi_sheet.xslx.xlsx -s Sheet1 -o yaml
152 | ---
153 | - Rank: "1"
154 | Change: ""
155 | Language: Python
156 | Share: "23.59 %"
157 | Trend: "+5.5 %"
158 | - Rank: "2"
159 | Change: ""
160 | Language: Java
161 | Share: "22.4 %"
162 | Trend: "-0.5 %"
163 | - Rank: "3"
164 | Change: ""
165 | Language: Javascript
166 | Share: "8.49 %"
167 | ...
168 | ```
169 |
170 | If you omit the sheet name, it will dump all sheets into an order map of array of maps.
171 |
172 |
173 | ### Features
174 |
175 | * `[x]` Reads a formatted YAML string and renders a Markdown Table
176 | * `[x]` Can take an optional list of column headings, and only display those from the table (filtering out other columns present)
177 | * `[X]` Native Binary Command Line (windows, linux, osx)
178 | * `[X]` Read an XLSX file and produce a Markdown Table
179 | * `[X]` Read an ODS file and produce a Markdown Table
180 | * `[ ]` Read a CSV, TSV, PSV (etc) file and produce a Markdown Table
181 | * `[ ]` Support Nested Structures in the YAML input
182 | * `[ ]` Read a Markdown File, and select the "table" and turn it back into YAML
183 |
184 | ### Future Goals
185 | * Finish the testing and publishing of the JS WASM Bindings. (PS - it works..
186 | (see : [test/www-sample](test/www-sample) and the [Makefile](Makefile) )
187 | * Embed the "importing" of YAML, CSV and XLS* files into the `mume` Markdown Preview Enhanced Plugin. [https://shd101wyy.github.io/markdown-preview-enhanced/](https://shd101wyy.github.io/markdown-preview-enhanced/) So we can have Awesome Markdown Documents.
188 | * Provide a `PreRenderer` for `[rust-lang-nursery/mdBook](https://github.com/rust-lang-nursery/mdBook) to "import" MD tables from files.
189 |
190 | ### Known Issues
191 | * A Spreadsheet Cell with a Date will come out as the "magic" Excel date number :-( - https://github.com/tafia/calamine/issues/116
192 |
193 | ## License
194 |
195 | Serde is licensed under either of
196 |
197 | * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
198 | http://www.apache.org/licenses/LICENSE-2.0)
199 | * MIT license ([LICENSE-MIT](LICENSE-MIT) or
200 | http://opensource.org/licenses/MIT)
201 |
202 | at your option.
203 |
204 | ### Contribution
205 |
206 | Unless you explicitly state otherwise, any contribution intentionally submitted
207 | for inclusion in Serde by you, as defined in the Apache-2.0 license, shall be
208 | dual licensed as above, without any additional terms or conditions.
--------------------------------------------------------------------------------
/csv/xtab.md:
--------------------------------------------------------------------------------
1 | # xtab
2 |
3 | 🦀 CSV command line utilities
4 |
5 | ## install
6 |
7 | ##### setp1:install cargo first
8 |
9 | ```bash
10 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
11 | ```
12 |
13 | ##### step2:
14 |
15 | ```bash
16 | cargo install xtab
17 | # or
18 |
19 | git clone https://github.com/sharkLoc/xtab.git
20 | cd xtab
21 | cargo b --release
22 | # mv target/release/xtab to anywhere you want
23 | ```
24 |
25 | ## usage
26 |
27 | ```bash
28 | xtab -- CSV command line utilities
29 | Version: 0.0.8
30 |
31 | Authors: sharkLoc
32 | Source code: https://github.com/sharkLoc/xtab.git
33 |
34 | xtab supports reading and writing gzip/bzip2/xz format file.
35 | Compression level:
36 | format range default crate
37 | gzip 1-9 6 https://crates.io/crates/flate2
38 | bzip2 1-9 6 https://crates.io/crates/bzip2
39 | xz 1-9 6 https://crates.io/crates/xz2
40 |
41 |
42 | Usage: xtab [OPTIONS] [CSV]
43 |
44 | Commands:
45 | addheader Set new header for CSV file [aliases: ah]
46 | csv2xlsx Convert CSV/TSV files to XLSX file [aliases: c2x]
47 | dim Dimensions of CSV file
48 | drop Drop or Select CSV fields by columns index
49 | flatten flattened view of CSV records [aliases: flat]
50 | freq Build frequency table of selected column in CSV data
51 | head Print first N records from CSV file
52 | pretty Convert CSV to a readable aligned table [aliases: prt]
53 | replace Replace data of matched fields
54 | reverse Reverses rows of CSV data [aliases: rev]
55 | sample Randomly select rows from CSV file using reservoir sampling
56 | search Applies the regex to each field individually and shows only matching rows
57 | slice Slice rows from a part of a CSV file
58 | tail Print last N records from CSV file
59 | transpose Transpose CSV data [aliases: trans]
60 | uniq Unique data with keys
61 | xlsx2csv Convert XLSX to CSV format [aliases: x2c]
62 | view Show CSV file content
63 | help Print this message or the help of the given subcommand(s)
64 |
65 | Global Arguments:
66 | -d, --delimiter Set delimiter for input csv file, e.g., in linux -d $'\t' for tab, in powershell -d `t for tab [default: ,]
67 | -D, --out-delimite Set delimiter for output CSV file, e.g., in linux -D $'\t' for tab, in powershell -D `t for tab [default: ,]
68 | --log If file name specified, write log message to this file, or write to stderr
69 | --compress-level Set compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with option -o/--out [default: 6]
70 | -v, --verbosity... control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug]
71 | [CSV] Input CSV file name, if file not specified read data from stdin
72 |
73 | Global FLAGS:
74 | -H, --no-header If set, the first row is treated as a special header row, and the original header row excluded from output
75 | -q, --quiet Be quiet and do not show any extra information
76 | -h, --help prints help information
77 | -V, --version prints version information
78 |
79 | Use "xtab help [command]" for more information about a command
80 | ```
81 |
--------------------------------------------------------------------------------
/dna/fakit.md:
--------------------------------------------------------------------------------
1 | # fakit
2 |
3 | 🦀 a simple program for fasta file manipulation
4 |
5 | ## install latest version
6 |
7 | ```bash
8 | cargo install --git https://github.com/sharkLoc/fakit.git
9 | ```
10 |
11 | ## install
12 |
13 | ```bash
14 | cargo install fakit
15 | ```
16 |
17 | ## usage
18 |
19 | ```bash
20 | Fakit: A simple program for fasta file manipulation
21 |
22 | Version: 0.3.6
23 |
24 | Authors: sharkLoc
25 | Source code: https://github.com/sharkLoc/fakit.git
26 |
27 | Fakit supports reading and writing gzip (.gz) format.
28 | Bzip2 (.bz2) and xz (.xz) format is supported since v0.3.0.
29 | Under the same compression level, xz has the highest compression ratio but consumes more time.
30 |
31 | Compression level:
32 | format range default crate
33 | gzip 1-9 6 https://crates.io/crates/flate2
34 | bzip2 1-9 6 https://crates.io/crates/bzip2
35 | xz 1-9 6 https://crates.io/crates/xz2
36 |
37 |
38 | Usage: fakit [OPTIONS]
39 |
40 | Commands:
41 | topn get first N records from fasta file [aliases: head]
42 | tail get last N records from fasta file
43 | fa2fq convert fasta to fastq file
44 | faidx create index and random access to fasta files [aliases: fai]
45 | flatten flatten fasta sequences [aliases: flat]
46 | range print fasta records in a range
47 | rename rename sequence id in fasta file
48 | reverse get a reverse-complement of fasta file [aliases: rev]
49 | window stat dna fasta gc content by sliding windows [aliases: slide]
50 | grep grep fasta sequences by name/seq
51 | seq convert all bases to lower/upper case, filter by length
52 | sort sort fasta file by name/seq/gc/length
53 | search search subsequences/motifs from fasta file
54 | shuffle shuffle fasta sequences
55 | size report fasta sequence base count
56 | subfa subsample sequences from big fasta file
57 | split split fasta file by sequence id
58 | split2 split fasta file by sequence number
59 | summ simple summary for dna fasta files [aliases: stat]
60 | codon show codon table and amino acid name
61 | help Print this message or the help of the given subcommand(s)
62 |
63 | Global Arguments:
64 | -w, --line-width line width when outputting fasta sequences, 0 for no wrap [default: 70]
65 | --compress-level set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for gzip/bzip2/xz output file, just work with
66 | option -o/--out [default: 6]
67 | --log if file name specified, write log message to this file, or write to stderr
68 | -v, --verbosity... control verbosity of logging, [-v: Error, -vv: Warn, -vvv: Info, -vvvv: Debug, -vvvvv: Trace, defalut: Debug]
69 |
70 | Global FLAGS:
71 | -q, --quiet be quiet and do not show extra information
72 | -h, --help prints help information
73 | -V, --version prints version information
74 |
75 | Use "fakit help [command]" for more information about a command
76 |
77 | ```
78 |
79 |
80 | ** any bugs please report issues **💖
81 |
--------------------------------------------------------------------------------
/dna/fq.md:
--------------------------------------------------------------------------------
1 | # fq
2 |
3 | [](https://github.com/stjude-rust-labs/fq/actions)
4 |
5 | **fq** filters, generates, subsamples, and validates [FASTQ] files.
6 |
7 | [FASTQ]: https://en.wikipedia.org/wiki/FASTQ_format
8 |
9 | ## Install
10 |
11 | There are different methods to install fq.
12 |
13 | ### Releases
14 |
15 | [Precompiled binaries are built][releases] for modern Linux distributions
16 | (`x86_64-unknown-linux-gnu`), macOS (`x86_64-apple-darwin`), and Windows
17 | (`x86_64-pc-windows-msvc`). The Linux binaries require glibc 2.18+ (CentOS/RHEL
18 | 8+, Debian 8+, Ubuntu 14.04+, etc.).
19 |
20 | [releases]: https://github.com/stjude-rust-labs/fq/releases
21 |
22 | ### Conda
23 |
24 | fq is available via [Bioconda].
25 |
26 | ```
27 | $ conda install fq=0.11.0
28 | ```
29 |
30 | [Bioconda]: https://bioconda.github.io/recipes/fq/README.html
31 |
32 | ### Manual
33 |
34 | Clone the repository and use [Cargo] to install fq.
35 |
36 | ```
37 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git
38 | $ cd fq
39 | $ cargo install --locked --path .
40 | ```
41 |
42 | [Cargo]: https://doc.rust-lang.org/cargo/getting-started/installation.html
43 |
44 | ### Container image
45 |
46 | Container images are managed by Bioconda and available through [Quay.io], e.g.,
47 | using [Docker]:
48 |
49 | ```
50 | $ docker image pull quay.io/biocontainers/fq:
51 | ```
52 |
53 | See [the repository tags] for the available tags.
54 |
55 | Alternatively, build the development container image:
56 |
57 | ```
58 | $ git clone --depth 1 --branch v0.11.0 https://github.com/stjude-rust-labs/fq.git
59 | $ cd fq
60 | $ docker image build --tag fq:0.11.0 .
61 | ```
62 |
63 | [Quay.io]: https://quay.io/repository/biocontainers/fq
64 | [the repository tags]: https://quay.io/repository/biocontainers/fq?tab=tags
65 | [Docker]: https://www.docker.com/
66 |
67 | ## Usage
68 |
69 | fq provides subcommands for filtering, generating, subsampling, and
70 | validating FASTQ files.
71 |
72 | ### filter
73 |
74 | **fq filter** filters a given FASTQ file by a set of names or a sequence
75 | pattern. The result includes only the records that match the given options.
76 |
77 | #### Usage
78 |
79 | ```
80 | Filters a FASTQ file
81 |
82 | Usage: fq filter [OPTIONS] --dsts [SRCS]...
83 |
84 | Arguments:
85 | [SRCS]... FASTQ sources
86 |
87 | Options:
88 | --names
89 | Allowlist of record names
90 | --sequence-pattern
91 | Keep records that have sequences that match the given regular expression
92 | --dsts
93 | Filtered FASTQ destinations
94 | -h, --help
95 | Print help
96 | -V, --version
97 | Print version
98 | ```
99 |
100 | #### Examples
101 |
102 | ```sh
103 | # Filters an input FASTQ using the given allowlist.
104 | $ fq filter --names allowlist.txt --dsts /dev/stdout in.fastq
105 |
106 | # Filters FASTQ files by matching a sequence pattern in the first input's
107 | # records and applying the match to all inputs.
108 | $ fq filter --sequence-pattern ^TC --dsts out.1.fq --dsts out.2.fq in.1.fq in.2.fq
109 | ```
110 |
111 | ### generate
112 |
113 | **fq generate** is a FASTQ file pair generator. It creates two reads, formatting
114 | names as [described by Illumina][1].
115 |
116 | While _generate_ creates "valid" FASTQ reads, the content of the files are
117 | completely random. The sequences do not align to any genome.
118 |
119 | [1]: https://help.basespace.illumina.com/articles/descriptive/fastq-files/
120 |
121 | #### Usage
122 |
123 | ```
124 | Generates a random FASTQ file pair
125 |
126 | Usage: fq generate [OPTIONS]
127 |
128 | Arguments:
129 | Read 1 destination. Output will be gzipped if ends in `.gz`
130 | Read 2 destination. Output will be gzipped if ends in `.gz`
131 |
132 | Options:
133 | -s, --seed Seed to use for the random number generator
134 | -n, --record-count Number of records to generate [default: 10000]
135 | --read-length Number of bases in the sequence [default: 101]
136 | -h, --help Print help
137 | -V, --version Print version
138 | ```
139 |
140 | #### Examples
141 |
142 | ```sh
143 | # Generates the default number of records, written to uncompressed files.
144 | $ fq generate /tmp/r1.fastq /tmp/r2.fastq
145 |
146 | # Generates FASTQ paired reads with 32 records, written to gzipped outputs.
147 | $ fq generate --record-count 32 /tmp/r1.fastq.gz /tmp/r2.fastq.gz
148 | ```
149 |
150 | ### lint
151 |
152 | **fq lint** is a FASTQ file pair validator.
153 |
154 | #### Usage
155 |
156 | ```
157 | Validates a FASTQ file pair
158 |
159 | Usage: fq lint [OPTIONS] [R2_SRC]
160 |
161 | Arguments:
162 | Read 1 source. Accepts both raw and gzipped FASTQ inputs
163 | [R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
164 |
165 | Options:
166 | --lint-mode
167 | Panic on first error or log all errors [default: panic] [possible values: panic, log]
168 | --single-read-validation-level
169 | Only use single read validators up to a given level [default: high] [possible values: low, medium, high]
170 | --paired-read-validation-level
171 | Only use paired read validators up to a given level [default: high] [possible values: low, medium, high]
172 | --disable-validator
173 | Disable validators by code. Use multiple times to disable more than one
174 | -h, --help
175 | Print help
176 | -V, --version
177 | Print version
178 | ```
179 |
180 | #### Validators
181 |
182 | _validate_ includes a set of validators that run on single or paired records.
183 | By default, records are validated with all rules, but validators can be
184 | disabled using `--disable-validator CODE`, where `CODE` is one of validators
185 | listed below.
186 |
187 | ##### Single
188 |
189 | | Code | Level | Name | Validation
190 | |------|--------|-------------------|------------
191 | | S001 | low | PlusLine | Plus line starts with a "+".
192 | | S002 | medium | Alphabet | All characters in sequence line are one of "ACGTN", case-insensitive.
193 | | S003 | high | Name | Name line starts with an "@".
194 | | S004 | low | Complete | All four record lines (name, sequence, plus line, and quality) are present.
195 | | S005 | high | ConsistentSeqQual | Sequence and quality lengths are the same.
196 | | S006 | medium | QualityString | All characters in quality line are between "!" and "~" (ordinal values).
197 | | S007 | high | DuplicateName | All record names are unique.
198 |
199 | ##### Paired
200 |
201 | | Code | Level | Name | Validation
202 | |------|---------|-------------------|------------
203 | | P001 | medium | Names | Each paired read name is the same, excluding interleave.
204 |
205 | #### Examples
206 |
207 | ```sh
208 | # Validate both reads using all validators. Exits cleanly (0) if no validation
209 | # errors occur.
210 | $ fq lint r1.fastq r2.fastq
211 |
212 | # Log errors instead of quitting on first error.
213 | $ fq lint --lint-mode log r1.fastq r2.fastq
214 |
215 | # Disable validators S004 and S007.
216 | $ fq lint --disable-validator S004 --disable-validator S007 r1.fastq r2.fastq
217 | ```
218 |
219 | ### subsample
220 |
221 | **fq subsample** outputs a subset of records from single or paired FASTQ files.
222 |
223 | When using a probability (`-p, --probability`), each file is read through once,
224 | and a subset of records is selected based on that chance. Given the randomness
225 | used when sampling a uniform distribution, the output record count will not be
226 | exact but (statistically) close.
227 |
228 | When using a record count (`-n, --record-count`), the first input is read
229 | twice, but it provides an exact number of records to be selected.
230 |
231 | A seed (`-s, --seed`) can be provided to influence the results, e.g.,
232 | for a deterministic subset of records.
233 |
234 | For paired input, the sampling is applied to each pair.
235 |
236 | #### Usage
237 |
238 | ```
239 | Outputs a subset of records
240 |
241 | Usage: fq subsample [OPTIONS] --r1-dst <--probability |--record-count > [R2_SRC]
242 |
243 | Arguments:
244 | Read 1 source. Accepts both raw and gzipped FASTQ inputs
245 | [R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
246 |
247 | Options:
248 | -p, --probability The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
249 | -n, --record-count The exact number of records to keep. Cannot be used with `probability`
250 | -s, --seed Seed to use for the random number generator
251 | --r1-dst Read 1 destination. Output will be gzipped if ends in `.gz`
252 | --r2-dst Read 2 destination. Output will be gzipped if ends in `.gz`
253 | -h, --help Print help
254 | -V, --version Print version
255 | ```
256 |
257 | #### Examples
258 |
259 | ```sh
260 | # Sample ~50% of records from a single FASTQ file
261 | $ fq subsample --probability 0.5 --r1-dst r1.50pct.fastq r1.fastq
262 |
263 | # Sample ~50% of records from a single FASTQ file and seed the RNG
264 | $ fq subsample --probability --seed 13 --r1-dst r1.50pct.fastq r1.fastq
265 |
266 | # Sample ~25% of records from paired FASTQ files
267 | $ fq subsample --probability 0.25 --r1-dst r1.25pct.fastq --r2-dst r2.25pct.fastq r1.fastq r2.fastq
268 |
269 | # Sample ~10% of records from a gzipped FASTQ file and compress output
270 | $ fq subsample --probability 0.1 --r1-dst r1.10pct.fastq.gz r1.fastq.gz
271 |
272 | # Sample exactly 10000 records from a single FASTQ file
273 | $ fq subsample --record-count 10000 -r1-dst r1.10k.fastq r1.fastq
274 | ```
--------------------------------------------------------------------------------
/dna/ngs.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ngs
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 | Command line utility for working with next-generation sequencing files.
25 |
26 | Explore the docs »
27 |
28 |
29 | Request Feature
30 | ·
31 | Report Bug
32 | ·
33 | ⭐ Consider starring the repo! ⭐
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 | ## 🎨 Features
44 |
45 | * **[`ngs convert`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-convert).** Convert between next-generation sequencing formats.
46 | * **[`ngs derive`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-derive).** Forensic analysis tool for next-generation sequencing data.
47 | * **[`ngs generate`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-generate).** Generates a BAM file from a given reference genome.
48 | * **[`ngs index`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-index).** Generates the index file to various next-generation sequencing files.
49 | * **[`ngs list`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-list).** Utility to list various supported items in this command line tool.
50 | * **[`ngs plot`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-plot).** Produces plots for data generated by `ngs qc`.
51 | * **[`ngs qc`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-qc).** Generates quality control metrics for BAM files.
52 | * **[`ngs view`](https://github.com/stjude-rust-labs/ngs/wiki/ngs-view).** Views various next-generation sequencing files, sometimes with a query region.
53 |
54 |
55 | ## Guiding Principles
56 |
57 | * **Modern, reliable foundation for everyday bioinformatics analysis—written in Rust.** `ngs` aims to package together a fairly comprehensive set of analysis tools and utilities for everyday work in bioinformatics. It is built with modern, multi-core systems in mind and written in Rust. Though we are not there today, we plan to work towards this goal in the future.
58 | * **Runs on readily available hardware/software.** We aim for every subcommand within `ngs` to run within most computing environments without the need for special hardware or software. Practically, this means we've designed `ngs` to run in any UNIX-like environment that has at least four (4) cores and sixteen (16) GB of RAM. Often, tools will run with fewer resources. This design decision is important and sometimes means that `ngs` runs slower than it otherwise could.
59 |
60 | ## 📚 Getting Started
61 |
62 | ### Installation
63 |
64 | To install the latest released version, you can simply use `cargo`.
65 |
66 | ```bash
67 | cargo install ngs
68 | ```
69 |
70 | To install the latest version on `main`, you can use the following command.
71 |
72 | ```bash
73 | cargo install --locked --git https://github.com/stjude-rust-labs/ngs.git
74 | ```
75 |
76 | ### Using Docker
77 |
78 | ```bash
79 | docker pull ghcr.io/stjude-rust-labs/ngs
80 | docker run -it --rm --volume "$(pwd)":/data ghcr.io/stjude-rust-labs/ngs
81 | ```
82 |
83 | `/data` is the working directory of the docker image. Running this command from the directory with your data will allow
84 | the continer to act on those files.
85 |
86 | Note: Currently the `latest` tag refers to the latest release of `ngs` and not the most recent code changes in this
87 | repository.
88 |
89 | ## 🖥️ Development
90 |
91 | To bootstrap a development environment, please use the following commands.
92 |
93 | ```bash
94 | # Clone the repository
95 | git clone git@github.com:stjude-rust-labs/ngs.git
96 | cd ngs
97 |
98 | # Run the command line tool using cargo.
99 | cargo run -- -h
100 | ```
101 |
102 | ## 🚧️ Tests
103 |
104 | ```bash
105 | # Run the project's tests.
106 | cargo test
107 |
108 | # Ensure the project doesn't have any linting warnings.
109 | cargo clippy
110 |
111 | # Ensure the project passes `cargo fmt`.
112 | cargo fmt --check
113 | ```
114 |
115 | ## Minimum Supported Rust Version (MSRV)
116 |
117 | The minimum supported Rust version for this project is 1.64.0.
118 |
119 | ## 🤝 Contributing
120 |
121 | Contributions, issues and feature requests are welcome! Feel free to check
122 | [issues page](https://github.com/stjude-rust-labs/ngs/issues).
123 |
124 | ## 📝 License
125 |
126 | * All code related to the `ngs derive instrument` subcommand is licensed under the [AGPL v2.0][agpl-v2]. This is not due to any strict requirement, but out of deference to some [code][10x-inspiration] that inspired our strategy (and from which patterns were copied), the decision was made to license this code consistently.
127 | * The rest of this project is licensed as either [Apache 2.0][license-apache] or
128 | [MIT][license-mit] at your discretion.
129 |
130 | Copyright © 2021-Present [St. Jude Children's Research
131 | Hospital](https://github.com/stjude).
132 |
133 | [10x-inspiration]: https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/illumina_instrument.py
134 | [agpl-v2]: http://www.affero.org/agpl2.html
135 | [contributing-md]: https://github.com/stjude-rust-labs/ngs/blob/master/CONTRIBUTING.md
136 | [license-apache]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-APACHE
137 | [license-mit]: https://github.com/stjude-rust-labs/ngs/blob/master/LICENSE-MIT
--------------------------------------------------------------------------------
/dna/rust-bio-tools.md:
--------------------------------------------------------------------------------
1 | [](https://gitpod.io/#https://github.com/rust-bio/rust-bio-tools)
2 | [](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
3 | [](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
4 | [](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
5 | [](http://bioconda.github.io/recipes/rust-bio-tools/README.html)
6 | [](https://github.com/rust-bio/rust-bio-tools/actions)
7 |
8 | # Rust-Bio-Tools
9 |
10 | A set of ultra fast and robust command line utilities for bioinformatics tasks based on Rust-Bio.
11 | Rust-Bio-Tools provides a command `rbt`, which currently supports the following operations:
12 |
13 | * a linear time implementation for fuzzy matching of two vcf/bcf files (`rbt vcf-match`)
14 | * a vcf/bcf to txt converter, that flexibly allows to select tags and properly handles multiallelic sites (`rbt vcf-to-txt`)
15 | * a linear time round-robin FASTQ splitter that splits a given FASTQ files into a given number of chunks (`rbt fastq-split`)
16 | * a linear time extraction of depth information from BAMs at given loci (`rbt bam-depth`)
17 | * a utility to quickly filter records from a FASTQ file (`rbt fastq-filter`)
18 | * a tool to merge BAM or FASTQ reads using marked duplicates respectively unique molecular identifiers (UMIs) (`rbt collapse-reads-to-fragments bam|fastq`)
19 | * a tool to generate interactive HTML based reports that offer multiple plots visualizing the provided genomics data in VCF and BAM format (`rbt vcf-report`)
20 | * a tool to generate an interactive HTML based report from a csv file including visualizations (`rbt csv-report`)
21 | * a tool for splitting VCF/BCF files into N equal chunks, including BND support (`rbt vcf-split`)
22 | * a tool to generate visualizations for a specific region of one or multiple BAM files with a given reference contained in a single HTML file (`rbt plot-bam`)
23 |
24 | Further functionality is added as it is needed by the authors. Check out the [Contributing](#Contributing) section if you want contribute anything yourself.
25 | For a list of changes, take a look at the [CHANGELOG](CHANGELOG.md).
26 |
27 |
28 | ## Installation
29 |
30 | ### Requirements
31 |
32 | Rust-Bio-Tools depends [rgsl](https://docs.rs/GSL/*/rgsl/) which needs [GSL](https://www.gnu.org/software/gsl/) to be installed:
33 |
34 | - Ubuntu: `sudo apt-get install libgsl-dev`
35 | - Arch: `sudo pacman -S gsl`
36 | - OSX: `brew install gsl`
37 |
38 | ### Bioconda
39 |
40 | Rust-Bio-Tools is available via [Bioconda](https://bioconda.github.io).
41 | With Bioconda set up, installation is as easy as
42 |
43 | conda install rust-bio-tools
44 |
45 | ### Cargo
46 |
47 | If the [Rust](https://www.rust-lang.org/tools/install) compiler and associated [Cargo](https://github.com/rust-lang/cargo/) are installed, Rust-Bio-Tools may be installed via
48 |
49 | cargo install rust-bio-tools
50 |
51 | ### Source
52 |
53 | Download the source code and within the root directory of source run
54 |
55 | cargo install
56 |
57 | ## Usage and Documentation
58 |
59 | Rust-Bio-Tools installs a command line utility `rbt`. Issue
60 |
61 | rbt --help
62 |
63 | for a summary of all options and tools.
64 |
65 | ## Contributing
66 |
67 | Any contributions are highly welcome. If you plan to contribute we suggest installing pre-commit hooks. To do so:
68 | 1. Install `pre-commit` as explained [here](https://pre-commit.com/#installation)
69 | 2. Run `pre-commit install` in the rust-bio-tools base directory
70 |
71 | This should format, check and lint your code when committing.
72 |
73 | ## Authors
74 |
75 | * [Johannes Köster](https://github.com/johanneskoester) (https://koesterlab.github.io)
76 | * [Felix Mölder](https://github.com/FelixMoelder)
77 | * [Henning Timm](https://github.com/HenningTimm)
78 | * [Felix Wiegand](https://github.com/fxwiegand)
--------------------------------------------------------------------------------
/dna/skc.md:
--------------------------------------------------------------------------------
1 | # skc
2 |
3 | `skc` is a simple tool for finding shared k-mer content between two genomes.
4 |
5 | ## Installation
6 |
7 | ### Prebuilt binary
8 |
9 | ```
10 | curl -sSL skc.mbh.sh | sh
11 | # or with wget
12 | wget -nv -O - skc.mbh.sh | sh
13 | ```
14 |
15 | You can also pass options to the script like so
16 |
17 | ```text
18 | $ curl -sSL skc.mbh.sh | sh -s -- --help
19 | install.sh [option]
20 |
21 | Fetch and install the latest version of skc, if skc is already
22 | installed it will be updated to the latest version.
23 |
24 | Options
25 | -V, --verbose
26 | Enable verbose output for the installer
27 |
28 | -f, -y, --force, --yes
29 | Skip the confirmation prompt during installation
30 |
31 | -p, --platform
32 | Override the platform identified by the installer
33 |
34 | -b, --bin-dir
35 | Override the bin installation directory [default: /usr/local/bin]
36 |
37 | -a, --arch
38 | Override the architecture identified by the installer [default: x86_64]
39 |
40 | -B, --base-url
41 | Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]
42 |
43 | -h, --help
44 | Display this help message
45 |
46 | ```
47 |
48 | ### Cargo
49 |
50 | ```text
51 | cargo install skc
52 | ```
53 |
54 | ### Conda
55 |
56 | ```text
57 | conda install skc
58 | ```
59 |
60 | ### Local
61 |
62 | ```text
63 | cargo build --release
64 | ./target/release/skc --help
65 | ```
66 |
67 | ## Usage
68 |
69 | Check for shared 16-mers between the [HIV-1 genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1) and the [
70 | *Mycobacterium tuberculosis* genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3).
71 |
72 | ```text
73 | $ skc -k 16 NC_001802.1.fa NC_000962.3.fa
74 | [2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target
75 | [2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
76 | >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
77 | TGCAGAACATCCAGGG
78 | >4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482
79 | CCAGCAGCAGATAGGG
80 | ```
81 |
82 | So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout -
83 | use the `-o` option to write them to file.
84 |
85 | ### Fasta description
86 |
87 | Example: `>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106`
88 |
89 | The ID (`4233642782`) is the 64-bit integer representation of the k-mer's value in bit-space (
90 | see [Daniel Liu's brilliant ][cute] for more information). `tcount` and `qcount` are the
91 | number of times the k-mer is present in the target and query genomes, respectively. `tpos` and `qpos` are the (1-based)
92 | k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs
93 | multiple times.
94 |
95 | ### Usage help
96 |
97 | ```text
98 | $ skc --help
99 | Shared k-mer content between two genomes
100 |
101 | Usage: skc [OPTIONS]
102 |
103 | Arguments:
104 |
105 | Target sequence
106 |
107 | Can be compressed with gzip, bzip2, xz, or zstd
108 |
109 |
110 | Query sequence
111 |
112 | Can be compressed with gzip, bzip2, xz, or zstd
113 |
114 | Options:
115 | -k, --kmer
116 | Size of k-mers (max. 32)
117 |
118 | [default: 21]
119 |
120 | -o, --output