├── .gitignore ├── LICENSE ├── README.md └── docker ├── Dockerfile └── compress_fastq /.gitignore: -------------------------------------------------------------------------------- 1 | *.deb -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Brett T. Hannigan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Comparison of FASTQ compression algorithms 2 | 3 | ## Background 4 | Next generation sequencing (NGS) experiments produce a tremendous amount of raw data that will 5 | be used in further downstream analysis. Typically, the raw data from an instrument is stored in 6 | [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) format, a raw text format where each base read 7 | is represented by 2 bytes - a byte to provide the nucleotide at a given position and a second byte 8 | providing a quality score, or how confident the instrument was in calling that nucleotide. 9 | 10 | In raw text format, these files are quite hefty. For instance, when sequencing the whole human genome, 11 | experiments frequently will aim to have each nucleotide represented on average 30 times in their 12 | sequencing run. This would lead to raw FASTQ files of around: 13 | ``` 14 | 3 billion nucleotides * 30x coverage * 2 bytes/nucleotide = 180 GB 15 | ``` 16 | 17 | Storing the data as raw text is incredibly inefficient though. Nucleotides are typically chosen from an 18 | alphabet of 4 characters (or potentially a little more if one wishes to allow for ambiguous or no-calls), 19 | so one would expect a bit over 2 bits needed per nucleotide rather than the full 8 bits provided in 20 | the text file. The quality score range varies by instrument, with some instruments using as few as 2 bits 21 | to represent quality scores. Historically, the most popular instruments would output around 64 different 22 | quality scores which would require around 6 bits to encode them directly. Thus, a simple binary encoding of 23 | the information should take only about half of the naive text version: 24 | ``` 25 | (2 bits + 6 bits)/(8 bits + 8 bits) = 50% 26 | ``` 27 | 28 | The data itself is not uniformly distributed though and thus even better compression ratios should be 29 | achievable by using entropic methods of compression or leveraging information about the 30 | nature of genomic data. To that end, a number of researchers have created tools to compress genomic data. 31 | 32 | I am interested in surveying some of these tools to get a sense of the compression ratios they afford 33 | and determine if they would be able to fit in with typical NGS data workflows. 34 | 35 | ## Tools evaluated 36 | * **gzip** - By far the most popular method to compress FASTQ files is to simply gzip them. Most bioinformatic 37 | tools will accept gzipped files as input. The gzip standard offers various levels of compression, allowing 38 | users to tradeoff compression time and compression efficiency. Gzip level 1 is the fastest to compress a 39 | given file, but at the cost of some compression efficiency. Gzip level 9 on the other hand should provide 40 | a smaller overall file at the cost of an increased time compressing the data. For this simple study, 41 | I ran gzip using the default compression level (level 6) as well as the highest compression level (level 9). 42 | * **Unmapped BAM** - The BAM file format is a binary format that traditionally was used to store reads mapped 43 | to a reference genome. Recently, the bioinformatics community has begun to use the BAM file to store the 44 | raw, unmapped reads as well, with the Broad Institute using the uBAM format as the starting point for their 45 | [best practices pipeline](https://software.broadinstitute.org/gatk/documentation/article?id=11008). The format 46 | encodes the data in a binary format and then that binary information is compressed using a block gzip compression 47 | algorithm. Thus, our naive expectation would be that the total size of a uBAM should be on par with the 48 | gzipped FASTQ files. 49 | * **Unmapped CRAM** - Like the BAM format, the CRAM format is a binary format typically used to store reads 50 | mapped to a reference genome. When a reference genome is provided, the CRAM format is able to compress 51 | the data more thoroughly. The CRAM format also compresses data on a column basis, allowing the compression 52 | algorithm to be more efficient as it compresses data one type together. It will be interesting to see how much better 53 | our unmapped CRAM is than unmapped BAM since our data has not be mapped and no reference genome will be 54 | provided. CRAM files are also capable of compressing the quality scores in a lossy manner. For this initial 55 | evaluation, we will use CRAM in a lossless encoding manner. 56 | * [**FaStore**](https://academic.oup.com/bioinformatics/article/34/16/2748/4956350) - The FaStore compressor 57 | offers both lossless and lossy compresion. For lossy compression, it can both alter read identifiers as 58 | well as alter quality scores. For our tests, we will only evaluate the lossless mode for now. While 59 | I suspect that for many if not most cases, compressing the quality scores will have little to no impact 60 | on downstream analysis, to simplify the current evaluation I will stick to the uncontroversial lossless 61 | approaches. That said, I would evaluate discarding the read identifiers, but it is a bit challenging 62 | to both keep the full quality scores and discard the read identifiers using the scripts provided by FaStore. 63 | * [**Spring**](https://academic.oup.com/bioinformatics/article-abstract/35/15/2674/5232998?redirectedFrom=fulltext) - 64 | The Spring compressor offers lossless and lossy modes similar to FaStore. Like FaStore, I will only evaluate 65 | lossless quality compression here. I will also evaluate the removal of read identifiers here though as it 66 | is a straight forward option to the tool. 67 | * [**fqzcomp**](https://github.com/jkbonfield/fqzcomp) - This compressor also offers lossless and lossy 68 | compression of quality scores. Once again, for our evaluation purposes we will stick to lossless for now. 69 | * [**repaq**](https://github.com/OpenGene/repaq) - Repaq offers lossless compression of FASTQ files. The 70 | authors highlight repaq's speed and the ability to obtain further compression through the use of xz. While 71 | repaq can operate on any pair of FASTQ files, the authors note that compression ratios on Illumina data are 72 | better than those for BGI data. 73 | * [**Petagene**](https://www.petagene.com) - Petagene offers a suite of compression tools that can losslessly 74 | compress both `fastq.gz` files and `.bam` files. The software will even capture gzip settings so that 75 | md5sums of compressed->decompressed files will match those of their original. 76 | 77 | ## Data used during evaluation 78 | As a first pass evaluation, I selected two NGS samples from the 79 | [Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra/) to evaluate a set of compression 80 | techniques. 81 | 82 | [**SRR2962693**](https://www.ncbi.nlm.nih.gov/sra/?term=SRR2962693): A human, whole exome sample from an Illumina HiSeq 2500 run. 83 | 84 | [**SRR8861483**](https://www.ncbi.nlm.nih.gov/sra/?term=SRR8861483): A human, whole genome sample from an Illumina NovaSeq 6000 run. 85 | 86 | Starting with their NovaSeq instruments, Illumina began using a simplified quality binning approach, 87 | with each base pair using around 2 bits to store quality information rather than the ~6 bits used in 88 | previous machines like the HiSeq line. Thus, our samples should allow us to monitor performance 89 | differences due to changes in quality representation across the older and newer instruments. 90 | 91 | ## Results 92 | First, unfortunately I was unable to get FaStore to complete the compression of the test samples. For 93 | both the WES and WGS samples, the tool would begin the multi-stage process of compressing the data, 94 | but it then would freeze. I did not see an error message. While monitoring the initial runs, I did observe 95 | a few peaks in memory use, so I tried the samples on an instance with significant memory resources - roughly 96 | 128 GB. I also provided up to 1 TB of storage space in case significant temporary files were generated. 97 | Alas, these efforts did not help. I may return at some point to investigate these failures in more depth, 98 | but for now I will just report results on the other tools. 99 | 100 | ### Timing 101 | For this initial evaluation, I chose not to concentrate on optimizing the timing of each approach. Some of the tools 102 | here make use of multiple-cores better than others and there may be ways to more efficiently leverage the tools, (in 103 | a scatter gather approach for example), but I thought a simple accounting of the time it took each tool to compress 104 | the data would still be informative. 105 | 106 | Note that for two of the compression schemes (gzip and fqzcomp), for paired-end data the tools compress each FASTQ 107 | separately. I will record both the total serial time and the per-file times for these samples. In most instances, 108 | I assume using the max of the per-file time would be most pertinent here since in most cases one would choose to 109 | run each of these in parallel. 110 | 111 | #### Instance details 112 | ##### Whole exome data **SRR2962693** 113 | **AWS c5d.2xlarge** 114 | 8 vCPU's 115 | 16 GB RAM 116 | Data read from and written to NVME SSD drive 117 | 118 | ##### Whole genome data **SRR8861483** 119 | **AWS r5d.4xlarge** 120 | 16 vCPU's 121 | 128 GB RAM 122 | Data read from and written to NVME SSD drive 123 | 124 | For tools which allowed for the specification of the number of cores to use, 8 cores were requested 125 | for both WES and WGS samples. The larger instance for WGS was selected for the larger memory and storage 126 | requirements offered. 127 | 128 | #### Compression 129 | | Sample | gzip | gzip -9 | uBAM | uCRAM | FaStore | Spring | Spring --no-ids | fqzcom | repaq | repaq-xz | Petagene | 130 | | ---------- | --------------------------- | --------------------------- | ------ | ----- | ------- | ------ | --------------- | --------- | ------ | --------- | ---------------------- | 131 | | SRR2962693 | 26m and 26m (52m total) | 1h 35m and 1h 39m (3h 14m) | 39m | 35m | DNF | 26m | 26m | 32m | 10m | 52m | 1.75m and 1.75m (3.5m) | 132 | | SRR8861483 | 1h 36m and 1h 23m (2h 59m) | 9h 35m + 12h = 21h 35m | 2h 34m | 2h 35m | DNF | 3h 3m | 3h 2m | 2h 15m | 47m | 147m | 8.25m + 8.75m (17m) | 133 | 134 | | SRR2962693 WES | SRR8861483 WGS | 135 | | -------------- | -------------- | 136 | | ![WES Compression times](https://user-images.githubusercontent.com/3038393/70670885-42530880-1c2f-11ea-96cd-aba10e2f6caf.png)| ![WGS Compression times](https://user-images.githubusercontent.com/3038393/70670876-3bc49100-1c2f-11ea-8510-fe446d19c2b8.png) | 137 | 138 | #### Decompression 139 | | Sample | gzip | gzip -9 | uBAM | uCRAM | FaStore | Spring | Spring --no-ids | fqzcomp | repaq | repaq-xz | Petagene | 140 | | ---------- | ----------------------------- | ----------------------------- | ----- | ----- | ------- | ------ | --------------- | --------------------------- | ------ | --------- | ----------------------------- | 141 | | SRR2962693 | 2m and 2m (4m total) | 2m and 2m (4m total) | 10m | 10m | DNF | 16m | 16m | 14m and 14m (28m total) | 18m | 25m | 1.25m and 1.25m (2.5m total) | 142 | | SRR8861483 | 11m and 11m (22m total) | 11m and 11m (22m total) | 58m | 1h 25m | DNF | 53m | 51m | 1h 5m and ??? (2h 10m) | 92m | 122m | 5.75m and 6m (11.75m total) | 143 | 144 | (Note: I forgot to record the time for decompressing the reverse reads for fqzcomp. My assumption is 145 | that it would take a similar time to decompress the reverse reads as it did to decompress the forward reads.) 146 | 147 | | SRR2962693 WES | SRR8861483 WGS | 148 | | -------------- | -------------- | 149 | | ![WES Deompression times](https://user-images.githubusercontent.com/3038393/70670872-36674680-1c2f-11ea-9a7d-c9f969dcce58.png)| ![WGS Deompression times](https://user-images.githubusercontent.com/3038393/70670863-3109fc00-1c2f-11ea-93c2-de7c2142040e.png) | 150 | 151 | ### Storage size 152 | As discussed in the tools section, I am concentrating on lossless compression for this current study. In many 153 | cases, it may make sense to leverage lossy compression, as the quality scores are fairly noisy and a number of 154 | studies show that it's possible to significantly bin quality scores without impacting typical 155 | variant calling pipelines. However, to limit the scope to the most uncontroversial mode of compression, for 156 | now I'll look at methods that preserve the full, original quality information. 157 | 158 | | Sample | SRA | FASTQ | FASTQ.gz | FASTQ.gz -9 | uBAM | uCRAM | FaStore | Spring | Spring --no-ids | fqzcomp | repaq | repaq-xz | Petagene | 159 | | ---------- | ----- | ---------- | -------- | ----------- | ------ | ------ | ------- | ------ | --------------- | ------- | ------ | --------- | -------- | 160 | | SRR2962693 | 7 GB | 40 GB | 10.3 GB | 9.4 GB | 9.3 GB | 6.6 GB | DNF | 3.5 GB | 3.5 GB | 4.7 GB | 12 GB | 5.4 GB | 3.6 GB | 161 | | SRR8861483 | 23 GB | 284 GB | 33 GB | 32 GB | 33 GB | 22 GB | DNF | 15 GB | 15 GB | 37 GB | 77 GB | 21 GB | 15.2 GB | 162 | 163 | | SRR2962693 WES | SRR8861483 WGS | 164 | | -------------- | -------------- | 165 | | ![WES Compressed file sizes](https://user-images.githubusercontent.com/3038393/70670858-2cddde80-1c2f-11ea-8deb-ac9ac2259fb7.png)| ![WGS Compressed file sizes](https://user-images.githubusercontent.com/3038393/70670853-294a5780-1c2f-11ea-93c5-b34923557a19.png) | 166 | 167 | ## Discussion 168 | As is widely accepted, researchers at a minimum should gzip compress their raw FASTQ files. For the 169 | HiSeq data, this provided a close to 75% reduction in storage space compared to the raw FASTQ. Add in 170 | the fact that gzipped files are cheap to decompress and the fact that most bioinformatics tools already 171 | accept gzipped FASTQ files as input, and I think you'll find there are few reasons to keep raw FASTQ file around very long. 172 | 173 | For users who are interested in even greater storage savings though, Spring appears to be a compelling 174 | option. I saw file sizes that saved an additional 55% - 66% storage space compared to gzipped FASTQ files. 175 | While the decompression time was significantly longer than a simple gzip file, it still seemed reasonable 176 | at roughly an hour for decompressing a WGS sample on a 16 core node (where only 8 cores were used). 177 | It was also nice that the original read ordering is completely preserved (unlike unmapped BAM and CRAM files 178 | where the reads are sorted by read name) and that the raw base pairs and quality scores matched the 179 | initial data exactly, unlike fqzcomp which has the occasional quality score change. (Again, according 180 | to the fqzcomp authors some changes may be expected as fqzcomp will set the quality score to 0 for an base 181 | pair called as an N. These changes seem eminently reasonable, but it's nice to not have to worry about 182 | the changes and wonder if maybe at least some subset of changes was due to some other issue). 183 | 184 | At a compression time of about 3 hours and a decompression time of about an hour for a WGS on an 185 | r5.4xlarge, we can calculate the break-even time for storing the Spring compressed file vs. 186 | the gzipped file as follows: 187 | 188 | **Hot storage** 189 | ``` 190 | 4 hours of r5d.4xlarge * $1.152/hour + 15 GB * $0.026 /GB/month * storage_duration = 33 GB * $0.026 /GB/month * storage_duration 191 | 192 | ($0.858 - $0.39) * storage_duration = $4.608 193 | 194 | storage_duration = 9.9 months 195 | ``` 196 | I believe we should be able to leverage the smaller r5d.2xlarge and see similar run times for compression and 197 | decompression since we are only using 8 cores. If this is true, our break even time would occur in roughly half 198 | the time, or after about 5 months. 199 | 200 | For cold storage (i.e. Glacier) the break-even would be pushed out a bit. 201 | ``` 202 | 4 hours of r5d.4xlarge * $1.152/hour + 15 GB * $0.005 /GB/month * storage_duration = 33 GB * $0.005 /GB/month * storage_duration 203 | 204 | ($0.165 - $0.075) * storage_duration = $4.608 205 | 206 | storage_duration = 51 months 207 | ``` 208 | Or roughly 26 months if we are able to use an r5d.2xlage. 209 | 210 | Leveraging the spot market would bring in our break-even dates further as would the ability 211 | to leverage underutilized compute cycles within our processing pipelines to perform the compression. 212 | 213 | ## Final caveats 214 | I did not exhaustively verify that the data compression was lossless under all methods. At a minimum, 215 | I did verify that each tool could produce a pair of FASTQ files that contained the same number of entries 216 | as the input data. 217 | 218 | For **fqzcomp**, I verified that the nucleotide information of the decompressed data 219 | exactly matches the input. However, I did notice that some of the quality scores had been altered. The 220 | authors note that all bases called as an N will have their quality score set to 0 though. I did not 221 | verify that all quality changes could be explained by this behavior. 222 | 223 | For **Spring**, I verified that the nucleotide information and the quality information from the uncompressed 224 | data exactly matched the input data. 225 | 226 | For **repaq**, I verified that the nucleotide information and the quality information from the ucompressed 227 | data exactly matched the input data. 228 | 229 | During **uBAM** and **uCRAM** creation, the input reads are sorted by read name, making comparison of 230 | the uncompressed and initial data more challenging. For now, I did verify that for at least one read 231 | the nucleotides and quality scores match exactly. Although a more thorough validation is possible, I feel 232 | fairly confident that for these two formats one can expect exact data fidelity for the nucleotides and 233 | quality scores. 234 | 235 | In many pipelines, it may make sense to store the mapped data and throw away the original raw data. Assuming 236 | your mapped reads contain at least one full-length entry for every raw read, you should be able to 237 | revert to unmapped FASTQ input without too much struggle and you having the mapped data would allow one to 238 | restart their pipeline at the variant calling stages rather than re-map the samples in any re-runs. In this case, 239 | a mapped CRAM, potentially with compressing quality scores using native CRAM quality score compression or 240 | external tools like [Crumble](https://academic.oup.com/bioinformatics/article/35/2/337/5051198), probably 241 | makes the most sense. 242 | 243 | However, there are a number of cases where keeping the raw, unmapped data makes the most sense. For example, 244 | there are many analyses that may be performed that don't have a mapping stage. Also, some projects anticipate 245 | users leveraging multiple reference genomes in which case storing the data mapped to one specific reference 246 | might not make sense. For these projects, leveraging one of the tools above might be the most appropriate. 247 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:18.04 as build 2 | ENV SAMTOOLS_VERSION 1.9 3 | ENV PICARD_VERSION 2.21.1 4 | ENV CRUMBLE_VERSION 0.8.3 5 | 6 | RUN apt-get update && \ 7 | apt-get install -y gcc g++ wget ncurses-dev zlib1g-dev libbz2-dev liblzma-dev build-essential git cmake 8 | # Install samtools and Picard 9 | RUN wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \ 10 | tar jxvf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \ 11 | cd samtools-${SAMTOOLS_VERSION} && ./configure --prefix=/samtools && make -j `nproc` && make install 12 | RUN wget https://github.com/broadinstitute/picard/releases/download/${PICARD_VERSION}/picard.jar && \ 13 | mkdir -p /picard && cp picard.jar /picard/ 14 | # Install FQZcomp 15 | RUN git clone https://github.com/jkbonfield/fqzcomp.git && \ 16 | cd fqzcomp && make -j `nproc` 17 | # Install FaStore 18 | RUN git clone https://github.com/refresh-bio/FaStore.git && \ 19 | cd FaStore && make -j `nproc` && chmod 755 scripts/*.sh 20 | # Install Spring 21 | RUN git clone https://github.com/shubhamchandak94/Spring.git && \ 22 | cd Spring && mkdir build && cd build && cmake ../ && make -j `nproc` 23 | # Install Crumble 24 | RUN wget https://github.com/jkbonfield/crumble/releases/download/v0.8.3/crumble-${CRUMBLE_VERSION}.tar.gz && \ 25 | cd /samtools-${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION} && ./configure && make -j `nproc` && make install && \ 26 | cd / && tar zxvf crumble-${CRUMBLE_VERSION}.tar.gz && mv crumble-${CRUMBLE_VERSION} /crumble && cd /crumble && \ 27 | ./configure --with-htslib=/samtools-${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION} && make -j `nproc` 28 | # Install repaq 29 | RUN git clone https://github.com/OpenGene/repaq.git && \ 30 | cd repaq && make -j `nproc` && make install 31 | 32 | 33 | ######################################################################################################### 34 | # Final image 35 | FROM ubuntu:18.04 36 | RUN apt-get update && apt-get install -y python3 openjdk-11-jre-headless libgomp1 pxz xz-utils libcurl4-openssl-dev aria2 && \ 37 | apt-get autoclean && apt-get autoremove && rm -rf /var/lib/apt/lists/* 38 | 39 | # Install Petagene. 40 | # Note: Petagene will provide a free trial of their software, see: 41 | # https://www.petagene.com/ 42 | # If you would like to skip including the Petagene portion, just 43 | # comment out the three statements below. 44 | ADD petasuite-cloud-edition_1.2.6p6_amd64.deb /petasuite-cloud-edition_1.2.6p6_amd64.deb 45 | RUN dpkg -i /petasuite-cloud-edition_1.2.6p6_amd64.deb && \ 46 | rm /petasuite-cloud-edition_1.2.6p6_amd64.deb 47 | RUN petasuite_install_corpus human 48 | # END Petagene 49 | COPY --from=build /samtools/bin/ /samtools/bin/ 50 | COPY --from=build /picard/ /picard/ 51 | COPY --from=build /fqzcomp/fqzcomp /usr/local/bin/fqzcomp 52 | COPY --from=build /FaStore/bin /FaStore/scripts 53 | COPY --from=build /FaStore/scripts /FaStore/scripts 54 | COPY --from=build /repaq/repaq /usr/local/bin/repaq 55 | RUN sed -i 's/FASTORE_BIN=.\//FASTORE_BIN=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh 56 | RUN sed -i 's/FASTORE_REBIN=.\//FASTORE_REBIN=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh 57 | RUN sed -i 's/FASTORE_PACK=.\//FASTORE_PACK=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh 58 | COPY --from=build /Spring/build/spring /Spring/build/spring 59 | COPY --from=build /crumble/crumble /usr/local/bin/crumble 60 | COPY --from=build /usr/local/lib /usr/local/lib 61 | ENV PATH="/samtools/bin:/FaStore/scripts:/Spring/build/:/usr/local/bin:${PATH}" 62 | RUN /sbin/ldconfig -v 63 | ADD compress_fastq /usr/local/bin 64 | 65 | ENTRYPOINT ["compress_fastq"] -------------------------------------------------------------------------------- /docker/compress_fastq: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import argparse 3 | import subprocess 4 | import os 5 | import re 6 | import sys 7 | import logging 8 | import multiprocessing 9 | 10 | FASTQ_PATTERN = "(.fq|.fastq)(.gz)?$" 11 | 12 | 13 | def _parse_args(): 14 | ap = argparse.ArgumentParser(description="Compress a pair of FASTQ files.") 15 | 16 | input_file_group = ap.add_argument_group("Input files") 17 | input_file_group.add_argument( 18 | "-1", 19 | "--fwd-reads", 20 | help="The forward reads in FASTQ or FASTQ.GZ format", 21 | required=True, 22 | ) 23 | input_file_group.add_argument( 24 | "-2", 25 | "--rev-reads", 26 | help="(optional) The reverse reads in FASTQ or FASTQ.GZ format", 27 | required=False, 28 | ) 29 | input_file_group.add_argument( 30 | "-r", 31 | "--ref-fasta", 32 | help="(optional) The reference FASTA - only needed if a mapped BAM file is " 33 | "provided as input", 34 | required=False, 35 | ) 36 | 37 | # For whatever reason, mutually_exclusive_group does not permit the addition of a 38 | # title. So to actually group these together on the help line, it is useful to 39 | # first create a group, which has a title, and then add a mutually_exclusive_group 40 | # to that group. 41 | algorithm_group = ap.add_argument_group("Compression algorithm") 42 | compression_type_group = algorithm_group.add_mutually_exclusive_group(required=True) 43 | compression_type_group.add_argument( 44 | "--ucram", help="Compress to an unmapped CRAM file", action="store_true" 45 | ) 46 | compression_type_group.add_argument( 47 | "--repaq", help="Compress using the Repaq algorithm", action="store_true" 48 | ) 49 | compression_type_group.add_argument( 50 | "--repaq-xz", help="Compress using the Repaq algorithm", action="store_true" 51 | ) 52 | compression_type_group.add_argument( 53 | "--ubam", help="Compress to an unmapped BAM file", action="store_true" 54 | ) 55 | compression_type_group.add_argument( 56 | "--mapped-cram", 57 | help="Compress to a mapped CRAM file. For this option, provide the mapped bam " 58 | "as input to the forward reads and provide the reference FASTA", 59 | action="store_true", 60 | ) 61 | compression_type_group.add_argument( 62 | "--mapped-cram-crumble", 63 | help="Compress to a mapped CRAM file and compress quality scores using Crumble." 64 | " For this option, provide the mapped bam as input to the forward reads " 65 | "and provide the reference FASTA", 66 | action="store_true", 67 | ) 68 | compression_type_group.add_argument( 69 | "--fastore", help="Compress using the FaStore algorithm", action="store_true" 70 | ) 71 | compression_type_group.add_argument( 72 | "--spring", help="Compress using the Spring algorithm", action="store_true" 73 | ) 74 | compression_type_group.add_argument( 75 | "--fqzcomp", help="Compress using the fqzcomp algorithm", action="store_true" 76 | ) 77 | compression_type_group.add_argument( 78 | "--petagene", help="Compress using the Petagene algorithm", action="store_true" 79 | ) 80 | 81 | misc_group = ap.add_argument_group("Misc arguments") 82 | misc_group.add_argument( 83 | "--remove-ids", 84 | help="Remove the sample ids and replace to a unique identifier.", 85 | action="store_true", 86 | ) 87 | misc_group.add_argument( 88 | "-o", 89 | "--output-prefix", 90 | help="(optional) The prefix to use for output. If not provided, we will use the" 91 | " prefix from the forward reads file", 92 | required=False, 93 | ) 94 | misc_group.add_argument( 95 | "-j", 96 | "--num-threads", 97 | help="(optional) The number of cores to use. Not all tools support this " 98 | "currently.", 99 | default=multiprocessing.cpu_count(), 100 | required=False, 101 | ) 102 | 103 | args = ap.parse_args() 104 | 105 | return args 106 | 107 | 108 | def _get_fastq_to_sam_cmd(fwd_reads, sample_name, read_group, rev_reads=None): 109 | """Get a command list suitable to provide as input to subprocess for 110 | converting a set of FASTQ files to a SAM file 111 | 112 | Parameters 113 | ---------- 114 | fwd_reads : str 115 | The path to the forward reads in FASTQ format 116 | sample_name : str 117 | The name to use for the sample name 118 | read_group : str 119 | The read group to use for this sample 120 | rev_reads : str, optional 121 | The path to the reverse reads in FASTQ format, by default None 122 | 123 | Returns 124 | ------- 125 | list[str] 126 | A list of strings that can be passed to a subprocess function for 127 | execution 128 | """ 129 | 130 | cmd = [ 131 | "java", 132 | "-jar", 133 | "/picard/picard.jar", 134 | "FastqToSam", 135 | f"F1={fwd_reads}", 136 | f"O=/dev/stdout", 137 | "QUIET=true", 138 | f"SM={sample_name}", 139 | f"RG={read_group}", 140 | ] 141 | if rev_reads is not None: 142 | cmd.append(f"F2={rev_reads}") 143 | 144 | return cmd 145 | 146 | 147 | def convert_fastq_to_spring( 148 | fwd_reads, 149 | rev_reads=None, 150 | output_prefix=None, 151 | remove_ids=False, 152 | num_threads=multiprocessing.cpu_count(), 153 | ): 154 | """Compress a pair of FASTQ files using Spring 155 | 156 | Parameters 157 | ---------- 158 | fwd_reads : str 159 | The path to the forward reads in FASTQ format 160 | rev_reads : str, optional 161 | The path to the reverse reads in FASTQ format, by default None 162 | output_prefix : str, optional 163 | A prefix to use for the sample name and output file, by default None 164 | remove_ids : bool, optional 165 | If set to true, Spring will remove the sample ids (names will be lossy, but the 166 | underlying data is preserved), by default False 167 | num_threads : int, optional 168 | Number of threads to use with Spring, by default multiprocessing.cpu_count() 169 | 170 | Returns 171 | ------- 172 | str 173 | The output filename 174 | """ 175 | if output_prefix is None: 176 | output_prefix = os.path.basename(fwd_reads) 177 | output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix) 178 | 179 | output_filename = f"{output_prefix}.spring" 180 | cmd = ["spring", "-c", "-o", output_filename, "-t", str(num_threads)] 181 | if fwd_reads.endswith(".gz"): 182 | cmd.append("-g") 183 | if remove_ids: 184 | cmd.append("--no-ids") 185 | cmd.extend(["-i", fwd_reads]) 186 | if rev_reads is not None: 187 | cmd.append(rev_reads) 188 | 189 | subprocess.check_call(cmd) 190 | 191 | return output_filename 192 | 193 | 194 | def convert_fastq_to_ubam_or_ucram( 195 | fwd_reads, 196 | rev_reads=None, 197 | output_prefix=None, 198 | output_type="bam", 199 | num_threads=multiprocessing.cpu_count(), 200 | ): 201 | """Convert a pair of FASTQ files to unmapped BAM or CRAM file 202 | 203 | Parameters 204 | ---------- 205 | fwd_reads : stt 206 | The path to the forward reads in FASTQ format 207 | rev_reads : str, optional 208 | The path to the reverse reads in FASTQ format, by default None 209 | output_prefix : str, optional 210 | A prefix to use for the sample name and output file, by default None 211 | output_type : str, optional 212 | Should we output as a bam (default) or cram?, by default 'bam' 213 | num_threads : int, optional 214 | Number of threads to use with samtools, by default multiprocessing.cpu_count() 215 | 216 | Returns 217 | ------- 218 | str 219 | The output filename 220 | """ 221 | if output_prefix is None: 222 | output_prefix = os.path.basename(fwd_reads) 223 | output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix) 224 | 225 | cmd = _get_fastq_to_sam_cmd( 226 | fwd_reads=fwd_reads, 227 | rev_reads=rev_reads, 228 | sample_name=output_prefix, 229 | read_group=f"{output_prefix}_1", 230 | ) 231 | fastq_to_sam_proc = subprocess.Popen(cmd, stdout=subprocess.PIPE) 232 | 233 | if output_type == "bam": 234 | output_filename = f"{output_prefix}.bam" 235 | cmd = ["samtools", "view", "-b", "-@", str(num_threads), "-o", output_filename] 236 | else: 237 | output_filename = f"{output_prefix}.cram" 238 | cmd = ["samtools", "view", "-C", "-@", str(num_threads), "-o", output_filename] 239 | 240 | subprocess.check_call(cmd, stdin=fastq_to_sam_proc.stdout) 241 | 242 | return output_filename 243 | 244 | 245 | def convert_fastq_to_fastore( 246 | fwd_reads, rev_reads, output_prefix, num_threads=multiprocessing.cpu_count() 247 | ): 248 | """Compress a FASTQ file using fastore 249 | 250 | Parameters 251 | ---------- 252 | fwd_reads : str 253 | The path to the forward reads in FASTQ format 254 | rev_reads : str 255 | The path to the reverse reads in FASTQ format 256 | output_prefix : str 257 | A prefix to use for the sample name and output file 258 | num_threads : int, optional 259 | Number of threads to use with FaStore, by default multiprocessing.cpu_count() 260 | 261 | Returns 262 | ------- 263 | str 264 | The output filename 265 | """ 266 | if output_prefix is None: 267 | output_prefix = os.path.basename(fwd_reads) 268 | output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix) 269 | 270 | cmd = [ 271 | "fastore_compress.sh", 272 | "--lossless", 273 | "--out", 274 | output_prefix, 275 | "--threads", 276 | str(num_threads), 277 | "--in", 278 | fwd_reads, 279 | ] 280 | if rev_reads is not None: 281 | cmd.extend(["--pair", rev_reads]) 282 | 283 | subprocess.check_call(cmd) 284 | 285 | return (f"{output_prefix}.cmeta", f"{output_prefix}.cdata") 286 | 287 | 288 | def convert_fastq_to_fqzcomp(reads, output_prefix): 289 | """Compress a FASTQ file using fqzcomp 290 | 291 | Parameters 292 | ---------- 293 | reads : str 294 | The path to the FASTQ file to be compressed 295 | output_prefix : str 296 | A prefix to use for the sample name and output file 297 | 298 | Returns 299 | ------- 300 | str 301 | The output filename 302 | """ 303 | if output_prefix is None: 304 | output_prefix = os.path.basename(reads) 305 | output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix) 306 | # Recommended parameters for Illumina data as provided by: 307 | # https://github.com/jkbonfield/fqzcomp 308 | output_filename = f"{output_prefix}.fqz" 309 | 310 | if reads.endswith(".gz"): 311 | input_proc = subprocess.Popen(["zcat", reads], stdout=subprocess.PIPE) 312 | else: 313 | input_proc = subprocess.Popen(["cat", reads], stdout=subprocess.PIPE) 314 | cmd = ["fqzcomp", "-n2", "-s7+", "-b", "-q3", "/dev/stdin", output_filename] 315 | subprocess.check_call(cmd, stdin=input_proc.stdout) 316 | 317 | return output_filename 318 | 319 | 320 | def convert_fastq_to_repaq( 321 | fwd_reads, 322 | rev_reads=None, 323 | output_prefix=None, 324 | xz_compress=False, 325 | num_threads=multiprocessing.cpu_count(), 326 | ): 327 | """Compress a pair of FASTQ files using Repaq 328 | 329 | Parameters 330 | ---------- 331 | fwd_reads : str 332 | The path to the forward reads in FASTQ format 333 | rev_reads : str, optional 334 | The path to the reverse reads in FASTQ format, by default None 335 | output_prefix : str, optional 336 | A prefix to use for the sample name and output file, by default None 337 | xz_compress : bool, optional 338 | Should we use xz to further compress the output? 339 | num_threads : int, optional 340 | Number of threads to use with Spring, by default multiprocessing.cpu_count() 341 | 342 | Returns 343 | ------- 344 | str 345 | The output filename 346 | """ 347 | if output_prefix is None: 348 | output_prefix = os.path.basename(fwd_reads) 349 | output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix) 350 | 351 | # Create the output filename 352 | output_filename = f"{output_prefix}.rfq" 353 | if xz_compress: 354 | output_filename += ".xz" 355 | 356 | # Generate the repaq command 357 | cmd = ["repaq", "-c", "--stdout", "-i", fwd_reads] 358 | if rev_reads is not None: 359 | cmd.extend(["-I", rev_reads]) 360 | 361 | # Now run repaq and optionally run xz to further compress 362 | with open(output_filename, "w") as output_fh: 363 | if xz_compress: 364 | repaq_proc = subprocess.Popen(cmd, stdout=subprocess.PIPE) 365 | # cmd = ["pxz", "-T", str(num_threads), '--lzma=dict=1000000000', "-z", "-c"] 366 | cmd = ["xz", "-T", str(num_threads), "--lzma2=dict=1000000000", "-z", "-c"] 367 | subprocess.check_call(cmd, stdin=repaq_proc.stdout, stdout=output_fh) 368 | else: 369 | repaq_proc = subprocess.check_call(cmd, stdout=output_fh) 370 | 371 | return output_filename 372 | 373 | 374 | def convert_fastq_to_petagene( 375 | fwd_reads, 376 | rev_reads=None, 377 | num_threads=multiprocessing.cpu_count(), 378 | ): 379 | """Compress a pair of FASTQ files using Spring 380 | 381 | Parameters 382 | ---------- 383 | fwd_reads : str 384 | The path to the forward reads in FASTQ format 385 | rev_reads : str, optional 386 | The path to the reverse reads in FASTQ format, by default None 387 | num_threads : int, optional 388 | Number of threads to use with Spring, by default multiprocessing.cpu_count() 389 | 390 | Returns 391 | ------- 392 | list[str] 393 | The output filenames 394 | """ 395 | cmd = ["petasuite", "-c", "-t", str(num_threads)] 396 | output_filenames = re.sub(FASTQ_PATTERN, '.fasterq', fwd_reads) 397 | subprocess.check_call(cmd + [fwd_reads]) 398 | 399 | if rev_reads is not None: 400 | output_filenames.append(re.sub(FASTQ_PATTERN, '.fasterq', rev_reads)) 401 | subprocess.check_call(cmd + [rev_reads]) 402 | 403 | return output_filenames 404 | 405 | 406 | if __name__ == "__main__": 407 | args = _parse_args() 408 | if args.ucram: 409 | if args.remove_ids: 410 | logging.warn("Removing ids not supported for ucram yet.") 411 | convert_fastq_to_ubam_or_ucram( 412 | fwd_reads=args.fwd_reads, 413 | rev_reads=args.rev_reads, 414 | output_prefix=args.output_prefix, 415 | output_type="cram", 416 | num_threads=args.num_threads, 417 | ) 418 | elif args.ubam: 419 | if args.remove_ids: 420 | logging.warn("Removing ids not supported for ubam yet.") 421 | convert_fastq_to_ubam_or_ucram( 422 | fwd_reads=args.fwd_reads, 423 | rev_reads=args.rev_reads, 424 | output_prefix=args.output_prefix, 425 | output_type="bam", 426 | num_threads=args.num_threads, 427 | ) 428 | elif args.mapped_cram: 429 | logging.error("Support for mapped CRAM not available yet.") 430 | sys.exit(1) 431 | elif args.mapped_cram_crumble: 432 | logging.error("Support for mapped CRAM not available yet.") 433 | sys.exit(1) 434 | elif args.fastore: 435 | if args.remove_ids: 436 | logging.warn("Removing ids not supported for fastore yet.") 437 | convert_fastq_to_fastore( 438 | fwd_reads=args.fwd_reads, 439 | rev_reads=args.rev_reads, 440 | output_prefix=args.output_prefix, 441 | num_threads=args.num_threads, 442 | ) 443 | elif args.spring: 444 | convert_fastq_to_spring( 445 | fwd_reads=args.fwd_reads, 446 | rev_reads=args.rev_reads, 447 | output_prefix=args.output_prefix, 448 | remove_ids=args.remove_ids, 449 | num_threads=args.num_threads, 450 | ) 451 | elif args.fqzcomp: 452 | if args.remove_ids: 453 | logging.warn("Removing ids not supported for fqzcomp yet.") 454 | convert_fastq_to_fqzcomp(reads=args.fwd_reads, output_prefix=args.output_prefix) 455 | convert_fastq_to_fqzcomp(reads=args.rev_reads, output_prefix=args.output_prefix) 456 | elif args.repaq: 457 | if args.remove_ids: 458 | logging.warn("Removing ids not supported for fqzcomp yet.") 459 | convert_fastq_to_repaq( 460 | fwd_reads=args.fwd_reads, 461 | rev_reads=args.rev_reads, 462 | output_prefix=args.output_prefix, 463 | num_threads=args.num_threads, 464 | ) 465 | elif args.repaq_xz: 466 | if args.remove_ids: 467 | logging.warn("Removing ids not supported for fqzcomp yet.") 468 | convert_fastq_to_repaq( 469 | fwd_reads=args.fwd_reads, 470 | rev_reads=args.rev_reads, 471 | output_prefix=args.output_prefix, 472 | xz_compress=True, 473 | num_threads=args.num_threads, 474 | ) 475 | elif args.petagene: 476 | if args.remove_ids: 477 | logging.warn("Removing ids not supported for Petagene.") 478 | convert_fastq_to_petagene( 479 | fwd_reads=args.fwd_reads, 480 | rev_reads=args.rev_reads, 481 | num_threads=args.num_threads, 482 | ) 483 | --------------------------------------------------------------------------------