├── .gitignore
├── LICENSE
├── README.md
└── docker
    ├── Dockerfile
    └── compress_fastq


/.gitignore:
--------------------------------------------------------------------------------
1 | *.deb


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Brett T. Hannigan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Comparison of FASTQ compression algorithms
  2 | 
  3 | ## Background
  4 | Next generation sequencing (NGS) experiments produce a tremendous amount of raw data that will 
  5 | be used in further downstream analysis. Typically, the raw data from an instrument is stored in 
  6 | [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) format, a raw text format where each base read
  7 | is represented by 2 bytes - a byte to provide the nucleotide at a given position and a second byte 
  8 | providing a quality score, or how confident the instrument was in calling that nucleotide.
  9 | 
 10 | In raw text format, these files are quite hefty. For instance, when sequencing the whole human genome,
 11 | experiments frequently will aim to have each nucleotide represented on average 30 times in their
 12 | sequencing run. This would lead to raw FASTQ files of around:
 13 | ```
 14 | 3 billion nucleotides * 30x coverage * 2 bytes/nucleotide = 180 GB
 15 | ```
 16 | 
 17 | Storing the data as raw text is incredibly inefficient though. Nucleotides are typically chosen from an
 18 | alphabet of 4 characters (or potentially a little more if one wishes to allow for ambiguous or no-calls),
 19 | so one would expect a bit over 2 bits needed per nucleotide rather than the full 8 bits provided in 
 20 | the text file. The quality score range varies by instrument, with some instruments using as few as 2 bits
 21 | to represent quality scores. Historically, the most popular instruments would output around 64 different 
 22 | quality scores which would require around 6 bits to encode them directly. Thus, a simple binary encoding of 
 23 | the information should take only about half of the naive text version: 
 24 | ```
 25 | (2 bits + 6 bits)/(8 bits + 8 bits) = 50%
 26 | ```
 27 | 
 28 | The data itself is not uniformly distributed though and thus even better compression ratios should be
 29 | achievable by using entropic methods of compression or leveraging information about the 
 30 | nature of genomic data. To that end, a number of researchers have created tools to compress genomic data.
 31 | 
 32 | I am interested in surveying some of these tools to get a sense of the compression ratios they afford 
 33 | and determine if they would be able to fit in with typical NGS data workflows.
 34 | 
 35 | ## Tools evaluated
 36 | * **gzip** - By far the most popular method to compress FASTQ files is to simply gzip them. Most bioinformatic
 37 | tools will accept gzipped files as input. The gzip standard offers various levels of compression, allowing
 38 | users to tradeoff compression time and compression efficiency. Gzip level 1 is the fastest to compress a 
 39 | given file, but at the cost of some compression efficiency. Gzip level 9 on the other hand should provide
 40 |  a smaller overall file at the cost of an increased time compressing the data. For this simple study,
 41 |  I ran gzip using the default compression level (level 6) as well as the highest compression level (level 9).
 42 | * **Unmapped BAM** - The BAM file format is a binary format that traditionally was used to store reads mapped
 43 | to a reference genome. Recently, the bioinformatics community has begun to use the BAM file to store the
 44 | raw, unmapped reads as well, with the Broad Institute using the uBAM format as the starting point for their
 45 | [best practices pipeline](https://software.broadinstitute.org/gatk/documentation/article?id=11008). The format
 46 | encodes the data in a binary format and then that binary information is compressed using a block gzip compression
 47 | algorithm. Thus, our naive expectation would be that the total size of a uBAM should be on par with the
 48 | gzipped FASTQ files.
 49 | * **Unmapped CRAM** - Like the BAM format, the CRAM format is a binary format typically used to store reads
 50 | mapped to a reference genome. When a reference genome is provided, the CRAM format is able to compress 
 51 | the data more thoroughly. The CRAM format also compresses data on a column basis, allowing the compression
 52 | algorithm to be more efficient as it compresses data one type together. It will be interesting to see how much better
 53 | our unmapped CRAM is than unmapped BAM since our data has not be mapped and no reference genome will be 
 54 | provided. CRAM files are also capable of compressing the quality scores in a lossy manner. For this initial 
 55 | evaluation, we will use CRAM in a lossless encoding manner.
 56 | * [**FaStore**](https://academic.oup.com/bioinformatics/article/34/16/2748/4956350) - The FaStore compressor
 57 | offers both lossless and lossy compresion. For lossy compression, it can both alter read identifiers as
 58 | well as alter quality scores. For our tests, we will only evaluate the lossless mode for now. While
 59 | I suspect that for many if not most cases, compressing the quality scores will have little to no impact
 60 | on downstream analysis, to simplify the current evaluation I will stick to the uncontroversial lossless 
 61 | approaches. That said, I would evaluate discarding the read identifiers, but it is a bit challenging
 62 | to both keep the full quality scores and discard the read identifiers using the scripts provided by FaStore.
 63 | * [**Spring**](https://academic.oup.com/bioinformatics/article-abstract/35/15/2674/5232998?redirectedFrom=fulltext) - 
 64 | The Spring compressor offers lossless and lossy modes similar to FaStore. Like FaStore, I will only evaluate
 65 | lossless quality compression here. I will also evaluate the removal of read identifiers here though as it
 66 | is a straight forward option to the tool.
 67 | * [**fqzcomp**](https://github.com/jkbonfield/fqzcomp) - This compressor also offers lossless and lossy
 68 | compression of quality scores. Once again, for our evaluation purposes we will stick to lossless for now.
 69 | * [**repaq**](https://github.com/OpenGene/repaq) - Repaq offers lossless compression of FASTQ files. The
 70 | authors highlight repaq's speed and the ability to obtain further compression through the use of xz. While
 71 | repaq can operate on any pair of FASTQ files, the authors note that compression ratios on Illumina data are
 72 | better than those for BGI data.
 73 | * [**Petagene**](https://www.petagene.com) - Petagene offers a suite of compression tools that can losslessly
 74 | compress both `fastq.gz` files and `.bam` files. The software will even capture gzip settings so that 
 75 | md5sums of compressed->decompressed files will match those of their original.
 76 | 
 77 | ## Data used during evaluation
 78 | As a first pass evaluation, I selected two NGS samples from the 
 79 | [Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra/) to evaluate a set of compression 
 80 | techniques.
 81 | 
 82 | [**SRR2962693**](https://www.ncbi.nlm.nih.gov/sra/?term=SRR2962693): A human, whole exome sample from an Illumina HiSeq 2500 run. 
 83 | 
 84 | [**SRR8861483**](https://www.ncbi.nlm.nih.gov/sra/?term=SRR8861483): A human, whole genome sample from an Illumina NovaSeq 6000 run.
 85 | 
 86 | Starting with their NovaSeq instruments, Illumina began using a simplified quality binning approach, 
 87 | with each base pair using around 2 bits to store quality information rather than the ~6 bits used in 
 88 | previous machines like the HiSeq line. Thus, our samples should allow us to monitor performance
 89 | differences due to changes in quality representation across the older and newer instruments.
 90 | 
 91 | ## Results
 92 | First, unfortunately I was unable to get FaStore to complete the compression of the test samples. For
 93 | both the WES and WGS samples, the tool would begin the multi-stage process of compressing the data,
 94 | but it then would freeze. I did not see an error message. While monitoring the initial runs, I did observe
 95 | a few peaks in memory use, so I tried the samples on an instance with significant memory resources - roughly
 96 | 128 GB. I also provided up to 1 TB of storage space in case significant temporary files were generated.
 97 | Alas, these efforts did not help. I may return at some point to investigate these failures in more depth,
 98 | but for now I will just report results on the other tools.
 99 | 
100 | ### Timing
101 | For this initial evaluation, I chose not to concentrate on optimizing the timing of each approach. Some of the tools
102 | here make use of multiple-cores better than others and there may be ways to more efficiently leverage the tools, (in 
103 | a scatter gather approach for example), but I thought a simple accounting of the time it took each tool to compress
104 | the data would still be informative. 
105 | 
106 | Note that for two of the compression schemes (gzip and fqzcomp), for paired-end data the tools compress each FASTQ 
107 | separately. I will record both the total serial time and the per-file times for these samples. In most instances,
108 | I assume using the max of the per-file time would be most pertinent here since in most cases one would choose to 
109 | run each of these in parallel.
110 | 
111 | #### Instance details
112 | ##### Whole exome data **SRR2962693**
113 | **AWS c5d.2xlarge**
114 | 8 vCPU's
115 | 16 GB RAM
116 | Data read from and written to NVME SSD drive
117 | 
118 | ##### Whole genome data **SRR8861483**
119 | **AWS r5d.4xlarge**
120 | 16 vCPU's
121 | 128 GB RAM
122 | Data read from and written to NVME SSD drive
123 | 
124 | For tools which allowed for the specification of the number of cores to use, 8 cores were requested
125 | for both WES and WGS samples. The larger instance for WGS was selected for the larger memory and storage
126 | requirements offered.
127 | 
128 | #### Compression
129 | |   Sample   |             gzip            |             gzip -9         |  uBAM  | uCRAM  | FaStore | Spring | Spring --no-ids |   fqzcom  |  repaq |  repaq-xz |      Petagene          |
130 | | ---------- | --------------------------- | --------------------------- | ------ | -----  | ------- | ------ | --------------- | --------- | ------ | --------- | ---------------------- |
131 | | SRR2962693 |   26m and 26m (52m total)   |  1h 35m and 1h 39m (3h 14m) |   39m  |  35m   |   DNF   |   26m  |      26m        |    32m    |  10m   |    52m    | 1.75m and 1.75m (3.5m) |
132 | | SRR8861483 | 1h 36m and 1h 23m (2h 59m)  |   9h 35m + 12h = 21h 35m    | 2h 34m | 2h 35m |   DNF   |  3h 3m |     3h 2m       |  2h 15m   |  47m   |    147m   | 8.25m + 8.75m (17m)    |
133 | 
134 | | SRR2962693 WES | SRR8861483 WGS |
135 | | -------------- | -------------- |
136 | | ![WES Compression times](https://user-images.githubusercontent.com/3038393/70670885-42530880-1c2f-11ea-96cd-aba10e2f6caf.png)| ![WGS Compression times](https://user-images.githubusercontent.com/3038393/70670876-3bc49100-1c2f-11ea-8510-fe446d19c2b8.png) |
137 | 
138 | #### Decompression
139 | |   Sample   |               gzip            |             gzip -9           |  uBAM | uCRAM  | FaStore | Spring | Spring --no-ids |          fqzcomp            |  repaq |  repaq-xz |          Petagene             |
140 | | ---------- | ----------------------------- | ----------------------------- | ----- | -----  | ------- | ------ | --------------- | --------------------------- | ------ | --------- | ----------------------------- |
141 | | SRR2962693 |      2m and 2m (4m total)     |    2m and 2m (4m total)       | 10m   |  10m   |   DNF   |  16m   |      16m        | 14m and 14m (28m total)     |  18m   |     25m   |  1.25m and 1.25m (2.5m total) |
142 | | SRR8861483 | 11m and 11m (22m total)       |      11m and 11m (22m total)  | 58m   | 1h 25m |   DNF   |  53m   |      51m        | 1h 5m and ??? (2h 10m)      |   92m  |    122m   | 5.75m and 6m (11.75m total)   |
143 | 
144 | (Note: I forgot to record the time for decompressing the reverse reads for fqzcomp. My assumption is 
145 | that it would take a similar time to decompress the reverse reads as it did to decompress the forward reads.)
146 | 
147 | | SRR2962693 WES | SRR8861483 WGS |
148 | | -------------- | -------------- |
149 | | ![WES Deompression times](https://user-images.githubusercontent.com/3038393/70670872-36674680-1c2f-11ea-9a7d-c9f969dcce58.png)| ![WGS Deompression times](https://user-images.githubusercontent.com/3038393/70670863-3109fc00-1c2f-11ea-93c2-de7c2142040e.png) |
150 | 
151 | ### Storage size
152 | As discussed in the tools section, I am concentrating on lossless compression for this current study. In many
153 | cases, it may make sense to leverage lossy compression, as the quality scores are fairly noisy and a number of
154 | studies show that it's possible to significantly bin quality scores without impacting typical
155 | variant calling pipelines. However, to limit the scope to the most uncontroversial mode of compression, for
156 | now I'll look at methods that preserve the full, original quality information.
157 | 
158 | |   Sample   |  SRA  |   FASTQ    | FASTQ.gz | FASTQ.gz -9 |  uBAM  | uCRAM  | FaStore | Spring | Spring --no-ids | fqzcomp |  repaq |  repaq-xz | Petagene |
159 | | ---------- | ----- | ---------- | -------- | ----------- | ------ | ------ | ------- | ------ | --------------- | ------- | ------ | --------- | -------- |
160 | | SRR2962693 |  7 GB |   40 GB    |  10.3 GB |   9.4 GB    | 9.3 GB | 6.6 GB |   DNF   | 3.5 GB |      3.5 GB     |  4.7 GB | 12 GB  |    5.4 GB |  3.6 GB  |
161 | | SRR8861483 | 23 GB |   284 GB   |  33 GB   |  32 GB      | 33 GB  | 22 GB  |   DNF   | 15 GB  |      15 GB      |  37 GB  | 77 GB  |    21 GB  | 15.2 GB  |
162 | 
163 | | SRR2962693 WES | SRR8861483 WGS |
164 | | -------------- | -------------- |
165 | | ![WES Compressed file sizes](https://user-images.githubusercontent.com/3038393/70670858-2cddde80-1c2f-11ea-8deb-ac9ac2259fb7.png)| ![WGS Compressed file sizes](https://user-images.githubusercontent.com/3038393/70670853-294a5780-1c2f-11ea-93c5-b34923557a19.png) |
166 | 
167 | ## Discussion
168 | As is widely accepted, researchers at a minimum should gzip compress their raw FASTQ files. For the
169 | HiSeq data, this provided a close to 75% reduction in storage space compared to the raw FASTQ. Add in 
170 | the fact that gzipped files are cheap to decompress and the fact that most bioinformatics tools already
171 | accept gzipped FASTQ files as input, and I think you'll find there are few reasons to keep raw FASTQ file around very long.
172 | 
173 | For users who are interested in even greater storage savings though, Spring appears to be a compelling
174 | option. I saw file sizes that saved an additional 55% - 66% storage space compared to gzipped FASTQ files.
175 | While the decompression time was significantly longer than a simple gzip file, it still seemed reasonable
176 | at roughly an hour for decompressing a WGS sample on a 16 core node (where only 8 cores were used). 
177 | It was also nice that the original read ordering is completely preserved (unlike unmapped BAM and CRAM files
178 | where the reads are sorted by read name) and that the raw base pairs and quality scores matched the 
179 | initial data exactly, unlike fqzcomp which has the occasional quality score change. (Again, according
180 | to the fqzcomp authors some changes may be expected as fqzcomp will set the quality score to 0 for an base
181 | pair called as an N. These changes seem eminently reasonable, but it's nice to not have to worry about 
182 | the changes and wonder if maybe at least some subset of changes was due to some other issue).
183 | 
184 | At a compression time of about 3 hours and a decompression time of about an hour for a WGS on an
185 | r5.4xlarge, we can calculate the break-even time for storing the Spring compressed file vs.
186 | the gzipped file as follows:
187 | 
188 | **Hot storage**
189 | ```
190 | 4 hours of r5d.4xlarge * $1.152/hour + 15 GB * $0.026 /GB/month * storage_duration = 33 GB * $0.026 /GB/month * storage_duration
191 | 
192 |  ($0.858 - $0.39) * storage_duration = $4.608
193 | 
194 |  storage_duration = 9.9 months
195 | ```
196 | I believe we should be able to leverage the smaller r5d.2xlarge and see similar run times for compression and
197 | decompression since we are only using 8 cores. If this is true, our break even time would occur in roughly half
198 | the time, or after about 5 months.
199 | 
200 | For cold storage (i.e. Glacier) the break-even would be pushed out a bit.
201 | ```
202 | 4 hours of r5d.4xlarge * $1.152/hour + 15 GB * $0.005 /GB/month * storage_duration = 33 GB * $0.005 /GB/month * storage_duration
203 | 
204 |  ($0.165 - $0.075) * storage_duration = $4.608
205 | 
206 |  storage_duration = 51 months
207 | ```
208 | Or roughly 26 months if we are able to use an r5d.2xlage.
209 | 
210 | Leveraging the spot market would bring in our break-even dates further as would the ability
211 | to leverage underutilized compute cycles within our processing pipelines to perform the compression.
212 | 
213 | ## Final caveats
214 | I did not exhaustively verify that the data compression was lossless under all methods. At a minimum, 
215 | I did verify that each tool could produce a pair of FASTQ files that contained the same number of entries 
216 | as the input data. 
217 | 
218 | For **fqzcomp**, I verified that the nucleotide information of the decompressed data
219 | exactly matches the input. However, I did notice that some of the quality scores had been altered. The
220 | authors note that all bases called as an N will have their quality score set to 0 though. I did not
221 | verify that all quality changes could be explained by this behavior.
222 | 
223 | For **Spring**, I verified that the nucleotide information and the quality information from the uncompressed 
224 | data exactly matched the input data.
225 | 
226 | For **repaq**, I verified that the nucleotide information and the quality information from the ucompressed
227 | data exactly matched the input data.
228 | 
229 | During **uBAM** and **uCRAM** creation, the input reads are sorted by read name, making comparison of 
230 | the uncompressed and initial data more challenging. For now, I did verify that for at least one read
231 | the nucleotides and quality scores match exactly. Although a more thorough validation is possible, I feel
232 | fairly confident that for these two formats one can expect exact data fidelity for the nucleotides and 
233 | quality scores.
234 | 
235 | In many pipelines, it may make sense to store the mapped data and throw away the original raw data. Assuming
236 | your mapped reads contain at least one full-length entry for every raw read, you should be able to 
237 | revert to unmapped FASTQ input without too much struggle and you having the mapped data would allow one to
238 | restart their pipeline at the variant calling stages rather than re-map the samples in any re-runs. In this case,
239 | a mapped CRAM, potentially with compressing quality scores using native CRAM quality score compression or 
240 | external tools like [Crumble](https://academic.oup.com/bioinformatics/article/35/2/337/5051198), probably 
241 | makes the most sense.
242 | 
243 | However, there are a number of cases where keeping the raw, unmapped data makes the most sense. For example,
244 | there are many analyses that may be performed that don't have a mapping stage. Also, some projects anticipate
245 | users leveraging multiple reference genomes in which case storing the data mapped to one specific reference
246 | might not make sense. For these projects, leveraging one of the tools above might be the most appropriate.
247 | 


--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM ubuntu:18.04 as build
 2 | ENV SAMTOOLS_VERSION 1.9
 3 | ENV PICARD_VERSION 2.21.1
 4 | ENV CRUMBLE_VERSION 0.8.3
 5 | 
 6 | RUN apt-get update && \
 7 |     apt-get install -y gcc g++ wget ncurses-dev zlib1g-dev libbz2-dev liblzma-dev build-essential git cmake
 8 | # Install samtools and Picard
 9 | RUN wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
10 |     tar jxvf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
11 |     cd samtools-${SAMTOOLS_VERSION} && ./configure --prefix=/samtools && make -j `nproc` && make install
12 | RUN wget https://github.com/broadinstitute/picard/releases/download/${PICARD_VERSION}/picard.jar && \
13 |     mkdir -p /picard && cp picard.jar /picard/
14 | # Install FQZcomp
15 | RUN git clone https://github.com/jkbonfield/fqzcomp.git && \
16 |     cd fqzcomp && make -j `nproc` 
17 | # Install FaStore
18 | RUN git clone https://github.com/refresh-bio/FaStore.git && \
19 |     cd FaStore && make -j `nproc` && chmod 755 scripts/*.sh
20 | # Install Spring
21 | RUN git clone https://github.com/shubhamchandak94/Spring.git && \
22 |     cd Spring && mkdir build && cd build && cmake ../ && make -j `nproc`
23 | # Install Crumble
24 | RUN wget https://github.com/jkbonfield/crumble/releases/download/v0.8.3/crumble-${CRUMBLE_VERSION}.tar.gz && \
25 |     cd /samtools-${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION} && ./configure && make -j `nproc` && make install && \
26 |     cd / && tar zxvf crumble-${CRUMBLE_VERSION}.tar.gz && mv crumble-${CRUMBLE_VERSION} /crumble && cd /crumble && \
27 |     ./configure --with-htslib=/samtools-${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION} && make -j `nproc`
28 | # Install repaq
29 | RUN git clone https://github.com/OpenGene/repaq.git && \
30 |     cd repaq && make -j `nproc` && make install
31 | 
32 | 
33 | #########################################################################################################
34 | # Final image
35 | FROM ubuntu:18.04
36 | RUN apt-get update && apt-get install -y python3 openjdk-11-jre-headless libgomp1 pxz xz-utils libcurl4-openssl-dev aria2 && \
37 |     apt-get autoclean && apt-get autoremove && rm -rf /var/lib/apt/lists/*
38 | 
39 | # Install Petagene. 
40 | # Note: Petagene will provide a free trial of their software, see:
41 | # https://www.petagene.com/
42 | # If you would like to skip including the Petagene portion, just 
43 | # comment out the three statements below.
44 | ADD petasuite-cloud-edition_1.2.6p6_amd64.deb /petasuite-cloud-edition_1.2.6p6_amd64.deb
45 | RUN dpkg -i /petasuite-cloud-edition_1.2.6p6_amd64.deb && \
46 |     rm /petasuite-cloud-edition_1.2.6p6_amd64.deb
47 | RUN petasuite_install_corpus human
48 | # END Petagene
49 | COPY --from=build /samtools/bin/ /samtools/bin/
50 | COPY --from=build /picard/ /picard/
51 | COPY --from=build /fqzcomp/fqzcomp /usr/local/bin/fqzcomp
52 | COPY --from=build /FaStore/bin /FaStore/scripts
53 | COPY --from=build /FaStore/scripts /FaStore/scripts
54 | COPY --from=build /repaq/repaq /usr/local/bin/repaq
55 | RUN sed -i 's/FASTORE_BIN=.\//FASTORE_BIN=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh
56 | RUN sed -i 's/FASTORE_REBIN=.\//FASTORE_REBIN=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh
57 | RUN sed -i 's/FASTORE_PACK=.\//FASTORE_PACK=\/FaStore\/scripts\//g' /FaStore/scripts/*.sh
58 | COPY --from=build /Spring/build/spring /Spring/build/spring
59 | COPY --from=build /crumble/crumble /usr/local/bin/crumble
60 | COPY --from=build /usr/local/lib /usr/local/lib
61 | ENV PATH="/samtools/bin:/FaStore/scripts:/Spring/build/:/usr/local/bin:${PATH}"
62 | RUN /sbin/ldconfig -v
63 | ADD compress_fastq /usr/local/bin 
64 | 
65 | ENTRYPOINT ["compress_fastq"]


--------------------------------------------------------------------------------
/docker/compress_fastq:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import argparse
  3 | import subprocess
  4 | import os
  5 | import re
  6 | import sys
  7 | import logging
  8 | import multiprocessing
  9 | 
 10 | FASTQ_PATTERN = "(.fq|.fastq)(.gz)?$"
 11 | 
 12 | 
 13 | def _parse_args():
 14 |     ap = argparse.ArgumentParser(description="Compress a pair of FASTQ files.")
 15 | 
 16 |     input_file_group = ap.add_argument_group("Input files")
 17 |     input_file_group.add_argument(
 18 |         "-1",
 19 |         "--fwd-reads",
 20 |         help="The forward reads in FASTQ or FASTQ.GZ format",
 21 |         required=True,
 22 |     )
 23 |     input_file_group.add_argument(
 24 |         "-2",
 25 |         "--rev-reads",
 26 |         help="(optional) The reverse reads in FASTQ or FASTQ.GZ format",
 27 |         required=False,
 28 |     )
 29 |     input_file_group.add_argument(
 30 |         "-r",
 31 |         "--ref-fasta",
 32 |         help="(optional) The reference FASTA - only needed if a mapped BAM file is "
 33 |         "provided as input",
 34 |         required=False,
 35 |     )
 36 | 
 37 |     # For whatever reason, mutually_exclusive_group does not permit the addition of a
 38 |     # title. So to actually group these together on the help line, it is useful to
 39 |     # first create a group, which has a title, and then add a mutually_exclusive_group
 40 |     # to that group.
 41 |     algorithm_group = ap.add_argument_group("Compression algorithm")
 42 |     compression_type_group = algorithm_group.add_mutually_exclusive_group(required=True)
 43 |     compression_type_group.add_argument(
 44 |         "--ucram", help="Compress to an unmapped CRAM file", action="store_true"
 45 |     )
 46 |     compression_type_group.add_argument(
 47 |         "--repaq", help="Compress using the Repaq algorithm", action="store_true"
 48 |     )
 49 |     compression_type_group.add_argument(
 50 |         "--repaq-xz", help="Compress using the Repaq algorithm", action="store_true"
 51 |     )
 52 |     compression_type_group.add_argument(
 53 |         "--ubam", help="Compress to an unmapped BAM file", action="store_true"
 54 |     )
 55 |     compression_type_group.add_argument(
 56 |         "--mapped-cram",
 57 |         help="Compress to a mapped CRAM file. For this option, provide the mapped bam "
 58 |         "as input to the forward reads and provide the reference FASTA",
 59 |         action="store_true",
 60 |     )
 61 |     compression_type_group.add_argument(
 62 |         "--mapped-cram-crumble",
 63 |         help="Compress to a mapped CRAM file and compress quality scores using Crumble."
 64 |         " For this option, provide the mapped bam as input to the forward reads "
 65 |         "and provide the reference FASTA",
 66 |         action="store_true",
 67 |     )
 68 |     compression_type_group.add_argument(
 69 |         "--fastore", help="Compress using the FaStore algorithm", action="store_true"
 70 |     )
 71 |     compression_type_group.add_argument(
 72 |         "--spring", help="Compress using the Spring algorithm", action="store_true"
 73 |     )
 74 |     compression_type_group.add_argument(
 75 |         "--fqzcomp", help="Compress using the fqzcomp algorithm", action="store_true"
 76 |     )
 77 |     compression_type_group.add_argument(
 78 |         "--petagene", help="Compress using the Petagene algorithm", action="store_true"
 79 |     )
 80 | 
 81 |     misc_group = ap.add_argument_group("Misc arguments")
 82 |     misc_group.add_argument(
 83 |         "--remove-ids",
 84 |         help="Remove the sample ids and replace to a unique identifier.",
 85 |         action="store_true",
 86 |     )
 87 |     misc_group.add_argument(
 88 |         "-o",
 89 |         "--output-prefix",
 90 |         help="(optional) The prefix to use for output. If not provided, we will use the"
 91 |         " prefix from the forward reads file",
 92 |         required=False,
 93 |     )
 94 |     misc_group.add_argument(
 95 |         "-j",
 96 |         "--num-threads",
 97 |         help="(optional) The number of cores to use. Not all tools support this "
 98 |         "currently.",
 99 |         default=multiprocessing.cpu_count(),
100 |         required=False,
101 |     )
102 | 
103 |     args = ap.parse_args()
104 | 
105 |     return args
106 | 
107 | 
108 | def _get_fastq_to_sam_cmd(fwd_reads, sample_name, read_group, rev_reads=None):
109 |     """Get a command list suitable to provide as input to subprocess for
110 |     converting a set of FASTQ files to a SAM file
111 | 
112 |     Parameters
113 |     ----------
114 |     fwd_reads : str
115 |         The path to the forward reads in FASTQ format
116 |     sample_name : str
117 |         The name to use for the sample name
118 |     read_group : str
119 |         The read group to use for this sample
120 |     rev_reads : str, optional
121 |         The path to the reverse reads in FASTQ format, by default None
122 | 
123 |     Returns
124 |     -------
125 |     list[str]
126 |         A list of strings that can be passed to a subprocess function for
127 |         execution
128 |     """
129 | 
130 |     cmd = [
131 |         "java",
132 |         "-jar",
133 |         "/picard/picard.jar",
134 |         "FastqToSam",
135 |         f"F1={fwd_reads}",
136 |         f"O=/dev/stdout",
137 |         "QUIET=true",
138 |         f"SM={sample_name}",
139 |         f"RG={read_group}",
140 |     ]
141 |     if rev_reads is not None:
142 |         cmd.append(f"F2={rev_reads}")
143 | 
144 |     return cmd
145 | 
146 | 
147 | def convert_fastq_to_spring(
148 |     fwd_reads,
149 |     rev_reads=None,
150 |     output_prefix=None,
151 |     remove_ids=False,
152 |     num_threads=multiprocessing.cpu_count(),
153 | ):
154 |     """Compress a pair of FASTQ files using Spring
155 | 
156 |     Parameters
157 |     ----------
158 |     fwd_reads : str
159 |         The path to the forward reads in FASTQ format
160 |     rev_reads : str, optional
161 |         The path to the reverse reads in FASTQ format, by default None
162 |     output_prefix : str, optional
163 |         A prefix to use for the sample name and output file, by default None
164 |     remove_ids : bool, optional
165 |         If set to true, Spring will remove the sample ids (names will be lossy, but the
166 |         underlying data is preserved), by default False
167 |     num_threads : int, optional
168 |         Number of threads to use with Spring, by default multiprocessing.cpu_count()
169 | 
170 |     Returns
171 |     -------
172 |     str
173 |         The output filename
174 |     """
175 |     if output_prefix is None:
176 |         output_prefix = os.path.basename(fwd_reads)
177 |         output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix)
178 | 
179 |     output_filename = f"{output_prefix}.spring"
180 |     cmd = ["spring", "-c", "-o", output_filename, "-t", str(num_threads)]
181 |     if fwd_reads.endswith(".gz"):
182 |         cmd.append("-g")
183 |     if remove_ids:
184 |         cmd.append("--no-ids")
185 |     cmd.extend(["-i", fwd_reads])
186 |     if rev_reads is not None:
187 |         cmd.append(rev_reads)
188 | 
189 |     subprocess.check_call(cmd)
190 | 
191 |     return output_filename
192 | 
193 | 
194 | def convert_fastq_to_ubam_or_ucram(
195 |     fwd_reads,
196 |     rev_reads=None,
197 |     output_prefix=None,
198 |     output_type="bam",
199 |     num_threads=multiprocessing.cpu_count(),
200 | ):
201 |     """Convert a pair of FASTQ files to unmapped BAM or CRAM file
202 | 
203 |     Parameters
204 |     ----------
205 |     fwd_reads : stt
206 |         The path to the forward reads in FASTQ format
207 |     rev_reads : str, optional
208 |         The path to the reverse reads in FASTQ format, by default None
209 |     output_prefix : str, optional
210 |         A prefix to use for the sample name and output file, by default None
211 |     output_type : str, optional
212 |         Should we output as a bam (default) or cram?, by default 'bam'
213 |     num_threads : int, optional
214 |         Number of threads to use with samtools, by default multiprocessing.cpu_count()
215 | 
216 |     Returns
217 |     -------
218 |     str
219 |         The output filename
220 |     """
221 |     if output_prefix is None:
222 |         output_prefix = os.path.basename(fwd_reads)
223 |         output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix)
224 | 
225 |     cmd = _get_fastq_to_sam_cmd(
226 |         fwd_reads=fwd_reads,
227 |         rev_reads=rev_reads,
228 |         sample_name=output_prefix,
229 |         read_group=f"{output_prefix}_1",
230 |     )
231 |     fastq_to_sam_proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
232 | 
233 |     if output_type == "bam":
234 |         output_filename = f"{output_prefix}.bam"
235 |         cmd = ["samtools", "view", "-b", "-@", str(num_threads), "-o", output_filename]
236 |     else:
237 |         output_filename = f"{output_prefix}.cram"
238 |         cmd = ["samtools", "view", "-C", "-@", str(num_threads), "-o", output_filename]
239 | 
240 |     subprocess.check_call(cmd, stdin=fastq_to_sam_proc.stdout)
241 | 
242 |     return output_filename
243 | 
244 | 
245 | def convert_fastq_to_fastore(
246 |     fwd_reads, rev_reads, output_prefix, num_threads=multiprocessing.cpu_count()
247 | ):
248 |     """Compress a FASTQ file using fastore
249 | 
250 |     Parameters
251 |     ----------
252 |     fwd_reads : str
253 |         The path to the forward reads in FASTQ format
254 |     rev_reads : str
255 |         The path to the reverse reads in FASTQ format
256 |     output_prefix : str
257 |         A prefix to use for the sample name and output file
258 |     num_threads : int, optional
259 |         Number of threads to use with FaStore, by default multiprocessing.cpu_count()
260 | 
261 |     Returns
262 |     -------
263 |     str
264 |         The output filename
265 |     """
266 |     if output_prefix is None:
267 |         output_prefix = os.path.basename(fwd_reads)
268 |         output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix)
269 | 
270 |     cmd = [
271 |         "fastore_compress.sh",
272 |         "--lossless",
273 |         "--out",
274 |         output_prefix,
275 |         "--threads",
276 |         str(num_threads),
277 |         "--in",
278 |         fwd_reads,
279 |     ]
280 |     if rev_reads is not None:
281 |         cmd.extend(["--pair", rev_reads])
282 | 
283 |     subprocess.check_call(cmd)
284 | 
285 |     return (f"{output_prefix}.cmeta", f"{output_prefix}.cdata")
286 | 
287 | 
288 | def convert_fastq_to_fqzcomp(reads, output_prefix):
289 |     """Compress a FASTQ file using fqzcomp
290 | 
291 |     Parameters
292 |     ----------
293 |     reads : str
294 |         The path to the FASTQ file to be compressed
295 |     output_prefix : str
296 |         A prefix to use for the sample name and output file
297 | 
298 |     Returns
299 |     -------
300 |     str
301 |         The output filename
302 |     """
303 |     if output_prefix is None:
304 |         output_prefix = os.path.basename(reads)
305 |         output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix)
306 |     # Recommended parameters for Illumina data as provided by:
307 |     # https://github.com/jkbonfield/fqzcomp
308 |     output_filename = f"{output_prefix}.fqz"
309 | 
310 |     if reads.endswith(".gz"):
311 |         input_proc = subprocess.Popen(["zcat", reads], stdout=subprocess.PIPE)
312 |     else:
313 |         input_proc = subprocess.Popen(["cat", reads], stdout=subprocess.PIPE)
314 |     cmd = ["fqzcomp", "-n2", "-s7+", "-b", "-q3", "/dev/stdin", output_filename]
315 |     subprocess.check_call(cmd, stdin=input_proc.stdout)
316 | 
317 |     return output_filename
318 | 
319 | 
320 | def convert_fastq_to_repaq(
321 |     fwd_reads,
322 |     rev_reads=None,
323 |     output_prefix=None,
324 |     xz_compress=False,
325 |     num_threads=multiprocessing.cpu_count(),
326 | ):
327 |     """Compress a pair of FASTQ files using Repaq
328 | 
329 |     Parameters
330 |     ----------
331 |     fwd_reads : str
332 |         The path to the forward reads in FASTQ format
333 |     rev_reads : str, optional
334 |         The path to the reverse reads in FASTQ format, by default None
335 |     output_prefix : str, optional
336 |         A prefix to use for the sample name and output file, by default None
337 |     xz_compress : bool, optional
338 |         Should we use xz to further compress the output?
339 |     num_threads : int, optional
340 |         Number of threads to use with Spring, by default multiprocessing.cpu_count()
341 | 
342 |     Returns
343 |     -------
344 |     str
345 |         The output filename
346 |     """
347 |     if output_prefix is None:
348 |         output_prefix = os.path.basename(fwd_reads)
349 |         output_prefix = re.sub(FASTQ_PATTERN, "", output_prefix)
350 | 
351 |     # Create the output filename
352 |     output_filename = f"{output_prefix}.rfq"
353 |     if xz_compress:
354 |         output_filename += ".xz"
355 | 
356 |     # Generate the repaq command
357 |     cmd = ["repaq", "-c", "--stdout", "-i", fwd_reads]
358 |     if rev_reads is not None:
359 |         cmd.extend(["-I", rev_reads])
360 | 
361 |     # Now run repaq and optionally run xz to further compress
362 |     with open(output_filename, "w") as output_fh:
363 |         if xz_compress:
364 |             repaq_proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
365 |             # cmd = ["pxz", "-T", str(num_threads), '--lzma=dict=1000000000', "-z", "-c"]
366 |             cmd = ["xz", "-T", str(num_threads), "--lzma2=dict=1000000000", "-z", "-c"]
367 |             subprocess.check_call(cmd, stdin=repaq_proc.stdout, stdout=output_fh)
368 |         else:
369 |             repaq_proc = subprocess.check_call(cmd, stdout=output_fh)
370 | 
371 |     return output_filename
372 | 
373 | 
374 | def convert_fastq_to_petagene(
375 |     fwd_reads,
376 |     rev_reads=None,
377 |     num_threads=multiprocessing.cpu_count(),
378 | ):
379 |     """Compress a pair of FASTQ files using Spring
380 | 
381 |     Parameters
382 |     ----------
383 |     fwd_reads : str
384 |         The path to the forward reads in FASTQ format
385 |     rev_reads : str, optional
386 |         The path to the reverse reads in FASTQ format, by default None
387 |     num_threads : int, optional
388 |         Number of threads to use with Spring, by default multiprocessing.cpu_count()
389 | 
390 |     Returns
391 |     -------
392 |     list[str]
393 |         The output filenames
394 |     """
395 |     cmd = ["petasuite", "-c", "-t", str(num_threads)]
396 |     output_filenames = re.sub(FASTQ_PATTERN, '.fasterq', fwd_reads)
397 |     subprocess.check_call(cmd + [fwd_reads])
398 | 
399 |     if rev_reads is not None:
400 |         output_filenames.append(re.sub(FASTQ_PATTERN, '.fasterq', rev_reads))
401 |         subprocess.check_call(cmd + [rev_reads])
402 | 
403 |     return output_filenames
404 | 
405 | 
406 | if __name__ == "__main__":
407 |     args = _parse_args()
408 |     if args.ucram:
409 |         if args.remove_ids:
410 |             logging.warn("Removing ids not supported for ucram yet.")
411 |         convert_fastq_to_ubam_or_ucram(
412 |             fwd_reads=args.fwd_reads,
413 |             rev_reads=args.rev_reads,
414 |             output_prefix=args.output_prefix,
415 |             output_type="cram",
416 |             num_threads=args.num_threads,
417 |         )
418 |     elif args.ubam:
419 |         if args.remove_ids:
420 |             logging.warn("Removing ids not supported for ubam yet.")
421 |         convert_fastq_to_ubam_or_ucram(
422 |             fwd_reads=args.fwd_reads,
423 |             rev_reads=args.rev_reads,
424 |             output_prefix=args.output_prefix,
425 |             output_type="bam",
426 |             num_threads=args.num_threads,
427 |         )
428 |     elif args.mapped_cram:
429 |         logging.error("Support for mapped CRAM not available yet.")
430 |         sys.exit(1)
431 |     elif args.mapped_cram_crumble:
432 |         logging.error("Support for mapped CRAM not available yet.")
433 |         sys.exit(1)
434 |     elif args.fastore:
435 |         if args.remove_ids:
436 |             logging.warn("Removing ids not supported for fastore yet.")
437 |         convert_fastq_to_fastore(
438 |             fwd_reads=args.fwd_reads,
439 |             rev_reads=args.rev_reads,
440 |             output_prefix=args.output_prefix,
441 |             num_threads=args.num_threads,
442 |         )
443 |     elif args.spring:
444 |         convert_fastq_to_spring(
445 |             fwd_reads=args.fwd_reads,
446 |             rev_reads=args.rev_reads,
447 |             output_prefix=args.output_prefix,
448 |             remove_ids=args.remove_ids,
449 |             num_threads=args.num_threads,
450 |         )
451 |     elif args.fqzcomp:
452 |         if args.remove_ids:
453 |             logging.warn("Removing ids not supported for fqzcomp yet.")
454 |         convert_fastq_to_fqzcomp(reads=args.fwd_reads, output_prefix=args.output_prefix)
455 |         convert_fastq_to_fqzcomp(reads=args.rev_reads, output_prefix=args.output_prefix)
456 |     elif args.repaq:
457 |         if args.remove_ids:
458 |             logging.warn("Removing ids not supported for fqzcomp yet.")
459 |         convert_fastq_to_repaq(
460 |             fwd_reads=args.fwd_reads,
461 |             rev_reads=args.rev_reads,
462 |             output_prefix=args.output_prefix,
463 |             num_threads=args.num_threads,
464 |         )
465 |     elif args.repaq_xz:
466 |         if args.remove_ids:
467 |             logging.warn("Removing ids not supported for fqzcomp yet.")
468 |         convert_fastq_to_repaq(
469 |             fwd_reads=args.fwd_reads,
470 |             rev_reads=args.rev_reads,
471 |             output_prefix=args.output_prefix,
472 |             xz_compress=True,
473 |             num_threads=args.num_threads,
474 |         )
475 |     elif args.petagene:
476 |         if args.remove_ids:
477 |             logging.warn("Removing ids not supported for Petagene.")
478 |         convert_fastq_to_petagene(
479 |             fwd_reads=args.fwd_reads,
480 |             rev_reads=args.rev_reads,
481 |             num_threads=args.num_threads,
482 |         )
483 | 


--------------------------------------------------------------------------------