├── README.md
├── bin
└── ktrim
├── change.log
├── license
├── GPLv3.txt
└── Ktrim.license
├── makefile
├── src
├── common.h
├── ktrim.cpp
├── param_handler.h
├── pe_handler.h
├── se_handler.h
└── util.h
└── testing_dataset
├── adapter.fa
├── check.accuracy.pl
└── simu.reads.pl
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data
4 | Version 1.6.0, Oct 2024
5 | Author: Kun Sun \(sunkun@szbl.ac.cn\)
6 |
7 | Distributed under the
8 | [GNU General Public License v3.0 \(GPLv3\)](https://www.gnu.org/licenses/gpl-3.0.en.html "GPLv3")
9 | for personal and academic usage only.
10 | For detailed information please refer to the license files under `license` directory.
11 |
12 | ---
13 | ## Release of version 1.6
14 | The author is pleased to release version 1.6 of Ktrim with usability improvement.
15 | This version added a "-R" option to output reads with adapters only. This could be useful for short libraries
16 | that all reads should contain adapters (e.g., CLIP-seq, whose adapters are also included as built-in support).
17 |
18 | ## Major features of Ktrim
19 | 1. Fast, sensitive, and accurate
20 | 2. Supports both paired- and single-end data
21 | 3. Supports both Gzipped and plain text
22 | 4. Supports output to stdout to pipe with downstream software, e.g., aligners
23 | 5. Supports multi-threading for speed-up
24 | 6. Built-in support for common adapters; customized adapters are also supported
25 |
26 | ## Installation
27 | `Ktrim` is written in `C++` for GNU Linux/Unix platforms. After uncompressing the source package, you can
28 | find an executable file at `bin/ktrim` compiled using `g++ v4.8.5` and linked with `libz v1.2.7` for Linux
29 | x64 system. If you could not run it (which is usually caused by low version of `libc++` or `libz` libraries)
30 | or you want to build a version optimized for your system, you can re-compile the programs:
31 | ```
32 | user@linux$ make clean && make
33 | ```
34 |
35 | ## Run Ktrim
36 | The main program is `ktrim` under `bin/` directory. You can add its path to your `~/.bashrc` file under
37 | the `PATH` variable to call it from anywhere; or you can run the following command to add it to your
38 | current session:
39 | ```
40 | user@linux$ PATH=$PATH:$PWD/bin/
41 | ```
42 |
43 | You can also add `ktrim` to your system to call it from anywhere and share with other users (requires
44 | root privilege):
45 | ```
46 | user@linux# make install
47 | ```
48 |
49 | Call `ktrim` without any parameters to see the usage (or use '-h' option):
50 | ```
51 | Usage: Ktrim [options] -f fq.list {-1/-U Read1.fq [-2 Read2.fq ]} -o out.prefix
52 |
53 | Author : Kun Sun (sunkun@szbl.ac.cn)
54 | Version: 1.6.0 (Oct 2024)
55 |
56 | Ktrim is designed to perform adapter- and quality-trimming of FASTQ files.
57 |
58 | Compulsory parameters:
59 |
60 | -f fq.list Specify the path to a file containing path to read 1/2 fastq files
61 |
62 | OR you can specify the fastq files directly:
63 |
64 | -1/-U Read1.fq Specify the path to the files containing read 1
65 | If your data is Paired-end, use '-1' and specify read 2 files using '-2' option
66 | Note that if '-U' is used, specification of '-2' is invalid
67 | If you have multiple files for your sample, use ',' to separate them
68 | Note that Gzip-compressed files (with .gz suffix) are also supported
69 |
70 | -o out.prefix Specify the prefix of the output files
71 | Note that output files include trimmed reads in FASTQ format and statistics
72 |
73 | Optional parameters:
74 |
75 | -2 Read2.fq Specify the path to the file containing read 2
76 | Use this parameter if your data is generated in paired-end mode
77 | If you have multiple files for your sample, use ',' to separate them
78 | and make sure that all the files are well paired in '-1' and '-2' options
79 |
80 | -c Write the trimming results to stdout (default: not set)
81 | Note that the interleaved fastq format will be used for paired-end data.
82 | -R Only output reads with adapter (default: not set)
83 | -t threads Specify how many threads should be used (default: 6)
84 | You can set '-t' to 0 to use all threads (automatically detected)
85 | 2-8 threads are recommended, as more threads would not benefit the performance
86 |
87 | -p phred-base Specify the baseline of the phred score (default: 33)
88 | -q score The minimum quality score to keep the cycle (default: 20)
89 | Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred
90 | -w window Set the window size for quality check (default: 5)
91 | Ktrim will stop when all the bases in a consecutive window pass the quality threshold
92 |
93 | Phred 33 ('!') and Phred 64 ('@') are the most widely used scoring system
94 | while quality scores start from 35 ('#') in the FASTQ files is also common
95 |
96 | -s size Minimum read size to be kept after trimming (default: 36)
97 |
98 | -k kit Specify the sequencing kit to use built-in adapters
99 | Currently supports 'Illumina' (default), 'Nextera', 'CLIP' and 'BGI'
100 | -a sequence Specify the adapter sequence in read 1
101 | -b sequence Specify the adapter sequence in read 2
102 | If '-a' is set while '-b' is not, I will assume that read 1 and 2 use same adapter
103 | Note that '-k' option has a higher priority (when set, '-a'/'-b' will be ignored)
104 |
105 | -m proportion Set the proportion of mismatches allowed during index and sequence comparison
106 | Default: 0.125 (i.e., 1/8 of compared base pairs)
107 |
108 | -h Show this help information and quit
109 | -v Show the software version and quit
110 |
111 | Please refer to README.md file for more information (e.g., setting adapters).
112 | Citation: Sun K. Bioinformatics 2020 Jun 1; 36(11):3561-3562. (PMID: 32159761)
113 |
114 | Ktrim: extra-fast and accurate adapter- and quality-trimmer.
115 | ```
116 |
117 | Please note that from version 1.2.0, Ktrim supports Gzip-compressed files as input. If you have multiple
118 | lanes of FASTQ files, Ktrim even supports that some lanes are compressed while others are in plain text, as
119 | long as READ1 and READ2 are the same (either both compressed or plain text) for each lane of paired-end data.
120 |
121 | `Ktrim` contains built-in adapter sequences used by Illumina TruSeq kits, Nextera kits, Nextera transposase
122 | adapters and BGI sequencing kits within the package. However, customized adapter sequences are also allowed
123 | by setting '-a' (for read 1) and '-b' (for read 2; if it is the same as read 1, you can left it blank)
124 | options. You may need to refer to the manual of your library preparation kit for the adapter sequences.
125 | Note that in the current version of `Ktrim`, only 1 pair of adapters is allowed.
126 |
127 | Here are the built-in adapter sequences (the copyright should belong to the corresponding companies):
128 |
129 | ```
130 | Illumina TruSeq kits:
131 | AGATCGGAAGAGC (for both read 1 and read 2)
132 |
133 | Nextera kits (suitable for ATAC-seq, Cut & tag data):
134 | CTGTCTCTTATACACATCT (for both read 1 and read 2)
135 |
136 | BGI adapters:
137 | Read 1: AAGTCGGAGGCCAAGCGGTC
138 | Read 2: AAGTCGGATCGTAGCCATGT
139 |
140 | CLIP-seq adapters:
141 | Read 1: TGGAATTCTCGGGTGCCAAGG
142 | Read 2: GATCGTCGGACTGTAGAACTCTGAAC
143 | ```
144 |
145 | ### Example 1
146 | Your data is generated using Illumina TruSeq kit in Single-end mode, then you can run:
147 | ```
148 | user@linux$ ktrim -U /path/to/read1.fq -o /path/to/output/dir
149 | ```
150 |
151 | ### Example 2
152 | Your data is generated using a kit with customized adapters; your data is composed of 3 lanes in Paired-end
153 | mode and you have prepared a `file.list` to record the paths as follows:
154 | ```
155 | /path/to/lane1.read1.fq.gz /path/to/lane1.read2.fq.gz
156 | /path/to/lane2.read1.fq.gz /path/to/lane2.read2.fq.gz
157 | /path/to/lane3.read1.fq /path/to/lane3.read2.fq
158 | ```
159 | in addition, your Phred scoring system starts from 64; you want to keep the high quality (Phred score >=30)
160 | bases and reads longer than 50 bp after trimming; and you want to use 8 threads to speed-up the analysis,
161 | then you can run:
162 | ```
163 | user@linux$ ktrim -f file.list -t 8 -p 64 -q 30 -s 50 -o /path/to/output/dir \
164 | -a READ1_ADAPTER_SEQUENCE -b READ2_ADAPTER_SEQUENCE
165 | ```
166 |
167 | ## Outputs explanation
168 | `Ktrim` outputs the trimmed reads in FASTQ format and key statistics (e.g., the numbers of reads that
169 | contains adapters and the number of reads in the trimmed files).
170 |
171 | ## Testing dataset and benchmark evaluation
172 | Under the `testing_dataset/` directory, a script named `simu.reads.pl` is provided to generate *in silico*
173 | reads for testing purpose only. **Note that the results in the paper is based on the data generated by this
174 | script.** Another script `check.accuracy.pl` is designed to evaluate the accuracies of the trimming tools.
175 |
176 | Please refer to Supplementary Method in the paper for reproducing the results (using Ktrim v1.1.0).
177 |
178 | ## Citation
179 | When referencing, please cite "Sun K: **Ktrim: an extra-fast and accurate adapter- and quality-trimmer
180 | for sequencing data.** *Bioinformatics* 2020 Jun 1; 36(11):3561-3562."
181 | [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/32159761 "Ktrim@PubMed")
182 | [Full Text](https://doi.org/10.1093/bioinformatics/btaa171 "Full text on Bioinformatics")
183 |
184 | ---
185 | Please send bug reports to Kun Sun \(sunkun@szbl.ac.cn\).
186 | Ktrim is freely available at
187 | [https://github.com/hellosunking/Ktrim/](https://github.com/hellosunking/Ktrim/ "Ktrim @ Github").
188 |
189 |
--------------------------------------------------------------------------------
/bin/ktrim:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellosunking/Ktrim/a2d98f0c802a30214310802ad73dfb51a1baf426/bin/ktrim
--------------------------------------------------------------------------------
/change.log:
--------------------------------------------------------------------------------
1 | Ktrim change log
2 |
3 | v1.6.0 Oct 2024
4 | * Add '-R' option to output reads with adapters only
5 | * Add built-in adapters for CLIP-seq
6 | * Fix a bug in single-thread file-writing
7 | * Change the default thread to 6
8 |
9 | v1.5.1 Nov 2023
10 | * Fix a bug that causes Ktrim exit when input FASTQ file is too small
11 |
12 | v1.5.0 May 2023
13 | * Add support for providing input using a file (-f option)
14 | * Add support for outputting to stdout for piping with other tools (-c option; use interleaved-fastq format for paired-end data)
15 | * Optimize file writting, PE data has ~25% speed-up (under 4-threads)
16 | * Default thread is set to 4 (-t option)
17 |
18 | v1.4.1 Dec 2021
19 | * Optimize file writting, both SE/PE data has ~10% speed-up (under 4-threads)
20 | * Allows 1 mismatch in tail-hits of PE data
21 |
22 | v1.4.0 Oct 2021
23 | * Optimize file loading, PE data has a ~50% speed-up using 4-threads
24 |
25 | v1.3.1 Apr 2021
26 | * Fix bug in processing SE data using single-thread mode
27 |
28 | v1.3.0 Mar 2021
29 | * Report number the adapter dimers in the sequencing data
30 |
31 | v1.2.2 Jan 2021
32 | * Fix bug when "-o" is NOT present but the program does not quit
33 |
34 | v1.2.1 Nov 2020
35 | * Fix bug in multi-file handling
36 |
37 | v1.2.0 Jun 2020
38 | * Support Gzip-compressed files as input
39 | * Support '-w' option to check quality scores within a window rather than 1 bp
40 | * Optimize adaptor handling in paired-end data
41 |
42 | v1.1.1 Mar 2020
43 | * Optimize cout and cerr output streams
44 | * Suppress the progress report
45 |
46 | v1.1.0 Feb 2020
47 | * Support users to set the allowed mismatches during adapter-sequence comparison (-m option)
48 | * Fix a bug when read1 and read2 sequences are not of the same length
49 | * Optimize the mismatch calculation during adapter-sequence comparison for paired-end data
50 |
--------------------------------------------------------------------------------
/license/GPLv3.txt:
--------------------------------------------------------------------------------
1 | GNU LESSER GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 |
9 | This version of the GNU Lesser General Public License incorporates
10 | the terms and conditions of version 3 of the GNU General Public
11 | License, supplemented by the additional permissions listed below.
12 |
13 | 0. Additional Definitions.
14 |
15 | As used herein, "this License" refers to version 3 of the GNU Lesser
16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU
17 | General Public License.
18 |
19 | "The Library" refers to a covered work governed by this License,
20 | other than an Application or a Combined Work as defined below.
21 |
22 | An "Application" is any work that makes use of an interface provided
23 | by the Library, but which is not otherwise based on the Library.
24 | Defining a subclass of a class defined by the Library is deemed a mode
25 | of using an interface provided by the Library.
26 |
27 | A "Combined Work" is a work produced by combining or linking an
28 | Application with the Library. The particular version of the Library
29 | with which the Combined Work was made is also called the "Linked
30 | Version".
31 |
32 | The "Minimal Corresponding Source" for a Combined Work means the
33 | Corresponding Source for the Combined Work, excluding any source code
34 | for portions of the Combined Work that, considered in isolation, are
35 | based on the Application, and not on the Linked Version.
36 |
37 | The "Corresponding Application Code" for a Combined Work means the
38 | object code and/or source code for the Application, including any data
39 | and utility programs needed for reproducing the Combined Work from the
40 | Application, but excluding the System Libraries of the Combined Work.
41 |
42 | 1. Exception to Section 3 of the GNU GPL.
43 |
44 | You may convey a covered work under sections 3 and 4 of this License
45 | without being bound by section 3 of the GNU GPL.
46 |
47 | 2. Conveying Modified Versions.
48 |
49 | If you modify a copy of the Library, and, in your modifications, a
50 | facility refers to a function or data to be supplied by an Application
51 | that uses the facility (other than as an argument passed when the
52 | facility is invoked), then you may convey a copy of the modified
53 | version:
54 |
55 | a) under this License, provided that you make a good faith effort to
56 | ensure that, in the event an Application does not supply the
57 | function or data, the facility still operates, and performs
58 | whatever part of its purpose remains meaningful, or
59 |
60 | b) under the GNU GPL, with none of the additional permissions of
61 | this License applicable to that copy.
62 |
63 | 3. Object Code Incorporating Material from Library Header Files.
64 |
65 | The object code form of an Application may incorporate material from
66 | a header file that is part of the Library. You may convey such object
67 | code under terms of your choice, provided that, if the incorporated
68 | material is not limited to numerical parameters, data structure
69 | layouts and accessors, or small macros, inline functions and templates
70 | (ten or fewer lines in length), you do both of the following:
71 |
72 | a) Give prominent notice with each copy of the object code that the
73 | Library is used in it and that the Library and its use are
74 | covered by this License.
75 |
76 | b) Accompany the object code with a copy of the GNU GPL and this license
77 | document.
78 |
79 | 4. Combined Works.
80 |
81 | You may convey a Combined Work under terms of your choice that,
82 | taken together, effectively do not restrict modification of the
83 | portions of the Library contained in the Combined Work and reverse
84 | engineering for debugging such modifications, if you also do each of
85 | the following:
86 |
87 | a) Give prominent notice with each copy of the Combined Work that
88 | the Library is used in it and that the Library and its use are
89 | covered by this License.
90 |
91 | b) Accompany the Combined Work with a copy of the GNU GPL and this license
92 | document.
93 |
94 | c) For a Combined Work that displays copyright notices during
95 | execution, include the copyright notice for the Library among
96 | these notices, as well as a reference directing the user to the
97 | copies of the GNU GPL and this license document.
98 |
99 | d) Do one of the following:
100 |
101 | 0) Convey the Minimal Corresponding Source under the terms of this
102 | License, and the Corresponding Application Code in a form
103 | suitable for, and under terms that permit, the user to
104 | recombine or relink the Application with a modified version of
105 | the Linked Version to produce a modified Combined Work, in the
106 | manner specified by section 6 of the GNU GPL for conveying
107 | Corresponding Source.
108 |
109 | 1) Use a suitable shared library mechanism for linking with the
110 | Library. A suitable mechanism is one that (a) uses at run time
111 | a copy of the Library already present on the user's computer
112 | system, and (b) will operate properly with a modified version
113 | of the Library that is interface-compatible with the Linked
114 | Version.
115 |
116 | e) Provide Installation Information, but only if you would otherwise
117 | be required to provide such information under section 6 of the
118 | GNU GPL, and only to the extent that such information is
119 | necessary to install and execute a modified version of the
120 | Combined Work produced by recombining or relinking the
121 | Application with a modified version of the Linked Version. (If
122 | you use option 4d0, the Installation Information must accompany
123 | the Minimal Corresponding Source and Corresponding Application
124 | Code. If you use option 4d1, you must provide the Installation
125 | Information in the manner specified by section 6 of the GNU GPL
126 | for conveying Corresponding Source.)
127 |
128 | 5. Combined Libraries.
129 |
130 | You may place library facilities that are a work based on the
131 | Library side by side in a single library together with other library
132 | facilities that are not Applications and are not covered by this
133 | License, and convey such a combined library under terms of your
134 | choice, if you do both of the following:
135 |
136 | a) Accompany the combined library with a copy of the same work based
137 | on the Library, uncombined with any other library facilities,
138 | conveyed under the terms of this License.
139 |
140 | b) Give prominent notice with the combined library that part of it
141 | is a work based on the Library, and explaining where to find the
142 | accompanying uncombined form of the same work.
143 |
144 | 6. Revised Versions of the GNU Lesser General Public License.
145 |
146 | The Free Software Foundation may publish revised and/or new versions
147 | of the GNU Lesser General Public License from time to time. Such new
148 | versions will be similar in spirit to the present version, but may
149 | differ in detail to address new problems or concerns.
150 |
151 | Each version is given a distinguishing version number. If the
152 | Library as you received it specifies that a certain numbered version
153 | of the GNU Lesser General Public License "or any later version"
154 | applies to it, you have the option of following the terms and
155 | conditions either of that published version or of any later version
156 | published by the Free Software Foundation. If the Library as you
157 | received it does not specify a version number of the GNU Lesser
158 | General Public License, you may choose any version of the GNU Lesser
159 | General Public License ever published by the Free Software Foundation.
160 |
161 | If the Library as you received it specifies that a proxy can decide
162 | whether future versions of the GNU Lesser General Public License shall
163 | apply, that proxy's public statement of acceptance of any version is
164 | permanent authorization for you to choose that version for the
165 | Library.
--------------------------------------------------------------------------------
/license/Ktrim.license:
--------------------------------------------------------------------------------
1 | GPLv3.txt
--------------------------------------------------------------------------------
/makefile:
--------------------------------------------------------------------------------
1 | bin/ktrim: src/ktrim.cpp src/common.h src/util.h src/param_handler.h src/pe_handler.h src/se_handler.h
2 | @echo Build Ktrim
3 | @cd src; g++ ktrim.cpp -march=native -std=c++11 -fopenmp -O3 -o ../bin/ktrim -lz; cd ..
4 |
5 | install: bin/ktrim # requires root
6 | @echo Install Ktrim for all users
7 | @cp bin/ktrim /usr/bin/
8 |
9 | clean:
10 | rm -f bin/ktrim
11 |
12 |
--------------------------------------------------------------------------------
/src/common.h:
--------------------------------------------------------------------------------
1 | /*
2 | * common.h
3 | *
4 | * This header file records the constants used in Ktrim
5 | *
6 | * Author: Kun Sun (sunkun@szbl.ac.cn)
7 | * Date: Mar, 2020
8 | * This program is part of the Ktrim package
9 | *
10 | **/
11 |
12 | /*
13 | * Sequencing model:
14 | *
15 | * *read1 -->
16 | * 5' adapter - sequence sequence sequence - 3' adapter
17 | * <-- read2*
18 | *
19 | * so read1 may contains 3' adapter, name it adapter_r1,
20 | * read2 may contains reversed and ACGT-paired 5' adapter, name it adapter_r2
21 | *
22 | **/
23 |
24 | #ifndef _KTRIM_COMMON_
25 | #define _KTRIM_COMMON_
26 |
27 | #include
28 | #include
29 | #include
30 | using namespace std;
31 |
32 | const char * VERSION = "1.6.0 (Oct 2024)";
33 |
34 | // 1.6.0, add '-w' option to output reads with adapters only;
35 | // add built-in adapters for CLIP-seq;
36 | // fix a bug in single-thread file-writing
37 | // change the waiting_for_writing check to 0.1 ms
38 | // change the default thread to 6
39 | // 1.5.1, fix a bug that causes exit if the FASTQ files are too small
40 | // 1.5.0, support loading file names from a file, and output to stdout for pipelines
41 | // 1.4.1, now we use 2 threads for reading and 2 threads for writting, ~10% faster
42 | // 1.4.0, now we use 2 threads for reading files in PE mode, ~33% faster
43 | // 1.3.1 fixed the bug in dimers when working on SE data processing using single-thread
44 | // 1.2.2 fixed the bug when "-o" is NOT present but the program does not quit
45 | // 1.2.1 fixed the bug in multi-file handling
46 |
47 | // structure of a READ
48 | const unsigned int MAX_READ_ID = 128;
49 | const unsigned int MAX_READ_CYCLE = 512;
50 |
51 | // maxmimum insert size to call a dimer
52 | const unsigned int DIMER_INSERT = 1;
53 |
54 | // time interval for querying for writing
55 | static int write_thread;
56 | //const static chrono::milliseconds waiting_time_for_writing(1);
57 | const static chrono::microseconds waiting_time_for_writing(100);
58 |
59 | typedef struct {
60 | char *id;
61 | char *seq;
62 | char *qual;
63 | unsigned int size;
64 | } CSEREAD;
65 |
66 | typedef struct {
67 | char *id1;
68 | char *seq1;
69 | char *qual1;
70 | unsigned int size;
71 |
72 | char *id2;
73 | char *seq2;
74 | char *qual2;
75 | unsigned int size2;
76 | } CPEREAD;
77 |
78 | typedef struct {
79 | unsigned int *dropped;
80 | unsigned int *real_adapter;
81 | unsigned int *tail_adapter;
82 | unsigned int *dimer;
83 | unsigned int *pass;
84 | } ktrim_stat;
85 |
86 | typedef struct {
87 | char ** buffer1;
88 | char ** buffer2;
89 | unsigned int *b1stored;
90 | unsigned int *b2stored;
91 | } writeBuffer;
92 |
93 | // built-in adapters
94 | const unsigned int MIN_ADAPTER_SIZE = 8;
95 | const unsigned int MAX_ADAPTER_SIZE = 64;
96 | const unsigned int ADAPTER_BUFFER_SIZE = 128;
97 | const unsigned int ADAPTER_INDEX_SIZE = 3;
98 | const unsigned int OFFSET_INDEX3 = 3;
99 | // illumina TruSeq kits adapters
100 | //const char * illumina_adapter_r1 = "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG";
101 | //const char * illumina_adapter_r2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
102 | //const unsigned int illumina_adapter_len = 33;
103 | const char * illumina_adapter_r1 = "AGATCGGAAGAGC";
104 | const char * illumina_adapter_r2 = illumina_adapter_r1;
105 | const unsigned int illumina_adapter_len = 13;
106 | const char * illumina_index1 = "AGA"; // use first 3 as index
107 | const char * illumina_index2 = illumina_index1; // use first 3 as index
108 | const char * illumina_index3 = "TCG"; // for single-end data
109 | // Nextera kits and AmpliSeq for Illumina panels
110 | const char * nextera_adapter_r1 = "CTGTCTCTTATACACATCT";
111 | const char * nextera_adapter_r2 = nextera_adapter_r1;
112 | const unsigned int nextera_adapter_len = 19;
113 | const char * nextera_index1 = "CTG"; // use first 3 as index
114 | const char * nextera_index2 = nextera_index1; // use first 3 as index
115 | const char * nextera_index3 = "TCT"; // for single-end data
116 | // Nextera transposase adapters
117 | const char * transposase_adapter_r1 = "TCGTCGGCAGCGTC";
118 | const char * transposase_adapter_r2 = "GTCTCGTGGGCTCG";
119 | const unsigned int transposase_adapter_len = 14;
120 | const char * transposase_index1 = "TCG"; // use first 3 as index
121 | const char * transposase_index2 = "GTC"; // use first 3 as index
122 | const char * transposase_index3 = "TCG"; // for single-end data
123 |
124 | //PRO-seq/CLIP-seq following Mahat et al. Nat Protoc 2016; 11:1455–1476. DOI: 10.1038/nprot.2016.086
125 | //NOTE: we recommend only use read2 (template strand) and set '-w' to enable output the reads with adapters
126 | const char * clip_adapter_r1 = "TGGAATTCTCGGGTGCCAAGG";
127 | const char * clip_adapter_r2 = "GATCGTCGGACTGTAGAACTCTGAAC";
128 | const unsigned int clip_adapter_len = 21;
129 | const char * clip_index1 = "TGG"; // use first 3 as index
130 | const char * clip_index2 = "GAT"; // use first 3 as index
131 | const char * clip_index3 = "TGG"; // for single-end data
132 |
133 | // BGI adapters
134 | //const char * bgi_adapter_r1 = "AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA";
135 | //const char * bgi_adapter_r2 = "AAGTCGGATCGTAGCCATGTCGTTCTGTGAGC";
136 | //const unsigned int bgi_adapter_len = 32;
137 | const char * bgi_adapter_r1 = "AAGTCGGAGGCCAAGCGGTC";
138 | const char * bgi_adapter_r2 = "AAGTCGGATCGTAGCCATGT";
139 | const unsigned int bgi_adapter_len = 20;
140 | const char * bgi_index1 = "AAG"; // use first 3 as index
141 | const char * bgi_index2 = "AAG"; // use first 3 as index
142 | const char * bgi_index3 = "TCG"; // for single-end data
143 |
144 | // seed and error configurations
145 | const unsigned int impossible_seed = 1000000;
146 | const unsigned int MAX_READ_LENGTH = 1024;
147 | const unsigned int MAX_SEED_NUM = 128;
148 |
149 | //configurations for parallelization, which is highly related to memory usage
150 | //but seems to have very minor effect on running time
151 | const int READS_PER_BATCH = 1 << 15; // process 128 K reads per batch (for parallelization)
152 | const int BUFFER_SIZE_PER_BATCH_READ = 1 << 25; // 128 MB buffer for each thread to store FASTQ
153 | const int MEM_SE_READSET = READS_PER_BATCH * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE);
154 | const int MEM_PE_READSET = READS_PER_BATCH * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE) * 2;
155 |
156 | // enlarge the buffer for single-thread run
157 | //const int READS_PER_BATCH_ST = READS_PER_BATCH << 1; // process 256 K reads per batch (for parallelization)
158 | //const int BUFFER_SIZE_PER_BATCH_READ_ST = BUFFER_SIZE_PER_BATCH_READ << 1; // 256 MB buffer for each thread to store FASTQ
159 | const int READS_PER_BATCH_ST = 1 << 15; // No. of reads per-batch for single-thread
160 | const int BUFFER_SIZE_PER_BATCH_READ_ST = 1 << 25; // buffer for single-thread to store FASTQ
161 | const int MEM_SE_READSET_ST = READS_PER_BATCH_ST * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE);
162 | const int MEM_PE_READSET_ST = READS_PER_BATCH_ST * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE) * 2;
163 |
164 | const char FILE_SEPARATOR = ',';
165 |
166 | // paramaters related
167 | typedef struct {
168 | char *filelist;
169 | char *FASTQ1, *FASTQ2, *FASTQU;
170 | char *outpre;
171 | vector R1s, R2s;
172 |
173 | unsigned int thread;
174 | unsigned int min_length;
175 | unsigned int phred;
176 | unsigned int minqual;
177 | unsigned int quality;
178 | unsigned int window;
179 |
180 | const char *seqKit, *seqA, *seqB;
181 | const char *adapter_r1, *adapter_r2;
182 | unsigned int adapter_len;
183 | const char *adapter_index1, *adapter_index2, *adapter_index3;
184 |
185 | bool use_default_mismatch;
186 | float mismatch_rate;
187 |
188 | bool paired_end_data;
189 | bool write2stdout;
190 | bool outputReadWithAdaptorOnly;
191 | FILE *fout1;
192 | FILE *fout2;
193 | FILE *flog;
194 | } ktrim_param;
195 |
196 | const char * param_list = "1:2:U:o:t:k:s:p:q:w:a:b:m:f:chRv";
197 |
198 | // definition of functions
199 | void usage();
200 | void init_param( ktrim_param &kp );
201 | int process_cmd_param( int argc, char * argv[], ktrim_param &kp );
202 | void print_param( const ktrim_param &kp );
203 | void extractFileNames( const char *str, vector & Rs );
204 | void loadFQFileNames( ktrim_param &kp );
205 |
206 | // C-style
207 | unsigned int load_batch_data_PE_C( FILE * fq1, FILE * fq2, CPEREAD *loadingReads, unsigned int num );
208 | bool check_mismatch_dynamic_PE_C( const CPEREAD *read, unsigned int pos, const ktrim_param &kp );
209 | int process_single_thread_PE_C( const ktrim_param &kp );
210 | int process_multi_thread_PE_C( const ktrim_param &kp );
211 |
212 | unsigned int load_batch_data_SE_C( FILE * fp, CSEREAD *loadingReads, unsigned int num );
213 | bool check_mismatch_dynamic_SE_C( const char * p, unsigned int pos, const ktrim_param &kp );
214 | int process_single_thread_SE_C( const ktrim_param &kp );
215 | int process_multi_thread_SE_C( const ktrim_param &kp );
216 |
217 | #endif
218 |
219 |
--------------------------------------------------------------------------------
/src/ktrim.cpp:
--------------------------------------------------------------------------------
1 | /**
2 | * Author: Kun Sun (sunkun@szbl.ac.cn)
3 | * Date: May 2023
4 | * Main program of Ktrim v1.5.0
5 | **/
6 |
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include
16 | #include
17 | #include "common.h"
18 | #include "util.h"
19 | #include "param_handler.h"
20 | #include "pe_handler.h"
21 | #include "se_handler.h"
22 |
23 | using namespace std;
24 |
25 | int main( int argc, char *argv[] ) {
26 | // process the command line parameters
27 | static ktrim_param kp;
28 | init_param( kp );
29 | int retValue = process_cmd_param( argc, argv, kp );
30 | if( retValue == 100 ) // help or version
31 | return 0;
32 | else if( retValue != 0 )
33 | return retValue;
34 |
35 | if( kp.paired_end_data ) { // single-end data
36 | if( kp.thread == 1 )
37 | retValue = process_single_thread_PE_C( kp );
38 | else if( kp.thread == 2 )
39 | retValue = process_two_thread_PE_C( kp );
40 | else
41 | retValue = process_multi_thread_PE_C( kp );
42 | } else {
43 | if( kp.thread == 1 )
44 | retValue = process_single_thread_SE_C( kp );
45 | else
46 | retValue = process_multi_thread_SE_C( kp );
47 | }
48 |
49 | close_files( kp );
50 | return retValue;
51 | }
52 |
53 |
--------------------------------------------------------------------------------
/src/param_handler.h:
--------------------------------------------------------------------------------
1 | /**
2 | * Author: Kun Sun (sunkun@szbl.ac.cn)
3 | * Date: May 2023
4 | * This program is part of the Ktrim package
5 | **/
6 |
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include
16 | #include
17 | #include "common.h"
18 |
19 | using namespace std;
20 |
21 | //static char tmp_r1[ADAPTER_BUFFER_SIZE];
22 | //static char tmp_r2[ADAPTER_BUFFER_SIZE];
23 | static char tmp_index1[ADAPTER_INDEX_SIZE+1];
24 | static char tmp_index2[ADAPTER_INDEX_SIZE+1];
25 | static char tmp_index3[ADAPTER_INDEX_SIZE+1];
26 |
27 | // default parameters
28 | void init_param( ktrim_param &kp ) {
29 | kp.FASTQ1 = NULL;
30 | kp.FASTQ2 = NULL;
31 | kp.FASTQU = NULL;
32 | kp.outpre = NULL;
33 |
34 | kp.thread = 6;
35 | kp.min_length = 36;
36 | kp.phred = 33;
37 | kp.minqual = 20;
38 | kp.quality = 53;
39 | kp.window = 5;
40 |
41 | kp.seqKit = NULL;
42 | kp.seqA = NULL;
43 | kp.seqB = NULL;
44 |
45 | kp.adapter_r1 = NULL;
46 | kp.adapter_r2 = NULL;
47 | kp.adapter_len= 0;
48 | kp.adapter_index1 = NULL;
49 | kp.adapter_index2 = NULL;
50 | kp.adapter_index3 = NULL;
51 |
52 | kp.write2stdout = false;
53 | kp.outputReadWithAdaptorOnly = false;
54 | kp.use_default_mismatch = true;
55 | kp.mismatch_rate = 0.125;
56 | }
57 |
58 | // process user-supplied parameters
59 | int process_cmd_param( int argc, char * argv[], ktrim_param &kp ) {
60 | const char * prg = argv[0];
61 | int index;
62 | int ch;
63 | while( (ch = getopt(argc, argv, param_list) ) != -1 ) {
64 | switch( ch ) {
65 | case 'f': kp.filelist = optarg; break;
66 | case '1': kp.FASTQ1 = optarg; break;
67 | case '2': kp.FASTQ2 = optarg; break;
68 | case 'U': kp.FASTQU = optarg; break;
69 | case 'o': kp.outpre = optarg; break;
70 | case 't': kp.thread = atoi(optarg); break;
71 | case 'k': kp.seqKit = optarg; break;
72 | case 's': kp.min_length = atoi(optarg); break;
73 | case 'p': kp.phred = atoi(optarg); break;
74 | case 'q': kp.minqual = atoi(optarg); break;
75 | case 'w': kp.window = atoi(optarg); break;
76 | case 'a': kp.seqA = optarg; break;
77 | case 'b': kp.seqB = optarg; break;
78 |
79 | case 'm': kp.use_default_mismatch = false; kp.mismatch_rate = atof(optarg); break;
80 | case 'c': kp.write2stdout = true; break;
81 | case 'R': kp.outputReadWithAdaptorOnly = true; break;
82 |
83 | case 'h': usage(); return 100;
84 | case 'v': cout << VERSION << '\n'; return 100;
85 |
86 | default:
87 | //cerr << "\033[1;31mError: invalid argument ('" << (char)optopt << "')!\033[0m\n";
88 | usage();
89 | return 2;
90 | }
91 | }
92 |
93 | argc -= optind;
94 | argv += optind;
95 | if( argc > 0 ) {
96 | cerr << "\033[1;31mError: unrecognized parameter ('" << argv[0] << "')!\033[0m\n";
97 | usage();
98 | return 2;
99 | }
100 |
101 | // check compulsory paramaters
102 | if( kp.filelist == NULL ) {
103 | if( kp.FASTQU != NULL ) { // single-end
104 | if( kp.FASTQ1!=NULL || kp.FASTQ2!=NULL ) {
105 | cerr << "\033[1;31mError: '-U' is specified but you also set '-1'/'-2'!\033[0m\n";
106 | usage();
107 | return 2;
108 | }
109 | extractFileNames( kp.FASTQU, kp.R1s );
110 | kp.paired_end_data = false;
111 | } else {
112 | if( kp.FASTQ1 == NULL ) {
113 | cerr << "\033[1;31mError: No input file specified (both '-1' and '-U' are not set)!\033[0m\n";
114 | usage();
115 | return 2;
116 | }
117 | kp.FASTQU = kp.FASTQ1;
118 | extractFileNames( kp.FASTQU, kp.R1s );
119 |
120 | if( kp.FASTQ2 != NULL ) {
121 | extractFileNames( kp.FASTQ2, kp.R2s );
122 | kp.paired_end_data = true;
123 | } else {
124 | kp.paired_end_data = false;
125 | }
126 | }
127 | } else { // the -f option has a higher priority
128 | if( kp.FASTQU != NULL || kp.FASTQ1!=NULL || kp.FASTQ2!=NULL ) {
129 | cerr << "\033[1;31mError: '-f' is specified but you also set '-U'/'-1'/'-2'!\033[0m\n";
130 | usage();
131 | return 2;
132 | }
133 | loadFQFileNames( kp );
134 | }
135 |
136 | if( kp.outpre == NULL ) {
137 | cerr << "\033[1;31mError: No output file specified ('-o' not set)!\033[0m\n";
138 | usage();
139 | return 3;
140 | }
141 |
142 | // check optional parameters
143 | if( kp.thread == 0 ) {
144 | // cerr << "Warning: thread is set to 0! I will use all threads (atmost 8) instead.\n";
145 | kp.thread = omp_get_max_threads();
146 | if( kp.thread > 8 )
147 | kp.thread = 8;
148 | }
149 | if( kp.thread > 8 ) {
150 | cerr << "Warning: we strongly discourage the usage of more than 8 threads.\n";
151 | }
152 | if( kp.min_length < 10 ) {
153 | cerr << "\033[1;31mError: invalid min_length! Must be a larger than 10!\033[0m\n";
154 | usage();
155 | return 11;
156 | }
157 | if( kp.phred==0 || kp.minqual==0 || kp.window==0 ) {
158 | cerr << "\033[1;31mError: invalid phred/quality/window parameters!\033[0m\n";
159 | usage();
160 | return 12;
161 | }
162 | kp.quality = kp.phred + kp.minqual;
163 |
164 | // settings of adapters
165 | if( kp.seqKit != NULL ) { // use built-in adaptors
166 | if( strcmp(kp.seqKit, "illumina")==0 || strcmp(kp.seqKit, "ILLUMINA")==0 || strcmp(kp.seqKit, "Illumina")==0 ) {
167 | kp.adapter_r1 = illumina_adapter_r1;
168 | kp.adapter_r2 = illumina_adapter_r2;
169 | kp.adapter_len = illumina_adapter_len;
170 | kp.adapter_index1 = illumina_index1;
171 | kp.adapter_index2 = illumina_index2;
172 | kp.adapter_index3 = illumina_index3;
173 | } else if( strcmp(kp.seqKit, "Nextera") == 0 || strcmp(kp.seqKit, "nextera") == 0 ) {
174 | kp.adapter_r1 = nextera_adapter_r1;
175 | kp.adapter_r2 = nextera_adapter_r2;
176 | kp.adapter_len = nextera_adapter_len;
177 | kp.adapter_index1 = nextera_index1;
178 | kp.adapter_index2 = nextera_index2;
179 | kp.adapter_index3 = nextera_index3;
180 | } else if( strcmp(kp.seqKit, "Transposase") == 0 || strcmp(kp.seqKit, "transposase") == 0 ) {
181 | kp.adapter_r1 = transposase_adapter_r1;
182 | kp.adapter_r2 = transposase_adapter_r2;
183 | kp.adapter_len = transposase_adapter_len;
184 | kp.adapter_index1 = transposase_index1;
185 | kp.adapter_index2 = transposase_index2;
186 | kp.adapter_index3 = transposase_index3;
187 | } else if( strcmp(kp.seqKit, "CLIP") == 0 || strcmp(kp.seqKit, "clip") == 0 ) {
188 | kp.adapter_r1 = clip_adapter_r1;
189 | kp.adapter_r2 = clip_adapter_r2;
190 | kp.adapter_len = clip_adapter_len;
191 | kp.adapter_index1 = clip_index1;
192 | kp.adapter_index2 = clip_index2;
193 | kp.adapter_index3 = clip_index3;
194 | } else if( strcmp(kp.seqKit, "BGI") == 0 || strcmp(kp.seqKit, "bgi") == 0 ) {
195 | kp.adapter_r1 = bgi_adapter_r1;
196 | kp.adapter_r2 = bgi_adapter_r2;
197 | kp.adapter_len = bgi_adapter_len;
198 | kp.adapter_index1 = bgi_index1;
199 | kp.adapter_index2 = bgi_index2;
200 | kp.adapter_index3 = bgi_index3;
201 | } else {
202 | cerr << "\033[1;31mError: unacceptable sequencing kit types!\033[0m\n";
203 | usage();
204 | return 13;
205 | }
206 | if( kp.seqA!=NULL || kp.seqB!=NULL )
207 | cerr << "\033[1;32mWarning: '-m' is set! '-a'/'-b' will be ignored!\033[0m\n";
208 | } else {
209 | if( kp.seqA==NULL ) { // no adapters set, use default; note that if -b is set BUT -a is not, report an error
210 | if( kp.seqB != NULL ) {
211 | cerr << "\033[1;31mError: '-b' is set while '-a' is not set!\033[0m\n";
212 | usage();
213 | return 14;
214 | }
215 | kp.adapter_r1 = illumina_adapter_r1;
216 | kp.adapter_r2 = illumina_adapter_r2;
217 | kp.adapter_len = illumina_adapter_len;
218 | kp.adapter_index1 = illumina_index1;
219 | kp.adapter_index2 = illumina_index2;
220 | kp.adapter_index3 = illumina_index3;
221 | } else {
222 | if( kp.seqB == NULL ) { // seqB not set, then set it to seqA
223 | kp.seqB = kp.seqA;
224 | if( kp.FASTQ2 != NULL ) // paired end
225 | cerr << "\033[1;32mWarning: '-b' is not set for Paired-End data! I will use '-a' as adapter for read 2!\033[0m\n";
226 | }
227 | register unsigned int i = 0;
228 | while( true ) {
229 | if( kp.seqA[i]=='\0' || kp.seqB[i]=='\0' )
230 | break;
231 |
232 | ++ i;
233 | }
234 | if( i < MIN_ADAPTER_SIZE || i > MAX_ADAPTER_SIZE ) {
235 | cerr << "\033[1;31mError: adapter size must be between " << MIN_ADAPTER_SIZE << " to "
236 | << MAX_ADAPTER_SIZE << " bp!\033[0m\n";
237 | return 20;
238 | }
239 | kp.adapter_len = i;
240 |
241 | for( i=0; i!=ADAPTER_INDEX_SIZE; ++i ) {
242 | tmp_index1[i] = kp.seqA[i];
243 | tmp_index2[i] = kp.seqB[i];
244 | tmp_index3[i] = kp.seqA[i+ADAPTER_INDEX_SIZE];
245 | }
246 | tmp_index1[ADAPTER_INDEX_SIZE] = '\0';
247 | tmp_index2[ADAPTER_INDEX_SIZE] = '\0';
248 | tmp_index3[ADAPTER_INDEX_SIZE] = '\0';
249 |
250 | kp.adapter_r1 = kp.seqA;
251 | kp.adapter_r2 = kp.seqB;
252 | kp.adapter_index1 = tmp_index1;
253 | kp.adapter_index2 = tmp_index2;
254 | kp.adapter_index3 = tmp_index3;
255 | }
256 | }
257 |
258 | // prepare output files
259 | string fileName = kp.outpre;
260 | if( kp.write2stdout ) {
261 | kp.fout1 = stdout;
262 | kp.fout2 = NULL;
263 | } else {
264 | if( kp.paired_end_data ) {
265 | fileName += ".read1.fq";
266 | kp.fout1 = fopen( fileName.c_str(), "wt" );
267 | fileName[ fileName.size()-4 ] = '2'; // read1 -> read2
268 | kp.fout2 = fopen( fileName.c_str(), "wt" );
269 | if( kp.fout1 == NULL || kp.fout2 == NULL ) {
270 | cerr << "\033[1;31mError: write file failed!\033[0m\n";
271 | if( kp.fout1 != NULL )fclose( kp.fout1 );
272 | if( kp.fout2 != NULL )fclose( kp.fout2 );
273 | exit(103);
274 | }
275 | } else { // single-end
276 | fileName += ".read1.fq";
277 | kp.fout1 = fopen( fileName.c_str(), "wt" );
278 | if( kp.fout1 == NULL ) {
279 | cerr << "\033[1;31mError: write file failed!\033[0m\n";
280 | exit(103);
281 | }
282 | }
283 | }
284 |
285 | fileName = kp.outpre;
286 | fileName += ".trim.log";
287 | kp.flog = fopen( fileName.c_str(), "wt" );
288 | if( kp.flog == NULL ) {
289 | cerr << "\033[1;34mError: cannot write log file!\033[0m\n";
290 | if( kp.fout1 != NULL )fclose( kp.fout1 );
291 | if( kp.fout2 != NULL )fclose( kp.fout2 );
292 | exit(105);
293 | }
294 |
295 | return 0;
296 | }
297 |
298 | void close_files( const ktrim_param &kp ) {
299 | if( ! kp.write2stdout ) {
300 | fclose( kp.fout1 );
301 |
302 | if( kp.paired_end_data )
303 | fclose( kp.fout2 );
304 | }
305 | fclose( kp.flog );
306 | }
307 |
308 | void usage() {
309 | cerr << "\n\033[1;34mUsage: Ktrim [options] -f {-1/-U [-2 Read2.fq ]} -o \033[0m\n\n"
310 | << "Author : Kun Sun (sunkun@szbl.ac.cn)\n"
311 | << "Version: " << VERSION << "\n\n"
312 |
313 | << "Ktrim is designed to perform adapter- and quality-trimming of FASTQ files.\n\n"
314 |
315 | << "Compulsory parameters:\n\n"
316 |
317 | << " -f fq.list Specify the path to a file containing path to read 1/2 fastq files\n\n"
318 | << "OR specify the fastq files directly:\n\n"
319 | << " -1/-U R1.fq[.gz] Specify the path to the files containing read 1\n"
320 | << " If your data is Paired-end, use '-1' and specify read 2 files using '-2' option\n"
321 | << " Note that if '-U' is used, specification of '-2' is invalid\n"
322 | << " If you have multiple files for your sample, use '"
323 | << FILE_SEPARATOR << "' to separate them\n"
324 | << " Gzip-compressed files (with .gz suffix) are supported.\n\n"
325 |
326 | << " -o out.prefix Specify the prefix of the output files\n"
327 | << " Outputs include trimmed reads in FASTQ format and statistics\n\n"
328 |
329 | << "Optional parameters:\n\n"
330 |
331 | << " -2 R2.fq[.gz] Specify the path to the file containing read 2\n"
332 | << " Use this parameter if your data is generated in paired-end mode\n"
333 | << " If you have multiple files for your sample, use '"
334 | << FILE_SEPARATOR << "' to separate them\n"
335 | << " and make sure that all the files are well paired in '-1' and '-2' options\n\n"
336 |
337 | << " -c Write the trimming results to stdout (default: not set)\n"
338 | << " Note that the interleaved fastq format will be used for paired-end data.\n"
339 | << " -R Only output reads with adapter (default: not set)\n"
340 | << " -s size Minimum read size to be kept after trimming (default: 36; must be larger than 10)\n\n"
341 |
342 | << " -t threads Specify how many threads should be used (default: 6)\n"
343 | << " You can set '-t' to 0 to use all threads (automatically detected)\n\n"
344 |
345 | << " -k kit Specify the sequencing kit to use built-in adapters\n"
346 | << " Currently supports 'Illumina' (default), 'Nextera', 'CLIP', and 'BGI'\n"
347 | << " -a sequence Specify the adapter sequence in read 1\n"
348 | << " -b sequence Specify the adapter sequence in read 2\n"
349 | << " If '-a' is set while '-b' is not, I will assume that read 1 and 2 use same adapter\n"
350 | << " Note that '-k' option has a higher priority (when set, '-a'/'-b' will be ignored)\n\n"
351 |
352 | << " -p phred-base Specify the baseline of the phred score (default: 33)\n"
353 | << " -q score The minimum quality score to keep the cycle (default: 20)\n"
354 | << " Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred\n\n"
355 | << " -w window Set the window size for quality check (default: 5)\n"
356 | << " Ktrim stops when all bases in a consecutive window pass the quality threshold\n\n"
357 | << " -m proportion Set the proportion of mismatches allowed during index and sequence comparison\n"
358 | << " Default: 0.125 (i.e., 1/8 of compared base pairs)\n"
359 | << " Please use this option with caution as it affects the accuracy a lot\n\n"
360 |
361 | << " -h Show this help information and quit (exit code=0)\n"
362 | << " -v Show the software version and quit (exit code=0)\n\n"
363 |
364 | << "Please refer to README.md file for more information (e.g., setting adapters).\n"
365 | << "Citation: Sun K. Bioinformatics 2020 Jun 1; 36(11):3561-3562. (PMID: 32159761)\n\n"
366 | << "\033[1;34mKtrim: extra-fast and accurate adapter- and quality-trimmer.\033[0m\n\n";
367 | }
368 |
369 |
--------------------------------------------------------------------------------
/src/pe_handler.h:
--------------------------------------------------------------------------------
1 | /**
2 | * Author: Kun Sun (sunkun@szbl.ac.cn)
3 | * Date: May 2023
4 | * This program is part of the Ktrim package
5 | **/
6 |
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include "common.h"
16 | using namespace std;
17 |
18 | void inline CPEREAD_resize( CPEREAD * read, int n ) {
19 | read->size = n;
20 | read->seq1[ n ] = 0;
21 | read->qual1[ n ] = 0;
22 | read->seq2[ n ] = 0;
23 | read->qual2[ n ] = 0;
24 | }
25 |
26 | bool inline is_revcomp( const char a, const char b ) {
27 | //TODO: consider how to deal with N, call it positive or negative???
28 | switch( a ) {
29 | case 'A': return b=='T';
30 | case 'C': return b=='G';
31 | case 'G': return b=='C';
32 | case 'T': return b=='A';
33 | default : return false;
34 | }
35 | return true;
36 | }
37 |
38 | void init_kstat_wrbuffer( ktrim_stat &kstat, writeBuffer &writebuffer, unsigned int nthread, bool write2stdout ) {
39 | kstat.dropped = new unsigned int [ nthread ];
40 | kstat.real_adapter = new unsigned int [ nthread ];
41 | kstat.tail_adapter = new unsigned int [ nthread ];
42 | kstat.dimer = new unsigned int [ nthread ];
43 | kstat.pass = new unsigned int [ nthread ];
44 |
45 | // buffer for storing the modified reads per thread
46 | writebuffer.buffer1 = new char * [ nthread ];
47 | writebuffer.buffer2 = new char * [ nthread ];
48 | writebuffer.b1stored = new unsigned int [ nthread ];
49 | writebuffer.b2stored = new unsigned int [ nthread ];
50 |
51 | for(unsigned int i=0; i!=nthread; ++i) {
52 | if( write2stdout ) { // only use buffer 1
53 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ << 1 ];
54 | writebuffer.b1stored[i] = 0;
55 | } else {
56 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ];
57 | writebuffer.buffer2[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ];
58 | writebuffer.b1stored[i] = 0;
59 | writebuffer.b2stored[i] = 0;
60 | }
61 |
62 | kstat.dropped[i] = 0;
63 | kstat.real_adapter[i] = 0;
64 | kstat.tail_adapter[i] = 0;
65 | kstat.dimer[i] = 0;
66 | kstat.pass[i] = 0;
67 | }
68 | }
69 |
70 | void find_seed_pe( vector &seed, const CPEREAD *read, const ktrim_param & kp ) {
71 | seed.clear();
72 | register const char *poffset = read->seq1;
73 | register const char *indexloc = poffset;
74 | while( true ) {
75 | indexloc = strstr( indexloc, kp.adapter_index1 );
76 | if( indexloc == NULL )
77 | break;
78 | seed.push_back( indexloc - poffset );
79 | ++ indexloc;
80 | }
81 | poffset = read->seq2;
82 | indexloc = poffset;
83 | while( true ) {
84 | indexloc = strstr( indexloc, kp.adapter_index2 );
85 | if( indexloc == NULL )
86 | break;
87 | seed.push_back( indexloc - poffset );
88 | ++ indexloc;
89 | }
90 | sort( seed.begin(), seed.end() );
91 | }
92 |
93 | // this function is slower than C++ version
94 | void workingThread_PE_C( unsigned int tn, unsigned int start, unsigned int end, CPEREAD *workingReads,
95 | ktrim_stat *kstat, writeBuffer *writebuffer, const ktrim_param &kp ) {
96 |
97 | writebuffer->b1stored[tn] = 0;
98 | writebuffer->b2stored[tn] = 0;
99 |
100 | // vector seed;
101 | // vector :: iterator it, end_of_seed;
102 | register int *seed = new int[ MAX_SEED_NUM ];
103 | register int hit_seed;
104 | register int *it, *end_of_seed;
105 |
106 | register CPEREAD *wkr = workingReads + start;
107 | for( unsigned int iii=end-start; iii; --iii, ++wkr ) {
108 | // read size handling
109 | if( wkr->size > wkr->size2 )
110 | wkr->size = wkr->size2;
111 |
112 | // remove the tail '\n'
113 | // in fact, it is not essential to do this step, because '\n' has a very low ascii value (10)
114 | // therefore it will be quality-trimmed
115 | CPEREAD_resize( wkr, wkr->size - 1 );
116 |
117 | // quality control
118 | register int i = get_quality_trim_cycle_pe( wkr, kp );
119 | if( i == 0 ) { // not long enough
120 | ++ kstat->dropped[ tn ];
121 | continue;
122 | }
123 | if( i != wkr->size ) { // quality-trim occurs
124 | CPEREAD_resize( wkr, i );
125 | }
126 |
127 | // looking for seed target, 1 mismatch is allowed for these 2 seeds
128 | // which means seq1 and seq2 at least should take 1 perfect seed match
129 | //find_seed_pe( seed, wkr, kp );
130 | //TODO: I donot need to find all the seeds, I can find-check, then next
131 | // seed.clear();
132 | hit_seed = 0;
133 | register const char *poffset = wkr->seq1;
134 | register const char *indexloc = poffset;
135 | while( true ) {
136 | indexloc = strstr( indexloc, kp.adapter_index1 );
137 | if( indexloc == NULL )
138 | break;
139 | //seed.push_back( indexloc - poffset );
140 | seed[ hit_seed++ ] = indexloc - poffset;
141 | ++ indexloc;
142 | }
143 | poffset = wkr->seq2;
144 | indexloc = poffset;
145 | while( true ) {
146 | indexloc = strstr( indexloc, kp.adapter_index2 );
147 | if( indexloc == NULL )
148 | break;
149 | //seed.push_back( indexloc - poffset );
150 | seed[ hit_seed++ ] = indexloc - poffset;
151 | ++ indexloc;
152 | }
153 | //sort( seed.begin(), seed.end() );
154 | end_of_seed = seed + hit_seed;
155 | if( hit_seed != 0 )
156 | sort( seed, seed + hit_seed );
157 |
158 | register bool no_valid_adapter = true;
159 | register unsigned int last_seed = impossible_seed; // a position which cannot be a seed
160 | //end_of_seed = seed.end();
161 | //for( it=seed.begin(); it!=end_of_seed; ++it ) {
162 | for( it=seed; it!=end_of_seed; ++it ) {
163 | if( *it != last_seed ) {
164 | // as there maybe the same value in seq1_seed and seq2_seed,
165 | // use this to avoid re-calculate that pos
166 | if( check_mismatch_dynamic_PE_C( wkr, *it, kp ) )
167 | break;
168 |
169 | last_seed = *it;
170 | }
171 | }
172 | if( it != end_of_seed ) { // adapter found
173 | no_valid_adapter = false;
174 | ++ kstat->real_adapter[tn];
175 | if( *it >= kp.min_length ) {
176 | CPEREAD_resize( wkr, *it );
177 | } else { // drop this read as its length is not enough
178 | ++ kstat->dropped[tn];
179 |
180 | if( *it <= DIMER_INSERT )
181 | ++ kstat->dimer[tn];
182 | continue;
183 | }
184 | } else { // seed not found, now check the tail 2 or 1, if perfect match, drop these 2
185 | // note: I will NOT consider tail hits as adapter_found, as '-w' should be used when insertDNA is very short
186 | i = wkr->size - 2;
187 | register const char *p = wkr->seq1;
188 | register const char *q = wkr->seq2;
189 | //Note: 1 mismatch is allowed in tail-checking
190 | register unsigned int mismatches = 0;
191 | if( p[i] != kp.adapter_r1[0] ) mismatches ++;
192 | if( p[i+1] != kp.adapter_r1[1] ) mismatches ++;
193 | if( q[i] != kp.adapter_r2[0] ) mismatches ++;
194 | if( q[i+1] != kp.adapter_r2[1] ) mismatches ++;
195 | if( ! is_revcomp(p[5], q[i-6]) ) mismatches ++;
196 | if( ! is_revcomp(q[5], p[i-6]) ) mismatches ++;
197 | if( mismatches <= 1 ) { // tail is good
198 | ++ kstat->tail_adapter[tn];
199 | if( i < kp.min_length ) {
200 | ++ kstat->dropped[tn];
201 | continue;
202 | }
203 | CPEREAD_resize( wkr, i );
204 | } else { // tail 2 is not good, check tail 1
205 | ++ i;
206 | mismatches = 0;
207 | if( p[i] != kp.adapter_r1[0] ) mismatches ++;
208 | if( q[i] != kp.adapter_r2[0] ) mismatches ++;
209 | if( ! is_revcomp(p[5], q[i-6]) ) mismatches ++;
210 | if( ! is_revcomp(q[5], p[i-6]) ) mismatches ++;
211 | if( ! is_revcomp(p[6], q[i-7]) ) mismatches ++;
212 | if( ! is_revcomp(q[6], p[i-7]) ) mismatches ++;
213 |
214 | if( mismatches <= 1 ) {
215 | ++ kstat->tail_adapter[tn];
216 | if( i < kp.min_length ) {
217 | ++ kstat->dropped[tn];
218 | continue;
219 | }
220 | CPEREAD_resize( wkr, i );
221 | }
222 | }
223 | }
224 |
225 | if( kp.outputReadWithAdaptorOnly && no_valid_adapter )
226 | continue;
227 |
228 | ++ kstat->pass[tn];
229 | if( kp.write2stdout ) {
230 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn],
231 | "%s%s\n+\n%s\n%s%s\n+\n%s\n",
232 | wkr->id1, wkr->seq1, wkr->qual1,
233 | wkr->id2, wkr->seq2, wkr->qual2);
234 | } else {
235 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn],
236 | "%s%s\n+\n%s\n", wkr->id1, wkr->seq1, wkr->qual1 );
237 | writebuffer->b2stored[tn] += sprintf( writebuffer->buffer2[tn]+writebuffer->b2stored[tn],
238 | "%s%s\n+\n%s\n", wkr->id2, wkr->seq2, wkr->qual2 );
239 | }
240 | }
241 |
242 | //wait for my turn to output
243 | while( true ) {
244 | if( tn == write_thread ) {
245 | if( kp.write2stdout ) {
246 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], stdout );
247 | } else {
248 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], kp.fout1 );
249 | fwrite( writebuffer->buffer2[tn], sizeof(char), writebuffer->b2stored[tn], kp.fout2 );
250 | }
251 | ++ write_thread;
252 | break;
253 | } else {
254 | this_thread::sleep_for( waiting_time_for_writing );
255 | }
256 | }
257 |
258 | delete [] seed;
259 | }
260 |
261 | int process_multi_thread_PE_C( const ktrim_param &kp ) {
262 | // IO speed-up
263 | ios::sync_with_stdio( false );
264 | // cin.tie( NULL );
265 |
266 | // in this version, two data containers are used and auto-swapped for working and loading data
267 | CPEREAD *readA, *readB;
268 | register char *readA_data, *readB_data;
269 | CPEREAD *workingReads, *loadingReads, *swapReads;
270 | ktrim_stat kstat;
271 | writeBuffer writebuffer;
272 | // vector R1s, kp.R2s;
273 | unsigned int totalFiles;
274 |
275 | // now I use 2 threads for init
276 | // prepare memory and file
277 | omp_set_num_threads( 2 );
278 | #pragma omp parallel
279 | {
280 | unsigned int tn = omp_get_thread_num();
281 |
282 | if( tn == 0 ) {
283 | readA = new CPEREAD[ READS_PER_BATCH ];
284 | readB = new CPEREAD[ READS_PER_BATCH ];
285 | readA_data = new char[ MEM_PE_READSET ];
286 | readB_data = new char[ MEM_PE_READSET ];
287 |
288 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) {
289 | readA[i].id1 = readA_data + j;
290 | readB[i].id1 = readB_data + j;
291 | j += MAX_READ_ID;
292 | readA[i].seq1 = readA_data + j;
293 | readB[i].seq1 = readB_data + j;
294 | j += MAX_READ_CYCLE;
295 | readA[i].qual1 = readA_data + j;
296 | readB[i].qual1 = readB_data + j;
297 | j += MAX_READ_CYCLE;
298 |
299 | readA[i].id2 = readA_data + j;
300 | readB[i].id2 = readB_data + j;
301 | j += MAX_READ_ID;
302 | readA[i].seq2 = readA_data + j;
303 | readB[i].seq2 = readB_data + j;
304 | j += MAX_READ_CYCLE;
305 | readA[i].qual2 = readA_data + j;
306 | readB[i].qual2 = readB_data + j;
307 | j += MAX_READ_CYCLE;
308 | }
309 | } else {
310 | init_kstat_wrbuffer( kstat, writebuffer, kp.thread, kp.write2stdout );
311 |
312 | // deal with multiple input files
313 | totalFiles = kp.R1s.size();
314 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n";
315 | }
316 | }
317 |
318 | // start analysis
319 | register unsigned int line = 0;
320 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) {
321 | bool file_is_gz = false;
322 | FILE *fq1, *fq2;
323 | gzFile gfp1, gfp2;
324 | register unsigned int i = kp.R1s[fileCnt].size() - 3;
325 | register const char * p = kp.R1s[fileCnt].c_str();
326 | register const char * q = kp.R2s[fileCnt].c_str();
327 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) {
328 | file_is_gz = true;
329 | gfp1 = gzopen( p, "r" );
330 | gfp2 = gzopen( q, "r" );
331 | if( gfp1==NULL || gfp2==NULL ) {
332 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
333 | return 104;
334 | }
335 | } else {
336 | fq1 = fopen( p, "rt" );
337 | fq2 = fopen( q, "rt" );
338 | if( fq1==NULL || fq2==NULL ) {
339 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
340 | return 104;
341 | }
342 | }
343 | // fprintf( stderr, "Loading files:\nRead1: %s\nRead2: %s\n", p, q );
344 |
345 | // initialization
346 | // get first batch of fastq reads
347 | unsigned int loaded;
348 | bool metEOF;
349 | omp_set_num_threads( 2 );
350 | #pragma omp parallel
351 | {
352 | unsigned int tn = omp_get_thread_num();
353 | if( tn == 0 ) {
354 | if( file_is_gz ) {
355 | loaded = load_batch_data_PE_GZ( gfp1, readA, READS_PER_BATCH, true );
356 | metEOF = gzeof( gfp1 );
357 | } else {
358 | loaded = load_batch_data_PE_C( fq1, readA, READS_PER_BATCH, true );
359 | metEOF = feof( fq1 );
360 | }
361 | } else {
362 | if( file_is_gz ) {
363 | loaded = load_batch_data_PE_GZ( gfp2, readA, READS_PER_BATCH, false );
364 | } else {
365 | loaded = load_batch_data_PE_C( fq2, readA, READS_PER_BATCH, false );
366 | }
367 | }
368 | }
369 | if( loaded == 0 ) break;
370 |
371 | loadingReads = readB;
372 | workingReads = readA;
373 | bool nextBatch = true;
374 | unsigned int threadLoaded=0, threadLoaded2=0;
375 | unsigned int NumWkThreads=0;
376 | while( nextBatch ) {
377 | // cerr << "Working on " << loaded << " reads\n";
378 | // start parallalization
379 | write_thread = 0;
380 | omp_set_num_threads( kp.thread );
381 | #pragma omp parallel
382 | {
383 | unsigned int tn = omp_get_thread_num();
384 | // if EOF is met, then all threads are used for analysis
385 | // otherwise 1 thread will do data loading
386 | if( metEOF ) {
387 | NumWkThreads = kp.thread;
388 | unsigned int start = loaded * tn / kp.thread;
389 | unsigned int end = loaded * (tn+1) / kp.thread;
390 | workingThread_PE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp );
391 | nextBatch = false;
392 | } else { // use 2 thread to load files, others for trimming
393 | NumWkThreads = kp.thread - 2;
394 | if( tn == kp.thread - 1 ) {
395 | if( file_is_gz ) {
396 | threadLoaded = load_batch_data_PE_GZ( gfp1, loadingReads, READS_PER_BATCH, true );
397 | metEOF = gzeof( gfp1 );
398 | } else {
399 | threadLoaded = load_batch_data_PE_C( fq1, loadingReads, READS_PER_BATCH, true );
400 | metEOF = feof( fq1 );
401 | }
402 | // cerr << "R1 loaded " << threadLoaded << ", pos=" << gztell(gfp2) << ", EOF=" << gzeof( gfp1 ) << "\n";
403 | nextBatch = (threadLoaded!=0);
404 | //cerr << "Loading thread: " << threadLoaded << ", " << metEOF << ", " << nextBatch << '\n';
405 | } else if ( tn == kp.thread - 2 ) {
406 | if( file_is_gz ) {
407 | threadLoaded2 = load_batch_data_PE_GZ( gfp2, loadingReads, READS_PER_BATCH, false );
408 | } else {
409 | threadLoaded2 = load_batch_data_PE_C( fq2, loadingReads, READS_PER_BATCH, false );
410 | }
411 | // cerr << "R2 loaded " << threadLoaded2 << ", pos=" << gztell(gfp2) << ", EOF=" << gzeof( gfp2 )<< "\n";
412 | } else {
413 | unsigned int start = loaded * tn / NumWkThreads;
414 | unsigned int end = loaded * (tn+1) / NumWkThreads;
415 | workingThread_PE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp );
416 | }
417 | }
418 | } // parallel body
419 | // write output and update fastq statistics
420 | // I cannot write fastq in each thread because it may cause unpaired reads
421 | /*if( ! kp.write2stdout ) {
422 | omp_set_num_threads( 2 );
423 | #pragma omp parallel
424 | {
425 | unsigned int tn = omp_get_thread_num();
426 | if( tn == 0 ) {
427 | for( unsigned int ii=0; ii!=NumWkThreads; ++ii ) {
428 | fwrite( writebuffer.buffer1[ii], sizeof(char), writebuffer.b1stored[ii], kp.fout1 );
429 | }
430 | } else {
431 | for( unsigned int ii=0; ii!=NumWkThreads; ++ii ) {
432 | fwrite( writebuffer.buffer2[ii], sizeof(char), writebuffer.b2stored[ii], kp.fout2 );
433 | }
434 | }
435 | }
436 | }*/
437 | // check whether the read-loading is correct
438 | if( threadLoaded != threadLoaded2 ) {
439 | cerr << "ERROR: unequal read number for read1 (" << threadLoaded << ") and read2 (" << threadLoaded2 << ")!\n";
440 | return 1;
441 | }
442 |
443 | line += loaded;
444 | loaded = threadLoaded;
445 | // swap workingReads and loadingReads for next loop
446 | swapReads = loadingReads;
447 | loadingReads = workingReads;
448 | workingReads = swapReads;
449 |
450 | // cerr << '\r' << line << " reads loaded\n";
451 | // cerr << line << " reads loaded, metEOF=" << metEOF << ", next=" << nextBatch << "\n";
452 | }//process 1 file
453 |
454 | if( file_is_gz ) {
455 | gzclose( gfp1 );
456 | gzclose( gfp2 );
457 | } else {
458 | fclose( fq1 );
459 | fclose( fq2 );
460 | }
461 | // cerr << '\n';
462 | } // all input files are loaded
463 | //cerr << "\rDone: " << line << " lines processed.\n";
464 |
465 | // write trim.log
466 | int dropped_all=0, real_all=0, tail_all=0, dimer_all=0, pass_all=0;
467 | for( unsigned int i=0; i!=kp.thread; ++i ) {
468 | dropped_all += kstat.dropped[i];
469 | real_all += kstat.real_adapter[i];
470 | tail_all += kstat.tail_adapter[i];
471 | dimer_all += kstat.dimer[i];
472 | pass_all += kstat.pass[i];
473 | }
474 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n",
475 | line, dropped_all, real_all, tail_all, dimer_all, pass_all );
476 |
477 | //free memory
478 | for(unsigned int i=0; i!=kp.thread; ++i) {
479 | delete writebuffer.buffer1[i];
480 | if( ! kp.write2stdout )
481 | delete writebuffer.buffer2[i];
482 | }
483 | delete [] writebuffer.buffer1;
484 | delete [] writebuffer.buffer2;
485 |
486 | delete [] kstat.dropped;
487 | delete [] kstat.real_adapter;
488 | delete [] kstat.tail_adapter;
489 | delete [] kstat.dimer;
490 | delete [] kstat.pass;
491 |
492 | delete [] readA;
493 | delete [] readB;
494 | delete [] readA_data;
495 | delete [] readB_data;
496 |
497 | return 0;
498 | }
499 |
500 | int process_single_thread_PE_C( const ktrim_param &kp ) {
501 | // fprintf( stderr, "process_single_thread_PE_C\n" );
502 | // IO speed-up
503 | ios::sync_with_stdio( false );
504 | // cin.tie( NULL );
505 |
506 | CPEREAD *read = new CPEREAD[ READS_PER_BATCH_ST ];
507 | register char *read_data = new char[ MEM_PE_READSET_ST ];
508 |
509 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) {
510 | read[i].id1 = read_data + j;
511 | j += MAX_READ_ID;
512 | read[i].seq1 = read_data + j;
513 | j += MAX_READ_CYCLE;
514 | read[i].qual1 = read_data + j;
515 | j += MAX_READ_CYCLE;
516 |
517 | read[i].id2 = read_data + j;
518 | j += MAX_READ_ID;
519 | read[i].seq2 = read_data + j;
520 | j += MAX_READ_CYCLE;
521 | read[i].qual2 = read_data + j;
522 | j += MAX_READ_CYCLE;
523 | }
524 |
525 | ktrim_stat kstat;
526 | kstat.dropped = new unsigned int [ 1 ];
527 | kstat.real_adapter = new unsigned int [ 1 ];
528 | kstat.tail_adapter = new unsigned int [ 1 ];
529 | kstat.dimer = new unsigned int [ 1 ];
530 | kstat.pass = new unsigned int [ 1 ];
531 | kstat.dropped[0] = 0;
532 | kstat.real_adapter[0] = 0;
533 | kstat.tail_adapter[0] = 0;
534 | kstat.dimer[0] = 0;
535 | kstat.pass[0] = 0;
536 |
537 | // buffer for storing the modified reads per thread
538 | writeBuffer writebuffer;
539 | writebuffer.buffer1 = new char * [ 1 ];
540 | if( kp.write2stdout ) {
541 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST << 1 ];
542 | writebuffer.b1stored = new unsigned int [ 1 ];
543 | } else {
544 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ];
545 |
546 | writebuffer.buffer2 = new char * [ 1 ];
547 | writebuffer.buffer2[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ];
548 | }
549 | writebuffer.b1stored = new unsigned int [ 1 ];
550 | writebuffer.b2stored = new unsigned int [ 1 ];
551 |
552 | // deal with multiple input files
553 | if( kp.R1s.size() != kp.R2s.size() ) {
554 | cerr << "\033[1;31mError: incorrect files!\033[0m\n";
555 | return 110;
556 | }
557 | unsigned int totalFiles = kp.R1s.size();
558 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n";
559 |
560 | register unsigned int line = 0;
561 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) {
562 | bool file_is_gz = false;
563 | FILE *fq1, *fq2;
564 | gzFile gfp1, gfp2;
565 | register unsigned int i = kp.R1s[fileCnt].size() - 3;
566 | register const char * p = kp.R1s[fileCnt].c_str();
567 | register const char * q = kp.R2s[fileCnt].c_str();
568 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) {
569 | file_is_gz = true;
570 | gfp1 = gzopen( p, "r" );
571 | gfp2 = gzopen( q, "r" );
572 | if( gfp1==NULL || gfp2==NULL ) {
573 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
574 | return 104;
575 | }
576 | } else {
577 | fq1 = fopen( p, "rt" );
578 | fq2 = fopen( q, "rt" );
579 | if( fq1==NULL || fq2==NULL ) {
580 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
581 | return 104;
582 | }
583 | }
584 |
585 | while( true ) {
586 | // get fastq reads
587 | unsigned int loaded;
588 | if( file_is_gz ) {
589 | loaded = load_batch_data_PE_both_GZ( gfp1, gfp2, read, READS_PER_BATCH_ST );
590 | } else {
591 | loaded = load_batch_data_PE_both_C( fq1, fq2, read, READS_PER_BATCH_ST );
592 | }
593 | if( loaded == 0 ) break;
594 |
595 | write_thread = 0;
596 | workingThread_PE_C( 0, 0, loaded, read, &kstat, &writebuffer, kp );
597 | // write output and update fastq statistics
598 | /* if( ! kp.write2stdout ) {
599 | fwrite( writebuffer.buffer1[0], sizeof(char), writebuffer.b1stored[0], kp.fout1 );
600 | fwrite( writebuffer.buffer2[0], sizeof(char), writebuffer.b2stored[0], kp.fout2 );
601 | }*/
602 |
603 | line += loaded;
604 | //cerr << '\r' << line << " reads loaded";
605 |
606 | if( file_is_gz ) {
607 | if( gzeof( gfp1 ) ) break;
608 | } else {
609 | if( feof( fq2 ) ) break;
610 | }
611 | }
612 |
613 | if( file_is_gz ) {
614 | gzclose( gfp1 );
615 | gzclose( gfp2 );
616 | } else {
617 | fclose( fq1 );
618 | fclose( fq2 );
619 | }
620 | }
621 | //cerr << "\rDone: " << line << " lines processed.\n";
622 |
623 | // write trim.log
624 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n",
625 | line, kstat.dropped[0], kstat.real_adapter[0], kstat.tail_adapter[0], kstat.dimer[0], kstat.pass[0] );
626 |
627 | delete writebuffer.buffer1[0];
628 | if( ! kp.write2stdout ) delete writebuffer.buffer2[0];
629 | delete [] read;
630 | delete [] read_data;
631 |
632 | return 0;
633 | }
634 |
635 | int process_two_thread_PE_C( const ktrim_param &kp ) {
636 | // IO speed-up
637 | ios::sync_with_stdio( false );
638 | // cin.tie( NULL );
639 |
640 | // in this version, two data containers are used and auto-swapped for working and loading data
641 | CPEREAD *readA;
642 | register char *readA_data;
643 | ktrim_stat kstat;
644 | writeBuffer writebuffer;
645 | // vector R1s, R2s;
646 | unsigned int totalFiles;
647 |
648 | // now I use 2 theads for init
649 | // prepare memory and file
650 | omp_set_num_threads( 2 );
651 | #pragma omp parallel
652 | {
653 | unsigned int tn = omp_get_thread_num();
654 |
655 | if( tn == 0 ) {
656 | readA = new CPEREAD[ READS_PER_BATCH ];
657 | readA_data = new char[ MEM_PE_READSET ];
658 |
659 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) {
660 | readA[i].id1 = readA_data + j;
661 | j += MAX_READ_ID;
662 | readA[i].seq1 = readA_data + j;
663 | j += MAX_READ_CYCLE;
664 | readA[i].qual1 = readA_data + j;
665 | j += MAX_READ_CYCLE;
666 |
667 | readA[i].id2 = readA_data + j;
668 | j += MAX_READ_ID;
669 | readA[i].seq2 = readA_data + j;
670 | j += MAX_READ_CYCLE;
671 | readA[i].qual2 = readA_data + j;
672 | j += MAX_READ_CYCLE;
673 | }
674 | } else {
675 | init_kstat_wrbuffer( kstat, writebuffer, kp.thread, kp.write2stdout );
676 |
677 | // deal with multiple input files
678 | totalFiles = kp.R1s.size();
679 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n";
680 | }
681 | }
682 |
683 | register unsigned int line = 0;
684 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) {
685 | bool file_is_gz = false;
686 | FILE *fq1, *fq2;
687 | gzFile gfp1, gfp2;
688 | register unsigned int i = kp.R1s[fileCnt].size() - 3;
689 | register const char * p = kp.R1s[fileCnt].c_str();
690 | register const char * q = kp.R2s[fileCnt].c_str();
691 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) {
692 | file_is_gz = true;
693 | gfp1 = gzopen( p, "r" );
694 | gfp2 = gzopen( q, "r" );
695 | if( gfp1==NULL || gfp2==NULL ) {
696 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
697 | return 104;
698 | }
699 | } else {
700 | fq1 = fopen( p, "rt" );
701 | fq2 = fopen( q, "rt" );
702 | if( fq1==NULL || fq2==NULL ) {
703 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
704 | return 104;
705 | }
706 | }
707 | // fprintf( stderr, "Loading files:\nRead1: %s\nRead2: %s\n", p, q );
708 |
709 | // initialization
710 | // get first batch of fastq reads
711 |
712 | // there is NO need to do rotation in 2-threads
713 | unsigned int loaded1, loaded2;
714 | bool metEOF;
715 | while( true ) {
716 | // load data
717 | omp_set_num_threads( 2 );
718 | #pragma omp parallel
719 | {
720 | unsigned int tn = omp_get_thread_num();
721 | if( tn == 0 ) {
722 | if( file_is_gz ) {
723 | loaded1 = load_batch_data_PE_GZ( gfp1, readA, READS_PER_BATCH, true );
724 | metEOF = gzeof( gfp1 );
725 | } else {
726 | loaded1 = load_batch_data_PE_C( fq1, readA, READS_PER_BATCH, true );
727 | metEOF = feof( fq1 );
728 | }
729 | } else {
730 | if( file_is_gz ) {
731 | loaded2 = load_batch_data_PE_GZ( gfp2, readA, READS_PER_BATCH, false );
732 | } else {
733 | loaded2 = load_batch_data_PE_C( fq2, readA, READS_PER_BATCH, false );
734 | }
735 | }
736 | }
737 | if( loaded1 != loaded2 ) {
738 | cerr << "ERROR: unequal read number for read1 (" << loaded1 << ") and read2 (" << loaded2 << ")!\n";
739 | return 1;
740 | }
741 | if( loaded1 == 0 ) break;
742 |
743 | // start analysis
744 | write_thread = 0;
745 | omp_set_num_threads( 2 );
746 | #pragma omp parallel
747 | {
748 | unsigned int tn = omp_get_thread_num();
749 | register int middle = loaded1 >> 1;
750 | if( tn == 0 ) {
751 | workingThread_PE_C( 0, 0, middle, readA, &kstat, &writebuffer, kp );
752 | } else {
753 | workingThread_PE_C( 1, middle, loaded1, readA, &kstat, &writebuffer, kp );
754 | }
755 | } // parallel body
756 | // write output and update fastq statistics
757 | /*if( ! kp.write2stdout ) {
758 | fwrite( writebuffer.buffer1[0], sizeof(char), writebuffer.b1stored[0], kp.fout1 );
759 | fwrite( writebuffer.buffer1[1], sizeof(char), writebuffer.b1stored[1], kp.fout1 );
760 | fwrite( writebuffer.buffer2[0], sizeof(char), writebuffer.b2stored[0], kp.fout2 );
761 | fwrite( writebuffer.buffer2[1], sizeof(char), writebuffer.b2stored[1], kp.fout2 );
762 | }*/
763 | line += loaded1;
764 | if( metEOF )break;
765 | }
766 |
767 | if( file_is_gz ) {
768 | gzclose( gfp1 );
769 | gzclose( gfp2 );
770 | } else {
771 | fclose( fq1 );
772 | fclose( fq2 );
773 | }
774 | } // all input files are loaded
775 | //cerr << "\rDone: " << line << " lines processed.\n";
776 |
777 | // write trim.log
778 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n",
779 | line, kstat.dropped[0]+kstat.dropped[1], kstat.real_adapter[0]+kstat.real_adapter[1],
780 | kstat.tail_adapter[0]+kstat.tail_adapter[1], kstat.dimer[0]+kstat.dimer[1], kstat.pass[0]+kstat.pass[1] );
781 |
782 | //free memory
783 | for(unsigned int i=0; i!=kp.thread; ++i) {
784 | delete writebuffer.buffer1[i];
785 | delete writebuffer.buffer2[i];
786 | }
787 | delete [] writebuffer.buffer1;
788 | delete [] writebuffer.buffer2;
789 | delete [] kstat.dropped;
790 | delete [] kstat.real_adapter;
791 | delete [] kstat.tail_adapter;
792 | delete [] kstat.dimer;
793 | delete [] kstat.pass;
794 |
795 | delete [] readA;
796 | delete [] readA_data;
797 |
798 | return 0;
799 | }
800 |
801 |
--------------------------------------------------------------------------------
/src/se_handler.h:
--------------------------------------------------------------------------------
1 | /**
2 | * Author: Kun Sun (sunkun@szbl.ac.cn)
3 | * Date: May 2023
4 | * This program is part of the Ktrim package
5 | **/
6 |
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include
16 | #include "common.h"
17 | #include "util.h"
18 | using namespace std;
19 |
20 | void inline CSEREAD_resize( CSEREAD * cr, int n ) {
21 | cr->seq[ n] = 0;
22 | cr->qual[n] = 0;
23 | cr->size = n;
24 | }
25 |
26 | void find_seed( vector &seed, CSEREAD *read, const ktrim_param & kp ) {
27 | seed.clear();
28 | register char *poffset = read->seq;
29 | register char *indexloc = poffset;
30 | while( true ) {
31 | indexloc = strstr( indexloc, kp.adapter_index1 );
32 | if( indexloc == NULL )
33 | break;
34 | seed.push_back( indexloc - poffset );
35 | indexloc ++;
36 | }
37 | poffset = read->seq + OFFSET_INDEX3;
38 | indexloc = poffset;
39 | while( true ) {
40 | indexloc = strstr( indexloc, kp.adapter_index3 );
41 | if( indexloc == NULL )
42 | break;
43 | seed.push_back( indexloc - poffset );
44 | indexloc ++;
45 | }
46 | sort( seed.begin(), seed.end() );
47 | }
48 |
49 | void workingThread_SE_C( unsigned int tn, unsigned int start, unsigned int end, CSEREAD *workingReads,
50 | ktrim_stat *kstat, writeBuffer *writebuffer, const ktrim_param &kp ) {
51 |
52 | // fprintf( stderr, "=== working thread %d: %d - %d\n", tn, start, end ), "\n";
53 | writebuffer->b1stored[tn] = 0;
54 |
55 | register int i, j;
56 | register unsigned int last_seed;
57 | vector seed;
58 | vector :: iterator it;
59 | const char *p, *q;
60 |
61 | register CSEREAD *wkr = workingReads + start;
62 | for( unsigned int ii=start; ii!=end; ++ii, ++wkr ) {
63 | // fprintf( stderr, "working: %d, %s\n", ii, wkr->id );
64 | // quality control
65 | p = wkr->qual;
66 | j = wkr->size;
67 |
68 | // update in v1.2: support window check
69 | i = get_quality_trim_cycle_se( p, j, kp );
70 |
71 | if( i == 0 ) { // not long enough
72 | ++ kstat->dropped[ tn ];
73 | continue;
74 | }
75 | if( i != j ) { // quality-trim occurs
76 | CSEREAD_resize( wkr, i);
77 | }
78 |
79 | // looking for seed target, 1 mismatch is allowed for these 2 seeds
80 | // which means seq1 and seq2 at least should take 1 perfect seed match
81 | find_seed( seed, wkr, kp );
82 |
83 | last_seed = impossible_seed; // a position which cannot be in seed
84 | for( it=seed.begin(); it!=seed.end(); ++it ) {
85 | if( *it != last_seed ) {
86 | // fprintf( stderr, " check seed: %d\n", *it );
87 | // as there maybe the same value in seq1_seed and seq2_seed,
88 | // use this to avoid re-calculate that pos
89 | if( check_mismatch_dynamic_SE_C( wkr, *it, kp ) )
90 | break;
91 |
92 | last_seed = *it;
93 | }
94 | }
95 | register bool no_valid_adapter = true;
96 | if( it != seed.end() ) { // adapter found
97 | no_valid_adapter = false;
98 | ++ kstat->real_adapter[tn];
99 | if( *it >= kp.min_length ) {
100 | CSEREAD_resize( wkr, *it );
101 | } else { // drop this read as its length is not enough
102 | ++ kstat->dropped[tn];
103 |
104 | if( *it <= DIMER_INSERT )
105 | ++ kstat->dimer[tn];
106 |
107 | continue;
108 | }
109 | } else { // seed not found, now check the tail 2, if perfect match, drop these 2; Single-end reads do not check tail 1
110 | i = wkr->size - 2;
111 | p = wkr->seq;
112 | if( p[i]==kp.adapter_r1[0] && p[i+1]==kp.adapter_r1[1] ) {
113 | ++ kstat->tail_adapter[tn];
114 | if( i < kp.min_length ) {
115 | ++ kstat->dropped[tn];
116 | continue;
117 | }
118 | CSEREAD_resize( wkr, i );
119 | }
120 | }
121 |
122 | if( kp.outputReadWithAdaptorOnly && no_valid_adapter )
123 | continue;
124 |
125 | ++ kstat->pass[tn];
126 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn],
127 | "%s%s\n+\n%s\n", wkr->id, wkr->seq, wkr->qual);
128 | }
129 |
130 | // wait for my turn to write
131 | while( true ) {
132 | if( tn == write_thread ) {
133 | // cerr << "Thread " << tn << " is writing.\n";
134 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], kp.fout1 );
135 | ++ write_thread;
136 | break;
137 | } else {
138 | this_thread::sleep_for( waiting_time_for_writing );
139 | }
140 | }
141 | }
142 |
143 | int process_multi_thread_SE_C( const ktrim_param &kp ) {
144 | // IO speed-up
145 | // ios::sync_with_stdio( false );
146 | // cin.tie( NULL );
147 |
148 | // in this version, two data containers are used and auto-swapped for working and loading data
149 | CSEREAD *readA = new CSEREAD[ READS_PER_BATCH ];
150 | CSEREAD *readB = new CSEREAD[ READS_PER_BATCH ];
151 |
152 | register char *readA_data = new char[ MEM_SE_READSET ];
153 | register char *readB_data = new char[ MEM_SE_READSET ];
154 |
155 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) {
156 | readA[i].id = readA_data + j;
157 | readB[i].id = readB_data + j;
158 | j += MAX_READ_ID;
159 | readA[i].seq = readA_data + j;
160 | readB[i].seq = readB_data + j;
161 | j += MAX_READ_CYCLE;
162 | readA[i].qual = readA_data + j;
163 | readB[i].qual = readB_data + j;
164 | j += MAX_READ_CYCLE;
165 | }
166 |
167 | CSEREAD *workingReads, *loadingReads, *swapReads;
168 |
169 | ktrim_stat kstat;
170 | kstat.dropped = new unsigned int [ kp.thread ];
171 | kstat.real_adapter = new unsigned int [ kp.thread ];
172 | kstat.tail_adapter = new unsigned int [ kp.thread ];
173 | kstat.dimer = new unsigned int [ kp.thread ];
174 | kstat.pass = new unsigned int [ kp.thread ];
175 |
176 | // buffer for storing the modified reads per thread
177 | writeBuffer writebuffer;
178 | writebuffer.buffer1 = new char * [ kp.thread ];
179 | writebuffer.b1stored = new unsigned int [ kp.thread ];
180 |
181 | for(unsigned int i=0; i!=kp.thread; ++i) {
182 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ];
183 | writebuffer.b1stored[i] = 0;
184 |
185 | kstat.dropped[i] = 0;
186 | kstat.real_adapter[i] = 0;
187 | kstat.tail_adapter[i] = 0;
188 | kstat.dimer[i] = 0;
189 | kstat.pass[i] = 0;
190 | }
191 |
192 | // deal with multiple input files
193 | // vector R1s;
194 | // extractFileNames( kp.FASTQU, R1s ); //this is moved to param_handler
195 | unsigned int totalFiles = kp.R1s.size();
196 | //cout << "\033[1;34mINFO: " << totalFiles << " single-end fastq files will be loaded.\033[0m\n";
197 |
198 | register unsigned int line = 0;
199 | unsigned int threadCNT = kp.thread - 1;
200 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) {
201 | bool file_is_gz = false;
202 | FILE *fq;
203 | gzFile gfp;
204 | register unsigned int i = kp.R1s[fileCnt].size() - 3;
205 | register const char * p = kp.R1s[fileCnt].c_str();
206 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) {
207 | file_is_gz = true;
208 | gfp = gzopen( p, "r" );
209 | if( gfp == NULL ) {
210 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
211 | return 104;
212 | }
213 | } else {
214 | fq = fopen( p, "rt" );
215 | if( fq == NULL ) {
216 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
217 | return 104;
218 | }
219 | }
220 | // initialization
221 | // get first batch of fastq reads
222 | unsigned int loaded;
223 | bool metEOF;
224 | if( file_is_gz ) {
225 | loaded = load_batch_data_SE_GZ( gfp, readA, READS_PER_BATCH );
226 | metEOF = gzeof( gfp );
227 | } else {
228 | loaded = load_batch_data_SE_C( fq, readA, READS_PER_BATCH );
229 | metEOF = feof( fq );
230 | }
231 | if( loaded == 0 ) break;
232 | // fprintf( stderr, "Loaded %d, metEOF=%d\n", loaded, metEOF );
233 |
234 | loadingReads = readB;
235 | workingReads = readA;
236 | bool nextBatch = true;
237 | unsigned int threadLoaded;
238 | while( nextBatch ) {
239 | // start parallalization
240 | write_thread = 0;
241 | omp_set_num_threads( kp.thread );
242 | #pragma omp parallel
243 | {
244 | unsigned int tn = omp_get_thread_num();
245 | // if EOF is met, then all threads are used for analysis
246 | // otherwise 1 thread will do data loading
247 | if( metEOF ) {
248 | unsigned int start = loaded * tn / kp.thread;
249 | unsigned int end = loaded * (tn+1) / kp.thread;
250 | workingThread_SE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp );
251 | nextBatch = false;
252 | } else { // use 1 thread to load file, others for trimming
253 | if( tn == threadCNT ) {
254 | if( file_is_gz ) {
255 | threadLoaded = load_batch_data_SE_GZ( gfp, loadingReads, READS_PER_BATCH );
256 | metEOF = gzeof( gfp );
257 | } else {
258 | threadLoaded = load_batch_data_SE_C( fq, loadingReads, READS_PER_BATCH );
259 | metEOF = feof( fq );
260 | }
261 | nextBatch = (threadLoaded!=0);
262 | // cerr << "Thread " << tn << " has loaded " << threadLoaded << " reads.\n";
263 | // fprintf( stderr, "Loaded %d, metEOF=%d\n", threadLoaded, metEOF );
264 | //cerr << "Loading thread: " << threadLoaded << ", " << metEOF << ", " << nextBatch << '\n';
265 | } else {
266 | unsigned int start = loaded * tn / threadCNT;
267 | unsigned int end = loaded * (tn+1) / threadCNT;
268 | workingThread_SE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp );
269 | // write output; fwrite is thread-safe
270 | }
271 | }
272 | } // parallel body
273 | // swap workingReads and loadingReads for next loop
274 | swapReads = loadingReads;
275 | loadingReads = workingReads;
276 | workingReads = swapReads;
277 | line += loaded;
278 | loaded = threadLoaded;
279 | //cerr << '\r' << line << " reads loaded";
280 | }
281 |
282 | if( file_is_gz ) {
283 | gzclose( gfp );
284 | } else {
285 | fclose( fq );
286 | }
287 | }
288 | //cerr << "\rDone: " << line << " lines processed.\n";
289 |
290 | // write trim.log
291 | int dropped_all=0, real_all=0, tail_all=0, dimer_all=0, pass_all=0;
292 | for( unsigned int i=0; i!=kp.thread; ++i ) {
293 | dropped_all += kstat.dropped[i];
294 | real_all += kstat.real_adapter[i];
295 | tail_all += kstat.tail_adapter[i];
296 | dimer_all += kstat.dimer[i];
297 | pass_all += kstat.pass[i];
298 | }
299 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n",
300 | line, dropped_all, real_all, tail_all, dimer_all, pass_all );
301 |
302 | //free memory
303 | for(unsigned int i=0; i!=kp.thread; ++i) {
304 | delete writebuffer.buffer1[i];
305 | }
306 | delete [] writebuffer.buffer1;
307 |
308 | delete [] kstat.dropped;
309 | delete [] kstat.real_adapter;
310 | delete [] kstat.tail_adapter;
311 | delete [] kstat.dimer;
312 | delete [] kstat.pass;
313 |
314 | delete [] readA;
315 | delete [] readB;
316 | delete [] readA_data;
317 | delete [] readB_data;
318 |
319 | return 0;
320 | }
321 |
322 | int process_single_thread_SE_C( const ktrim_param &kp ) {
323 | // IO speed-up
324 | // ios::sync_with_stdio( false );
325 | // cin.tie( NULL );
326 |
327 | CSEREAD *read = new CSEREAD[ READS_PER_BATCH_ST ];
328 | register char *read_data = new char[ MEM_SE_READSET_ST ];
329 |
330 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) {
331 | read[i].id = read_data + j;
332 | j += MAX_READ_ID;
333 | read[i].seq = read_data + j;
334 | j += MAX_READ_CYCLE;
335 | read[i].qual = read_data + j;
336 | j += MAX_READ_CYCLE;
337 | }
338 |
339 | ktrim_stat kstat;
340 | kstat.dropped = new unsigned int [ 1 ];
341 | kstat.real_adapter = new unsigned int [ 1 ];
342 | kstat.tail_adapter = new unsigned int [ 1 ];
343 | kstat.dimer = new unsigned int [ 1 ];
344 | kstat.pass = new unsigned int [ 1 ];
345 | kstat.dropped[0] = 0;
346 | kstat.real_adapter[0] = 0;
347 | kstat.tail_adapter[0] = 0;
348 | kstat.dimer[0] = 0;
349 | kstat.pass[0] = 0;
350 |
351 | // buffer for storing the modified reads per thread
352 | writeBuffer writebuffer;
353 | writebuffer.buffer1 = new char * [ 1 ];
354 | writebuffer.b1stored = new unsigned int [ 1 ];
355 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ];
356 |
357 | // deal with multiple input files
358 | unsigned int totalFiles = kp.R1s.size();
359 | //cout << "\033[1;34mINFO: " << totalFiles << " single-end fastq files will be loaded.\033[0m\n";
360 |
361 | register unsigned int line = 0;
362 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) {
363 | //fq1.open( R1s[fileCnt].c_str() );
364 | bool file_is_gz = false;
365 | FILE *fq;
366 | gzFile gfp;
367 | register unsigned int i = kp.R1s[fileCnt].size() - 3;
368 | register const char * p = kp.R1s[fileCnt].c_str();
369 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) {
370 | // fprintf( stderr, "GZ file!\n" );
371 | file_is_gz = true;
372 | gfp = gzopen( p, "r" );
373 | if( gfp == NULL ) {
374 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
375 | return 104;
376 | }
377 | } else {
378 | fq = fopen( p, "rt" );
379 | if( fq == NULL ) {
380 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n";
381 | return 104;
382 | }
383 | }
384 |
385 | while( true ) {
386 | // get fastq reads
387 | //unsigned int loaded = load_batch_data_SE( fq1, read, READS_PER_BATCH_ST );
388 | unsigned int loaded;
389 | if( file_is_gz ) {
390 | loaded = load_batch_data_SE_GZ( gfp, read, READS_PER_BATCH_ST );
391 | // fprintf( stderr, "Loaded=%d\n", loaded );
392 | } else {
393 | loaded = load_batch_data_SE_C( fq, read, READS_PER_BATCH_ST );
394 | }
395 | if( loaded == 0 ) break;
396 |
397 | write_thread = 0;
398 | workingThread_SE_C( 0, 0, loaded, read, &kstat, &writebuffer, kp );
399 | // write output and update fastq statistics
400 | line += loaded;
401 | //cerr << '\r' << line << " reads loaded";
402 |
403 | //if( fq1.eof() ) break;
404 | // fprintf( stderr, "Check\n" );
405 | if( file_is_gz ) {
406 | if( gzeof( gfp ) ) break;
407 | } else {
408 | if( feof( fq ) ) break;
409 | }
410 | }
411 | //fq1.close();
412 | if( file_is_gz ) {
413 | gzclose( gfp );
414 | } else {
415 | fclose( fq );
416 | }
417 | }
418 |
419 | // write trim.log
420 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n",
421 | line, kstat.dropped[0], kstat.real_adapter[0], kstat.tail_adapter[0], kstat.dimer[0], kstat.pass[0] );
422 |
423 | //free memory
424 | // delete buffer1;
425 | // delete fq1buffer;
426 | delete writebuffer.buffer1[0];
427 | delete [] read;
428 | delete [] read_data;
429 |
430 | return 0;
431 | }
432 |
433 |
--------------------------------------------------------------------------------
/src/util.h:
--------------------------------------------------------------------------------
1 | /**
2 | * Author: Kun Sun (sunkun@szbl.ac.cn)
3 | * Date: Feb, 2020
4 | * This program is part of the Ktrim package
5 | **/
6 |
7 | #ifndef _KTRIM_UTIL_
8 | #define _KTRIM_UTIL_
9 |
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include
16 | #include
17 | #include
18 | #include
19 | #include
20 | #include
21 | #include
22 | #include "common.h"
23 |
24 | using namespace std;
25 |
26 | // extract file names
27 | void extractFileNames( const char *str, vector & Rs ) {
28 | string fileName = "";
29 | for(unsigned int i=0; str[i]!='\0'; ++i) {
30 | if( str[i] == FILE_SEPARATOR ) {
31 | Rs.push_back( fileName );
32 | fileName.clear();
33 | } else {
34 | fileName += str[i];
35 | }
36 | }
37 | if( fileName.size() ) // in case there is a FILE_SEPARATOR at the end
38 | Rs.push_back( fileName );
39 | }
40 |
41 | void loadFQFileNames( ktrim_param & kp ) {
42 | ifstream fin;
43 | fin.open( kp.filelist );
44 | if( fin.fail() ) {
45 | cerr << "Error: load file " << kp.filelist << " failed!\n";
46 | exit(10);
47 | }
48 |
49 | string line;
50 | stringstream ss;
51 | string fq1, fq2;
52 |
53 | // read the first valid line, determine SE/PE data
54 | int lineCnt = 0;
55 | bool pe_data;
56 | while( true ) {
57 | getline(fin, line);
58 | if( fin.eof() )break;
59 |
60 | if( line[0] != '#' ) { // lines starts with "#" are ignored as comments
61 | ss.str( line );
62 | ss.clear();
63 | ss >> fq1;
64 |
65 | if( ss.rdbuf()->in_avail() ) {
66 | ss >> fq2;
67 | pe_data = true;
68 |
69 | kp.R1s.push_back( fq1 );
70 | kp.R2s.push_back( fq2 );
71 | } else {
72 | pe_data = false;
73 | kp.R1s.push_back( fq1 );
74 | }
75 |
76 | ++ lineCnt;
77 |
78 | break;
79 | }
80 | }
81 |
82 | bool inconsistent = false;
83 | while( true ) {
84 | getline(fin, line);
85 | if( fin.eof() )break;
86 |
87 | if( line[0] == '#' ) {
88 | continue;
89 | }
90 |
91 | ss.str( line );
92 | ss.clear();
93 | ss >> fq1;
94 |
95 | if( pe_data ) {
96 | if( ss.rdbuf()->in_avail() ) {
97 | ss >> fq2;
98 | kp.R1s.push_back( fq1 );
99 | kp.R2s.push_back( fq2 );
100 | } else {
101 | inconsistent = true;
102 | break;
103 | }
104 | } else {
105 | if( ss.rdbuf()->in_avail() ) {
106 | inconsistent = true;
107 | break;
108 | }
109 | kp.R1s.push_back( fq1 );
110 | }
111 |
112 | ++ lineCnt;
113 | }
114 |
115 | fin.close();
116 | if( inconsistent ) {
117 | cerr << "\033[1;31mERROR: inconsistent PE/SE in '" << kp.filelist << "'!\033[0m\n";
118 | exit(3);
119 | }
120 | if( pe_data ) {
121 | if( kp.R1s.size() != kp.R2s.size() ) {
122 | cerr << "\033[1;31mError: incorrect pairs in file list!\033[0m\n";
123 | exit(110);
124 | }
125 | }
126 | // cerr << "Done: " << lineCnt << " lines of " << (pe_data) ? 'P' : 'S' << "E data loaded.\n";
127 | kp.paired_end_data = pe_data;
128 | }
129 |
130 | //load 1 batch of data, using purely C-style
131 | unsigned int load_batch_data_SE_C( FILE *fp, CSEREAD *loadingReads, unsigned int num ) {
132 | register unsigned int loaded = 0;
133 | register CSEREAD *p = loadingReads;
134 | register CSEREAD *q = p + num;
135 | while( p != q ) {
136 | if( fgets( p->id, MAX_READ_ID, fp ) == NULL ) break;
137 | fgets( p->seq, MAX_READ_CYCLE, fp );
138 | fgets( p->qual, MAX_READ_CYCLE, fp ); // this line is useless
139 | fgets( p->qual, MAX_READ_CYCLE, fp );
140 |
141 | // remove the tail '\n'
142 | register unsigned int i = strlen( p->seq ) - 1;
143 | p->size = i;
144 | p->seq[i] = 0;
145 | p->qual[i] = 0;
146 |
147 | ++ p;
148 | }
149 | return p-loadingReads;
150 | }
151 |
152 | // update in v1.2: support Gzipped input file
153 | unsigned int load_batch_data_SE_GZ( gzFile gfp, CSEREAD *loadingReads, unsigned int num ) {
154 | register unsigned int loaded = 0;
155 | // string unk;
156 | register CSEREAD *p = loadingReads;
157 | register CSEREAD *q = p + num;
158 | register char c;
159 | register unsigned int i;
160 |
161 | while( p != q ) {
162 | if( gzgets( gfp, p->id, MAX_READ_ID ) == NULL ) break;
163 | gzgets( gfp, p->seq, MAX_READ_CYCLE );
164 | gzgets( gfp, p->qual, MAX_READ_CYCLE ); // this line is useless
165 | gzgets( gfp, p->qual, MAX_READ_CYCLE );
166 |
167 | // remove the tailing '\n'
168 | i = strlen( p->seq ) - 1;
169 | p->size = i;
170 | p->seq[i] = 0;
171 | p->qual[i] = 0;
172 |
173 | ++ p;
174 | }
175 |
176 | return p-loadingReads;
177 | }
178 |
179 | unsigned int load_batch_data_PE_C( FILE *fq, CPEREAD *loadingReads, const unsigned int num, const bool isRead1 ) {
180 | //register unsigned int loaded = 0;
181 | register CPEREAD *p = loadingReads;
182 | register CPEREAD *q = p + num;
183 | // load read1
184 | if( isRead1 ) {
185 | while( p != q ) {
186 | if( fgets( p->id1, MAX_READ_ID, fq ) == NULL ) break;
187 | fgets( p->seq1, MAX_READ_CYCLE, fq );
188 | fgets( p->qual1, MAX_READ_CYCLE, fq ); // this line is useless
189 | fgets( p->qual1, MAX_READ_CYCLE, fq );
190 | p->size = strlen( p->seq1 );
191 |
192 | ++ p;
193 | }
194 | } else {
195 | while( p != q ) {
196 | if( fgets( p->id2, MAX_READ_ID, fq ) == NULL ) break;
197 | fgets( p->seq2, MAX_READ_CYCLE, fq );
198 | fgets( p->qual2, MAX_READ_CYCLE, fq ); // this line is useless
199 | fgets( p->qual2, MAX_READ_CYCLE, fq );
200 | p->size2 = strlen( p->seq2 );
201 |
202 | ++ p;
203 | }
204 | }
205 | return p - loadingReads;
206 | }
207 |
208 | // update in v1.2: support Gzipped input file
209 | unsigned int load_batch_data_PE_GZ( gzFile gfp, CPEREAD *loadingReads, const unsigned int num, const bool isRead1 ) {
210 | //register unsigned int loaded = 0;
211 | register CPEREAD *p = loadingReads;
212 | register CPEREAD *q = p + num;
213 | if( isRead1 ) {
214 | while( p != q ) {
215 | if( gzgets( gfp, p->id1, MAX_READ_ID ) == NULL ) break;
216 | gzgets( gfp, p->seq1, MAX_READ_CYCLE );
217 | gzgets( gfp, p->qual1, MAX_READ_CYCLE ); // this line is useless
218 | gzgets( gfp, p->qual1, MAX_READ_CYCLE );
219 | p->size = strlen( p->seq1 );
220 |
221 | ++ p;
222 | }
223 | } else {
224 | while( p != q ) {
225 | if( gzgets( gfp, p->id2, MAX_READ_ID ) == NULL ) break;
226 | gzgets( gfp, p->seq2, MAX_READ_CYCLE );
227 | gzgets( gfp, p->qual2, MAX_READ_CYCLE ); // this line is useless
228 | gzgets( gfp, p->qual2, MAX_READ_CYCLE );
229 | p->size2 = strlen( p->seq2 );
230 |
231 | ++ p;
232 | }
233 | }
234 | return p - loadingReads;
235 | }
236 |
237 | unsigned int load_batch_data_PE_both_C( FILE *fq1, FILE *fq2, CPEREAD *loadingReads, unsigned int num ) {
238 | //register unsigned int loaded = 0;
239 | register CPEREAD *p = loadingReads;
240 | register CPEREAD *q = p + num;
241 | register CPEREAD *s = p;
242 | // load read1
243 | while( p != q ) {
244 | if( fgets( p->id1, MAX_READ_ID, fq1 ) == NULL ) break;
245 | fgets( p->seq1, MAX_READ_CYCLE, fq1 );
246 | fgets( p->qual1, MAX_READ_CYCLE, fq1 ); // this line is useless
247 | fgets( p->qual1, MAX_READ_CYCLE, fq1 );
248 | p->size = strlen( p->seq1 );
249 |
250 | ++ p;
251 | }
252 |
253 | // load read2
254 | while( s != p ) {
255 | fgets( s->id2, MAX_READ_ID, fq2 );
256 | fgets( s->seq2, MAX_READ_CYCLE, fq2 );
257 | fgets( s->qual2, MAX_READ_CYCLE, fq2 ); // this line is useless
258 | fgets( s->qual2, MAX_READ_CYCLE, fq2 );
259 | s->size2 = strlen( s->seq2 );
260 |
261 | ++ s;
262 | }
263 |
264 | return p - loadingReads;
265 | }
266 |
267 | // update in v1.2: support Gzipped input file
268 | unsigned int load_batch_data_PE_both_GZ( gzFile gfp1, gzFile gfp2, CPEREAD *loadingReads, unsigned int num ) {
269 | //register unsigned int loaded = 0;
270 | register CPEREAD *p = loadingReads;
271 | register CPEREAD *q = p + num;
272 | register CPEREAD *s = p;
273 | while( p != q ) {
274 | if( gzgets( gfp1, p->id1, MAX_READ_ID ) == NULL ) break;
275 | gzgets( gfp1, p->seq1, MAX_READ_CYCLE );
276 | gzgets( gfp1, p->qual1, MAX_READ_CYCLE ); // this line is useless
277 | gzgets( gfp1, p->qual1, MAX_READ_CYCLE );
278 | p->size = strlen( p->seq1 );
279 |
280 | ++ p;
281 | }
282 | while( s != p ) {
283 | gzgets( gfp2, s->id2, MAX_READ_ID );
284 | gzgets( gfp2, s->seq2, MAX_READ_CYCLE );
285 | gzgets( gfp2, s->qual2, MAX_READ_CYCLE ); // this line is useless
286 | gzgets( gfp2, s->qual2, MAX_READ_CYCLE );
287 | s->size2 = strlen( s->seq2 );
288 |
289 | ++ s;
290 | }
291 |
292 | return p - loadingReads;
293 | }
294 |
295 | /*
296 | * use dynamic max_mismatch as the covered size can range from 3 to a large number such as 50,
297 | * here the maximum mismatch allowed is LEN/8
298 | */
299 | bool check_mismatch_dynamic_SE_C( CSEREAD *read, unsigned int pos, const ktrim_param & kp ) {
300 | register unsigned int mis=0;
301 | register unsigned int i, len;
302 | len = read->size - pos;
303 | if( len > kp.adapter_len )
304 | len = kp.adapter_len;
305 |
306 | register unsigned int max_mismatch_dynamic;
307 | // update in v1.1.0: allows the users to set the proportion of mismatches
308 | if( kp.use_default_mismatch ) {
309 | max_mismatch_dynamic = len >> 3;
310 | if( (max_mismatch_dynamic<<3) != len )
311 | ++ max_mismatch_dynamic;
312 | } else {
313 | max_mismatch_dynamic = ceil( len * kp.mismatch_rate );
314 | }
315 |
316 | register const char *p = read->seq;
317 | for( i=0; i!=len; ++i ) {
318 | if( p[pos+i] != kp.adapter_r1[i] ) {
319 | if( mis == max_mismatch_dynamic )
320 | return false;
321 |
322 | ++ mis;
323 | }
324 | }
325 |
326 | return true;
327 | }
328 |
329 | /*
330 | * use dynamic max_mismatch as the covered size can range from 3 to a large number such as 50,
331 | * here the maximum mismatch allowed is LEN/4 for read1 + read2
332 | */
333 | bool check_mismatch_dynamic_PE_C( const CPEREAD *read, unsigned int pos, const ktrim_param &kp ) {
334 | register unsigned int mis1=0, mis2=0;
335 | register unsigned int i, len;
336 | len = read->size - pos;
337 | if( len > kp.adapter_len )
338 | len = kp.adapter_len;
339 |
340 | register unsigned int max_mismatch_dynamic;
341 | // update in v1.1.0: allows the users to set the proportion of mismatches
342 | // BUT it is highly discouraged
343 | if( kp.use_default_mismatch ) {
344 | // each read allows 1/8 mismatches of the total comparable length
345 | max_mismatch_dynamic = len >> 3;
346 | if( (max_mismatch_dynamic<<3) != len )
347 | ++ max_mismatch_dynamic;
348 | } else {
349 | max_mismatch_dynamic = ceil( len * kp.mismatch_rate );
350 | }
351 |
352 | // check mismatch for each read
353 | register const char * p = read->seq1 + pos;
354 | register const char * q = kp.adapter_r1;
355 | for( i=0; i!=len; ++i, ++p, ++q ) {
356 | if( *p != *q ) {
357 | if( mis1 == max_mismatch_dynamic )
358 | return false;
359 |
360 | ++ mis1;
361 | }
362 | }
363 | p = read->seq2 + pos;
364 | q = kp.adapter_r2;
365 | for( i=0; i!=len; ++i, ++p, ++q ) {
366 | if( *p != *q ) {
367 | if( mis2 == max_mismatch_dynamic )
368 | return false;
369 |
370 | ++ mis2;
371 | }
372 | }
373 |
374 | return true;
375 | }
376 |
377 | // update in v1.2: support window check
378 | int get_quality_trim_cycle_se( const char *p, const int total_size, const ktrim_param &kp ) {
379 | register int i, j, k;
380 | register int stop = kp.min_length-1;
381 | for( i=total_size-1; i>=stop; ) {
382 | if( p[i] >= kp.quality ) {
383 | k = i - kp.window;
384 | for( j=i-1; j!=k; --j ) {
385 | if( j<0 || p[j]= stop )
400 | return i + 1;
401 | else
402 | return 0;
403 | }
404 |
405 | int get_quality_trim_cycle_pe( const CPEREAD *read, const ktrim_param &kp ) {
406 | register int i, j, k;
407 | register int stop = kp.min_length - 1;
408 | const char *p = read->qual1;
409 | const char *q = read->qual2;
410 | for( i=read->size-1; i>=stop; ) {
411 | if( p[i]>=kp.quality && q[i]>=kp.quality ) {
412 | k = i - kp.window;
413 | for( j=i-1; j!=k; --j ) {
414 | if( j<0 || p[j]= stop )
429 | return i + 1;
430 | else
431 | return 0;
432 | }
433 |
434 | #endif
435 |
436 |
--------------------------------------------------------------------------------
/testing_dataset/adapter.fa:
--------------------------------------------------------------------------------
1 | >testingPE/1
2 | CTGTCTCTTATACACATCT
3 | >testingPE/2
4 | AGATGTGTATAAGAGACAG
5 |
--------------------------------------------------------------------------------
/testing_dataset/check.accuracy.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | #
4 | # Author: Kun Sun @ SZBL (sunkun@szbl.ac.cn)
5 | #
6 |
7 | use strict;
8 | use warnings;
9 |
10 | if( $#ARGV < 1 ) {
11 | print STDERR "\nUsage: $0 [min.size=36] [seq.cycle=100]\n\n";
12 | exit 2;
13 | }
14 |
15 | my $cutSize = $ARGV[2] || 36;
16 | my $seqCycle = $ARGV[3] || 100;
17 |
18 | my %raw;
19 | open IN, "$ARGV[0]" or die( "$!" );
20 | while( ) {
21 | /^\@(\d+):(\d+)\//;
22 | $raw{$1} = $2;
23 | ;
24 | ;
25 | ;
26 | }
27 | close IN;
28 |
29 | my %trim;
30 | open IN, "$ARGV[1]" or die( "$!" );
31 | while( ) {
32 | /^\@(\d+):/;
33 | my $id = $1;
34 | my $seq = ;
35 | $trim{$id} = length($seq)-1;
36 | ;
37 | ;
38 | }
39 | close IN;
40 |
41 | my $perfect = 0;
42 | my $mistrim = 0; ## reads that should be trimmed but kept
43 | my $overtrim = 0; ## usually tail-hits
44 | my $overkill = 0; ## the reads are overtrimmed and get discarded
45 | my $error = 0; ## the reads is trimmed at a wrong position
46 |
47 | foreach my $id ( keys %raw ) {
48 | if( $raw{$id} < $cutSize ) { ## too short reads, should be discarded
49 | if( exists $trim{$id} ) {
50 | ++ $mistrim;
51 | } else {
52 | ++ $perfect;
53 | }
54 | } else { ## long reads, should be kept
55 | unless( exists $trim{$id} ) {
56 | ++ $overkill;
57 | next;
58 | }
59 | if( $raw{$id} >= $seqCycle ) { ## very long read
60 | if( $trim{$id} == $seqCycle ) {
61 | ++ $perfect;
62 | } else {
63 | ++ $overtrim;
64 | }
65 | } else {
66 | if( $trim{$id} == $raw{$id} ) {
67 | ++ $perfect;
68 | } elsif( $trim{$id} < $raw{$id} ) {
69 | ++ $overtrim;
70 | } elsif( $trim{$id} >= $seqCycle-2) { ## seems a tail-trim
71 | ++ $mistrim;
72 | } else {
73 | ++ $error;
74 | #print STDERR "$id\n";
75 | }
76 | }
77 | }
78 | }
79 |
80 | print "Correct\t$perfect\n",
81 | "Over-trim\t", $overtrim + $overkill, "\n",
82 | "Missed-trim\t", $mistrim + $error, "\n";
83 |
84 |
--------------------------------------------------------------------------------
/testing_dataset/simu.reads.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | #
4 | # Author: Kun Sun @ SZBL (sunkun@szbl.ac.cn)
5 | #
6 |
7 | use strict;
8 | use warnings;
9 |
10 | if( $#ARGV < 0 ) {
11 | print STDERR "\nUsage: $0 [num=1e7] [min.size=10] [max.size=200] [read.cycle=100] [error.rate=0.01]\n\n";
12 | exit 2;
13 | }
14 |
15 | ## set seed, then every time it will give the exactly same result
16 | srand( 7 );
17 |
18 | our @nuc = ( 'A', 'C', 'G', 'T' );
19 | ## for testing purpose, using the adapters from the Nextera kits
20 | our $adapter_1 = "CTGTCTCTTATACACATCT";
21 | our $adapter_2 = "AGATGTGTATAAGAGACAG";
22 |
23 | my $num = $ARGV[1] || 1e7; ## number of fragments
24 | my $min = $ARGV[2] || 10; ## minimum fragment size
25 | my $max = $ARGV[3] || 200; ## maximum fragment size
26 | my $cycle = $ARGV[4] || 100; ## read cycle
27 | my $eRate = $ARGV[5] || 0.01; ## sequencing error rate
28 |
29 | ## quality is NOT considered for compaison purpose
30 | my $qual = 'h' x $cycle;
31 |
32 | open FQ1, ">$ARGV[0].read1.fq" or die( "$!" );
33 | open FQ2, ">$ARGV[0].read2.fq" or die( "$!" );
34 |
35 | foreach my $i ( 1..$num ) {
36 | my $size = int( rand()*($max-$min) + $min );
37 |
38 | my $read1 = join('', map{ $nuc[int rand @nuc] } (1..$size));
39 | my $read2 = reverse $read1;
40 | $read2 =~ tr/ACGT/TGCA/;
41 | my $m1 = mut($adapter_1, $eRate);
42 | my $m2 = mut($adapter_2, $eRate);
43 | #print STDERR "$read1+$m1\n$read2+$m2\n";
44 | $read1 .= $m1;
45 | $read2 .= $m2;
46 | if( length($read1) < $cycle ) {
47 | my $tail = $cycle - length($read1);
48 | $read1 .= 'A' x $tail;
49 | $read2 .= 'T' x $tail;
50 | }
51 |
52 | print FQ1 "\@$i:$size/1\n", substr($read1, 0, $cycle), "\n+\n$qual\n";
53 | print FQ2 "\@$i:$size/2\n", substr($read2, 0, $cycle), "\n+\n$qual\n";
54 | }
55 |
56 | close FQ1;
57 | close FQ2;
58 |
59 |
60 | sub mut {
61 | my $raw = shift;
62 | my $eRate = shift;
63 |
64 | my $len = length($raw);
65 |
66 | my $mut = '';
67 | for(my $s=0; $s<$len; ++$s) {
68 | my $r = substr( $raw, $s, 1 );
69 | if( rand() < $eRate ) { ## add a mutation
70 | while( 1 ) {
71 | my $m = $nuc[ int rand @nuc ];
72 | if( $m ne $r ) {
73 | $mut .= $m;
74 | last;
75 | }
76 | }
77 | } else {
78 | $mut .= $r;
79 | }
80 | }
81 |
82 | return $mut;
83 | }
84 |
85 |
--------------------------------------------------------------------------------