├── README.md ├── bin └── ktrim ├── change.log ├── license ├── GPLv3.txt └── Ktrim.license ├── makefile ├── src ├── common.h ├── ktrim.cpp ├── param_handler.h ├── pe_handler.h ├── se_handler.h └── util.h └── testing_dataset ├── adapter.fa ├── check.accuracy.pl └── simu.reads.pl /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data 4 | Version 1.6.0, Oct 2024
5 | Author: Kun Sun \(sunkun@szbl.ac.cn\)
6 |
7 | Distributed under the 8 | [GNU General Public License v3.0 \(GPLv3\)](https://www.gnu.org/licenses/gpl-3.0.en.html "GPLv3") 9 | for personal and academic usage only.
10 | For detailed information please refer to the license files under `license` directory. 11 | 12 | --- 13 | ## Release of version 1.6 14 | The author is pleased to release version 1.6 of Ktrim with usability improvement. 15 | This version added a "-R" option to output reads with adapters only. This could be useful for short libraries 16 | that all reads should contain adapters (e.g., CLIP-seq, whose adapters are also included as built-in support). 17 | 18 | ## Major features of Ktrim 19 | 1. Fast, sensitive, and accurate 20 | 2. Supports both paired- and single-end data 21 | 3. Supports both Gzipped and plain text 22 | 4. Supports output to stdout to pipe with downstream software, e.g., aligners 23 | 5. Supports multi-threading for speed-up 24 | 6. Built-in support for common adapters; customized adapters are also supported 25 | 26 | ## Installation 27 | `Ktrim` is written in `C++` for GNU Linux/Unix platforms. After uncompressing the source package, you can 28 | find an executable file at `bin/ktrim` compiled using `g++ v4.8.5` and linked with `libz v1.2.7` for Linux 29 | x64 system. If you could not run it (which is usually caused by low version of `libc++` or `libz` libraries) 30 | or you want to build a version optimized for your system, you can re-compile the programs: 31 | ``` 32 | user@linux$ make clean && make 33 | ``` 34 | 35 | ## Run Ktrim 36 | The main program is `ktrim` under `bin/` directory. You can add its path to your `~/.bashrc` file under 37 | the `PATH` variable to call it from anywhere; or you can run the following command to add it to your 38 | current session: 39 | ``` 40 | user@linux$ PATH=$PATH:$PWD/bin/ 41 | ``` 42 | 43 | You can also add `ktrim` to your system to call it from anywhere and share with other users (requires 44 | root privilege): 45 | ``` 46 | user@linux# make install 47 | ``` 48 | 49 | Call `ktrim` without any parameters to see the usage (or use '-h' option): 50 | ``` 51 | Usage: Ktrim [options] -f fq.list {-1/-U Read1.fq [-2 Read2.fq ]} -o out.prefix 52 | 53 | Author : Kun Sun (sunkun@szbl.ac.cn) 54 | Version: 1.6.0 (Oct 2024) 55 | 56 | Ktrim is designed to perform adapter- and quality-trimming of FASTQ files. 57 | 58 | Compulsory parameters: 59 | 60 | -f fq.list Specify the path to a file containing path to read 1/2 fastq files 61 | 62 | OR you can specify the fastq files directly: 63 | 64 | -1/-U Read1.fq Specify the path to the files containing read 1 65 | If your data is Paired-end, use '-1' and specify read 2 files using '-2' option 66 | Note that if '-U' is used, specification of '-2' is invalid 67 | If you have multiple files for your sample, use ',' to separate them 68 | Note that Gzip-compressed files (with .gz suffix) are also supported 69 | 70 | -o out.prefix Specify the prefix of the output files 71 | Note that output files include trimmed reads in FASTQ format and statistics 72 | 73 | Optional parameters: 74 | 75 | -2 Read2.fq Specify the path to the file containing read 2 76 | Use this parameter if your data is generated in paired-end mode 77 | If you have multiple files for your sample, use ',' to separate them 78 | and make sure that all the files are well paired in '-1' and '-2' options 79 | 80 | -c Write the trimming results to stdout (default: not set) 81 | Note that the interleaved fastq format will be used for paired-end data. 82 | -R Only output reads with adapter (default: not set) 83 | -t threads Specify how many threads should be used (default: 6) 84 | You can set '-t' to 0 to use all threads (automatically detected) 85 | 2-8 threads are recommended, as more threads would not benefit the performance 86 | 87 | -p phred-base Specify the baseline of the phred score (default: 33) 88 | -q score The minimum quality score to keep the cycle (default: 20) 89 | Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred 90 | -w window Set the window size for quality check (default: 5) 91 | Ktrim will stop when all the bases in a consecutive window pass the quality threshold 92 | 93 | Phred 33 ('!') and Phred 64 ('@') are the most widely used scoring system 94 | while quality scores start from 35 ('#') in the FASTQ files is also common 95 | 96 | -s size Minimum read size to be kept after trimming (default: 36) 97 | 98 | -k kit Specify the sequencing kit to use built-in adapters 99 | Currently supports 'Illumina' (default), 'Nextera', 'CLIP' and 'BGI' 100 | -a sequence Specify the adapter sequence in read 1 101 | -b sequence Specify the adapter sequence in read 2 102 | If '-a' is set while '-b' is not, I will assume that read 1 and 2 use same adapter 103 | Note that '-k' option has a higher priority (when set, '-a'/'-b' will be ignored) 104 | 105 | -m proportion Set the proportion of mismatches allowed during index and sequence comparison 106 | Default: 0.125 (i.e., 1/8 of compared base pairs) 107 | 108 | -h Show this help information and quit 109 | -v Show the software version and quit 110 | 111 | Please refer to README.md file for more information (e.g., setting adapters). 112 | Citation: Sun K. Bioinformatics 2020 Jun 1; 36(11):3561-3562. (PMID: 32159761) 113 | 114 | Ktrim: extra-fast and accurate adapter- and quality-trimmer. 115 | ``` 116 | 117 | Please note that from version 1.2.0, Ktrim supports Gzip-compressed files as input. If you have multiple 118 | lanes of FASTQ files, Ktrim even supports that some lanes are compressed while others are in plain text, as 119 | long as READ1 and READ2 are the same (either both compressed or plain text) for each lane of paired-end data. 120 | 121 | `Ktrim` contains built-in adapter sequences used by Illumina TruSeq kits, Nextera kits, Nextera transposase 122 | adapters and BGI sequencing kits within the package. However, customized adapter sequences are also allowed 123 | by setting '-a' (for read 1) and '-b' (for read 2; if it is the same as read 1, you can left it blank) 124 | options. You may need to refer to the manual of your library preparation kit for the adapter sequences. 125 | Note that in the current version of `Ktrim`, only 1 pair of adapters is allowed. 126 | 127 | Here are the built-in adapter sequences (the copyright should belong to the corresponding companies): 128 | 129 | ``` 130 | Illumina TruSeq kits: 131 | AGATCGGAAGAGC (for both read 1 and read 2) 132 | 133 | Nextera kits (suitable for ATAC-seq, Cut & tag data): 134 | CTGTCTCTTATACACATCT (for both read 1 and read 2) 135 | 136 | BGI adapters: 137 | Read 1: AAGTCGGAGGCCAAGCGGTC 138 | Read 2: AAGTCGGATCGTAGCCATGT 139 | 140 | CLIP-seq adapters: 141 | Read 1: TGGAATTCTCGGGTGCCAAGG 142 | Read 2: GATCGTCGGACTGTAGAACTCTGAAC 143 | ``` 144 | 145 | ### Example 1 146 | Your data is generated using Illumina TruSeq kit in Single-end mode, then you can run: 147 | ``` 148 | user@linux$ ktrim -U /path/to/read1.fq -o /path/to/output/dir 149 | ``` 150 | 151 | ### Example 2 152 | Your data is generated using a kit with customized adapters; your data is composed of 3 lanes in Paired-end 153 | mode and you have prepared a `file.list` to record the paths as follows: 154 | ``` 155 | /path/to/lane1.read1.fq.gz /path/to/lane1.read2.fq.gz 156 | /path/to/lane2.read1.fq.gz /path/to/lane2.read2.fq.gz 157 | /path/to/lane3.read1.fq /path/to/lane3.read2.fq 158 | ``` 159 | in addition, your Phred scoring system starts from 64; you want to keep the high quality (Phred score >=30) 160 | bases and reads longer than 50 bp after trimming; and you want to use 8 threads to speed-up the analysis, 161 | then you can run: 162 | ``` 163 | user@linux$ ktrim -f file.list -t 8 -p 64 -q 30 -s 50 -o /path/to/output/dir \ 164 | -a READ1_ADAPTER_SEQUENCE -b READ2_ADAPTER_SEQUENCE 165 | ``` 166 | 167 | ## Outputs explanation 168 | `Ktrim` outputs the trimmed reads in FASTQ format and key statistics (e.g., the numbers of reads that 169 | contains adapters and the number of reads in the trimmed files). 170 | 171 | ## Testing dataset and benchmark evaluation 172 | Under the `testing_dataset/` directory, a script named `simu.reads.pl` is provided to generate *in silico* 173 | reads for testing purpose only. **Note that the results in the paper is based on the data generated by this 174 | script.** Another script `check.accuracy.pl` is designed to evaluate the accuracies of the trimming tools. 175 | 176 | Please refer to Supplementary Method in the paper for reproducing the results (using Ktrim v1.1.0). 177 | 178 | ## Citation 179 | When referencing, please cite "Sun K: **Ktrim: an extra-fast and accurate adapter- and quality-trimmer 180 | for sequencing data.** *Bioinformatics* 2020 Jun 1; 36(11):3561-3562." 181 | [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/32159761 "Ktrim@PubMed") 182 | [Full Text](https://doi.org/10.1093/bioinformatics/btaa171 "Full text on Bioinformatics") 183 | 184 | --- 185 | Please send bug reports to Kun Sun \(sunkun@szbl.ac.cn\).
186 | Ktrim is freely available at 187 | [https://github.com/hellosunking/Ktrim/](https://github.com/hellosunking/Ktrim/ "Ktrim @ Github"). 188 | 189 | -------------------------------------------------------------------------------- /bin/ktrim: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellosunking/Ktrim/a2d98f0c802a30214310802ad73dfb51a1baf426/bin/ktrim -------------------------------------------------------------------------------- /change.log: -------------------------------------------------------------------------------- 1 | Ktrim change log 2 | 3 | v1.6.0 Oct 2024 4 | * Add '-R' option to output reads with adapters only 5 | * Add built-in adapters for CLIP-seq 6 | * Fix a bug in single-thread file-writing 7 | * Change the default thread to 6 8 | 9 | v1.5.1 Nov 2023 10 | * Fix a bug that causes Ktrim exit when input FASTQ file is too small 11 | 12 | v1.5.0 May 2023 13 | * Add support for providing input using a file (-f option) 14 | * Add support for outputting to stdout for piping with other tools (-c option; use interleaved-fastq format for paired-end data) 15 | * Optimize file writting, PE data has ~25% speed-up (under 4-threads) 16 | * Default thread is set to 4 (-t option) 17 | 18 | v1.4.1 Dec 2021 19 | * Optimize file writting, both SE/PE data has ~10% speed-up (under 4-threads) 20 | * Allows 1 mismatch in tail-hits of PE data 21 | 22 | v1.4.0 Oct 2021 23 | * Optimize file loading, PE data has a ~50% speed-up using 4-threads 24 | 25 | v1.3.1 Apr 2021 26 | * Fix bug in processing SE data using single-thread mode 27 | 28 | v1.3.0 Mar 2021 29 | * Report number the adapter dimers in the sequencing data 30 | 31 | v1.2.2 Jan 2021 32 | * Fix bug when "-o" is NOT present but the program does not quit 33 | 34 | v1.2.1 Nov 2020 35 | * Fix bug in multi-file handling 36 | 37 | v1.2.0 Jun 2020 38 | * Support Gzip-compressed files as input 39 | * Support '-w' option to check quality scores within a window rather than 1 bp 40 | * Optimize adaptor handling in paired-end data 41 | 42 | v1.1.1 Mar 2020 43 | * Optimize cout and cerr output streams 44 | * Suppress the progress report 45 | 46 | v1.1.0 Feb 2020 47 | * Support users to set the allowed mismatches during adapter-sequence comparison (-m option) 48 | * Fix a bug when read1 and read2 sequences are not of the same length 49 | * Optimize the mismatch calculation during adapter-sequence comparison for paired-end data 50 | -------------------------------------------------------------------------------- /license/GPLv3.txt: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | 9 | This version of the GNU Lesser General Public License incorporates 10 | the terms and conditions of version 3 of the GNU General Public 11 | License, supplemented by the additional permissions listed below. 12 | 13 | 0. Additional Definitions. 14 | 15 | As used herein, "this License" refers to version 3 of the GNU Lesser 16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU 17 | General Public License. 18 | 19 | "The Library" refers to a covered work governed by this License, 20 | other than an Application or a Combined Work as defined below. 21 | 22 | An "Application" is any work that makes use of an interface provided 23 | by the Library, but which is not otherwise based on the Library. 24 | Defining a subclass of a class defined by the Library is deemed a mode 25 | of using an interface provided by the Library. 26 | 27 | A "Combined Work" is a work produced by combining or linking an 28 | Application with the Library. The particular version of the Library 29 | with which the Combined Work was made is also called the "Linked 30 | Version". 31 | 32 | The "Minimal Corresponding Source" for a Combined Work means the 33 | Corresponding Source for the Combined Work, excluding any source code 34 | for portions of the Combined Work that, considered in isolation, are 35 | based on the Application, and not on the Linked Version. 36 | 37 | The "Corresponding Application Code" for a Combined Work means the 38 | object code and/or source code for the Application, including any data 39 | and utility programs needed for reproducing the Combined Work from the 40 | Application, but excluding the System Libraries of the Combined Work. 41 | 42 | 1. Exception to Section 3 of the GNU GPL. 43 | 44 | You may convey a covered work under sections 3 and 4 of this License 45 | without being bound by section 3 of the GNU GPL. 46 | 47 | 2. Conveying Modified Versions. 48 | 49 | If you modify a copy of the Library, and, in your modifications, a 50 | facility refers to a function or data to be supplied by an Application 51 | that uses the facility (other than as an argument passed when the 52 | facility is invoked), then you may convey a copy of the modified 53 | version: 54 | 55 | a) under this License, provided that you make a good faith effort to 56 | ensure that, in the event an Application does not supply the 57 | function or data, the facility still operates, and performs 58 | whatever part of its purpose remains meaningful, or 59 | 60 | b) under the GNU GPL, with none of the additional permissions of 61 | this License applicable to that copy. 62 | 63 | 3. Object Code Incorporating Material from Library Header Files. 64 | 65 | The object code form of an Application may incorporate material from 66 | a header file that is part of the Library. You may convey such object 67 | code under terms of your choice, provided that, if the incorporated 68 | material is not limited to numerical parameters, data structure 69 | layouts and accessors, or small macros, inline functions and templates 70 | (ten or fewer lines in length), you do both of the following: 71 | 72 | a) Give prominent notice with each copy of the object code that the 73 | Library is used in it and that the Library and its use are 74 | covered by this License. 75 | 76 | b) Accompany the object code with a copy of the GNU GPL and this license 77 | document. 78 | 79 | 4. Combined Works. 80 | 81 | You may convey a Combined Work under terms of your choice that, 82 | taken together, effectively do not restrict modification of the 83 | portions of the Library contained in the Combined Work and reverse 84 | engineering for debugging such modifications, if you also do each of 85 | the following: 86 | 87 | a) Give prominent notice with each copy of the Combined Work that 88 | the Library is used in it and that the Library and its use are 89 | covered by this License. 90 | 91 | b) Accompany the Combined Work with a copy of the GNU GPL and this license 92 | document. 93 | 94 | c) For a Combined Work that displays copyright notices during 95 | execution, include the copyright notice for the Library among 96 | these notices, as well as a reference directing the user to the 97 | copies of the GNU GPL and this license document. 98 | 99 | d) Do one of the following: 100 | 101 | 0) Convey the Minimal Corresponding Source under the terms of this 102 | License, and the Corresponding Application Code in a form 103 | suitable for, and under terms that permit, the user to 104 | recombine or relink the Application with a modified version of 105 | the Linked Version to produce a modified Combined Work, in the 106 | manner specified by section 6 of the GNU GPL for conveying 107 | Corresponding Source. 108 | 109 | 1) Use a suitable shared library mechanism for linking with the 110 | Library. A suitable mechanism is one that (a) uses at run time 111 | a copy of the Library already present on the user's computer 112 | system, and (b) will operate properly with a modified version 113 | of the Library that is interface-compatible with the Linked 114 | Version. 115 | 116 | e) Provide Installation Information, but only if you would otherwise 117 | be required to provide such information under section 6 of the 118 | GNU GPL, and only to the extent that such information is 119 | necessary to install and execute a modified version of the 120 | Combined Work produced by recombining or relinking the 121 | Application with a modified version of the Linked Version. (If 122 | you use option 4d0, the Installation Information must accompany 123 | the Minimal Corresponding Source and Corresponding Application 124 | Code. If you use option 4d1, you must provide the Installation 125 | Information in the manner specified by section 6 of the GNU GPL 126 | for conveying Corresponding Source.) 127 | 128 | 5. Combined Libraries. 129 | 130 | You may place library facilities that are a work based on the 131 | Library side by side in a single library together with other library 132 | facilities that are not Applications and are not covered by this 133 | License, and convey such a combined library under terms of your 134 | choice, if you do both of the following: 135 | 136 | a) Accompany the combined library with a copy of the same work based 137 | on the Library, uncombined with any other library facilities, 138 | conveyed under the terms of this License. 139 | 140 | b) Give prominent notice with the combined library that part of it 141 | is a work based on the Library, and explaining where to find the 142 | accompanying uncombined form of the same work. 143 | 144 | 6. Revised Versions of the GNU Lesser General Public License. 145 | 146 | The Free Software Foundation may publish revised and/or new versions 147 | of the GNU Lesser General Public License from time to time. Such new 148 | versions will be similar in spirit to the present version, but may 149 | differ in detail to address new problems or concerns. 150 | 151 | Each version is given a distinguishing version number. If the 152 | Library as you received it specifies that a certain numbered version 153 | of the GNU Lesser General Public License "or any later version" 154 | applies to it, you have the option of following the terms and 155 | conditions either of that published version or of any later version 156 | published by the Free Software Foundation. If the Library as you 157 | received it does not specify a version number of the GNU Lesser 158 | General Public License, you may choose any version of the GNU Lesser 159 | General Public License ever published by the Free Software Foundation. 160 | 161 | If the Library as you received it specifies that a proxy can decide 162 | whether future versions of the GNU Lesser General Public License shall 163 | apply, that proxy's public statement of acceptance of any version is 164 | permanent authorization for you to choose that version for the 165 | Library. -------------------------------------------------------------------------------- /license/Ktrim.license: -------------------------------------------------------------------------------- 1 | GPLv3.txt -------------------------------------------------------------------------------- /makefile: -------------------------------------------------------------------------------- 1 | bin/ktrim: src/ktrim.cpp src/common.h src/util.h src/param_handler.h src/pe_handler.h src/se_handler.h 2 | @echo Build Ktrim 3 | @cd src; g++ ktrim.cpp -march=native -std=c++11 -fopenmp -O3 -o ../bin/ktrim -lz; cd .. 4 | 5 | install: bin/ktrim # requires root 6 | @echo Install Ktrim for all users 7 | @cp bin/ktrim /usr/bin/ 8 | 9 | clean: 10 | rm -f bin/ktrim 11 | 12 | -------------------------------------------------------------------------------- /src/common.h: -------------------------------------------------------------------------------- 1 | /* 2 | * common.h 3 | * 4 | * This header file records the constants used in Ktrim 5 | * 6 | * Author: Kun Sun (sunkun@szbl.ac.cn) 7 | * Date: Mar, 2020 8 | * This program is part of the Ktrim package 9 | * 10 | **/ 11 | 12 | /* 13 | * Sequencing model: 14 | * 15 | * *read1 --> 16 | * 5' adapter - sequence sequence sequence - 3' adapter 17 | * <-- read2* 18 | * 19 | * so read1 may contains 3' adapter, name it adapter_r1, 20 | * read2 may contains reversed and ACGT-paired 5' adapter, name it adapter_r2 21 | * 22 | **/ 23 | 24 | #ifndef _KTRIM_COMMON_ 25 | #define _KTRIM_COMMON_ 26 | 27 | #include 28 | #include 29 | #include 30 | using namespace std; 31 | 32 | const char * VERSION = "1.6.0 (Oct 2024)"; 33 | 34 | // 1.6.0, add '-w' option to output reads with adapters only; 35 | // add built-in adapters for CLIP-seq; 36 | // fix a bug in single-thread file-writing 37 | // change the waiting_for_writing check to 0.1 ms 38 | // change the default thread to 6 39 | // 1.5.1, fix a bug that causes exit if the FASTQ files are too small 40 | // 1.5.0, support loading file names from a file, and output to stdout for pipelines 41 | // 1.4.1, now we use 2 threads for reading and 2 threads for writting, ~10% faster 42 | // 1.4.0, now we use 2 threads for reading files in PE mode, ~33% faster 43 | // 1.3.1 fixed the bug in dimers when working on SE data processing using single-thread 44 | // 1.2.2 fixed the bug when "-o" is NOT present but the program does not quit 45 | // 1.2.1 fixed the bug in multi-file handling 46 | 47 | // structure of a READ 48 | const unsigned int MAX_READ_ID = 128; 49 | const unsigned int MAX_READ_CYCLE = 512; 50 | 51 | // maxmimum insert size to call a dimer 52 | const unsigned int DIMER_INSERT = 1; 53 | 54 | // time interval for querying for writing 55 | static int write_thread; 56 | //const static chrono::milliseconds waiting_time_for_writing(1); 57 | const static chrono::microseconds waiting_time_for_writing(100); 58 | 59 | typedef struct { 60 | char *id; 61 | char *seq; 62 | char *qual; 63 | unsigned int size; 64 | } CSEREAD; 65 | 66 | typedef struct { 67 | char *id1; 68 | char *seq1; 69 | char *qual1; 70 | unsigned int size; 71 | 72 | char *id2; 73 | char *seq2; 74 | char *qual2; 75 | unsigned int size2; 76 | } CPEREAD; 77 | 78 | typedef struct { 79 | unsigned int *dropped; 80 | unsigned int *real_adapter; 81 | unsigned int *tail_adapter; 82 | unsigned int *dimer; 83 | unsigned int *pass; 84 | } ktrim_stat; 85 | 86 | typedef struct { 87 | char ** buffer1; 88 | char ** buffer2; 89 | unsigned int *b1stored; 90 | unsigned int *b2stored; 91 | } writeBuffer; 92 | 93 | // built-in adapters 94 | const unsigned int MIN_ADAPTER_SIZE = 8; 95 | const unsigned int MAX_ADAPTER_SIZE = 64; 96 | const unsigned int ADAPTER_BUFFER_SIZE = 128; 97 | const unsigned int ADAPTER_INDEX_SIZE = 3; 98 | const unsigned int OFFSET_INDEX3 = 3; 99 | // illumina TruSeq kits adapters 100 | //const char * illumina_adapter_r1 = "AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG"; 101 | //const char * illumina_adapter_r2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"; 102 | //const unsigned int illumina_adapter_len = 33; 103 | const char * illumina_adapter_r1 = "AGATCGGAAGAGC"; 104 | const char * illumina_adapter_r2 = illumina_adapter_r1; 105 | const unsigned int illumina_adapter_len = 13; 106 | const char * illumina_index1 = "AGA"; // use first 3 as index 107 | const char * illumina_index2 = illumina_index1; // use first 3 as index 108 | const char * illumina_index3 = "TCG"; // for single-end data 109 | // Nextera kits and AmpliSeq for Illumina panels 110 | const char * nextera_adapter_r1 = "CTGTCTCTTATACACATCT"; 111 | const char * nextera_adapter_r2 = nextera_adapter_r1; 112 | const unsigned int nextera_adapter_len = 19; 113 | const char * nextera_index1 = "CTG"; // use first 3 as index 114 | const char * nextera_index2 = nextera_index1; // use first 3 as index 115 | const char * nextera_index3 = "TCT"; // for single-end data 116 | // Nextera transposase adapters 117 | const char * transposase_adapter_r1 = "TCGTCGGCAGCGTC"; 118 | const char * transposase_adapter_r2 = "GTCTCGTGGGCTCG"; 119 | const unsigned int transposase_adapter_len = 14; 120 | const char * transposase_index1 = "TCG"; // use first 3 as index 121 | const char * transposase_index2 = "GTC"; // use first 3 as index 122 | const char * transposase_index3 = "TCG"; // for single-end data 123 | 124 | //PRO-seq/CLIP-seq following Mahat et al. Nat Protoc 2016; 11:1455–1476. DOI: 10.1038/nprot.2016.086 125 | //NOTE: we recommend only use read2 (template strand) and set '-w' to enable output the reads with adapters 126 | const char * clip_adapter_r1 = "TGGAATTCTCGGGTGCCAAGG"; 127 | const char * clip_adapter_r2 = "GATCGTCGGACTGTAGAACTCTGAAC"; 128 | const unsigned int clip_adapter_len = 21; 129 | const char * clip_index1 = "TGG"; // use first 3 as index 130 | const char * clip_index2 = "GAT"; // use first 3 as index 131 | const char * clip_index3 = "TGG"; // for single-end data 132 | 133 | // BGI adapters 134 | //const char * bgi_adapter_r1 = "AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"; 135 | //const char * bgi_adapter_r2 = "AAGTCGGATCGTAGCCATGTCGTTCTGTGAGC"; 136 | //const unsigned int bgi_adapter_len = 32; 137 | const char * bgi_adapter_r1 = "AAGTCGGAGGCCAAGCGGTC"; 138 | const char * bgi_adapter_r2 = "AAGTCGGATCGTAGCCATGT"; 139 | const unsigned int bgi_adapter_len = 20; 140 | const char * bgi_index1 = "AAG"; // use first 3 as index 141 | const char * bgi_index2 = "AAG"; // use first 3 as index 142 | const char * bgi_index3 = "TCG"; // for single-end data 143 | 144 | // seed and error configurations 145 | const unsigned int impossible_seed = 1000000; 146 | const unsigned int MAX_READ_LENGTH = 1024; 147 | const unsigned int MAX_SEED_NUM = 128; 148 | 149 | //configurations for parallelization, which is highly related to memory usage 150 | //but seems to have very minor effect on running time 151 | const int READS_PER_BATCH = 1 << 15; // process 128 K reads per batch (for parallelization) 152 | const int BUFFER_SIZE_PER_BATCH_READ = 1 << 25; // 128 MB buffer for each thread to store FASTQ 153 | const int MEM_SE_READSET = READS_PER_BATCH * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE); 154 | const int MEM_PE_READSET = READS_PER_BATCH * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE) * 2; 155 | 156 | // enlarge the buffer for single-thread run 157 | //const int READS_PER_BATCH_ST = READS_PER_BATCH << 1; // process 256 K reads per batch (for parallelization) 158 | //const int BUFFER_SIZE_PER_BATCH_READ_ST = BUFFER_SIZE_PER_BATCH_READ << 1; // 256 MB buffer for each thread to store FASTQ 159 | const int READS_PER_BATCH_ST = 1 << 15; // No. of reads per-batch for single-thread 160 | const int BUFFER_SIZE_PER_BATCH_READ_ST = 1 << 25; // buffer for single-thread to store FASTQ 161 | const int MEM_SE_READSET_ST = READS_PER_BATCH_ST * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE); 162 | const int MEM_PE_READSET_ST = READS_PER_BATCH_ST * (MAX_READ_ID+MAX_READ_CYCLE+MAX_READ_CYCLE) * 2; 163 | 164 | const char FILE_SEPARATOR = ','; 165 | 166 | // paramaters related 167 | typedef struct { 168 | char *filelist; 169 | char *FASTQ1, *FASTQ2, *FASTQU; 170 | char *outpre; 171 | vector R1s, R2s; 172 | 173 | unsigned int thread; 174 | unsigned int min_length; 175 | unsigned int phred; 176 | unsigned int minqual; 177 | unsigned int quality; 178 | unsigned int window; 179 | 180 | const char *seqKit, *seqA, *seqB; 181 | const char *adapter_r1, *adapter_r2; 182 | unsigned int adapter_len; 183 | const char *adapter_index1, *adapter_index2, *adapter_index3; 184 | 185 | bool use_default_mismatch; 186 | float mismatch_rate; 187 | 188 | bool paired_end_data; 189 | bool write2stdout; 190 | bool outputReadWithAdaptorOnly; 191 | FILE *fout1; 192 | FILE *fout2; 193 | FILE *flog; 194 | } ktrim_param; 195 | 196 | const char * param_list = "1:2:U:o:t:k:s:p:q:w:a:b:m:f:chRv"; 197 | 198 | // definition of functions 199 | void usage(); 200 | void init_param( ktrim_param &kp ); 201 | int process_cmd_param( int argc, char * argv[], ktrim_param &kp ); 202 | void print_param( const ktrim_param &kp ); 203 | void extractFileNames( const char *str, vector & Rs ); 204 | void loadFQFileNames( ktrim_param &kp ); 205 | 206 | // C-style 207 | unsigned int load_batch_data_PE_C( FILE * fq1, FILE * fq2, CPEREAD *loadingReads, unsigned int num ); 208 | bool check_mismatch_dynamic_PE_C( const CPEREAD *read, unsigned int pos, const ktrim_param &kp ); 209 | int process_single_thread_PE_C( const ktrim_param &kp ); 210 | int process_multi_thread_PE_C( const ktrim_param &kp ); 211 | 212 | unsigned int load_batch_data_SE_C( FILE * fp, CSEREAD *loadingReads, unsigned int num ); 213 | bool check_mismatch_dynamic_SE_C( const char * p, unsigned int pos, const ktrim_param &kp ); 214 | int process_single_thread_SE_C( const ktrim_param &kp ); 215 | int process_multi_thread_SE_C( const ktrim_param &kp ); 216 | 217 | #endif 218 | 219 | -------------------------------------------------------------------------------- /src/ktrim.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: Kun Sun (sunkun@szbl.ac.cn) 3 | * Date: May 2023 4 | * Main program of Ktrim v1.5.0 5 | **/ 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include "common.h" 18 | #include "util.h" 19 | #include "param_handler.h" 20 | #include "pe_handler.h" 21 | #include "se_handler.h" 22 | 23 | using namespace std; 24 | 25 | int main( int argc, char *argv[] ) { 26 | // process the command line parameters 27 | static ktrim_param kp; 28 | init_param( kp ); 29 | int retValue = process_cmd_param( argc, argv, kp ); 30 | if( retValue == 100 ) // help or version 31 | return 0; 32 | else if( retValue != 0 ) 33 | return retValue; 34 | 35 | if( kp.paired_end_data ) { // single-end data 36 | if( kp.thread == 1 ) 37 | retValue = process_single_thread_PE_C( kp ); 38 | else if( kp.thread == 2 ) 39 | retValue = process_two_thread_PE_C( kp ); 40 | else 41 | retValue = process_multi_thread_PE_C( kp ); 42 | } else { 43 | if( kp.thread == 1 ) 44 | retValue = process_single_thread_SE_C( kp ); 45 | else 46 | retValue = process_multi_thread_SE_C( kp ); 47 | } 48 | 49 | close_files( kp ); 50 | return retValue; 51 | } 52 | 53 | -------------------------------------------------------------------------------- /src/param_handler.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: Kun Sun (sunkun@szbl.ac.cn) 3 | * Date: May 2023 4 | * This program is part of the Ktrim package 5 | **/ 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include "common.h" 18 | 19 | using namespace std; 20 | 21 | //static char tmp_r1[ADAPTER_BUFFER_SIZE]; 22 | //static char tmp_r2[ADAPTER_BUFFER_SIZE]; 23 | static char tmp_index1[ADAPTER_INDEX_SIZE+1]; 24 | static char tmp_index2[ADAPTER_INDEX_SIZE+1]; 25 | static char tmp_index3[ADAPTER_INDEX_SIZE+1]; 26 | 27 | // default parameters 28 | void init_param( ktrim_param &kp ) { 29 | kp.FASTQ1 = NULL; 30 | kp.FASTQ2 = NULL; 31 | kp.FASTQU = NULL; 32 | kp.outpre = NULL; 33 | 34 | kp.thread = 6; 35 | kp.min_length = 36; 36 | kp.phred = 33; 37 | kp.minqual = 20; 38 | kp.quality = 53; 39 | kp.window = 5; 40 | 41 | kp.seqKit = NULL; 42 | kp.seqA = NULL; 43 | kp.seqB = NULL; 44 | 45 | kp.adapter_r1 = NULL; 46 | kp.adapter_r2 = NULL; 47 | kp.adapter_len= 0; 48 | kp.adapter_index1 = NULL; 49 | kp.adapter_index2 = NULL; 50 | kp.adapter_index3 = NULL; 51 | 52 | kp.write2stdout = false; 53 | kp.outputReadWithAdaptorOnly = false; 54 | kp.use_default_mismatch = true; 55 | kp.mismatch_rate = 0.125; 56 | } 57 | 58 | // process user-supplied parameters 59 | int process_cmd_param( int argc, char * argv[], ktrim_param &kp ) { 60 | const char * prg = argv[0]; 61 | int index; 62 | int ch; 63 | while( (ch = getopt(argc, argv, param_list) ) != -1 ) { 64 | switch( ch ) { 65 | case 'f': kp.filelist = optarg; break; 66 | case '1': kp.FASTQ1 = optarg; break; 67 | case '2': kp.FASTQ2 = optarg; break; 68 | case 'U': kp.FASTQU = optarg; break; 69 | case 'o': kp.outpre = optarg; break; 70 | case 't': kp.thread = atoi(optarg); break; 71 | case 'k': kp.seqKit = optarg; break; 72 | case 's': kp.min_length = atoi(optarg); break; 73 | case 'p': kp.phred = atoi(optarg); break; 74 | case 'q': kp.minqual = atoi(optarg); break; 75 | case 'w': kp.window = atoi(optarg); break; 76 | case 'a': kp.seqA = optarg; break; 77 | case 'b': kp.seqB = optarg; break; 78 | 79 | case 'm': kp.use_default_mismatch = false; kp.mismatch_rate = atof(optarg); break; 80 | case 'c': kp.write2stdout = true; break; 81 | case 'R': kp.outputReadWithAdaptorOnly = true; break; 82 | 83 | case 'h': usage(); return 100; 84 | case 'v': cout << VERSION << '\n'; return 100; 85 | 86 | default: 87 | //cerr << "\033[1;31mError: invalid argument ('" << (char)optopt << "')!\033[0m\n"; 88 | usage(); 89 | return 2; 90 | } 91 | } 92 | 93 | argc -= optind; 94 | argv += optind; 95 | if( argc > 0 ) { 96 | cerr << "\033[1;31mError: unrecognized parameter ('" << argv[0] << "')!\033[0m\n"; 97 | usage(); 98 | return 2; 99 | } 100 | 101 | // check compulsory paramaters 102 | if( kp.filelist == NULL ) { 103 | if( kp.FASTQU != NULL ) { // single-end 104 | if( kp.FASTQ1!=NULL || kp.FASTQ2!=NULL ) { 105 | cerr << "\033[1;31mError: '-U' is specified but you also set '-1'/'-2'!\033[0m\n"; 106 | usage(); 107 | return 2; 108 | } 109 | extractFileNames( kp.FASTQU, kp.R1s ); 110 | kp.paired_end_data = false; 111 | } else { 112 | if( kp.FASTQ1 == NULL ) { 113 | cerr << "\033[1;31mError: No input file specified (both '-1' and '-U' are not set)!\033[0m\n"; 114 | usage(); 115 | return 2; 116 | } 117 | kp.FASTQU = kp.FASTQ1; 118 | extractFileNames( kp.FASTQU, kp.R1s ); 119 | 120 | if( kp.FASTQ2 != NULL ) { 121 | extractFileNames( kp.FASTQ2, kp.R2s ); 122 | kp.paired_end_data = true; 123 | } else { 124 | kp.paired_end_data = false; 125 | } 126 | } 127 | } else { // the -f option has a higher priority 128 | if( kp.FASTQU != NULL || kp.FASTQ1!=NULL || kp.FASTQ2!=NULL ) { 129 | cerr << "\033[1;31mError: '-f' is specified but you also set '-U'/'-1'/'-2'!\033[0m\n"; 130 | usage(); 131 | return 2; 132 | } 133 | loadFQFileNames( kp ); 134 | } 135 | 136 | if( kp.outpre == NULL ) { 137 | cerr << "\033[1;31mError: No output file specified ('-o' not set)!\033[0m\n"; 138 | usage(); 139 | return 3; 140 | } 141 | 142 | // check optional parameters 143 | if( kp.thread == 0 ) { 144 | // cerr << "Warning: thread is set to 0! I will use all threads (atmost 8) instead.\n"; 145 | kp.thread = omp_get_max_threads(); 146 | if( kp.thread > 8 ) 147 | kp.thread = 8; 148 | } 149 | if( kp.thread > 8 ) { 150 | cerr << "Warning: we strongly discourage the usage of more than 8 threads.\n"; 151 | } 152 | if( kp.min_length < 10 ) { 153 | cerr << "\033[1;31mError: invalid min_length! Must be a larger than 10!\033[0m\n"; 154 | usage(); 155 | return 11; 156 | } 157 | if( kp.phred==0 || kp.minqual==0 || kp.window==0 ) { 158 | cerr << "\033[1;31mError: invalid phred/quality/window parameters!\033[0m\n"; 159 | usage(); 160 | return 12; 161 | } 162 | kp.quality = kp.phred + kp.minqual; 163 | 164 | // settings of adapters 165 | if( kp.seqKit != NULL ) { // use built-in adaptors 166 | if( strcmp(kp.seqKit, "illumina")==0 || strcmp(kp.seqKit, "ILLUMINA")==0 || strcmp(kp.seqKit, "Illumina")==0 ) { 167 | kp.adapter_r1 = illumina_adapter_r1; 168 | kp.adapter_r2 = illumina_adapter_r2; 169 | kp.adapter_len = illumina_adapter_len; 170 | kp.adapter_index1 = illumina_index1; 171 | kp.adapter_index2 = illumina_index2; 172 | kp.adapter_index3 = illumina_index3; 173 | } else if( strcmp(kp.seqKit, "Nextera") == 0 || strcmp(kp.seqKit, "nextera") == 0 ) { 174 | kp.adapter_r1 = nextera_adapter_r1; 175 | kp.adapter_r2 = nextera_adapter_r2; 176 | kp.adapter_len = nextera_adapter_len; 177 | kp.adapter_index1 = nextera_index1; 178 | kp.adapter_index2 = nextera_index2; 179 | kp.adapter_index3 = nextera_index3; 180 | } else if( strcmp(kp.seqKit, "Transposase") == 0 || strcmp(kp.seqKit, "transposase") == 0 ) { 181 | kp.adapter_r1 = transposase_adapter_r1; 182 | kp.adapter_r2 = transposase_adapter_r2; 183 | kp.adapter_len = transposase_adapter_len; 184 | kp.adapter_index1 = transposase_index1; 185 | kp.adapter_index2 = transposase_index2; 186 | kp.adapter_index3 = transposase_index3; 187 | } else if( strcmp(kp.seqKit, "CLIP") == 0 || strcmp(kp.seqKit, "clip") == 0 ) { 188 | kp.adapter_r1 = clip_adapter_r1; 189 | kp.adapter_r2 = clip_adapter_r2; 190 | kp.adapter_len = clip_adapter_len; 191 | kp.adapter_index1 = clip_index1; 192 | kp.adapter_index2 = clip_index2; 193 | kp.adapter_index3 = clip_index3; 194 | } else if( strcmp(kp.seqKit, "BGI") == 0 || strcmp(kp.seqKit, "bgi") == 0 ) { 195 | kp.adapter_r1 = bgi_adapter_r1; 196 | kp.adapter_r2 = bgi_adapter_r2; 197 | kp.adapter_len = bgi_adapter_len; 198 | kp.adapter_index1 = bgi_index1; 199 | kp.adapter_index2 = bgi_index2; 200 | kp.adapter_index3 = bgi_index3; 201 | } else { 202 | cerr << "\033[1;31mError: unacceptable sequencing kit types!\033[0m\n"; 203 | usage(); 204 | return 13; 205 | } 206 | if( kp.seqA!=NULL || kp.seqB!=NULL ) 207 | cerr << "\033[1;32mWarning: '-m' is set! '-a'/'-b' will be ignored!\033[0m\n"; 208 | } else { 209 | if( kp.seqA==NULL ) { // no adapters set, use default; note that if -b is set BUT -a is not, report an error 210 | if( kp.seqB != NULL ) { 211 | cerr << "\033[1;31mError: '-b' is set while '-a' is not set!\033[0m\n"; 212 | usage(); 213 | return 14; 214 | } 215 | kp.adapter_r1 = illumina_adapter_r1; 216 | kp.adapter_r2 = illumina_adapter_r2; 217 | kp.adapter_len = illumina_adapter_len; 218 | kp.adapter_index1 = illumina_index1; 219 | kp.adapter_index2 = illumina_index2; 220 | kp.adapter_index3 = illumina_index3; 221 | } else { 222 | if( kp.seqB == NULL ) { // seqB not set, then set it to seqA 223 | kp.seqB = kp.seqA; 224 | if( kp.FASTQ2 != NULL ) // paired end 225 | cerr << "\033[1;32mWarning: '-b' is not set for Paired-End data! I will use '-a' as adapter for read 2!\033[0m\n"; 226 | } 227 | register unsigned int i = 0; 228 | while( true ) { 229 | if( kp.seqA[i]=='\0' || kp.seqB[i]=='\0' ) 230 | break; 231 | 232 | ++ i; 233 | } 234 | if( i < MIN_ADAPTER_SIZE || i > MAX_ADAPTER_SIZE ) { 235 | cerr << "\033[1;31mError: adapter size must be between " << MIN_ADAPTER_SIZE << " to " 236 | << MAX_ADAPTER_SIZE << " bp!\033[0m\n"; 237 | return 20; 238 | } 239 | kp.adapter_len = i; 240 | 241 | for( i=0; i!=ADAPTER_INDEX_SIZE; ++i ) { 242 | tmp_index1[i] = kp.seqA[i]; 243 | tmp_index2[i] = kp.seqB[i]; 244 | tmp_index3[i] = kp.seqA[i+ADAPTER_INDEX_SIZE]; 245 | } 246 | tmp_index1[ADAPTER_INDEX_SIZE] = '\0'; 247 | tmp_index2[ADAPTER_INDEX_SIZE] = '\0'; 248 | tmp_index3[ADAPTER_INDEX_SIZE] = '\0'; 249 | 250 | kp.adapter_r1 = kp.seqA; 251 | kp.adapter_r2 = kp.seqB; 252 | kp.adapter_index1 = tmp_index1; 253 | kp.adapter_index2 = tmp_index2; 254 | kp.adapter_index3 = tmp_index3; 255 | } 256 | } 257 | 258 | // prepare output files 259 | string fileName = kp.outpre; 260 | if( kp.write2stdout ) { 261 | kp.fout1 = stdout; 262 | kp.fout2 = NULL; 263 | } else { 264 | if( kp.paired_end_data ) { 265 | fileName += ".read1.fq"; 266 | kp.fout1 = fopen( fileName.c_str(), "wt" ); 267 | fileName[ fileName.size()-4 ] = '2'; // read1 -> read2 268 | kp.fout2 = fopen( fileName.c_str(), "wt" ); 269 | if( kp.fout1 == NULL || kp.fout2 == NULL ) { 270 | cerr << "\033[1;31mError: write file failed!\033[0m\n"; 271 | if( kp.fout1 != NULL )fclose( kp.fout1 ); 272 | if( kp.fout2 != NULL )fclose( kp.fout2 ); 273 | exit(103); 274 | } 275 | } else { // single-end 276 | fileName += ".read1.fq"; 277 | kp.fout1 = fopen( fileName.c_str(), "wt" ); 278 | if( kp.fout1 == NULL ) { 279 | cerr << "\033[1;31mError: write file failed!\033[0m\n"; 280 | exit(103); 281 | } 282 | } 283 | } 284 | 285 | fileName = kp.outpre; 286 | fileName += ".trim.log"; 287 | kp.flog = fopen( fileName.c_str(), "wt" ); 288 | if( kp.flog == NULL ) { 289 | cerr << "\033[1;34mError: cannot write log file!\033[0m\n"; 290 | if( kp.fout1 != NULL )fclose( kp.fout1 ); 291 | if( kp.fout2 != NULL )fclose( kp.fout2 ); 292 | exit(105); 293 | } 294 | 295 | return 0; 296 | } 297 | 298 | void close_files( const ktrim_param &kp ) { 299 | if( ! kp.write2stdout ) { 300 | fclose( kp.fout1 ); 301 | 302 | if( kp.paired_end_data ) 303 | fclose( kp.fout2 ); 304 | } 305 | fclose( kp.flog ); 306 | } 307 | 308 | void usage() { 309 | cerr << "\n\033[1;34mUsage: Ktrim [options] -f {-1/-U [-2 Read2.fq ]} -o \033[0m\n\n" 310 | << "Author : Kun Sun (sunkun@szbl.ac.cn)\n" 311 | << "Version: " << VERSION << "\n\n" 312 | 313 | << "Ktrim is designed to perform adapter- and quality-trimming of FASTQ files.\n\n" 314 | 315 | << "Compulsory parameters:\n\n" 316 | 317 | << " -f fq.list Specify the path to a file containing path to read 1/2 fastq files\n\n" 318 | << "OR specify the fastq files directly:\n\n" 319 | << " -1/-U R1.fq[.gz] Specify the path to the files containing read 1\n" 320 | << " If your data is Paired-end, use '-1' and specify read 2 files using '-2' option\n" 321 | << " Note that if '-U' is used, specification of '-2' is invalid\n" 322 | << " If you have multiple files for your sample, use '" 323 | << FILE_SEPARATOR << "' to separate them\n" 324 | << " Gzip-compressed files (with .gz suffix) are supported.\n\n" 325 | 326 | << " -o out.prefix Specify the prefix of the output files\n" 327 | << " Outputs include trimmed reads in FASTQ format and statistics\n\n" 328 | 329 | << "Optional parameters:\n\n" 330 | 331 | << " -2 R2.fq[.gz] Specify the path to the file containing read 2\n" 332 | << " Use this parameter if your data is generated in paired-end mode\n" 333 | << " If you have multiple files for your sample, use '" 334 | << FILE_SEPARATOR << "' to separate them\n" 335 | << " and make sure that all the files are well paired in '-1' and '-2' options\n\n" 336 | 337 | << " -c Write the trimming results to stdout (default: not set)\n" 338 | << " Note that the interleaved fastq format will be used for paired-end data.\n" 339 | << " -R Only output reads with adapter (default: not set)\n" 340 | << " -s size Minimum read size to be kept after trimming (default: 36; must be larger than 10)\n\n" 341 | 342 | << " -t threads Specify how many threads should be used (default: 6)\n" 343 | << " You can set '-t' to 0 to use all threads (automatically detected)\n\n" 344 | 345 | << " -k kit Specify the sequencing kit to use built-in adapters\n" 346 | << " Currently supports 'Illumina' (default), 'Nextera', 'CLIP', and 'BGI'\n" 347 | << " -a sequence Specify the adapter sequence in read 1\n" 348 | << " -b sequence Specify the adapter sequence in read 2\n" 349 | << " If '-a' is set while '-b' is not, I will assume that read 1 and 2 use same adapter\n" 350 | << " Note that '-k' option has a higher priority (when set, '-a'/'-b' will be ignored)\n\n" 351 | 352 | << " -p phred-base Specify the baseline of the phred score (default: 33)\n" 353 | << " -q score The minimum quality score to keep the cycle (default: 20)\n" 354 | << " Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred\n\n" 355 | << " -w window Set the window size for quality check (default: 5)\n" 356 | << " Ktrim stops when all bases in a consecutive window pass the quality threshold\n\n" 357 | << " -m proportion Set the proportion of mismatches allowed during index and sequence comparison\n" 358 | << " Default: 0.125 (i.e., 1/8 of compared base pairs)\n" 359 | << " Please use this option with caution as it affects the accuracy a lot\n\n" 360 | 361 | << " -h Show this help information and quit (exit code=0)\n" 362 | << " -v Show the software version and quit (exit code=0)\n\n" 363 | 364 | << "Please refer to README.md file for more information (e.g., setting adapters).\n" 365 | << "Citation: Sun K. Bioinformatics 2020 Jun 1; 36(11):3561-3562. (PMID: 32159761)\n\n" 366 | << "\033[1;34mKtrim: extra-fast and accurate adapter- and quality-trimmer.\033[0m\n\n"; 367 | } 368 | 369 | -------------------------------------------------------------------------------- /src/pe_handler.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: Kun Sun (sunkun@szbl.ac.cn) 3 | * Date: May 2023 4 | * This program is part of the Ktrim package 5 | **/ 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include "common.h" 16 | using namespace std; 17 | 18 | void inline CPEREAD_resize( CPEREAD * read, int n ) { 19 | read->size = n; 20 | read->seq1[ n ] = 0; 21 | read->qual1[ n ] = 0; 22 | read->seq2[ n ] = 0; 23 | read->qual2[ n ] = 0; 24 | } 25 | 26 | bool inline is_revcomp( const char a, const char b ) { 27 | //TODO: consider how to deal with N, call it positive or negative??? 28 | switch( a ) { 29 | case 'A': return b=='T'; 30 | case 'C': return b=='G'; 31 | case 'G': return b=='C'; 32 | case 'T': return b=='A'; 33 | default : return false; 34 | } 35 | return true; 36 | } 37 | 38 | void init_kstat_wrbuffer( ktrim_stat &kstat, writeBuffer &writebuffer, unsigned int nthread, bool write2stdout ) { 39 | kstat.dropped = new unsigned int [ nthread ]; 40 | kstat.real_adapter = new unsigned int [ nthread ]; 41 | kstat.tail_adapter = new unsigned int [ nthread ]; 42 | kstat.dimer = new unsigned int [ nthread ]; 43 | kstat.pass = new unsigned int [ nthread ]; 44 | 45 | // buffer for storing the modified reads per thread 46 | writebuffer.buffer1 = new char * [ nthread ]; 47 | writebuffer.buffer2 = new char * [ nthread ]; 48 | writebuffer.b1stored = new unsigned int [ nthread ]; 49 | writebuffer.b2stored = new unsigned int [ nthread ]; 50 | 51 | for(unsigned int i=0; i!=nthread; ++i) { 52 | if( write2stdout ) { // only use buffer 1 53 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ << 1 ]; 54 | writebuffer.b1stored[i] = 0; 55 | } else { 56 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ]; 57 | writebuffer.buffer2[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ]; 58 | writebuffer.b1stored[i] = 0; 59 | writebuffer.b2stored[i] = 0; 60 | } 61 | 62 | kstat.dropped[i] = 0; 63 | kstat.real_adapter[i] = 0; 64 | kstat.tail_adapter[i] = 0; 65 | kstat.dimer[i] = 0; 66 | kstat.pass[i] = 0; 67 | } 68 | } 69 | 70 | void find_seed_pe( vector &seed, const CPEREAD *read, const ktrim_param & kp ) { 71 | seed.clear(); 72 | register const char *poffset = read->seq1; 73 | register const char *indexloc = poffset; 74 | while( true ) { 75 | indexloc = strstr( indexloc, kp.adapter_index1 ); 76 | if( indexloc == NULL ) 77 | break; 78 | seed.push_back( indexloc - poffset ); 79 | ++ indexloc; 80 | } 81 | poffset = read->seq2; 82 | indexloc = poffset; 83 | while( true ) { 84 | indexloc = strstr( indexloc, kp.adapter_index2 ); 85 | if( indexloc == NULL ) 86 | break; 87 | seed.push_back( indexloc - poffset ); 88 | ++ indexloc; 89 | } 90 | sort( seed.begin(), seed.end() ); 91 | } 92 | 93 | // this function is slower than C++ version 94 | void workingThread_PE_C( unsigned int tn, unsigned int start, unsigned int end, CPEREAD *workingReads, 95 | ktrim_stat *kstat, writeBuffer *writebuffer, const ktrim_param &kp ) { 96 | 97 | writebuffer->b1stored[tn] = 0; 98 | writebuffer->b2stored[tn] = 0; 99 | 100 | // vector seed; 101 | // vector :: iterator it, end_of_seed; 102 | register int *seed = new int[ MAX_SEED_NUM ]; 103 | register int hit_seed; 104 | register int *it, *end_of_seed; 105 | 106 | register CPEREAD *wkr = workingReads + start; 107 | for( unsigned int iii=end-start; iii; --iii, ++wkr ) { 108 | // read size handling 109 | if( wkr->size > wkr->size2 ) 110 | wkr->size = wkr->size2; 111 | 112 | // remove the tail '\n' 113 | // in fact, it is not essential to do this step, because '\n' has a very low ascii value (10) 114 | // therefore it will be quality-trimmed 115 | CPEREAD_resize( wkr, wkr->size - 1 ); 116 | 117 | // quality control 118 | register int i = get_quality_trim_cycle_pe( wkr, kp ); 119 | if( i == 0 ) { // not long enough 120 | ++ kstat->dropped[ tn ]; 121 | continue; 122 | } 123 | if( i != wkr->size ) { // quality-trim occurs 124 | CPEREAD_resize( wkr, i ); 125 | } 126 | 127 | // looking for seed target, 1 mismatch is allowed for these 2 seeds 128 | // which means seq1 and seq2 at least should take 1 perfect seed match 129 | //find_seed_pe( seed, wkr, kp ); 130 | //TODO: I donot need to find all the seeds, I can find-check, then next 131 | // seed.clear(); 132 | hit_seed = 0; 133 | register const char *poffset = wkr->seq1; 134 | register const char *indexloc = poffset; 135 | while( true ) { 136 | indexloc = strstr( indexloc, kp.adapter_index1 ); 137 | if( indexloc == NULL ) 138 | break; 139 | //seed.push_back( indexloc - poffset ); 140 | seed[ hit_seed++ ] = indexloc - poffset; 141 | ++ indexloc; 142 | } 143 | poffset = wkr->seq2; 144 | indexloc = poffset; 145 | while( true ) { 146 | indexloc = strstr( indexloc, kp.adapter_index2 ); 147 | if( indexloc == NULL ) 148 | break; 149 | //seed.push_back( indexloc - poffset ); 150 | seed[ hit_seed++ ] = indexloc - poffset; 151 | ++ indexloc; 152 | } 153 | //sort( seed.begin(), seed.end() ); 154 | end_of_seed = seed + hit_seed; 155 | if( hit_seed != 0 ) 156 | sort( seed, seed + hit_seed ); 157 | 158 | register bool no_valid_adapter = true; 159 | register unsigned int last_seed = impossible_seed; // a position which cannot be a seed 160 | //end_of_seed = seed.end(); 161 | //for( it=seed.begin(); it!=end_of_seed; ++it ) { 162 | for( it=seed; it!=end_of_seed; ++it ) { 163 | if( *it != last_seed ) { 164 | // as there maybe the same value in seq1_seed and seq2_seed, 165 | // use this to avoid re-calculate that pos 166 | if( check_mismatch_dynamic_PE_C( wkr, *it, kp ) ) 167 | break; 168 | 169 | last_seed = *it; 170 | } 171 | } 172 | if( it != end_of_seed ) { // adapter found 173 | no_valid_adapter = false; 174 | ++ kstat->real_adapter[tn]; 175 | if( *it >= kp.min_length ) { 176 | CPEREAD_resize( wkr, *it ); 177 | } else { // drop this read as its length is not enough 178 | ++ kstat->dropped[tn]; 179 | 180 | if( *it <= DIMER_INSERT ) 181 | ++ kstat->dimer[tn]; 182 | continue; 183 | } 184 | } else { // seed not found, now check the tail 2 or 1, if perfect match, drop these 2 185 | // note: I will NOT consider tail hits as adapter_found, as '-w' should be used when insertDNA is very short 186 | i = wkr->size - 2; 187 | register const char *p = wkr->seq1; 188 | register const char *q = wkr->seq2; 189 | //Note: 1 mismatch is allowed in tail-checking 190 | register unsigned int mismatches = 0; 191 | if( p[i] != kp.adapter_r1[0] ) mismatches ++; 192 | if( p[i+1] != kp.adapter_r1[1] ) mismatches ++; 193 | if( q[i] != kp.adapter_r2[0] ) mismatches ++; 194 | if( q[i+1] != kp.adapter_r2[1] ) mismatches ++; 195 | if( ! is_revcomp(p[5], q[i-6]) ) mismatches ++; 196 | if( ! is_revcomp(q[5], p[i-6]) ) mismatches ++; 197 | if( mismatches <= 1 ) { // tail is good 198 | ++ kstat->tail_adapter[tn]; 199 | if( i < kp.min_length ) { 200 | ++ kstat->dropped[tn]; 201 | continue; 202 | } 203 | CPEREAD_resize( wkr, i ); 204 | } else { // tail 2 is not good, check tail 1 205 | ++ i; 206 | mismatches = 0; 207 | if( p[i] != kp.adapter_r1[0] ) mismatches ++; 208 | if( q[i] != kp.adapter_r2[0] ) mismatches ++; 209 | if( ! is_revcomp(p[5], q[i-6]) ) mismatches ++; 210 | if( ! is_revcomp(q[5], p[i-6]) ) mismatches ++; 211 | if( ! is_revcomp(p[6], q[i-7]) ) mismatches ++; 212 | if( ! is_revcomp(q[6], p[i-7]) ) mismatches ++; 213 | 214 | if( mismatches <= 1 ) { 215 | ++ kstat->tail_adapter[tn]; 216 | if( i < kp.min_length ) { 217 | ++ kstat->dropped[tn]; 218 | continue; 219 | } 220 | CPEREAD_resize( wkr, i ); 221 | } 222 | } 223 | } 224 | 225 | if( kp.outputReadWithAdaptorOnly && no_valid_adapter ) 226 | continue; 227 | 228 | ++ kstat->pass[tn]; 229 | if( kp.write2stdout ) { 230 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn], 231 | "%s%s\n+\n%s\n%s%s\n+\n%s\n", 232 | wkr->id1, wkr->seq1, wkr->qual1, 233 | wkr->id2, wkr->seq2, wkr->qual2); 234 | } else { 235 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn], 236 | "%s%s\n+\n%s\n", wkr->id1, wkr->seq1, wkr->qual1 ); 237 | writebuffer->b2stored[tn] += sprintf( writebuffer->buffer2[tn]+writebuffer->b2stored[tn], 238 | "%s%s\n+\n%s\n", wkr->id2, wkr->seq2, wkr->qual2 ); 239 | } 240 | } 241 | 242 | //wait for my turn to output 243 | while( true ) { 244 | if( tn == write_thread ) { 245 | if( kp.write2stdout ) { 246 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], stdout ); 247 | } else { 248 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], kp.fout1 ); 249 | fwrite( writebuffer->buffer2[tn], sizeof(char), writebuffer->b2stored[tn], kp.fout2 ); 250 | } 251 | ++ write_thread; 252 | break; 253 | } else { 254 | this_thread::sleep_for( waiting_time_for_writing ); 255 | } 256 | } 257 | 258 | delete [] seed; 259 | } 260 | 261 | int process_multi_thread_PE_C( const ktrim_param &kp ) { 262 | // IO speed-up 263 | ios::sync_with_stdio( false ); 264 | // cin.tie( NULL ); 265 | 266 | // in this version, two data containers are used and auto-swapped for working and loading data 267 | CPEREAD *readA, *readB; 268 | register char *readA_data, *readB_data; 269 | CPEREAD *workingReads, *loadingReads, *swapReads; 270 | ktrim_stat kstat; 271 | writeBuffer writebuffer; 272 | // vector R1s, kp.R2s; 273 | unsigned int totalFiles; 274 | 275 | // now I use 2 threads for init 276 | // prepare memory and file 277 | omp_set_num_threads( 2 ); 278 | #pragma omp parallel 279 | { 280 | unsigned int tn = omp_get_thread_num(); 281 | 282 | if( tn == 0 ) { 283 | readA = new CPEREAD[ READS_PER_BATCH ]; 284 | readB = new CPEREAD[ READS_PER_BATCH ]; 285 | readA_data = new char[ MEM_PE_READSET ]; 286 | readB_data = new char[ MEM_PE_READSET ]; 287 | 288 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) { 289 | readA[i].id1 = readA_data + j; 290 | readB[i].id1 = readB_data + j; 291 | j += MAX_READ_ID; 292 | readA[i].seq1 = readA_data + j; 293 | readB[i].seq1 = readB_data + j; 294 | j += MAX_READ_CYCLE; 295 | readA[i].qual1 = readA_data + j; 296 | readB[i].qual1 = readB_data + j; 297 | j += MAX_READ_CYCLE; 298 | 299 | readA[i].id2 = readA_data + j; 300 | readB[i].id2 = readB_data + j; 301 | j += MAX_READ_ID; 302 | readA[i].seq2 = readA_data + j; 303 | readB[i].seq2 = readB_data + j; 304 | j += MAX_READ_CYCLE; 305 | readA[i].qual2 = readA_data + j; 306 | readB[i].qual2 = readB_data + j; 307 | j += MAX_READ_CYCLE; 308 | } 309 | } else { 310 | init_kstat_wrbuffer( kstat, writebuffer, kp.thread, kp.write2stdout ); 311 | 312 | // deal with multiple input files 313 | totalFiles = kp.R1s.size(); 314 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n"; 315 | } 316 | } 317 | 318 | // start analysis 319 | register unsigned int line = 0; 320 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) { 321 | bool file_is_gz = false; 322 | FILE *fq1, *fq2; 323 | gzFile gfp1, gfp2; 324 | register unsigned int i = kp.R1s[fileCnt].size() - 3; 325 | register const char * p = kp.R1s[fileCnt].c_str(); 326 | register const char * q = kp.R2s[fileCnt].c_str(); 327 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) { 328 | file_is_gz = true; 329 | gfp1 = gzopen( p, "r" ); 330 | gfp2 = gzopen( q, "r" ); 331 | if( gfp1==NULL || gfp2==NULL ) { 332 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 333 | return 104; 334 | } 335 | } else { 336 | fq1 = fopen( p, "rt" ); 337 | fq2 = fopen( q, "rt" ); 338 | if( fq1==NULL || fq2==NULL ) { 339 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 340 | return 104; 341 | } 342 | } 343 | // fprintf( stderr, "Loading files:\nRead1: %s\nRead2: %s\n", p, q ); 344 | 345 | // initialization 346 | // get first batch of fastq reads 347 | unsigned int loaded; 348 | bool metEOF; 349 | omp_set_num_threads( 2 ); 350 | #pragma omp parallel 351 | { 352 | unsigned int tn = omp_get_thread_num(); 353 | if( tn == 0 ) { 354 | if( file_is_gz ) { 355 | loaded = load_batch_data_PE_GZ( gfp1, readA, READS_PER_BATCH, true ); 356 | metEOF = gzeof( gfp1 ); 357 | } else { 358 | loaded = load_batch_data_PE_C( fq1, readA, READS_PER_BATCH, true ); 359 | metEOF = feof( fq1 ); 360 | } 361 | } else { 362 | if( file_is_gz ) { 363 | loaded = load_batch_data_PE_GZ( gfp2, readA, READS_PER_BATCH, false ); 364 | } else { 365 | loaded = load_batch_data_PE_C( fq2, readA, READS_PER_BATCH, false ); 366 | } 367 | } 368 | } 369 | if( loaded == 0 ) break; 370 | 371 | loadingReads = readB; 372 | workingReads = readA; 373 | bool nextBatch = true; 374 | unsigned int threadLoaded=0, threadLoaded2=0; 375 | unsigned int NumWkThreads=0; 376 | while( nextBatch ) { 377 | // cerr << "Working on " << loaded << " reads\n"; 378 | // start parallalization 379 | write_thread = 0; 380 | omp_set_num_threads( kp.thread ); 381 | #pragma omp parallel 382 | { 383 | unsigned int tn = omp_get_thread_num(); 384 | // if EOF is met, then all threads are used for analysis 385 | // otherwise 1 thread will do data loading 386 | if( metEOF ) { 387 | NumWkThreads = kp.thread; 388 | unsigned int start = loaded * tn / kp.thread; 389 | unsigned int end = loaded * (tn+1) / kp.thread; 390 | workingThread_PE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp ); 391 | nextBatch = false; 392 | } else { // use 2 thread to load files, others for trimming 393 | NumWkThreads = kp.thread - 2; 394 | if( tn == kp.thread - 1 ) { 395 | if( file_is_gz ) { 396 | threadLoaded = load_batch_data_PE_GZ( gfp1, loadingReads, READS_PER_BATCH, true ); 397 | metEOF = gzeof( gfp1 ); 398 | } else { 399 | threadLoaded = load_batch_data_PE_C( fq1, loadingReads, READS_PER_BATCH, true ); 400 | metEOF = feof( fq1 ); 401 | } 402 | // cerr << "R1 loaded " << threadLoaded << ", pos=" << gztell(gfp2) << ", EOF=" << gzeof( gfp1 ) << "\n"; 403 | nextBatch = (threadLoaded!=0); 404 | //cerr << "Loading thread: " << threadLoaded << ", " << metEOF << ", " << nextBatch << '\n'; 405 | } else if ( tn == kp.thread - 2 ) { 406 | if( file_is_gz ) { 407 | threadLoaded2 = load_batch_data_PE_GZ( gfp2, loadingReads, READS_PER_BATCH, false ); 408 | } else { 409 | threadLoaded2 = load_batch_data_PE_C( fq2, loadingReads, READS_PER_BATCH, false ); 410 | } 411 | // cerr << "R2 loaded " << threadLoaded2 << ", pos=" << gztell(gfp2) << ", EOF=" << gzeof( gfp2 )<< "\n"; 412 | } else { 413 | unsigned int start = loaded * tn / NumWkThreads; 414 | unsigned int end = loaded * (tn+1) / NumWkThreads; 415 | workingThread_PE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp ); 416 | } 417 | } 418 | } // parallel body 419 | // write output and update fastq statistics 420 | // I cannot write fastq in each thread because it may cause unpaired reads 421 | /*if( ! kp.write2stdout ) { 422 | omp_set_num_threads( 2 ); 423 | #pragma omp parallel 424 | { 425 | unsigned int tn = omp_get_thread_num(); 426 | if( tn == 0 ) { 427 | for( unsigned int ii=0; ii!=NumWkThreads; ++ii ) { 428 | fwrite( writebuffer.buffer1[ii], sizeof(char), writebuffer.b1stored[ii], kp.fout1 ); 429 | } 430 | } else { 431 | for( unsigned int ii=0; ii!=NumWkThreads; ++ii ) { 432 | fwrite( writebuffer.buffer2[ii], sizeof(char), writebuffer.b2stored[ii], kp.fout2 ); 433 | } 434 | } 435 | } 436 | }*/ 437 | // check whether the read-loading is correct 438 | if( threadLoaded != threadLoaded2 ) { 439 | cerr << "ERROR: unequal read number for read1 (" << threadLoaded << ") and read2 (" << threadLoaded2 << ")!\n"; 440 | return 1; 441 | } 442 | 443 | line += loaded; 444 | loaded = threadLoaded; 445 | // swap workingReads and loadingReads for next loop 446 | swapReads = loadingReads; 447 | loadingReads = workingReads; 448 | workingReads = swapReads; 449 | 450 | // cerr << '\r' << line << " reads loaded\n"; 451 | // cerr << line << " reads loaded, metEOF=" << metEOF << ", next=" << nextBatch << "\n"; 452 | }//process 1 file 453 | 454 | if( file_is_gz ) { 455 | gzclose( gfp1 ); 456 | gzclose( gfp2 ); 457 | } else { 458 | fclose( fq1 ); 459 | fclose( fq2 ); 460 | } 461 | // cerr << '\n'; 462 | } // all input files are loaded 463 | //cerr << "\rDone: " << line << " lines processed.\n"; 464 | 465 | // write trim.log 466 | int dropped_all=0, real_all=0, tail_all=0, dimer_all=0, pass_all=0; 467 | for( unsigned int i=0; i!=kp.thread; ++i ) { 468 | dropped_all += kstat.dropped[i]; 469 | real_all += kstat.real_adapter[i]; 470 | tail_all += kstat.tail_adapter[i]; 471 | dimer_all += kstat.dimer[i]; 472 | pass_all += kstat.pass[i]; 473 | } 474 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n", 475 | line, dropped_all, real_all, tail_all, dimer_all, pass_all ); 476 | 477 | //free memory 478 | for(unsigned int i=0; i!=kp.thread; ++i) { 479 | delete writebuffer.buffer1[i]; 480 | if( ! kp.write2stdout ) 481 | delete writebuffer.buffer2[i]; 482 | } 483 | delete [] writebuffer.buffer1; 484 | delete [] writebuffer.buffer2; 485 | 486 | delete [] kstat.dropped; 487 | delete [] kstat.real_adapter; 488 | delete [] kstat.tail_adapter; 489 | delete [] kstat.dimer; 490 | delete [] kstat.pass; 491 | 492 | delete [] readA; 493 | delete [] readB; 494 | delete [] readA_data; 495 | delete [] readB_data; 496 | 497 | return 0; 498 | } 499 | 500 | int process_single_thread_PE_C( const ktrim_param &kp ) { 501 | // fprintf( stderr, "process_single_thread_PE_C\n" ); 502 | // IO speed-up 503 | ios::sync_with_stdio( false ); 504 | // cin.tie( NULL ); 505 | 506 | CPEREAD *read = new CPEREAD[ READS_PER_BATCH_ST ]; 507 | register char *read_data = new char[ MEM_PE_READSET_ST ]; 508 | 509 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) { 510 | read[i].id1 = read_data + j; 511 | j += MAX_READ_ID; 512 | read[i].seq1 = read_data + j; 513 | j += MAX_READ_CYCLE; 514 | read[i].qual1 = read_data + j; 515 | j += MAX_READ_CYCLE; 516 | 517 | read[i].id2 = read_data + j; 518 | j += MAX_READ_ID; 519 | read[i].seq2 = read_data + j; 520 | j += MAX_READ_CYCLE; 521 | read[i].qual2 = read_data + j; 522 | j += MAX_READ_CYCLE; 523 | } 524 | 525 | ktrim_stat kstat; 526 | kstat.dropped = new unsigned int [ 1 ]; 527 | kstat.real_adapter = new unsigned int [ 1 ]; 528 | kstat.tail_adapter = new unsigned int [ 1 ]; 529 | kstat.dimer = new unsigned int [ 1 ]; 530 | kstat.pass = new unsigned int [ 1 ]; 531 | kstat.dropped[0] = 0; 532 | kstat.real_adapter[0] = 0; 533 | kstat.tail_adapter[0] = 0; 534 | kstat.dimer[0] = 0; 535 | kstat.pass[0] = 0; 536 | 537 | // buffer for storing the modified reads per thread 538 | writeBuffer writebuffer; 539 | writebuffer.buffer1 = new char * [ 1 ]; 540 | if( kp.write2stdout ) { 541 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST << 1 ]; 542 | writebuffer.b1stored = new unsigned int [ 1 ]; 543 | } else { 544 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ]; 545 | 546 | writebuffer.buffer2 = new char * [ 1 ]; 547 | writebuffer.buffer2[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ]; 548 | } 549 | writebuffer.b1stored = new unsigned int [ 1 ]; 550 | writebuffer.b2stored = new unsigned int [ 1 ]; 551 | 552 | // deal with multiple input files 553 | if( kp.R1s.size() != kp.R2s.size() ) { 554 | cerr << "\033[1;31mError: incorrect files!\033[0m\n"; 555 | return 110; 556 | } 557 | unsigned int totalFiles = kp.R1s.size(); 558 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n"; 559 | 560 | register unsigned int line = 0; 561 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) { 562 | bool file_is_gz = false; 563 | FILE *fq1, *fq2; 564 | gzFile gfp1, gfp2; 565 | register unsigned int i = kp.R1s[fileCnt].size() - 3; 566 | register const char * p = kp.R1s[fileCnt].c_str(); 567 | register const char * q = kp.R2s[fileCnt].c_str(); 568 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) { 569 | file_is_gz = true; 570 | gfp1 = gzopen( p, "r" ); 571 | gfp2 = gzopen( q, "r" ); 572 | if( gfp1==NULL || gfp2==NULL ) { 573 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 574 | return 104; 575 | } 576 | } else { 577 | fq1 = fopen( p, "rt" ); 578 | fq2 = fopen( q, "rt" ); 579 | if( fq1==NULL || fq2==NULL ) { 580 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 581 | return 104; 582 | } 583 | } 584 | 585 | while( true ) { 586 | // get fastq reads 587 | unsigned int loaded; 588 | if( file_is_gz ) { 589 | loaded = load_batch_data_PE_both_GZ( gfp1, gfp2, read, READS_PER_BATCH_ST ); 590 | } else { 591 | loaded = load_batch_data_PE_both_C( fq1, fq2, read, READS_PER_BATCH_ST ); 592 | } 593 | if( loaded == 0 ) break; 594 | 595 | write_thread = 0; 596 | workingThread_PE_C( 0, 0, loaded, read, &kstat, &writebuffer, kp ); 597 | // write output and update fastq statistics 598 | /* if( ! kp.write2stdout ) { 599 | fwrite( writebuffer.buffer1[0], sizeof(char), writebuffer.b1stored[0], kp.fout1 ); 600 | fwrite( writebuffer.buffer2[0], sizeof(char), writebuffer.b2stored[0], kp.fout2 ); 601 | }*/ 602 | 603 | line += loaded; 604 | //cerr << '\r' << line << " reads loaded"; 605 | 606 | if( file_is_gz ) { 607 | if( gzeof( gfp1 ) ) break; 608 | } else { 609 | if( feof( fq2 ) ) break; 610 | } 611 | } 612 | 613 | if( file_is_gz ) { 614 | gzclose( gfp1 ); 615 | gzclose( gfp2 ); 616 | } else { 617 | fclose( fq1 ); 618 | fclose( fq2 ); 619 | } 620 | } 621 | //cerr << "\rDone: " << line << " lines processed.\n"; 622 | 623 | // write trim.log 624 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n", 625 | line, kstat.dropped[0], kstat.real_adapter[0], kstat.tail_adapter[0], kstat.dimer[0], kstat.pass[0] ); 626 | 627 | delete writebuffer.buffer1[0]; 628 | if( ! kp.write2stdout ) delete writebuffer.buffer2[0]; 629 | delete [] read; 630 | delete [] read_data; 631 | 632 | return 0; 633 | } 634 | 635 | int process_two_thread_PE_C( const ktrim_param &kp ) { 636 | // IO speed-up 637 | ios::sync_with_stdio( false ); 638 | // cin.tie( NULL ); 639 | 640 | // in this version, two data containers are used and auto-swapped for working and loading data 641 | CPEREAD *readA; 642 | register char *readA_data; 643 | ktrim_stat kstat; 644 | writeBuffer writebuffer; 645 | // vector R1s, R2s; 646 | unsigned int totalFiles; 647 | 648 | // now I use 2 theads for init 649 | // prepare memory and file 650 | omp_set_num_threads( 2 ); 651 | #pragma omp parallel 652 | { 653 | unsigned int tn = omp_get_thread_num(); 654 | 655 | if( tn == 0 ) { 656 | readA = new CPEREAD[ READS_PER_BATCH ]; 657 | readA_data = new char[ MEM_PE_READSET ]; 658 | 659 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) { 660 | readA[i].id1 = readA_data + j; 661 | j += MAX_READ_ID; 662 | readA[i].seq1 = readA_data + j; 663 | j += MAX_READ_CYCLE; 664 | readA[i].qual1 = readA_data + j; 665 | j += MAX_READ_CYCLE; 666 | 667 | readA[i].id2 = readA_data + j; 668 | j += MAX_READ_ID; 669 | readA[i].seq2 = readA_data + j; 670 | j += MAX_READ_CYCLE; 671 | readA[i].qual2 = readA_data + j; 672 | j += MAX_READ_CYCLE; 673 | } 674 | } else { 675 | init_kstat_wrbuffer( kstat, writebuffer, kp.thread, kp.write2stdout ); 676 | 677 | // deal with multiple input files 678 | totalFiles = kp.R1s.size(); 679 | //cout << "\033[1;34mINFO: " << totalFiles << " paired fastq files will be loaded.\033[0m\n"; 680 | } 681 | } 682 | 683 | register unsigned int line = 0; 684 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) { 685 | bool file_is_gz = false; 686 | FILE *fq1, *fq2; 687 | gzFile gfp1, gfp2; 688 | register unsigned int i = kp.R1s[fileCnt].size() - 3; 689 | register const char * p = kp.R1s[fileCnt].c_str(); 690 | register const char * q = kp.R2s[fileCnt].c_str(); 691 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) { 692 | file_is_gz = true; 693 | gfp1 = gzopen( p, "r" ); 694 | gfp2 = gzopen( q, "r" ); 695 | if( gfp1==NULL || gfp2==NULL ) { 696 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 697 | return 104; 698 | } 699 | } else { 700 | fq1 = fopen( p, "rt" ); 701 | fq2 = fopen( q, "rt" ); 702 | if( fq1==NULL || fq2==NULL ) { 703 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 704 | return 104; 705 | } 706 | } 707 | // fprintf( stderr, "Loading files:\nRead1: %s\nRead2: %s\n", p, q ); 708 | 709 | // initialization 710 | // get first batch of fastq reads 711 | 712 | // there is NO need to do rotation in 2-threads 713 | unsigned int loaded1, loaded2; 714 | bool metEOF; 715 | while( true ) { 716 | // load data 717 | omp_set_num_threads( 2 ); 718 | #pragma omp parallel 719 | { 720 | unsigned int tn = omp_get_thread_num(); 721 | if( tn == 0 ) { 722 | if( file_is_gz ) { 723 | loaded1 = load_batch_data_PE_GZ( gfp1, readA, READS_PER_BATCH, true ); 724 | metEOF = gzeof( gfp1 ); 725 | } else { 726 | loaded1 = load_batch_data_PE_C( fq1, readA, READS_PER_BATCH, true ); 727 | metEOF = feof( fq1 ); 728 | } 729 | } else { 730 | if( file_is_gz ) { 731 | loaded2 = load_batch_data_PE_GZ( gfp2, readA, READS_PER_BATCH, false ); 732 | } else { 733 | loaded2 = load_batch_data_PE_C( fq2, readA, READS_PER_BATCH, false ); 734 | } 735 | } 736 | } 737 | if( loaded1 != loaded2 ) { 738 | cerr << "ERROR: unequal read number for read1 (" << loaded1 << ") and read2 (" << loaded2 << ")!\n"; 739 | return 1; 740 | } 741 | if( loaded1 == 0 ) break; 742 | 743 | // start analysis 744 | write_thread = 0; 745 | omp_set_num_threads( 2 ); 746 | #pragma omp parallel 747 | { 748 | unsigned int tn = omp_get_thread_num(); 749 | register int middle = loaded1 >> 1; 750 | if( tn == 0 ) { 751 | workingThread_PE_C( 0, 0, middle, readA, &kstat, &writebuffer, kp ); 752 | } else { 753 | workingThread_PE_C( 1, middle, loaded1, readA, &kstat, &writebuffer, kp ); 754 | } 755 | } // parallel body 756 | // write output and update fastq statistics 757 | /*if( ! kp.write2stdout ) { 758 | fwrite( writebuffer.buffer1[0], sizeof(char), writebuffer.b1stored[0], kp.fout1 ); 759 | fwrite( writebuffer.buffer1[1], sizeof(char), writebuffer.b1stored[1], kp.fout1 ); 760 | fwrite( writebuffer.buffer2[0], sizeof(char), writebuffer.b2stored[0], kp.fout2 ); 761 | fwrite( writebuffer.buffer2[1], sizeof(char), writebuffer.b2stored[1], kp.fout2 ); 762 | }*/ 763 | line += loaded1; 764 | if( metEOF )break; 765 | } 766 | 767 | if( file_is_gz ) { 768 | gzclose( gfp1 ); 769 | gzclose( gfp2 ); 770 | } else { 771 | fclose( fq1 ); 772 | fclose( fq2 ); 773 | } 774 | } // all input files are loaded 775 | //cerr << "\rDone: " << line << " lines processed.\n"; 776 | 777 | // write trim.log 778 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n", 779 | line, kstat.dropped[0]+kstat.dropped[1], kstat.real_adapter[0]+kstat.real_adapter[1], 780 | kstat.tail_adapter[0]+kstat.tail_adapter[1], kstat.dimer[0]+kstat.dimer[1], kstat.pass[0]+kstat.pass[1] ); 781 | 782 | //free memory 783 | for(unsigned int i=0; i!=kp.thread; ++i) { 784 | delete writebuffer.buffer1[i]; 785 | delete writebuffer.buffer2[i]; 786 | } 787 | delete [] writebuffer.buffer1; 788 | delete [] writebuffer.buffer2; 789 | delete [] kstat.dropped; 790 | delete [] kstat.real_adapter; 791 | delete [] kstat.tail_adapter; 792 | delete [] kstat.dimer; 793 | delete [] kstat.pass; 794 | 795 | delete [] readA; 796 | delete [] readA_data; 797 | 798 | return 0; 799 | } 800 | 801 | -------------------------------------------------------------------------------- /src/se_handler.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: Kun Sun (sunkun@szbl.ac.cn) 3 | * Date: May 2023 4 | * This program is part of the Ktrim package 5 | **/ 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include "common.h" 17 | #include "util.h" 18 | using namespace std; 19 | 20 | void inline CSEREAD_resize( CSEREAD * cr, int n ) { 21 | cr->seq[ n] = 0; 22 | cr->qual[n] = 0; 23 | cr->size = n; 24 | } 25 | 26 | void find_seed( vector &seed, CSEREAD *read, const ktrim_param & kp ) { 27 | seed.clear(); 28 | register char *poffset = read->seq; 29 | register char *indexloc = poffset; 30 | while( true ) { 31 | indexloc = strstr( indexloc, kp.adapter_index1 ); 32 | if( indexloc == NULL ) 33 | break; 34 | seed.push_back( indexloc - poffset ); 35 | indexloc ++; 36 | } 37 | poffset = read->seq + OFFSET_INDEX3; 38 | indexloc = poffset; 39 | while( true ) { 40 | indexloc = strstr( indexloc, kp.adapter_index3 ); 41 | if( indexloc == NULL ) 42 | break; 43 | seed.push_back( indexloc - poffset ); 44 | indexloc ++; 45 | } 46 | sort( seed.begin(), seed.end() ); 47 | } 48 | 49 | void workingThread_SE_C( unsigned int tn, unsigned int start, unsigned int end, CSEREAD *workingReads, 50 | ktrim_stat *kstat, writeBuffer *writebuffer, const ktrim_param &kp ) { 51 | 52 | // fprintf( stderr, "=== working thread %d: %d - %d\n", tn, start, end ), "\n"; 53 | writebuffer->b1stored[tn] = 0; 54 | 55 | register int i, j; 56 | register unsigned int last_seed; 57 | vector seed; 58 | vector :: iterator it; 59 | const char *p, *q; 60 | 61 | register CSEREAD *wkr = workingReads + start; 62 | for( unsigned int ii=start; ii!=end; ++ii, ++wkr ) { 63 | // fprintf( stderr, "working: %d, %s\n", ii, wkr->id ); 64 | // quality control 65 | p = wkr->qual; 66 | j = wkr->size; 67 | 68 | // update in v1.2: support window check 69 | i = get_quality_trim_cycle_se( p, j, kp ); 70 | 71 | if( i == 0 ) { // not long enough 72 | ++ kstat->dropped[ tn ]; 73 | continue; 74 | } 75 | if( i != j ) { // quality-trim occurs 76 | CSEREAD_resize( wkr, i); 77 | } 78 | 79 | // looking for seed target, 1 mismatch is allowed for these 2 seeds 80 | // which means seq1 and seq2 at least should take 1 perfect seed match 81 | find_seed( seed, wkr, kp ); 82 | 83 | last_seed = impossible_seed; // a position which cannot be in seed 84 | for( it=seed.begin(); it!=seed.end(); ++it ) { 85 | if( *it != last_seed ) { 86 | // fprintf( stderr, " check seed: %d\n", *it ); 87 | // as there maybe the same value in seq1_seed and seq2_seed, 88 | // use this to avoid re-calculate that pos 89 | if( check_mismatch_dynamic_SE_C( wkr, *it, kp ) ) 90 | break; 91 | 92 | last_seed = *it; 93 | } 94 | } 95 | register bool no_valid_adapter = true; 96 | if( it != seed.end() ) { // adapter found 97 | no_valid_adapter = false; 98 | ++ kstat->real_adapter[tn]; 99 | if( *it >= kp.min_length ) { 100 | CSEREAD_resize( wkr, *it ); 101 | } else { // drop this read as its length is not enough 102 | ++ kstat->dropped[tn]; 103 | 104 | if( *it <= DIMER_INSERT ) 105 | ++ kstat->dimer[tn]; 106 | 107 | continue; 108 | } 109 | } else { // seed not found, now check the tail 2, if perfect match, drop these 2; Single-end reads do not check tail 1 110 | i = wkr->size - 2; 111 | p = wkr->seq; 112 | if( p[i]==kp.adapter_r1[0] && p[i+1]==kp.adapter_r1[1] ) { 113 | ++ kstat->tail_adapter[tn]; 114 | if( i < kp.min_length ) { 115 | ++ kstat->dropped[tn]; 116 | continue; 117 | } 118 | CSEREAD_resize( wkr, i ); 119 | } 120 | } 121 | 122 | if( kp.outputReadWithAdaptorOnly && no_valid_adapter ) 123 | continue; 124 | 125 | ++ kstat->pass[tn]; 126 | writebuffer->b1stored[tn] += sprintf( writebuffer->buffer1[tn]+writebuffer->b1stored[tn], 127 | "%s%s\n+\n%s\n", wkr->id, wkr->seq, wkr->qual); 128 | } 129 | 130 | // wait for my turn to write 131 | while( true ) { 132 | if( tn == write_thread ) { 133 | // cerr << "Thread " << tn << " is writing.\n"; 134 | fwrite( writebuffer->buffer1[tn], sizeof(char), writebuffer->b1stored[tn], kp.fout1 ); 135 | ++ write_thread; 136 | break; 137 | } else { 138 | this_thread::sleep_for( waiting_time_for_writing ); 139 | } 140 | } 141 | } 142 | 143 | int process_multi_thread_SE_C( const ktrim_param &kp ) { 144 | // IO speed-up 145 | // ios::sync_with_stdio( false ); 146 | // cin.tie( NULL ); 147 | 148 | // in this version, two data containers are used and auto-swapped for working and loading data 149 | CSEREAD *readA = new CSEREAD[ READS_PER_BATCH ]; 150 | CSEREAD *readB = new CSEREAD[ READS_PER_BATCH ]; 151 | 152 | register char *readA_data = new char[ MEM_SE_READSET ]; 153 | register char *readB_data = new char[ MEM_SE_READSET ]; 154 | 155 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) { 156 | readA[i].id = readA_data + j; 157 | readB[i].id = readB_data + j; 158 | j += MAX_READ_ID; 159 | readA[i].seq = readA_data + j; 160 | readB[i].seq = readB_data + j; 161 | j += MAX_READ_CYCLE; 162 | readA[i].qual = readA_data + j; 163 | readB[i].qual = readB_data + j; 164 | j += MAX_READ_CYCLE; 165 | } 166 | 167 | CSEREAD *workingReads, *loadingReads, *swapReads; 168 | 169 | ktrim_stat kstat; 170 | kstat.dropped = new unsigned int [ kp.thread ]; 171 | kstat.real_adapter = new unsigned int [ kp.thread ]; 172 | kstat.tail_adapter = new unsigned int [ kp.thread ]; 173 | kstat.dimer = new unsigned int [ kp.thread ]; 174 | kstat.pass = new unsigned int [ kp.thread ]; 175 | 176 | // buffer for storing the modified reads per thread 177 | writeBuffer writebuffer; 178 | writebuffer.buffer1 = new char * [ kp.thread ]; 179 | writebuffer.b1stored = new unsigned int [ kp.thread ]; 180 | 181 | for(unsigned int i=0; i!=kp.thread; ++i) { 182 | writebuffer.buffer1[i] = new char[ BUFFER_SIZE_PER_BATCH_READ ]; 183 | writebuffer.b1stored[i] = 0; 184 | 185 | kstat.dropped[i] = 0; 186 | kstat.real_adapter[i] = 0; 187 | kstat.tail_adapter[i] = 0; 188 | kstat.dimer[i] = 0; 189 | kstat.pass[i] = 0; 190 | } 191 | 192 | // deal with multiple input files 193 | // vector R1s; 194 | // extractFileNames( kp.FASTQU, R1s ); //this is moved to param_handler 195 | unsigned int totalFiles = kp.R1s.size(); 196 | //cout << "\033[1;34mINFO: " << totalFiles << " single-end fastq files will be loaded.\033[0m\n"; 197 | 198 | register unsigned int line = 0; 199 | unsigned int threadCNT = kp.thread - 1; 200 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) { 201 | bool file_is_gz = false; 202 | FILE *fq; 203 | gzFile gfp; 204 | register unsigned int i = kp.R1s[fileCnt].size() - 3; 205 | register const char * p = kp.R1s[fileCnt].c_str(); 206 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) { 207 | file_is_gz = true; 208 | gfp = gzopen( p, "r" ); 209 | if( gfp == NULL ) { 210 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 211 | return 104; 212 | } 213 | } else { 214 | fq = fopen( p, "rt" ); 215 | if( fq == NULL ) { 216 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 217 | return 104; 218 | } 219 | } 220 | // initialization 221 | // get first batch of fastq reads 222 | unsigned int loaded; 223 | bool metEOF; 224 | if( file_is_gz ) { 225 | loaded = load_batch_data_SE_GZ( gfp, readA, READS_PER_BATCH ); 226 | metEOF = gzeof( gfp ); 227 | } else { 228 | loaded = load_batch_data_SE_C( fq, readA, READS_PER_BATCH ); 229 | metEOF = feof( fq ); 230 | } 231 | if( loaded == 0 ) break; 232 | // fprintf( stderr, "Loaded %d, metEOF=%d\n", loaded, metEOF ); 233 | 234 | loadingReads = readB; 235 | workingReads = readA; 236 | bool nextBatch = true; 237 | unsigned int threadLoaded; 238 | while( nextBatch ) { 239 | // start parallalization 240 | write_thread = 0; 241 | omp_set_num_threads( kp.thread ); 242 | #pragma omp parallel 243 | { 244 | unsigned int tn = omp_get_thread_num(); 245 | // if EOF is met, then all threads are used for analysis 246 | // otherwise 1 thread will do data loading 247 | if( metEOF ) { 248 | unsigned int start = loaded * tn / kp.thread; 249 | unsigned int end = loaded * (tn+1) / kp.thread; 250 | workingThread_SE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp ); 251 | nextBatch = false; 252 | } else { // use 1 thread to load file, others for trimming 253 | if( tn == threadCNT ) { 254 | if( file_is_gz ) { 255 | threadLoaded = load_batch_data_SE_GZ( gfp, loadingReads, READS_PER_BATCH ); 256 | metEOF = gzeof( gfp ); 257 | } else { 258 | threadLoaded = load_batch_data_SE_C( fq, loadingReads, READS_PER_BATCH ); 259 | metEOF = feof( fq ); 260 | } 261 | nextBatch = (threadLoaded!=0); 262 | // cerr << "Thread " << tn << " has loaded " << threadLoaded << " reads.\n"; 263 | // fprintf( stderr, "Loaded %d, metEOF=%d\n", threadLoaded, metEOF ); 264 | //cerr << "Loading thread: " << threadLoaded << ", " << metEOF << ", " << nextBatch << '\n'; 265 | } else { 266 | unsigned int start = loaded * tn / threadCNT; 267 | unsigned int end = loaded * (tn+1) / threadCNT; 268 | workingThread_SE_C( tn, start, end, workingReads, &kstat, &writebuffer, kp ); 269 | // write output; fwrite is thread-safe 270 | } 271 | } 272 | } // parallel body 273 | // swap workingReads and loadingReads for next loop 274 | swapReads = loadingReads; 275 | loadingReads = workingReads; 276 | workingReads = swapReads; 277 | line += loaded; 278 | loaded = threadLoaded; 279 | //cerr << '\r' << line << " reads loaded"; 280 | } 281 | 282 | if( file_is_gz ) { 283 | gzclose( gfp ); 284 | } else { 285 | fclose( fq ); 286 | } 287 | } 288 | //cerr << "\rDone: " << line << " lines processed.\n"; 289 | 290 | // write trim.log 291 | int dropped_all=0, real_all=0, tail_all=0, dimer_all=0, pass_all=0; 292 | for( unsigned int i=0; i!=kp.thread; ++i ) { 293 | dropped_all += kstat.dropped[i]; 294 | real_all += kstat.real_adapter[i]; 295 | tail_all += kstat.tail_adapter[i]; 296 | dimer_all += kstat.dimer[i]; 297 | pass_all += kstat.pass[i]; 298 | } 299 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n", 300 | line, dropped_all, real_all, tail_all, dimer_all, pass_all ); 301 | 302 | //free memory 303 | for(unsigned int i=0; i!=kp.thread; ++i) { 304 | delete writebuffer.buffer1[i]; 305 | } 306 | delete [] writebuffer.buffer1; 307 | 308 | delete [] kstat.dropped; 309 | delete [] kstat.real_adapter; 310 | delete [] kstat.tail_adapter; 311 | delete [] kstat.dimer; 312 | delete [] kstat.pass; 313 | 314 | delete [] readA; 315 | delete [] readB; 316 | delete [] readA_data; 317 | delete [] readB_data; 318 | 319 | return 0; 320 | } 321 | 322 | int process_single_thread_SE_C( const ktrim_param &kp ) { 323 | // IO speed-up 324 | // ios::sync_with_stdio( false ); 325 | // cin.tie( NULL ); 326 | 327 | CSEREAD *read = new CSEREAD[ READS_PER_BATCH_ST ]; 328 | register char *read_data = new char[ MEM_SE_READSET_ST ]; 329 | 330 | for( register int i=0, j=0; i!=READS_PER_BATCH; ++i ) { 331 | read[i].id = read_data + j; 332 | j += MAX_READ_ID; 333 | read[i].seq = read_data + j; 334 | j += MAX_READ_CYCLE; 335 | read[i].qual = read_data + j; 336 | j += MAX_READ_CYCLE; 337 | } 338 | 339 | ktrim_stat kstat; 340 | kstat.dropped = new unsigned int [ 1 ]; 341 | kstat.real_adapter = new unsigned int [ 1 ]; 342 | kstat.tail_adapter = new unsigned int [ 1 ]; 343 | kstat.dimer = new unsigned int [ 1 ]; 344 | kstat.pass = new unsigned int [ 1 ]; 345 | kstat.dropped[0] = 0; 346 | kstat.real_adapter[0] = 0; 347 | kstat.tail_adapter[0] = 0; 348 | kstat.dimer[0] = 0; 349 | kstat.pass[0] = 0; 350 | 351 | // buffer for storing the modified reads per thread 352 | writeBuffer writebuffer; 353 | writebuffer.buffer1 = new char * [ 1 ]; 354 | writebuffer.b1stored = new unsigned int [ 1 ]; 355 | writebuffer.buffer1[0] = new char[ BUFFER_SIZE_PER_BATCH_READ_ST ]; 356 | 357 | // deal with multiple input files 358 | unsigned int totalFiles = kp.R1s.size(); 359 | //cout << "\033[1;34mINFO: " << totalFiles << " single-end fastq files will be loaded.\033[0m\n"; 360 | 361 | register unsigned int line = 0; 362 | for( unsigned int fileCnt=0; fileCnt!=totalFiles; ++ fileCnt ) { 363 | //fq1.open( R1s[fileCnt].c_str() ); 364 | bool file_is_gz = false; 365 | FILE *fq; 366 | gzFile gfp; 367 | register unsigned int i = kp.R1s[fileCnt].size() - 3; 368 | register const char * p = kp.R1s[fileCnt].c_str(); 369 | if( p[i]=='.' && p[i+1]=='g' && p[i+2]=='z' ) { 370 | // fprintf( stderr, "GZ file!\n" ); 371 | file_is_gz = true; 372 | gfp = gzopen( p, "r" ); 373 | if( gfp == NULL ) { 374 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 375 | return 104; 376 | } 377 | } else { 378 | fq = fopen( p, "rt" ); 379 | if( fq == NULL ) { 380 | cerr << "\033[1;31mError: open fastq file failed!\033[0m\n"; 381 | return 104; 382 | } 383 | } 384 | 385 | while( true ) { 386 | // get fastq reads 387 | //unsigned int loaded = load_batch_data_SE( fq1, read, READS_PER_BATCH_ST ); 388 | unsigned int loaded; 389 | if( file_is_gz ) { 390 | loaded = load_batch_data_SE_GZ( gfp, read, READS_PER_BATCH_ST ); 391 | // fprintf( stderr, "Loaded=%d\n", loaded ); 392 | } else { 393 | loaded = load_batch_data_SE_C( fq, read, READS_PER_BATCH_ST ); 394 | } 395 | if( loaded == 0 ) break; 396 | 397 | write_thread = 0; 398 | workingThread_SE_C( 0, 0, loaded, read, &kstat, &writebuffer, kp ); 399 | // write output and update fastq statistics 400 | line += loaded; 401 | //cerr << '\r' << line << " reads loaded"; 402 | 403 | //if( fq1.eof() ) break; 404 | // fprintf( stderr, "Check\n" ); 405 | if( file_is_gz ) { 406 | if( gzeof( gfp ) ) break; 407 | } else { 408 | if( feof( fq ) ) break; 409 | } 410 | } 411 | //fq1.close(); 412 | if( file_is_gz ) { 413 | gzclose( gfp ); 414 | } else { 415 | fclose( fq ); 416 | } 417 | } 418 | 419 | // write trim.log 420 | fprintf( kp.flog, "Total\t%u\nDropped\t%u\nAadaptor\t%u\nTailHit\t%u\nDimer\t%u\nPass\t%u\n", 421 | line, kstat.dropped[0], kstat.real_adapter[0], kstat.tail_adapter[0], kstat.dimer[0], kstat.pass[0] ); 422 | 423 | //free memory 424 | // delete buffer1; 425 | // delete fq1buffer; 426 | delete writebuffer.buffer1[0]; 427 | delete [] read; 428 | delete [] read_data; 429 | 430 | return 0; 431 | } 432 | 433 | -------------------------------------------------------------------------------- /src/util.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: Kun Sun (sunkun@szbl.ac.cn) 3 | * Date: Feb, 2020 4 | * This program is part of the Ktrim package 5 | **/ 6 | 7 | #ifndef _KTRIM_UTIL_ 8 | #define _KTRIM_UTIL_ 9 | 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include "common.h" 23 | 24 | using namespace std; 25 | 26 | // extract file names 27 | void extractFileNames( const char *str, vector & Rs ) { 28 | string fileName = ""; 29 | for(unsigned int i=0; str[i]!='\0'; ++i) { 30 | if( str[i] == FILE_SEPARATOR ) { 31 | Rs.push_back( fileName ); 32 | fileName.clear(); 33 | } else { 34 | fileName += str[i]; 35 | } 36 | } 37 | if( fileName.size() ) // in case there is a FILE_SEPARATOR at the end 38 | Rs.push_back( fileName ); 39 | } 40 | 41 | void loadFQFileNames( ktrim_param & kp ) { 42 | ifstream fin; 43 | fin.open( kp.filelist ); 44 | if( fin.fail() ) { 45 | cerr << "Error: load file " << kp.filelist << " failed!\n"; 46 | exit(10); 47 | } 48 | 49 | string line; 50 | stringstream ss; 51 | string fq1, fq2; 52 | 53 | // read the first valid line, determine SE/PE data 54 | int lineCnt = 0; 55 | bool pe_data; 56 | while( true ) { 57 | getline(fin, line); 58 | if( fin.eof() )break; 59 | 60 | if( line[0] != '#' ) { // lines starts with "#" are ignored as comments 61 | ss.str( line ); 62 | ss.clear(); 63 | ss >> fq1; 64 | 65 | if( ss.rdbuf()->in_avail() ) { 66 | ss >> fq2; 67 | pe_data = true; 68 | 69 | kp.R1s.push_back( fq1 ); 70 | kp.R2s.push_back( fq2 ); 71 | } else { 72 | pe_data = false; 73 | kp.R1s.push_back( fq1 ); 74 | } 75 | 76 | ++ lineCnt; 77 | 78 | break; 79 | } 80 | } 81 | 82 | bool inconsistent = false; 83 | while( true ) { 84 | getline(fin, line); 85 | if( fin.eof() )break; 86 | 87 | if( line[0] == '#' ) { 88 | continue; 89 | } 90 | 91 | ss.str( line ); 92 | ss.clear(); 93 | ss >> fq1; 94 | 95 | if( pe_data ) { 96 | if( ss.rdbuf()->in_avail() ) { 97 | ss >> fq2; 98 | kp.R1s.push_back( fq1 ); 99 | kp.R2s.push_back( fq2 ); 100 | } else { 101 | inconsistent = true; 102 | break; 103 | } 104 | } else { 105 | if( ss.rdbuf()->in_avail() ) { 106 | inconsistent = true; 107 | break; 108 | } 109 | kp.R1s.push_back( fq1 ); 110 | } 111 | 112 | ++ lineCnt; 113 | } 114 | 115 | fin.close(); 116 | if( inconsistent ) { 117 | cerr << "\033[1;31mERROR: inconsistent PE/SE in '" << kp.filelist << "'!\033[0m\n"; 118 | exit(3); 119 | } 120 | if( pe_data ) { 121 | if( kp.R1s.size() != kp.R2s.size() ) { 122 | cerr << "\033[1;31mError: incorrect pairs in file list!\033[0m\n"; 123 | exit(110); 124 | } 125 | } 126 | // cerr << "Done: " << lineCnt << " lines of " << (pe_data) ? 'P' : 'S' << "E data loaded.\n"; 127 | kp.paired_end_data = pe_data; 128 | } 129 | 130 | //load 1 batch of data, using purely C-style 131 | unsigned int load_batch_data_SE_C( FILE *fp, CSEREAD *loadingReads, unsigned int num ) { 132 | register unsigned int loaded = 0; 133 | register CSEREAD *p = loadingReads; 134 | register CSEREAD *q = p + num; 135 | while( p != q ) { 136 | if( fgets( p->id, MAX_READ_ID, fp ) == NULL ) break; 137 | fgets( p->seq, MAX_READ_CYCLE, fp ); 138 | fgets( p->qual, MAX_READ_CYCLE, fp ); // this line is useless 139 | fgets( p->qual, MAX_READ_CYCLE, fp ); 140 | 141 | // remove the tail '\n' 142 | register unsigned int i = strlen( p->seq ) - 1; 143 | p->size = i; 144 | p->seq[i] = 0; 145 | p->qual[i] = 0; 146 | 147 | ++ p; 148 | } 149 | return p-loadingReads; 150 | } 151 | 152 | // update in v1.2: support Gzipped input file 153 | unsigned int load_batch_data_SE_GZ( gzFile gfp, CSEREAD *loadingReads, unsigned int num ) { 154 | register unsigned int loaded = 0; 155 | // string unk; 156 | register CSEREAD *p = loadingReads; 157 | register CSEREAD *q = p + num; 158 | register char c; 159 | register unsigned int i; 160 | 161 | while( p != q ) { 162 | if( gzgets( gfp, p->id, MAX_READ_ID ) == NULL ) break; 163 | gzgets( gfp, p->seq, MAX_READ_CYCLE ); 164 | gzgets( gfp, p->qual, MAX_READ_CYCLE ); // this line is useless 165 | gzgets( gfp, p->qual, MAX_READ_CYCLE ); 166 | 167 | // remove the tailing '\n' 168 | i = strlen( p->seq ) - 1; 169 | p->size = i; 170 | p->seq[i] = 0; 171 | p->qual[i] = 0; 172 | 173 | ++ p; 174 | } 175 | 176 | return p-loadingReads; 177 | } 178 | 179 | unsigned int load_batch_data_PE_C( FILE *fq, CPEREAD *loadingReads, const unsigned int num, const bool isRead1 ) { 180 | //register unsigned int loaded = 0; 181 | register CPEREAD *p = loadingReads; 182 | register CPEREAD *q = p + num; 183 | // load read1 184 | if( isRead1 ) { 185 | while( p != q ) { 186 | if( fgets( p->id1, MAX_READ_ID, fq ) == NULL ) break; 187 | fgets( p->seq1, MAX_READ_CYCLE, fq ); 188 | fgets( p->qual1, MAX_READ_CYCLE, fq ); // this line is useless 189 | fgets( p->qual1, MAX_READ_CYCLE, fq ); 190 | p->size = strlen( p->seq1 ); 191 | 192 | ++ p; 193 | } 194 | } else { 195 | while( p != q ) { 196 | if( fgets( p->id2, MAX_READ_ID, fq ) == NULL ) break; 197 | fgets( p->seq2, MAX_READ_CYCLE, fq ); 198 | fgets( p->qual2, MAX_READ_CYCLE, fq ); // this line is useless 199 | fgets( p->qual2, MAX_READ_CYCLE, fq ); 200 | p->size2 = strlen( p->seq2 ); 201 | 202 | ++ p; 203 | } 204 | } 205 | return p - loadingReads; 206 | } 207 | 208 | // update in v1.2: support Gzipped input file 209 | unsigned int load_batch_data_PE_GZ( gzFile gfp, CPEREAD *loadingReads, const unsigned int num, const bool isRead1 ) { 210 | //register unsigned int loaded = 0; 211 | register CPEREAD *p = loadingReads; 212 | register CPEREAD *q = p + num; 213 | if( isRead1 ) { 214 | while( p != q ) { 215 | if( gzgets( gfp, p->id1, MAX_READ_ID ) == NULL ) break; 216 | gzgets( gfp, p->seq1, MAX_READ_CYCLE ); 217 | gzgets( gfp, p->qual1, MAX_READ_CYCLE ); // this line is useless 218 | gzgets( gfp, p->qual1, MAX_READ_CYCLE ); 219 | p->size = strlen( p->seq1 ); 220 | 221 | ++ p; 222 | } 223 | } else { 224 | while( p != q ) { 225 | if( gzgets( gfp, p->id2, MAX_READ_ID ) == NULL ) break; 226 | gzgets( gfp, p->seq2, MAX_READ_CYCLE ); 227 | gzgets( gfp, p->qual2, MAX_READ_CYCLE ); // this line is useless 228 | gzgets( gfp, p->qual2, MAX_READ_CYCLE ); 229 | p->size2 = strlen( p->seq2 ); 230 | 231 | ++ p; 232 | } 233 | } 234 | return p - loadingReads; 235 | } 236 | 237 | unsigned int load_batch_data_PE_both_C( FILE *fq1, FILE *fq2, CPEREAD *loadingReads, unsigned int num ) { 238 | //register unsigned int loaded = 0; 239 | register CPEREAD *p = loadingReads; 240 | register CPEREAD *q = p + num; 241 | register CPEREAD *s = p; 242 | // load read1 243 | while( p != q ) { 244 | if( fgets( p->id1, MAX_READ_ID, fq1 ) == NULL ) break; 245 | fgets( p->seq1, MAX_READ_CYCLE, fq1 ); 246 | fgets( p->qual1, MAX_READ_CYCLE, fq1 ); // this line is useless 247 | fgets( p->qual1, MAX_READ_CYCLE, fq1 ); 248 | p->size = strlen( p->seq1 ); 249 | 250 | ++ p; 251 | } 252 | 253 | // load read2 254 | while( s != p ) { 255 | fgets( s->id2, MAX_READ_ID, fq2 ); 256 | fgets( s->seq2, MAX_READ_CYCLE, fq2 ); 257 | fgets( s->qual2, MAX_READ_CYCLE, fq2 ); // this line is useless 258 | fgets( s->qual2, MAX_READ_CYCLE, fq2 ); 259 | s->size2 = strlen( s->seq2 ); 260 | 261 | ++ s; 262 | } 263 | 264 | return p - loadingReads; 265 | } 266 | 267 | // update in v1.2: support Gzipped input file 268 | unsigned int load_batch_data_PE_both_GZ( gzFile gfp1, gzFile gfp2, CPEREAD *loadingReads, unsigned int num ) { 269 | //register unsigned int loaded = 0; 270 | register CPEREAD *p = loadingReads; 271 | register CPEREAD *q = p + num; 272 | register CPEREAD *s = p; 273 | while( p != q ) { 274 | if( gzgets( gfp1, p->id1, MAX_READ_ID ) == NULL ) break; 275 | gzgets( gfp1, p->seq1, MAX_READ_CYCLE ); 276 | gzgets( gfp1, p->qual1, MAX_READ_CYCLE ); // this line is useless 277 | gzgets( gfp1, p->qual1, MAX_READ_CYCLE ); 278 | p->size = strlen( p->seq1 ); 279 | 280 | ++ p; 281 | } 282 | while( s != p ) { 283 | gzgets( gfp2, s->id2, MAX_READ_ID ); 284 | gzgets( gfp2, s->seq2, MAX_READ_CYCLE ); 285 | gzgets( gfp2, s->qual2, MAX_READ_CYCLE ); // this line is useless 286 | gzgets( gfp2, s->qual2, MAX_READ_CYCLE ); 287 | s->size2 = strlen( s->seq2 ); 288 | 289 | ++ s; 290 | } 291 | 292 | return p - loadingReads; 293 | } 294 | 295 | /* 296 | * use dynamic max_mismatch as the covered size can range from 3 to a large number such as 50, 297 | * here the maximum mismatch allowed is LEN/8 298 | */ 299 | bool check_mismatch_dynamic_SE_C( CSEREAD *read, unsigned int pos, const ktrim_param & kp ) { 300 | register unsigned int mis=0; 301 | register unsigned int i, len; 302 | len = read->size - pos; 303 | if( len > kp.adapter_len ) 304 | len = kp.adapter_len; 305 | 306 | register unsigned int max_mismatch_dynamic; 307 | // update in v1.1.0: allows the users to set the proportion of mismatches 308 | if( kp.use_default_mismatch ) { 309 | max_mismatch_dynamic = len >> 3; 310 | if( (max_mismatch_dynamic<<3) != len ) 311 | ++ max_mismatch_dynamic; 312 | } else { 313 | max_mismatch_dynamic = ceil( len * kp.mismatch_rate ); 314 | } 315 | 316 | register const char *p = read->seq; 317 | for( i=0; i!=len; ++i ) { 318 | if( p[pos+i] != kp.adapter_r1[i] ) { 319 | if( mis == max_mismatch_dynamic ) 320 | return false; 321 | 322 | ++ mis; 323 | } 324 | } 325 | 326 | return true; 327 | } 328 | 329 | /* 330 | * use dynamic max_mismatch as the covered size can range from 3 to a large number such as 50, 331 | * here the maximum mismatch allowed is LEN/4 for read1 + read2 332 | */ 333 | bool check_mismatch_dynamic_PE_C( const CPEREAD *read, unsigned int pos, const ktrim_param &kp ) { 334 | register unsigned int mis1=0, mis2=0; 335 | register unsigned int i, len; 336 | len = read->size - pos; 337 | if( len > kp.adapter_len ) 338 | len = kp.adapter_len; 339 | 340 | register unsigned int max_mismatch_dynamic; 341 | // update in v1.1.0: allows the users to set the proportion of mismatches 342 | // BUT it is highly discouraged 343 | if( kp.use_default_mismatch ) { 344 | // each read allows 1/8 mismatches of the total comparable length 345 | max_mismatch_dynamic = len >> 3; 346 | if( (max_mismatch_dynamic<<3) != len ) 347 | ++ max_mismatch_dynamic; 348 | } else { 349 | max_mismatch_dynamic = ceil( len * kp.mismatch_rate ); 350 | } 351 | 352 | // check mismatch for each read 353 | register const char * p = read->seq1 + pos; 354 | register const char * q = kp.adapter_r1; 355 | for( i=0; i!=len; ++i, ++p, ++q ) { 356 | if( *p != *q ) { 357 | if( mis1 == max_mismatch_dynamic ) 358 | return false; 359 | 360 | ++ mis1; 361 | } 362 | } 363 | p = read->seq2 + pos; 364 | q = kp.adapter_r2; 365 | for( i=0; i!=len; ++i, ++p, ++q ) { 366 | if( *p != *q ) { 367 | if( mis2 == max_mismatch_dynamic ) 368 | return false; 369 | 370 | ++ mis2; 371 | } 372 | } 373 | 374 | return true; 375 | } 376 | 377 | // update in v1.2: support window check 378 | int get_quality_trim_cycle_se( const char *p, const int total_size, const ktrim_param &kp ) { 379 | register int i, j, k; 380 | register int stop = kp.min_length-1; 381 | for( i=total_size-1; i>=stop; ) { 382 | if( p[i] >= kp.quality ) { 383 | k = i - kp.window; 384 | for( j=i-1; j!=k; --j ) { 385 | if( j<0 || p[j]= stop ) 400 | return i + 1; 401 | else 402 | return 0; 403 | } 404 | 405 | int get_quality_trim_cycle_pe( const CPEREAD *read, const ktrim_param &kp ) { 406 | register int i, j, k; 407 | register int stop = kp.min_length - 1; 408 | const char *p = read->qual1; 409 | const char *q = read->qual2; 410 | for( i=read->size-1; i>=stop; ) { 411 | if( p[i]>=kp.quality && q[i]>=kp.quality ) { 412 | k = i - kp.window; 413 | for( j=i-1; j!=k; --j ) { 414 | if( j<0 || p[j]= stop ) 429 | return i + 1; 430 | else 431 | return 0; 432 | } 433 | 434 | #endif 435 | 436 | -------------------------------------------------------------------------------- /testing_dataset/adapter.fa: -------------------------------------------------------------------------------- 1 | >testingPE/1 2 | CTGTCTCTTATACACATCT 3 | >testingPE/2 4 | AGATGTGTATAAGAGACAG 5 | -------------------------------------------------------------------------------- /testing_dataset/check.accuracy.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | # 4 | # Author: Kun Sun @ SZBL (sunkun@szbl.ac.cn) 5 | # 6 | 7 | use strict; 8 | use warnings; 9 | 10 | if( $#ARGV < 1 ) { 11 | print STDERR "\nUsage: $0 [min.size=36] [seq.cycle=100]\n\n"; 12 | exit 2; 13 | } 14 | 15 | my $cutSize = $ARGV[2] || 36; 16 | my $seqCycle = $ARGV[3] || 100; 17 | 18 | my %raw; 19 | open IN, "$ARGV[0]" or die( "$!" ); 20 | while( ) { 21 | /^\@(\d+):(\d+)\//; 22 | $raw{$1} = $2; 23 | ; 24 | ; 25 | ; 26 | } 27 | close IN; 28 | 29 | my %trim; 30 | open IN, "$ARGV[1]" or die( "$!" ); 31 | while( ) { 32 | /^\@(\d+):/; 33 | my $id = $1; 34 | my $seq = ; 35 | $trim{$id} = length($seq)-1; 36 | ; 37 | ; 38 | } 39 | close IN; 40 | 41 | my $perfect = 0; 42 | my $mistrim = 0; ## reads that should be trimmed but kept 43 | my $overtrim = 0; ## usually tail-hits 44 | my $overkill = 0; ## the reads are overtrimmed and get discarded 45 | my $error = 0; ## the reads is trimmed at a wrong position 46 | 47 | foreach my $id ( keys %raw ) { 48 | if( $raw{$id} < $cutSize ) { ## too short reads, should be discarded 49 | if( exists $trim{$id} ) { 50 | ++ $mistrim; 51 | } else { 52 | ++ $perfect; 53 | } 54 | } else { ## long reads, should be kept 55 | unless( exists $trim{$id} ) { 56 | ++ $overkill; 57 | next; 58 | } 59 | if( $raw{$id} >= $seqCycle ) { ## very long read 60 | if( $trim{$id} == $seqCycle ) { 61 | ++ $perfect; 62 | } else { 63 | ++ $overtrim; 64 | } 65 | } else { 66 | if( $trim{$id} == $raw{$id} ) { 67 | ++ $perfect; 68 | } elsif( $trim{$id} < $raw{$id} ) { 69 | ++ $overtrim; 70 | } elsif( $trim{$id} >= $seqCycle-2) { ## seems a tail-trim 71 | ++ $mistrim; 72 | } else { 73 | ++ $error; 74 | #print STDERR "$id\n"; 75 | } 76 | } 77 | } 78 | } 79 | 80 | print "Correct\t$perfect\n", 81 | "Over-trim\t", $overtrim + $overkill, "\n", 82 | "Missed-trim\t", $mistrim + $error, "\n"; 83 | 84 | -------------------------------------------------------------------------------- /testing_dataset/simu.reads.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | # 4 | # Author: Kun Sun @ SZBL (sunkun@szbl.ac.cn) 5 | # 6 | 7 | use strict; 8 | use warnings; 9 | 10 | if( $#ARGV < 0 ) { 11 | print STDERR "\nUsage: $0 [num=1e7] [min.size=10] [max.size=200] [read.cycle=100] [error.rate=0.01]\n\n"; 12 | exit 2; 13 | } 14 | 15 | ## set seed, then every time it will give the exactly same result 16 | srand( 7 ); 17 | 18 | our @nuc = ( 'A', 'C', 'G', 'T' ); 19 | ## for testing purpose, using the adapters from the Nextera kits 20 | our $adapter_1 = "CTGTCTCTTATACACATCT"; 21 | our $adapter_2 = "AGATGTGTATAAGAGACAG"; 22 | 23 | my $num = $ARGV[1] || 1e7; ## number of fragments 24 | my $min = $ARGV[2] || 10; ## minimum fragment size 25 | my $max = $ARGV[3] || 200; ## maximum fragment size 26 | my $cycle = $ARGV[4] || 100; ## read cycle 27 | my $eRate = $ARGV[5] || 0.01; ## sequencing error rate 28 | 29 | ## quality is NOT considered for compaison purpose 30 | my $qual = 'h' x $cycle; 31 | 32 | open FQ1, ">$ARGV[0].read1.fq" or die( "$!" ); 33 | open FQ2, ">$ARGV[0].read2.fq" or die( "$!" ); 34 | 35 | foreach my $i ( 1..$num ) { 36 | my $size = int( rand()*($max-$min) + $min ); 37 | 38 | my $read1 = join('', map{ $nuc[int rand @nuc] } (1..$size)); 39 | my $read2 = reverse $read1; 40 | $read2 =~ tr/ACGT/TGCA/; 41 | my $m1 = mut($adapter_1, $eRate); 42 | my $m2 = mut($adapter_2, $eRate); 43 | #print STDERR "$read1+$m1\n$read2+$m2\n"; 44 | $read1 .= $m1; 45 | $read2 .= $m2; 46 | if( length($read1) < $cycle ) { 47 | my $tail = $cycle - length($read1); 48 | $read1 .= 'A' x $tail; 49 | $read2 .= 'T' x $tail; 50 | } 51 | 52 | print FQ1 "\@$i:$size/1\n", substr($read1, 0, $cycle), "\n+\n$qual\n"; 53 | print FQ2 "\@$i:$size/2\n", substr($read2, 0, $cycle), "\n+\n$qual\n"; 54 | } 55 | 56 | close FQ1; 57 | close FQ2; 58 | 59 | 60 | sub mut { 61 | my $raw = shift; 62 | my $eRate = shift; 63 | 64 | my $len = length($raw); 65 | 66 | my $mut = ''; 67 | for(my $s=0; $s<$len; ++$s) { 68 | my $r = substr( $raw, $s, 1 ); 69 | if( rand() < $eRate ) { ## add a mutation 70 | while( 1 ) { 71 | my $m = $nuc[ int rand @nuc ]; 72 | if( $m ne $r ) { 73 | $mut .= $m; 74 | last; 75 | } 76 | } 77 | } else { 78 | $mut .= $r; 79 | } 80 | } 81 | 82 | return $mut; 83 | } 84 | 85 | --------------------------------------------------------------------------------