├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── custombuf.cpp ├── custombuf.h ├── epsapg.cpp ├── epsapg.png ├── utils.cpp └── utils.h /.gitignore: -------------------------------------------------------------------------------- 1 | #Do not track content of these folders 2 | bin/ 3 | logs/ 4 | *.log 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Issar Arab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Make file will compile the software components, generate executables, 3 | # and move them to the bin folder to add later to the system envirnment 4 | 5 | all: 6 | # for version `GLIBCXX_3.4.21' not found 7 | export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64:/usr/lib64 8 | g++ -o epsapg -std=c++0x epsapg.cpp utils.cpp custombuf.cpp -pthread 9 | rm -rf bin 10 | mkdir bin 11 | mv epsapg bin/. 12 | 13 | 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EPSAPG 2 | 3 | EPSAPG is a pipeline combining modules from state-of-the-art methods to quickly generate extensive protein sequence alignment profiles. It is a fast alignment pipeline that runs the muscles of MMseqs2 and generates PSI-BLAST output (profile and pssm). 4 | 5 |

6 | 7 |

8 | 9 | I started researching this project idea back during my stay as an AI researcher at Rostlab, Technical University of Munich. 10 | 11 | Main entry point of the software is "epsapg.cpp". 12 | 13 | ## Publication 14 | If you use EPSAPG in your work, please cite the following publication: 15 | 16 | - I. Arab, **EPSAPG: A Pipeline Combining MMseqs2 and PSI-BLAST to Quickly Generate Extensive Protein Sequence Alignment Profiles**, In 2023 IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies (BDCAT ’23), December 4–7, 2023, Taormina, Messina, Italy. ACM, New York, NY, USA, 9 pages. [doi:10.1145/3632366.3632384](https://doi.org/10.1145/3632366.3632384) 17 | 18 | # Prerequisites 19 | ## Hardware requirements 20 | :exclamation: Before you start compiling any code, make sure you have a machine that supports SSE4.1 or AVX2. You can check that by typing the following commands 21 | 22 | cat /proc/cpuinfo | grep -c sse4_1 > /dev/null && echo "SSE4.1 Supported" || echo "SSE4.1 Unsupported" 23 | 24 | OR 25 | 26 | cat /proc/cpuinfo | grep -c avx2 > /dev/null && echo "AVX2 Supported" || echo "AVX2 Unsupported" 27 | 28 | Once the hardware requirements are met, you can move forward to the software requirements. 29 | 30 | ## Software requirements 31 | :exclamation: Compiling the following software will require 'git', 'g++' (7.3 or higher) and 'cmake' (3.0 or higher). 32 | 33 | ### Installing g++ 34 | The commands bellow are taillored to CentOS 7, a Linux distribution. You can lookup the equivalent command for each step. In most of the time, you just need to replace 'yum' by 'apt-get'. Refer to the support page bellow to figure out the equivalent in other Linux distributions: 35 | https://help.ubuntu.com/community/SwitchingToUbuntu/FromLinux/RedHatEnterpriseLinuxAndFedora 36 | 37 | > You can either install gcc from repository by typing: 38 | 39 | yum -y install gcc 40 | 41 | In order to include c++ library as well, you may want to install gcc-c++. This package will compile files with extensions that indicate they are C source as C++, instead of as C. You can install it by typing: 42 | 43 | yum -y install gcc-c++ 44 | gcc --version 45 | 46 | Depending on your version of your CentOS, you may see that GCC is not the last version available. In my machine, CentOS 7, the version distributed in this OS is GCC 4.8.5 47 | 48 | Since we need GCC 7.3 or higher, we may need to install it from source. To do so, follow the next steps: 49 | 50 | > Install GCC from source by typing 51 | 52 | wget http://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-7.3.0/gcc-7.3.0.tar.gz 53 | tar zxf gcc-7.3.0.tar.gz 54 | cd gcc-7.3.0 55 | 56 | Then, you have to install bzip2 and run the ‘download_prerequisites’ script from the top level of the GCC source tree to download some prerequisites needed by GCC: 57 | 58 | yum -y install bzip2 59 | ./contrib/download_prerequisites 60 | 61 | After all the prerequisites are installed and in order to configure the GCC build environment, run the following 'configure' script: 62 | 63 | ./configure --disable-multilib --enable-languages=c,c++ 64 | 65 | Then at the end run the following command to compile the source code. It is recommanded to start a screen before you run them since the compilation may take more than an hour 66 | 67 | screen -S gcc 68 | make -j 4 69 | make install 70 | 71 | ### Installing Boost libraries 72 | Boost is a set of libraries for the C++ programming language that provide support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading, image processing, regular expressions, and unit testing. EPSAPG make use of different optimized algorithms available in Boost, namely tokenizer, string algorithms, and uuid library. To install the Boost, type the following: 73 | 74 | yum install boost-devel 75 | 76 | ### Installing MMseqs2 77 | 78 | MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge proteins/nucleotide sequence sets. It is recommanded to compile MMseqs2 from source as it has the advantage to be optimized to the specific system, which should improve its performance. 79 | 80 | In order to compile from source, clone the repo from GitHub and run the following commands (coppied from the GitHub README file) 81 | 82 | git clone https://github.com/soedinglab/MMseqs2.git 83 | cd MMseqs2 84 | mkdir build 85 | cd build 86 | cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. .. 87 | make 88 | make install 89 | export PATH=$(pwd)/bin/:$PATH 90 | 91 | Now you have to build the mmseqsDB of your fasta database and then compile an index table for it in order to acheive faster runs later. 92 | For more details about the project and the commands to use for building the DB and index table, refer to the repository in GitHub: 93 | 94 | https://github.com/soedinglab/MMseqs2 95 | 96 | 97 | ### Installing PSI-BLAST 98 | BLAST is the Basic Local Alignment Search Tool. It uses an index to rapdily search large sequence databases. Please refer to the following liking to install it: 99 | 100 | https://angus.readthedocs.io/en/2019/running-command-line-blast.html#what-is-blast 101 | 102 | ### Compiling EPSAPG 103 | To compile EPSAPG, clone the repo then run 'make'. After succesful compilation, the executable will be placed in the bin folder. After that include the path into the system variable: 104 | 105 | git clone https://github.com/git@github.com:issararab/EPSAPG.git 106 | cd EPSAPG 107 | make 108 | export PATH=$(pwd)/bin/:$PATH 109 | 110 | ### EPSAPG Command paramters 111 | | Parameter | Mandatory/Optional | Input | Description | 112 | | --- | --- | --- | --- | 113 | | -query | Mandatory | Fasta_File_In | Input file name in fasta format | 114 | | -interm_path | Optional | folder_path | If run with indextable, which is the recommanded option, MMseqs2 stores intermediate results in tmp folder. (If not specified, the default path is 'tmp' folder in the same location as targetDB. Same one used while indexing the DB.) | 115 | | -db_load_mode | Optional | int_value[1/2] | Option for using instant search power of MMseqs2, set to 2. (default 1, loads db from disk). This option enhaces the search unless the indextable is loaded and locked into memory. Refer to: How to search small query sets fast in the MMseqs2 documentation. | 116 | | -dbsize | Optional | size | Effective length of the database. Default uses the amino acid length of MMseqs2 output result. For accurate statistics of the profile, provide the main database length. | 117 | | -use_sw_tback | Optional | boolean[0/1] | Compute locally optimal Smith-Waterman alignments(default is set to 0) | 118 | | -max_seqs | Optional | int_value | maximum result sequences per query (this parameter affects the sensitivity). Default is set to 1000. | 119 | | -threads | Optional | int_value | number of cores used for the computation (uses all cores of the machine by default) | 120 | | -output_profile | Optional | boolean[0/1] | Enable or disable outputing alignment profile of a query sequence against a database. If disabled, it will print on console.(enabled by default) | 121 | | -output_pssm | Optional | boolean[0/1] | Enable or disable outputing position-specific scoring matrix checkpoint file (disabled by default) | 122 | | -output_ascii_pssm | Optional | boolean[0/1] | Enable or disable outputing ASCII version of position-specific scoring matrix checkpoint file (disabled by default) | 123 | 124 | ### How to Search Using EPSAPG 125 | 126 | :exclamation: For running the pipeline succesfully, make sure first to create MMseqs database and index table of your main DB with the name **targetDB**. The pipeline identifies **targetDB** as the database to search on. The following are the three main commands to search with EPSAPG (read details in MMseqs documentation cited above): 127 | 128 | #Compile a MMseqs2 database 129 | mmseqs createdb main/DB.fasta targetDB 130 | # Compile an index table of the DB 131 | mmseqs createindex targetDB tmp 132 | #Launch the search 133 | epsapg -query query/example.fasta 134 | 135 | # alternative precise intermediate folder path 136 | mmseqs createindex targetDB /path/tmp 137 | epsapg -query query/example.fasta -interm_path /path/tmp 138 | 139 | **Note:** MMseqs database and indextable are compiled only once, as long as the main database did not change. Refer to "EPSAPG Command paramters" section for more search options. 140 | 141 | Enjoy your super fast alignments :) 142 | -------------------------------------------------------------------------------- /custombuf.cpp: -------------------------------------------------------------------------------- 1 | #include "custombuf.h" 2 | #include 3 | #include 4 | #include 5 | 6 | using namespace std; 7 | 8 | custombuf::custombuf(string& target) : target_(target) { 9 | this->setp(this->buffer_, this->buffer_ + bufsize - 1); 10 | } 11 | 12 | int custombuf::overflow(int c) { 13 | if (!traits_type::eq_int_type(c, traits_type::eof())) { 14 | *this->pptr() = traits_type::to_char_type(c); 15 | this->pbump(1); 16 | } 17 | this->target_.append(this->pbase(), this->pptr() - this->pbase()); 18 | this->setp(this->buffer_, this->buffer_ + bufsize - 1); 19 | return traits_type::not_eof(c); 20 | } 21 | 22 | int custombuf::sync() { 23 | this->overflow(traits_type::eof()); 24 | return 0; 25 | } 26 | -------------------------------------------------------------------------------- /custombuf.h: -------------------------------------------------------------------------------- 1 | #ifndef CUSTOMBUF_H // Header File - for function prototypes and variables 2 | #define CUSTOMBUF_H 3 | #include 4 | #include 5 | #include 6 | 7 | using namespace std; 8 | 9 | // streambuf is a base class in C++ that provides an interface for reading from 10 | // or writing to a sequence of characters. The class improves performance in 11 | // the case of file reading by reading data from the underlying operating system 12 | // in larger chunks. 13 | class custombuf : public streambuf { 14 | public: 15 | custombuf(string& target); 16 | 17 | private: 18 | string& target_; 19 | enum { bufsize = 8192 }; 20 | char buffer_[bufsize]; 21 | int overflow(int c); 22 | int sync(); 23 | }; 24 | 25 | 26 | #endif // CUSTOMBUF_H -------------------------------------------------------------------------------- /epsapg.cpp: -------------------------------------------------------------------------------- 1 | #include "custombuf.h" 2 | #include "utils.h" 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include // uuid class 13 | #include // generators 14 | #include // streaming operators etc. 15 | #include // cast uuid to string 16 | #include 17 | #include 18 | 19 | using namespace std; 20 | //using namespace boost; 21 | 22 | int main(int argc, char **argv) { 23 | typedef boost::tokenizer> tokenizer; 24 | string isar_tuple,isar_pair,isar_pair_queries,single_query_data; 25 | string query,temp; 26 | int queryCount; 27 | map queryDictionary; 28 | //paramters of the software command 29 | map commandParameters; 30 | string parameters[] = { 31 | "-query", 32 | "-db_load_mode", 33 | "-dbsize", 34 | "-interm_path", 35 | "-use_sw_tback", 36 | "-max_seqs", 37 | "-threads", 38 | "-output_profile", 39 | "-output_pssm", 40 | "-output_ascii_pssm" 41 | }; 42 | // Create a std::promise object 43 | promise exitSignal; 44 | //Fetch std::future object associated with promise 45 | future futureObj = exitSignal.get_future(); 46 | //multiple threads datastructure 47 | vector threads; 48 | unsigned concurentThreadsSupported = 0; 49 | boost::uuids::uuid uuid = boost::uuids::random_generator()(); 50 | string UUID = "-"+boost::lexical_cast(uuid); 51 | 52 | //parsing the command 53 | if (argc <= 2){ 54 | if(argc == 1) 55 | help(true); 56 | else if(string(argv[1]) == "-h") 57 | help(false); 58 | else if(string(argv[1]) == "-help") 59 | help(true); 60 | else { 61 | printf("You have to provide a command in the following format:\n"); 62 | help(true); 63 | } 64 | return 0; 65 | } else { 66 | if(argc%2 == 0) { 67 | printf("You have to provide a command in the following format:\n"); 68 | help(false); 69 | return 0; 70 | } 71 | for(int i=1;i(argv[i],argv[i+1]) ); 75 | else { 76 | printf("\n Unknown parameter: %s \nYou have to provide a command in the following format:\n",argv[i]); 77 | help(false); 78 | return 0; 79 | } 80 | } 81 | } 82 | 83 | //clean folder from previous results 84 | system("rm -f isar.result.*"); 85 | //check if command include input file 86 | if ( commandParameters.find("-query") == commandParameters.end() ) { 87 | printf("\n Please provide an input fasta file. '-query' parameter is mandatory to run the software. \n" 88 | "You have to provide a command in the following format:\n"); 89 | help(false); 90 | return 0; 91 | } 92 | query = commandParameters.find("-query")->second; 93 | if (query == "queryDB" || boost::starts_with(query, "isar")) 94 | { 95 | printf("'queryDB' and the words starting with 'isar' are preserved file names.\n" 96 | "Please consider changing your file name and run the search again.\nGood luck!"); 97 | return 0; 98 | } 99 | printf("###############################\n"); 100 | printf("# Starting MMSEQS2 #\n"); 101 | printf("###############################\n"); 102 | 103 | //Check if mmseqs2 is installed 104 | if (system("which mmseqs > /dev/null 2>&1")) { 105 | cout << "mmseqs: command not found.\n\tPlease make sure you have installed MMseqs2 on your" 106 | " machine and included its path to the envirnment variables.\n"; 107 | return 0; 108 | } 109 | 110 | //Check if psiblast is installed 111 | if (system("which psiblast > /dev/null 2>&1")) { 112 | cout << "psiblast: command not found.\n\tPlease make sure you have installed psiblast on" 113 | " your machine and included its path to the envirnment variables.\n"; 114 | return 0; 115 | } 116 | 117 | //Constructing the command according to the parameters provided 118 | string max_seqs,interm_path; 119 | if ( commandParameters.find("-max_seqs") == commandParameters.end() ) 120 | max_seqs = "1000"; 121 | else 122 | max_seqs = commandParameters.find("-max_seqs")->second; 123 | if ( commandParameters.find("-interm_path") == commandParameters.end() ) 124 | interm_path = "tmp"; 125 | else 126 | interm_path = commandParameters.find("-interm_path")->second; 127 | 128 | //start mmseqs search 129 | system(("mmseqs createdb "+query+" queryDB"+UUID).data()); 130 | if ( commandParameters.find("-db_load_mode") != commandParameters.end() && commandParameters.find("-db_load_mode")->second == "2") { 131 | if ( commandParameters.find("-threads") != commandParameters.end() ) { 132 | system(("mmseqs search --threads " + commandParameters.find("-threads")->second + " --num-iterations 1 --max-seqs " + 133 | max_seqs + " queryDB" + UUID + " targetDB resultDB" + UUID + " " + interm_path + " --db-load-mode 2").data()); 134 | system(("mmseqs convertalis --threads " + commandParameters.find("-threads")->second + " queryDB" + UUID + " targetDB resultDB" + 135 | UUID + " isar.tuple --format-output \"qheader,theader,tseq\" --db-load-mode 2").data()); 136 | } else { 137 | system(("mmseqs search --num-iterations 1 --max-seqs " + max_seqs + " queryDB" + UUID + " targetDB resultDB" + UUID + " " + 138 | interm_path + " --db-load-mode 2").data()); 139 | system(("mmseqs convertalis queryDB" + UUID + " targetDB resultDB" + UUID + 140 | " isar.tuple --format-output \"qheader,theader,tseq\" --db-load-mode 2").data()); 141 | } 142 | } else { 143 | if ( commandParameters.find("-threads") != commandParameters.end() ) { 144 | system(("mmseqs search --threads " + commandParameters.find("-threads")->second + " --num-iterations 1 --max-seqs " + max_seqs + 145 | " queryDB" + UUID + " targetDB resultDB" + UUID + " " + interm_path).data()); 146 | system(("mmseqs convertalis --threads " + commandParameters.find("-threads")->second + " queryDB" + UUID + " targetDB resultDB" + 147 | UUID + " isar.tuple --format-output \"qheader,theader,tseq\"").data()); 148 | } else { 149 | system(("mmseqs search --num-iterations 1 --max-seqs " + max_seqs + " queryDB" + UUID + " targetDB resultDB" + 150 | UUID + " " + interm_path).data()); 151 | system(("mmseqs convertalis queryDB" + UUID + " targetDB resultDB" + UUID + 152 | " isar.tuple --format-output \"qheader,theader,tseq\"").data()); 153 | } 154 | } 155 | 156 | //delete intermediate files 157 | system(("rm -f queryDB" + UUID).data()); 158 | system(("rm -f queryDB" + UUID+".*").data()); 159 | system(("rm -f queryDB" + UUID+"_h").data()); 160 | system(("rm -f queryDB" + UUID+"_h.*").data()); 161 | system(("rm -f resultDB" + UUID+"").data()); 162 | system(("rm -f resultDB" + UUID+".*").data()); 163 | printf("###############################\n"); 164 | printf("# Parsing MMSEQS2's output #\n"); 165 | printf("###############################\n"); 166 | 167 | // Starting Thread & move the future object in lambda function by reference 168 | thread th(&loadingAnimation, move(futureObj)); 169 | 170 | //get the list of query headers and creating a file, isar.pair, referencing all the corresponding outputs 171 | queryCount = 0; 172 | custombuf sbuf(isar_pair_queries); 173 | //if(fexists("isar.pair")) 174 | system("rm -f isar.pair"); 175 | 176 | if (ostream(&sbuf) << ifstream(query).rdbuf() << flush) { 177 | boost::char_separator sep{">"}; 178 | tokenizer tok{isar_pair_queries, sep}; 179 | for (const auto &q : tok) { 180 | temp = q; 181 | boost::trim(temp); 182 | writeToFile("isar.q." + to_string(++queryCount),">"+temp,false); 183 | writeToFile("isar.db." + to_string(queryCount),"",false); 184 | boost::char_separator querySep{"\n"}; 185 | tokenizer tok2{temp, querySep}; 186 | tokenizer::iterator token = tok2.begin(); 187 | temp = *token; 188 | boost::trim(temp); 189 | writeToFile("isar.pair", temp + "\tisar.result.q" + to_string(queryCount) + 190 | "{.psiblast | .pssm | .ascii.pssm}\n",true); 191 | queryDictionary.insert ( pair(temp,queryCount) ); 192 | } 193 | }else { 194 | cout << "failed to read the query file.\n"; 195 | return 0; 196 | } 197 | //Parsing isar.tuple - The output file containing all the alignments from MMseqs2 198 | string dbOutput = ""; 199 | custombuf sbuf2(isar_tuple); 200 | if (ostream(&sbuf2) << ifstream("isar.tuple").rdbuf() << flush) { 201 | string previousQueryHeader = ""; 202 | string targetFile = ""; 203 | boost::char_separator sep1{"\n"}; 204 | tokenizer tok1{isar_tuple, sep1}; 205 | for (const auto &row : tok1) { 206 | boost::char_separator sep2{"\t"}; 207 | tokenizer tok2{row, sep2}; 208 | tokenizer::iterator col = tok2.begin(); 209 | temp = *col;//get the query header 210 | boost::trim(temp);//trim the query id 211 | //block to avoid searching in the dictionary for the query id for each line/sequece retrieved 212 | if(previousQueryHeader != temp) { 213 | if(previousQueryHeader != "") { 214 | writeToFile(targetFile,dbOutput,false); 215 | dbOutput = ""; 216 | } 217 | //get the query id and assign it to the target result file 218 | targetFile = "isar.db."+to_string(queryDictionary.find(temp)->second); 219 | previousQueryHeader = temp; 220 | } 221 | temp = *(++col);//get the retrieved sequece header 222 | boost::trim(temp); 223 | //writeToFile(targetFile,">"+temp+"\n",true); 224 | dbOutput += ">"+temp+"\n"; 225 | temp = *(++col);//get the retrieved sequece 226 | boost::trim(temp); 227 | //writeToFile(targetFile,temp+"\n",true); 228 | dbOutput += temp+"\n"; 229 | } 230 | //write to file the db corresponding to the last query in the batch 231 | writeToFile(targetFile,dbOutput,false); 232 | //Set the value in promise 233 | exitSignal.set_value(); 234 | 235 | //Wait for thread to join 236 | th.join(); 237 | printf("\nParsing Complete!\n"); 238 | sleep(1); 239 | }else { 240 | cout << "failed to read the isar.tuple file, search result file from MMseqs2.\n"; 241 | return 0; 242 | } 243 | 244 | 245 | queryDictionary.clear(); 246 | system("rm -f isar.tuple"); 247 | printf("###############################\n"); 248 | printf("# Starting PsiBlast #\n"); 249 | printf("###############################\n"); 250 | 251 | sleep(2); 252 | system("rm -f isar.result.q*"); 253 | 254 | if ( commandParameters.find("-threads") != commandParameters.end() ) 255 | concurentThreadsSupported = std::stoul (commandParameters.find("-threads")->second,nullptr,0); 256 | else 257 | concurentThreadsSupported = thread::hardware_concurrency(); 258 | //extract parameter values for psiblast runs 259 | string dbsize, use_sw_tback, output_profile, output_pssm, output_ascii_pssm; 260 | if ( commandParameters.find("-dbsize") != commandParameters.end() ) 261 | dbsize = commandParameters.find("-dbsize")->second; 262 | if ( commandParameters.find("-use_sw_tback") != commandParameters.end() ) 263 | use_sw_tback = commandParameters.find("-use_sw_tback")->second; 264 | if ( commandParameters.find("-output_profile") != commandParameters.end() ) 265 | output_profile = commandParameters.find("-output_profile")->second; 266 | if ( commandParameters.find("-output_pssm") != commandParameters.end() ) 267 | output_pssm = commandParameters.find("-output_pssm")->second; 268 | if ( commandParameters.find("-output_ascii_pssm") != commandParameters.end() ) 269 | output_ascii_pssm = commandParameters.find("-output_ascii_pssm")->second; 270 | 271 | if(concurentThreadsSupported) { 272 | //Launch a group of threads 273 | int step = queryCount/concurentThreadsSupported; 274 | //Stratified distribution of thread jobs 275 | for (int i = 0; i < concurentThreadsSupported; ++i) { 276 | if(i < queryCount%concurentThreadsSupported) 277 | threads.push_back(thread(runPsiblast, 1 + step * i,step * ( i + 1), queryCount-i, dbsize, 278 | use_sw_tback, output_profile, output_pssm, output_ascii_pssm)); 279 | else 280 | threads.push_back(thread(runPsiblast, 1 + step * i,step * ( i + 1), 0, dbsize, 281 | use_sw_tback, output_profile, output_pssm, output_ascii_pssm)); 282 | } 283 | } else 284 | threads.push_back(thread(runPsiblast, 1, queryCount, 0, dbsize, use_sw_tback, 285 | output_profile, output_pssm, output_ascii_pssm)); 286 | 287 | 288 | //Join the threads with the main thread 289 | for(auto &th : threads) { 290 | th.join(); 291 | } 292 | printf("###############################\n"); 293 | printf("# End of Search #\n"); 294 | printf("###############################\n"); 295 | return 0; 296 | } -------------------------------------------------------------------------------- /epsapg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/issararab/EPSAPG/3efa81516a13a281ca3d7ab3fa0b052f52b57aeb/epsapg.png -------------------------------------------------------------------------------- /utils.cpp: -------------------------------------------------------------------------------- 1 | #include "custombuf.h" 2 | #include "utils.h" 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | 10 | using namespace std; 11 | 12 | void writeToFile(const string& name, const string& content, bool append) { 13 | ofstream outfile; 14 | if (append) 15 | outfile.open(name, ios_base::app); 16 | else 17 | outfile.open(name); 18 | outfile << content; 19 | } 20 | 21 | // Animation on console while parsing 22 | void loadingAnimation(future futureObj) { 23 | cout << "Parsing the result"; 24 | while (futureObj.wait_for(chrono::milliseconds(1)) == future_status::timeout) { 25 | for (int i = 0; i < 4; i++) { 26 | cout << "."; 27 | cout.flush(); 28 | sleep(1); 29 | } 30 | cout << "\b\b\b \b\b\b\b"; 31 | } 32 | } 33 | 34 | bool is_file_empty(ifstream& pFile) { 35 | return pFile.peek() == ifstream::traits_type::eof(); 36 | } 37 | 38 | bool fexists(const char* filename) { 39 | ifstream ifile(filename); 40 | return (bool)ifile; 41 | } 42 | 43 | void psiBlast(int i, string dbsize, string use_sw_tback, string output_profile, 44 | string output_pssm, string output_ascii_pssm) { 45 | string single_query_data; 46 | // Check if the query db is empty 47 | ifstream file("isar.db." + to_string(i)); 48 | if (is_file_empty(file)) { 49 | custombuf sbuf3(single_query_data); 50 | if (ostream(&sbuf3) << ifstream("isar.q." + to_string(i)).rdbuf() << flush) { 51 | writeToFile("isar.db." + to_string(i), single_query_data, false); 52 | } else { 53 | cout << "failed to read the query file number; " + to_string(i) + ".\n"; 54 | } 55 | } 56 | system(("makeblastdb -in isar.db." + to_string(i) + " -dbtype prot").data()); 57 | 58 | string cmd = "psiblast -query isar.q." + to_string(i) + " -db isar.db." + 59 | to_string(i) + " -evalue 10"; 60 | if (!dbsize.empty()) 61 | cmd += " -dbsize " + dbsize; 62 | 63 | if (use_sw_tback == "1") 64 | cmd += " -use_sw_tback"; 65 | 66 | if (output_profile.empty() || output_profile == "1") 67 | cmd += " -out isar.result.q" + to_string(i) + ".psiblast"; 68 | 69 | if (output_pssm == "1") 70 | cmd += " -out_pssm isar.result.q" + to_string(i) + ".pssm"; 71 | 72 | if (output_ascii_pssm == "1") 73 | cmd += " -out_ascii_pssm isar.result.q" + to_string(i) + ".ascii.pssm"; 74 | 75 | if (output_pssm == "1" || output_ascii_pssm == "1") 76 | cmd += " -save_pssm_after_last_round"; 77 | 78 | system((cmd).data()); 79 | 80 | system(("rm -f isar.q." + to_string(i)).data()); 81 | system(("rm -f isar.db." + to_string(i)).data()); 82 | system(("rm -f isar.db." + to_string(i) + ".*").data()); 83 | } 84 | 85 | void runPsiblast(int startQ, int endQ, int additionalQ, string dbsize, string use_sw_tback, 86 | string output_profile, string output_pssm, string output_ascii_pssm) { 87 | for (int i = startQ; i <= endQ; i++) 88 | psiBlast(i, dbsize, use_sw_tback, output_profile, output_pssm, output_ascii_pssm); 89 | 90 | if (additionalQ) 91 | psiBlast(additionalQ, dbsize, use_sw_tback, output_profile, output_pssm, output_ascii_pssm); 92 | } 93 | 94 | void help(bool detailed){ 95 | cout << " ++USAGE \n" 96 | << " epsapg [-h] [-help] [-query input_file] \n" 97 | << " [-db_load_mode int_value[1|2] ] [-dbsize num_amino_acids] \n" 98 | << " [-interm_path path_to_intermedite_folder] [-use_sw_tback boolean[0|1] ] \n" 99 | << " [-max_seqs int_value] [-threads num_threads] [-output_profile boolean [0|1] ] \n" 100 | << " [-output_pssm boolean [0|1] ] [-output_ascii_pssm boolean[0|1] ] \n" 101 | << "\n ++DESCRIPTION \n" 102 | << " EPSAPG: An Extensive Protein Sequence Alignment Profiles Generator Combining \n" 103 | << " MMseqs2 and PSI-BLAST in a Single Pipeline. It can generate both profiles and PSSMs.)\n" 104 | << endl; 105 | if(!detailed) 106 | return; 107 | cout << "\n ++MANDATORY PARAMETERS \n" 108 | << " -query \n" 109 | << " Input file name in fasta format \n" 110 | << "\n ++OPTIONAL PARAMETERS " 111 | << "\n -interm_path " 112 | << "\n If run with indextable, which is the recommanded option, " 113 | << "\n MMseqs2 stores intermediate results in tmp folder. Using a fast" 114 | << "\n local drive can reduce load on a shared filesystem and increase" 115 | << "\n speed.(If not specified, the default path is 'tmp' folder in the" 116 | << "\n same location as targetDB. Same one used while indexing the DB.)\n" 117 | << "\n -db_load_mode " 118 | << "\n Option for using instant search power of MMseqs2, set to 2." 119 | << "\n (default 1, loads db from disk). This option enhaces the search unless" 120 | << "\n the indextable is loaded and locked into memory. Refer to: How to search" 121 | << "\n small query sets fast in the MMseqs2 documentation.\n" 122 | << "\n -dbsize " 123 | << "\n Effective length of the database. Default uses the amino acid length of" 124 | << "\n MMseqs2 output result. For accurate statistics of the profile, provide " 125 | << "\n the main database length.\n" 126 | << "\n -use_sw_tback " 127 | << "\n Compute locally optimal Smith-Waterman alignments(default is set to 0) \n" 128 | << "\n -max_seqs " 129 | << "\n maximum result sequences per query (this parameter affects the sensitivity)." 130 | << "\n Default is set to 1000.\n" 131 | << "\n -threads " 132 | << "\n number of cores used for the computation (uses all cores of the machine by default)\n" 133 | << "\n -output_profile " 134 | << "\n Enable or disable outputing alignment profile of a query sequence against a database." 135 | << "\n If disabled, it will print on console. (enabled by default)\n" 136 | << "\n -output_pssm " 137 | << "\n Enable or disable outputing position-specific scoring matrix checkpoint file" 138 | << "\n (disabled by default)\n" 139 | << "\n -output_ascii_pssm " 140 | << "\n Enable or disable outputing ASCII version of position-specific scoring matrix" 141 | << "\n checkpoint file (disabled by default)" 142 | << "\n" 143 | << "\n" 144 | << endl; 145 | } -------------------------------------------------------------------------------- /utils.h: -------------------------------------------------------------------------------- 1 | #ifndef UTILS_H // Header File - for function prototypes and variables 2 | #define UTILS_H 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | using namespace std; 10 | 11 | void writeToFile(const string& name, const string& content, bool append); 12 | 13 | //animation on console while parsing 14 | void loadingAnimation(future futureObj); 15 | 16 | bool is_file_empty(ifstream& pFile); 17 | 18 | bool fexists(const char *filename); 19 | 20 | void psiBlast(int i, string dbsize, string use_sw_tback, string output_profile, 21 | string output_pssm, string output_ascii_pssm); 22 | 23 | void runPsiblast(int startQ, int endQ, int additionalQ, string dbsize, string use_sw_tback, 24 | string output_profile, string output_pssm, string output_ascii_pssm); 25 | 26 | void help(bool detailed); 27 | 28 | 29 | #endif // UTILS_H --------------------------------------------------------------------------------