├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── custombuf.cpp
├── custombuf.h
├── epsapg.cpp
├── epsapg.png
├── utils.cpp
└── utils.h
/.gitignore:
--------------------------------------------------------------------------------
1 | #Do not track content of these folders
2 | bin/
3 | logs/
4 | *.log
5 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Issar Arab
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 |
2 | # Make file will compile the software components, generate executables,
3 | # and move them to the bin folder to add later to the system envirnment
4 |
5 | all:
6 | # for version `GLIBCXX_3.4.21' not found
7 | export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64:/usr/lib64
8 | g++ -o epsapg -std=c++0x epsapg.cpp utils.cpp custombuf.cpp -pthread
9 | rm -rf bin
10 | mkdir bin
11 | mv epsapg bin/.
12 |
13 |
14 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EPSAPG
2 |
3 | EPSAPG is a pipeline combining modules from state-of-the-art methods to quickly generate extensive protein sequence alignment profiles. It is a fast alignment pipeline that runs the muscles of MMseqs2 and generates PSI-BLAST output (profile and pssm).
4 |
5 |
6 |
7 |
8 |
9 | I started researching this project idea back during my stay as an AI researcher at Rostlab, Technical University of Munich.
10 |
11 | Main entry point of the software is "epsapg.cpp".
12 |
13 | ## Publication
14 | If you use EPSAPG in your work, please cite the following publication:
15 |
16 | - I. Arab, **EPSAPG: A Pipeline Combining MMseqs2 and PSI-BLAST to Quickly Generate Extensive Protein Sequence Alignment Profiles**, In 2023 IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies (BDCAT ’23), December 4–7, 2023, Taormina, Messina, Italy. ACM, New York, NY, USA, 9 pages. [doi:10.1145/3632366.3632384](https://doi.org/10.1145/3632366.3632384)
17 |
18 | # Prerequisites
19 | ## Hardware requirements
20 | :exclamation: Before you start compiling any code, make sure you have a machine that supports SSE4.1 or AVX2. You can check that by typing the following commands
21 |
22 | cat /proc/cpuinfo | grep -c sse4_1 > /dev/null && echo "SSE4.1 Supported" || echo "SSE4.1 Unsupported"
23 |
24 | OR
25 |
26 | cat /proc/cpuinfo | grep -c avx2 > /dev/null && echo "AVX2 Supported" || echo "AVX2 Unsupported"
27 |
28 | Once the hardware requirements are met, you can move forward to the software requirements.
29 |
30 | ## Software requirements
31 | :exclamation: Compiling the following software will require 'git', 'g++' (7.3 or higher) and 'cmake' (3.0 or higher).
32 |
33 | ### Installing g++
34 | The commands bellow are taillored to CentOS 7, a Linux distribution. You can lookup the equivalent command for each step. In most of the time, you just need to replace 'yum' by 'apt-get'. Refer to the support page bellow to figure out the equivalent in other Linux distributions:
35 | https://help.ubuntu.com/community/SwitchingToUbuntu/FromLinux/RedHatEnterpriseLinuxAndFedora
36 |
37 | > You can either install gcc from repository by typing:
38 |
39 | yum -y install gcc
40 |
41 | In order to include c++ library as well, you may want to install gcc-c++. This package will compile files with extensions that indicate they are C source as C++, instead of as C. You can install it by typing:
42 |
43 | yum -y install gcc-c++
44 | gcc --version
45 |
46 | Depending on your version of your CentOS, you may see that GCC is not the last version available. In my machine, CentOS 7, the version distributed in this OS is GCC 4.8.5
47 |
48 | Since we need GCC 7.3 or higher, we may need to install it from source. To do so, follow the next steps:
49 |
50 | > Install GCC from source by typing
51 |
52 | wget http://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-7.3.0/gcc-7.3.0.tar.gz
53 | tar zxf gcc-7.3.0.tar.gz
54 | cd gcc-7.3.0
55 |
56 | Then, you have to install bzip2 and run the ‘download_prerequisites’ script from the top level of the GCC source tree to download some prerequisites needed by GCC:
57 |
58 | yum -y install bzip2
59 | ./contrib/download_prerequisites
60 |
61 | After all the prerequisites are installed and in order to configure the GCC build environment, run the following 'configure' script:
62 |
63 | ./configure --disable-multilib --enable-languages=c,c++
64 |
65 | Then at the end run the following command to compile the source code. It is recommanded to start a screen before you run them since the compilation may take more than an hour
66 |
67 | screen -S gcc
68 | make -j 4
69 | make install
70 |
71 | ### Installing Boost libraries
72 | Boost is a set of libraries for the C++ programming language that provide support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading, image processing, regular expressions, and unit testing. EPSAPG make use of different optimized algorithms available in Boost, namely tokenizer, string algorithms, and uuid library. To install the Boost, type the following:
73 |
74 | yum install boost-devel
75 |
76 | ### Installing MMseqs2
77 |
78 | MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge proteins/nucleotide sequence sets. It is recommanded to compile MMseqs2 from source as it has the advantage to be optimized to the specific system, which should improve its performance.
79 |
80 | In order to compile from source, clone the repo from GitHub and run the following commands (coppied from the GitHub README file)
81 |
82 | git clone https://github.com/soedinglab/MMseqs2.git
83 | cd MMseqs2
84 | mkdir build
85 | cd build
86 | cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
87 | make
88 | make install
89 | export PATH=$(pwd)/bin/:$PATH
90 |
91 | Now you have to build the mmseqsDB of your fasta database and then compile an index table for it in order to acheive faster runs later.
92 | For more details about the project and the commands to use for building the DB and index table, refer to the repository in GitHub:
93 |
94 | https://github.com/soedinglab/MMseqs2
95 |
96 |
97 | ### Installing PSI-BLAST
98 | BLAST is the Basic Local Alignment Search Tool. It uses an index to rapdily search large sequence databases. Please refer to the following liking to install it:
99 |
100 | https://angus.readthedocs.io/en/2019/running-command-line-blast.html#what-is-blast
101 |
102 | ### Compiling EPSAPG
103 | To compile EPSAPG, clone the repo then run 'make'. After succesful compilation, the executable will be placed in the bin folder. After that include the path into the system variable:
104 |
105 | git clone https://github.com/git@github.com:issararab/EPSAPG.git
106 | cd EPSAPG
107 | make
108 | export PATH=$(pwd)/bin/:$PATH
109 |
110 | ### EPSAPG Command paramters
111 | | Parameter | Mandatory/Optional | Input | Description |
112 | | --- | --- | --- | --- |
113 | | -query | Mandatory | Fasta_File_In | Input file name in fasta format |
114 | | -interm_path | Optional | folder_path | If run with indextable, which is the recommanded option, MMseqs2 stores intermediate results in tmp folder. (If not specified, the default path is 'tmp' folder in the same location as targetDB. Same one used while indexing the DB.) |
115 | | -db_load_mode | Optional | int_value[1/2] | Option for using instant search power of MMseqs2, set to 2. (default 1, loads db from disk). This option enhaces the search unless the indextable is loaded and locked into memory. Refer to: How to search small query sets fast in the MMseqs2 documentation. |
116 | | -dbsize | Optional | size | Effective length of the database. Default uses the amino acid length of MMseqs2 output result. For accurate statistics of the profile, provide the main database length. |
117 | | -use_sw_tback | Optional | boolean[0/1] | Compute locally optimal Smith-Waterman alignments(default is set to 0) |
118 | | -max_seqs | Optional | int_value | maximum result sequences per query (this parameter affects the sensitivity). Default is set to 1000. |
119 | | -threads | Optional | int_value | number of cores used for the computation (uses all cores of the machine by default) |
120 | | -output_profile | Optional | boolean[0/1] | Enable or disable outputing alignment profile of a query sequence against a database. If disabled, it will print on console.(enabled by default) |
121 | | -output_pssm | Optional | boolean[0/1] | Enable or disable outputing position-specific scoring matrix checkpoint file (disabled by default) |
122 | | -output_ascii_pssm | Optional | boolean[0/1] | Enable or disable outputing ASCII version of position-specific scoring matrix checkpoint file (disabled by default) |
123 |
124 | ### How to Search Using EPSAPG
125 |
126 | :exclamation: For running the pipeline succesfully, make sure first to create MMseqs database and index table of your main DB with the name **targetDB**. The pipeline identifies **targetDB** as the database to search on. The following are the three main commands to search with EPSAPG (read details in MMseqs documentation cited above):
127 |
128 | #Compile a MMseqs2 database
129 | mmseqs createdb main/DB.fasta targetDB
130 | # Compile an index table of the DB
131 | mmseqs createindex targetDB tmp
132 | #Launch the search
133 | epsapg -query query/example.fasta
134 |
135 | # alternative precise intermediate folder path
136 | mmseqs createindex targetDB /path/tmp
137 | epsapg -query query/example.fasta -interm_path /path/tmp
138 |
139 | **Note:** MMseqs database and indextable are compiled only once, as long as the main database did not change. Refer to "EPSAPG Command paramters" section for more search options.
140 |
141 | Enjoy your super fast alignments :)
142 |
--------------------------------------------------------------------------------
/custombuf.cpp:
--------------------------------------------------------------------------------
1 | #include "custombuf.h"
2 | #include
3 | #include
4 | #include
5 |
6 | using namespace std;
7 |
8 | custombuf::custombuf(string& target) : target_(target) {
9 | this->setp(this->buffer_, this->buffer_ + bufsize - 1);
10 | }
11 |
12 | int custombuf::overflow(int c) {
13 | if (!traits_type::eq_int_type(c, traits_type::eof())) {
14 | *this->pptr() = traits_type::to_char_type(c);
15 | this->pbump(1);
16 | }
17 | this->target_.append(this->pbase(), this->pptr() - this->pbase());
18 | this->setp(this->buffer_, this->buffer_ + bufsize - 1);
19 | return traits_type::not_eof(c);
20 | }
21 |
22 | int custombuf::sync() {
23 | this->overflow(traits_type::eof());
24 | return 0;
25 | }
26 |
--------------------------------------------------------------------------------
/custombuf.h:
--------------------------------------------------------------------------------
1 | #ifndef CUSTOMBUF_H // Header File - for function prototypes and variables
2 | #define CUSTOMBUF_H
3 | #include
4 | #include
5 | #include
6 |
7 | using namespace std;
8 |
9 | // streambuf is a base class in C++ that provides an interface for reading from
10 | // or writing to a sequence of characters. The class improves performance in
11 | // the case of file reading by reading data from the underlying operating system
12 | // in larger chunks.
13 | class custombuf : public streambuf {
14 | public:
15 | custombuf(string& target);
16 |
17 | private:
18 | string& target_;
19 | enum { bufsize = 8192 };
20 | char buffer_[bufsize];
21 | int overflow(int c);
22 | int sync();
23 | };
24 |
25 |
26 | #endif // CUSTOMBUF_H
--------------------------------------------------------------------------------
/epsapg.cpp:
--------------------------------------------------------------------------------
1 | #include "custombuf.h"
2 | #include "utils.h"
3 | #include
4 | #include
5 | #include
6 | #include
7 | #include