├── logo.png ├── tests ├── DMELA_1.fq.gz ├── DMELA_2.fq.gz ├── DWILL_1.fq.gz ├── DWILL_2.fq.gz ├── ref │ ├── OG0003709.fa │ ├── OG0004212.fa │ ├── OG0003820.fa │ ├── OG0003531.fa │ └── OG0003977.fa └── run_test.sh ├── lib ├── __pycache__ │ ├── map.cpython-37.pyc │ ├── main.cpython-312.pyc │ ├── main.cpython-37.pyc │ ├── map.cpython-312.pyc │ ├── assemble.cpython-37.pyc │ ├── library.cpython-312.pyc │ ├── library.cpython-37.pyc │ └── assemble.cpython-312.pyc ├── library.py ├── main.py ├── map.py └── assemble.py ├── PhyloAln ├── scripts ├── root_tree.py ├── prune_tree.py ├── select_seqs.py ├── trim_matrix.py ├── merge_seqs.py ├── check_aln.py ├── transseq.pl ├── test_effect.py ├── alignseq.pl ├── revertransseq.pl └── connect.pl ├── LICENSE ├── .gitignore └── README.md /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/logo.png -------------------------------------------------------------------------------- /tests/DMELA_1.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DMELA_1.fq.gz -------------------------------------------------------------------------------- /tests/DMELA_2.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DMELA_2.fq.gz -------------------------------------------------------------------------------- /tests/DWILL_1.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DWILL_1.fq.gz -------------------------------------------------------------------------------- /tests/DWILL_2.fq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DWILL_2.fq.gz -------------------------------------------------------------------------------- /lib/__pycache__/map.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/map.cpython-37.pyc -------------------------------------------------------------------------------- /lib/__pycache__/main.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/main.cpython-312.pyc -------------------------------------------------------------------------------- /lib/__pycache__/main.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/main.cpython-37.pyc -------------------------------------------------------------------------------- /lib/__pycache__/map.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/map.cpython-312.pyc -------------------------------------------------------------------------------- /lib/__pycache__/assemble.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/assemble.cpython-37.pyc -------------------------------------------------------------------------------- /lib/__pycache__/library.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/library.cpython-312.pyc -------------------------------------------------------------------------------- /lib/__pycache__/library.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/library.cpython-37.pyc -------------------------------------------------------------------------------- /lib/__pycache__/assemble.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/assemble.cpython-312.pyc -------------------------------------------------------------------------------- /PhyloAln: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | #-*- coding = utf-8 -*- 3 | 4 | import sys 5 | import importlib 6 | 7 | def main(): 8 | mod = importlib.import_module('lib.main') 9 | if len(sys.argv) < 2: 10 | print("\nError: no argument was provided!\n") 11 | mod.main(['-h']) 12 | else: 13 | mod.main(sys.argv[1:]) 14 | 15 | if __name__ == '__main__': 16 | main() 17 | -------------------------------------------------------------------------------- /scripts/root_tree.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import sys 5 | import os 6 | from ete3 import Tree 7 | 8 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 9 | print("Usage: {} input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma)".format(sys.argv[0])) 10 | sys.exit(0) 11 | if len(sys.argv) < 3: 12 | print("Error: options < 2!\nUsage: {} input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma)".format(sys.argv[0])) 13 | sys.exit(1) 14 | 15 | tree = Tree(sys.argv[1]) 16 | if len(sys.argv) > 3: 17 | outgroup = sys.argv[3] 18 | if ',' in outgroup: 19 | outgroup = tree.get_common_ancestor(outgroup.split(',')) 20 | else: 21 | outgroup = tree.get_midpoint_outgroup() 22 | tree.set_outgroup(outgroup) 23 | tree.write(outfile=sys.argv[2]) 24 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 huangyh45 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /scripts/prune_tree.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import sys 5 | import os 6 | from ete3 import Tree 7 | 8 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 9 | print("Usage: {} input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion( seq/seqs_in_clade2_for_deletion ...)".format(sys.argv[0])) 10 | sys.exit(0) 11 | if len(sys.argv) < 4: 12 | print("Error: options < 3!\nUsage: {} input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion( seq/seqs_in_clade2_for_deletion ...)".format(sys.argv[0])) 13 | sys.exit(1) 14 | 15 | tree = Tree(sys.argv[1]) 16 | leafids = [] 17 | for leaf in tree: 18 | leafids.append(leaf.name) 19 | dists = {} 20 | for seqids in sys.argv[3:]: 21 | if ',' in seqids: 22 | clade = tree.get_common_ancestor(seqids.split(',')) 23 | for leaf in clade: 24 | leafids.remove(leaf.name) 25 | else: 26 | clade = tree.search_nodes(name=seqids)[0] 27 | leafids.remove(seqids) 28 | parent = clade.up 29 | if len(parent.children) == 2: 30 | sis_leaves = [] 31 | for leaf in parent: 32 | if leaf not in clade and leaf.name != clade.name: 33 | sis_leaves.append(leaf.name) 34 | dists[','.join(sis_leaves)] = parent.dist + parent.children[0].dist + parent.children[1].dist - clade.dist 35 | #clade.detach() 36 | tree.prune(leafids) 37 | for seqids, dist in dists.items(): 38 | if ',' in seqids: 39 | tree.get_common_ancestor(seqids.split(',')).dist = dist 40 | else: 41 | tree.search_nodes(name=seqids)[0].dist = dist 42 | tree.write(outfile=sys.argv[2]) 43 | -------------------------------------------------------------------------------- /scripts/select_seqs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import sys 6 | 7 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 8 | print("Usage: {} input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no)".format(sys.argv[0])) 9 | sys.exit(0) 10 | if len(sys.argv) <= 3: 11 | print("Error: options < 3!\nUsage: {} input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no)".format(sys.argv[0])) 12 | sys.exit(1) 13 | select_list = sys.argv[2].split(',') 14 | suffix = '.fa' 15 | if len(sys.argv) > 4: 16 | suffix = sys.argv[4] 17 | sep = '.' 18 | if len(sys.argv) > 5: 19 | sep = sys.argv[5] 20 | if_select = True 21 | if len(sys.argv) > 6: 22 | if sys.argv[6] == 'no': 23 | print("Error: ambiguous option! If you want to select the species or sequences in the list instead of to exclude them, please not input the last option!") 24 | sys.exit(1) 25 | if_select = False 26 | 27 | 28 | if not os.path.isdir(sys.argv[3]): 29 | os.mkdir(sys.argv[3]) 30 | 31 | files = os.listdir(sys.argv[1]) 32 | for filename in files: 33 | if not filename.endswith(suffix): 34 | continue 35 | outfile = open(os.path.join(sys.argv[3], filename), 'w') 36 | if_output = False 37 | for line in open(os.path.join(sys.argv[1], filename)): 38 | if line.startswith('>'): 39 | seqid = line.lstrip('>').rstrip().split(' ')[0] 40 | if if_select: 41 | if_output = False 42 | else: 43 | if_output = True 44 | for sp in select_list: 45 | if if_select: 46 | if seqid == sp or seqid.startswith(sp + sep): 47 | if_output = True 48 | break 49 | else: 50 | if seqid == sp or seqid.startswith(sp + sep): 51 | if_output = False 52 | break 53 | if if_output: 54 | outfile.write(line) 55 | outfile.close() 56 | 57 | -------------------------------------------------------------------------------- /tests/ref/OG0003709.fa: -------------------------------------------------------------------------------- 1 | >DBUSC.8329 2 | atGAATTCATTGGCACGTCTTGGTGGATTTGTTTGTCAATCTGTGCAAATTGCAGGCTGTGGCctgcagcaagtgcgAACCAAATATGCCGACTGGAAAATGATACGAGATGTTAAGCGACGCAAGTGTGTCTCGGAGCACGCCAAGGACCGGTTGCGAGTAAATTCGCTACGAAAGAATGACATACTGCCCGTGGAGCTGCGTGAAGTGGCGGATGCCCAGATTGCAGCATTTCCCAGGGACTCATCACTGGTGCGTGTGAGAGAACGCTGTGCTCTTACGTCACGACCCCGCGGTGTTGTGCACAAATACAGGCTGAGCCGTATTGTTTGGCGCCACTTGGCTGACTATAACAAGCTGTCTGGAGTCCAGAGAGCCATGTGG 3 | >DMOJA.2790 4 | ATGAACTCCTTGGCTAAGCTCGGAAGCTTCGTTAGCCAAAGCGTTCAAATCGCTGGATGTGGCTTGCAGCAAGTGCGCACCAAATATGCCGACTGGCGTATGATACGCGATGTCAAACGTCGCAAATGTGTCAGCGAACATGCCAAGGACCGCTTGCGTGTAAACTCGCTACGGAAGAATGACATACTGCCCGTTGAGCTCCGCGAGTTGGCGGATGCTGAGATAGCCGCATTTCCAAGGGACTCCTCACTGGTTCGCGTGCGAGAACGTTGCGCACTTACGTCACGGCCGCGCGGCGTAGTCCACAAATATCGACTTAGCCGCATTGTGTGGCGACACCTGGCCGATTATAACAAGCTGTCTGGAGTGCAGCGTGCCATGTGG 5 | >DPSEU.9672 6 | atgaatTCTCTGGCCAGAATTGGTGGTTTCGCTTGCCAAGCTGGGAAATTAGCTGGATTTGGCTTACAGCAAGTGCGAACAAAGTATGCCGACTGGAAGATGATCCGTGATGTCAAGCGCCGCAAGTGCGTGCAAGAGCATGCCAAGGAGAGGCTTCGAGTTAATTCACTTCGGAAGAATGACATACTGCCCATTGAGCTGAGGGAAGTGGCCGATGCGGAGATTGCTGCTTTTCCACGCGACTCATCATTGGTCCGTGTTCGGGAACGTTGCGCACTTACGTCACGGCCGCGCGGAGTTGTCCACAAATATCGCCTCAGCAGAATTGTATGGCGCCATTTGGCGGACTATAACAAGCTATCCGGTGTGCAGCGTGCCATGTGG 7 | >DYAKU.7011 8 | ATGAATTCTTTGGCCAGGATCGGGGGTTTTGTGTGCCAGTCCGTGCAGATAGCCGGCTGCGGGctgcagcaggtGCGCACCAAGTACGCCGATTGGAAGATGATTCGCGATGTCAAGCGGCGCAAGTGCGTCAAGGAGAACGCCGTGGAGCGACTACGGATCAACTCGCTGCGCAAGAACGACATCCTGCCGCCGGAGCTGCGCGAGGTGGCCGACGCCGAGATCGCTGCCTTTCCACGGGACTCATCGCTCGTCCGGGTGAGGGAACGCTGTGCGCTTACGTCACGGCCGCGCGGAGTCGTCCACAAGTACAGGCTTAGTCGAATCGTGTGGCGCCACCTCGCCGACTACAACAAGCTGTCCGGCGTGCAGCGTGCCATGTGG 9 | >SLEBA.7037 10 | ATGAACTCGTTGGCCAAAGTTAGTAGCTTTGTGTTCCAGTCTGTACAAATTTCTGGATGTGGTCTTCAACAAATACGTACAAAGTATGCCGACTGGAAGATGATACGTGACGTTAAACGTCGTAAGTGCGTGGAAAAATTTGCCAAGGAACGTTTGCAAATCAATTCAATACGCAAAAACGATATTCTTCCTCATGAACTACGACAGCTGGCCGATGTAGATATTGCAGCATTTCCTCGGGACTCTTCCTTGGTACGTGTTCGTGAACGTTGTGCACTTACGTCACGACCTCGGGGCGTCGTTCACAAATATCGGCTTAGCAGAATCGTGTGGCGGCACTTAGCAGATTACAATAAGCTATCTGGAGTTCAAAGAGCCATGTGG 11 | -------------------------------------------------------------------------------- /scripts/trim_matrix.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import sys 6 | 7 | def read_fasta(fasta): 8 | seqs = {} 9 | seqid = '' 10 | for line in open(fasta): 11 | line = line.rstrip() 12 | if line.startswith('>'): 13 | arr = line.split(" ") 14 | seqid = arr[0].lstrip('>') 15 | seqs[seqid] = '' 16 | elif seqs.get(seqid) is not None: 17 | seqs[seqid] += line 18 | return seqs 19 | 20 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 21 | print("Usage: {} input_dir output_dir unknown_symbol(default='X') known_number(>=1)_or_percent(<1)_for_columns(default=0.5) known_number(>=1)_or_percent(<1)_for_rows(default=0) fasta_suffix(default='.fa')".format(sys.argv[0])) 22 | sys.exit(0) 23 | if len(sys.argv) < 3: 24 | print("Error: options < 2!\nUsage: {} input_dir output_dir unknown_symbol(default='X') known_percent_for_columns(default=50) known_percent_for_rows(default=0) fasta_suffix(default='.fa')".format(sys.argv[0])) 25 | sys.exit(1) 26 | unknow = 'X' 27 | if len(sys.argv) > 3: 28 | unknow = sys.argv[3] 29 | pcol = 0.5 30 | if len(sys.argv) > 4: 31 | pcol = float(sys.argv[4]) 32 | prow = 0 33 | if len(sys.argv) > 5: 34 | prow = float(sys.argv[5]) 35 | suffix = '.fa' 36 | if len(sys.argv) > 6: 37 | suffix = sys.argv[6] 38 | 39 | if not os.path.isdir(sys.argv[2]): 40 | os.mkdir(sys.argv[2]) 41 | 42 | files = os.listdir(sys.argv[1]) 43 | for filename in files: 44 | if not filename.endswith(suffix): 45 | continue 46 | seqs = read_fasta(os.path.join(sys.argv[1], filename)) 47 | if pcol > 0: 48 | if pcol < 1: 49 | ncol = pcol * len(seqs) 50 | else: 51 | ncol = pcol 52 | newseqs = {} 53 | for seqid in seqs.keys(): 54 | newseqs[seqid] = '' 55 | for i in range(len(list(seqs.values())[0])): 56 | n = 0 57 | for seqstr in seqs.values(): 58 | if seqstr[i] != unknow: 59 | n += 1 60 | if n >= ncol: 61 | for seqid, seqstr in seqs.items(): 62 | newseqs[seqid] += seqstr[i] 63 | seqs = newseqs 64 | if prow > 0: 65 | if prow < 1: 66 | nrow = prow * len(list(seqs.values())[0]) 67 | else: 68 | nrow = prow 69 | for seqid, seqstr in list(seqs.items()): 70 | n = 0 71 | for base in seqstr: 72 | if base != unknow: 73 | n += 1 74 | if n < nrow: 75 | print("Removing {} from {}: known sites {}/{} = {}%".format(seqid, filename, n, len(seqstr), int(n / len(seqstr) * 10000 + 0.5) / 100)) 76 | seqs.pop(seqid) 77 | outfile = open(os.path.join(sys.argv[2], filename), 'w') 78 | for seqid, seqstr in seqs.items(): 79 | outfile.write(">{}\n{}\n".format(seqid, seqstr)) 80 | outfile.close() 81 | 82 | -------------------------------------------------------------------------------- /scripts/merge_seqs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import sys 6 | 7 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 8 | print("Usage: {} output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...)".format(sys.argv[0])) 9 | sys.exit(0) 10 | if len(sys.argv) < 3: 11 | print("Error: options < 2!\nUsage: {} output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...)".format(sys.argv[0])) 12 | sys.exit(1) 13 | 14 | os.mkdir(sys.argv[1]) 15 | if os.path.isdir(os.path.join(sys.argv[2], 'aa_out')): 16 | os.mkdir(os.path.join(sys.argv[1], 'aa_out')) 17 | if os.path.isdir(os.path.join(sys.argv[2], 'nt_out')): 18 | os.mkdir(os.path.join(sys.argv[1], 'nt_out')) 19 | 20 | aafiles = {} 21 | ntfiles = {} 22 | if os.path.isdir(os.path.join(sys.argv[2], 'aa_out')): 23 | for dirname in sys.argv[2:]: 24 | filenames = os.listdir(os.path.join(dirname, 'aa_out')) 25 | for filename in filenames: 26 | if not filename.endswith('.fa'): 27 | continue 28 | if aafiles.get(filename) is None: 29 | aafiles[filename] = [] 30 | aafiles[filename].append(os.path.join(dirname, 'aa_out', filename)) 31 | if os.path.isdir(os.path.join(sys.argv[2], 'nt_out')): 32 | for dirname in sys.argv[2:]: 33 | filenames = os.listdir(os.path.join(dirname, 'nt_out')) 34 | for filename in filenames: 35 | if not filename.endswith('.fa'): 36 | continue 37 | if ntfiles.get(filename) is None: 38 | ntfiles[filename] = [] 39 | ntfiles[filename].append(os.path.join(dirname, 'nt_out', filename)) 40 | 41 | for aafile, filenames in aafiles.items(): 42 | seqs = {} 43 | for filename in filenames: 44 | seqid = '' 45 | for line in open(filename): 46 | line = line.rstrip() 47 | if line.startswith('>'): 48 | arr = line.split(" ") 49 | seqid = arr[0].lstrip('>') 50 | seqs[seqid] = '' 51 | elif seqs.get(seqid) is not None: 52 | seqs[seqid] += line 53 | outfile = open(os.path.join(sys.argv[1], 'aa_out', aafile), 'w') 54 | for seqid, seqstr in seqs.items(): 55 | outfile.write(">{}\n{}\n".format(seqid, seqstr)) 56 | outfile.close() 57 | for ntfile, filenames in ntfiles.items(): 58 | seqs = {} 59 | for filename in filenames: 60 | seqid = '' 61 | for line in open(filename): 62 | line = line.rstrip() 63 | if line.startswith('>'): 64 | arr = line.split(" ") 65 | seqid = arr[0].lstrip('>') 66 | seqs[seqid] = '' 67 | elif seqs.get(seqid) is not None: 68 | seqs[seqid] += line 69 | outfile = open(os.path.join(sys.argv[1], 'nt_out', ntfile), 'w') 70 | for seqid, seqstr in seqs.items(): 71 | outfile.write(">{}\n{}\n".format(seqid, seqstr)) 72 | outfile.close() 73 | 74 | -------------------------------------------------------------------------------- /tests/run_test.sh: -------------------------------------------------------------------------------- 1 | set -e 2 | dir=`pwd` 3 | if [ ! -d "tests/aln" ]; then 4 | mkdir tests/aln 5 | fi 6 | for file in tests/ref/*.fa; do 7 | og=`basename $file` 8 | alignseq.pl -i $file -o tests/aln/$og -a codon -n 5 9 | done 10 | echo "DMELA $dir/tests/DMELA_1.fq.gz,$dir/tests/DMELA_2.fq.gz" > tests/run_test.config 11 | echo "DWILL $dir/tests/DWILL_1.fq.gz,$dir/tests/DWILL_2.fq.gz" >> tests/run_test.config 12 | PhyloAln -d tests/aln -x .fa -c tests/run_test.config -o tests/PhyloAln_out -p 5 -m codon -u SLEBA 13 | if [ $1 ]; then 14 | # full usage experience of all the scripts with real examples through a simple phylogenomic flow 15 | PhyloAln -d tests/aln -x .fa -s DWILL2 -i $dir/tests/DWILL_1.fq.gz $dir/tests/DWILL_2.fq.gz -o tests/PhyloAln_out2 -p 5 -m codon -u SLEBA # additionally generate result alignments for only DWILL 16 | #test_effect.py tests/PhyloAln_out2/nt_out:DYAKU tests/PhyloAln_out/nt_out:DYAKU tests/PhyloAln_DYAKUvsDYAKU.tsv N . .fa DBUSC,DMOJA,DPSEU,DYAKU,SLEBA # compare result aligned sequences of DYAKU in two runs. The sequences should be 100% identical. 17 | merge_seqs.py tests/PhyloAln_all tests/PhyloAln_out tests/PhyloAln_out2 # merge results of two runs 18 | cp tests/PhyloAln_all/nt_out/OG0003820.fa tests/PhyloAln_all/nt_out/OG0003977.fa tests/PhyloAln_all/nt_out/OG0004212.fa tests/PhyloAln_all # pick out the result alignments containing DWILL and DWILL2 19 | test_effect.py tests/PhyloAln_all:DWILL tests/PhyloAln_all:DWILL2 tests/PhyloAln_DWILLvsDWILL2.tsv N . .fa DBUSC,DMOJA,DPSEU,DYAKU,SLEBA # compare result aligned sequences of DWILL and DWILL2. The sequences should be 100% identical. 20 | select_seqs.py tests/PhyloAln_all/nt_out DBUSC,DPSEU tests/PhyloAln_all/nt_sel .fa . 1 # exclude the sequences of DBUSC and DPSEU from the result alignments 21 | trim_matrix.py tests/PhyloAln_all/nt_sel tests/PhyloAln_all/nt_trim N 0.5 10 # exclude the columns (sites) with known bases (not 'N') < 50% and the rows (species) with known bases (not 'N') < 10 22 | connect.pl -i tests/PhyloAln_all/nt_trim -f N -b all.block -n -c 123 # concatenate the alignments of five genes into a supermatrix and generate a partition file of codon positions for IQ-TREE 23 | iqtree -s all.fas -p all.block -m MFP+MERGE -B 1000 -T AUTO --threads-max 5 --prefix tests/PhyloAln_all/species_tree # reconstruct phylogeny of the supermatrix by IQ-TREE 24 | root_tree.py tests/PhyloAln_all/species_tree.treefile tests/PhyloAln_all/species_tree.rooted.tre SLEBA # root the species phylogeny using SLEBA (outgroup) and generate the rooted tree 'tests/PhyloAln_all/species_tree.rooted.tre' 25 | echo "Successfully complete running" 26 | else 27 | connect.pl -i tests/PhyloAln_out/nt_out -f N -b all.block -n -c 123 28 | check_aln.py -h 29 | merge_seqs.py -h 30 | prune_tree.py -h 31 | root_tree.py -h 32 | select_seqs.py -h 33 | test_effect.py -h 34 | trim_matrix.py -h 35 | echo "Successfully installed" 36 | fi 37 | -------------------------------------------------------------------------------- /scripts/check_aln.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import sys 6 | 7 | def read_fasta(fasta): 8 | seqs = {} 9 | seqid = '' 10 | for line in open(fasta): 11 | line = line.rstrip() 12 | if line.startswith('>'): 13 | arr = line.split(" ") 14 | seqid = arr[0].lstrip('>') 15 | seqs[seqid] = '' 16 | elif seqs.get(seqid) is not None: 17 | seqs[seqid] += line.upper() 18 | return seqs 19 | 20 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 21 | print("Usage: {} input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa')".format(sys.argv[0])) 22 | sys.exit(0) 23 | if len(sys.argv) < 2: 24 | print("Error: options < 1!\nUsage: {} input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa')".format(sys.argv[0])) 25 | sys.exit(1) 26 | outdir = None 27 | if len(sys.argv) > 2: 28 | if sys.argv[2].lower() != 'none': 29 | outdir = sys.argv[2] 30 | freq = 0.75 31 | if len(sys.argv) > 3: 32 | freq = float(sys.argv[3]) 33 | gap = '-' 34 | if len(sys.argv) > 4: 35 | gap = sys.argv[4] 36 | pgap = 0.6 37 | if len(sys.argv) > 5: 38 | pgap = float(sys.argv[5]) 39 | suffix = '.fa' 40 | if len(sys.argv) > 6: 41 | suffix = sys.argv[6] 42 | 43 | if outdir: 44 | if not os.path.isdir(outdir): 45 | os.mkdir(outdir) 46 | 47 | files = os.listdir(sys.argv[1]) 48 | for filename in files: 49 | if not filename.endswith(suffix): 50 | continue 51 | seqs = read_fasta(os.path.join(sys.argv[1], filename)) 52 | unaln = [] 53 | start = None 54 | end = None 55 | if pgap < 1: 56 | ngap = pgap * len(seqs) 57 | else: 58 | ngap = pgap 59 | for i in range(len(list(seqs.values())[0])): 60 | n = 0 61 | for seqstr in seqs.values(): 62 | if seqstr[i] != gap: 63 | n += 1 64 | if n >= ngap: 65 | if start is None: 66 | start = i 67 | end = i 68 | print(f"\n{filename}, length: {len(list(seqs.values())[0])}, valid regions: {start}-{end}") 69 | if start is None or end is None: 70 | for seqid in seqs.keys(): 71 | print(f"{seqid}, None-None: -Inf < {freq}, unaligned!") 72 | unaln.append(seqid) 73 | else: 74 | scores = {} 75 | for i in range(start, end + 1): 76 | scores[i] = {} 77 | for seqstr in seqs.values(): 78 | scores[i].setdefault(seqstr[i], 0) 79 | scores[i][seqstr[i]] += 1 80 | for seqid, seqstr in seqs.items(): 81 | seq_start = None 82 | seq_end = None 83 | for i in range(start, end + 1): 84 | if seqstr[i] != gap: 85 | if seq_start is None: 86 | seq_start = i 87 | seq_end = i 88 | if seq_start is None or seq_end is None: 89 | print(f"{seqid}, {seq_start}-{seq_end}: -Inf < {freq}, unaligned!") 90 | unaln.append(seqid) 91 | continue 92 | aver_freq = 0 93 | for i in range(seq_start, seq_end + 1): 94 | aver_freq += (scores[i][seqstr[i]] / len(seqs)) 95 | aver_freq = aver_freq / (seq_end - seq_start + 1) 96 | if aver_freq < freq: 97 | print(f"{seqid}, {seq_start}-{seq_end}: {aver_freq} < {freq}, unaligned!") 98 | unaln.append(seqid) 99 | else: 100 | print(f"{seqid}, {seq_start}-{seq_end}: {aver_freq}") 101 | if outdir: 102 | outfile = open(os.path.join(outdir, filename), 'w') 103 | for seqid, seqstr in seqs.items(): 104 | if seqid not in unaln: 105 | outfile.write(">{}\n{}\n".format(seqid, seqstr)) 106 | outfile.close() 107 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | -------------------------------------------------------------------------------- /scripts/transseq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use Bio::SeqIO; 4 | use Getopt::Std; 5 | use Parallel::ForkManager; 6 | use strict; 7 | my %opt=('g'=>'1','t'=>'*','n'=>'1','l'=>'transseq.log'); 8 | getopts('i:o:g:t:c:a:n:l:h',\%opt); 9 | usage() if $opt{h}; 10 | 11 | my $ntfile=$opt{i}; 12 | my $aafile=$opt{o}; 13 | my $gencode=$opt{g}; 14 | my $termination=$opt{t}; 15 | my $incomplete=$opt{c}; 16 | my $transall=$opt{a}; 17 | our $numthreads=$opt{n}; 18 | our $logfile=$opt{l}; 19 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n"; 20 | 21 | printdie("\nError: no input file or output file was set!\n") unless $ntfile&&$aafile; 22 | printlog("\n##### Sequences translation begins #####\n"); 23 | my $in=Bio::SeqIO->new(-file => "$ntfile", -format => 'fasta'); 24 | my @seqs; 25 | my $pm = new Parallel::ForkManager($numthreads); 26 | $pm -> run_on_finish( sub { 27 | my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_; 28 | if($data_structure_reference) { 29 | my @arr=@$data_structure_reference; 30 | if(@arr==1) {printdie($arr[0]);} 31 | else { 32 | push(@seqs, $arr[1]); 33 | printlog($arr[0]); 34 | } 35 | } 36 | }); 37 | while(my $seq=$in->next_seq()) { 38 | $pm->start and next; 39 | my $id=$seq->display_id; 40 | my @frames=(0,1,2,-1,-2,-3); 41 | my $finalseq; 42 | for(my $i=0;$i<@frames;$i++) { 43 | my ($transseq,$seq1); 44 | if($frames[$i]<0) { 45 | $seq1=$seq->revcom; 46 | $transseq=$seq1->translate(-codontable_id => $gencode, -terminator => $termination, -frame => -$frames[$i]-1); 47 | $seq1=$seq1->trunc(-$frames[$i],$seq1->length()); 48 | } 49 | else { 50 | $transseq=$seq->translate(-codontable_id => $gencode, -terminator => $termination, -frame => $frames[$i]); 51 | $seq1=$seq->trunc($frames[$i]+1,$seq->length()); 52 | } 53 | my $seq2=$transseq; 54 | if($incomplete&&$seq2->length()*3!=$seq1->length()) { 55 | if($seq2->length()*3+1==$seq1->length()||$seq2->length()*3+2==$seq1->length()) { 56 | $seq2=Bio::Seq->new( -seq => ($transseq->seq).'X' , -id => $id ); 57 | } 58 | else {my $diestr="\nError: nt length does not match the aa length in $id!\n";$pm->finish(0,[$diestr]);} 59 | } 60 | unless($transall) {$finalseq=$seq2;last;} 61 | else { 62 | my $start=$frames[$i]<0 ? $seq->length()+$frames[$i]+1 : $frames[$i]+1; 63 | my $end=$frames[$i]<0 ? $start%3+1 : $seq->length()-($seq->length()-$start+1)%3; 64 | if($incomplete) {$end=$frames[$i]<0 ? 1 : $seq->length();} 65 | $finalseq=Bio::Seq->new( -seq => $seq2->seq , -id => "$id|$start-$end|$frames[$i]") 66 | } 67 | } 68 | $pm->finish(0,["Translation of $id finished.\n",$finalseq]); 69 | } 70 | $pm->wait_all_children; 71 | printlog("#####i Sequences translation complished #####\n\n"); 72 | close(LOG); 73 | my $out = Bio::SeqIO->new(-file => ">$aafile", -format => 'fasta'); 74 | foreach my $seq (@seqs) {$out->write_seq($seq);} 75 | 76 | sub printlog { 77 | print LOG $_[0]; 78 | printf $_[0]; 79 | } 80 | 81 | sub printdie { 82 | print LOG $_[0]; 83 | die "$_[0]\nYou can use '-h' to watch detailed help.\n"; 84 | } 85 | 86 | sub usage { 87 | 88 | die " 89 | perl $0 90 | Translate nucleotide sequences in a file to amino acid sequences. 91 | 92 | Usage: 93 | -i input nucleotide sequences file 94 | -o output amino acid sequences file 95 | -g genetic code(default=1, invertebrate mitochondrion=5) 96 | -t symbol of termination(default='*') 97 | -c if translate incomplete codons into 'X'(default=no) 98 | -a if translate all six ORF(default=no) 99 | -n num threads(default=1) 100 | -l log file(default='transseq.log') 101 | -h this help message 102 | 103 | Example: 104 | transseq.pl -i ntfile -o aafile -g gencode -t termination -c 1 -a 1 -n numthreads -l logfile 105 | 106 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn 107 | "; 108 | 109 | } 110 | -------------------------------------------------------------------------------- /tests/ref/OG0004212.fa: -------------------------------------------------------------------------------- 1 | >DBUSC.10868 2 | ATGATGATGCAACGCACTGGATTATTGTTGCCAATGGCAATTGAAGCCACCATGCAGGCTCAGCAGCAACGTGGCATGGCTACGCTGAAGACAATTTCCATGCGCCTCAAATCCGTGAAGAACATTCAGAAAATTACGCAATCAATGAAGATGGTGTCCGCGGCCAAATACGCCCGTGCCGAGCGTGATTTGAGGGCAGCGCGTCCCTATGGCGCTGGTGCTCAGCAATTCTTTGAAAAGGTTGAAATCACTCCCGATGAGAAGGCCGAACCCAAGAAGTTGCTCATTGCCATGACATCGGATCGTGGTCTGTGCGGTGCAGTCCATACCGGTGTGGCGCGTCTAATTCGTGGTGAGCTGGCCCAGGATGATACAAACACTAAGGTGTTCTGCGTCGGTGACAAGTCACGCGCCATTCTAGCTCGTCTCTACggcaaaaacattttgatgGTAGCCAATGAGATTGGTCGCCTGCCACCCACTTTCCTAGACGCATCCAAGATTGCGCATGAAGTGTTGAACACCGGCTACGAGTATACCGAGGGCAAGATTGTTTACAACAGATTCAAGTCTGTCGTCTCCTATCAGTGCAGCACACTGCCCATCTTCAGCGGTTCTACTGTGGAGAAGTCAGAGAAGCTGGCTGTTTACGATTCGCTCGATAGCGATGTTGTCCAAAGCTATCTGGAATTTTCGTTGGCCTCGCTCATCTTCTACACCATGAAGGAAGGCGCTTGCTCCGAGCAATCGTCTCGTATGACTGCCATGGACAATGCTTCCAAGAACGCCGGTGAGATGATTGAAAAGCTAACACTGACATTCAACCGCACCAGACAGGCTGTCATTACCCGTGAGCTGATTGAAATCATCTCTGGTGCCGCTGCCCTGACA 3 | >DMOJA.8036 4 | ATGATGATGCAACGTACTACGCTTTTGCTGCCAATGGCTGTTGAAGCCACCAATGTTGCCCAACAGCAACGTGGTATGGCCACATTGAAGCATATTTCCATGCGCCTCAAATCCGTAAAGAACATCCAAAAAATTACGCAATCAATGAAAATGGTGTCCGCGGCCAAGTACTCCCGTGCCGAGCGTGATTTAAAGGCAGCGCGTCCCTATGGCATCGGTGCTCAACAATTCTTTGATAAGACCGAAGTGCAGGCTGATGGAGCTGTCGAGCCCAAGAAGCTGCTTATTGCCGTAACTTCGGATCGTGGCCTCTGCGGTGCCGTGCACACCGGTGTTGCACGTCTCATCCGTGGCGAGCTGCAGAAGGACGATTCTAACACCATGGTGTTCTGCGTTGGCGACAAGTCGCGTGCCATTCTGTCCCGTTTGTACGGTAAGAACATCCTGATGGTGGCCAACGAAGTGGGCCGTCTGCCACCTACTTTCCTGGATGCATCCAAGATTGCGCATGAGGTATTGTCGACCGGGTACGATTATACTGAGGGCAAGATCGTGTACAACCAGTTCAAGTCTGTGGTCTCGTACAAGTGCTCCACGTTGCCCATCTACAGTGGCCCCACTGTGGAGAAGTCGGAGAAGTTGGCCGTTTACGATTCGCTCGACAGCGATGTCATCAAGAGCTATCTGGAGTTCTCTCTGGCCTCGCTCATCTTCTACACCATGAAGGAGGGCGCTTGCTCTGAGCAATCGTCCCGTATGACTGCCATGGACAATGCCTCGAAGAACGCCGGTGAAATGATTGAGAAGCTGACCCTCACATTCAACCGCACCCGACAGGCTGTCATCACTCGCGAGTTGATTGAAATCATCTCTGGTGCCGCTGCCCTGACA 5 | >DPSEU.2270 6 | ATGATGATGCAACGCACTACGCTTTTGCTGCCCATGGCCATTGAAGCCACCATgctggcccagcagcagcgtggCATGGCCACTTTAAAGACCATTTCCATGCGTCTGAAATCCGTGAAGAACATTCAGAAAATTACGCAATCGATGAAGATGGTGTCCGCGGCCAAGTACGCCCGTGCCGAGCGTGATTTGAAGGCGGCGCGTCCCTACGGAATCGGTGCTCAGCAGTTTTTCGAAAAGACTGAGATCGTGCCCGATGAGAAGGCCGAGCCCAAGAAGCTCTTCATCGCCGTAACATCGGACCGTGGTCTGTGCGGTGCTGTTCACACTGGTGTGGCGCGTCTGATCCGTGGCGAGATGGCTACCGAACATGCCAACACCAAGATTTTCTGCGTGGGAGACAAGTCCCGTGCTATTCTGGCCCGTCTGTACGGCAAGAATATCCTGATGGTGGCCAACGAAATCGGCCGTCTGCCCCCCACTTTCCTGGATGCCTCGAAGATCGCCCATGAGGTCCTAAACACTGGCTACGAGTACACCGAGGGCAAGATCGTGTACAACAAGTTCAGGTCCGTTGTCTCGTACCAGTGCAGCACACTGCCCATCTACGGTGGCCCCACTGTCGAGAAGTCGGAGAAACTGGCCACTTACGACTCGCTCGACAGCGATGTCATCAAGAGCTACTTGGAGTTCTCGCTGGCCTCCCTCATCTTCTACACCATGAAGGAGGGTGCCTGCTCGGAGCAGTCCTCCCGTATGACGGCCATGGACAATGCTTCCAAGAACGCCGGTGAGATGATTGACAAACTTACTCTCACATTCAACCGCACCCGACAGGCTGTCATCACTCGCGAGCTGATTGAGATCAtctctggtgctgctgccctcaCA 7 | >DYAKU.7834 8 | ATGATGATGCAACGCACCCAGCTCCTGCTGCCCCTGGCCATGGAGGCCACCATGCTGGCCCAGCAGCAGCGTGGCATGGCCACCCTGAAGATGATTTCCATCCGCCTGAAGTCGGTGAAGAACATTCAGAAAATTACGCAATCGATGAAGATGGTGTCCGCTGCCAAGTACGCCCGTGCCGAGCGAGACTTGAAGGCGGCGCGTCCTTACGGCATCGGCGCCCAGCAGTTCTTCGAGAAGACGGAGATCCAGGCGGACGAGAAGGCGGAGCCCAAGAAGCTGCTCATCGCAGTCACTTCGGACCGTGGTCTTTGCGGCGCTGTCCACACTGGTGTGGCCCGTCTCATTCGTGGTGAACTGGCCCAGGACGAGGCCAACACTAAGGTGTTCTGCGTGGGCGACAAGTCGCGCGCTATCCTGTCCCGTCTGTACGGCAAGAACATCCTGATGGTGGCCAACGAGGTGGGCCGACTGCCGCCCACTTTCCTGGACGCCTCGAAGATTGCCAACGAGGTTCTGCAGACCGGTTACGATTACACCGAGGGCAAGATCGTCTACAACCGTTTCAAGTCGGTGGTGTCGTACCAGTGCTCCACCCTCCCCATCTTCAGCGGATCCACCGTGGAGAAGTCGGAGAAGCTGGCCGTCTACGACTCGCTCGACAGCGACGTGGTCAAGAGCTACCTGGAGTTCTCGCTGGCCTCGCTCATCTTCTACACCATGAAGGAGGGCGCCTGCTCGGAGCAGTCCTCCCGTATGACTGCCATGGACAACGCTTCCAAGAACGCCGGTGAGATGATCGACAAGCTGACCCTCACCTTCAACCGTACCCGACAGGCCGTCATCACTCGCGAGCTGATTGAAATCATCTCCGGTGCCGCCGCCCTCACA 9 | >SLEBA.12201 10 | ATGATGATGCAACGCTCCATGCTCTTGCTACCCCTGGCGGTTGAAGCCACGATGCATGCCCAACAGCAACGTGGTATGGCCACTTTGCAGTCAATTTCCATTCGCTTGAAATCAGTGAAGAACATTCAGAAAATTACGCAATCAATGAAGATGGTGTCCGCCGCAAAATACGCCAGGGCGGAGCGTGATTTGAAGGCGGCGCGTCCTTATGGCATTGGAGCGCAACAGTTCTTCGAAAAGACCGAGATCCAGGTCGATGAGAAGGCCGAACCCAAGAAGCTTCTTATCGCGATGACATCCGATCGTGGTCTCTGCGGCGCTGTGCACACCGGTGTGGCCCGTCACATCCGCAACGAGCTTGCTAAGGACGATGTTAACACCAAGATTTTCTGTGTCGGCGACAAGTCGCGCTCGATCCTGGCGCGTCTATACGGCAAGAATATTCTGATGGTAGCCAACGAAGTGGGTCGTCTGCCACCTACCTTCTTGGATGCCTCTAGGATTGCGCATGAGGTCCTGCAAACCGGTTACGAGTATTCCGAGGGACAAATCGTGTACAACAAGTTCAACTCGGTGGTATCTTACTCACTGTCCCAACTGCCCATCTACAGTGGCGCCACTGTGGAGAAGTCGGAGAAGCTGGCGGTCTTCGATTCCCTGGACGCCGATGTCATCCAGAGCTATTTGGAGTTCTCGCTGGCTTCCCTAATCTTTTACACCATGAAGGAAGGCGCTTGCTCTGAGCAATCGTCGCGTATGACTGCCATGGATAACGCCTCGAAGAATGCTGGTGAGATGATTGAAAAGTTGACGCTCATATTCAACCGCACCAGACAGGCTGTCATCACTCGCGAGCTGATTGAAATTATCTCCGGTGCCTCTGCTCTGGAG 11 | -------------------------------------------------------------------------------- /scripts/test_effect.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import sys 6 | 7 | def read_fasta(fasta, target, sep='.', select_list=None): 8 | seqs = {} 9 | if select_list: 10 | seqid = None 11 | for line in open(fasta): 12 | line = line.rstrip() 13 | if line.startswith('>'): 14 | arr = line.split(" ") 15 | seqid = arr[0].lstrip('>') 16 | if seqid in select_list or seqid.split(sep)[0] in select_list: 17 | seqs[seqid] = '' 18 | elif seqs.get(seqid) is not None: 19 | seqs[seqid] += line.upper() 20 | seqid = None 21 | seqstr = '' 22 | for line in open(fasta): 23 | line = line.rstrip() 24 | if line.startswith('>'): 25 | if seqstr: 26 | break 27 | arr = line.split(" ") 28 | seqid = arr[0].lstrip('>') 29 | if seqid != target and not seqid.startswith(target + sep): 30 | seqid = None 31 | elif seqid: 32 | seqstr += line.upper() 33 | # debug 34 | # print(seqid, seqstr) 35 | # print(seqs) 36 | # input() 37 | return seqstr, seqs 38 | 39 | if len(sys.argv) == 1 or sys.argv[1] == '-h': 40 | print("Usage: {} reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma)".format(sys.argv[0])) 41 | sys.exit(0) 42 | if len(sys.argv) <= 3: 43 | print("Error: options < 3!\nUsage: {} reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma)".format(sys.argv[0])) 44 | sys.exit(1) 45 | unknow = 'N' 46 | if len(sys.argv) > 4: 47 | unknow = sys.argv[4] 48 | sep = '.' 49 | if len(sys.argv) > 5: 50 | sep = sys.argv[5] 51 | suffix = '.fa' 52 | if len(sys.argv) > 6: 53 | suffix = sys.argv[6] 54 | select_list = None 55 | if len(sys.argv) > 7: 56 | select_list = sys.argv[7].split(',') 57 | 58 | ref_dir, ref_sp = sys.argv[1].split(':') 59 | target_dir, target_sp = sys.argv[2].split(':') 60 | outfile = open(sys.argv[3], 'w') 61 | outfile.write("file\treference length\ttarget length\tnident\tcompleteness\tpident\n") 62 | total_rlen = 0 63 | total_tlen = 0 64 | total_nident = 0 65 | files = os.listdir(ref_dir) 66 | for filename in files: 67 | if not filename.endswith(suffix): 68 | continue 69 | ref_seq, ref_sel = read_fasta(os.path.join(ref_dir, filename), ref_sp, sep, select_list) 70 | target_seq, target_sel = read_fasta(os.path.join(target_dir, filename), target_sp, sep, select_list) 71 | indexes = [] 72 | if select_list: 73 | i = 0 74 | j = 0 75 | while i < len(list(ref_sel.values())[0]): 76 | if j >= len(list(target_sel.values())[0]): 77 | if_match = False 78 | else: 79 | if_match = True 80 | for sel in ref_sel.keys(): 81 | # to use species 82 | # sp = sel.split(sep)[0] 83 | if ref_sel[sel][i] != target_sel[sel][j]: 84 | if_match = False 85 | break 86 | if if_match: 87 | j += 1 88 | else: 89 | indexes.append(i) 90 | i += 1 91 | print(filename) 92 | str_list = list(ref_seq) 93 | for i in reversed(indexes): 94 | str_list.pop(i) 95 | ref_seq = ''.join(str_list) 96 | start = None 97 | for i in range(len(ref_seq)): 98 | if ref_seq[i] not in [unknow, '-']: 99 | start = i 100 | break 101 | end = None 102 | for i in range(len(ref_seq)-1, -1, -1): 103 | if ref_seq[i] not in [unknow, '-']: 104 | end = i 105 | break 106 | if start is None or end is None: 107 | # continue 108 | print("\nError: invalid reference in '{}': '{}'!\n".format(filename, ref_seq)) 109 | sys.exit(1) 110 | if not target_seq: 111 | rlen = 0 112 | for i in range(start, end+1): 113 | if ref_seq[i] != unknow: 114 | rlen += 1 115 | total_rlen += rlen 116 | outfile.write("{}\t{}\t0\t0\t0\t0\n".format(filename, rlen)) 117 | continue 118 | rlen = 0 119 | tlen = 0 120 | nident = 0 121 | # print(filename, start, end) 122 | for i in range(start, end+1): 123 | if ref_seq[i] == unknow: 124 | continue 125 | rlen += 1 126 | if target_seq[i] == unknow: 127 | continue 128 | tlen += 1 129 | if target_seq[i] == ref_seq[i]: 130 | nident += 1 131 | total_rlen += rlen 132 | total_tlen += tlen 133 | total_nident += nident 134 | outfile.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(filename, rlen, tlen, nident, int(10000*tlen/rlen+0.5)/100, int(10000*nident/tlen+0.5)/100)) 135 | outfile.write("total\t{}\t{}\t{}\t{}\t{}\n".format(total_rlen, total_tlen, total_nident, int(10000*total_tlen/total_rlen+0.5)/100, int(10000*total_nident/total_tlen+0.5)/100)) 136 | outfile.close() 137 | -------------------------------------------------------------------------------- /scripts/alignseq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use FindBin qw($Bin); 4 | use Getopt::Std; 5 | use strict; 6 | my %opt=('g'=>'1','a'=>'direct','t'=>'X','n'=>'1','l'=>'alignseq.log'); 7 | getopts('i:o:a:g:t:m:f:n:l:h',\%opt); 8 | usage() if $opt{h}; 9 | 10 | my $input=$opt{i}; 11 | my $output=$opt{o}; 12 | my $aligntype=$opt{a}; 13 | my $gencode=$opt{g}; 14 | my $termination=$opt{t}; 15 | my $incomplete=$opt{c}; 16 | my $ifmediate=$opt{m}; 17 | my $mafftfolder=$opt{f}; 18 | if($mafftfolder) {$mafftfolder.="/";} 19 | our $numthreads=$opt{n}; 20 | our $logfile=$opt{l}; 21 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n"; 22 | 23 | printdie("\nError: no input file or output file was set!\n") unless $input&&$output; 24 | printdie("\nError: invalid alignment type was set!\n") unless $aligntype eq 'direct'||$aligntype eq 'translate'||$aligntype eq 'codon'||$aligntype eq 'complement'||$aligntype eq 'ncRNA'; 25 | printlog("\n##### Alignment begins #####\n"); 26 | if($aligntype eq 'complement') { 27 | my $mafftcommand=$mafftfolder."mafft"; 28 | runcmd("$mafftcommand --thread $numthreads --adjustdirectionaccurately $input > $output"); 29 | } 30 | elsif($aligntype eq 'ncRNA') { 31 | my $mafftcommand=$mafftfolder."mafft-qinsi"; 32 | runcmd("$mafftcommand --thread $numthreads $input > $output"); 33 | } 34 | else { 35 | my ($mafftin,$mafftout,$transfile,$aafile); 36 | if($aligntype ne 'direct') { 37 | $transfile=$input; 38 | $transfile=~s/(\.(\w+))+/\.trans\.fas/; 39 | my %parameter=('-i',$input,'-o',$transfile,'-g',$gencode,'-t',$termination,'-c',$incomplete,'-n',$numthreads,'-l',$logfile); 40 | runpl("transseq.pl",\%parameter); 41 | $mafftin=$transfile; 42 | if($aligntype eq 'codon') { 43 | $aafile=$output; 44 | $aafile=~s/(\.(\w+))+/\.aa\.fas/; 45 | $mafftout=$aafile; 46 | } 47 | else {$mafftout=$output;} 48 | } 49 | else {$mafftin=$input;$mafftout=$output;} 50 | my $mafftcommand=$mafftfolder."linsi"; 51 | runcmd("$mafftcommand --thread $numthreads $mafftin > $mafftout"); 52 | if($aligntype eq 'codon') { 53 | my %parameter=('-i',$input,'-b',$aafile,'-o',$output,'-g',$gencode,'-t',$termination,'-n',$numthreads,'-l',$logfile); 54 | runpl("revertransseq.pl",\%parameter); 55 | } 56 | if($ifmediate) { 57 | if($transfile) {unlink("$transfile") or printdie("\nError: $transfile fail to delete!\n");} 58 | if($aafile) {unlink("$aafile") or printdie("\nError: $aafile fail to delete!\n");} 59 | } 60 | } 61 | printlog("##### Alignment complished #####\n\n"); 62 | 63 | close(LOG); 64 | 65 | sub runcmd { 66 | printlog("$_[0]\n"); 67 | my $iferr=system("$_[0] 2>$logfile.alignseq.temp"); 68 | open(TEMP,"<$logfile.alignseq.temp") or printdie("\nError: $logfile.alignseq.temp can't open!\n"); 69 | while() {printlog("$_");} 70 | close(TEMP); 71 | unlink("$logfile.alignseq.temp") or printdie("\nError: $logfile.alignseq.temp fail to delete!\n"); 72 | if($iferr) { 73 | my $command=$_[0]; 74 | if($command=~/(\S+)/o) {$command=$1;} 75 | printdie("\nError in $command!\n"); 76 | } 77 | } 78 | 79 | sub runpl { 80 | my $pl=$_[0]; 81 | my $ref=$_[1]; 82 | my %parameter=%$ref; 83 | my $command=""; 84 | foreach my $x (keys %parameter) { 85 | if($parameter{$x}) { 86 | if($parameter{$x}=~/\s/) {$parameter{$x}="\'$parameter{$x}\'";} 87 | $command.=" $x $parameter{$x}"; 88 | } 89 | } 90 | printlog("$Bin/$pl $command\n"); 91 | if(system("$Bin/$pl $command")) {printdie("\nError in $pl!\n");} 92 | } 93 | 94 | sub printlog { 95 | print LOG $_[0]; 96 | printf $_[0]; 97 | } 98 | 99 | sub printdie { 100 | print LOG $_[0]; 101 | die "$_[0]\nYou can use '-h' to watch detailed help.\n"; 102 | } 103 | 104 | sub usage { 105 | 106 | die " 107 | perl $0 108 | Align sequences in a file by mafft. 109 | Requirement: mafft 110 | 111 | Usage: 112 | -i input sequences file 113 | -o output sequences file 114 | -a type of alignment(direct/translate/codon/complement/ncRNA, default='direct', 'translate' means alignment of translation of sequences) 115 | -g genetic code(default=1, invertebrate mitochondrion=5) 116 | -t symbol of termination(default='X', mafft will clean '*') 117 | -c if translate incomplete codons into 'X'(default=no) 118 | -m if delete the intermediate files, such as translated files and aligned aa files(default=no) 119 | -f the folder where mafft/linsi is, if mafft/linsi had been in PATH you can ignore this parameter 120 | -n num threads(default=1) 121 | -l log file(default='alignseq.log') 122 | -h this help message 123 | 124 | Example: 125 | alignseq.pl -i inputfile -o outputfile -a aligntype -g gencode -t termination -c 1 -m 1 -f mafftfolder -n numthreads -l logfile 126 | 127 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn 128 | "; 129 | 130 | } 131 | -------------------------------------------------------------------------------- /scripts/revertransseq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use Bio::SeqIO; 4 | use Bio::DB::Fasta; 5 | use Getopt::Std; 6 | use Parallel::ForkManager; 7 | use strict; 8 | my %opt=('g'=>'1','t'=>'*','n'=>'1','l'=>'revertransseq.log'); 9 | getopts('i:b:o:g:t:n:l:h',\%opt); 10 | usage() if $opt{h}; 11 | 12 | my @ntfiles=split(',',$opt{i}); 13 | my $aafile=$opt{b}; 14 | my $alignfile=$opt{o}; 15 | my $gencode=$opt{g}; 16 | my $termination=$opt{t}; 17 | our $numthreads=$opt{n}; 18 | our $logfile=$opt{l}; 19 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n"; 20 | 21 | printdie("\nError: no input file, blueprint file or output file was set!\n") unless @ntfiles&&$aafile&&$alignfile; 22 | printlog("\n##### Sequences reverse-translation begins #####\n"); 23 | my $aln=Bio::SeqIO->new(-file => "$aafile", -format => 'fasta'); 24 | my @seqs; 25 | my $pm = new Parallel::ForkManager($numthreads); 26 | $pm -> run_on_finish( sub { 27 | my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_; 28 | my @arr=@$data_structure_reference; 29 | printlog($arr[0]); 30 | if(@arr>2) {push(@seqs,$arr[2]);} 31 | else {printdie($arr[1]);} 32 | }); 33 | while(my $seq=$aln->next_seq()) { 34 | $pm->start and next; 35 | my $outstr; 36 | my $seqid=$seq->display_id; 37 | my ($db,$ntseq); 38 | foreach my $ntfile (@ntfiles) { 39 | $db= Bio::DB::Fasta->new("$ntfile"); 40 | $ntseq=$db->get_Seq_by_id($seqid); 41 | if($ntseq) {last;} 42 | } 43 | unless($ntseq) {$outstr.="\nWarning: $seqid can not be extracted in $opt{i}!\n";$pm->finish(0,[$outstr]);next;} 44 | my $seqstr; 45 | my $j=1; 46 | my $i; 47 | for($i=1;$i<=$seq->length();$i++) { 48 | my $base=$seq->subseq($i, $i); 49 | if($base eq '-') {$seqstr.="---";} 50 | elsif($j>$ntseq->length()) {$pm->finish(0,[$outstr,"\nError: no codon can be extracted in $seqid: $i-$base!"]);} 51 | else { 52 | my $n=$j+2>$ntseq->length() ? $ntseq->length() : $j+2; 53 | my $codon=$ntseq->subseq($j, $n); 54 | my $transbase=(Bio::Tools::CodonTable->new(-id=>$gencode))->translate($codon); 55 | $j=$n+1; 56 | if(length($codon)<3&&$base eq 'X') {$seqstr.=$codon;$outstr.="\nWarning: there is an incomplete codon in $seqid: $i-$base-$transbase($codon)\n";} 57 | elsif($base eq $transbase) {$seqstr.=$codon;} 58 | elsif($transbase eq '*') { 59 | if($base eq $termination) {$seqstr.=$codon;} 60 | else {$outstr.="\nWarning: there is an unexpected termination codon in $seqid: $i-$base-$transbase($codon)\n";$i-=1;} 61 | if($j<=$ntseq->length()) {$outstr.="\nWarning: there is a middle termination codon in $seqid: $i-$base-$transbase($codon)\n";} 62 | } 63 | else {$pm->finish(0,[$outstr,"\nError: the codon does not match the amino acid in $seqid: $i-$base-$transbase($codon)\n"]);} 64 | } 65 | } 66 | if($j+2==$ntseq->length()&&(Bio::Tools::CodonTable->new(-id=>$gencode))->is_ter_codon($ntseq->subseq($j, $j+2))) { 67 | my $codon=$ntseq->subseq($j, $j+2); 68 | $outstr.="\nWarning: there is an unexpected termination codon in $seqid: $i-end-*($codon)\n"; 69 | } 70 | elsif($ntseq->length()==$j||$ntseq->length()==$j+1) { 71 | my $codon=$ntseq->subseq($j, $ntseq->length()); 72 | $outstr.="\nWarning: there is an incomplete codon in $seqid: $i-end-($codon)\n"; 73 | } 74 | elsif($ntseq->length()!=$j-1) { 75 | $pm->finish(0,[$outstr,"\nError: nt length does not match the aa length in $seqid!"]); 76 | } 77 | my $alignseq = Bio::Seq->new( -seq => $seqstr, 78 | -id => $seqid, 79 | ); 80 | $outstr.="Reverse-translation of $seqid finished.\n"; 81 | $pm->finish(0,[$outstr,'',$alignseq]); 82 | } 83 | $pm->wait_all_children; 84 | my $out = Bio::SeqIO->new(-file => ">$alignfile", -format => 'fasta'); 85 | foreach my $seq (@seqs) {$out->write_seq($seq);} 86 | printlog("##### Sequences reverse-translation complished #####\n\n"); 87 | close(LOG); 88 | 89 | 90 | sub printlog { 91 | print LOG $_[0]; 92 | printf $_[0]; 93 | } 94 | 95 | sub printdie { 96 | print LOG $_[0]; 97 | die "$_[0]\nYou can use '-h' to watch detailed help.\n"; 98 | } 99 | 100 | sub usage { 101 | 102 | die " 103 | perl $0 104 | Used the aligned translated sequences in a file as blueprint to aligned nucleotide sequences, which means reverse-translation. 105 | 106 | Usage: 107 | -i input nucleotide sequences file or files(separated by ',') 108 | -b aligned amino acid sequences file translated by input file as blueprint 109 | -o output aligned nucleotide sequences file 110 | -g genetic code(default=1, invertebrate mitochondrion=5) 111 | -t symbol of termination in blueprint(default='*') 112 | -n num threads(default=1) 113 | -l log file(default='revertransseq.log') 114 | -h this help message 115 | 116 | Example: 117 | revertransseq.pl -i ntfile1,ntfile2,ntfile3 -b aafile -o alignedfile -g gencode -t termination -n numthreads -l logfile 118 | 119 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn 120 | "; 121 | 122 | } 123 | -------------------------------------------------------------------------------- /scripts/connect.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use Bio::SeqIO; 4 | use Getopt::Std; 5 | use strict; 6 | my %opt; 7 | getopts('i:o:l:t:b:f:s:x:c:nh',\%opt); 8 | usage() if $opt{h}; 9 | my $indir=$opt{i}; 10 | my $outfile=$opt{o} ? $opt{o} : 'all.fas'; 11 | my $type=$opt{t} ? $opt{t} : 'phyloaln'; 12 | my $fill=$opt{f} ? $opt{f} : '-'; 13 | my $sep=$opt{s} ? $opt{s} : '.'; 14 | my $suffix=$opt{x} ? $opt{x} : '.fa'; 15 | my $blockfile=$opt{b}; 16 | my $nexus=$opt{n}; 17 | my @codonpos=split('',$opt{c}); 18 | my $list=$opt{l} ? $opt{l} : makelist($indir,$type,$sep,$suffix); 19 | my $suffixq=quotemeta $suffix; 20 | my $sepq=quotemeta $sep; 21 | open(F,"<$list") or die "\nError: $list can't open!\n"; 22 | open(FF,">$outfile") or die "\nError: $outfile can't open!\n"; 23 | if($blockfile) { 24 | open(B,">$blockfile") or die "\nError: $blockfile can't open!\n"; 25 | if($nexus) {print B "#nexus\nbegin sets;\n";} 26 | } 27 | while() { 28 | chomp; 29 | my $taxon=$_; 30 | my $str; 31 | my $num=0; 32 | while(<$indir/*$suffix>) { 33 | my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta'); 34 | my $gene=$_; 35 | $gene=~s/(([\w\.]+)\/)+//; 36 | $gene=~s/$suffixq//; 37 | my ($find,$len); 38 | while(my $seq=$in->next_seq()) { 39 | my $id=$seq->display_id; 40 | $len=$seq->length() unless $len; 41 | if($type eq 'phyloaln'&&$id=~/([^$sepq]+)$sepq/) {$id=$1;} 42 | elsif($type eq 'blastsearch') {$id=~s/(\|(\S+))+//;$id=~s/\_$gene//;} 43 | elsif($type eq 'orthograph') { 44 | if($id=~/^(\w+)\|(\w+)\|/) {$gene=$1;$id=$2;} 45 | else {die "Error: Invalid format of $id in $_!\n";} 46 | } 47 | if($taxon eq $id) {$find=1;$str.=$seq->seq;last;} 48 | } 49 | unless($find) { 50 | printf "No $gene in $taxon!\n"; 51 | for(my $i=0;$i<$len;$i++) {$str.=$fill;} 52 | } 53 | if($blockfile) { 54 | my $start=$num+1; 55 | $num+=$len; 56 | if(@codonpos) { 57 | for(my $i=0;$i<@codonpos;$i++) { 58 | if($nexus) {print B "charset ";} 59 | my $cstart=$start+$codonpos[$i]-1; 60 | print B "$gene\_codon$codonpos[$i]\t=\t$cstart-$num\\3;\n"; 61 | } 62 | } 63 | else { 64 | if($nexus) {print B "charset ";} 65 | print B "$gene\t=\t$start-$num;\n"; 66 | } 67 | } 68 | } 69 | $str=~s/\s//g; 70 | print FF "\>$taxon\n$str\n"; 71 | if($nexus) {print B "end;\n";} 72 | close(B); 73 | $blockfile=""; 74 | } 75 | close(F); 76 | 77 | sub makelist { 78 | my $dir=$_[0]; 79 | my $type=$_[1]; 80 | my $sep=$_[2]; 81 | my $suffix=$_[3]; 82 | my %taxon; 83 | my $suffixq=quotemeta $suffix; 84 | my $sepq=quotemeta $sep; 85 | if($type eq 'phyloaln') { 86 | while(<$dir/*$suffix>) { 87 | my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta'); 88 | my $gene=$_; 89 | $gene=~s/(([\w\.]+)\/)+//; 90 | $gene=~s/$suffixq//; 91 | while(my $seq=$in->next_seq()) { 92 | my $id=$seq->display_id; 93 | if($id=~/([^$sepq]+)$sepq/) {$id=$1;} 94 | $taxon{$id}=1; 95 | } 96 | } 97 | } 98 | elsif($type eq 'blastsearch') { 99 | while(<$dir/*$suffix>) { 100 | my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta'); 101 | my $gene=$_; 102 | $gene=~s/(([\w\.]+)\/)+//; 103 | $gene=~s/$suffixq//; 104 | while(my $seq=$in->next_seq()) { 105 | my $id=$seq->display_id; 106 | $id=~s/(\|(\S+))+//; 107 | $id=~s/\_$gene//; 108 | $taxon{$id}=1; 109 | } 110 | } 111 | } 112 | elsif($type eq 'orthograph') { 113 | while(<$dir/*$suffix>) { 114 | my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta'); 115 | while(my $seq=$in->next_seq()) { 116 | my $id=$seq->display_id; 117 | if($id=~/^(\w+)\|(\w+)\|/) {my $taxa=$2;$taxon{$taxa}=1;} 118 | else {die "Error: Invalid format of $id in $_!\n";} 119 | } 120 | } 121 | } 122 | my $list='list'; 123 | open(L,">$list") or die "\nError: $list can't open!\n"; 124 | foreach my $taxa (keys %taxon) {print L "$taxa\n";} 125 | close(L); 126 | return $list; 127 | } 128 | 129 | sub usage { 130 | 131 | die " 132 | perl $0 133 | Concatenate multiple alignments into a matrix. 134 | 135 | Usage: 136 | -i directory containing input FASTA alignment files 137 | -o output concatenated FASTA alignment file 138 | -t type of input format(phyloaln/orthograph/blastsearch, default='phyloaln', also suitable for the format with same species name in all alignments, but the name shuold not contain separate symbol) 139 | -f the symbol to fill the sites of absent species in the alignments(default='-') 140 | -s the symbol to separate the sequences name and the first space is the species name in the 'phyloaln' format(default='.') 141 | -x the suffix of the input FASTA alignment files(default='.fa') 142 | -b the block file of the positions of each alignments(default=not to output) 143 | -n output the block file with NEXUS format, suitable for IQ-TREE(default=no) 144 | -c the codon positions to be written in the block file(default=no codon position, '123' represents outputing all the three codon positions, '12' represents outputing first and second positions) 145 | -l the list file with all the involved species you want to be included in the output alignments, one species per line(default=automatically generated, with all species found at least once in all the alignments) 146 | -h this help message 147 | 148 | Example: 149 | connect.pl -i inputdir -o outputfile -t inputtype -f fillsymbol -s separate -x suffix -b block1file -n -c codonpos -l listfile 150 | 151 | Written by Yu-Hao Huang (2018-2024) huangyh45\@mail3.sysu.edu.cn 152 | "; 153 | 154 | } 155 | 156 | -------------------------------------------------------------------------------- /tests/ref/OG0003820.fa: -------------------------------------------------------------------------------- 1 | >DBUSC.3865 2 | ATGTTTGCGTTACGACATTTGTGTCTGCAGGGCCGCTTAAGAATGCATTTTCCACCCGCAgtagcagcgacagcaactacaattacaacaacaaaacacatgTCAGCTGTGCCGGCGCGTGTTTTGGTTGAGatgcacacacatgccaaGAGCTGCTACGGTAGACATAAGTCTCTATTGCTCCAAGAGCGAATGACAACCTGGACGCTAAACCAAAGAACAACGACCACGCTTTCTGCAACAGATTCCAATGAACAAAAGAAACCGCCGACAAAGTCGCCGCTGCAAGAGTTAGTGGCTGCTGCAAGACCATATGCACAGCTAATGCGTATTGATCGGCCAATTGGAACGTACTTGCTGTTCTGGCCATGCGGCTGGAGCATAGCTCTTAGCGCTAACGCTGGCTGTTGGCCAGATTTTGCCATGCTAGGATTGTTTGCCACGGGCGCGCTAATTATGCGTGGTGCCGGTTGCACCATTAACGATCTCTGGGACAAGGACATAGATGCAAAAGTTGAGCGTACGCGAAGTCGCCCCCTAGCTGCTGGAAAGATTACACAATTTGACGCTATAGTTTTTCTCTCGGCGCAACTCAGTTTGGGCTTGCTTGTGCTCGTGCAGCTGAACTGGCAATCTATACTGCTGGGTGCCAGTTCGCTGGGTCTAGTGATCACGTATCCACTAATGAAGCGTGTCACATACTGGCCACAACTAGTGCTTGGCATGTGCTTTAACTGGGGTGCGCTCTTGGGCTGGTGTGCTACACAGGGCAGCGTAAACCTAGATGCCTGTTTACCACTTTACTTGTCAGGCGTGTGCTGGACAATTGTGTATGACACCATCTATGCACACCAGGATAAACTGGACGATATGCAGATTGGCGTCAAGTCAACGGCACTGCGTTTTGGCGAGAATACTAAAATGTGGCTATCAGGATTTACTGCGGCGATGCTGACGGGTCTCTCGGCTGCTGGTTATGCCTGCGATCAAACACTTCCATACTACGCAGCCGTTAGCATAGTTGGTGCTCATCTGGCACAGCAGATCTACTCACTGAACATAGATAATCCTAGCGATTGCGCAAAAAAGTTCTTTTCGAACCAACAAGTTGgcctcattttatttttaggcatTGTGCTGGGCACACTGCTCAAGTCGGACGATCCTAAGCAGCAGCGTAAATCTTTAAGCACACCCGCATCCACAGCTGCTTATGTGCCGTTGCCCCAAACGAAACCCGATGTAATTAGC 3 | >DMOJA.1254 4 | ATGTTTGCGTTACGACACTTGCGACTACAAGGCAGGGCTAAAATACATTTGCCATATGCTGCGatggcaacaactgcaacagcaacaactacaaaacaCATGCCAGCTGTGCCGGCGCGTGTCTATGTGggtctacatacatacagcagCGAGTATCGCATGCCAAgactgctgatgctgcaggAACATATGACAAGTTGGAGGCTCCACCAAAGAACCACAACAACGCTGTCGAAATCTTCGAACCAGGAGAGGAAGACGCCGCATAAAGCAACACTGTTACAGGAACTGGCCGACGCCGCTAAGCCCTACGCACAGCTGATGCGTATAGACCGACCCATCGGGACATATCTGCTCTTCTGGCCGTGCGGCTGGAGCATAGCTCTGAGTGCTGACGCCGGCTGCTGGCCAGACTTTTCGATGCTGGGACTCTTTGCATTGGGCGCATTGATTATGCGCGGCGCGGGCTGCACCATCAACGACATGTGGGACAAAGATATCGACGCAAAGGTGGAAAGAACGCGCACTCGTCCTCTGGCCTCTGGACAGATTACGCAGTTCGATGCGATTGTATTTCTCTCGGCGCAGCTTAGCCTGGGTTTGCTTGTGCTGGTGCAGCTGAACTGGCAGTCCATCCTTCTGGGCGCCAGTTCGCTGGGTTTGGTAATTACGTACCCGCTGATGAAGCGCGTCACGTACTGGCCCCAGCTGGTGCTCGGCATGTGTTTTAACTGGGGCGCCCTGCTTGGCTGGTGTGCCACTCAGGGAAGCGTTAATCTGGAAGCCTGCCTGCCACTCTATCTATCCGGTGTGTGCTGGACCATTGTCTACGACACCATCTATGCCCACCAAGACAAGCTGGACGACTTGCAGATAGGCGTCAAGTCGACGGCGCTTCGTTTTGGAGAGAACACAAAAGTCTGGCTCTCCTGCTTTACAGCTGCTATGCTGGCGGGTCTCACCTCTGCCGGCTATGCCTGTGACCAGACTCTGCCCTATTATACCGCAGTGGGCGTCGTCGGAGCGCACTTGGTGCAGCAGATCTACTCCCTCAACATAGACAATCCCAGTGACTGCGCTAAGAAGTTTCTCTCCAACCAGCAGGTGGGCCTGATTCTTTTCCTGGGTATCGTGCTGGGCACTCTGCTGAAGTCGGACGACAGCAAGAAACAGAGCAAAGCGACGCTAGCACCTGTGACAGCTCCATCATATGTACCTTTACCCCAATCAAAGCCGGACGTAATCAGC 5 | >DPSEU.12475 6 | ATGTTTGCGGTACGACATTTGCTGAAGAGCAGAAAGCATTTTCCCTACGCTTATGCGGCggcgacaacaacaaagagcAGGCTGCCAATGCCAGCTGTGCCGGCGCGTGTTCTTATTGGCCTCCACACAGATAGTGATTGCCGCAACGAGAGGCTACCGCAGATCCAGGAGCTTTCTTTTCGCAAGATGTCTACGCTGCCAACATCCAAGAAGCCAGGATCGGTGCTCGAAGAGCTGTACGCAGCTACGAAACCATATGCCCAGCTGATGAGGATTGATCGACCCATTGGCACTTACTTGTTGTTCTGGCCCTGTGCGTGGAGCATAGCGTTGAGTGCAGATGCAGGCTGTTGGCCGGACCTTACCATGCTGGGCTTGTTTGGAACGGGAGCATTGATAATGCGTGGCGCCGGCTGCACTATAAACGATCTCTGGGACAAGGATATCGACGCCAAGGTGGAGCGAACGCGGACGCGGCCACTCGCATCCGGGCAGATTAGTCAGTTCGATGCTATTGTCTTTCTCTCGGCACAGCTGAGTCTCGGACTACTGGTTCTTGTCCAGCTCAACTGGCAGTCAATTCTGTTGGGCGCCAGCTCACTGGGGCTGGTGATCACTTATCCCTTGATGAAGAGGGTGACCTATTGGCCCCAGTTGGTTCTTGGCATGGCCTTCAACTGGGGTGCTTTGTTGGGATGGTGTGCAACACAGGGAAGCGTCAACCTGGCCGCTTGTCTTCCGCTCTACCTGTCTGGTGTTTGCTGGACCATTGTCTACGACACAATCTACGCCCACCAGGATAAGCTAGACGATTTGCAAATCGGTGTCAAATCCACTGCTTTGCGTTTTGGTGAGAACACCAAAGTTTGGCTATCAGGTTTCACCGCAGCCATGCTGACGGGTCTCTCTACCGCCGGCTGGGCCTGCGATCAAACGCTGCCGTACTATGCCGCTGTTGGAGTTGTTGGCGCTCATCTTGTGCAGCAGATCTACTCCCTCAACATTGACAATCCCACCGACTGCgccaagaaatttttatcgaATCATCAGGTAGGACTCATTCTGTTTCTTGGAATCGTCCTTGGCACACTACTGAAAGCGAACGACACTAAAAATCAGCCGCAACCCGCACTAACATCATCGGCAGCCAGCTCCTATGCTTCGCTAACTCAAAAACCAGAAGTTTTGAGC 7 | >DYAKU.10021 8 | ATGTATGCGCTACGACACCTGCGACTCCAGAGCGCACGACACCTCCGCAGCTCTTAtgcagcggcggcaacaacaaaacacatGCTGCCCCGGCAACCAGCGCGTGTTCTGATTGGAGATGGGAGCACCTGGGATAAGTACCAAGTACAGGATGTATACTCCAGGAGTTCGAGTACCGCCACTGAGCCCGTGAAGCCGCAAACGCCGCTGCAGGAACTGGTGTCAGCCGCCAAACCCTATGCCCAACTGATGCGGATCGACCGGCCTATTGGCACCTACCTCCTCTTCTGGCCCTGCGCCTGGAGCATAGCGCTCAGCGCGGATGCGGGTTGCTGGCCGGACCTGACCATGCTCGGTCTGTTTGGCACCGGGGCACTGATAATGCGCGGCGCCGGGTGCACCATTAACGATCTCTGGGACAAGGACATCGATGCCAAGGTGGAGCGCACAAGATTGCGGCCCTTGGCCTCGGGACAAATCAGCCAGTTCGATGCCATAGTATTCCTCTCGGCTCAGCTTAGTCTGGGTCTTTTGGTGCTGGTCCAGCTCAACTGGCAGTCCATATTGTTGGGCGCCAGTTCTCTGGGTCTCGTAATCACCTATCCACTCATGAAAAGAGTCACCTACTGGCCCCAGCTGGTTCTGGGCATGGCTTTCAACTGGGGCGCCCTACTGGGATGGTGTGCCACCCAGGGCAGTGTTAATCTGGCCGCCTGCCTGCCGCTCTACCTTTCCGGTGTATGCTGGACCATTGTGTACGACACCATATACGCCCACCAAGACAAGCTGGATGACCTGCAAATCGGCGTGAAATCCACGGCTCTGAGATTTGGCGAGAACACCAAGGCTTGGCTGTCTGGATTCACGGCAGCCATGCTGACTGGTCTTTCCGCCGCTGGCTGGGCCTGCGATCAAACGGTGCCCTACTACGCGGCTGTTGGAGTAGTGGGTGCCCATCTAGTGCAGCAGATCTACTCCCTCAACATTGACAACCCCAGCGACTGCGCCAAGAAGTTCCTATCGAACCATAAAGTAGGACTCATTCTATTCCTTGGCATTGTTTTGGGCACCCTTCTGAAATCAGACGAGACCAAGAAACAGCGCCAATCCTCACTGACAACATCTACGGCCAGCTCATACGTTCCAGCGCTGCCGCAAAAGCAAGAAGTTATAAGC 9 | >SLEBA.9410 10 | ATGTTCGCCCTACGCCAGATGCGACTCCAAGGTAGAATTCACATGCCATATTCAGCAGCAAGAGCGAATATAACACCGATCTTGCCGGCGCGTGTTCTAATTagcctgcacacacacacttatgaCCACCGACAGAGGCTATCACAACATTGCAAGCACATGGCAATTGCAAAGCCGCAAATGTATTTGCAGCGAACCAGTTCCACCCTAAGCGCGCCTAAAGAACAGAGTGAAACACCTGATGGGCGTTCCACAAAGTCCTTGATGGAGGAATTGAGCACCAGTGTCAGGCCTTACAAGCAGCTGATGCGCTTAGATCGCCCAATAGGAACGTACCTACTGTTTTGGCCCTGCGGCTGGAGTATCGCGTTGAGCGCGGATGCCGGCTGTTGGCCCGACTTAACGATGTTAGGGTTGTTTGCCACAGGTGCTTTAATTATGCGCGGTGCCGGTTGCACCATCAACGATCTGTGGGACAAGGACATTGATGCAAAGGTAGAGCGTACACGCTCTCGTCCATTAGCATCTGGTCAAATAACACAGTTCGATGCCATAGTGTTTTTATCGGCACAGCTGAGCTTGGgcctgctggtgctggtgcagCTGAACTGGCAGTCTATACTGCTGGGCGCCAGCTCCTTGGGCCTGGTTATTACGTACCCGCTTATGAAACGTGTAACATACTGGCCACAACTAGTGCTGGGCATGGCCTTCAACTGGGGCGCCCTGCTGGGTTGGTGTGCCACTCAAGAAAGCATCAATTTAGCCGCCTGCCTACCGCTTTATCTTTCGGGTGTATGCTGGACTATTGTATATGACACGATCTATGCGCACCAAGACAAGCTTGACGATTTGCAAATTGGCGTTAAGTCGACGGCTCTGCGATTCGGCGAAAATACGAAAGTGTGGCTCTCCGGTTTTACAGCAGCCATGCTCACCGGCCTTTCCGCGGCAGGATGGGCATGTGATCAAACGCTGCCCTACTACGCTTCTGTTGGAATAGTTGGCGCACATTTGGCTCAACAGATCTACTCGCTGAACATAGACAATCCGAGTGATTGCGCCAAGAAATTCTTTTCAAATCATCAGGTTGGTCTCATTCTCTTTCTTGGCATTGTGCTTGGCACGCTCCTTAAGTCGAAAGACGCAAAAAAACAACGACAAACTGCGCCCACATCTACAACAGCCAACACCTATGTAGCGCTACCAGCAAACCCCGAGGTTATAAGc 11 | -------------------------------------------------------------------------------- /lib/library.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import shutil 4 | import gzip 5 | import subprocess 6 | from multiprocessing import Pool, set_start_method 7 | from Bio.Seq import translate 8 | from Bio.SeqIO import index_db 9 | 10 | parentdir = os.path.dirname(os.path.realpath(__file__)) 11 | PhyloAlndir = os.path.dirname(parentdir) 12 | epath = os.environ['PATH'] 13 | 14 | # check if the programs exist in the path 15 | def check_programs(progs): 16 | for program in progs: 17 | fullpath = shutil.which(program) 18 | if not fullpath: 19 | print("\nError: fail to find {}! Please install and add to it into the path!".format(progs[0])) 20 | sys.exit(1) 21 | 22 | # run the command 23 | def runcmd(cmd, log, env=None, stdout=True, error=False): 24 | if env: 25 | # add the environment into the path 26 | os.environ['PATH'] = env + ':' + epath 27 | if stdout: 28 | print('PATH: +' + env) 29 | log.write('PATH: +' + env + "\n") 30 | log.write(' '.join(cmd) + "\n") 31 | if stdout: 32 | print(' '.join(cmd)) 33 | p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 34 | # synchronous output to the log file and the screen 35 | while p.poll() is None: 36 | cmdout = p.stdout.readline().decode("utf8") 37 | if cmdout: 38 | log.write(cmdout) 39 | if stdout: 40 | print(cmdout.rstrip("\n")) 41 | 42 | # restore to the original environment 43 | os.environ['PATH'] = epath 44 | 45 | # if error occurs, stop the script and exit 46 | if p.returncode != 0: 47 | if error: 48 | print("\nError in '" + ' '.join(cmd) + "'!") 49 | sys.exit(1) 50 | else: 51 | # if not 'error', return the state 52 | return False 53 | elif not error: 54 | return True 55 | 56 | # run the function by single process with pbar 57 | def run_sp(function, args_list, kwds={}, total=None, finish=0): 58 | # If total number is set, the number will be use, otherwise the length of the list will be used as total number 59 | if total is None: 60 | total = len(args_list) 61 | results = [] 62 | 63 | for args in args_list: 64 | if kwds: 65 | # add the fixed parameters 66 | result = function(*args, **kwds) 67 | else: 68 | result = function(*args) 69 | results.append(result) 70 | # refresh the pbar 71 | finish += 1 72 | sys.stdout.write("[{}] {}{}/{} ({}%)\r".format(("+" * int((finish/total)*50)) + (" " * (50 - int((finish/total)*50))), (" " * (len(str(total))-len(str(finish)))), finish, total, '%.2f' %(finish/total*100))) 73 | sys.stdout.flush() 74 | 75 | return results 76 | 77 | # run the function by multiprocess with pbar 78 | def run_mp(function, args_list, cpus, kwds={}, sp_list=[]): 79 | multinum = len(args_list) 80 | total = multinum + len(sp_list) 81 | results = [] 82 | 83 | if cpus == 1: 84 | # when using 1 cpu, use the function directly instead of multiprocess 85 | results.extend(run_sp(function, args_list, kwds=kwds, total=total)) 86 | elif multinum > 0: 87 | # use multiprocess 88 | # setup multiprocess method using less memory 89 | try: 90 | set_start_method('spawn') 91 | except RuntimeError: 92 | # avoid setting repeatedly 93 | pass 94 | # setup pool 95 | p = Pool(cpus) 96 | finish = 0 97 | # setup results and split over cpus 98 | for args in args_list: 99 | if kwds: 100 | # add the fixed parameters 101 | results.append(p.apply_async(function, args=args, kwds=kwds)) 102 | else: 103 | results.append(p.apply_async(function, args=args)) 104 | # print the pbar 105 | while True: 106 | finish_task = sum(1 for result in results if result.ready()) 107 | if finish_task == finish: 108 | continue 109 | # when new task finished, refresh the pbar 110 | finish = finish_task 111 | sys.stdout.write("[{}] {}{}/{} ({}%)\r".format(("+" * int((finish/total)*50)) + (" " * (50 - int((finish/total)*50))), (" " * (len(str(total))-len(str(finish)))), finish, total, '%.2f' %(finish/total*100))) 112 | sys.stdout.flush() 113 | if finish == multinum: 114 | break 115 | p.close() 116 | p.join() 117 | # extract the outputs of the function 118 | results = [result.get() for result in results] 119 | 120 | # finally manage the list of single process if set 121 | if sp_list: 122 | results.extend(run_sp(function, sp_list, kwds=kwds, finish=multinum, total=total)) 123 | 124 | # end the pbar 125 | print("\n") 126 | return results 127 | 128 | # read the FASTQ or FASTA file 129 | def read_fastx(fastx, file_format='guess', select_list=None, low_mem=False): 130 | seqs = {} 131 | if file_format == 'large_fasta': 132 | db_dict = index_db(fastx + '.idx', fastx, 'fasta') 133 | for seqid, seqinfo in db_dict.items(): 134 | if select_list is None or seqid in select_list: 135 | seqs[seqid] = str(seqinfo.seq) 136 | db_dict.close() 137 | return seqs 138 | elif fastx.endswith('.gz'): 139 | reads = gzip.open(fastx, 'rt') 140 | else: 141 | reads = open(fastx) 142 | if file_format == 'guess': 143 | for line in reads: 144 | if line.startswith('@'): 145 | file_format = 'fastq' 146 | elif line.startswith('>'): 147 | file_format = 'fasta' 148 | break 149 | # recover the file iteration 150 | if fastx.endswith('.gz'): 151 | reads = gzip.open(fastx, 'rt') 152 | else: 153 | reads = open(fastx) 154 | print("Detected format: {}".format(file_format)) 155 | if low_mem: 156 | # if using low-memory mode, return the file iteration and file format instead of reading the sequences 157 | return reads, file_format 158 | seqid = None 159 | line_num = 0 160 | for line in reads: 161 | line = line.rstrip() 162 | if file_format == 'fastq' and line.startswith('@') and line_num % 4 == 0: 163 | seqid = line.replace('@', '', 1).replace(' ', '_') 164 | if select_list is not None: 165 | if seqid not in select_list: 166 | seqid = None 167 | elif file_format == 'fasta' and line.startswith('>'): 168 | arr = line.split(" ") 169 | seqid = arr[0].lstrip('>') 170 | if select_list is None or seqid in select_list: 171 | seqs[seqid] = '' 172 | else: 173 | seqid = None 174 | elif seqid: 175 | if file_format == 'fastq': 176 | seqs[seqid] = line 177 | seqid = None 178 | else: 179 | seqs[seqid] += line 180 | line_num += 1 181 | return seqs 182 | 183 | # translate the (gappy) sequences in a FASTA file 184 | def trans_seq(filename, output, gencode=1, dna_codon_unknow=None): 185 | seqs = read_fastx(filename, 'fasta') 186 | outfile = open(output, 'w') 187 | for seqid, seqstr in seqs.items(): 188 | tran_str = '' 189 | i = 0 190 | while i < len(seqstr): 191 | if seqstr[i:i+3] == '---': 192 | tran_str += '-' 193 | elif dna_codon_unknow: 194 | try: 195 | tran_str += translate(seqstr[i:i+3], table=gencode) 196 | except: 197 | tran_str += dna_codon_unknow 198 | else: 199 | tran_str += translate(seqstr[i:i+3], table=gencode) 200 | i += 3 201 | outfile.write(">{}\n{}\n".format(seqid, tran_str)) 202 | outfile.close() 203 | 204 | -------------------------------------------------------------------------------- /tests/ref/OG0003531.fa: -------------------------------------------------------------------------------- 1 | >DBUSC.12359 2 | ATGGTGCTAGACCTGGATTTGTTTCGCAGCGACAAGGGCGGCAACCCTGATGCAGTCCGTGAGAATCAAAAGAAGCGTTTTAAAGATGTGGGACTTGTAGAGGCAGTCATTGAAAAAGACACGGAATGGCGACAGCGCCGTCATCGAGCCGACAATCTGAACAAGGTGAAAAATGTCTGCAGCAAGGTAATCGGTGAGAAAATGAAGAAGAAGGAGCCCGTAGGCCCCGAGGGCGAGGAGGTGCCAGCTGCTATACGAGTAGATTTGACGCAAATAACTGCGGAGACACTACAAGCGTTGACAGTGAATCAGATCAAACAGCTGCGCTTACTCATCGATGATGCGATGACGGACAATCAAAAGTCTCTGGAGCTGGCTGAACAGACGCGCAATACGGCACTGCGTGAGGTGGGCAATCATCTGCACGATTCTGTGCCCGTGTCTAATGACGAGGAGGAGAATCGTGTTGAGCGAACCTTTGGGGACTGCGAGAAGCGCGGCAAGTACTCACATGTGGATCTAATTGTGATGATCGATGGCATGAATGCGGAGAAGGGCGCTGTCGTATCTGGAGGACGTGGCTACTTCCTTACTGGCGCTGCTGTATTTCTGGAGCAAGCGCTGATTCAGCATGCGCTTCAGCTGCTCTATACGCGTGACTATACTCCACTTTATACGCCCTTCTTTATGCGCAAAGAGGTGATGCAAGAGGTGGCGCAGCTCTCTCAGTTCGACGAGGAGCTTTACAAAGTGGTTGGCAAGGGAAGCGAGCGTGCCGAAGAGTGTGGCACCGATGAAAAGTACCTGATAGCCACCTCGGAGCAGCCTATTGCGGCGTATCATCGCGACGAGTGGCTGCCAGAGTCATCGCTACCCATTAAATACGCCGGCTTGTCGACTTGCTTCCGCCAGGAGGTAGGCTCGCATGGACGCGATACTCGTGGCATTTTCCGCGTGCACCAGTTTGAGAAGGTGGAACAGTTCGTACTGACTTCTCCACATGACAACAAATCGTGGGAGATGATGGACGAGATGATTGGCAACGCAGAGCAGTTCTGCCAATCTCTGGGCATACCATATCGCGTGGTTAACATTGTTTCTGGAGCTCTCAATCACGCCGCTTCCAAGAAACTCGATCTGGAGGCCTGGTTTGGCGGCAGCGGTGCCTTCAGGGAGCTGGTTTCCTGCTCCAATTGCCTAGACTACCAGGCTCGTCGCTTGCTGGTGCGCTACGGTCAGACCAAGAAGATGAACGCGGCTGTGGACTATGTGCACATGCTTAATGCTACAATGTGCGCTGCTACTCGTGTTATCTGCGCCATTCTCGAGACGCACCAAACAGAAACGGGCGTCAAGGTGCCCGAGCCCCTCAAGAAGTATATGCCGGCGAAGTTCCAGGACGAGATTCCTTTCGTTAAGCCAGCACCCATTGATCTGGAGCTTGCCGCTGCGGCCAATCAAAAGGCCAAGAAGGATAAGACCAAGAAGGATCCAGCCGCTGCC 3 | >DMOJA.5365 4 | ATGGTTTTGGATTTGGATCTATTTCGCAAAGACAAGGGCGGCAATCCCGATGCGGTGCGCGAGAATCAAAAGAAGCGCTTCAAAGATGTCGGCCTCGTCGAGACCGTCATCGAAAAGGACACCGAATGGCGTCAGCGTCGGCATCGAGCCGACAACCTGAACAAGGTGAAGAACGTCTGCAGCAAAGTTATTGGcgagaagatgaagaagaaggagcCCCTAGGCGCCGATGGCGAGGAGGTGCCCGCTTCGGTACGCTCGGATCTAACGCAAATCACAGCCGAAACCCTGCAAGCGTTGACGGTGAATCAAATCAAGCAATTGCGGCTGCTCATCGATGATGCGATGACCGAGAATCAAAAAGCGCTGGAGCTGGCAGAGCAAACCAGGAATACTTCGCTGCGTGAGGTGGGCAATCACTTACACGAATCGGTGCCCGTTTCCAACGACGAGGACGAGAATCGTGTGGAGCGCACATTTGGTGATTGTACGAAGCGTGGCAAATATTCGCATGTGGATTTGATTGTTATGATCGATGGCATGAATGCCGAGAAGGGCGCCGTGGTGTCCGGTGGCCGTGGTTACTTCCTAACTGGCGCCGCTGTCTTCCTGGAGCAAGCCGTCATACAGCATGCCCTGCATTCGCTCTACCAGAAGGACTATGTGCCCCTTTATACGCCCTTCTTTATGCGCAAGGAGGTGATGCAGGAAGTCGCCCAGCTGTCGCAGTTCGACGAAGAGCTGTACAAGGTGGTGGGCAAGGGCAGCGAACGGGCCGAGGAGAGCGGCACAGATGAGAAATACCTGATAGCCACCTCGGAGCAGCCGATAGCCGCCTATCATCGTGATGAATGGCTGCCAGAGGCATCGCTACCAATCAAGTACGCTGGCCTCTCCACGTGCTTCCGCCAAGAGGTGGGCTCCCATGGACGTGATACGCGAGGCATCTTCCGGGTGCATCAGTTCGAGAAGGTGGAACAATTTGTGCTGACCTCGCCGCATGACAACAAATCGTGGGAGATGATGGACGAGATGATTGGCAATGCGGAGCAGTTCTGCCAATCTCTAGGCATCCCATACCGCATTGTCAACATTGTTTCGGGCGCACTGAATCATGCCGCCTCTAAGAAGCTCGATTTAGAGGCCTGGTTCGGCGGCAGTGGCGCCTACAGAGAGCTTGTCTCCTGCTCCAATTGTCTGGACTACCAGGCACGTCGACTGCTGGTGCGCTATGGCCAGACCAAGAAGATGAATGCAGCTGTGGACTACGTCCACATGCTGAATGCCACAATGTGCGCTGCCACGCGTGTGATTTGCGCCATCCTGGAGACCCATCAGACGGAGACGGGCATCAAGGTGCCGGAGCCTTTAAAGAAATACATGCCCTCCAAATTTCAGGATGAAATTCCATATGTGAAGCCAGCGCCAATTGACTTGGAACAAGCTGCCGCCGACAAACAGAAGTCCAAGAAGGAGAAGCCGAAGAAGGATCCTGCAGCTGCC 5 | >DPSEU.1497 6 | ATGGTGCTGGATTTGGATCTGTTTCGCAGCGACAAGGGCGGCAATCCCAATGCCGTGCGCGACAACCAGAAGAAGCGCTTCAAAGATGTAGCCCTGGTGGAGACTGTGATTGAGAAAGACACCGAGTGGCGTCAGTGCCGTCATCGGGCCGACAACCTGAACAAGGTGAAGAATGTGTGCAGCAAGGTGATTGGCgagaagatgaagaagaaggAGCCAGTGGGTGCCGTGAGCGAGGAGCTGCCTGCTGCGGTAACCACAAGCCTAACTGAGATTGTGCCCGAAACTCTGCAGCCGCTGACGGTCAACCAGATCAAGCAGCTGCGCGTGCTCATTGACGATGCCATGACGGAGAACCAAAAGTCCCTGGAGCTGGCGGAGCAGACCAGAAACACCTCTTTACGAGAGGTCGGCAATCACCTTCACGATTCGGTCCCTGTGTCTAACGATGAAGAGGAGAACCGCGTCGAGAGGACCTTTGGCGATTGCGAAAAGCGTGGCAAGTACTCGCATGTGGATCTGATTGTAATGATTGATGGCATGAATGCCGAGAAAGGATCTGTGGTATCTGGGGGACGCGGCTATTTCCTCACCGGTGCCGCTGTCTTTCTTGAGCAGGCGCTCATTCAACATGCCCTGCACTTGCTGTACGCTAAGGACTATGTCCCCCTATATACGCCCTTCTTTATGCGCAAGGAGGTGATGCAAGAGGTGGCCCAGCTCTCGCAGTTCGACGAGGAGCTCTACAAGGTGGTGGGCAAGGGAAGCGAGAAAGCTGAAGAGGCTGGCACCGACGAGAAGTATCTGATTGCCACCTCTGAGCAGCCCATTGCCGCCTATCATCGCGACGAATGGCTCCCGGAGGCTTCGCTACCCATCAAATATGCCGGCTTGTCCACGTGTTTCCGTCAGGAGGTAGGCTCCCATGGCCGCGACACTCGTGGCATATTCCGCGTGCATCAGTTTGAGAAGGTCGAACAGTTTGTGCTCACTTCCCCACACGACAACAAATCCTGGGAGATGATGGACGAGATGATTGGCAATGCCGAGCAGTTCTGTCAGTCCCTGGGTATTCCATATCGCGTTGTGAACATAGTGTCCGGTGCCCTCAATCATGCCGCCTCCAAGAAGTTAGATCTGGAGGCCTGGTTTGGCGGCAGCGGTGCCTACAGAGAGCTCGTCTCCTGCTCCAATTGCCTGGACTACCAGGCGCGTCGTCTGCTGGTTCGCTTTGGCCAAACCAAGAAGATGAACGCCGCCGTGGACTACGTACACATGTTGAATGCGACAATGTGCGCTGCCACACGTGTCATTTGCGCCATTCTGGAAACGCATCAGACAGAGACGGGCATCAAGGTGCCGGAACCACTCAAGAAATACATGCCGGCTAAGTTCCAAGATGAGATTCCGTTTGTCAAGCCCGCTCCCATCGATCTGGAGCTGGCCGCGGCCGAGAAACAGAAGGGAAAGAAGGACAAGAGCAAGAAGGATCCAGCTGCCGGT 7 | >DYAKU.11913 8 | ATGGtgctggatctggatctgttTCGCAGCGACAAGGGAGGCAACCCGGACCTCGTGCGCGAAAACCAAAAGAAGCGCTTCAAGGATGTGGCGCTGGTGGAGACGGTGATCGCCAAGGATACTGAGTGGCGTCAGTGCCGCCACCGTGCCGACAACCTGAACAAGGTGAAGAACGTCTGCAGCAAGGTGATCGGCGAGAAGATGAAGAAGAAGGAGCCGGTGGGTGCAATGAGCGAGGACCTGCCCGCAGACGTGACCAAGGACCTCACCGAGATTGTGGCCGAGACACTGCAGCCGCTGACCGTCAACCAGATCAAGCAGCTGCGCGTGCTCATCGACGACGCAATGACGGAGAACCAGAAGTCCCTGGAGCTTGCCGAGCAAACGAGGAACACCTCACTGCGGGAGGTGGGCAACCACCTGCACGAGTCCGTCCCAGTGTCAAACGACGAGGACGAGAACCGTGTGGAGCGGACCTTTGGCGACTGCGAAAAGCGCGGCAAGTATTCGCATGTGGACCTCATCGTGATGATCGACGGCATGAACGCGGAGAAGGGTGCCGTGGTGTCCGGCGGACGTGGTTACTTCCTTACCGGAGCCGCAGTCTTCTTGGAGCAAGCTCTCATTCAGCACGCCCTGCACCTGCTGTACGCCCGTGACTACGTTCCCCTGTACACGCCCTTCTTCATGCGGAAGGAGGTAATGCAGGAGGTGGCCCAGCTGTCACAATTCGACGAGGAGCTCTACAAGGTGGTGGGTAAGGGCAGCGAGAAGGCCGAGGAGGTAGGCATCGATGAGAAGTACCTGATCGCCACCTCAGAGCAGCCCATCGCCGCCTACCATCGCGATGAGTGGCTGCCGGAGGCTTCGCTGCCCATCAAGTATGCCGGTCTGTCCACCTGCTTCCGGCAGGAAGTGGGCTCGCACGGACGCGACACTCGCGGCATTTTCCGCGTCCACCAGTTCGAGAAGGTGGAGCAGTTCGTACTGACCTCCCCACACGACAACAAGTCGTGGGAGATGATGGACGAGATGATCGGCAATGCGGAGCAGTTCTGCCAGTCACTGGGCATTCCATATCGCGTGGTGAACATCGTTTCCGGTGCGCTCAACCATGCAGCCTCCAAGAAACTGGATCTGGAGGCCTGGTTCGGCGGCAGCGGCGCTTACAGAGAACTTGTATCGTGCTCCAACTGCTTGGACTACCAGGCCCGTCGTCTGCTCGTACGTTTCGGCCAGACCAAGAAGATGAACGCCGCCGTCGACTATGTGCACATGCTGAACGCCACGATGTGCGCAGCCACTCGCGTCATCTGCGCCATCCTGGAGACGCATCAGACAGAGACGGGTATCAAGGTGCCGGAGCCATTGAAGAAGTACATGCCGGCGAAGTTCCAGGATGAGATTCCGTTCGTCAAGCCCGCTCCCATTGATCTGGAGTTGGCCGCCGCCGAGAAGCAAAAGGGCAAAAAGGAGAAAACCAAGAAGGACCCTGCCGCCGGT 9 | >SLEBA.7636 10 | atggtACTCGATTTGGATTTGTTTCGCAGCGATAAGGGCGGTAATCCCGATGCTGTGCGCGAGAACCAAAAGAAGCGCTTCAAGGATGTAGGACTGGTCGAGACGGTAATTGAGAAGGACTCCGAATGGCGTCAGCGACGTCATCGGGCCGACAACCTGAACAAGGTGAAAAATGTCTGCAGCAAAGTAATTGGCGAGAAGATGAAGAAAAAGGAGCCCGTCGGTGCTGAAGGCGAAGAAGTGCCGGCCGCCATACGCGCGGATCTGACCCAAATTACGGCCGAGACGCTGCAGTCATTGACTGTGAATCAAATTAAACAGCTGCGCTTACTCATCGACGATGCCATGACTGAAAACCAAAAGTTGCTGGAGGCTGCTGAACAAACCCGAAACACGGCACTGCGTGAGGTGGGCAATCACCTGCATGAGTCTGTGCCCGTCTCCAATGACGAGGATGAGAATCGAGTGGAGCGCACCTTTGGGGATTGCGAGAAGCGCGGAAAATATTCTCATGTTGATCTGATTGTTATGATCGATGGCATGAACGCCGAAAAGGGTGCTGTAGTATCTGGTGGGCGTGGCTATTTTCTCACTGGCGCCGCTGTATTCCTCGAACAAGCGCTCATTCAGCATGCCCTACACTTGTTGTATGCCCGCGAATATACGCCGCTTTATACACCATTCTTCATGCGCAAGGAAGTCATGCAGGAGGTGGCGCAGCTCTCGCAATTCGACGAGGAGCTGTATAAAGTTGTCGGCAAGGGTAGCGAGCGGGCTGAGGAGGGTGGCACTGATGAAAAGTATTTGATAGCTACTTCGGAACAGCCCATTGCGGCATATCATCGTGACGAGTGGCTGCCAGAGACTTCACTGCCTATTAAATATGCCGGTTTGTCTACATGCTTCCGCCAGGAGGTCGGCTCACATGGACGCGATACCCGTGGCATATTCCGCGTGCATCAATTTGAGAAGATCGAACAATTCGTGCTTACCTCGCCACATGATAATAAATCTTGGGAAATGATGGACGAAATGATTGGCAATGCAGAGAACTTCTGTCAATCGTTGGGCATTCCATATCGTGTGGTCAACATCGTCTCTGGCGCCCTCAATCATGCTGCCTCCAAGAAACTTGATCTGGAAGCCTGGTTCGGTGGCAGCGGCGCCTACAGAGAGCTGGTCTCCTGCTCCAATTGCCTAGACTATCAGGCACGTCGTCTACTTGTTCGCTATGGCCAGACGAAAAAGATGAATGCTGCAGTGGATTATGTGCACATGTTAAATGCTACTATGTGCGCCACAACGCGTGTCATTTGCGCCATTCTGGAAACACATCAGACGGAGACAGGTGTTAGGGTTCCAGAGCCtctgaaaaaatatatgcCGGCCAAGTTCCAAGATGAGATTCCTTTTGTTAAACCGGCGCCCATTGATTTGGAACaggctgctgccgctgccaagGGCAAAAAGGAGAAGAACAAGAAAGATGCAGCTGCC 11 | -------------------------------------------------------------------------------- /tests/ref/OG0003977.fa: -------------------------------------------------------------------------------- 1 | >DBUSC.10374 2 | ATGACTTCAAGTATATTTCTAACGACTGCGGAGAATGGTTTGCGACATGACAAAATTGTAATTCTGGATGCTGGAGCACAATATGGCAAGgtTATCGATCGCAAGGTGCGCGAACTGTTGGTGGAATCGGATATTCTGCCATTGGATacaccagcagcaactataCGCAATCATGGCTATAAGGGTATTATAATCTCAGGGGGCCCCAATTCCGTTTACGCTGAAGATGCGCCCACATACGATCCTGAGctgtttaagcttaaaataccGATGCTGGGCATCTGCTATGGCATGCAACTAATTAACAAAGAATTCGGCGGCACCGTCCTCAAGACGGATGTGCGGGAGGATGGACAGCagaatattgaaattgaaacctCATGTCCTTTGTTTAGTCGTCTCAGTCGCACACAGTCTGTGCTGTTAACCCATGGCGACAGCGTGGAGCGAGTGGGCGATAAACTGAAAGTGGGCGGCTGGTCAACAAATCGTATTGTGACTGCTATTTATAACGACGTATTGCGCATCTACGGTGTGCAATTTCATCCGGAGGTGGACCTAACCATCAACGGCAAGCAGATACTGTCCAATTTTCTATACGAAATCTGTGAGTTAACGCCAAACTTCACCATGGGTAGTCGAAAAGAGGAATGCATTCGATATATACGCAATAAAgtgggcaacaacaaagtgttgttaTTGGTCAGTGGCGGTGTTGACTCGAGCGTCTGTGCCGCGCTGCTGCGTTGCGCTCTACATCCTAGCCAAATTATAGCAGTGCATGTGGATAATGGTTTCATGCGTAAGAACGAAAGCGATAAAGTTGAACGCTCACTCAGAGAAATTGGCATTGATTTGATTGTACGCAAGGAAGGCTACACGTTTCTCAAAGGGACCACGCAAGTGAAGCGTGCCGGCCAGTATTCAGTAGTGGAAACGCCCATGTTGTGTCAAACTTATAATCCCGAGGAGAAGCGTAAAATCATTGGTGATATATTTGTGAAGGTGACCAATGATGTGGTGGCCGAGCTCAAGCTTAAACCAGAGGAAGTGCTGTTGGCGCAAGGAACTCTGCGACCGGACTTAATAGAGTCAGCATCGAATATGGTCAGCACGAATGCCGAAACCATCAAGACGCATCACAATGACACGGATCTGATTCGGGAACTGCGCAATGCCGGTCGAGTTGTAGAGCCATTATGTGACTTCCATAAGGATGAGGTGCGTGATCTGGGCAGTGATCTTGGCCTGCCAGCAGAGCTTGTGGAACGACAACCGTTTCCAGGACCTGGTCTGGCCATACGCGTACTCTGCGCTGAAGAAGCCTACATGGAAAAGGATTACTCAGAGACACAGGTTATAGCACGCGTCATTGTGGACTACAAGAACAAATTGCATAAGAATCATGCGCTTATCAATCGTGTAACGGGAGCTACTAGTGAATTAGAGCAAAAGGAACTCTTGCGTATCTCTGCCAATTCGGAAATCCAGGCCACGCTGTTGCCTATTCGCTCGGTGGGCGTGCAAGGCGACAAGCGCACTTATAGCTATGTGGTGGGACTGTCAACATCAACAGCCGAGCCCAATTGGGCGGACATGATGtttttagcaaaattaataCCACGCATTTTGCATAATGTTAACCGCGTCTGTTATATCTTTGGTGAGCCTGTGCAGTATCTGGTGAATGATATAACGCATACAACGCTCAATACTGTGGTGCTGGCACATCTGCGGCAAGCAGATGCCATTGCTAATGAGATTATAATGCAAGCGGGCCTTTATCGCAAAATCTCACAAATGCCTGTAGTTCTAATACCCGTGCACTTTGATCGTGATCCCATAAATCGCACACCCTCCTGCAGAAGATCCGTGGTGTTGCGACCCTTCATAACGAACGACTTTATGACTGGTGTGCCAGCAGTGCCTGGTTCCGTTCAACTGCCTTTGCAAGTCTTAAATCAAATGGTGCGAGAAATCTCTAAATTAGATGGCGTATCCCGAGTACTCTACGATTTAACAGCTAAACCGCCCGGCACCACAGAGTGGGAA 3 | >DMOJA.1896 4 | ATGAGTACAACTATATACCTAACGACAGCGGAGAATGGACTGCGACACGATAAAATTGTTATACTCGATGCTGGTGCACAGTATGGCAAGGTAATCGATCGCAAGGTTCGAGAACTTTTAGTAGAATCGGATATTCTGCCATTGGATACGCCTGCCGCAACGATACGTGATAATGGCTATCGAGGCATAATCATATCCGGCGGACCCAATTCAGTTTATGCCGAAGATGCGCCCACGTACGATCCCGATCTATTTAAACTGAAAATCCCAGTGCTTGGCATCTGCTATGGCATGCAGTTAATCAACAAAGAGTTCGGAGGCAGTGTCCTCAAGACTGACGTGCGAGAAGATggacaacaaaatattgagaTTGAAACATCATGCCCTTTGTTCAGTCGCCTTAGCCGTACACAGTCCGTGCTGCTCACCCATGGCGATAGCGTGGAGCGAGTGGGCGATAAGTTGAAAGTGGGTGGCTGGTCATCGAATCGGATTGTGACGGCTATTTACAGTGAGGTATTGAGAATCTATGGTGTTCAATTTCATCCAGAAGTCGATCTAACTATCAACGGCAAACAAATGCTATCCAATTTTCTATACGAAATTTGCGAACTAACGCCTAACTTTACCATGGGCAGTCGAAAAGAGGAATGCATTCGGTACATACGCGAAAAAGTGGGCAATAATAAAGTTTTGTTGCTCGTCAGTGGAGGCGTCGATTCAAGTGTCTGTGCTGCATTGCTGCGAAAAGCGCTGCATCCCAATCAGATTATAGCAGTTCATGTGGATAATGGTTTCATGCGAAaaaatgaaagcgaaaatGTAGTGCGTTCATTGCGAGATATTGGCATTGATTTGATAGTGCGTAAAGAATGCTACACGTTCCTCAAGGGCACCACGCAAGTGAAACGACCCGGCCAGTATTCGGTAGTTGAAACGCCCATGCTATGTCAGACCTACAACCCGGAGGAGAAGCGAAAAATAATCGGTGATATATTTGTGAAAGTGACTAACGATGTGGTGGCTGAGCTGAAACTGAAGCCCGAGGAGGTCCTCTTGGCGCAGGGCACACTGAGGCCGGATTTGATTGAGTCTGCTTCGAATATGGTCAGCACGAATGCCGAAACAATTAAAACGCATCACAATGACACGGATCTGATTCGgGAGCTGCGTAATGCTGGCCGTGTCGTAGAACCATTATGTGACTTCCACAAGGATGAGGTGCGTGACCTGGGCAATGATCTCGGCCTGCCAGCGGAATTGGTCGAGCGCCAGCCGTTCCCCGGTCCTGGTCTCGCCATACGCGTTCTTTGCGCCGAGGAAGCATACATGGAGAAGGATTACTCGGAAACACAGGTCATTGTGCGCGTTATAGTGGactacaaaaataaactacAAAAGAATCATGCGCTTATTAATCGTGTCACGGGCGCCACTAGTGAAGCTGAGCAAAAGGAGCTGCTGCGCATCTCTGCCAATTCCGATATCCAGGCTACTCTACTGCCGGTACGCTCGGTGGGTGTACAGGGTGATAAACGCACCTACAGCTATGTTGTTGGCTTGTCAACGACGACCCCGGAGCCCAACTGGACTGACATGttatttttggccaaaatcATACCACGCATATTGCATAACGTTAACCGAGTCTGTTATATCTTTGGCGAGCCCGTGCAGTATCTAATTACAGATATAACGCACACAACTCTGAACACTGTAGTACTGGCGCAACTCAGGCAAGCGGACGCTATAGCCAATGAGATTATAATGAAAGAGGGACTGTATCGCAAAATCTCACAAATGCCCGTAGTTCTGATACCCGTGCACTTCGATCGTGATCCCATTAATCGTACGCCTTCCTGCAGAAGATCAGTGGTATTGCGTCCATTTATAACGAACGATTTTATGACTGGTGTGCCAGCTGTGCCCGGATCTGCTCAACTGCCATTGCATGTTCTAAATCAAATTGTGCGAGATATTTCTAAATTGGATGGCATCTCGAGGGTACTCTACGACTTGACAGCCAAGCCGCCTGGCACCACGGAGTGGGAA 5 | >DPSEU.1121 6 | ATGAATTCAGGCATATTTCTGGGCACAGCGGAGAACGGCCTGCGGCACGATAAGATTGTTATACTGGATGCGGGAGCACAGTACGGCAAGGTTATCGATCGCAAGGTGCGGGAACTGCTCGTGGAGTCGGATATACTGCCCCTAGATACACCAGCGGCAACGATACGGAATAATGGCTATCGGGGCATCATCATATCCGGCGGTCCCAATTCCGTGTATGCCGAGGATGCGCCCACGTACGATCCGGATCTGTTTAAGCTGAAAATTCCCATGCTCGGAATTTGCTATGGCATGCAGCTAATCAACAAAGAGTTTGGCGGCAGCGTTCTCAAGACGGACGTCCGGGAAGATGGCCAACAGAACATTGAGATTGAGACCTCGTGCCCGCTGTTCAGCCGCCTCAGTCGCACACAGTCCGTACTGCTCACCCACGGGGACAGCGTGGAGCGAGTGGGCGAGAAGCTAAAGGTGGGGGGTTGGTCAACGAACCGCATCGTCACGGCCATCTACAGCGAGGTGTTGCGCATCTATGGCGTCCAGTTCCATCCAGAGGTGGATCTAACCATCAATGGCAAACAGATGCTATCAAATTTCCTCTATGAAATCTGCGAACTAACGCCAAACTTCACAATGGGCAGCAGGAAGGAGGagtgcatacggtacatacgAGAGAAAGTGGGCAGTAATAAAGTCTTGCTCTTGGTCAGTGGCGGTGTGGATTCGAGTGTATGTGCAGCCTTGTTGCGCCGTGCCCTATACCCTAATCAAATAATTGCCGTGCATGTAGATAATGGTTTCATGCGCAAGAACGAAAGTGAAAAGGTGGAGCGTTCGTTGCGCGAAATCGGCATCGATTTGATTGTGCGAAAGGAATGCTATACGTTCCTCAAGGGCACCACGCAAGTGAAGCGACCCGGCCAGTATTCGGTCGTTGAGACGCCCATGCTCTGCCAGACCTACAACCCCGAGGAGAAGCGCAAGATAATTGGTGATATATTCGTAAAGGTGACCAACGATGTGGTGGCCTTTCTGAAACTCAAACCCGAAGAAGTTATGCTCGCCCAGGGCACCCTCAGGCCGGATCTAATTGAGTCTGCATCGAATATGGTCAGCACGAATGCAGAAACAATCAAAACGCATCACAATGACACGGATCTGATTAGAGAACTCCGCAATGCGGGACGCGTCGTAGAGCCTCTGTGCGACTTTCACAAGGATGAGGTACGAGATCTGGGCAATGATCTGGGTCTGCCACCAGAGCTAGTCGAGCGACAACCCTTCCCAGGTCCCGGTCTGGCCATCCGCGTTCTGTGTGCCGAGGAGGCTTACATGGAGAAGGACTATTCAGAGACTCAGGTAATTGTGCGTGTTATTGTGGACTATAAAAACAAACTGCAGAAGAACCATGCGCTCATAAATCGAGTGACGGGCGCCACCAGTGAGGCCGAACAAATAGAGCTACTGCGCATCTCTGCCAACTCAACGATCCAGGCTACCCTGCTGCCCATCCGGTCTGTGGGTGTGCAAGGCGACAAACGCAGCTACAGCTACGTGGTGGGGCTGTCTACGAGTCAAGAGCCCAACTGGATGGACCTGCTCTTCCTAGCAAAGATCATACCGCGTATTTTGCATAATGTCAATAGGGTTTGCTATATCTTTGGGGAGCCCGTTCAGTATCTGGTCACGGACATAACGCACACAACCCTCAATACGGTGGTGTTGTCGCAGCTCCGGCAAGCCGATGCCATTGCAAATGAGattATCATGCAAGCGGGCTTGTACAGGAAAATCTCACAGATGCCTGTCGTTCTGATACCCGTGCACTTTGATCGCGATCCCATCAATAGGACGCCCTCCTGCAGAAGATCGGTGGTTCTGCGTCCCTTCATAACGAACGATTTTATGACTGGAGTGCCGGCTGTGCCTGGATCAGTGCAACTGCCATTGCAAGTCCTTAATCAAATGGTGCGCGACATAACCAAGCTGGATGGAATATCGCGGGTACTCTACGACTTGACCGCCAAGCCGCCGGGTACCACTGAGTGGGAA 7 | >DYAKU.12180 8 | ATGAACTCAAACATATTTCTGGGCACAGCAGAGAACGGCCTGCGGCACGATAAGATTGTTATACTTGATGCTGGAGCACAGTACGGCAAGGTTATCGACCGTAAGGTACGCGAACTCCTCGTTGAGTCGGATATCCTTCCACTGGACACCCCAGCGGCTACGATACGCAACAATGGCTATCGAGGCATCATCATCTCCGGGGGACCCAACTCAGTCTACGCAGAGGATGCACCCAGCTATGATCCCGATCTGTTCAAGCTAAAGATACCTATGCTGGGCATCTGCTACGGCATGCAGCTAATAAACAAAGAGTTCGGAGGCACAGTGCTCAAGAAGGATGTACGAGAGGATGGCCAACAAAATATCGAGATTGAGACCTCGTGTCCGCTCTTTAGTCGCCTCAGTCGCACACAGTCGGTGCTGTTAACCCACGGTGATAGCGTTGAGAGAGTGGGCGAGAATCTGAAGATTGGTGGCTGGTCCACAAACCGCATTGTGACAGCTATTTACAATGAAGTACTCCGCATCTACGGCGTCCAGTTCCATCCTGAGGTGGACCTCACTATCAATGGCAAACAGATGCTATCGAACTTCCTGTACGAAATCTGCGAACTGACGCCCAACTTTACCATGGGTAGTCGAAAGGAGGAGTGCATACGCTACATCCGCGAGAAAGTGGGCAGTAATAAAGTGTTGCTACTGGTCAGCGGCGGCGTGGATTCGAGTGTCTGTGCAGCTTTGCTCCGCCGTGCTTTGTACCCCAATCAGATAATTGCCGTACATGTAGATAATGGTTTCATGCGCAAAAACGAAAGTGAAAAGGTGGAGCGTTCACTGCGCGATATTGGCATTGATTTAATTGTCCGAAAAGAAGGCTACACGTTCCTTACAGGCACTACGCAAGTCAAGAGGCCCGGACAGTACTCCGTGGTGGAAACGCCTATGTTATGTCAGACCTATAATCCGGAGGAGAAACGCAAGATAATTGGTGATATATTCGTCAAAGTGACCAACGATGTAGTAGCCGAATTGAAACTAAAGCCCGAGGAAGTTATGTTGGCTCAGGGAACCCTCCGACCAGATCTAATCGAGTCCGCCTCGAACATGGTGAGCACGAATGCAGAAACAATCAAAACGCACCACAATGACACGGATCTGATCAGaGAGCTTCGTAACGCAGGACGTGTGGTTGAGCCGCTGTGTGACTTTCATAAGGATGAAGTGCGCGACTTAGGCAACGATCTTGGACTGCCCCAAGAGCTTGTGGAGAGACAACCCTTCCCAGGTCCTGGCCTGGCAATCCGCGTTCTCTGCGCCGAGGAGGCATACATGGAGAAGGACTACTCAGAGACTCAggTTATTATACGCGTGATTGTAGACTACAAGAATAAACTGCAGAAGAACCATGCTCTAATCAACCGCGTAACAGGGACCACGAGCGAGTCAGAACAGAAAGACCTATTGCGTATCTCTGCGAACTCGCAGATTCAGGCAACTTTGCTGCCCATCCGCTCCGTGGGCGTGCAAGGTGATAAACGGTCATACAGCTACGTAGTAGGTCTATCAACGAGCCAGGAGCCCAACTGGCAGGACCTTCTCTTCTTGGCTAAAATTATACCGCGCATACTGCACAACGTGAACAGGGTGTGCTACATTTTTGGCGAGCCCGTGCAGTACCTAGTAACGGATATAACGCACACCACACTGAATACGGTAGTTCTTTCGCAGCTAAGGCAAGCGGATGCTATTGCCAATGAAATCATAATGCAAGCTGGACTATACCGGAAAATCTCTCAGATGCCTGTTGTTCTCATACCCGTGCACTTTGACCGCGATCCCATTAACCGTACACCCTCATGCCTAAGGTCGGTAGTGCTGCGTCCGTTCATAACGAACGACTTTATGACTGGTGTGCCGGCTGAGCCCGGCTCCGTGCAACTGCCATTGCAGGTCCTAAATCAAATTGTACGCGATATATCCAAACTGGGCGGAATCTCGAGGGTGCTGTACGACTTGACAGCTAAGCCACCGGGCACCACCGAGTGGGAA 9 | >SLEBA.7381 10 | ATGTCTTCAAATCTATTTCtaacaacagcagaaaatgGTCTGCGACACGATAAAATTGTTATACTCGATGCTGGTGCACAGTACGGCAAGGTTATCGATCGTAAAGTGCGTGAATTGCTCGTTGAATCGGATATCCTGCCTCTAGATACGCCAGCATCGACGATACGCGATCATGGCTATCGCGGAATCATAATTTCCGGCGGACCCAACTCAGTATATGCCGAAGATGCGCCCTCTTATGATCCCGACTTATTTAAGCTGAAAATACCTATGCTAGGCATTTGCTATGGCATGCAGCTGATTAATAAGGAATTCGGCGGAACCGTACTCAAGACGGATGTACGCGAGGATGGTCAGCAAAGCATAGAAATTGAAACATCGTGTCCGTTGTTCAGCCGCCTCAGTCGCACTCAATCCGTGTTACTTACACATGGCGATAGTGTAGAGCGTGTTGGTGAAAAACTTAAAGTAGGGGGCTGGTCAACGAATCGAATTGTCACCGCCATTTATAATGAAGTGCTGCGCATTTATGGTGTACAATTCCATCCGGAAGTCGACTTAACAATCAATGGAAAACAGATGCTATCGAATTTTCTCTATGATATCTGTGAATTGACGCCGAATTTTACCATGGGTAGTCGAAAAGAGGAATGCATACGCTATATTAGAGAGAAAGTGGGCAACAATAAAGTACTgTTGCTGGTCAGCGGCGGCGTCGATTCCAGTGTGTGTGCTGCGCTTTTAAGACGCGCCTTACATCCTGGCCAAATTATAGCGGTGCATGTGGATAAtgGTTTCATGCGAAAAAACGAAAGTGAAAAGGTGGAGCGTTCCTTGCGAGAAATTGGCATAGATTTGATAGTTCGTAAAGAAAGCTACACGTTCCTCAAAGGCACCACGCAAGTGAAAAGGCCTGGACAGTATTCGGTGGTTGAAACGCCCATGCTATGCCAGACATACAACCCCGAAGAAAAACGCAAGATAATTGGTGATATATTCGTAAAAGTGACCAATGATGTGGTAGCCGAGCTAAAATTGAAACCCGAAGAGGTTTTGCTGGCTCAGGGTACCCTACGACCCGATCTAATAGAGTCCGCCTCAAATATGGTTAGCATGAATGCCGAAACGATCAAGACGCATCACAATGACACAGATCTGATAAGAGAACTGCGGAATGCAGGGCGCGTTGTCGAGCCACTGTGCGATTTCCATAAGGACGAAGTACGGGATCTTGGCAATGATTTGGGCTTACCAGCGGAGCTAGTAGAAAGGCAACCATTTCCGGGACCTGGTCTAGCAATTCGAGTACTTTGTGCTGAGGAGGCTTACATGGAGAAAGACTATTCAGAAACACAGGTCATAGTTCGTGTAATAGTGgactacaaaaacaaactacaGAAAAATCATGCACTCATCAATCGTGTTACGGGCGCTACCAATGAAGCCGAACAAAAGGAACTCATACGCATTtcaacaaatacacaaatcCAAGCCACATTGCTGCCTGTGCGCTCGGTGGGTGTGCAAGGTGATAAGCGCACCTACAGCTATGTTGTTGGTCTATCAACGAGTCAGGAACCCAATTGGACGGATCTGTTATTCTTAGCGAAAATCATACCGCGCATTTTACACAATGTAAATAGAGTTTGCTATATCTTCGGTGATCCCGTGCAGTATCTGGTAACCGATATAACGCATACGACACTCAATACTGTGGTGTTGGCGCAGCTACGCCAAGCGGATGCAATTGCCAATGAGaTTATTATGCAAGCTGGTCTGTATCGTAAAATTTCACAGATGCCTGTTGTGCTTATACCAGTACATTTTGATCGTGATCCAATAAATCGTACGCCGTCGTGTCGAAGATCCGTGGTGTTGCGTCCCTTTATAACAAATGATTTCATGACTGGTGTGCCAGCTGTGCCTGGTTCTGTGCAACTGCCATTGCAAGTCCTTAATCAAATAGTGAGGGATATATCCAAATTGGATGGAATCTCGAGGGTACTATACGATTTAACCGCAAAACCGCCCGGCACCACAGAATGGGAA 11 | -------------------------------------------------------------------------------- /lib/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys 4 | import os 5 | import argparse 6 | try: 7 | import library as lib 8 | import map 9 | import assemble as ab 10 | except ImportError: 11 | import lib.library as lib 12 | import lib.map as map 13 | import lib.assemble as ab 14 | 15 | def parameter(x): 16 | return x.strip() 17 | 18 | def main(args): 19 | # the arguments 20 | parser = argparse.ArgumentParser(prog='PhyloAln', usage="%(prog)s [options] -a reference_alignment_file -s species -i fasta_file -f fasta -o output_directory\n%(prog)s [options] -d reference_alignments_directory -c config.tsv -f fastq -o output_directory", description="A program to directly generate multiple sequence alignments from FASTA/FASTQ files based on reference alignments for phylogenetic analyses.\nCitation: Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. MBE. 41(7):msae150. https://doi.org/10.1093/molbev/msae150", epilog="Written by Yu-Hao Huang (2023-2025) huangyh45@mail3.sysu.edu.cn", formatter_class=argparse.RawDescriptionHelpFormatter) 21 | parser.add_argument('-a', '--aln', type=os.path.abspath, help='the single reference FASTA alignment file') 22 | parser.add_argument('-d', '--aln_dir', type=os.path.abspath, help='the directory containing all the reference FASTA alignment files') 23 | parser.add_argument('-x', '--aln_suffix', default='.fa', help='the suffix of the reference FASTA alignment files when using "-d"(default:%(default)s)') 24 | parser.add_argument('-s', '--species', help='the studied species ID for the provided FASTA/FASTQ files(-i)') 25 | parser.add_argument('-i', '--input', type=os.path.abspath, nargs='+', help='the input FASTA/FASTQ file(s) of the single species(-s), compressed files ending with ".gz" are allowed') 26 | parser.add_argument('-c', '--config', type=os.path.abspath, help="the TSV file with the format of 'species sequence_file(s)(absolute path, files separated by commas)' per line for multiple species") 27 | parser.add_argument('-f', '--file_format', choices=['guess', 'fastq', 'fasta', 'large_fasta'], default='guess', help="the file format of the provided FASTA/FASTQ files, 'large_fasta' is recommended for speeding up reading the FASTA files with long sequences(e.g. genome sequences) and cannot be guessed(default:%(default)s)") 28 | parser.add_argument('-o', '--output', default='PhyloAln_out', type=os.path.abspath, help='the output directory containing the results(default:%(default)s)') 29 | parser.add_argument('-p', '--cpu', type=int, default=8, help="maximum threads to be totally used in parallel tasks(default:%(default)d)") 30 | parser.add_argument('--parallel', type=int, help="number of parallel tasks for each alignments, number of CPUs used for single alignment will be automatically calculated by '--cpu / --parallel'(default:the smaller value between number of alignments and the maximum threads to be used)") 31 | parser.add_argument('-e', '--mode', choices=['dna2reads', 'prot2reads', 'codon2reads', 'fast_dna2reads', 'fast_prot2reads', 'fast_codon2reads', 'dna2trans', 'prot2trans', 'codon2trans', 'dna2genome', 'prot2genome', 'codon2genome', 'rna2rna', 'prot2prot', 'codon2codon', 'gene_dna2dna', 'gene_rna2rna', 'gene_codon2codon', 'gene_codon2dna', 'gene_prot2prot'], help="the common mode to automatically set the parameters for easy use(**NOTICE: if you manually set those parameters, the parameters you set will be ignored and covered! See https://github.com/huangyh45/PhyloAln/blob/main/README.md#example-commands-for-different-data-and-common-mode-for-easy-use for detailed parameters)") 32 | parser.add_argument('-m', '--mol_type', choices=['dna', 'prot', 'codon', 'dna_codon'], default='dna', help="the molecular type of the reference alignments(default:%(default)s, 'dna' suitable for nucleotide-to-nucleotide or protein-to-protein alignment, 'prot' suitable for protein-to-nucleotide alignment, 'codon' and 'dna_codon' suitable for codon-to-nucleotide alignment based on protein and nucleotide alignments respectively)") 33 | parser.add_argument('-g', '--gencode', type=int, default=1, help="the genetic code used in translation(default:%(default)d = the standard code, see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)") 34 | parser.add_argument('--ref_split_len', type=int, help="If provided, split the reference alignments longer than this length into short alignments with this length, ~1000 may be recommended for concatenated alignments, and codon alignments should be devided by 3") 35 | parser.add_argument('-l','--split_len', type=int, help="If provided, split the sequences longer than this length into short sequences with this length, 200 may be recommended for long genomic reads or sequences") 36 | parser.add_argument('--split_slide', type=int, help="the slide to split the sequences using sliding window method(default:half of '--split_len')") 37 | parser.add_argument('-n', '--no_reverse', action='store_true', help="not to prepare and search the reverse strand of the sequences, recommended for searching protein or CDS sequences") 38 | parser.add_argument('--low_mem', action='store_true', help="use a low-memory but slower mode to prepare the reads, 'large_fasta' format is not supported and gz compressed files may still spend some memory") 39 | parser.add_argument('--hmmbuild_parameters', type=parameter, nargs='+', default=[], help="the parameters when using HMMER hmmbuild for reference preparation, with the format of ' --xxx' of each parameter, in which space is required(default:%(default)s)") 40 | parser.add_argument('--hmmsearch_parameters', type=parameter, nargs='+', default=[], help="the parameters when using HMMER hmmsearch for mapping the sequences, with the format of ' --xxx' of each parameter, in which space is required((default:%(default)s)") 41 | parser.add_argument('-b', '--no_assemble', action='store_true', help="not to assemble the raw sequences based on overlap regions") 42 | parser.add_argument('--overlap_len', type=int, default=30, help="minimum overlap length when assembling the raw sequences(default:%(default)d)") 43 | parser.add_argument('--overlap_pident', type=float, default=98, help="minimum overlap percent identity when assembling the raw sequences(default:%(default).2f)") 44 | parser.add_argument('-t', '--no_out_filter', action='store_true', help="not to filter the foreign or no-signal sequences based on conservative score") 45 | parser.add_argument('-u', '--outgroup', nargs='+', default=[], help="the outgroup species for foreign or no-signal sequences detection(default:all the sequences in the alignments with all sequences as ingroups)") 46 | parser.add_argument('--ingroup', nargs='+', default=[], help="the ingroup species for score calculation in foreign or no-signal sequences detection(default:all the sequences when all sequences are set as outgroups; all other sequences except the outgroups)") 47 | parser.add_argument('-q', '--sep', default='.', help="the separate symbol between species name and gene identifier in the sequence headers of the alignments(default:%(default)s)") 48 | parser.add_argument('--outgroup_weight', type=float, default=0.9, help="the weight coefficient to adjust strictness of the foreign or no-signal sequence filter, small number or decimal means ralaxed criterion (default:%(default).2f, 1 = not adjust)") 49 | parser.add_argument('-r', '--no_cross_species', action='store_true', help="not to remove the cross contamination for multiple species") 50 | parser.add_argument('--cross_overlap_len', type=int, default=30, help="minimum overlap length when cross contamination detection(default:%(default)d)") 51 | parser.add_argument('--cross_overlap_pident', type=float, default=98, help="minimum overlap percent identity when cross contamination detection(default:%(default).2f)") 52 | parser.add_argument('--min_exp', type=float, default=0.2, help="minimum expression value when cross contamination detection(default:%(default).2f)") 53 | parser.add_argument('--min_exp_fold', type=float, default=5, help="minimum expression fold when cross contamination detection(default:%(default).2f)") 54 | parser.add_argument('-w', '--unknow_symbol', default='unknow', help="the symbol representing unknown bases for missing regions(default:%(default)s = 'N' in nucleotide alignments and 'X' in protein alignments)") 55 | parser.add_argument('-z', '--final_seq', choices=['consensus', 'consensus_strict', 'all', 'expression', 'length'], default='consensus', help="the mode to output the sequences(default:%(default)s, 'consensus' means selecting most common bases from all sequences, 'consensus_strict' means only selecting the common bases and remaining the different bases unknow, 'all' means remaining all sequences, 'expression' means the sequence with highest read counts after assembly, 'length' means sequence with longest length") 56 | parser.add_argument('-y', '--no_ref', action='store_true', help="not to output the reference sequences") 57 | parser.add_argument('-k', '--keep_seqid', action='store_true', help="keep original sequence IDs in the output alignments instead of renaming them based on the species ID, not recommended when the output mode is 'consensus'/'consensus_strict' or the assembly step is on") 58 | parser.add_argument('-v', '--version', action='version', version="%(prog)s v1.1.0") 59 | args = parser.parse_args(args) 60 | 61 | # automatically set the parameters when mode is set for easy use 62 | if args.mode is not None: 63 | if args.mode in ['rna2rna', 'prot2prot', 'codon2codon']: 64 | args.no_reverse = True 65 | args.no_assemble = True 66 | args.no_cross_species = True 67 | if args.mode == 'codon2codon': 68 | args.mol_type = 'codon' 69 | else: 70 | args.mol_type = 'dna' 71 | if args.mode == 'prot2prot': 72 | args.unknow_symbol = 'X' 73 | elif args.mode.startswith('gene_'): 74 | args.no_assemble = True 75 | args.no_cross_species = True 76 | args.final_seq = 'all' 77 | args.keep_seqid = True 78 | args.unknow_symbol = '-' 79 | if args.mode.endswith('2dna'): 80 | args.no_reverse = False 81 | else: 82 | args.no_reverse = True 83 | if args.mode.startswith('gene_codon2'): 84 | args.mol_type = 'codon' 85 | else: 86 | args.mol_type = 'dna' 87 | #if args.mode == 'gene_prot2prot': 88 | # args.unknow_symbol = 'X' 89 | else: 90 | if 'dna2' in args.mode: 91 | args.mol_type = 'dna' 92 | elif 'prot2' in args.mode: 93 | args.mol_type = 'prot' 94 | elif 'codon2' in args.mode: 95 | args.mol_type = 'codon' 96 | if args.mode.endswith('reads') and not args.mode.startswith('fast_'): 97 | args.no_assemble = False 98 | else: 99 | args.no_assemble = True 100 | if args.mode.endswith('reads'): 101 | args.no_cross_species = False 102 | else: 103 | args.no_cross_species = True 104 | if args.mode.endswith('2genome'): 105 | args.split_len = 200 106 | args.file_format = 'large_fasta' 107 | 108 | # parse the alignment files 109 | alns = {} 110 | if args.ref_split_len: 111 | if args.aln_dir: 112 | print("\nError: split of the alignment is not supported for multiple alignments, please input a single alignment file through '-a' or '--aln'!") 113 | sys.exit(1) 114 | elif args.final_seq == 'all': 115 | print("\nError: split of the alignment is not supported to output all sequences, please choice other options to keep unqiue sequences instead!") 116 | sys.exit(1) 117 | elif args.aln: 118 | alns = map.split_ref(args.aln, args.ref_split_len) 119 | elif args.aln: 120 | alns['aln'] = args.aln 121 | elif args.aln_dir: 122 | for filename in os.listdir(args.aln_dir): 123 | if filename.endswith(args.aln_suffix): 124 | alns[filename.replace(args.aln_suffix, '')] = os.path.join(args.aln_dir, filename) 125 | if not alns: 126 | print("\nError: fail to find any alignment!") 127 | sys.exit(1) 128 | 129 | # parse the species data 130 | rawdata = {} 131 | if args.species and args.input: 132 | rawdata[args.species] = args.input 133 | elif args.config: 134 | for line in open(args.config): 135 | arr = line.rstrip().split("\t") 136 | rawdata[arr[0]] = arr[1].split(',') 137 | if not rawdata: 138 | print("\nError: fail to find any species data!") 139 | sys.exit(1) 140 | 141 | # check the unknow symbol, low-memory mode and outgroup, and parse the parallel task number and length to split 142 | if args.unknow_symbol != 'unknow' and len(args.unknow_symbol) > 1: 143 | print("\nError: the symbol representing unknown bases should be single character!") 144 | sys.exit(1) 145 | if args.low_mem and args.file_format == 'large_fasta': 146 | print("\nError: the format of 'large_fasta' is not supported in the low-memory mode! If you want to use the low-memory mode, you can use 'fasta' format and it will take a while!") 147 | sys.exit(1) 148 | if not args.outgroup and not args.no_out_filter: 149 | print("\nWarning: no outgroup was set, all the sequences in the alignments will be considered as outgroups with all sequences as ingroups in foreign sequence filter!") 150 | if args.parallel is None: 151 | args.parallel = min(len(alns), args.cpu) 152 | else: 153 | args.parallel = min(args.parallel, len(alns), args.cpu) 154 | if args.split_len is not None and args.split_slide is None: 155 | args.split_slide = int(args.split_len / 2) 156 | 157 | # create and enter the output directory, and output the splitted reference alignments 158 | if not os.path.isdir(args.output): 159 | os.makedirs(args.output) 160 | os.chdir(args.output) 161 | if not os.path.isdir('ok'): 162 | os.mkdir('ok') 163 | if args.ref_split_len: 164 | if not os.path.isdir('ref_split'): 165 | os.mkdir('ref_split') 166 | for aln, aln_seqstr in alns.items(): 167 | outfile = open(os.path.join('ref_split', aln + '.fa'), 'w') 168 | outfile.write(aln_seqstr) 169 | outfile.close() 170 | alns[aln] = os.path.join('ref_split', aln + '.fa') 171 | 172 | # prepare the reference HMMs 173 | map.prepare_ref(alns, cpu=args.cpu, np=args.parallel, moltype=args.mol_type, gencode=args.gencode, parameters=args.hmmbuild_parameters) 174 | 175 | # map (by HMMER), extract, assemble the sequences and remove foreign or no-signal sequences of each species 176 | total_reads = {} 177 | assemblers = {} 178 | for sp, fastxs in rawdata.items(): 179 | all_seqs, total_reads[sp] = map.map_reads(alns, sp, fastxs, file_format=args.file_format, cpu=args.cpu, np=args.parallel, moltype=args.mol_type, gencode=args.gencode, split_len=args.split_len, split_slide=args.split_slide, no_reverse=args.no_reverse, low_mem=args.low_mem, parameters=args.hmmsearch_parameters) 180 | all_hmmres = map.extract_reads(alns, sp, all_seqs, moltype=args.mol_type, split_len=args.split_len, low_mem=args.low_mem) 181 | del all_seqs 182 | assemblers[sp] = ab.generate_assembly_mp(alns, sp, all_hmmres, np = args.parallel, moltype=args.mol_type, gencode=args.gencode, no_assemble=args.no_assemble, overlap_len=args.overlap_len, overlap_pident=args.overlap_pident, no_out_filter=args.no_out_filter, outgroup=args.outgroup, ingroup=args.ingroup, sep=args.sep, outgroup_weight=args.outgroup_weight, final_seq=args.final_seq) 183 | 184 | # cross decontamination and output 185 | for group_name in alns.keys(): 186 | assemblers[group_name] = {} 187 | for sp in rawdata.keys(): 188 | assemblers[group_name][sp] = assemblers[sp][group_name] 189 | for sp in rawdata.keys(): 190 | assemblers.pop(sp) 191 | ab.cross_and_output_mp(alns.keys(), list(rawdata.keys()), assemblers, total_reads, np = args.parallel, moltype=args.mol_type, gencode=args.gencode, no_assemble=args.no_assemble, no_cross_species=args.no_cross_species, min_overlap=args.cross_overlap_len, min_pident=args.cross_overlap_pident, min_exp=args.min_exp, min_fold=args.min_exp_fold, unknow=args.unknow_symbol, final_seq=args.final_seq, no_ref=args.no_ref, sep=args.sep, keep_seqid=args.keep_seqid) 192 | 193 | # concatenate the output alignments if split the reference alignment 194 | if args.ref_split_len: 195 | if args.unknow_symbol == 'unknow': 196 | fill = 'N' 197 | else: 198 | fill = args.unknow_symbol 199 | ab.concatenate('nt_out', alns, fill) 200 | if args.mol_type != 'dna': 201 | if args.unknow_symbol == 'unknow': 202 | fill = 'X' 203 | else: 204 | fill = args.unknow_symbol 205 | ab.concatenate('aa_out', alns, fill) 206 | 207 | if __name__ == "__main__": 208 | main(sys.argv[1:]) 209 | -------------------------------------------------------------------------------- /lib/map.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys 4 | import os 5 | import warnings 6 | from Bio import SearchIO, BiopythonWarning 7 | from Bio.Seq import reverse_complement, translate 8 | try: 9 | import library as lib 10 | except ImportError: 11 | import lib.library as lib 12 | 13 | # split the reference alignments into short alignments 14 | def split_ref(alnfile, split_len): 15 | print("Splitting the reference alignment into short alignments...") 16 | seqs = lib.read_fastx(alnfile, 'fasta') 17 | aln_len = len(list(seqs.values())[0]) 18 | alns = {} 19 | i = 0 20 | while i * split_len < aln_len: 21 | alns['aln_' + str(i)] = '' 22 | end = i * split_len + split_len 23 | # for final alignments and avoid too short length < 30 24 | if end + 30 > aln_len: 25 | end = aln_len 26 | for seqid, seqstr in seqs.items(): 27 | alns['aln_' + str(i)] += ">{}\n{}\n".format(seqid, seqstr[i*split_len:end]) 28 | i += 1 29 | return alns 30 | 31 | # output the splited sequences to different FASTA file by single CPU 32 | def output_fasta_percpu(seq_list, output_fasta, fastx_num=1, moltype='dna', gencode=1, outfile0=None, split_len=None, split_slide=None, no_reverse=False, low_mem_iter=None, low_mem_format=None): 33 | if outfile0 is None: 34 | outfile = open(output_fasta, 'w') 35 | else: 36 | outfile = outfile0 37 | if low_mem_iter is not None: 38 | # low-memory mode: directly read the file instead of reading the store 39 | seqid = None 40 | seqstr = None 41 | count = 0 42 | line_num = 0 43 | for line in low_mem_iter: 44 | line = line.rstrip() 45 | if low_mem_format == 'fastq' and line.startswith('@') and line_num % 4 == 0: 46 | seqid = line.replace('@', '', 1).replace(' ', '_') 47 | elif low_mem_format == 'fasta' and line.startswith('>'): 48 | if seqid: 49 | # prepare the single sequence 50 | output_fasta_percpu([(seqid, seqstr)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse) 51 | count += 1 52 | arr = line.split(" ") 53 | seqid = arr[0].lstrip('>') 54 | seqstr = '' 55 | elif seqid: 56 | if low_mem_format == 'fastq': 57 | # prepare the single sequence 58 | output_fasta_percpu([(seqid, line)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse) 59 | count += 1 60 | seqid = None 61 | else: 62 | seqstr += line 63 | line_num += 1 64 | if low_mem_format == 'fasta' and seqstr: 65 | output_fasta_percpu([(seqid, seqstr)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse) 66 | count += 1 67 | # low-memory mode for this FASTA/FASTQ file ends here and return the read count 68 | return count 69 | for seqid, seqstr0 in seq_list: 70 | if split_len: 71 | # split the sequences into short sequences with length of split_len 72 | i = 0 73 | while i + split_len < len(seqstr0): 74 | output_fasta_percpu([('_'.join([seqid, 'split', str(i+1), str(i+split_len)]), seqstr0[i:i+split_len])], output_fasta, fastx_num, moltype, gencode, outfile, no_reverse=no_reverse) 75 | i += split_slide 76 | output_fasta_percpu([('_'.join([seqid, 'split', str(i+1), str(len(seqstr0))]), seqstr0[i:])], output_fasta, fastx_num, moltype, gencode, outfile, no_reverse=no_reverse) 77 | elif moltype.startswith('dna'): 78 | outfile.write(">{}_fastx{}\n{}\n".format(seqid, fastx_num, seqstr0)) 79 | if not no_reverse: 80 | outfile.write(">{}_fastx{}_rev\n{}\n".format(seqid, fastx_num, reverse_complement(seqstr0))) 81 | else: 82 | # supress the warnings due to the last incomplete codons when translate the sequences 83 | with warnings.catch_warnings(): 84 | warnings.simplefilter('ignore', BiopythonWarning) 85 | for j in [1,2,3]: 86 | seqstr = seqstr0[(j-1):] 87 | outfile.write(">{}_fastx{}_pos{}\n{}\n".format(seqid, fastx_num, j, translate(seqstr, table=gencode))) 88 | if not no_reverse: 89 | seqstr0 = reverse_complement(seqstr0) 90 | for j in [1,2,3]: 91 | seqstr = seqstr0[(j-1):] 92 | outfile.write(">{}_fastx{}_pos{}rev\n{}\n".format(seqid, fastx_num, j, translate(seqstr, table=gencode))) 93 | if outfile0 is None: 94 | outfile.close() 95 | 96 | # convert the raw FASTQ/FASTA files to (translated) FASTA format 97 | def fastx2fasta(fastxs, fasta, file_format='guess', cpu=8, moltype='dna', gencode=1, split_len=None, split_slide=None, no_reverse=False, low_mem=False, output=True): 98 | if output: 99 | outfile = open(fasta, 'w') 100 | all_seqs = [] 101 | total_count = 0 102 | for i in range(len(fastxs)): 103 | if low_mem: 104 | fastx_iter, file_format = lib.read_fastx(fastxs[i], file_format, low_mem=True) 105 | all_seqs.append([fastxs[i], file_format]) 106 | seqs = {} 107 | else: 108 | seqs = lib.read_fastx(fastxs[i], file_format) 109 | all_seqs.append(seqs) 110 | total_count += len(seqs) 111 | if output: 112 | if not low_mem and cpu > 1 and (split_len or len(seqs) > 10000): 113 | # run in multiprocess when too much sequences 114 | print("Binning the reads in '{}' into {} parts and preparing in multiprocess...".format(fastxs[i], cpu)) 115 | nseq = int(len(seqs) / cpu) 116 | if nseq * cpu < len(seqs): 117 | nseq += 1 118 | kwds = {'fastx_num': i+1, 'moltype': moltype, 'gencode': gencode, 'split_len': split_len, 'split_slide': split_slide, 'no_reverse': no_reverse} 119 | args_list = [] 120 | for j in range(cpu-1): 121 | args_list.append((list(seqs.items())[(nseq*j):(nseq*(j+1))], fasta + '.' + str(j))) 122 | args_list.append((list(seqs.items())[(nseq*(cpu-1)):], fasta + '.' + str(cpu-1))) 123 | lib.run_mp(output_fasta_percpu, args_list, cpu, kwds=kwds) 124 | for j in range(cpu): 125 | for line in open(fasta + '.' + str(j)): 126 | outfile.write(line) 127 | os.remove(fasta + '.' + str(j)) 128 | elif low_mem: 129 | # output using a low-memory method and obtain the read count 130 | total_count += output_fasta_percpu([], fasta, i+1, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem_iter=fastx_iter, low_mem_format=file_format) 131 | else: 132 | # directly output by single cpu 133 | output_fasta_percpu(list(seqs.items()), fasta, i+1, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse) 134 | if output: 135 | outfile.close() 136 | return all_seqs, total_count 137 | 138 | # construct HMM file by hmmbuild 139 | def hmmbuild(alnfile, group_name, cpu=1, moltype='dna', gencode=1, parameters=[]): 140 | log = open(os.path.join('ref_hmm', group_name + '.log'), 'w') 141 | if moltype == 'codon': 142 | lib.trans_seq(alnfile, os.path.join('ref_hmm', group_name + '.aa.fas'), gencode=gencode) 143 | cmd = ['hmmbuild', '-O', os.path.join('ref_hmm', group_name + '.sto'), '--cpu', str(cpu)] 144 | cmd.extend(parameters) 145 | cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), os.path.join('ref_hmm', group_name + '.aa.fas')]) 146 | else: 147 | cmd = ['hmmbuild', '-O', os.path.join('ref_hmm', group_name + '.sto'), '--cpu', str(cpu)] 148 | cmd.extend(parameters) 149 | cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), alnfile]) 150 | ifcomplish = lib.runcmd(cmd, log, stdout=False) 151 | log.close() 152 | if ifcomplish: 153 | return 0, group_name 154 | else: 155 | return 1, group_name 156 | 157 | # HMM search by HMMER 158 | def hmmsearch(group_name, species, cpu=1, parameters=[]): 159 | if not os.path.isfile(os.path.join('ok', "map_{}_{}.ok".format(species, group_name))): 160 | fasta = species + '.temp.fasta' 161 | log = open(os.path.join('map_' + species, group_name + '.log'), 'w') 162 | cmd = ['hmmsearch', '-o', os.path.join('map_' + species, group_name + '.txt'), '--tblout', os.path.join('map_' + species, group_name + '.tbl'), '--cpu', str(cpu)] 163 | cmd.extend(parameters) 164 | cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), fasta]) 165 | ifcomplish = lib.runcmd(cmd, log, stdout=False) 166 | log.close() 167 | if ifcomplish: 168 | open(os.path.join('ok', "map_{}_{}.ok".format(species, group_name)), 'w').close() 169 | return 0, group_name 170 | else: 171 | return 1, group_name 172 | return 0, group_name 173 | 174 | # prepare the HMMs of alignments 175 | def prepare_ref(alns, cpu=8, np=8, moltype='dna', gencode=1, parameters=[]): 176 | lib.check_programs(['hmmbuild']) 177 | np = min(np, len(alns), cpu) 178 | ncpu = int(cpu / np) 179 | 180 | print("\nPreparing the reference alignments...") 181 | if os.path.isfile(os.path.join('ok', 'prepare_alignments.ok')): 182 | print("\nUsing the existing hmm files in directory 'ref_hmm'") 183 | else: 184 | print("\nBuilding HMMs for mapping...") 185 | if not os.path.isdir('ref_hmm'): 186 | os.mkdir('ref_hmm') 187 | args_list = [] 188 | kwds = {'moltype': moltype, 'gencode': gencode, 'parameters': parameters} 189 | usedcpu = ncpu * np 190 | for group_name, alnfile in alns.items(): 191 | if usedcpu < cpu: 192 | args_list.append((alnfile, group_name, ncpu+1)) 193 | usedcpu += 1 194 | else: 195 | args_list.append((alnfile, group_name, ncpu)) 196 | iferrors = lib.run_mp(hmmbuild, args_list, np, kwds=kwds) 197 | errors = [] 198 | for iferror in iferrors: 199 | if iferror[0] == 1: 200 | errors.append(iferror[1]) 201 | if errors: 202 | print("\nError in hmmbuild commands: {}".format(', '.join(errors))) 203 | sys.exit(1) 204 | open(os.path.join('ok', 'prepare_alignments.ok'), 'w').close() 205 | 206 | # map the reads to the HMMs of alignments 207 | def map_reads(alns, species, fastxs, file_format='guess', cpu=8, np=8, moltype='dna', gencode=1, split_len=None, split_slide=None, no_reverse=False, low_mem=False, parameters=[]): 208 | lib.check_programs(['hmmsearch']) 209 | if os.path.isfile(os.path.join('ok', "prepare_{}.ok".format(species))): 210 | print("\nUsing the existing temp FASTA file of {}".format(species)) 211 | all_seqs, total_reads = fastx2fasta(fastxs, species + '.temp.fasta', file_format, cpu=cpu, moltype=moltype, gencode=gencode, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem=low_mem, output=False) 212 | else: 213 | print("\nPreparing the reads of {}...".format(species)) 214 | all_seqs, total_reads = fastx2fasta(fastxs, species + '.temp.fasta', file_format, cpu=cpu, moltype=moltype, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem=low_mem, gencode=gencode) 215 | open(os.path.join('ok', "prepare_{}.ok".format(species)), 'w').close() 216 | 217 | if not os.path.isdir('map_' + species): 218 | os.mkdir('map_' + species) 219 | np = min(np, len(alns), cpu) 220 | ncpu = int(cpu / np) 221 | print("\nMapping the reads of {} to reference alignments...".format(species)) 222 | args_list = [] 223 | kwds = {'parameters': parameters} 224 | usedcpu = ncpu * np 225 | for group_name in alns.keys(): 226 | if usedcpu < cpu: 227 | args_list.append((group_name, species, ncpu+1)) 228 | usedcpu += 1 229 | else: 230 | args_list.append((group_name, species, ncpu)) 231 | iferrors = lib.run_mp(hmmsearch, args_list, np, kwds=kwds) 232 | errors = [] 233 | for iferror in iferrors: 234 | if iferror[0] == 1: 235 | errors.append(iferror[1]) 236 | if errors: 237 | print("\nError in hmmsearch commands: {}".format(', '.join(errors))) 238 | sys.exit(1) 239 | if os.path.exists(species + '.temp.fasta'): 240 | os.remove(species + '.temp.fasta') 241 | return all_seqs, total_reads 242 | 243 | # read the information of HMMER results 244 | def read_hmmer(hmmtxt): 245 | hmmresults = [] 246 | qresults = SearchIO.parse(hmmtxt, 'hmmer3-text') 247 | for qresult in qresults: 248 | for hit in qresult: 249 | for HSP in hit: 250 | for HSPfrag in HSP: 251 | hmmresults.append(HSPfrag) 252 | return hmmresults 253 | 254 | # extract the target reads from the raw data 255 | def extract_reads(alns, species, all_seqs, moltype='dna', split_len=None, low_mem=False): 256 | all_hmmres = {} 257 | target_seqids = {} 258 | for group_name in alns.keys(): 259 | hmmresults = read_hmmer(os.path.join('map_' + species, group_name + '.txt')) 260 | all_hmmres[group_name] = hmmresults 261 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'w') 262 | for hmmresult in hmmresults: 263 | if moltype.startswith('dna'): 264 | hitid = hmmresult.hit_id 265 | if hitid.endswith('_rev'): 266 | hitid = hitid.replace('_rev', '') 267 | seqid = '_'.join(hitid.split('_')[:-1]) 268 | fastx_num = hitid.split('_')[-1].replace('fastx', '') 269 | else: 270 | seqid = '_'.join(hmmresult.hit_id.split('_')[:-2]) 271 | fastx_num = hmmresult.hit_id.split('_')[-2].replace('fastx', '') 272 | if low_mem: 273 | if target_seqids.get(fastx_num) is None: 274 | target_seqids[fastx_num] = {} 275 | if split_len: 276 | seqid0, start_end = seqid.split('_split_') 277 | if target_seqids[fastx_num].get(seqid0) is None: 278 | target_seqids[fastx_num][seqid0] = {} 279 | if target_seqids[fastx_num][seqid0].get(group_name) is None: 280 | target_seqids[fastx_num][seqid0][group_name] = [] 281 | target_seqids[fastx_num][seqid0][group_name].append(start_end.split('_')) 282 | else: 283 | if target_seqids[fastx_num].get(seqid) is None: 284 | target_seqids[fastx_num][seqid] = [] 285 | target_seqids[fastx_num][seqid].append(group_name) 286 | elif split_len: 287 | seqid, start_end = seqid.split('_split_') 288 | start, end = start_end.split('_') 289 | outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', start, end, 'fastx' + fastx_num]), all_seqs[int(fastx_num)-1][seqid][int(start)-1:int(end)])) 290 | else: 291 | outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, all_seqs[int(fastx_num)-1][seqid])) 292 | outfile.close() 293 | if low_mem: 294 | # low-memory mode: extract the reads from the file instead of the store 295 | for fastx_num, seqids in target_seqids.items(): 296 | fastx_iter, file_format = lib.read_fastx(all_seqs[int(fastx_num)-1][0], all_seqs[int(fastx_num)-1][1], low_mem=True) 297 | seqid = None 298 | seqstr = None 299 | line_num = 0 300 | for line in fastx_iter: 301 | line = line.rstrip() 302 | if file_format == 'fastq' and line.startswith('@') and line_num % 4 == 0: 303 | seqid = line.replace('@', '', 1).replace(' ', '_') 304 | elif file_format == 'fasta' and line.startswith('>'): 305 | if seqid is not None and seqids.get(seqid) is not None: 306 | if split_len: 307 | for group_name, start_ends in seqids[seqid].items(): 308 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 309 | for start_end in start_ends: 310 | outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), seqstr[int(start_end[0])-1:int(start_end[1])])) 311 | outfile.close() 312 | else: 313 | for group_name in seqids[seqid]: 314 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 315 | outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, seqstr)) 316 | outfile.close() 317 | arr = line.split(" ") 318 | seqid = arr[0].lstrip('>') 319 | seqstr = '' 320 | elif seqid: 321 | if file_format == 'fastq': 322 | if seqids.get(seqid) is not None: 323 | if split_len: 324 | for group_name, start_ends in seqids[seqid].items(): 325 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 326 | for start_end in start_ends: 327 | outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), line[int(start_end[0])-1:int(start_end[1])])) 328 | outfile.close() 329 | else: 330 | for group_name in seqids[seqid]: 331 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 332 | outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, line)) 333 | outfile.close() 334 | seqid = None 335 | else: 336 | seqstr += line 337 | line_num += 1 338 | if file_format == 'fasta' and seqstr and seqids.get(seqid) is not None: 339 | if split_len: 340 | for group_name, start_ends in seqids[seqid].items(): 341 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 342 | for start_end in start_ends: 343 | outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), seqstr[int(start_end[0])-1:int(start_end[1])])) 344 | outfile.close() 345 | else: 346 | for group_name in seqids[seqid]: 347 | outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a') 348 | outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, seqstr)) 349 | outfile.close() 350 | return all_hmmres 351 | -------------------------------------------------------------------------------- /lib/assemble.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys 4 | import os 5 | from copy import deepcopy 6 | from Bio.Seq import reverse_complement, translate 7 | from Bio import AlignIO 8 | try: 9 | import library as lib 10 | except ImportError: 11 | import lib.library as lib 12 | 13 | # class to process the information of the read or sequence hits 14 | class read_hit: 15 | 16 | # inherit the key parameters from the HSPfrag objects 17 | def __init__(self, hmmresult): 18 | self.hit_id = hmmresult.hit_id 19 | self.hit_start = hmmresult.hit_start + 1 20 | self.hit_end = hmmresult.hit_end 21 | self.hit_seq = str(hmmresult.hit.seq).upper() 22 | self.query_start = hmmresult.query_start + 1 23 | self.query_end = hmmresult.query_end 24 | self.query_seq = str(hmmresult.query.seq).upper() 25 | 26 | # map the sequence of the target region of target read without assembly 27 | def map_read_seqstr(self, seqid, seqstr, reads, moltype='dna', gencode=1): 28 | self.moltype = moltype 29 | self.gencode = gencode 30 | self.seq_comp = {} 31 | self.reads = reads 32 | rep = len(reads) 33 | self.read_count = rep 34 | pos = 0 35 | j = 0 36 | jpos = 0 37 | if moltype.startswith('dna'): 38 | if seqid.endswith('_rev'): 39 | seqstr = reverse_complement(seqstr) 40 | for i in range(self.hit_start-1, self.hit_end): 41 | while self.hit_seq[pos] in ['.', '-']: 42 | self.seq_comp[self.query_start + j] = {'-': rep} 43 | j += 1 44 | pos += 1 45 | jpos += 1 46 | if self.query_seq[jpos] not in ['.', '-']: 47 | self.seq_comp[self.query_start + j] = {seqstr[i]: rep} 48 | j += 1 49 | pos += 1 50 | jpos += 1 51 | else: 52 | # for protein or codon-to-nucleotide alignments (codon alignments) 53 | if seqid.endswith('rev'): 54 | seqstr = reverse_complement(seqstr) 55 | i = int(seqid.split('_')[-1].replace('pos', '').replace('rev', '')) 56 | codon_pos = i - 1 57 | for protpos in range(self.hit_start-1, self.hit_end): 58 | i = protpos * 3 + codon_pos 59 | while self.hit_seq[pos] in ['.', '-']: 60 | self.seq_comp[self.query_start + j] = {1: {'-': rep}, 2: {'-': rep}, 3: {'-': rep}} 61 | j += 1 62 | pos += 1 63 | jpos += 1 64 | if self.query_seq[jpos] not in ['.', '-']: 65 | self.seq_comp[self.query_start + j] = {1: {seqstr[i]: rep}, 2: {seqstr[i+1]: rep}, 3: {seqstr[i+2]: rep}} 66 | j += 1 67 | pos += 1 68 | jpos += 1 69 | 70 | # print the sequence from the hit information 71 | def print_seq(self, seq, protseq=None): 72 | for pos, base in self.seq_comp.items(): 73 | if protseq: 74 | for i in [1, 2, 3]: 75 | seq[3*pos+i-4] = base['codon'][i-1] 76 | protseq[pos-1] = base['base'] 77 | else: 78 | seq[pos-1] = base 79 | return seq, protseq 80 | 81 | # calculate pseudo RPKM of the hit using read counts 82 | def addRPKM(self, total_count): 83 | self.pseudoRPKM = 1000 * 1000000 * self.read_count / total_count / len(self.seq_comp) 84 | 85 | # read the alignments with stockholm format for HMM construction (conservative sites for output alignments) 86 | def read_sto(stofile): 87 | seqs = {} 88 | alignment = AlignIO.read(stofile, "stockholm") 89 | cons_pos = alignment._per_col_annotations['reference_annotation'] 90 | for seq in alignment: 91 | seqstr = str(seq.seq) 92 | seqs[seq.id] = '' 93 | for i in range(len(seqstr)): 94 | if cons_pos[i] == 'x': 95 | seqs[seq.id] += seqstr[i] 96 | return seqs 97 | 98 | # calculate the most common base of the site 99 | def most_common(bases): 100 | return max(bases, key=bases.get) 101 | 102 | # calculate the identical site number between two hits 103 | def calculate_ident(bases1, bases2): 104 | total_bases1 = sum(bases1.values()) 105 | total_bases2 = sum(bases2.values()) 106 | ident = 0 107 | for base in set(bases1.keys()) & set(bases2.keys()): 108 | ident += bases1[base] / total_bases1 * bases2[base] / total_bases2 109 | return ident 110 | 111 | # calculate overlap length and percent identity between two hits for cross decontamination 112 | def compare_hits(hitA, hitB, min_overlap=30, min_pident=98): 113 | if hitA.query_start > hitB.query_start: 114 | hit1 = deepcopy(hitB) 115 | hit2 = deepcopy(hitA) 116 | else: 117 | hit1 = deepcopy(hitA) 118 | hit2 = deepcopy(hitB) 119 | nident = 0 120 | end_pos = min(hit1.query_end, hit2.query_end) 121 | if hit1.moltype.startswith('dna'): 122 | overlap_len = end_pos - hit2.query_start + 1 123 | if overlap_len < min_overlap: 124 | return 0, 0 125 | for i in range(hit2.query_start, end_pos+1): 126 | if hit1.seq_comp[i] == hit2.seq_comp[i]: 127 | nident += 1 128 | else: 129 | # for codon alignments 130 | overlap_len = (end_pos - hit2.query_start + 1) * 3 131 | if overlap_len < min_overlap: 132 | return 0, 0 133 | for i in range(hit2.query_start, end_pos+1): 134 | for k in [0, 1, 2]: 135 | if hit1.seq_comp[i]['codon'][k] == hit2.seq_comp[i]['codon'][k]: 136 | nident += 1 137 | if overlap_len == 0: 138 | pident = 0 139 | else: 140 | pident = nident / overlap_len * 100 141 | if pident < min_pident: 142 | return 0, 0 143 | else: 144 | return overlap_len, pident 145 | 146 | # class to assemble the reads and process the reads mapped to the alignment 147 | class Assembler: 148 | 149 | # construct the site composition of the alignments and calculate the conservative score of the outgroup sequence 150 | def __init__(self, group_name, alnfile, species, outgroup=[], ingroup=[], sep='.'): 151 | self.species = species 152 | self.name = group_name 153 | self.alnfile = alnfile 154 | stofile = os.path.join('ref_hmm', group_name + '.sto') 155 | self.aln = read_sto(stofile) 156 | self.aln_len = len(list(self.aln.values())[0]) 157 | outgroup_aln = {} 158 | ingroup_aln = {} 159 | # process the gap symbols 160 | for seqid, seqstr in self.aln.items(): 161 | self.aln[seqid] = seqstr.replace('.', '-') 162 | self.aln[seqid] = seqstr.replace('~', '-') 163 | if outgroup: 164 | for spid in outgroup: 165 | for seqid in self.aln.keys(): 166 | if seqid == spid or seqid.startswith(spid + sep): 167 | outgroup_aln[seqid] = self.aln[seqid] 168 | break 169 | else: 170 | outgroup_aln = deepcopy(self.aln) 171 | if ingroup: 172 | for spid in ingroup: 173 | for seqid in self.aln.keys(): 174 | if seqid == spid or seqid.startswith(spid + sep): 175 | ingroup_aln[seqid] = self.aln[seqid] 176 | break 177 | else: 178 | if len(outgroup_aln) == len(self.aln): 179 | ingroup_aln = deepcopy(self.aln) 180 | else: 181 | for seqid, seqstr in self.aln.items(): 182 | if outgroup_aln.get(seqid) is None: 183 | ingroup_aln[seqid] = seqstr 184 | self.site_composition = {} 185 | self.outgroup_score = {} 186 | for seqid in outgroup_aln.keys(): 187 | self.outgroup_score[seqid] = {} 188 | for i in range(self.aln_len): 189 | self.site_composition[i+1] = {} 190 | for seqstr in ingroup_aln.values(): 191 | base = seqstr[i] 192 | if self.site_composition[i+1].get(base) is None: 193 | self.site_composition[i+1][base] = 1 194 | else: 195 | self.site_composition[i+1][base] += 1 196 | for seqid in outgroup_aln.keys(): 197 | self.outgroup_score[seqid][i+1] = self.site_composition[i+1].get(outgroup_aln[seqid][i], 0) 198 | 199 | # map the new hits with or without assembly to the original mapped hits 200 | def map_binning(self, hit, seqid, seqstr, reads, min_overlap=30, min_pident=98): 201 | read_count = len(reads) 202 | for n in range(len(self.hits)): 203 | refhit = self.hits[n] 204 | newhit = deepcopy(refhit) 205 | end_pos = min(refhit.query_end, hit.query_end) 206 | pos = 0 207 | jpos = 0 208 | if self.moltype.startswith('dna'): 209 | overlap_len = end_pos - hit.query_start + 1 210 | if overlap_len < min_overlap: 211 | continue 212 | if seqid.endswith('_rev'): 213 | seqstr = reverse_complement(seqstr) 214 | nident = 0 215 | j = hit.hit_start - 1 216 | for i in range(hit.query_start, end_pos+1): 217 | while hit.query_seq[jpos] in ['.', '-']: 218 | j += 1 219 | pos += 1 220 | jpos += 1 221 | if hit.hit_seq[pos] in ['.', '-']: 222 | nident += calculate_ident(refhit.seq_comp[i], {'-': read_count}) 223 | if newhit.seq_comp[i].get('-') is None: 224 | newhit.seq_comp[i]['-'] = read_count 225 | else: 226 | newhit.seq_comp[i]['-'] += read_count 227 | else: 228 | nident += calculate_ident(refhit.seq_comp[i], {seqstr[j]: read_count}) 229 | if newhit.seq_comp[i].get(seqstr[j]) is None: 230 | newhit.seq_comp[i][seqstr[j]] = read_count 231 | else: 232 | newhit.seq_comp[i][seqstr[j]] += read_count 233 | j += 1 234 | pos += 1 235 | jpos += 1 236 | if nident / overlap_len * 100 < min_pident: 237 | continue 238 | i = end_pos + 1 239 | while i < hit.query_end + 1: 240 | if hit.hit_seq[pos] in ['.', '-']: 241 | newhit.seq_comp[i] = {'-': read_count} 242 | i += 1 243 | elif hit.query_seq[jpos] in ['.', '-']: 244 | j += 1 245 | else: 246 | newhit.seq_comp[i] = {seqstr[j]: read_count} 247 | i += 1 248 | j += 1 249 | pos += 1 250 | jpos += 1 251 | if end_pos < hit.query_end: 252 | newhit.query_end = hit.query_end 253 | newhit.read_count += read_count 254 | newhit.reads.extend(reads) 255 | self.hits[n] = newhit 256 | #debug 257 | #print("overlap length:", str(overlap_len), "pident:", str(nident / overlap_len * 100), "\nMerging\n", vars(refhit), "\nand\n", vars(hit), "seqid: ", seqid, "seq:", seqstr, "\ninto\n", vars(newhit)) 258 | #input() 259 | hit = None 260 | break 261 | else: 262 | # for codon alignments 263 | overlap_len = (end_pos - hit.query_start + 1) * 3 264 | if overlap_len < min_overlap: 265 | continue 266 | if seqid.endswith('rev'): 267 | seqstr = reverse_complement(seqstr) 268 | i = int(seqid.split('_')[-1].replace('pos', '').replace('rev', '')) 269 | codon_pos = i - 1 270 | nident = 0 271 | j = hit.hit_start * 3 + codon_pos - 3 272 | for i in range(hit.query_start, end_pos+1): 273 | while hit.query_seq[jpos] in ['.', '-']: 274 | j += 3 275 | pos += 1 276 | jpos += 1 277 | if hit.hit_seq[pos] in ['.', '-']: 278 | for k in [1, 2, 3]: 279 | nident += calculate_ident(refhit.seq_comp[i][k], {'-': read_count}) 280 | if newhit.seq_comp[i][k].get('-') is None: 281 | newhit.seq_comp[i][k]['-'] = read_count 282 | else: 283 | newhit.seq_comp[i][k]['-'] += read_count 284 | else: 285 | for k in [1, 2, 3]: 286 | nident += calculate_ident(refhit.seq_comp[i][k], {seqstr[j+k-1]: read_count}) 287 | if newhit.seq_comp[i][k].get(seqstr[j+k-1]) is None: 288 | newhit.seq_comp[i][k][seqstr[j+k-1]] = read_count 289 | else: 290 | newhit.seq_comp[i][k][seqstr[j+k-1]] += read_count 291 | j += 3 292 | pos += 1 293 | jpos += 1 294 | if nident / overlap_len * 100 < min_pident: 295 | continue 296 | i = end_pos + 1 297 | while i < hit.query_end + 1: 298 | if hit.hit_seq[pos] in ['.', '-']: 299 | newhit.seq_comp[i] = {1: {'-': read_count}, 2: {'-': read_count}, 3: {'-': read_count}} 300 | i += 1 301 | elif hit.query_seq[jpos] in ['.', '-']: 302 | j += 3 303 | else: 304 | newhit.seq_comp[i] = {1: {seqstr[j]: read_count}, 2: {seqstr[j+1]: read_count}, 3: {seqstr[j+2]: read_count}} 305 | i += 1 306 | j += 3 307 | pos += 1 308 | jpos += 1 309 | if end_pos < hit.query_end: 310 | newhit.query_end = hit.query_end 311 | newhit.read_count += read_count 312 | newhit.reads.extend(reads) 313 | self.hits[n] = newhit 314 | #debug 315 | #print("overlap length:", str(overlap_len), "pident:", str(nident / overlap_len * 100), "\nMerging\n", vars(refhit), "\nand\n", vars(hit), "seqid: ", seqid, "seq:", seqstr, "\ninto\n", vars(newhit)) 316 | #input() 317 | hit = None 318 | break 319 | 320 | # if not assembled to any mapped hits, mapped it simply to the alignments 321 | if hit is not None: 322 | hit.map_read_seqstr(seqid, seqstr, reads, self.moltype, self.gencode) 323 | self.hits.append(hit) 324 | 325 | # map and assemble the hits 326 | def map_read_seq(self, hmmresults, fasta, moltype='dna', gencode=1, no_assemble=False, min_overlap=30, min_pident=98): 327 | self.moltype = moltype 328 | self.gencode = gencode 329 | self.stat_num = {'raw hits': len(hmmresults)} 330 | self.hits = [] 331 | targets = [] 332 | readids = [] 333 | hmmhits = [] 334 | hmmresults.sort(key = lambda x : x.query_end) 335 | hmmresults.sort(key = lambda x : x.query_start) 336 | for hmmresult in hmmresults: 337 | if moltype.startswith('dna'): 338 | hitid = hmmresult.hit_id 339 | if hitid.endswith('_rev'): 340 | hitid = hitid.replace('_rev', '') 341 | targets.append(hmmresult.hit_id) 342 | readids.append(hitid) 343 | hmmhits.append(read_hit(hmmresult)) 344 | else: 345 | targets.append(hmmresult.hit_id) 346 | readids.append('_'.join(hmmresult.hit_id.split('_')[:-1])) 347 | hmmhits.append(read_hit(hmmresult)) 348 | reads = lib.read_fastx(fasta, 'fasta', select_list=readids) 349 | 350 | # cluster the identical hits first 351 | if not no_assemble: 352 | reps = {} 353 | for i in range(len(targets)): 354 | identifier = '_'.join([reads[readids[i]].upper(), str(hmmhits[i].query_start), str(hmmhits[i].query_end)]) 355 | if reps.get(identifier) is None: 356 | reps[identifier] = [] 357 | reps[identifier].append(readids[i]) 358 | 359 | # map and assemble each hit 360 | for i in range(len(targets)): 361 | hit = hmmhits[i] 362 | #debug 363 | #print(vars(hit)) 364 | #input() 365 | seqstr = reads[readids[i]].upper() 366 | if no_assemble: 367 | hit.map_read_seqstr(targets[i], seqstr, [readids[i]], moltype, gencode) 368 | self.hits.append(hit) 369 | else: 370 | identifier = '_'.join([seqstr, str(hit.query_start), str(hit.query_end)]) 371 | if reps[identifier][-1] != 'mapped': 372 | self.map_binning(hit, targets[i], seqstr, reps[identifier], min_overlap, min_pident) 373 | reps[identifier].append('mapped') 374 | 375 | # record the hit number 376 | if not no_assemble: 377 | self.stat_num['assembled contigs'] = len(self.hits) 378 | 379 | # convert the assembled contig information into simple sequence information (the most common bases instead of SNPs) 380 | def simplify_hit_info(self): 381 | for i in range(len(self.hits)): 382 | for j in self.hits[i].seq_comp.keys(): 383 | bases = self.hits[i].seq_comp[j] 384 | if self.hits[i].moltype.startswith('dna'): 385 | self.hits[i].seq_comp[j] = most_common(bases) 386 | else: 387 | base1 = most_common(bases[1]) 388 | base2 = most_common(bases[2]) 389 | base3 = most_common(bases[3]) 390 | if base1 == '-' and base2 == '-' and base3 == '-': 391 | base = '-' 392 | else: 393 | try: 394 | base = translate(base1 + base2 + base3, table=self.hits[i].gencode) 395 | except: 396 | base = 'X' 397 | self.hits[i].seq_comp[j] = {'base': base, 'codon': [base1, base2, base3]} 398 | 399 | # remove the foreign sequences based on conservative scores 400 | def remove_out_hit(self, weight=1): 401 | to_remove = [] 402 | for i in range(len(self.hits)): 403 | hit = self.hits[i] 404 | score = 0 405 | outgroup_score = {} 406 | for seqid in self.outgroup_score.keys(): 407 | outgroup_score[seqid] = 0 408 | for pos, seq_comp in hit.seq_comp.items(): 409 | if hit.moltype.startswith('dna'): 410 | base = seq_comp 411 | else: 412 | base = seq_comp['base'] 413 | score += self.site_composition[pos].get(base, 0) 414 | for seqid in self.outgroup_score.keys(): 415 | outgroup_score[seqid] += self.outgroup_score[seqid][pos] 416 | if score < min(outgroup_score.values()) * weight: 417 | to_remove.append(i) 418 | for i in reversed(to_remove): 419 | self.hits.pop(i) 420 | self.stat_num['ingroup clean seqs'] = len(self.hits) 421 | 422 | # output the final sequences 423 | def output_sequence(self, mode='consensus', nuclfill='N', protfill='X'): 424 | seqs = [] 425 | protseqs = [] 426 | seqids = [] 427 | if len(self.hits) == 0: 428 | return seqs, protseqs, seqids 429 | if mode.startswith('consensus'): 430 | seq_info = {} 431 | # the sequences consisting of more hits have more weights 432 | self.hits.sort(key = lambda x : x.read_count, reverse = True) 433 | for hit in self.hits: 434 | seqids.extend(hit.reads) 435 | for pos, base in hit.seq_comp.items(): 436 | if hit.moltype.startswith('dna'): 437 | if seq_info.get(pos) is None: 438 | seq_info[pos] = {} 439 | if seq_info[pos].get(base) is None: 440 | seq_info[pos][base] = 1 441 | else: 442 | seq_info[pos][base] += 1 443 | else: 444 | if seq_info.get(pos) is None: 445 | seq_info[pos] = {1: {}, 2: {}, 3:{}} 446 | for k in [1, 2, 3]: 447 | if seq_info[pos][k].get(base['codon'][k-1]) is None: 448 | seq_info[pos][k][base['codon'][k-1]] = 1 449 | else: 450 | seq_info[pos][k][base['codon'][k-1]] += 1 451 | if self.moltype.startswith('dna'): 452 | seq = list(nuclfill * self.aln_len) 453 | else: 454 | seq = list(nuclfill * (3*self.aln_len)) 455 | protseq = list(protfill * self.aln_len) 456 | for pos, bases in seq_info.items(): 457 | if self.moltype.startswith('dna'): 458 | if mode == 'consensus' or len(set(bases)) == 1: 459 | seq[pos-1] = most_common(bases) 460 | else: 461 | for i in [1, 2, 3]: 462 | if mode == 'consensus' or len(set(bases[i])) == 1: 463 | seq[3*pos+i-4] = most_common(bases[i]) 464 | codon = ''.join(seq[3*pos-3:3*pos]) 465 | if codon == '---': 466 | protseq[pos-1] = '-' 467 | else: 468 | try: 469 | protseq[pos-1] = translate(codon, table=self.gencode) 470 | except: 471 | protseq[pos-1] = 'X' 472 | seqs.append(''.join(seq)) 473 | if not self.moltype.startswith('dna'): 474 | protseqs.append(''.join(protseq)) 475 | seqids = [' '.join(seqids)] 476 | elif mode == 'all': 477 | # output all assembled sequences without consensus or selection 478 | for hit in self.hits: 479 | if self.moltype.startswith('dna'): 480 | seq = list(nuclfill * self.aln_len) 481 | protseq = None 482 | else: 483 | seq = list(nuclfill * (3*self.aln_len)) 484 | protseq = list(protfill * self.aln_len) 485 | seq, protseq = hit.print_seq(seq, protseq) 486 | seqs.append(''.join(seq)) 487 | if not self.moltype.startswith('dna'): 488 | protseqs.append(''.join(protseq)) 489 | seqids.append(' '.join(hit.reads)) 490 | else: 491 | if mode == 'expression': 492 | self.hits.sort(key = lambda x : x.read_count, reverse = True) 493 | else: 494 | # for longest sequence 495 | self.hits.sort(key = lambda x : len(x.seq_comp), reverse = True) 496 | if self.moltype.startswith('dna'): 497 | seq = list(nuclfill * self.aln_len) 498 | protseq = None 499 | else: 500 | seq = list(nuclfill * (3*self.aln_len)) 501 | protseq = list(protfill * self.aln_len) 502 | seq, protseq = self.hits[0].print_seq(seq, protseq) 503 | seqs.append(''.join(seq)) 504 | if not self.moltype.startswith('dna'): 505 | protseqs.append(''.join(protseq)) 506 | seqids = [' '.join(self.hits[0].reads)] 507 | return seqs, protseqs, list(map(lambda x : x.split('_fastx')[0], seqids)) 508 | 509 | # a single task for assembly and foreign sequence removal 510 | def generate_assembly(group_name, alnfile, species, hmmresults, np=8, moltype='dna', gencode=1, no_assemble=False, overlap_len=30, overlap_pident=98, no_out_filter=False, outgroup=[], ingroup=[], sep='.', outgroup_weight=1, final_seq='consensus'): 511 | assembler = Assembler(group_name, alnfile, species, outgroup=outgroup, ingroup=ingroup, sep=sep) 512 | assembler.map_read_seq(hmmresults, os.path.join('map_' + species, group_name + '.targets.fa'), moltype=moltype, gencode=gencode, no_assemble=no_assemble, min_overlap=overlap_len, min_pident=overlap_pident) 513 | assembler.simplify_hit_info() 514 | if not no_out_filter: 515 | assembler.remove_out_hit(outgroup_weight) 516 | # pre-process the hits to reduce the memory when not to assemble (too many hits) 517 | if no_assemble and len(assembler.hits) > 0: 518 | if final_seq.startswith('consensus'): 519 | seq_info = {} 520 | for hit in assembler.hits: 521 | for pos, base in hit.seq_comp.items(): 522 | if hit.moltype.startswith('dna'): 523 | if seq_info.get(pos) is None: 524 | seq_info[pos] = {} 525 | if seq_info[pos].get(base) is None: 526 | seq_info[pos][base] = 1 527 | else: 528 | seq_info[pos][base] += 1 529 | else: 530 | if seq_info.get(pos) is None: 531 | seq_info[pos] = {1: {}, 2: {}, 3:{}} 532 | for k in [1, 2, 3]: 533 | if seq_info[pos][k].get(base['codon'][k-1]) is None: 534 | seq_info[pos][k][base['codon'][k-1]] = 1 535 | else: 536 | seq_info[pos][k][base['codon'][k-1]] += 1 537 | start_pos = 0 538 | end_pos = 0 539 | for i in range(1, assembler.aln_len + 1): 540 | if seq_info.get(i) is None: 541 | continue 542 | if assembler.moltype.startswith('dna'): 543 | if final_seq == 'consensus' or len(seq_info[i]) == 1: 544 | assembler.hits[0].seq_comp[i] = most_common(seq_info[i]) 545 | else: 546 | assembler.hits[0].seq_comp[i] = None 547 | else: 548 | if final_seq == 'consensus' or len(seq_info[i][1]) == 1: 549 | base1 = most_common(seq_info[i][1]) 550 | else: 551 | base1 = 'N' 552 | if final_seq == 'consensus' or len(seq_info[i][2]) == 1: 553 | base2 = most_common(seq_info[i][2]) 554 | else: 555 | base2 = 'N' 556 | if final_seq == 'consensus' or len(seq_info[i][3]) == 1: 557 | base3 = most_common(seq_info[i][3]) 558 | else: 559 | base3 = 'N' 560 | if base1 == '-' and base2 == '-' and base3 == '-': 561 | base = '-' 562 | else: 563 | try: 564 | base = translate(base1 + base2 + base3, table=self.hits[i].gencode) 565 | except: 566 | base = 'X' 567 | assembler.hits[0].seq_comp[i] = {'base': base, 'codon': [base1, base2, base3]} 568 | if not start_pos: 569 | start_pos = i 570 | end_pos = i 571 | assembler.hits[0].query_start = start_pos 572 | assembler.hits[0].query_end = end_pos 573 | elif final_seq == 'length': 574 | assembler.hits.sort(key = lambda x : len(x.seq_comp), reverse = True) 575 | if final_seq != 'all': 576 | assembler.hits = [assembler.hits[0]] 577 | return assembler 578 | 579 | # assemble the hits and remove putative foreign sequences in multiple processes 580 | def generate_assembly_mp(alns, species, all_hmmres, np=8, moltype='dna', gencode=1, no_assemble=False, overlap_len=30, overlap_pident=98, no_out_filter=False, outgroup=[], ingroup=[], sep='.', outgroup_weight=1, final_seq='consensus'): 581 | print("\nAssembling the reads of {} and removing putative foreign sequences...".format(species)) 582 | assemblers = {} 583 | args_list = [] 584 | kwds = {'moltype': moltype, 'gencode': gencode, 'no_assemble': no_assemble, 'overlap_len': overlap_len, 'overlap_pident': overlap_pident, 'no_out_filter': no_out_filter, 'outgroup': outgroup, 'ingroup': ingroup, 'sep': sep, 'outgroup_weight': outgroup_weight, 'final_seq': final_seq} 585 | for group_name, alnfile in alns.items(): 586 | args_list.append((group_name, alnfile, species, all_hmmres[group_name])) 587 | asbs = lib.run_mp(generate_assembly, args_list, np, kwds=kwds) 588 | for assembler in asbs: 589 | assemblers[assembler.name] = assembler 590 | return assemblers 591 | 592 | # detect and remove cross contamination of assembled sequences 593 | def cross_decontam(assemblers, total_reads, min_overlap=30, min_pident=98, min_expression=0.2, min_fold=2): 594 | to_remove = {} 595 | for sp1 in assemblers.keys(): 596 | to_remove[sp1] = [] 597 | for i in range(len(assemblers[sp1].hits)): 598 | assemblers[sp1].hits[i].addRPKM(total_reads[sp1]) 599 | statout = open(os.path.join('stat_info', list(assemblers.values())[0].name + '.cross_info.tsv'), 'w') 600 | statout.write("species1\tquery_start1\tquery_end1\tspecies2\tquery_start2\tquery_end2\toverlap_length\tpident\tpseudoRPKM1\tpseudoRPKM2\texpression_fold\tseq1\tseq2\n") 601 | for sp1, assembler1 in assemblers.items(): 602 | for sp2, assembler2 in assemblers.items(): 603 | if sp1 == sp2: 604 | continue 605 | for i in range(len(assembler1.hits)): 606 | for j in range(len(assembler2.hits)): 607 | overlap_len, pident = compare_hits(assembler1.hits[i], assembler2.hits[j], min_overlap, min_pident) 608 | if overlap_len: 609 | exp_fold = assembler1.hits[i].pseudoRPKM / assembler2.hits[j].pseudoRPKM 610 | seq1 = '' 611 | for pos in range(assembler1.hits[i].query_start, assembler1.hits[i].query_end+1): 612 | if assembler1.moltype.startswith('dna'): 613 | seq1 += assembler1.hits[i].seq_comp[pos] 614 | else: 615 | for k in [1, 2, 3]: 616 | seq1 += assembler1.hits[i].seq_comp[pos]['codon'][k-1] 617 | seq2 = '' 618 | for pos in range(assembler2.hits[j].query_start, assembler2.hits[j].query_end+1): 619 | if assembler2.moltype.startswith('dna'): 620 | seq2 += assembler2.hits[j].seq_comp[pos] 621 | else: 622 | for k in [1, 2, 3]: 623 | seq2 += assembler2.hits[j].seq_comp[pos]['codon'][k-1] 624 | statout.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(assembler1.species, assembler1.hits[i].query_start, assembler1.hits[i].query_end, assembler2.species, assembler2.hits[j].query_start, assembler2.hits[j].query_end, overlap_len, pident, assembler1.hits[i].pseudoRPKM, assembler2.hits[j].pseudoRPKM, exp_fold, seq1, seq2)) 625 | if exp_fold > min_fold and assembler1.hits[i].pseudoRPKM >= min_expression and j not in to_remove[sp2]: 626 | to_remove[sp2].append(j) 627 | elif exp_fold < 1 / min_fold and assembler2.hits[j].pseudoRPKM >= min_expression and i not in to_remove[sp1]: 628 | to_remove[sp1].append(i) 629 | statout.close() 630 | for sp, remove_indexes in to_remove.items(): 631 | for i in sorted(remove_indexes, reverse=True): 632 | assemblers[sp].hits.pop(i) 633 | assemblers[sp].stat_num['cross clean seqs'] = len(assemblers[sp].hits) 634 | return assemblers 635 | 636 | # generate conservative reference codon alignments (corresponding to protein alignments for HMM construction) 637 | def generate_codon_ref(assembler): 638 | codon_seqs = lib.read_fastx(assembler.alnfile, 'fasta') 639 | prot_seqs = assembler.aln 640 | seqs = {} 641 | for seqid, protstr in prot_seqs.items(): 642 | seqs[seqid] = '' 643 | i = 0 644 | j = 0 645 | while j < len(protstr) and i + 3 <= len(codon_seqs[seqid]): 646 | protbase = protstr[j] 647 | codon = codon_seqs[seqid][i:i+3] 648 | #debug 649 | #print(seqid, 'codon', i, codon, 'protein', j, protbase) 650 | #input() 651 | if codon == '---': 652 | if protbase == '-': 653 | seqs[seqid] += codon 654 | j += 1 655 | i += 3 656 | elif translate(codon, table=assembler.gencode) == protbase: 657 | seqs[seqid] += codon 658 | j += 1 659 | i += 3 660 | elif protbase == '-': 661 | seqs[seqid] += '---' 662 | j += 1 663 | else: 664 | i += 3 665 | if j < len(protstr): 666 | print("\nError: fail to match the codon columns to the protein alignment HMM: {}-{}-{}!".format(assembler.name, seqid, j)) 667 | return None 668 | #old version 669 | #i = 0 670 | #for j in range(len(assembler.outgroup_seq)): 671 | # ifmatch = False 672 | # while not ifmatch: 673 | # if i + 3 > len(codon_seqs[assembler.outgroup]): 674 | # break 675 | # ifmatch = True 676 | # sites = {} 677 | # for seqid, protstr in prot_seqs.items(): 678 | # protbase = protstr[j] 679 | # codon = codon_seqs[seqid][i:i+3] 680 | # #debug 681 | # print(seqid, 'codon', i, codon, 'protein', j, protbase) 682 | # input() 683 | # if codon == '---': 684 | # if protbase != '-': 685 | # ifmatch = False 686 | # break 687 | # else: 688 | # sites[seqid] = codon 689 | # elif translate(codon, table=assembler.gencode) != protbase: 690 | # ifmatch = False 691 | # break 692 | # else: 693 | # sites[seqid] = codon 694 | # i += 3 695 | # if not ifmatch: 696 | # print("\nError: fail to match the codon columns to the protein alignment HMM: {}-{}!".format(assembler.name, j+1)) 697 | # return None 698 | # for seqid in seqs.keys(): 699 | # seqs[seqid] += sites[seqid] 700 | return seqs 701 | 702 | # a single task for cross contamination and final sequence output 703 | def cross_and_output(group_name, assemblers, sps, total_reads, moltype='dna', gencode=1, no_assemble=False, no_cross_species=False, min_overlap=30, min_pident=98, min_exp=0.2, min_fold=2, unknow='unknow', final_seq='consensus', no_ref=False, sep='.', keep_seqid=False): 704 | if not no_cross_species and len(sps) > 1 and not no_assemble: 705 | assemblers = cross_decontam(assemblers, total_reads, min_overlap=min_overlap, min_pident=min_pident, min_expression=min_exp, min_fold=min_fold) 706 | if unknow == 'unknow': 707 | nuclfill = 'N' 708 | protfill = 'X' 709 | else: 710 | nuclfill = unknow 711 | protfill = unknow 712 | seqs = {} 713 | protseqs = {} 714 | if not no_ref: 715 | # prepare the reference alignments 716 | if moltype.startswith('dna'): 717 | seqs.update(assemblers[sps[0]].aln) 718 | else: 719 | protseqs.update(assemblers[sps[0]].aln) 720 | if moltype == 'codon': 721 | seqs = generate_codon_ref(assemblers[sps[0]]) 722 | if seqs is None: 723 | return 1, group_name 724 | 725 | # print the stat information 726 | statout = open(os.path.join('stat_info', group_name + '.stat_info.tsv'), 'w') 727 | statout.write("species\t" + "\t".join(assemblers[sps[0]].stat_num.keys()) + "\tfinal seqs\n") 728 | for sp in sps: 729 | seq_list, protseq_list, seqid_list = assemblers[sp].output_sequence(mode=final_seq, nuclfill=nuclfill, protfill=protfill) 730 | statout.write("{}\t{}\t{}\n".format(sp, "\t".join(map(str, assemblers[sp].stat_num.values())), len(seq_list))) 731 | for i in range(len(seq_list)): 732 | if keep_seqid: 733 | if no_assemble and not final_seq.startswith('consensus'): 734 | seqid = seqid_list[i] 735 | else: 736 | seqid = sep.join([sp, group_name, str(i+1)]) + ' ' + seqid_list[i] 737 | else: 738 | seqid = sep.join([sp, group_name, str(i+1)]) 739 | seqs[seqid] = seq_list[i] 740 | if not moltype.startswith('dna'): 741 | protseqs[seqid] = protseq_list[i] 742 | statout.close() 743 | 744 | # cross decontamination for consensus without assembly 745 | if not no_cross_species and len(sps) > 1 and no_assemble and final_seq.startswith('consensus'): 746 | crossout = open(os.path.join('stat_info', group_name + '.cross_info.tsv'), 'w') 747 | crossout.write("seqid1\tseqid2\toverlap_length\tpident\tpseudoRPKM1\tpseudoRPKM2\texpression_fold\tseq1\tseq2\n") 748 | for sp1 in sps: 749 | for sp2 in sps: 750 | if sp1 == sp2: 751 | continue 752 | seqid1 = None 753 | seqid2 = None 754 | for seqid in seqs.keys(): 755 | if seqid.startswith(sp1 + sep): 756 | seqid1 = seqid 757 | elif seqid.startswith(sp2 + sep): 758 | seqid2 = seqid 759 | if seqid1 is None or seqid2 is None: 760 | continue 761 | overlap_len = 0 762 | nident = 0 763 | length1 = 0 764 | length2 = 0 765 | for i in range(len(seqs[seqid1])): 766 | if seqs[seqid1][i] != nuclfill: 767 | length1 += 1 768 | if seqs[seqid2][i] != nuclfill: 769 | length2 += 1 770 | if seqs[seqid1][i] == nuclfill or seqs[seqid2][i] == nuclfill: 771 | continue 772 | overlap_len += 1 773 | if seqs[seqid1][i] == seqs[seqid2][i]: 774 | nident += 1 775 | if overlap_len < min_overlap: 776 | continue 777 | pident = 0 778 | if overlap_len > 0: 779 | pident = nident / overlap_len * 100 780 | if pident < min_pident: 781 | continue 782 | pseudoRPKM1 = 1000 * 1000000 * assemblers[sp1].stat_num['raw hits'] / total_reads[sp1] / length1 783 | pseudoRPKM2 = 1000 * 1000000 * assemblers[sp2].stat_num['raw hits'] / total_reads[sp2] / length2 784 | exp_fold = pseudoRPKM1 / pseudoRPKM2 785 | crossout.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(seqid1, seqid2, overlap_len, pident, pseudoRPKM1, pseudoRPKM2, exp_fold, seqs[seqid1], seqs[seqid2])) 786 | if exp_fold > min_fold and pseudoRPKM1 > min_exp: 787 | seqs.pop(seqid2) 788 | if not assemblers[sp2].moltype.startswith('dna'): 789 | protseqs.pop(seqid2) 790 | elif exp_fold < 1 / min_fold and pseudoRPKM2 > min_exp: 791 | seqs.pop(seqid1) 792 | if not assemblers[sp1].moltype.startswith('dna'): 793 | protseqs.pop(seqid1) 794 | crossout.close() 795 | 796 | # output the final sequences 797 | seqout = open(os.path.join('nt_out', group_name + '.fa'), 'w') 798 | for seqid, seqstr in seqs.items(): 799 | seqout.write(">{}\n{}\n".format(seqid, seqstr)) 800 | seqout.close() 801 | if not moltype.startswith('dna'): 802 | seqout = open(os.path.join('aa_out', group_name + '.fa'), 'w') 803 | for seqid, seqstr in protseqs.items(): 804 | seqout.write(">{}\n{}\n".format(seqid, seqstr)) 805 | seqout.close() 806 | elif moltype == 'dna_codon': 807 | lib.trans_seq(os.path.join('nt_out', group_name + '.fa'), os.path.join('aa_out', group_name + '.fa'), gencode, dna_codon_unknow=protfill) 808 | return 0, group_name 809 | 810 | # cross contamination and final sequence output in multiple processes 811 | def cross_and_output_mp(groups, sps, assemblers, total_reads, np = 8, moltype='dna', gencode=1, no_assemble=False, no_cross_species=False, min_overlap=30, min_pident=98, min_exp=0.2, min_fold=2, unknow='unknow', final_seq='consensus', no_ref=False, sep='.', keep_seqid=False): 812 | print("\nRemoving cross contamination and printing the final sequences...") 813 | if not os.path.isdir('stat_info'): 814 | os.mkdir('stat_info') 815 | if not os.path.isdir('nt_out'): 816 | os.mkdir('nt_out') 817 | if moltype != 'dna' and not os.path.isdir('aa_out'): 818 | os.mkdir('aa_out') 819 | if not no_cross_species and len(sps) > 1 and no_assemble: 820 | if final_seq.startswith('consensus'): 821 | print("\nWarning: not to assemble and {}, cross decontamination will be conducted after consensus!".format(final_seq)) 822 | else: 823 | print("\nWarning: not to assemble and {}, cross decontamination will be disabled due to no read counts!".format(final_seq)) 824 | args_list = [] 825 | kwds = {'sps': sps, 'total_reads': total_reads, 'moltype': moltype, 'gencode': gencode, 'no_assemble': no_assemble, 'no_cross_species': no_cross_species, 'min_overlap': min_overlap, 'min_pident': min_pident, 'min_exp': min_exp, 'min_fold': min_fold, 'unknow': unknow, 'final_seq': final_seq, 'no_ref': no_ref, 'sep': sep, 'keep_seqid': keep_seqid} 826 | for group_name in groups: 827 | args_list.append((group_name, assemblers[group_name])) 828 | iferrors = lib.run_mp(cross_and_output, args_list, np, kwds=kwds) 829 | # if errors occur, print the message 830 | errors = [] 831 | for iferror in iferrors: 832 | if iferror[0] == 1: 833 | errors.append(iferror[1]) 834 | if errors: 835 | print("\nError: fail to output due to invalid reference codon alignments: {}".format(', '.join(errors))) 836 | sys.exit(1) 837 | 838 | # concatenate the short alignments if split the reference alignements 839 | def concatenate(dirname, alns, fill='-'): 840 | seqs = {} 841 | total_len = 0 842 | for aln in alns.keys(): 843 | aln_seqs = lib.read_fastx(os.path.join(dirname, aln + '.fa'), 'fasta') 844 | for seqid, seqstr in aln_seqs.items(): 845 | if seqid.endswith('.' + aln + '.1'): 846 | seqid = seqid.replace('.' + aln + '.1', '') 847 | if seqs.get(seqid) is None: 848 | seqs[seqid] = fill * total_len + seqstr 849 | else: 850 | seqs[seqid] += seqstr 851 | aln_len = len(seqstr) 852 | for seqid in seqs.keys(): 853 | if aln_seqs.get(seqid) is None and aln_seqs.get(seqid + '.' + aln + '.1') is None: 854 | seqs[seqid] += fill * aln_len 855 | total_len += aln_len 856 | outfile = open(os.path.join(dirname, 'aln.concatenated.fa'), 'w') 857 | for seqid, seqstr in seqs.items(): 858 | outfile.write(">{}\n{}\n".format(seqid, seqstr)) 859 | outfile.close() 860 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PhyloAln 2 | ## PhyloAln: a convenient reference-based tool to align sequences and high-throughput reads for phylogeny and evolution in the omic era 3 | 4 | ![logo](https://github.com/huangyh45/PhyloAln/blob/main/logo.png) 5 | 6 | PhyloAln is a reference-based multiple sequence alignment tool for phylogeny and evolution. PhyloAln can directly map not only the raw reads but also assembled or translated sequences into the reference alignments, which is suitable for different omic data and skips the complex preparation in the traditional method including data assembly, gene prediction, orthology assignment and multiple sequence alignment, with a relatively high accuracy in the aligned sites and the downstream phylogeny. It is also able to detect and remove foreign and cross contamination in the generated alignments, which is not considered in other reference-based methods, and thus improve the quality of the alignments for downstream analyses. 7 | 8 | ### Catalogue 9 | - [Installation](#installation) 10 | - [1) Installation from source](#1-installation-from-source) 11 | - [2) Installation using Conda](#2-installation-using-conda) 12 | - [Usage](#usage) 13 | - [Quick start](#quick-start) 14 | - [A practice using PhyloAln for phylogenomics](#a-practice-using-phyloaln-for-phylogenomics) 15 | - [A practice using PhyloAln for gene family analysis](#a-practice-using-phyloaln-for-gene-family-analysis) 16 | - [Input](#input) 17 | - [Output](#output) 18 | - [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use) 19 | - [Detailed parameters](#detailed-parameters) 20 | - [Limitations](#limitations) 21 | - [Auxiliary scripts for PhyloAln and phylogenetic analyses](#auxiliary-scripts-for-phyloaln-and-phylogenetic-analyses) 22 | - [Script to translate sequences: transseq.pl](#transseqpl) 23 | - [Script to back-translate sequences: revertransseq.pl](#revertransseqpl) 24 | - [Script to align the sequences: alignseq.pl](#alignseqpl) 25 | - [Script to concatenate the alignments: connect.pl](#connectpl) 26 | - [Script to combine each result alignment of different PhyloAln runs: merge_seqs.py](#merge_seqspy) 27 | - [Script to select the sequences in the files in bulk: select_seqs.py](#select_seqspy) 28 | - [Script to trim the alignments based on unknown sites in bulk: trim_matrix.py](#trim_matrixpy) 29 | - [Script to root the phylogenetic tree: root_tree.py](#root_treepy) 30 | - [Script to prune the phylogenetic tree: prune_tree.py](#prune_treepy) 31 | - [Script to assist in checking the unaligned sequences in the reference alignments in bulk: check_aln.py](#check_alnpy) 32 | - [Script to test performance of PhyloAln: test_effect.py](#test_effectpy) 33 | - [Citation](#citation) 34 | - [Acknowledgments](#acknowledgments) 35 | - [Questions & Answers](#questions--answers) 36 | - [Fail to locate Bio/SeqIO.pm in @INC when installation](#fail-to-locate-bioseqiopm-in-inc-when-installation) 37 | - [How can I obtain the reference alignments and the final tree?](#how-can-i-obtain-the-reference-alignments-and-the-final-tree) 38 | - [Does selection of the outgroup influence detection of foreign contamination? How can I choose an appropriate outgroup?](#does-selection-of-the-outgroup-influence-detection-of-foreign-contamination-how-can-i-choose-an-appropriate-outgroup) 39 | - [The required memory is too large to run PhyloAln.](#the-required-memory-is-too-large-to-run-PhyloAln) 40 | - [The positions of sites in the reference alignments are changed in the output alignments.](#the-positions-of-sites-in-the-reference-alignments-are-changed-in-the-output-alignments) 41 | - [How can I assemble the paired-end reads?](#how-can-i-assemble-the-paired-end-reads) 42 | - [Can PhyloAln generate the alignments of multiple-copy genes for gene family analyses?](#can-phyloaln-generate-the-alignments-of-multiple-copy-genes-for-gene-family-analyses) 43 | 44 | ### Installation 45 | 46 | #### 1) Installation from source 47 | ##### Requirements 48 | - python >=3.7.4 (https://www.python.org/downloads/) 49 | - biopython >=1.77 (https://biopython.org/wiki/Download) 50 | - hmmer >=3.1 (http://hmmer.org/download.html) 51 | - mafft >=7.467 (optional for the auxiliary scripts, https://mafft.cbrc.jp/alignment/software/source.html) 52 | - ete3 >=3.1.2 (optional for the auxiliary scripts, http://etetoolkit.org/download/) 53 | - perl >=5.26.2 (optional for the auxiliary scripts, https://www.perl.org/get.html) 54 | - perl-bioperl >=1.7.2 (optional for the auxiliary scripts, https://github.com/bioperl/bioperl-live/blob/master/README.md) 55 | - perl-parallel-forkmanager >=2.02 (optional for the auxiliary scripts, https://github.com/dluxhu/perl-parallel-forkmanager) 56 | 57 | After installing these requirements, you can download the latest release of PhyloAln directly from this page or using the command in your computer: 58 | ``` 59 | git clone https://github.com/huangyh45/PhyloAln.git 60 | cd PhyloAln 61 | git checkout v1.1.0 # switch to the latest stable release version 62 | ``` 63 | If your computer needs execute permissions to run the programs, such as the Linux or macOS system, you should first run the command : 64 | ``` 65 | chmod -R +x /your/PhyloAln/path/ # the absolute path of 'PhyloAln' directory in the above commands 66 | ``` 67 | Then, you can test if PhyloAln has been available using the commands: 68 | ``` 69 | cd /your/PhyloAln/path/ 70 | export PATH=$PATH:/your/PhyloAln/path/:/your/PhyloAln/path/scripts 71 | bash tests/run_test.sh 72 | ``` 73 | When you see "Successfully installed" at the end of the screen output, PhyloAln and all its auxiliary scripts has been successfully installed and available. 74 | If the test fails, you should check if the requirements have been successfully installed and executable in the current environment. 75 | After test, you can manually delete all the newly generated files, or run the command to delete them: 76 | ``` 77 | rm -rf alignseq.log all.block all.fas list tests/run_test.config tests/PhyloAln_* tests/aln tests/ref/*.fas tests/ref/*.index 78 | ``` 79 | ##### Full usage experience 80 | If you have installed [IQ-TREE](http://www.iqtree.org/#download) and want to experience the usage of all the scripts with real examples through a simple phylogenomic flow after installation, you can run the commands (it will spend some minutes): 81 | ``` 82 | bash tests/run_test.sh full 83 | ``` 84 | After running, you can manually delete all the newly generated files, or run the command to delete them: 85 | ``` 86 | rm -rf alignseq.log all.block all.fas list tests/run_test.config tests/PhyloAln_* tests/aln tests/ref/*.fas tests/ref/*.index 87 | ``` 88 | #### 2) Installation using Conda 89 | PhyloAln has been provided on [Bioconda](https://bioconda.github.io/recipes/phyloaln/README.html), run the command to install it: 90 | ``` 91 | conda install phyloaln 92 | ``` 93 | If your base environment of Conda has installed amounts of packages, Conda may be hard to manage the packages when installing PhyloAln. In this case, you can install the requirements in a newly created Conda environment using this command: 94 | ``` 95 | conda install -m -n your_env phyloaln 96 | ``` 97 | and activate your environment before using PhyloAln: 98 | ``` 99 | conda activate your_env 100 | ``` 101 | If the installation spends too much time, you can try to install the requirements and all their dependencies with fixed but not latest version. Download the [Conda configure file of requirements with fixed version](https://github.com/huangyh45/PhyloAln/releases/download/v0.1.0/requirement_fix.txt), and install these requirements using the command: 102 | ``` 103 | conda install (-m -n your_env) --file requirement_fix.txt 104 | ``` 105 | Then, you can install PhyloAln using the command: 106 | ``` 107 | conda install (-n your_env) phyloaln 108 | ``` 109 | 110 | ### Usage 111 | 112 | #### Quick start 113 | If you have only one reference alignment FASTA file and sequence data from only one source/species, you can use -a to input the reference alignment file, -s to input the species name and -i to input the FASTA/FASTQ sequence/read file(s), like this command: 114 | ``` 115 | PhyloAln -a reference_alignment_file -s species -i sequence_file1 (sequence_file2) -o output_directory 116 | ``` 117 | 118 | You can also use -c to input a configure file representing information of sequence data from multiple sources/species. The configure file should be tab-separated and like this: 119 | ``` 120 | species1 /absolute/path/sequence_file1 121 | species2 /absolute/path/sequence_file1,/absolute/path/sequence_file2 122 | ``` 123 | If you have a directory containing multiple reference alignment FASTA files with a same suffix, you can use -d to input the directory and -x to input the suffix. The command using multiple reference alignments and multiple sources/species is like this: 124 | ``` 125 | PhyloAln -d reference_alignments_directory -c config.tsv -x alignment_file_name_suffix -o output_directory 126 | ``` 127 | **Note:we found a bug when using unzipped FASTA/FASTQ sequence/read file(s) and guessed file format in the versions ≤ 1.0.0, which is fixed in the versions ≥ 1.1.0. Please always input the file format (-f) without guess when you run the versions ≤ 1.0.0 of PhyloAln!** 128 | 129 | #### A practice using PhyloAln for phylogenomics 130 | The following practice is for phylogenomics using codon alignments of nuclear single-copy orthologous groups and 20 CPUs. 131 | ##### 1. obtain the reference orthologous sequences 132 | You can download the reference sequences from the ortholog database (e.g., [OrthoDB](https://www.orthodb.org/), [OMA](https://omabrowser.org/oma/home/)), or perform *de novo* orthology assignment (e.g., by [OrthoFinder](https://github.com/davidemms/OrthoFinder)). The reference species are recommended to contain one or several outgroup for PhyloAln. 133 | ##### 2. codon alignment for each ortholog group 134 | In this step, you can use our auxiliary script [alignseq.pl](#alignseqpl) 135 | Run the shell commands: 136 | ``` 137 | mkdir aln 138 | for file in orthogroup/*.fa; do 139 | name=`basename $file` 140 | scripts/alignseq.pl -i $file -o aln/$name -a codon -n 20 141 | done 142 | ``` 143 | ##### 3. trim the alignments (optional) 144 | In this step, you can use the tool [trimAl](https://github.com/inab/trimal) 145 | Run the shell commands to trim the codon alignments generated in the above step: 146 | ``` 147 | mkdir ref_aln 148 | for file in aln/*.aa.fas; do 149 | name=`basename $file .aa.fas` 150 | trimal -in $file -out ref_aln/$name.fa -automated1 -keepheader -backtrans orthogroup/$name.fa 151 | done 152 | ``` 153 | However, sometimes you would like to directly trim the alignments without considering the codons, like these commands: 154 | ``` 155 | mkdir ref_aln 156 | for file in aln/*.fa; do 157 | name=`basename $file` 158 | trimal -in $file -out ref_aln/$name -automated1 -keepheader 159 | done 160 | ``` 161 | The reference alignments have been generated here. And you can directly obtain the existing alignments as reference instead of the above three steps, for example, from the published supplementary data. 162 | ##### 4. write the configure of the species and data 163 | The format of the configure file is TSV and like this: 164 | ``` 165 | species1 /absolute/path/sequence_file1 166 | species2 /absolute/path/sequence_file1,/absolute/path/sequence_file2 167 | ``` 168 | ##### 5. run PhyloAln the map the sequences/reads into the reference alignments 169 | ``` 170 | PhyloAln -d ref_aln -c config.tsv -p 20 -m codon -u outgroup 171 | ``` 172 | The output alignments can be trimmed to remove the sites with too many unknown bases using our auxiliary script [trim_matrix.py](#trim_matrixpy), and trimmed to make other editing about gappy or conservative sites using the tool [trimAl](https://github.com/inab/trimal) 173 | ##### 6. concatenate the alignments into a supermatrix 174 | This step can be done with our auxiliary script [connect.pl](#connectpl) 175 | For codon dataset, you can run it like: 176 | ``` 177 | scripts/connect.pl -i PhyloAln_out/nt_out -f N -b all.block -n -c 123 178 | ``` 179 | For protein dataset, the command is like: 180 | ``` 181 | scripts/connect.pl -i PhyloAln_out/aa_out -f X -b all.block -n 182 | ``` 183 | ##### 7. reconstruct the phylogenetic tree 184 | You can build the tree by [IQ-TREE](http://www.iqtree.org/#download) like this: 185 | ``` 186 | iqtree -s all.fas -p all.block -m MFP+MERGE -B 1000 -T AUTO --threads-max 20 --prefix species_tree 187 | ``` 188 | ##### 8. root the tree 189 | You can root the tree with the outgroup using our auxiliary script [root_tree.py](#root_treepy) 190 | ``` 191 | scripts/root_tree.py species_tree.treefile species_tree.rooted.tre outgroup 192 | ``` 193 | Finally you obtain a species tree with NEWICK format here and you can then visualize it or use it in other downstream analyses. 194 | 195 | #### A practice using PhyloAln for gene family analysis 196 | The following practice is for gene family analysis or marker sequence polish using codon alignment of insect COX1 genes as reference, undirected COX1 marker sequences as targets, and 20 CPUs. The idea for this usage is provided by **Yi-Fei Sun**. 197 | The commands here will use the easy mode (different modes suitable for different data in different gene family analyses, see [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use)) provided in the versions ≥ 1.1.0. 198 | ##### 1. obtain the reference alignment 199 | You can download or extract the COX1 reference sequences from the mitochondrial genomes in the NCBI RefSeq database or other places, and then conduct codon alignment. 200 | Our auxiliary script [alignseq.pl](#alignseqpl) can be used to conduct the alignment. 201 | Run the shell commands: 202 | ``` 203 | scripts/alignseq.pl -i COX1.fa -o COX1.aln.fa -a codon -g 5 -n 20 204 | ``` 205 | The start and end regions are recommended to be trimed. 206 | ##### 2. run PhyloAln the map the target sequences into the reference alignment 207 | ``` 208 | PhyloAln -a COX1.aln.fa -s anything -i targets.fa -e gene_codon2dna -g 5 -p 20 209 | ``` 210 | One or several outgroups in the reference alignment can be set with '-u'. 211 | ##### 3. trim the result alignment 212 | The output alignment is recommended to be trimmed to remove the sites with too many gaps and the sequences with short or actually no regions mapped to the reference using our auxiliary script [trim_matrix.py](#trim_matrixpy) like this: 213 | ``` 214 | scripts/trim_matrix.py PhyloAln_out/nt_out trim_out - 0.5 0.6 215 | ``` 216 | ##### 4. check the trimmed alignment 217 | You can use our auxiliary script [check_aln.py](#check_alnpy) to assist in checking if the sequences are well aligned in the result alignments, like this: 218 | ``` 219 | scripts/check_aln.py trim_out 220 | ``` 221 | Based on the warnings output by check_aln.py, you should manually check the unaligned sequences and edit the alignments. 222 | ##### 5. reconstruct the phylogenetic tree 223 | You can build the gene tree by [IQ-TREE](http://www.iqtree.org/#download) like this: 224 | ``` 225 | iqtree -s trim_out/aln.fa -B 1000 -T AUTO --threads-max 20 --prefix gene_tree 226 | ``` 227 | ##### 6. root the tree 228 | You can root the tree with the midpoint outgroup (default) or your provided outgroup using our auxiliary script [root_tree.py](#root_treepy) 229 | ``` 230 | scripts/root_tree.py gene_tree.treefile gene_tree.rooted.tre (your_provided_outgroup) 231 | ``` 232 | Finally you obtain a gene tree with NEWICK format here and you can then visualize it or use it in other downstream analyses. 233 | 234 | #### Input 235 | PhyloAln needs two types of file: 236 | - the alignment file(s) with FASTA format. Trimmed alignments with conservative sites are recommended. Multiple alignment files with the same suffix should be placed into a directory for input. 237 | - the sequence/read file(s) with FASTA or FASTQ format. Compressed files ending with ".gz" are allowed. Sequence/read files from multiple sources/species should be inputed through a configure file as described in quick start. 238 | 239 | #### Output 240 | PhyloAln generates new alignment file(s) with FASTA format. Each output alignment in `nt_out` directory is corresponding to each reference alignment file, with the aligned target sequences from the provided sequence/read file(s). If using prot, codon or dna_codon mode, the translated protein alignments will be also generated in `aa_out` directory. These alignments are mainly for phylogenetic analyses and evolutionary analyses using conservative sites. 241 | 242 | #### Example commands for different data and common mode for easy use 243 | Notice: the following commands are only recommended according to our practice, and you can manually set the options as you need without setting '-e' or '--mode' if you want to change the specific options listed as follows. 244 | 245 | Map the reads into the DNA alignments(-e dna2reads): 246 | ``` 247 | PhyloAln [options] -m dna 248 | ``` 249 | Map the reads into large numbers of DNA alignments(-e fast_dna2reads): 250 | ``` 251 | PhyloAln [options] -m dna -b 252 | ``` 253 | Map the transcript assembly/sequences into the DNA alignments(-e dna2trans): 254 | ``` 255 | PhyloAln [options] -m dna -b -r 256 | ``` 257 | Map the genomic assembly/sequences with intron regions into the DNA alignments(-e dna2genome): 258 | ``` 259 | PhyloAln [options] -m dna -b -r -l 200 -f large_fasta 260 | ``` 261 | Map the reads into the protein alignments(-e prot2reads): 262 | ``` 263 | PhyloAln [options] -m prot 264 | ``` 265 | Map the reads into large numbers of protein alignments(-e fast_prot2reads): 266 | ``` 267 | PhyloAln [options] -m prot -b 268 | ``` 269 | Map the transcript assembly/sequences into the protein alignments(-e prot2trans): 270 | ``` 271 | PhyloAln [options] -m prot -b -r 272 | ``` 273 | Map the genomic assembly/sequences with intron regions into the protein alignments(-e prot2genome): 274 | ``` 275 | PhyloAln [options] -m prot -b -r -l 200 -f large_fasta 276 | ``` 277 | Map the reads into the codon alignments(-e codon2reads): 278 | ``` 279 | PhyloAln [options] -m codon 280 | ``` 281 | Map the reads into large numbers of codon alignments(-e fast_codon2reads): 282 | ``` 283 | PhyloAln [options] -m codon -b 284 | ``` 285 | Map the transcript assembly/sequences into the codon alignments(-e codon2trans): 286 | ``` 287 | PhyloAln [options] -m codon -b -r 288 | ``` 289 | Map the genomic assembly/sequences with intron regions into the codon alignments(-e codon2genome): 290 | ``` 291 | PhyloAln [options] -m codon -b -r -l 200 -f large_fasta 292 | ``` 293 | Map the directed RNA/cDNA sequences into the RNA/cDNA alignments(-e rna2rna): 294 | ``` 295 | PhyloAln [options] -m dna -n -b -r 296 | ``` 297 | Map the protein sequences into the protein alignments(-e prot2prot): 298 | ``` 299 | PhyloAln [options] -m dna -n -b -r -w X 300 | ``` 301 | Map the CDS or the directed transcript/cDNA sequences into the codon alignments(-e codon2codon): 302 | ``` 303 | PhyloAln [options] -m codon -n -b -r 304 | ``` 305 | Map the DNA sequences into the DNA alignments for gene family analysis or polish the marker sequences(-e gene_dna2dna): 306 | ``` 307 | PhyloAln [options] -m dna -b -r -z all -k -w - 308 | ``` 309 | Map the directed RNA/cDNA/protein sequences into the RNA/cDNA/protein alignments for gene family analysis or polish the marker sequences(-e gene_rna2rna or -e gene_prot2prot): 310 | ``` 311 | PhyloAln [options] -m dna -n -b -r -z all -k -w - 312 | ``` 313 | Map the CDS or the directed transcript/cDNA sequences into the codon alignments for gene family analysis or polish the marker sequences(-e gene_codon2codon): 314 | ``` 315 | PhyloAln [options] -m codon -n -b -r -z all -k -w - 316 | ``` 317 | Map the DNA sequences into the codon alignments for gene family analysis or polish the marker sequences(-e gene_codon2dna): 318 | ``` 319 | PhyloAln [options] -m codon -b -r -z all -k -w - 320 | ``` 321 | Map the sequences/reads into the codon alignments using the non-standard genetic code (see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for detail), for example, the codon alignments of plastid protein-coding genes: 322 | ``` 323 | PhyloAln [options] -m codon -g 11 324 | ``` 325 | or 326 | ``` 327 | PhyloAln [options] -e codon2reads -g 11 328 | ``` 329 | Map the long reads with high insertion and deletetion rates into the codon alignments (actually not recommended to use long reads with high error rates): 330 | ``` 331 | PhyloAln [options] -m dna_codon 332 | ``` 333 | Map the sequences/reads into the concatenated or other long DNA alignments: 334 | ``` 335 | PhyloAln [options] ---ref_split_len 1000 336 | ``` 337 | 338 | #### Detailed parameters 339 | ``` 340 | usage: PhyloAln [options] -a reference_alignment_file -s species -i fasta_file -f fasta -o output_directory 341 | PhyloAln [options] -d reference_alignments_directory -c config.tsv -f fastq -o output_directory 342 | 343 | A program to directly generate multiple sequence alignments from FASTA/FASTQ files based on reference alignments for 344 | phylogenetic analyses. 345 | Citation: Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. MBE. 41(7):msae150. https://doi.org/10.1093/molbev/msae150 346 | 347 | options: 348 | -h, --help show this help message and exit 349 | -a ALN, --aln ALN the single reference FASTA alignment file 350 | -d ALN_DIR, --aln_dir ALN_DIR 351 | the directory containing all the reference FASTA alignment files 352 | -x ALN_SUFFIX, --aln_suffix ALN_SUFFIX 353 | the suffix of the reference FASTA alignment files when using "-d"(default:.fa) 354 | -s SPECIES, --species SPECIES 355 | the studied species ID for the provided FASTA/FASTQ files(-i) 356 | -i INPUT [INPUT ...], --input INPUT [INPUT ...] 357 | the input FASTA/FASTQ file(s) of the single species(-s), compressed files ending with ".gz" are 358 | allowed 359 | -c CONFIG, --config CONFIG 360 | the TSV file with the format of 'species sequence_file(s)(absolute path, files separated by 361 | commas)' per line for multiple species 362 | -f {guess,fastq,fasta,large_fasta}, --file_format {guess,fastq,fasta,large_fasta} 363 | the file format of the provided FASTA/FASTQ files, 'large_fasta' is recommended for speeding up 364 | reading the FASTA files with long sequences(e.g. genome sequences) and cannot be 365 | guessed(default:guess) 366 | -o OUTPUT, --output OUTPUT 367 | the output directory containing the results(default:PhyloAln_out) 368 | -p CPU, --cpu CPU maximum threads to be totally used in parallel tasks(default:8) 369 | --parallel PARALLEL number of parallel tasks for each alignments, number of CPUs used for single alignment will be 370 | automatically calculated by '--cpu / --parallel'(default:the smaller value between number of 371 | alignments and the maximum threads to be used) 372 | -e {dna2reads,prot2reads,codon2reads,fast_dna2reads,fast_prot2reads,fast_codon2reads,dna2trans,prot2trans,codon2trans, 373 | dna2genome,prot2genome,codon2genome,rna2rna,prot2prot,codon2codon,gene_dna2dna,gene_rna2rna,gene_codon2codon,gene_codon2dna, 374 | gene_prot2prot}, --mode {dna2reads,prot2reads,codon2reads,fast_dna2reads,fast_prot2reads,fast_codon2reads,dna2trans,prot2trans, 375 | codon2trans,dna2genome,prot2genome,codon2genome,rna2rna,prot2prot,codon2codon,gene_dna2dna,gene_rna2rna,gene_codon2codon, 376 | gene_codon2dna,gene_prot2prot} 377 | the common mode to automatically set the parameters for easy use(**NOTICE: if you manually set 378 | those parameters, the parameters you set will be ignored and covered! See 379 | https://github.com/huangyh45/PhyloAln/blob/main/README.md#example-commands-for-different-data- 380 | and-common-mode-for-easy-use for detailed parameters) 381 | -m {dna,prot,codon,dna_codon}, --mol_type {dna,prot,codon,dna_codon} 382 | the molecular type of the reference alignments(default:dna, 'dna' suitable for nucleotide-to- 383 | nucleotide or protein-to-protein alignment, 'prot' suitable for protein-to-nucleotide alignment, 384 | 'codon' and 'dna_codon' suitable for codon-to-nucleotide alignment based on protein and 385 | nucleotide alignments respectively) 386 | -g GENCODE, --gencode GENCODE 387 | the genetic code used in translation(default:1 = the standard code, see 388 | https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) 389 | --ref_split_len REF_SPLIT_LEN 390 | If provided, split the reference alignments longer than this length into short alignments with 391 | this length, ~1000 may be recommended for concatenated alignments, and codon alignments should be 392 | devided by 3 393 | -l SPLIT_LEN, --split_len SPLIT_LEN 394 | If provided, split the sequences longer than this length into short sequences with this length, 395 | 200 may be recommended for long genomic reads or sequences 396 | --split_slide SPLIT_SLIDE 397 | the slide to split the sequences using sliding window method(default:half of '--split_len') 398 | -n, --no_reverse not to prepare and search the reverse strand of the sequences, recommended for searching protein 399 | or CDS sequences 400 | --low_mem use a low-memory but slower mode to prepare the reads, 'large_fasta' format is not supported and 401 | gz compressed files may still spend some memory 402 | --hmmbuild_parameters HMMBUILD_PARAMETERS [HMMBUILD_PARAMETERS ...] 403 | the parameters when using HMMER hmmbuild for reference preparation, with the format of ' --xxx' 404 | of each parameter, in which space is required(default:[]) 405 | --hmmsearch_parameters HMMSEARCH_PARAMETERS [HMMSEARCH_PARAMETERS ...] 406 | the parameters when using HMMER hmmsearch for mapping the sequences, with the format of ' --xxx' 407 | of each parameter, in which space is required((default:[]) 408 | -b, --no_assemble not to assemble the raw sequences based on overlap regions 409 | --overlap_len OVERLAP_LEN 410 | minimum overlap length when assembling the raw sequences(default:30) 411 | --overlap_pident OVERLAP_PIDENT 412 | minimum overlap percent identity when assembling the raw sequences(default:98.00) 413 | -t, --no_out_filter not to filter the foreign or no-signal sequences based on conservative score 414 | -u OUTGROUP [OUTGROUP ...], --outgroup OUTGROUP [OUTGROUP ...] 415 | the outgroup species for foreign or no-signal sequences detection(default:all the sequences in 416 | the alignments with all sequences as ingroups) 417 | --ingroup INGROUP [INGROUP ...] 418 | the ingroup species for score calculation in foreign or no-signal sequences detection(default:all 419 | the sequences when all sequences are set as outgroups; all other sequences except the outgroups) 420 | -q SEP, --sep SEP the separate symbol between species name and gene identifier in the sequence headers of the 421 | alignments(default:.) 422 | --outgroup_weight OUTGROUP_WEIGHT 423 | the weight coefficient to adjust strictness of the foreign or no-signal sequence filter, small 424 | number or decimal means ralaxed criterion (default:0.90, 1 = not adjust) 425 | -r, --no_cross_species 426 | not to remove the cross contamination for multiple species 427 | --cross_overlap_len CROSS_OVERLAP_LEN 428 | minimum overlap length when cross contamination detection(default:30) 429 | --cross_overlap_pident CROSS_OVERLAP_PIDENT 430 | minimum overlap percent identity when cross contamination detection(default:98.00) 431 | --min_exp MIN_EXP minimum expression value when cross contamination detection(default:0.20) 432 | --min_exp_fold MIN_EXP_FOLD 433 | minimum expression fold when cross contamination detection(default:5.00) 434 | -w UNKNOW_SYMBOL, --unknow_symbol UNKNOW_SYMBOL 435 | the symbol representing unknown bases for missing regions(default:unknow = 'N' in nucleotide 436 | alignments and 'X' in protein alignments) 437 | -z {consensus,consensus_strict,all,expression,length}, --final_seq {consensus,consensus_strict,all,expression,length} 438 | the mode to output the sequences(default:consensus, 'consensus' means selecting most common bases 439 | from all sequences, 'consensus_strict' means only selecting the common bases and remaining the 440 | different bases unknow, 'all' means remaining all sequences, 'expression' means the sequence with 441 | highest read counts after assembly, 'length' means sequence with longest length 442 | -y, --no_ref not to output the reference sequences 443 | -k, --keep_seqid keep original sequence IDs in the output alignments instead of renaming them based on the species 444 | ID, not recommended when the output mode is 'consensus'/'consensus_strict' or the assembly step 445 | is on 446 | -v, --version show program's version number and exit 447 | 448 | Written by Yu-Hao Huang (2023-2025) huangyh45@mail3.sysu.edu.cn 449 | ``` 450 | 451 | #### Limitations 452 | - PhyloAln is only designed for phylogenetic analyses and evolutionary analyses with reference-based conservative sites, and thus cannot perform *de novo* assembly due to non-conservative sites and sites not covered in the reference alignments. The unmapped sites will be ignored. 453 | - We prioritize the flexibility of PhyloAln and thus did not provide the upstream steps of collecting the reference sequences and generating the reference alignments, and the downstream phylogenetic analyses. But you can use the auxiliary scripts to help preparation and perform downstream analyses. 454 | - In current version, we did not heavily focus on optimizing the runtime and memory usage of PhyloAln. Specially, speed and memory usage may be influenced by the numbers of the reference alignments and the numbers of the target sequences/reads. A version with optimized parallel and storage operations and optional accessories using C or other rapid languages may be developed in the future. Faster sequence search tools may also be a candidate to be integrated into PhyloAln as an option to speed up the alignments. 455 | 456 | ### Auxiliary scripts for PhyloAln and phylogenetic analyses 457 | 458 | #### transseq.pl 459 | Requirements: 460 | - perl >=5.26.2 461 | - perl-bioperl >=1.7.2 462 | - perl-parallel-forkmanager >=2.02 463 | 464 | ``` 465 | perl scripts/transseq.pl 466 | Translate nucleotide sequences in a file to amino acid sequences. 467 | 468 | Usage: 469 | -i input nucleotide sequences file 470 | -o output amino acid sequences file 471 | -g genetic code(default=1, invertebrate mitochondrion=5) 472 | -t symbol of termination(default='*') 473 | -c if translate incomplete codons into 'X'(default=no) 474 | -a if translate all six ORF(default=no) 475 | -n num threads(default=1) 476 | -l log file(default='transseq.log') 477 | -h this help message 478 | 479 | Example: 480 | transseq.pl -i ntfile -o aafile -g gencode -t termination -c 1 -a 1 -n numthreads -l logfile 481 | 482 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn 483 | ``` 484 | 485 | #### revertransseq.pl 486 | Requirements: 487 | - perl >=5.26.2 488 | - perl-bioperl >=1.7.2 489 | - perl-parallel-forkmanager >=2.02 490 | 491 | ``` 492 | perl scripts/revertransseq.pl 493 | Used the aligned translated sequences in a file as blueprint to aligned nucleotide sequences, which means reverse-translation. 494 | 495 | Usage: 496 | -i input nucleotide sequences file or files(separated by ',') 497 | -b aligned amino acid sequences file translated by input file as blueprint 498 | -o output aligned nucleotide sequences file 499 | -g genetic code(default=1, invertebrate mitochondrion=5) 500 | -t symbol of termination in blueprint(default='*') 501 | -n num threads(default=1) 502 | -l log file(default='revertransseq.log') 503 | -h this help message 504 | 505 | Example: 506 | revertransseq.pl -i ntfile1,ntfile2,ntfile3 -b aafile -o alignedfile -g gencode -t termination -n numthreads -l logfile 507 | 508 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn 509 | ``` 510 | 511 | #### alignseq.pl 512 | Requirements: 513 | - perl >=5.26.2 514 | - mafft >=7.467 515 | - transseq.pl 516 | - revertransseq.pl 517 | 518 | ``` 519 | perl scripts/alignseq.pl 520 | Align sequences in a file by mafft. 521 | Requirement: mafft 522 | 523 | Usage: 524 | -i input sequences file 525 | -o output sequences file 526 | -a type of alignment(direct/translate/codon/complement(experimental)/ncRNA(experimental), default='direct', 'translate' means alignment of translation of sequences) 527 | -g genetic code(default=1, invertebrate mitochondrion=5) 528 | -t symbol of termination(default='X', mafft will clean '*') 529 | -c if translate incomplete codons into 'X'(default=no) 530 | -m if delete the intermediate files, such as translated files and aligned aa files(default=no) 531 | -f the folder where mafft/linsi is, if mafft/linsi had been in PATH you can ignore this parameter 532 | -n num threads(default=1) 533 | -l log file(default='alignseq.log') 534 | -h this help message 535 | 536 | Example: 537 | alignseq.pl -i inputfile -o outputfile -a aligntype -g gencode -t termination -c 1 -m 1 -f mafftfolder -n numthreads -l logfile 538 | 539 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn 540 | ``` 541 | 542 | #### connect.pl 543 | Requirements: 544 | - perl >=5.26.2 545 | - perl-bioperl >=1.7.2 546 | 547 | ``` 548 | perl scripts/connect.pl 549 | Concatenate multiple alignments into a matrix. 550 | 551 | Usage: 552 | -i directory containing input FASTA alignment files 553 | -o output concatenated FASTA alignment file 554 | -t type of input format(phyloaln/orthograph/blastsearch, default='phyloaln', also suitable for the format with same species name in all alignments, but the name shuold not contain separate symbol) 555 | -f the symbol to fill the sites of absent species in the alignments(default='-') 556 | -s the symbol to separate the sequences name and the first space is the species name in the 'phyloaln' format(default='.') 557 | -x the suffix of the input FASTA alignment files(default='.fa') 558 | -b the block file of the positions of each alignments(default=not to output) 559 | -n output the block file with NEXUS format, suitable for IQ-TREE(default=no) 560 | -c the codon positions to be written in the block file(default=no codon position, '123' represents outputing all the three codon positions, '12' represents outputing first and second positions) 561 | -l the list file with all the involved species you want to be included in the output alignments, one species per line(default=automatically generated, with all species found at least once in all the alignments) 562 | -h this help message 563 | 564 | Example: 565 | connect.pl -i inputdir -o outputfile -t inputtype -f fillsymbol -s separate -x suffix -b block1file -n -c codonpos -l listfile 566 | 567 | Written by Yu-Hao Huang (2018-2024) huangyh45@mail3.sysu.edu.cn 568 | ``` 569 | 570 | #### merge_seqs.py 571 | Requirements: 572 | - python >=3.7.4 573 | 574 | The script can be used to merge the output alignments in different PhyloAln output directories with the same reference alignments, for example, for data of different batches. 575 | Usage: 576 | ``` 577 | scripts/merge_seqs.py output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...) 578 | ``` 579 | 580 | #### select_seqs.py 581 | Requirements: 582 | - python >=3.7.4 583 | 584 | The script can be used to select or exclude a list of species (first space separated by separate_symbol from the sequence name) or sequences (the sequence name) from the sequence FASTA files with the same suffix in a directory and output the managed sequence files to a new diretory. 585 | Usage: 586 | ``` 587 | scripts/select_seqs.py input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no) 588 | ``` 589 | 590 | #### trim_matrix.py 591 | Requirements: 592 | - python >=3.7.4 593 | 594 | The script can be used to trim first the colomns (sites) and/or then the rows (sequences) in the sequence matrixes in FASTA files with the same suffix in a directory based on the unknown sites and output the managed sequence files to a new diretory. 595 | Usage: 596 | ``` 597 | scripts/trim_matrix.py input_dir output_dir unknown_symbol(default='X') known_number(>=1)_or_percent(<1)_for_columns(default=0.5) known_number(>=1)_or_percent(<1)_for_rows(default=0) fasta_suffix(default='.fa') 598 | ``` 599 | 600 | #### root_tree.py 601 | Requirements: 602 | - python >=3.7.4 603 | - ete3 >=3.1.2 604 | 605 | The script can be used to root the tree with NEWICK format by ETE 3 package and output the rooted NEWICK tree file. 606 | Usage: 607 | ``` 608 | scripts/root_tree.py input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma) 609 | ``` 610 | 611 | #### prune_tree.py 612 | Requirements: 613 | - python >=3.7.4 614 | - ete3 >=3.1.2 615 | 616 | The script can be used to prune the tree with NEWICK format by ETE 3 package and output the pruned NEWICK tree file. 617 | Usage: 618 | ``` 619 | scripts/prune_tree.py input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion (seq/seqs_in_clade2_for_deletion ...) 620 | ``` 621 | 622 | #### check_aln.py 623 | Requirements: 624 | - python >=3.7.4 625 | 626 | The script can be used to assist in checking and finding out the unaligned sequences in the reference alignments in FASTA files with the same suffix in a directory and optionally exlude the unaligned sequences and output the managed alignment files to a new diretory (not recommended, manually checking the warning alignments and managing them are better). 627 | Usage: 628 | ``` 629 | scripts/check_aln.py input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa') 630 | ``` 631 | 632 | #### test_effect.py 633 | Requirements: 634 | - python >=3.7.4 635 | 636 | The script can be used to calculate the completeness and percent identity of the alignments in FASTA files with the same suffix in a directory compared with the reference alignments in another directory, mainly for testing the effect of reference-based alignment tools, such as PhyloAln. 637 | Usage: 638 | ``` 639 | scripts/test_effect.py reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma) 640 | ``` 641 | 642 | ### Citation 643 | Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. PhyloAln: A Convenient Reference-Based Tool to Align Sequences and High-Throughput Reads for Phylogeny and Evolution in the Omic Era. Molecular Biology and Evolution 41(7):msae150. https://doi.org/10.1093/molbev/msae150 644 | 645 | ### Acknowledgments 646 | We would like to thank these people for improvement of PhyloAln: 647 | - **Zong-Jin Jiang:** test of installation and suggestions of environment configuration 648 | - **Yuan-Sen Liang:** test of installation 649 | - **Xin-Hui Xia:** test of installation 650 | 651 | ### Questions & Answers 652 | #### Fail to locate Bio/SeqIO.pm in @INC when installation 653 | This is because perl module, especially bioperl, have not been successfully installed or in the perl library path, which sometimes occurs when configuration by Conda. You should set or add the perl library path to solve the problem, for example, try this command if you use Conda to install the requirements: 654 | ``` 655 | export PERL5LIB=/your/Conda/path/lib/perl5/site_perl 656 | ``` 657 | If you install the requirements in a newly created Conda environment, you can try this command: 658 | ``` 659 | export PERL5LIB=/your/Conda/path/envs/your_env/lib/perl5/site_perl 660 | ``` 661 | You can also add the perl library path in the Conda config to avoid set it each time you run, using the command: 662 | ``` 663 | conda env config vars set PERL5LIB=/your/Conda/path/lib/perl5/site_perl 664 | ``` 665 | or 666 | ``` 667 | conda env config vars set -n your_env PERL5LIB=/your/Conda/path/envs/your_env/lib/perl5/site_perl 668 | ``` 669 | In addition, you can try mamba or other tools to install the requirements. 670 | #### How can I obtain the reference alignments and the final tree? 671 | We do not provide the upstream preparation of the reference alignments and the downstream phylogenetic analyses in PhyloAln. You can manually collect the reference sequences, align them to generate the reference alignments and build the tree. These steps are flexible as you like. The reference alignments are recommended to contain outgroup(s) for foreign decontamination in PhyloAln and rooting tree. A detailed practice of phylogenomics using nuclear single-copy protein-coding genes can be seen here ([A practice using PhyloAln for phylogenomics](#a-practice-using-phyloaln-for-phylogenomics)). For other types of the data, such as non-protein-coding genes or genes with non-standard genetic codes, you can collect the reference sequences from [NCBI](https://www.ncbi.nlm.nih.gov/) or other places, and additionally adjust the options of alignseq.pl and PhyloAln. 672 | #### Does selection of the outgroup influence detection of foreign contamination? How can I choose an appropriate outgroup? 673 | Actually, in a specific reference alignment, selection of the outgroup have minimum impact on the results through our test (see [our article](https://doi.org/10.1093/molbev/msae150) for detail). Therefore, if you are not sure which species should be the outgroup, you can tentatively not defined the outgroup, and PhyloAln will acquiescently use the first sequences in the reference alignments as the outgroup (versions ≤ 1.0.0), or all the sequences in the alignments as the outgroups with all sequences as ingroups (versions ≥ 1.1.0). 674 | But when preparing the reference alignments, it should be noticed that the evolutionary distance between the ingroups and the defined outgroup may have impact on detection of foreign contamination based on conservative score. The contamination from the species phylogenetically close to the reference species is relatively hard to be distinguished from the clean ingroup sequences, compared with the contamination from the species distinct from all the reference species, such as symbiotic bacteria of the target eukaryotic species. If the defined outgroup species is too divergent from the ingroups, a large amount of foreign contamination, especially those from species closer to the ingroups than the defined outgroup species, may not be detected and removed. 675 | Consequently, it should be better that the users have priori knowledge of choosing the defined outgroup when constructing or obtaining the reference alignments. In most cases, the defined outgroup in PhyloAln is recommended to be from close or sister group of the monophyletic ingroup. If several outgroup species are used for phylogenetic reconstruction, you can input all these outgroups or only the closest outgroup to PhyloAln (versions ≥ 1.1.0). Furthermore, you can set the ingroups in the versions ≥ 1.1.0. In addition, the sensitivity of detection can be manually adjusted by setting a weight coefficient, which is default as 0.9 (see `--outgroup_weight` in [parameters](#detailed-parameters) for detail). 676 | #### The required memory is too large to run PhyloAln. 677 | By default, the step to prepare the sequences/reads is in parallel and thus memory-consuming, especially when the data is large. You can try adding the option `--low_mem` to use a low-memory but slower mode to prepare the sequences/reads. In addition, decompression of the ".gz"-ended files will spend some memory. You can also try decompressing the files manually and then running PhyloAln. 678 | #### The positions of sites in the reference alignments are changed in the output alignments. 679 | When HMMER3 search, some non-conservative sites are deleted (e.g., gappy sites) or sometimes realigned. This has little impact on the downstream phylogenetic or evolutionary analyses. If you want to remain unchanged reference alignments or need special HMMER3 search, you can try utilizing the options `--hmmbuild_parameters` and `--hmmsearch_parameters` to control the parameters of HMMER3. For example, you can try adding the option `--hmmbuild_parameters ' --symfrac' '0'` to remain the gappy sites. It should be noticed that the HMMER3 parameters starting with '-' or '--' can only be parsed by adding space before it between a pair of quotation marks. 680 | #### How can I assemble the paired-end reads? 681 | PhyloAln does not have the method to specifically assemble the paired-end reads. It only mapped all the sequences/reads into the alignments and build a consensus in the assemble and/or output steps. You can input both two paired-end read files for a single source/species (see `-i` and `-c` in [parameters](#detailed-parameters) for detail). Furthermore, if you focus on the effect of assembly using paired-end reads, you can first merge them by other tools (e.g., [fastp](https://github.com/OpenGene/fastp)) and then input the merged read files with or without unpaired read files into PhyloAln. 682 | #### Can PhyloAln generate the alignments of multiple-copy genes for gene family analyses? 683 | Actually, we have designed options of this possibility. You can try it like this: 684 | ``` 685 | PhyloAln -d reference_alignments_directory -c config.tsv -x alignment_file_name_suffix -o output_directory -p 20 -m codon -u outgroup -z all --overlap_len overlap_len --overlap_pident overlap_pident 686 | ``` 687 | `-z all` represents outputing all the assembled sequences instead of consensus of them. And you can adjust `--overlap_len` and `--overlap_pident` to find a best assembly for the genes. 688 | If the target sequences you provided contain those complete gene sequences instead of reads and genomic sequences with introns, you can see [A practice using PhyloAln for gene family analysis](#a-practice-using-phyloaln-for-gene-family-analysis) and [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use), and find the modes gene_xxx2xxx to help you. 689 | --------------------------------------------------------------------------------