├── logo.png
├── tests
    ├── DMELA_1.fq.gz
    ├── DMELA_2.fq.gz
    ├── DWILL_1.fq.gz
    ├── DWILL_2.fq.gz
    ├── ref
    │   ├── OG0003709.fa
    │   ├── OG0004212.fa
    │   ├── OG0003820.fa
    │   ├── OG0003531.fa
    │   └── OG0003977.fa
    └── run_test.sh
├── lib
    ├── __pycache__
    │   ├── map.cpython-37.pyc
    │   ├── main.cpython-312.pyc
    │   ├── main.cpython-37.pyc
    │   ├── map.cpython-312.pyc
    │   ├── assemble.cpython-37.pyc
    │   ├── library.cpython-312.pyc
    │   ├── library.cpython-37.pyc
    │   └── assemble.cpython-312.pyc
    ├── library.py
    ├── main.py
    ├── map.py
    └── assemble.py
├── PhyloAln
├── scripts
    ├── root_tree.py
    ├── prune_tree.py
    ├── select_seqs.py
    ├── trim_matrix.py
    ├── merge_seqs.py
    ├── check_aln.py
    ├── transseq.pl
    ├── test_effect.py
    ├── alignseq.pl
    ├── revertransseq.pl
    └── connect.pl
├── LICENSE
├── .gitignore
└── README.md


/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/logo.png


--------------------------------------------------------------------------------
/tests/DMELA_1.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DMELA_1.fq.gz


--------------------------------------------------------------------------------
/tests/DMELA_2.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DMELA_2.fq.gz


--------------------------------------------------------------------------------
/tests/DWILL_1.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DWILL_1.fq.gz


--------------------------------------------------------------------------------
/tests/DWILL_2.fq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/tests/DWILL_2.fq.gz


--------------------------------------------------------------------------------
/lib/__pycache__/map.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/map.cpython-37.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/main.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/main.cpython-312.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/main.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/main.cpython-37.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/map.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/map.cpython-312.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/assemble.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/assemble.cpython-37.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/library.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/library.cpython-312.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/library.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/library.cpython-37.pyc


--------------------------------------------------------------------------------
/lib/__pycache__/assemble.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangyh45/PhyloAln/HEAD/lib/__pycache__/assemble.cpython-312.pyc


--------------------------------------------------------------------------------
/PhyloAln:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | #-*- coding = utf-8 -*-
 3 | 
 4 | import sys
 5 | import importlib
 6 | 
 7 | def main():
 8 | 	mod = importlib.import_module('lib.main')
 9 | 	if len(sys.argv) < 2:
10 | 		print("\nError: no argument was provided!\n")
11 | 		mod.main(['-h'])
12 | 	else:
13 | 		mod.main(sys.argv[1:])
14 | 
15 | if __name__ == '__main__':
16 |     main()
17 | 


--------------------------------------------------------------------------------
/scripts/root_tree.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import sys
 5 | import os
 6 | from ete3 import Tree
 7 | 
 8 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 9 | 	print("Usage: {} input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma)".format(sys.argv[0]))
10 | 	sys.exit(0)
11 | if len(sys.argv) < 3:
12 | 	print("Error: options < 2!\nUsage: {} input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma)".format(sys.argv[0]))
13 | 	sys.exit(1)
14 | 
15 | tree = Tree(sys.argv[1])
16 | if len(sys.argv) > 3:
17 | 	outgroup = sys.argv[3]
18 | 	if ',' in outgroup:
19 | 		outgroup = tree.get_common_ancestor(outgroup.split(','))
20 | else:
21 | 	outgroup = tree.get_midpoint_outgroup()
22 | tree.set_outgroup(outgroup)
23 | tree.write(outfile=sys.argv[2])
24 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 huangyh45
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/scripts/prune_tree.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import sys
 5 | import os
 6 | from ete3 import Tree
 7 | 
 8 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 9 | 	print("Usage: {} input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion( seq/seqs_in_clade2_for_deletion ...)".format(sys.argv[0]))
10 | 	sys.exit(0)
11 | if len(sys.argv) < 4:
12 | 	print("Error: options < 3!\nUsage: {} input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion( seq/seqs_in_clade2_for_deletion ...)".format(sys.argv[0]))
13 | 	sys.exit(1)
14 | 
15 | tree = Tree(sys.argv[1])
16 | leafids = []
17 | for leaf in tree:
18 | 	leafids.append(leaf.name)
19 | dists = {}
20 | for seqids in sys.argv[3:]:
21 | 	if ',' in seqids:
22 | 		clade = tree.get_common_ancestor(seqids.split(','))
23 | 		for leaf in clade:
24 | 			leafids.remove(leaf.name)
25 | 	else:
26 | 		clade = tree.search_nodes(name=seqids)[0]
27 | 		leafids.remove(seqids)
28 | 	parent = clade.up
29 | 	if len(parent.children) == 2:
30 | 		sis_leaves = []
31 | 		for leaf in parent:
32 | 			if leaf not in clade and leaf.name != clade.name:
33 | 				sis_leaves.append(leaf.name)
34 | 		dists[','.join(sis_leaves)] = parent.dist + parent.children[0].dist + parent.children[1].dist - clade.dist
35 | 	#clade.detach()
36 | tree.prune(leafids)
37 | for seqids, dist in dists.items():
38 | 	if ',' in seqids:
39 | 		tree.get_common_ancestor(seqids.split(',')).dist = dist
40 | 	else:
41 | 		tree.search_nodes(name=seqids)[0].dist = dist 
42 | tree.write(outfile=sys.argv[2])
43 | 


--------------------------------------------------------------------------------
/scripts/select_seqs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import os
 5 | import sys
 6 | 
 7 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 8 |     print("Usage: {} input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no)".format(sys.argv[0]))
 9 |     sys.exit(0)
10 | if len(sys.argv) <= 3:
11 |     print("Error: options < 3!\nUsage: {} input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no)".format(sys.argv[0]))
12 |     sys.exit(1)
13 | select_list = sys.argv[2].split(',')
14 | suffix = '.fa'
15 | if len(sys.argv) > 4:
16 | 	suffix = sys.argv[4]
17 | sep = '.'
18 | if len(sys.argv) > 5:
19 | 	sep = sys.argv[5]
20 | if_select = True
21 | if len(sys.argv) > 6:
22 | 	if sys.argv[6] == 'no':
23 | 		print("Error: ambiguous option! If you want to select the species or sequences in the list instead of to exclude them, please not input the last option!")
24 | 		sys.exit(1)
25 | 	if_select = False
26 | 
27 | 
28 | if not os.path.isdir(sys.argv[3]):
29 | 	os.mkdir(sys.argv[3])
30 | 
31 | files = os.listdir(sys.argv[1])
32 | for filename in files:
33 | 	if not filename.endswith(suffix):
34 | 		continue
35 | 	outfile = open(os.path.join(sys.argv[3], filename), 'w')
36 | 	if_output = False
37 | 	for line in open(os.path.join(sys.argv[1], filename)):
38 | 		if line.startswith('>'):
39 | 			seqid = line.lstrip('>').rstrip().split(' ')[0]
40 | 			if if_select:
41 | 				if_output = False
42 | 			else:
43 | 				if_output = True
44 | 			for sp in select_list:
45 | 				if if_select:
46 | 					if seqid == sp or seqid.startswith(sp + sep):
47 | 						if_output = True
48 | 						break
49 | 				else:
50 | 					if seqid == sp or seqid.startswith(sp + sep):
51 | 						if_output = False
52 | 						break
53 | 		if if_output:
54 | 			outfile.write(line)
55 | 	outfile.close()
56 | 
57 | 


--------------------------------------------------------------------------------
/tests/ref/OG0003709.fa:
--------------------------------------------------------------------------------
 1 | >DBUSC.8329
 2 | atGAATTCATTGGCACGTCTTGGTGGATTTGTTTGTCAATCTGTGCAAATTGCAGGCTGTGGCctgcagcaagtgcgAACCAAATATGCCGACTGGAAAATGATACGAGATGTTAAGCGACGCAAGTGTGTCTCGGAGCACGCCAAGGACCGGTTGCGAGTAAATTCGCTACGAAAGAATGACATACTGCCCGTGGAGCTGCGTGAAGTGGCGGATGCCCAGATTGCAGCATTTCCCAGGGACTCATCACTGGTGCGTGTGAGAGAACGCTGTGCTCTTACGTCACGACCCCGCGGTGTTGTGCACAAATACAGGCTGAGCCGTATTGTTTGGCGCCACTTGGCTGACTATAACAAGCTGTCTGGAGTCCAGAGAGCCATGTGG
 3 | >DMOJA.2790
 4 | ATGAACTCCTTGGCTAAGCTCGGAAGCTTCGTTAGCCAAAGCGTTCAAATCGCTGGATGTGGCTTGCAGCAAGTGCGCACCAAATATGCCGACTGGCGTATGATACGCGATGTCAAACGTCGCAAATGTGTCAGCGAACATGCCAAGGACCGCTTGCGTGTAAACTCGCTACGGAAGAATGACATACTGCCCGTTGAGCTCCGCGAGTTGGCGGATGCTGAGATAGCCGCATTTCCAAGGGACTCCTCACTGGTTCGCGTGCGAGAACGTTGCGCACTTACGTCACGGCCGCGCGGCGTAGTCCACAAATATCGACTTAGCCGCATTGTGTGGCGACACCTGGCCGATTATAACAAGCTGTCTGGAGTGCAGCGTGCCATGTGG
 5 | >DPSEU.9672
 6 | atgaatTCTCTGGCCAGAATTGGTGGTTTCGCTTGCCAAGCTGGGAAATTAGCTGGATTTGGCTTACAGCAAGTGCGAACAAAGTATGCCGACTGGAAGATGATCCGTGATGTCAAGCGCCGCAAGTGCGTGCAAGAGCATGCCAAGGAGAGGCTTCGAGTTAATTCACTTCGGAAGAATGACATACTGCCCATTGAGCTGAGGGAAGTGGCCGATGCGGAGATTGCTGCTTTTCCACGCGACTCATCATTGGTCCGTGTTCGGGAACGTTGCGCACTTACGTCACGGCCGCGCGGAGTTGTCCACAAATATCGCCTCAGCAGAATTGTATGGCGCCATTTGGCGGACTATAACAAGCTATCCGGTGTGCAGCGTGCCATGTGG
 7 | >DYAKU.7011
 8 | ATGAATTCTTTGGCCAGGATCGGGGGTTTTGTGTGCCAGTCCGTGCAGATAGCCGGCTGCGGGctgcagcaggtGCGCACCAAGTACGCCGATTGGAAGATGATTCGCGATGTCAAGCGGCGCAAGTGCGTCAAGGAGAACGCCGTGGAGCGACTACGGATCAACTCGCTGCGCAAGAACGACATCCTGCCGCCGGAGCTGCGCGAGGTGGCCGACGCCGAGATCGCTGCCTTTCCACGGGACTCATCGCTCGTCCGGGTGAGGGAACGCTGTGCGCTTACGTCACGGCCGCGCGGAGTCGTCCACAAGTACAGGCTTAGTCGAATCGTGTGGCGCCACCTCGCCGACTACAACAAGCTGTCCGGCGTGCAGCGTGCCATGTGG
 9 | >SLEBA.7037
10 | ATGAACTCGTTGGCCAAAGTTAGTAGCTTTGTGTTCCAGTCTGTACAAATTTCTGGATGTGGTCTTCAACAAATACGTACAAAGTATGCCGACTGGAAGATGATACGTGACGTTAAACGTCGTAAGTGCGTGGAAAAATTTGCCAAGGAACGTTTGCAAATCAATTCAATACGCAAAAACGATATTCTTCCTCATGAACTACGACAGCTGGCCGATGTAGATATTGCAGCATTTCCTCGGGACTCTTCCTTGGTACGTGTTCGTGAACGTTGTGCACTTACGTCACGACCTCGGGGCGTCGTTCACAAATATCGGCTTAGCAGAATCGTGTGGCGGCACTTAGCAGATTACAATAAGCTATCTGGAGTTCAAAGAGCCATGTGG
11 | 


--------------------------------------------------------------------------------
/scripts/trim_matrix.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import os
 5 | import sys
 6 | 
 7 | def read_fasta(fasta):
 8 | 	seqs = {}
 9 | 	seqid = ''
10 | 	for line in open(fasta):
11 | 		line = line.rstrip()
12 | 		if line.startswith('>'):
13 | 			arr = line.split(" ")
14 | 			seqid = arr[0].lstrip('>')
15 | 			seqs[seqid] = ''
16 | 		elif seqs.get(seqid) is not None:
17 | 			seqs[seqid] += line
18 | 	return seqs
19 | 
20 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
21 |     print("Usage: {} input_dir output_dir unknown_symbol(default='X') known_number(>=1)_or_percent(<1)_for_columns(default=0.5) known_number(>=1)_or_percent(<1)_for_rows(default=0) fasta_suffix(default='.fa')".format(sys.argv[0]))
22 |     sys.exit(0)
23 | if len(sys.argv) < 3:
24 |     print("Error: options < 2!\nUsage: {} input_dir output_dir unknown_symbol(default='X') known_percent_for_columns(default=50) known_percent_for_rows(default=0) fasta_suffix(default='.fa')".format(sys.argv[0]))
25 |     sys.exit(1)
26 | unknow = 'X'
27 | if len(sys.argv) > 3:
28 | 	unknow = sys.argv[3]
29 | pcol = 0.5
30 | if len(sys.argv) > 4:
31 | 	pcol = float(sys.argv[4])
32 | prow = 0
33 | if len(sys.argv) > 5:
34 | 	prow = float(sys.argv[5])
35 | suffix = '.fa'
36 | if len(sys.argv) > 6:
37 | 	suffix = sys.argv[6]
38 | 
39 | if not os.path.isdir(sys.argv[2]):
40 | 	os.mkdir(sys.argv[2])
41 | 
42 | files = os.listdir(sys.argv[1])
43 | for filename in files:
44 | 	if not filename.endswith(suffix):
45 | 		continue
46 | 	seqs = read_fasta(os.path.join(sys.argv[1], filename))
47 | 	if pcol > 0:
48 | 		if pcol < 1:
49 | 			ncol = pcol * len(seqs)
50 | 		else:
51 | 			ncol = pcol
52 | 		newseqs = {}
53 | 		for seqid in seqs.keys():
54 | 			newseqs[seqid] = ''
55 | 		for i in range(len(list(seqs.values())[0])):
56 | 			n = 0
57 | 			for seqstr in seqs.values():
58 | 				if seqstr[i] != unknow:
59 | 					n += 1
60 | 			if n >= ncol:
61 | 				for seqid, seqstr in seqs.items():
62 | 					newseqs[seqid] += seqstr[i]
63 | 		seqs = newseqs
64 | 	if prow > 0:
65 | 		if prow < 1:
66 | 			nrow = prow * len(list(seqs.values())[0])
67 | 		else:
68 | 			nrow = prow
69 | 		for seqid, seqstr in list(seqs.items()):
70 | 			n = 0
71 | 			for base in seqstr:
72 | 				if base != unknow:
73 | 					n += 1
74 | 			if n < nrow:
75 | 				print("Removing {} from {}: known sites {}/{} = {}%".format(seqid, filename, n, len(seqstr), int(n / len(seqstr) * 10000 + 0.5) / 100))
76 | 				seqs.pop(seqid)
77 | 	outfile = open(os.path.join(sys.argv[2], filename), 'w')
78 | 	for seqid, seqstr in seqs.items():
79 | 		outfile.write(">{}\n{}\n".format(seqid, seqstr))
80 | 	outfile.close()
81 | 
82 | 


--------------------------------------------------------------------------------
/scripts/merge_seqs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import os
 5 | import sys
 6 | 
 7 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 8 |     print("Usage: {} output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...)".format(sys.argv[0]))
 9 |     sys.exit(0)
10 | if len(sys.argv) < 3:
11 |     print("Error: options < 2!\nUsage: {} output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...)".format(sys.argv[0]))
12 |     sys.exit(1)
13 | 
14 | os.mkdir(sys.argv[1])
15 | if os.path.isdir(os.path.join(sys.argv[2], 'aa_out')):
16 | 	os.mkdir(os.path.join(sys.argv[1], 'aa_out'))
17 | if os.path.isdir(os.path.join(sys.argv[2], 'nt_out')):
18 | 	os.mkdir(os.path.join(sys.argv[1], 'nt_out'))
19 | 
20 | aafiles = {}
21 | ntfiles = {}
22 | if os.path.isdir(os.path.join(sys.argv[2], 'aa_out')):
23 | 	for dirname in sys.argv[2:]:
24 | 		filenames = os.listdir(os.path.join(dirname, 'aa_out'))
25 | 		for filename in filenames:
26 | 			if not filename.endswith('.fa'):
27 | 				continue
28 | 			if aafiles.get(filename) is None:
29 | 				aafiles[filename] = []
30 | 			aafiles[filename].append(os.path.join(dirname, 'aa_out', filename))
31 | if os.path.isdir(os.path.join(sys.argv[2], 'nt_out')):
32 | 	for dirname in sys.argv[2:]:
33 | 		filenames = os.listdir(os.path.join(dirname, 'nt_out'))
34 | 		for filename in filenames:
35 | 			if not filename.endswith('.fa'):
36 | 				continue
37 | 			if ntfiles.get(filename) is None:
38 | 				ntfiles[filename] = []
39 | 			ntfiles[filename].append(os.path.join(dirname, 'nt_out', filename))
40 | 
41 | for aafile, filenames in aafiles.items():
42 | 	seqs = {}
43 | 	for filename in filenames:
44 | 		seqid = ''
45 | 		for line in open(filename):
46 | 			line = line.rstrip()
47 | 			if line.startswith('>'):
48 | 				arr = line.split(" ")
49 | 				seqid = arr[0].lstrip('>')
50 | 				seqs[seqid] = ''
51 | 			elif seqs.get(seqid) is not None:
52 | 				seqs[seqid] += line
53 | 	outfile = open(os.path.join(sys.argv[1], 'aa_out', aafile), 'w')
54 | 	for seqid, seqstr in seqs.items():
55 | 		outfile.write(">{}\n{}\n".format(seqid, seqstr))
56 | 	outfile.close()
57 | for ntfile, filenames in ntfiles.items():
58 | 	seqs = {}
59 | 	for filename in filenames:
60 | 		seqid = ''
61 | 		for line in open(filename):
62 | 			line = line.rstrip()
63 | 			if line.startswith('>'):
64 | 				arr = line.split(" ")
65 | 				seqid = arr[0].lstrip('>')
66 | 				seqs[seqid] = ''
67 | 			elif seqs.get(seqid) is not None:
68 | 				seqs[seqid] += line
69 | 	outfile = open(os.path.join(sys.argv[1], 'nt_out', ntfile), 'w')
70 | 	for seqid, seqstr in seqs.items():
71 | 		outfile.write(">{}\n{}\n".format(seqid, seqstr))
72 | 	outfile.close()
73 | 
74 | 


--------------------------------------------------------------------------------
/tests/run_test.sh:
--------------------------------------------------------------------------------
 1 | set -e
 2 | dir=`pwd`
 3 | if [ ! -d "tests/aln" ]; then
 4 | 	mkdir tests/aln
 5 | fi
 6 | for file in tests/ref/*.fa; do
 7 | 	og=`basename $file`
 8 | 	alignseq.pl -i $file -o tests/aln/$og -a codon -n 5
 9 | done
10 | echo "DMELA	$dir/tests/DMELA_1.fq.gz,$dir/tests/DMELA_2.fq.gz" > tests/run_test.config
11 | echo "DWILL	$dir/tests/DWILL_1.fq.gz,$dir/tests/DWILL_2.fq.gz" >> tests/run_test.config
12 | PhyloAln -d tests/aln -x .fa -c tests/run_test.config -o tests/PhyloAln_out -p 5 -m codon -u SLEBA
13 | if [ $1 ]; then
14 | 	# full usage experience of all the scripts with real examples through a simple phylogenomic flow
15 | 	PhyloAln -d tests/aln -x .fa -s DWILL2 -i $dir/tests/DWILL_1.fq.gz $dir/tests/DWILL_2.fq.gz -o tests/PhyloAln_out2 -p 5 -m codon -u SLEBA   # additionally generate result alignments for only DWILL
16 | 	#test_effect.py tests/PhyloAln_out2/nt_out:DYAKU tests/PhyloAln_out/nt_out:DYAKU tests/PhyloAln_DYAKUvsDYAKU.tsv N . .fa DBUSC,DMOJA,DPSEU,DYAKU,SLEBA   # compare result aligned sequences of DYAKU in two runs. The sequences should be 100% identical.
17 | 	merge_seqs.py tests/PhyloAln_all tests/PhyloAln_out tests/PhyloAln_out2   # merge results of two runs
18 | 	cp tests/PhyloAln_all/nt_out/OG0003820.fa tests/PhyloAln_all/nt_out/OG0003977.fa tests/PhyloAln_all/nt_out/OG0004212.fa tests/PhyloAln_all   # pick out the result alignments containing DWILL and DWILL2
19 | 	test_effect.py tests/PhyloAln_all:DWILL tests/PhyloAln_all:DWILL2 tests/PhyloAln_DWILLvsDWILL2.tsv N . .fa DBUSC,DMOJA,DPSEU,DYAKU,SLEBA   # compare result aligned sequences of DWILL and DWILL2. The sequences should be 100% identical.
20 | 	select_seqs.py tests/PhyloAln_all/nt_out DBUSC,DPSEU tests/PhyloAln_all/nt_sel .fa . 1   # exclude the sequences of DBUSC and DPSEU from the result alignments
21 | 	trim_matrix.py tests/PhyloAln_all/nt_sel tests/PhyloAln_all/nt_trim N 0.5 10   # exclude the columns (sites) with known bases (not 'N') < 50% and the rows (species) with known bases (not 'N') < 10
22 | 	connect.pl -i tests/PhyloAln_all/nt_trim -f N -b all.block -n -c 123   # concatenate the alignments of five genes into a supermatrix and generate a partition file of codon positions for IQ-TREE
23 | 	iqtree -s all.fas -p all.block -m MFP+MERGE -B 1000 -T AUTO --threads-max 5 --prefix tests/PhyloAln_all/species_tree   # reconstruct phylogeny of the supermatrix by IQ-TREE
24 | 	root_tree.py tests/PhyloAln_all/species_tree.treefile tests/PhyloAln_all/species_tree.rooted.tre SLEBA   # root the species phylogeny using SLEBA (outgroup) and generate the rooted tree 'tests/PhyloAln_all/species_tree.rooted.tre'
25 | 	echo "Successfully complete running"
26 | else
27 | 	connect.pl -i tests/PhyloAln_out/nt_out -f N -b all.block -n -c 123
28 | 	check_aln.py -h
29 | 	merge_seqs.py -h
30 | 	prune_tree.py -h
31 | 	root_tree.py -h
32 | 	select_seqs.py -h
33 | 	test_effect.py -h
34 | 	trim_matrix.py -h
35 | 	echo "Successfully installed"
36 | fi
37 | 


--------------------------------------------------------------------------------
/scripts/check_aln.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import os
  5 | import sys
  6 | 
  7 | def read_fasta(fasta):
  8 | 	seqs = {}
  9 | 	seqid = ''
 10 | 	for line in open(fasta):
 11 | 		line = line.rstrip()
 12 | 		if line.startswith('>'):
 13 | 			arr = line.split(" ")
 14 | 			seqid = arr[0].lstrip('>')
 15 | 			seqs[seqid] = ''
 16 | 		elif seqs.get(seqid) is not None:
 17 | 			seqs[seqid] += line.upper()
 18 | 	return seqs
 19 | 
 20 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 21 |     print("Usage: {} input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa')".format(sys.argv[0]))
 22 |     sys.exit(0)
 23 | if len(sys.argv) < 2:
 24 |     print("Error: options < 1!\nUsage: {} input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa')".format(sys.argv[0]))
 25 |     sys.exit(1)
 26 | outdir = None
 27 | if len(sys.argv) > 2:
 28 | 	if sys.argv[2].lower() != 'none':
 29 | 		outdir = sys.argv[2]
 30 | freq = 0.75
 31 | if len(sys.argv) > 3:
 32 | 	freq = float(sys.argv[3])
 33 | gap = '-'
 34 | if len(sys.argv) > 4:
 35 | 	gap = sys.argv[4]
 36 | pgap = 0.6
 37 | if len(sys.argv) > 5:
 38 | 	pgap = float(sys.argv[5])
 39 | suffix = '.fa'
 40 | if len(sys.argv) > 6:
 41 | 	suffix = sys.argv[6]
 42 | 
 43 | if outdir:
 44 | 	if not os.path.isdir(outdir):
 45 | 		os.mkdir(outdir)
 46 | 
 47 | files = os.listdir(sys.argv[1])
 48 | for filename in files:
 49 | 	if not filename.endswith(suffix):
 50 | 		continue
 51 | 	seqs = read_fasta(os.path.join(sys.argv[1], filename))
 52 | 	unaln = []
 53 | 	start = None
 54 | 	end = None
 55 | 	if pgap < 1:
 56 | 		ngap = pgap * len(seqs)
 57 | 	else:
 58 | 		ngap = pgap
 59 | 	for i in range(len(list(seqs.values())[0])):
 60 | 			n = 0
 61 | 			for seqstr in seqs.values():
 62 | 				if seqstr[i] != gap:
 63 | 					n += 1
 64 | 			if n >= ngap:
 65 | 				if start is None:
 66 | 					start = i
 67 | 				end = i
 68 | 	print(f"\n{filename}, length: {len(list(seqs.values())[0])}, valid regions: {start}-{end}")
 69 | 	if start is None or end is None:
 70 | 		for seqid in seqs.keys():
 71 | 			print(f"{seqid}, None-None: -Inf < {freq}, unaligned!")
 72 | 			unaln.append(seqid)
 73 | 	else:
 74 | 		scores = {}
 75 | 		for i in range(start, end + 1):
 76 | 			scores[i] = {}
 77 | 			for seqstr in seqs.values():
 78 | 				scores[i].setdefault(seqstr[i], 0)
 79 | 				scores[i][seqstr[i]] += 1
 80 | 		for seqid, seqstr in seqs.items():
 81 | 			seq_start = None
 82 | 			seq_end = None
 83 | 			for i in range(start, end + 1):
 84 | 				if seqstr[i] != gap:
 85 | 					if seq_start is None:
 86 | 						 seq_start = i
 87 | 					seq_end = i
 88 | 			if seq_start is None or seq_end is None:
 89 | 				print(f"{seqid}, {seq_start}-{seq_end}: -Inf < {freq}, unaligned!")
 90 | 				unaln.append(seqid)
 91 | 				continue
 92 | 			aver_freq = 0
 93 | 			for i in range(seq_start, seq_end + 1):
 94 | 				aver_freq += (scores[i][seqstr[i]] / len(seqs))
 95 | 			aver_freq = aver_freq / (seq_end - seq_start + 1)
 96 | 			if aver_freq < freq:
 97 | 				print(f"{seqid}, {seq_start}-{seq_end}: {aver_freq} < {freq}, unaligned!")
 98 | 				unaln.append(seqid)
 99 | 			else:
100 | 				print(f"{seqid}, {seq_start}-{seq_end}: {aver_freq}")
101 | 	if outdir:
102 | 		outfile = open(os.path.join(outdir, filename), 'w')
103 | 		for seqid, seqstr in seqs.items():
104 | 			if seqid not in unaln:
105 | 				outfile.write(">{}\n{}\n".format(seqid, seqstr))
106 | 		outfile.close()
107 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | 


--------------------------------------------------------------------------------
/scripts/transseq.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | 
  3 | use Bio::SeqIO;
  4 | use Getopt::Std;
  5 | use Parallel::ForkManager;
  6 | use strict; 
  7 | my %opt=('g'=>'1','t'=>'*','n'=>'1','l'=>'transseq.log'); 
  8 | getopts('i:o:g:t:c:a:n:l:h',\%opt);
  9 | usage() if $opt{h};
 10 | 
 11 | my $ntfile=$opt{i};
 12 | my $aafile=$opt{o};
 13 | my $gencode=$opt{g};
 14 | my $termination=$opt{t};
 15 | my $incomplete=$opt{c};
 16 | my $transall=$opt{a};
 17 | our $numthreads=$opt{n};
 18 | our $logfile=$opt{l};
 19 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n";
 20 | 
 21 | printdie("\nError: no input file or output file was set!\n") unless $ntfile&&$aafile;
 22 | printlog("\n##### Sequences translation begins #####\n");
 23 | my $in=Bio::SeqIO->new(-file => "$ntfile", -format => 'fasta');
 24 | my @seqs;
 25 | my $pm = new Parallel::ForkManager($numthreads); 
 26 | $pm -> run_on_finish( sub {
 27 |         my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;
 28 |         if($data_structure_reference) {
 29 |             my @arr=@$data_structure_reference;
 30 |             if(@arr==1) {printdie($arr[0]);}
 31 |             else {
 32 |                 push(@seqs, $arr[1]);
 33 |                 printlog($arr[0]);
 34 |             }
 35 |         }
 36 |     });
 37 | while(my $seq=$in->next_seq()) {
 38 |     $pm->start and next;
 39 |     my $id=$seq->display_id;
 40 |     my @frames=(0,1,2,-1,-2,-3);
 41 |     my $finalseq;
 42 |     for(my $i=0;$i<@frames;$i++) {
 43 |         my ($transseq,$seq1);
 44 |         if($frames[$i]<0) {
 45 |         $seq1=$seq->revcom;
 46 |         $transseq=$seq1->translate(-codontable_id => $gencode, -terminator => $termination, -frame => -$frames[$i]-1);
 47 |         $seq1=$seq1->trunc(-$frames[$i],$seq1->length());
 48 |         }
 49 |         else {
 50 |         $transseq=$seq->translate(-codontable_id => $gencode, -terminator => $termination, -frame => $frames[$i]);
 51 |         $seq1=$seq->trunc($frames[$i]+1,$seq->length());
 52 |         }
 53 |         my $seq2=$transseq;
 54 |         if($incomplete&&$seq2->length()*3!=$seq1->length()) {
 55 |         if($seq2->length()*3+1==$seq1->length()||$seq2->length()*3+2==$seq1->length()) {
 56 |         $seq2=Bio::Seq->new( -seq => ($transseq->seq).'X' , -id  => $id );
 57 |         }
 58 |         else {my $diestr="\nError: nt length does not match the aa length in $id!\n";$pm->finish(0,[$diestr]);}
 59 |         }
 60 |         unless($transall) {$finalseq=$seq2;last;}
 61 |         else {
 62 |         my $start=$frames[$i]<0 ? $seq->length()+$frames[$i]+1 : $frames[$i]+1;
 63 |         my $end=$frames[$i]<0 ? $start%3+1 : $seq->length()-($seq->length()-$start+1)%3;
 64 |         if($incomplete) {$end=$frames[$i]<0 ? 1 : $seq->length();}
 65 |         $finalseq=Bio::Seq->new( -seq => $seq2->seq , -id  => "$id|$start-$end|$frames[$i]")
 66 |         }
 67 |     }
 68 |     $pm->finish(0,["Translation of $id finished.\n",$finalseq]);
 69 | }
 70 | $pm->wait_all_children;
 71 | printlog("#####i Sequences translation complished #####\n\n");
 72 | close(LOG);
 73 | my $out = Bio::SeqIO->new(-file => ">$aafile", -format => 'fasta');
 74 | foreach my $seq (@seqs) {$out->write_seq($seq);}
 75 | 
 76 | sub printlog {
 77 | print LOG $_[0];
 78 | printf $_[0];
 79 | }
 80 | 
 81 | sub printdie {
 82 | print LOG $_[0];
 83 | die "$_[0]\nYou can use '-h' to watch detailed help.\n";
 84 | }
 85 | 
 86 | sub usage {
 87 | 
 88 | die "
 89 | perl $0
 90 | Translate nucleotide sequences in a file to amino acid sequences.
 91 | 
 92 | Usage: 
 93 | -i   input nucleotide sequences file
 94 | -o   output amino acid sequences file
 95 | -g   genetic code(default=1, invertebrate mitochondrion=5)
 96 | -t   symbol of termination(default='*')
 97 | -c   if translate incomplete codons into 'X'(default=no)
 98 | -a   if translate all six ORF(default=no)
 99 | -n   num threads(default=1)
100 | -l   log file(default='transseq.log')
101 | -h   this help message
102 | 
103 | Example:
104 | transseq.pl -i ntfile -o aafile -g gencode -t termination -c 1 -a 1 -n numthreads -l logfile
105 | 
106 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn
107 | ";
108 | 
109 | }
110 | 


--------------------------------------------------------------------------------
/tests/ref/OG0004212.fa:
--------------------------------------------------------------------------------
 1 | >DBUSC.10868
 2 | ATGATGATGCAACGCACTGGATTATTGTTGCCAATGGCAATTGAAGCCACCATGCAGGCTCAGCAGCAACGTGGCATGGCTACGCTGAAGACAATTTCCATGCGCCTCAAATCCGTGAAGAACATTCAGAAAATTACGCAATCAATGAAGATGGTGTCCGCGGCCAAATACGCCCGTGCCGAGCGTGATTTGAGGGCAGCGCGTCCCTATGGCGCTGGTGCTCAGCAATTCTTTGAAAAGGTTGAAATCACTCCCGATGAGAAGGCCGAACCCAAGAAGTTGCTCATTGCCATGACATCGGATCGTGGTCTGTGCGGTGCAGTCCATACCGGTGTGGCGCGTCTAATTCGTGGTGAGCTGGCCCAGGATGATACAAACACTAAGGTGTTCTGCGTCGGTGACAAGTCACGCGCCATTCTAGCTCGTCTCTACggcaaaaacattttgatgGTAGCCAATGAGATTGGTCGCCTGCCACCCACTTTCCTAGACGCATCCAAGATTGCGCATGAAGTGTTGAACACCGGCTACGAGTATACCGAGGGCAAGATTGTTTACAACAGATTCAAGTCTGTCGTCTCCTATCAGTGCAGCACACTGCCCATCTTCAGCGGTTCTACTGTGGAGAAGTCAGAGAAGCTGGCTGTTTACGATTCGCTCGATAGCGATGTTGTCCAAAGCTATCTGGAATTTTCGTTGGCCTCGCTCATCTTCTACACCATGAAGGAAGGCGCTTGCTCCGAGCAATCGTCTCGTATGACTGCCATGGACAATGCTTCCAAGAACGCCGGTGAGATGATTGAAAAGCTAACACTGACATTCAACCGCACCAGACAGGCTGTCATTACCCGTGAGCTGATTGAAATCATCTCTGGTGCCGCTGCCCTGACA
 3 | >DMOJA.8036
 4 | ATGATGATGCAACGTACTACGCTTTTGCTGCCAATGGCTGTTGAAGCCACCAATGTTGCCCAACAGCAACGTGGTATGGCCACATTGAAGCATATTTCCATGCGCCTCAAATCCGTAAAGAACATCCAAAAAATTACGCAATCAATGAAAATGGTGTCCGCGGCCAAGTACTCCCGTGCCGAGCGTGATTTAAAGGCAGCGCGTCCCTATGGCATCGGTGCTCAACAATTCTTTGATAAGACCGAAGTGCAGGCTGATGGAGCTGTCGAGCCCAAGAAGCTGCTTATTGCCGTAACTTCGGATCGTGGCCTCTGCGGTGCCGTGCACACCGGTGTTGCACGTCTCATCCGTGGCGAGCTGCAGAAGGACGATTCTAACACCATGGTGTTCTGCGTTGGCGACAAGTCGCGTGCCATTCTGTCCCGTTTGTACGGTAAGAACATCCTGATGGTGGCCAACGAAGTGGGCCGTCTGCCACCTACTTTCCTGGATGCATCCAAGATTGCGCATGAGGTATTGTCGACCGGGTACGATTATACTGAGGGCAAGATCGTGTACAACCAGTTCAAGTCTGTGGTCTCGTACAAGTGCTCCACGTTGCCCATCTACAGTGGCCCCACTGTGGAGAAGTCGGAGAAGTTGGCCGTTTACGATTCGCTCGACAGCGATGTCATCAAGAGCTATCTGGAGTTCTCTCTGGCCTCGCTCATCTTCTACACCATGAAGGAGGGCGCTTGCTCTGAGCAATCGTCCCGTATGACTGCCATGGACAATGCCTCGAAGAACGCCGGTGAAATGATTGAGAAGCTGACCCTCACATTCAACCGCACCCGACAGGCTGTCATCACTCGCGAGTTGATTGAAATCATCTCTGGTGCCGCTGCCCTGACA
 5 | >DPSEU.2270
 6 | ATGATGATGCAACGCACTACGCTTTTGCTGCCCATGGCCATTGAAGCCACCATgctggcccagcagcagcgtggCATGGCCACTTTAAAGACCATTTCCATGCGTCTGAAATCCGTGAAGAACATTCAGAAAATTACGCAATCGATGAAGATGGTGTCCGCGGCCAAGTACGCCCGTGCCGAGCGTGATTTGAAGGCGGCGCGTCCCTACGGAATCGGTGCTCAGCAGTTTTTCGAAAAGACTGAGATCGTGCCCGATGAGAAGGCCGAGCCCAAGAAGCTCTTCATCGCCGTAACATCGGACCGTGGTCTGTGCGGTGCTGTTCACACTGGTGTGGCGCGTCTGATCCGTGGCGAGATGGCTACCGAACATGCCAACACCAAGATTTTCTGCGTGGGAGACAAGTCCCGTGCTATTCTGGCCCGTCTGTACGGCAAGAATATCCTGATGGTGGCCAACGAAATCGGCCGTCTGCCCCCCACTTTCCTGGATGCCTCGAAGATCGCCCATGAGGTCCTAAACACTGGCTACGAGTACACCGAGGGCAAGATCGTGTACAACAAGTTCAGGTCCGTTGTCTCGTACCAGTGCAGCACACTGCCCATCTACGGTGGCCCCACTGTCGAGAAGTCGGAGAAACTGGCCACTTACGACTCGCTCGACAGCGATGTCATCAAGAGCTACTTGGAGTTCTCGCTGGCCTCCCTCATCTTCTACACCATGAAGGAGGGTGCCTGCTCGGAGCAGTCCTCCCGTATGACGGCCATGGACAATGCTTCCAAGAACGCCGGTGAGATGATTGACAAACTTACTCTCACATTCAACCGCACCCGACAGGCTGTCATCACTCGCGAGCTGATTGAGATCAtctctggtgctgctgccctcaCA
 7 | >DYAKU.7834
 8 | ATGATGATGCAACGCACCCAGCTCCTGCTGCCCCTGGCCATGGAGGCCACCATGCTGGCCCAGCAGCAGCGTGGCATGGCCACCCTGAAGATGATTTCCATCCGCCTGAAGTCGGTGAAGAACATTCAGAAAATTACGCAATCGATGAAGATGGTGTCCGCTGCCAAGTACGCCCGTGCCGAGCGAGACTTGAAGGCGGCGCGTCCTTACGGCATCGGCGCCCAGCAGTTCTTCGAGAAGACGGAGATCCAGGCGGACGAGAAGGCGGAGCCCAAGAAGCTGCTCATCGCAGTCACTTCGGACCGTGGTCTTTGCGGCGCTGTCCACACTGGTGTGGCCCGTCTCATTCGTGGTGAACTGGCCCAGGACGAGGCCAACACTAAGGTGTTCTGCGTGGGCGACAAGTCGCGCGCTATCCTGTCCCGTCTGTACGGCAAGAACATCCTGATGGTGGCCAACGAGGTGGGCCGACTGCCGCCCACTTTCCTGGACGCCTCGAAGATTGCCAACGAGGTTCTGCAGACCGGTTACGATTACACCGAGGGCAAGATCGTCTACAACCGTTTCAAGTCGGTGGTGTCGTACCAGTGCTCCACCCTCCCCATCTTCAGCGGATCCACCGTGGAGAAGTCGGAGAAGCTGGCCGTCTACGACTCGCTCGACAGCGACGTGGTCAAGAGCTACCTGGAGTTCTCGCTGGCCTCGCTCATCTTCTACACCATGAAGGAGGGCGCCTGCTCGGAGCAGTCCTCCCGTATGACTGCCATGGACAACGCTTCCAAGAACGCCGGTGAGATGATCGACAAGCTGACCCTCACCTTCAACCGTACCCGACAGGCCGTCATCACTCGCGAGCTGATTGAAATCATCTCCGGTGCCGCCGCCCTCACA
 9 | >SLEBA.12201
10 | ATGATGATGCAACGCTCCATGCTCTTGCTACCCCTGGCGGTTGAAGCCACGATGCATGCCCAACAGCAACGTGGTATGGCCACTTTGCAGTCAATTTCCATTCGCTTGAAATCAGTGAAGAACATTCAGAAAATTACGCAATCAATGAAGATGGTGTCCGCCGCAAAATACGCCAGGGCGGAGCGTGATTTGAAGGCGGCGCGTCCTTATGGCATTGGAGCGCAACAGTTCTTCGAAAAGACCGAGATCCAGGTCGATGAGAAGGCCGAACCCAAGAAGCTTCTTATCGCGATGACATCCGATCGTGGTCTCTGCGGCGCTGTGCACACCGGTGTGGCCCGTCACATCCGCAACGAGCTTGCTAAGGACGATGTTAACACCAAGATTTTCTGTGTCGGCGACAAGTCGCGCTCGATCCTGGCGCGTCTATACGGCAAGAATATTCTGATGGTAGCCAACGAAGTGGGTCGTCTGCCACCTACCTTCTTGGATGCCTCTAGGATTGCGCATGAGGTCCTGCAAACCGGTTACGAGTATTCCGAGGGACAAATCGTGTACAACAAGTTCAACTCGGTGGTATCTTACTCACTGTCCCAACTGCCCATCTACAGTGGCGCCACTGTGGAGAAGTCGGAGAAGCTGGCGGTCTTCGATTCCCTGGACGCCGATGTCATCCAGAGCTATTTGGAGTTCTCGCTGGCTTCCCTAATCTTTTACACCATGAAGGAAGGCGCTTGCTCTGAGCAATCGTCGCGTATGACTGCCATGGATAACGCCTCGAAGAATGCTGGTGAGATGATTGAAAAGTTGACGCTCATATTCAACCGCACCAGACAGGCTGTCATCACTCGCGAGCTGATTGAAATTATCTCCGGTGCCTCTGCTCTGGAG
11 | 


--------------------------------------------------------------------------------
/scripts/test_effect.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import os
  5 | import sys
  6 | 
  7 | def read_fasta(fasta, target, sep='.', select_list=None):
  8 | 	seqs = {}
  9 | 	if select_list:
 10 | 		seqid = None
 11 | 		for line in open(fasta):
 12 | 			line = line.rstrip()
 13 | 			if line.startswith('>'):
 14 | 				arr = line.split(" ")
 15 | 				seqid = arr[0].lstrip('>')
 16 | 				if seqid in select_list or seqid.split(sep)[0] in select_list:
 17 | 					seqs[seqid] = ''
 18 | 			elif seqs.get(seqid) is not None:
 19 | 				seqs[seqid] += line.upper()
 20 | 	seqid = None
 21 | 	seqstr = ''
 22 | 	for line in open(fasta):
 23 | 		line = line.rstrip()
 24 | 		if line.startswith('>'):
 25 | 			if seqstr:
 26 | 				break
 27 | 			arr = line.split(" ")
 28 | 			seqid = arr[0].lstrip('>')
 29 | 			if seqid != target and not seqid.startswith(target + sep):
 30 | 				seqid = None
 31 | 		elif seqid:
 32 | 			seqstr += line.upper()
 33 | 	# debug
 34 | 	# print(seqid, seqstr)
 35 | 	# print(seqs)
 36 | 	# input()
 37 | 	return seqstr, seqs
 38 | 
 39 | if len(sys.argv) == 1 or sys.argv[1] == '-h':
 40 |     print("Usage: {} reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma)".format(sys.argv[0]))
 41 |     sys.exit(0)
 42 | if len(sys.argv) <= 3:
 43 |     print("Error: options < 3!\nUsage: {} reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma)".format(sys.argv[0]))
 44 |     sys.exit(1)
 45 | unknow = 'N'
 46 | if len(sys.argv) > 4:
 47 | 	unknow = sys.argv[4]
 48 | sep = '.'
 49 | if len(sys.argv) > 5:
 50 | 	sep = sys.argv[5]
 51 | suffix = '.fa'
 52 | if len(sys.argv) > 6:
 53 | 	suffix = sys.argv[6]
 54 | select_list = None
 55 | if len(sys.argv) > 7:
 56 | 	select_list = sys.argv[7].split(',')
 57 | 
 58 | ref_dir, ref_sp = sys.argv[1].split(':')
 59 | target_dir, target_sp = sys.argv[2].split(':')
 60 | outfile = open(sys.argv[3], 'w')
 61 | outfile.write("file\treference length\ttarget length\tnident\tcompleteness\tpident\n")
 62 | total_rlen = 0
 63 | total_tlen = 0
 64 | total_nident = 0
 65 | files = os.listdir(ref_dir)
 66 | for filename in files:
 67 | 	if not filename.endswith(suffix):
 68 | 		continue
 69 | 	ref_seq, ref_sel = read_fasta(os.path.join(ref_dir, filename), ref_sp, sep, select_list)
 70 | 	target_seq, target_sel = read_fasta(os.path.join(target_dir, filename), target_sp, sep, select_list)
 71 | 	indexes = []
 72 | 	if select_list:
 73 | 		i = 0
 74 | 		j = 0
 75 | 		while i < len(list(ref_sel.values())[0]):
 76 | 			if j >= len(list(target_sel.values())[0]):
 77 | 				if_match = False
 78 | 			else:
 79 | 				if_match = True
 80 | 				for sel in ref_sel.keys():
 81 | 					# to use species
 82 | 					# sp = sel.split(sep)[0]
 83 | 					if ref_sel[sel][i] != target_sel[sel][j]:
 84 | 						if_match = False
 85 | 						break
 86 | 			if if_match:
 87 | 				j += 1
 88 | 			else:
 89 | 				indexes.append(i)
 90 | 			i += 1
 91 | 		print(filename)
 92 | 		str_list = list(ref_seq)
 93 | 		for i in reversed(indexes):
 94 | 			str_list.pop(i)
 95 | 		ref_seq = ''.join(str_list)
 96 | 	start = None
 97 | 	for i in range(len(ref_seq)):
 98 | 		if ref_seq[i] not in [unknow, '-']:
 99 | 			start = i
100 | 			break
101 | 	end = None
102 | 	for i in range(len(ref_seq)-1, -1, -1):
103 | 		if ref_seq[i] not in [unknow, '-']:
104 | 			end = i
105 | 			break
106 | 	if start is None or end is None:
107 | 		# continue
108 | 		print("\nError: invalid reference in '{}': '{}'!\n".format(filename, ref_seq))
109 | 		sys.exit(1)
110 | 	if not target_seq:
111 | 		rlen = 0
112 | 		for i in range(start, end+1):
113 | 			if ref_seq[i] != unknow:
114 | 				rlen += 1
115 | 		total_rlen += rlen
116 | 		outfile.write("{}\t{}\t0\t0\t0\t0\n".format(filename, rlen))
117 | 		continue
118 | 	rlen = 0
119 | 	tlen = 0
120 | 	nident = 0
121 | 	# print(filename, start, end)
122 | 	for i in range(start, end+1):
123 | 		if ref_seq[i] == unknow:
124 | 			continue
125 | 		rlen += 1
126 | 		if target_seq[i] == unknow:
127 | 			continue
128 | 		tlen += 1
129 | 		if target_seq[i] == ref_seq[i]:
130 | 			nident += 1
131 | 	total_rlen += rlen
132 | 	total_tlen += tlen
133 | 	total_nident += nident
134 | 	outfile.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(filename, rlen, tlen, nident, int(10000*tlen/rlen+0.5)/100, int(10000*nident/tlen+0.5)/100))
135 | outfile.write("total\t{}\t{}\t{}\t{}\t{}\n".format(total_rlen, total_tlen, total_nident, int(10000*total_tlen/total_rlen+0.5)/100, int(10000*total_nident/total_tlen+0.5)/100))
136 | outfile.close()
137 | 


--------------------------------------------------------------------------------
/scripts/alignseq.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | 
  3 | use FindBin qw($Bin);
  4 | use Getopt::Std;
  5 | use strict; 
  6 | my %opt=('g'=>'1','a'=>'direct','t'=>'X','n'=>'1','l'=>'alignseq.log'); 
  7 | getopts('i:o:a:g:t:m:f:n:l:h',\%opt);
  8 | usage() if $opt{h};
  9 | 
 10 | my $input=$opt{i};
 11 | my $output=$opt{o};
 12 | my $aligntype=$opt{a};
 13 | my $gencode=$opt{g};
 14 | my $termination=$opt{t};
 15 | my $incomplete=$opt{c};
 16 | my $ifmediate=$opt{m};
 17 | my $mafftfolder=$opt{f};
 18 | if($mafftfolder) {$mafftfolder.="/";}
 19 | our $numthreads=$opt{n};
 20 | our $logfile=$opt{l};
 21 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n";
 22 | 
 23 | printdie("\nError: no input file or output file was set!\n") unless $input&&$output;
 24 | printdie("\nError: invalid alignment type was set!\n") unless $aligntype eq 'direct'||$aligntype eq 'translate'||$aligntype eq 'codon'||$aligntype eq 'complement'||$aligntype eq 'ncRNA';
 25 | printlog("\n##### Alignment begins #####\n");
 26 | if($aligntype eq 'complement') {
 27 |     my $mafftcommand=$mafftfolder."mafft";
 28 |     runcmd("$mafftcommand --thread $numthreads --adjustdirectionaccurately $input > $output");
 29 | }
 30 | elsif($aligntype eq 'ncRNA') {
 31 |     my $mafftcommand=$mafftfolder."mafft-qinsi";
 32 |     runcmd("$mafftcommand --thread $numthreads $input > $output");
 33 | }
 34 | else {
 35 |     my ($mafftin,$mafftout,$transfile,$aafile);
 36 |     if($aligntype ne 'direct') {
 37 |         $transfile=$input;
 38 |         $transfile=~s/(\.(\w+))+/\.trans\.fas/;
 39 |         my %parameter=('-i',$input,'-o',$transfile,'-g',$gencode,'-t',$termination,'-c',$incomplete,'-n',$numthreads,'-l',$logfile);
 40 |         runpl("transseq.pl",\%parameter);
 41 |         $mafftin=$transfile;
 42 |         if($aligntype eq 'codon') {
 43 |         $aafile=$output;
 44 |         $aafile=~s/(\.(\w+))+/\.aa\.fas/;
 45 |         $mafftout=$aafile;
 46 |         }
 47 |         else {$mafftout=$output;}
 48 |     }
 49 |     else {$mafftin=$input;$mafftout=$output;}
 50 |     my $mafftcommand=$mafftfolder."linsi";
 51 |     runcmd("$mafftcommand --thread $numthreads $mafftin > $mafftout");
 52 |     if($aligntype eq 'codon') {
 53 |         my %parameter=('-i',$input,'-b',$aafile,'-o',$output,'-g',$gencode,'-t',$termination,'-n',$numthreads,'-l',$logfile);
 54 |         runpl("revertransseq.pl",\%parameter);
 55 |     }
 56 |     if($ifmediate) {
 57 |     if($transfile) {unlink("$transfile") or printdie("\nError: $transfile fail to delete!\n");}
 58 |     if($aafile) {unlink("$aafile") or printdie("\nError: $aafile fail to delete!\n");}
 59 |     }
 60 | }
 61 | printlog("##### Alignment complished #####\n\n");
 62 | 
 63 | close(LOG);
 64 | 
 65 | sub runcmd {
 66 | printlog("$_[0]\n");
 67 | my $iferr=system("$_[0] 2>$logfile.alignseq.temp");
 68 | open(TEMP,"<$logfile.alignseq.temp") or printdie("\nError: $logfile.alignseq.temp can't open!\n");
 69 | while(<TEMP>) {printlog("$_");}
 70 | close(TEMP);
 71 | unlink("$logfile.alignseq.temp") or printdie("\nError: $logfile.alignseq.temp fail to delete!\n");
 72 | if($iferr) {
 73 |     my $command=$_[0];
 74 |     if($command=~/(\S+)/o) {$command=$1;}
 75 |     printdie("\nError in $command!\n");
 76 | }
 77 | }
 78 | 
 79 | sub runpl {
 80 | my $pl=$_[0];
 81 | my $ref=$_[1];
 82 | my %parameter=%$ref;
 83 | my $command="";
 84 | foreach my $x (keys %parameter) {
 85 |     if($parameter{$x}) {
 86 |         if($parameter{$x}=~/\s/) {$parameter{$x}="\'$parameter{$x}\'";}
 87 |         $command.=" $x $parameter{$x}";
 88 |     }
 89 | }
 90 | printlog("$Bin/$pl $command\n");
 91 | if(system("$Bin/$pl $command")) {printdie("\nError in $pl!\n");}
 92 | }
 93 | 
 94 | sub printlog {
 95 | print LOG $_[0];
 96 | printf $_[0];
 97 | }
 98 | 
 99 | sub printdie {
100 | print LOG $_[0];
101 | die "$_[0]\nYou can use '-h' to watch detailed help.\n";
102 | }
103 | 
104 | sub usage {
105 | 
106 | die "
107 | perl $0
108 | Align sequences in a file by mafft.
109 | Requirement: mafft
110 | 
111 | Usage: 
112 | -i   input sequences file
113 | -o   output sequences file
114 | -a   type of alignment(direct/translate/codon/complement/ncRNA, default='direct', 'translate' means alignment of translation of sequences)
115 | -g   genetic code(default=1, invertebrate mitochondrion=5)
116 | -t   symbol of termination(default='X', mafft will clean '*')
117 | -c   if translate incomplete codons into 'X'(default=no)
118 | -m   if delete the intermediate files, such as translated files and aligned aa files(default=no)
119 | -f   the folder where mafft/linsi is, if mafft/linsi had been in PATH you can ignore this parameter
120 | -n   num threads(default=1)
121 | -l   log file(default='alignseq.log')
122 | -h   this help message
123 | 
124 | Example:
125 | alignseq.pl -i inputfile -o outputfile -a aligntype -g gencode -t termination -c 1 -m 1 -f mafftfolder -n numthreads -l logfile
126 | 
127 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn
128 | ";
129 | 
130 | }
131 | 


--------------------------------------------------------------------------------
/scripts/revertransseq.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | 
  3 | use Bio::SeqIO;
  4 | use Bio::DB::Fasta;
  5 | use Getopt::Std;
  6 | use Parallel::ForkManager;
  7 | use strict; 
  8 | my %opt=('g'=>'1','t'=>'*','n'=>'1','l'=>'revertransseq.log'); 
  9 | getopts('i:b:o:g:t:n:l:h',\%opt);
 10 | usage() if $opt{h};
 11 | 
 12 | my @ntfiles=split(',',$opt{i});
 13 | my $aafile=$opt{b};
 14 | my $alignfile=$opt{o};
 15 | my $gencode=$opt{g};
 16 | my $termination=$opt{t};
 17 | our $numthreads=$opt{n};
 18 | our $logfile=$opt{l};
 19 | open(LOG,">>$logfile") or die "\nError: $logfile can't open!\n";
 20 | 
 21 | printdie("\nError: no input file, blueprint file or output file was set!\n") unless @ntfiles&&$aafile&&$alignfile;
 22 | printlog("\n##### Sequences reverse-translation begins #####\n");
 23 | my $aln=Bio::SeqIO->new(-file => "$aafile", -format => 'fasta');
 24 | my @seqs;
 25 | my $pm = new Parallel::ForkManager($numthreads); 
 26 | $pm -> run_on_finish( sub {
 27 |         my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;
 28 |         my @arr=@$data_structure_reference;
 29 |         printlog($arr[0]);
 30 |         if(@arr>2) {push(@seqs,$arr[2]);}
 31 |         else {printdie($arr[1]);}
 32 |     });
 33 | while(my $seq=$aln->next_seq()) {
 34 |     $pm->start and next;
 35 |     my $outstr;
 36 |     my $seqid=$seq->display_id;
 37 |     my ($db,$ntseq);
 38 |     foreach my $ntfile (@ntfiles) {
 39 |         $db= Bio::DB::Fasta->new("$ntfile");
 40 |         $ntseq=$db->get_Seq_by_id($seqid);
 41 |         if($ntseq) {last;}
 42 |     }
 43 |     unless($ntseq) {$outstr.="\nWarning: $seqid can not be extracted in $opt{i}!\n";$pm->finish(0,[$outstr]);next;}
 44 |     my $seqstr;
 45 |     my $j=1;
 46 |     my $i;
 47 |     for($i=1;$i<=$seq->length();$i++) {
 48 |         my $base=$seq->subseq($i, $i);
 49 |         if($base eq '-') {$seqstr.="---";}
 50 |         elsif($j>$ntseq->length()) {$pm->finish(0,[$outstr,"\nError: no codon can be extracted in $seqid: $i-$base!"]);}
 51 |         else {
 52 |         my $n=$j+2>$ntseq->length() ? $ntseq->length() : $j+2;
 53 |         my $codon=$ntseq->subseq($j, $n);
 54 |         my $transbase=(Bio::Tools::CodonTable->new(-id=>$gencode))->translate($codon);
 55 |         $j=$n+1;
 56 |         if(length($codon)<3&&$base eq 'X') {$seqstr.=$codon;$outstr.="\nWarning: there is an incomplete codon in $seqid: $i-$base-$transbase($codon)\n";}
 57 |         elsif($base eq $transbase) {$seqstr.=$codon;}
 58 |         elsif($transbase eq '*') {
 59 |         if($base eq $termination) {$seqstr.=$codon;}
 60 |         else {$outstr.="\nWarning: there is an unexpected termination codon in $seqid: $i-$base-$transbase($codon)\n";$i-=1;}
 61 |         if($j<=$ntseq->length()) {$outstr.="\nWarning: there is a middle termination codon in $seqid: $i-$base-$transbase($codon)\n";}
 62 |         }
 63 |         else {$pm->finish(0,[$outstr,"\nError: the codon does not match the amino acid in $seqid: $i-$base-$transbase($codon)\n"]);}
 64 |         }
 65 |     }
 66 |     if($j+2==$ntseq->length()&&(Bio::Tools::CodonTable->new(-id=>$gencode))->is_ter_codon($ntseq->subseq($j, $j+2))) {
 67 |         my $codon=$ntseq->subseq($j, $j+2);
 68 |         $outstr.="\nWarning: there is an unexpected termination codon in $seqid: $i-end-*($codon)\n";
 69 |     }
 70 |     elsif($ntseq->length()==$j||$ntseq->length()==$j+1) {
 71 |         my $codon=$ntseq->subseq($j, $ntseq->length());
 72 |         $outstr.="\nWarning: there is an incomplete codon in $seqid: $i-end-($codon)\n";
 73 |     }
 74 |     elsif($ntseq->length()!=$j-1) {
 75 |         $pm->finish(0,[$outstr,"\nError: nt length does not match the aa length in $seqid!"]);
 76 |     }
 77 |     my $alignseq = Bio::Seq->new( -seq => $seqstr,
 78 |                                    -id  => $seqid,
 79 |                                  );
 80 |     $outstr.="Reverse-translation of $seqid finished.\n";
 81 |     $pm->finish(0,[$outstr,'',$alignseq]);
 82 | }
 83 | $pm->wait_all_children;
 84 | my $out = Bio::SeqIO->new(-file => ">$alignfile", -format => 'fasta');
 85 | foreach my $seq (@seqs) {$out->write_seq($seq);}
 86 | printlog("##### Sequences reverse-translation complished #####\n\n");
 87 | close(LOG);
 88 | 
 89 | 
 90 | sub printlog {
 91 | print LOG $_[0];
 92 | printf $_[0];
 93 | }
 94 | 
 95 | sub printdie {
 96 | print LOG $_[0];
 97 | die "$_[0]\nYou can use '-h' to watch detailed help.\n";
 98 | }
 99 | 
100 | sub usage {
101 | 
102 | die "
103 | perl $0
104 | Used the aligned translated sequences in a file as blueprint to aligned nucleotide sequences, which means reverse-translation.
105 | 
106 | Usage: 
107 | -i   input nucleotide sequences file or files(separated by ',')
108 | -b   aligned amino acid sequences file translated by input file as blueprint
109 | -o   output aligned nucleotide sequences file
110 | -g   genetic code(default=1, invertebrate mitochondrion=5)
111 | -t   symbol of termination in blueprint(default='*')
112 | -n   num threads(default=1)
113 | -l   log file(default='revertransseq.log')
114 | -h   this help message
115 | 
116 | Example:
117 | revertransseq.pl -i ntfile1,ntfile2,ntfile3 -b aafile -o alignedfile -g gencode -t termination -n numthreads -l logfile
118 | 
119 | Written by Yu-Hao Huang (2017-2024) huangyh45\@mail3.sysu.edu.cn
120 | ";
121 | 
122 | }
123 | 


--------------------------------------------------------------------------------
/scripts/connect.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | 
  3 | use Bio::SeqIO;
  4 | use Getopt::Std;
  5 | use strict;
  6 | my %opt; 
  7 | getopts('i:o:l:t:b:f:s:x:c:nh',\%opt);
  8 | usage() if $opt{h};
  9 | my $indir=$opt{i};
 10 | my $outfile=$opt{o} ? $opt{o} : 'all.fas';
 11 | my $type=$opt{t} ? $opt{t} : 'phyloaln';
 12 | my $fill=$opt{f} ? $opt{f} : '-';
 13 | my $sep=$opt{s} ? $opt{s} : '.';
 14 | my $suffix=$opt{x} ? $opt{x} : '.fa';
 15 | my $blockfile=$opt{b};
 16 | my $nexus=$opt{n};
 17 | my @codonpos=split('',$opt{c});
 18 | my $list=$opt{l} ? $opt{l} : makelist($indir,$type,$sep,$suffix);
 19 | my $suffixq=quotemeta $suffix;
 20 | my $sepq=quotemeta $sep;
 21 | open(F,"<$list") or die "\nError: $list can't open!\n";
 22 | open(FF,">$outfile") or die "\nError: $outfile can't open!\n";
 23 | if($blockfile) {
 24 |     open(B,">$blockfile") or die "\nError: $blockfile can't open!\n";
 25 |     if($nexus) {print B "#nexus\nbegin sets;\n";}
 26 | }
 27 | while(<F>) {
 28 | chomp;
 29 | my $taxon=$_;
 30 | my $str;
 31 | my $num=0;
 32 | while(<$indir/*$suffix>) {
 33 |     my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta');
 34 |     my $gene=$_;
 35 |     $gene=~s/(([\w\.]+)\/)+//;
 36 |     $gene=~s/$suffixq//;
 37 |     my ($find,$len);
 38 |     while(my $seq=$in->next_seq()) {
 39 |     my $id=$seq->display_id;
 40 |     $len=$seq->length() unless $len;
 41 |     if($type eq 'phyloaln'&&$id=~/([^$sepq]+)$sepq/) {$id=$1;}
 42 |     elsif($type eq 'blastsearch') {$id=~s/(\|(\S+))+//;$id=~s/\_$gene//;}
 43 |     elsif($type eq 'orthograph') {
 44 |         if($id=~/^(\w+)\|(\w+)\|/) {$gene=$1;$id=$2;}
 45 |         else {die "Error: Invalid format of $id in $_!\n";}
 46 |     }
 47 |     if($taxon eq $id) {$find=1;$str.=$seq->seq;last;}
 48 |     }
 49 |     unless($find) {
 50 |         printf "No $gene in $taxon!\n";
 51 |         for(my $i=0;$i<$len;$i++) {$str.=$fill;}
 52 |     }
 53 |     if($blockfile) {
 54 |         my $start=$num+1;
 55 |         $num+=$len;
 56 | 	if(@codonpos) {
 57 | 		for(my $i=0;$i<@codonpos;$i++) {
 58 | 			if($nexus) {print B "charset ";}
 59 | 			my $cstart=$start+$codonpos[$i]-1;
 60 | 			print B "$gene\_codon$codonpos[$i]\t=\t$cstart-$num\\3;\n";
 61 | 		}
 62 | 	}
 63 |         else {
 64 | 		if($nexus) {print B "charset ";}
 65 | 		print B "$gene\t=\t$start-$num;\n";
 66 | 	}
 67 |     }
 68 | }
 69 | $str=~s/\s//g;
 70 | print FF "\>$taxon\n$str\n";
 71 | if($nexus) {print B "end;\n";}
 72 | close(B);
 73 | $blockfile="";
 74 | }
 75 | close(F);
 76 | 
 77 | sub makelist {
 78 | 	my $dir=$_[0];
 79 | 	my $type=$_[1];
 80 | 	my $sep=$_[2];
 81 | 	my $suffix=$_[3];
 82 | 	my %taxon;
 83 | 	my $suffixq=quotemeta $suffix;
 84 | 	my $sepq=quotemeta $sep;
 85 | 	if($type eq 'phyloaln') {
 86 | 		while(<$dir/*$suffix>) {
 87 |     		my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta');
 88 | 		my $gene=$_;
 89 |    		$gene=~s/(([\w\.]+)\/)+//;
 90 |     		$gene=~s/$suffixq//;
 91 |     		while(my $seq=$in->next_seq()) {
 92 |     		my $id=$seq->display_id;
 93 |     		if($id=~/([^$sepq]+)$sepq/) {$id=$1;}
 94 | 	    	$taxon{$id}=1;
 95 |     		}
 96 | 		}
 97 | 	}
 98 | 	elsif($type eq 'blastsearch') {
 99 | 		while(<$dir/*$suffix>) {
100 |     		my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta');
101 | 		my $gene=$_;
102 |    		$gene=~s/(([\w\.]+)\/)+//;
103 |     		$gene=~s/$suffixq//;
104 |     		while(my $seq=$in->next_seq()) {
105 |     		my $id=$seq->display_id;
106 |     		$id=~s/(\|(\S+))+//;
107 |     		$id=~s/\_$gene//;
108 | 	    	$taxon{$id}=1;
109 |     		}
110 | 		}
111 | 	}
112 | 	elsif($type eq 'orthograph') {
113 | 		while(<$dir/*$suffix>) {
114 |     		my $in=Bio::SeqIO->new(-file => "$_", -format => 'fasta');
115 |     		while(my $seq=$in->next_seq()) {
116 |     		my $id=$seq->display_id;
117 |     		if($id=~/^(\w+)\|(\w+)\|/) {my $taxa=$2;$taxon{$taxa}=1;}
118 | 		else {die "Error: Invalid format of $id in $_!\n";}
119 |     		}
120 | 		}
121 | 	}
122 | 	my $list='list';
123 | 	open(L,">$list") or die "\nError: $list can't open!\n";
124 | 	foreach my $taxa (keys %taxon) {print L "$taxa\n";}
125 | 	close(L);
126 | 	return $list;
127 | }
128 | 
129 | sub usage {
130 | 
131 | die "
132 | perl $0
133 | Concatenate multiple alignments into a matrix.
134 | 
135 | Usage: 
136 | -i   directory containing input FASTA alignment files
137 | -o   output concatenated FASTA alignment file
138 | -t   type of input format(phyloaln/orthograph/blastsearch, default='phyloaln', also suitable for the format with same species name in all alignments, but the name shuold not contain separate symbol)
139 | -f   the symbol to fill the sites of absent species in the alignments(default='-')
140 | -s   the symbol to separate the sequences name and the first space is the species name in the 'phyloaln' format(default='.')
141 | -x   the suffix of the input FASTA alignment files(default='.fa')
142 | -b   the block file of the positions of each alignments(default=not to output)
143 | -n   output the block file with NEXUS format, suitable for IQ-TREE(default=no)
144 | -c   the codon positions to be written in the block file(default=no codon position, '123' represents outputing all the three codon positions, '12' represents outputing first and second positions)
145 | -l   the list file with all the involved species you want to be included in the output alignments, one species per line(default=automatically generated, with all species found at least once in all the alignments)
146 | -h   this help message
147 | 
148 | Example:
149 | connect.pl -i inputdir -o outputfile -t inputtype -f fillsymbol -s separate -x suffix -b block1file -n -c codonpos -l listfile
150 | 
151 | Written by Yu-Hao Huang (2018-2024) huangyh45\@mail3.sysu.edu.cn
152 | ";
153 | 
154 | }
155 | 
156 | 


--------------------------------------------------------------------------------
/tests/ref/OG0003820.fa:
--------------------------------------------------------------------------------
 1 | >DBUSC.3865
 2 | ATGTTTGCGTTACGACATTTGTGTCTGCAGGGCCGCTTAAGAATGCATTTTCCACCCGCAgtagcagcgacagcaactacaattacaacaacaaaacacatgTCAGCTGTGCCGGCGCGTGTTTTGGTTGAGatgcacacacatgccaaGAGCTGCTACGGTAGACATAAGTCTCTATTGCTCCAAGAGCGAATGACAACCTGGACGCTAAACCAAAGAACAACGACCACGCTTTCTGCAACAGATTCCAATGAACAAAAGAAACCGCCGACAAAGTCGCCGCTGCAAGAGTTAGTGGCTGCTGCAAGACCATATGCACAGCTAATGCGTATTGATCGGCCAATTGGAACGTACTTGCTGTTCTGGCCATGCGGCTGGAGCATAGCTCTTAGCGCTAACGCTGGCTGTTGGCCAGATTTTGCCATGCTAGGATTGTTTGCCACGGGCGCGCTAATTATGCGTGGTGCCGGTTGCACCATTAACGATCTCTGGGACAAGGACATAGATGCAAAAGTTGAGCGTACGCGAAGTCGCCCCCTAGCTGCTGGAAAGATTACACAATTTGACGCTATAGTTTTTCTCTCGGCGCAACTCAGTTTGGGCTTGCTTGTGCTCGTGCAGCTGAACTGGCAATCTATACTGCTGGGTGCCAGTTCGCTGGGTCTAGTGATCACGTATCCACTAATGAAGCGTGTCACATACTGGCCACAACTAGTGCTTGGCATGTGCTTTAACTGGGGTGCGCTCTTGGGCTGGTGTGCTACACAGGGCAGCGTAAACCTAGATGCCTGTTTACCACTTTACTTGTCAGGCGTGTGCTGGACAATTGTGTATGACACCATCTATGCACACCAGGATAAACTGGACGATATGCAGATTGGCGTCAAGTCAACGGCACTGCGTTTTGGCGAGAATACTAAAATGTGGCTATCAGGATTTACTGCGGCGATGCTGACGGGTCTCTCGGCTGCTGGTTATGCCTGCGATCAAACACTTCCATACTACGCAGCCGTTAGCATAGTTGGTGCTCATCTGGCACAGCAGATCTACTCACTGAACATAGATAATCCTAGCGATTGCGCAAAAAAGTTCTTTTCGAACCAACAAGTTGgcctcattttatttttaggcatTGTGCTGGGCACACTGCTCAAGTCGGACGATCCTAAGCAGCAGCGTAAATCTTTAAGCACACCCGCATCCACAGCTGCTTATGTGCCGTTGCCCCAAACGAAACCCGATGTAATTAGC
 3 | >DMOJA.1254
 4 | ATGTTTGCGTTACGACACTTGCGACTACAAGGCAGGGCTAAAATACATTTGCCATATGCTGCGatggcaacaactgcaacagcaacaactacaaaacaCATGCCAGCTGTGCCGGCGCGTGTCTATGTGggtctacatacatacagcagCGAGTATCGCATGCCAAgactgctgatgctgcaggAACATATGACAAGTTGGAGGCTCCACCAAAGAACCACAACAACGCTGTCGAAATCTTCGAACCAGGAGAGGAAGACGCCGCATAAAGCAACACTGTTACAGGAACTGGCCGACGCCGCTAAGCCCTACGCACAGCTGATGCGTATAGACCGACCCATCGGGACATATCTGCTCTTCTGGCCGTGCGGCTGGAGCATAGCTCTGAGTGCTGACGCCGGCTGCTGGCCAGACTTTTCGATGCTGGGACTCTTTGCATTGGGCGCATTGATTATGCGCGGCGCGGGCTGCACCATCAACGACATGTGGGACAAAGATATCGACGCAAAGGTGGAAAGAACGCGCACTCGTCCTCTGGCCTCTGGACAGATTACGCAGTTCGATGCGATTGTATTTCTCTCGGCGCAGCTTAGCCTGGGTTTGCTTGTGCTGGTGCAGCTGAACTGGCAGTCCATCCTTCTGGGCGCCAGTTCGCTGGGTTTGGTAATTACGTACCCGCTGATGAAGCGCGTCACGTACTGGCCCCAGCTGGTGCTCGGCATGTGTTTTAACTGGGGCGCCCTGCTTGGCTGGTGTGCCACTCAGGGAAGCGTTAATCTGGAAGCCTGCCTGCCACTCTATCTATCCGGTGTGTGCTGGACCATTGTCTACGACACCATCTATGCCCACCAAGACAAGCTGGACGACTTGCAGATAGGCGTCAAGTCGACGGCGCTTCGTTTTGGAGAGAACACAAAAGTCTGGCTCTCCTGCTTTACAGCTGCTATGCTGGCGGGTCTCACCTCTGCCGGCTATGCCTGTGACCAGACTCTGCCCTATTATACCGCAGTGGGCGTCGTCGGAGCGCACTTGGTGCAGCAGATCTACTCCCTCAACATAGACAATCCCAGTGACTGCGCTAAGAAGTTTCTCTCCAACCAGCAGGTGGGCCTGATTCTTTTCCTGGGTATCGTGCTGGGCACTCTGCTGAAGTCGGACGACAGCAAGAAACAGAGCAAAGCGACGCTAGCACCTGTGACAGCTCCATCATATGTACCTTTACCCCAATCAAAGCCGGACGTAATCAGC
 5 | >DPSEU.12475
 6 | ATGTTTGCGGTACGACATTTGCTGAAGAGCAGAAAGCATTTTCCCTACGCTTATGCGGCggcgacaacaacaaagagcAGGCTGCCAATGCCAGCTGTGCCGGCGCGTGTTCTTATTGGCCTCCACACAGATAGTGATTGCCGCAACGAGAGGCTACCGCAGATCCAGGAGCTTTCTTTTCGCAAGATGTCTACGCTGCCAACATCCAAGAAGCCAGGATCGGTGCTCGAAGAGCTGTACGCAGCTACGAAACCATATGCCCAGCTGATGAGGATTGATCGACCCATTGGCACTTACTTGTTGTTCTGGCCCTGTGCGTGGAGCATAGCGTTGAGTGCAGATGCAGGCTGTTGGCCGGACCTTACCATGCTGGGCTTGTTTGGAACGGGAGCATTGATAATGCGTGGCGCCGGCTGCACTATAAACGATCTCTGGGACAAGGATATCGACGCCAAGGTGGAGCGAACGCGGACGCGGCCACTCGCATCCGGGCAGATTAGTCAGTTCGATGCTATTGTCTTTCTCTCGGCACAGCTGAGTCTCGGACTACTGGTTCTTGTCCAGCTCAACTGGCAGTCAATTCTGTTGGGCGCCAGCTCACTGGGGCTGGTGATCACTTATCCCTTGATGAAGAGGGTGACCTATTGGCCCCAGTTGGTTCTTGGCATGGCCTTCAACTGGGGTGCTTTGTTGGGATGGTGTGCAACACAGGGAAGCGTCAACCTGGCCGCTTGTCTTCCGCTCTACCTGTCTGGTGTTTGCTGGACCATTGTCTACGACACAATCTACGCCCACCAGGATAAGCTAGACGATTTGCAAATCGGTGTCAAATCCACTGCTTTGCGTTTTGGTGAGAACACCAAAGTTTGGCTATCAGGTTTCACCGCAGCCATGCTGACGGGTCTCTCTACCGCCGGCTGGGCCTGCGATCAAACGCTGCCGTACTATGCCGCTGTTGGAGTTGTTGGCGCTCATCTTGTGCAGCAGATCTACTCCCTCAACATTGACAATCCCACCGACTGCgccaagaaatttttatcgaATCATCAGGTAGGACTCATTCTGTTTCTTGGAATCGTCCTTGGCACACTACTGAAAGCGAACGACACTAAAAATCAGCCGCAACCCGCACTAACATCATCGGCAGCCAGCTCCTATGCTTCGCTAACTCAAAAACCAGAAGTTTTGAGC
 7 | >DYAKU.10021
 8 | ATGTATGCGCTACGACACCTGCGACTCCAGAGCGCACGACACCTCCGCAGCTCTTAtgcagcggcggcaacaacaaaacacatGCTGCCCCGGCAACCAGCGCGTGTTCTGATTGGAGATGGGAGCACCTGGGATAAGTACCAAGTACAGGATGTATACTCCAGGAGTTCGAGTACCGCCACTGAGCCCGTGAAGCCGCAAACGCCGCTGCAGGAACTGGTGTCAGCCGCCAAACCCTATGCCCAACTGATGCGGATCGACCGGCCTATTGGCACCTACCTCCTCTTCTGGCCCTGCGCCTGGAGCATAGCGCTCAGCGCGGATGCGGGTTGCTGGCCGGACCTGACCATGCTCGGTCTGTTTGGCACCGGGGCACTGATAATGCGCGGCGCCGGGTGCACCATTAACGATCTCTGGGACAAGGACATCGATGCCAAGGTGGAGCGCACAAGATTGCGGCCCTTGGCCTCGGGACAAATCAGCCAGTTCGATGCCATAGTATTCCTCTCGGCTCAGCTTAGTCTGGGTCTTTTGGTGCTGGTCCAGCTCAACTGGCAGTCCATATTGTTGGGCGCCAGTTCTCTGGGTCTCGTAATCACCTATCCACTCATGAAAAGAGTCACCTACTGGCCCCAGCTGGTTCTGGGCATGGCTTTCAACTGGGGCGCCCTACTGGGATGGTGTGCCACCCAGGGCAGTGTTAATCTGGCCGCCTGCCTGCCGCTCTACCTTTCCGGTGTATGCTGGACCATTGTGTACGACACCATATACGCCCACCAAGACAAGCTGGATGACCTGCAAATCGGCGTGAAATCCACGGCTCTGAGATTTGGCGAGAACACCAAGGCTTGGCTGTCTGGATTCACGGCAGCCATGCTGACTGGTCTTTCCGCCGCTGGCTGGGCCTGCGATCAAACGGTGCCCTACTACGCGGCTGTTGGAGTAGTGGGTGCCCATCTAGTGCAGCAGATCTACTCCCTCAACATTGACAACCCCAGCGACTGCGCCAAGAAGTTCCTATCGAACCATAAAGTAGGACTCATTCTATTCCTTGGCATTGTTTTGGGCACCCTTCTGAAATCAGACGAGACCAAGAAACAGCGCCAATCCTCACTGACAACATCTACGGCCAGCTCATACGTTCCAGCGCTGCCGCAAAAGCAAGAAGTTATAAGC
 9 | >SLEBA.9410
10 | ATGTTCGCCCTACGCCAGATGCGACTCCAAGGTAGAATTCACATGCCATATTCAGCAGCAAGAGCGAATATAACACCGATCTTGCCGGCGCGTGTTCTAATTagcctgcacacacacacttatgaCCACCGACAGAGGCTATCACAACATTGCAAGCACATGGCAATTGCAAAGCCGCAAATGTATTTGCAGCGAACCAGTTCCACCCTAAGCGCGCCTAAAGAACAGAGTGAAACACCTGATGGGCGTTCCACAAAGTCCTTGATGGAGGAATTGAGCACCAGTGTCAGGCCTTACAAGCAGCTGATGCGCTTAGATCGCCCAATAGGAACGTACCTACTGTTTTGGCCCTGCGGCTGGAGTATCGCGTTGAGCGCGGATGCCGGCTGTTGGCCCGACTTAACGATGTTAGGGTTGTTTGCCACAGGTGCTTTAATTATGCGCGGTGCCGGTTGCACCATCAACGATCTGTGGGACAAGGACATTGATGCAAAGGTAGAGCGTACACGCTCTCGTCCATTAGCATCTGGTCAAATAACACAGTTCGATGCCATAGTGTTTTTATCGGCACAGCTGAGCTTGGgcctgctggtgctggtgcagCTGAACTGGCAGTCTATACTGCTGGGCGCCAGCTCCTTGGGCCTGGTTATTACGTACCCGCTTATGAAACGTGTAACATACTGGCCACAACTAGTGCTGGGCATGGCCTTCAACTGGGGCGCCCTGCTGGGTTGGTGTGCCACTCAAGAAAGCATCAATTTAGCCGCCTGCCTACCGCTTTATCTTTCGGGTGTATGCTGGACTATTGTATATGACACGATCTATGCGCACCAAGACAAGCTTGACGATTTGCAAATTGGCGTTAAGTCGACGGCTCTGCGATTCGGCGAAAATACGAAAGTGTGGCTCTCCGGTTTTACAGCAGCCATGCTCACCGGCCTTTCCGCGGCAGGATGGGCATGTGATCAAACGCTGCCCTACTACGCTTCTGTTGGAATAGTTGGCGCACATTTGGCTCAACAGATCTACTCGCTGAACATAGACAATCCGAGTGATTGCGCCAAGAAATTCTTTTCAAATCATCAGGTTGGTCTCATTCTCTTTCTTGGCATTGTGCTTGGCACGCTCCTTAAGTCGAAAGACGCAAAAAAACAACGACAAACTGCGCCCACATCTACAACAGCCAACACCTATGTAGCGCTACCAGCAAACCCCGAGGTTATAAGc
11 | 


--------------------------------------------------------------------------------
/lib/library.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | import shutil
  4 | import gzip
  5 | import subprocess
  6 | from multiprocessing import Pool, set_start_method
  7 | from Bio.Seq import translate
  8 | from Bio.SeqIO import index_db
  9 | 
 10 | parentdir = os.path.dirname(os.path.realpath(__file__))
 11 | PhyloAlndir = os.path.dirname(parentdir)
 12 | epath = os.environ['PATH']
 13 | 
 14 | # check if the programs exist in the path
 15 | def check_programs(progs):
 16 | 	for program in progs:
 17 | 		fullpath = shutil.which(program)
 18 | 		if not fullpath:
 19 | 			print("\nError: fail to find {}! Please install and add to it into the path!".format(progs[0]))
 20 | 			sys.exit(1)
 21 | 
 22 | # run the command
 23 | def runcmd(cmd, log, env=None, stdout=True, error=False):
 24 | 	if env:
 25 | 		# add the environment into the path
 26 | 		os.environ['PATH'] = env + ':' + epath
 27 | 		if stdout:
 28 | 			print('PATH: +' + env)
 29 | 		log.write('PATH: +' + env + "\n")
 30 | 	log.write(' '.join(cmd) + "\n")
 31 | 	if stdout:
 32 | 		print(' '.join(cmd))
 33 | 	p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
 34 | 	# synchronous output to the log file and the screen
 35 | 	while p.poll() is None:
 36 | 		cmdout = p.stdout.readline().decode("utf8")
 37 | 		if cmdout:
 38 | 			log.write(cmdout)
 39 | 			if stdout:
 40 | 				print(cmdout.rstrip("\n"))
 41 | 
 42 | 	# restore to the original environment
 43 | 	os.environ['PATH'] = epath
 44 | 
 45 | 	# if error occurs, stop the script and exit
 46 | 	if p.returncode != 0:
 47 | 		if error:
 48 | 			print("\nError in '" + ' '.join(cmd) + "'!")
 49 | 			sys.exit(1)
 50 | 		else:
 51 | 			# if not 'error', return the state
 52 | 			return False
 53 | 	elif not error:
 54 | 		return True
 55 | 
 56 | # run the function by single process with pbar
 57 | def run_sp(function, args_list, kwds={}, total=None, finish=0):
 58 | 	# If total number is set, the number will be use, otherwise the length of the list will be used as total number
 59 | 	if total is None:
 60 | 		total = len(args_list)
 61 | 	results = []
 62 | 
 63 | 	for args in args_list:
 64 | 		if kwds:
 65 | 			# add the fixed parameters
 66 | 			result = function(*args, **kwds)
 67 | 		else:
 68 | 			result = function(*args)
 69 | 		results.append(result)
 70 | 		# refresh the pbar
 71 | 		finish += 1
 72 | 		sys.stdout.write("[{}] {}{}/{} ({}%)\r".format(("+" * int((finish/total)*50)) + (" " * (50 - int((finish/total)*50))), (" " * (len(str(total))-len(str(finish)))), finish, total, '%.2f' %(finish/total*100)))
 73 | 		sys.stdout.flush()
 74 | 
 75 | 	return results
 76 | 
 77 | # run the function by multiprocess with pbar
 78 | def run_mp(function, args_list, cpus, kwds={}, sp_list=[]):
 79 | 	multinum = len(args_list)
 80 | 	total = multinum + len(sp_list)
 81 | 	results = []
 82 | 
 83 | 	if cpus == 1:
 84 | 		# when using 1 cpu, use the function directly instead of multiprocess
 85 | 		results.extend(run_sp(function, args_list, kwds=kwds, total=total))
 86 | 	elif multinum > 0:
 87 | 		# use multiprocess
 88 | 		# setup multiprocess method using less memory
 89 | 		try:
 90 | 			set_start_method('spawn')
 91 | 		except RuntimeError:
 92 | 			# avoid setting repeatedly
 93 | 			pass
 94 | 		# setup pool
 95 | 		p = Pool(cpus)
 96 | 		finish = 0
 97 | 		# setup results and split over cpus
 98 | 		for args in args_list:
 99 | 			if kwds:
100 | 				# add the fixed parameters
101 | 				results.append(p.apply_async(function, args=args, kwds=kwds))
102 | 			else:
103 | 				results.append(p.apply_async(function, args=args))
104 | 		# print the pbar
105 | 		while True:
106 | 			finish_task = sum(1 for result in results if result.ready())
107 | 			if finish_task == finish:
108 | 				continue
109 | 			# when new task finished, refresh the pbar
110 | 			finish = finish_task
111 | 			sys.stdout.write("[{}] {}{}/{} ({}%)\r".format(("+" * int((finish/total)*50)) + (" " * (50 - int((finish/total)*50))), (" " * (len(str(total))-len(str(finish)))), finish, total, '%.2f' %(finish/total*100)))
112 | 			sys.stdout.flush()
113 | 			if finish == multinum:
114 | 				break
115 | 		p.close()
116 | 		p.join()
117 | 		# extract the outputs of the function
118 | 		results = [result.get() for result in results]
119 | 
120 | 	# finally manage the list of single process if set
121 | 	if sp_list:
122 | 		results.extend(run_sp(function, sp_list, kwds=kwds, finish=multinum, total=total))
123 | 
124 | 	# end the pbar
125 | 	print("\n")
126 | 	return results
127 | 
128 | # read the FASTQ or FASTA file
129 | def read_fastx(fastx, file_format='guess', select_list=None, low_mem=False):
130 | 	seqs = {}
131 | 	if file_format == 'large_fasta':
132 | 		db_dict = index_db(fastx + '.idx', fastx, 'fasta')
133 | 		for seqid, seqinfo in db_dict.items():
134 | 			if select_list is None or seqid in select_list:
135 | 				seqs[seqid] = str(seqinfo.seq)
136 | 		db_dict.close()
137 | 		return seqs
138 | 	elif fastx.endswith('.gz'):
139 | 		reads = gzip.open(fastx, 'rt')
140 | 	else:
141 | 		reads = open(fastx)
142 | 	if file_format == 'guess':
143 | 		for line in reads:
144 | 			if line.startswith('@'):
145 | 				file_format = 'fastq'
146 | 			elif line.startswith('>'):
147 | 				file_format = 'fasta'
148 | 			break
149 | 		# recover the file iteration
150 | 		if fastx.endswith('.gz'):
151 | 			reads = gzip.open(fastx, 'rt')
152 | 		else:
153 | 			reads = open(fastx)
154 | 		print("Detected format: {}".format(file_format))
155 | 	if low_mem:
156 | 		# if using low-memory mode, return the file iteration and file format instead of reading the sequences
157 | 		return reads, file_format
158 | 	seqid = None
159 | 	line_num = 0
160 | 	for line in reads:
161 | 		line = line.rstrip()
162 | 		if file_format == 'fastq' and line.startswith('@') and line_num % 4 == 0:
163 | 			seqid = line.replace('@', '', 1).replace(' ', '_')
164 | 			if select_list is not None:
165 | 				if seqid not in select_list:
166 | 					seqid = None
167 | 		elif file_format == 'fasta' and line.startswith('>'):
168 | 			arr = line.split(" ")
169 | 			seqid = arr[0].lstrip('>')
170 | 			if select_list is None or seqid in select_list:
171 | 				seqs[seqid] = ''
172 | 			else:
173 | 				seqid = None
174 | 		elif seqid:
175 | 			if file_format == 'fastq':
176 | 				seqs[seqid] = line
177 | 				seqid = None
178 | 			else:
179 | 				seqs[seqid] += line
180 | 		line_num += 1
181 | 	return seqs
182 | 
183 | # translate the (gappy) sequences in a FASTA file
184 | def trans_seq(filename, output, gencode=1, dna_codon_unknow=None):
185 | 	seqs = read_fastx(filename, 'fasta')
186 | 	outfile = open(output, 'w')
187 | 	for seqid, seqstr in seqs.items():
188 | 		tran_str = ''
189 | 		i = 0
190 | 		while i < len(seqstr):
191 | 			if seqstr[i:i+3] == '---':
192 | 				tran_str += '-'
193 | 			elif dna_codon_unknow:
194 | 				try:
195 | 					tran_str += translate(seqstr[i:i+3], table=gencode)
196 | 				except:
197 | 					tran_str += dna_codon_unknow
198 | 			else:
199 | 				tran_str += translate(seqstr[i:i+3], table=gencode)
200 | 			i += 3
201 | 		outfile.write(">{}\n{}\n".format(seqid, tran_str))
202 | 	outfile.close()
203 | 
204 | 


--------------------------------------------------------------------------------
/tests/ref/OG0003531.fa:
--------------------------------------------------------------------------------
 1 | >DBUSC.12359
 2 | ATGGTGCTAGACCTGGATTTGTTTCGCAGCGACAAGGGCGGCAACCCTGATGCAGTCCGTGAGAATCAAAAGAAGCGTTTTAAAGATGTGGGACTTGTAGAGGCAGTCATTGAAAAAGACACGGAATGGCGACAGCGCCGTCATCGAGCCGACAATCTGAACAAGGTGAAAAATGTCTGCAGCAAGGTAATCGGTGAGAAAATGAAGAAGAAGGAGCCCGTAGGCCCCGAGGGCGAGGAGGTGCCAGCTGCTATACGAGTAGATTTGACGCAAATAACTGCGGAGACACTACAAGCGTTGACAGTGAATCAGATCAAACAGCTGCGCTTACTCATCGATGATGCGATGACGGACAATCAAAAGTCTCTGGAGCTGGCTGAACAGACGCGCAATACGGCACTGCGTGAGGTGGGCAATCATCTGCACGATTCTGTGCCCGTGTCTAATGACGAGGAGGAGAATCGTGTTGAGCGAACCTTTGGGGACTGCGAGAAGCGCGGCAAGTACTCACATGTGGATCTAATTGTGATGATCGATGGCATGAATGCGGAGAAGGGCGCTGTCGTATCTGGAGGACGTGGCTACTTCCTTACTGGCGCTGCTGTATTTCTGGAGCAAGCGCTGATTCAGCATGCGCTTCAGCTGCTCTATACGCGTGACTATACTCCACTTTATACGCCCTTCTTTATGCGCAAAGAGGTGATGCAAGAGGTGGCGCAGCTCTCTCAGTTCGACGAGGAGCTTTACAAAGTGGTTGGCAAGGGAAGCGAGCGTGCCGAAGAGTGTGGCACCGATGAAAAGTACCTGATAGCCACCTCGGAGCAGCCTATTGCGGCGTATCATCGCGACGAGTGGCTGCCAGAGTCATCGCTACCCATTAAATACGCCGGCTTGTCGACTTGCTTCCGCCAGGAGGTAGGCTCGCATGGACGCGATACTCGTGGCATTTTCCGCGTGCACCAGTTTGAGAAGGTGGAACAGTTCGTACTGACTTCTCCACATGACAACAAATCGTGGGAGATGATGGACGAGATGATTGGCAACGCAGAGCAGTTCTGCCAATCTCTGGGCATACCATATCGCGTGGTTAACATTGTTTCTGGAGCTCTCAATCACGCCGCTTCCAAGAAACTCGATCTGGAGGCCTGGTTTGGCGGCAGCGGTGCCTTCAGGGAGCTGGTTTCCTGCTCCAATTGCCTAGACTACCAGGCTCGTCGCTTGCTGGTGCGCTACGGTCAGACCAAGAAGATGAACGCGGCTGTGGACTATGTGCACATGCTTAATGCTACAATGTGCGCTGCTACTCGTGTTATCTGCGCCATTCTCGAGACGCACCAAACAGAAACGGGCGTCAAGGTGCCCGAGCCCCTCAAGAAGTATATGCCGGCGAAGTTCCAGGACGAGATTCCTTTCGTTAAGCCAGCACCCATTGATCTGGAGCTTGCCGCTGCGGCCAATCAAAAGGCCAAGAAGGATAAGACCAAGAAGGATCCAGCCGCTGCC
 3 | >DMOJA.5365
 4 | ATGGTTTTGGATTTGGATCTATTTCGCAAAGACAAGGGCGGCAATCCCGATGCGGTGCGCGAGAATCAAAAGAAGCGCTTCAAAGATGTCGGCCTCGTCGAGACCGTCATCGAAAAGGACACCGAATGGCGTCAGCGTCGGCATCGAGCCGACAACCTGAACAAGGTGAAGAACGTCTGCAGCAAAGTTATTGGcgagaagatgaagaagaaggagcCCCTAGGCGCCGATGGCGAGGAGGTGCCCGCTTCGGTACGCTCGGATCTAACGCAAATCACAGCCGAAACCCTGCAAGCGTTGACGGTGAATCAAATCAAGCAATTGCGGCTGCTCATCGATGATGCGATGACCGAGAATCAAAAAGCGCTGGAGCTGGCAGAGCAAACCAGGAATACTTCGCTGCGTGAGGTGGGCAATCACTTACACGAATCGGTGCCCGTTTCCAACGACGAGGACGAGAATCGTGTGGAGCGCACATTTGGTGATTGTACGAAGCGTGGCAAATATTCGCATGTGGATTTGATTGTTATGATCGATGGCATGAATGCCGAGAAGGGCGCCGTGGTGTCCGGTGGCCGTGGTTACTTCCTAACTGGCGCCGCTGTCTTCCTGGAGCAAGCCGTCATACAGCATGCCCTGCATTCGCTCTACCAGAAGGACTATGTGCCCCTTTATACGCCCTTCTTTATGCGCAAGGAGGTGATGCAGGAAGTCGCCCAGCTGTCGCAGTTCGACGAAGAGCTGTACAAGGTGGTGGGCAAGGGCAGCGAACGGGCCGAGGAGAGCGGCACAGATGAGAAATACCTGATAGCCACCTCGGAGCAGCCGATAGCCGCCTATCATCGTGATGAATGGCTGCCAGAGGCATCGCTACCAATCAAGTACGCTGGCCTCTCCACGTGCTTCCGCCAAGAGGTGGGCTCCCATGGACGTGATACGCGAGGCATCTTCCGGGTGCATCAGTTCGAGAAGGTGGAACAATTTGTGCTGACCTCGCCGCATGACAACAAATCGTGGGAGATGATGGACGAGATGATTGGCAATGCGGAGCAGTTCTGCCAATCTCTAGGCATCCCATACCGCATTGTCAACATTGTTTCGGGCGCACTGAATCATGCCGCCTCTAAGAAGCTCGATTTAGAGGCCTGGTTCGGCGGCAGTGGCGCCTACAGAGAGCTTGTCTCCTGCTCCAATTGTCTGGACTACCAGGCACGTCGACTGCTGGTGCGCTATGGCCAGACCAAGAAGATGAATGCAGCTGTGGACTACGTCCACATGCTGAATGCCACAATGTGCGCTGCCACGCGTGTGATTTGCGCCATCCTGGAGACCCATCAGACGGAGACGGGCATCAAGGTGCCGGAGCCTTTAAAGAAATACATGCCCTCCAAATTTCAGGATGAAATTCCATATGTGAAGCCAGCGCCAATTGACTTGGAACAAGCTGCCGCCGACAAACAGAAGTCCAAGAAGGAGAAGCCGAAGAAGGATCCTGCAGCTGCC
 5 | >DPSEU.1497
 6 | ATGGTGCTGGATTTGGATCTGTTTCGCAGCGACAAGGGCGGCAATCCCAATGCCGTGCGCGACAACCAGAAGAAGCGCTTCAAAGATGTAGCCCTGGTGGAGACTGTGATTGAGAAAGACACCGAGTGGCGTCAGTGCCGTCATCGGGCCGACAACCTGAACAAGGTGAAGAATGTGTGCAGCAAGGTGATTGGCgagaagatgaagaagaaggAGCCAGTGGGTGCCGTGAGCGAGGAGCTGCCTGCTGCGGTAACCACAAGCCTAACTGAGATTGTGCCCGAAACTCTGCAGCCGCTGACGGTCAACCAGATCAAGCAGCTGCGCGTGCTCATTGACGATGCCATGACGGAGAACCAAAAGTCCCTGGAGCTGGCGGAGCAGACCAGAAACACCTCTTTACGAGAGGTCGGCAATCACCTTCACGATTCGGTCCCTGTGTCTAACGATGAAGAGGAGAACCGCGTCGAGAGGACCTTTGGCGATTGCGAAAAGCGTGGCAAGTACTCGCATGTGGATCTGATTGTAATGATTGATGGCATGAATGCCGAGAAAGGATCTGTGGTATCTGGGGGACGCGGCTATTTCCTCACCGGTGCCGCTGTCTTTCTTGAGCAGGCGCTCATTCAACATGCCCTGCACTTGCTGTACGCTAAGGACTATGTCCCCCTATATACGCCCTTCTTTATGCGCAAGGAGGTGATGCAAGAGGTGGCCCAGCTCTCGCAGTTCGACGAGGAGCTCTACAAGGTGGTGGGCAAGGGAAGCGAGAAAGCTGAAGAGGCTGGCACCGACGAGAAGTATCTGATTGCCACCTCTGAGCAGCCCATTGCCGCCTATCATCGCGACGAATGGCTCCCGGAGGCTTCGCTACCCATCAAATATGCCGGCTTGTCCACGTGTTTCCGTCAGGAGGTAGGCTCCCATGGCCGCGACACTCGTGGCATATTCCGCGTGCATCAGTTTGAGAAGGTCGAACAGTTTGTGCTCACTTCCCCACACGACAACAAATCCTGGGAGATGATGGACGAGATGATTGGCAATGCCGAGCAGTTCTGTCAGTCCCTGGGTATTCCATATCGCGTTGTGAACATAGTGTCCGGTGCCCTCAATCATGCCGCCTCCAAGAAGTTAGATCTGGAGGCCTGGTTTGGCGGCAGCGGTGCCTACAGAGAGCTCGTCTCCTGCTCCAATTGCCTGGACTACCAGGCGCGTCGTCTGCTGGTTCGCTTTGGCCAAACCAAGAAGATGAACGCCGCCGTGGACTACGTACACATGTTGAATGCGACAATGTGCGCTGCCACACGTGTCATTTGCGCCATTCTGGAAACGCATCAGACAGAGACGGGCATCAAGGTGCCGGAACCACTCAAGAAATACATGCCGGCTAAGTTCCAAGATGAGATTCCGTTTGTCAAGCCCGCTCCCATCGATCTGGAGCTGGCCGCGGCCGAGAAACAGAAGGGAAAGAAGGACAAGAGCAAGAAGGATCCAGCTGCCGGT
 7 | >DYAKU.11913
 8 | ATGGtgctggatctggatctgttTCGCAGCGACAAGGGAGGCAACCCGGACCTCGTGCGCGAAAACCAAAAGAAGCGCTTCAAGGATGTGGCGCTGGTGGAGACGGTGATCGCCAAGGATACTGAGTGGCGTCAGTGCCGCCACCGTGCCGACAACCTGAACAAGGTGAAGAACGTCTGCAGCAAGGTGATCGGCGAGAAGATGAAGAAGAAGGAGCCGGTGGGTGCAATGAGCGAGGACCTGCCCGCAGACGTGACCAAGGACCTCACCGAGATTGTGGCCGAGACACTGCAGCCGCTGACCGTCAACCAGATCAAGCAGCTGCGCGTGCTCATCGACGACGCAATGACGGAGAACCAGAAGTCCCTGGAGCTTGCCGAGCAAACGAGGAACACCTCACTGCGGGAGGTGGGCAACCACCTGCACGAGTCCGTCCCAGTGTCAAACGACGAGGACGAGAACCGTGTGGAGCGGACCTTTGGCGACTGCGAAAAGCGCGGCAAGTATTCGCATGTGGACCTCATCGTGATGATCGACGGCATGAACGCGGAGAAGGGTGCCGTGGTGTCCGGCGGACGTGGTTACTTCCTTACCGGAGCCGCAGTCTTCTTGGAGCAAGCTCTCATTCAGCACGCCCTGCACCTGCTGTACGCCCGTGACTACGTTCCCCTGTACACGCCCTTCTTCATGCGGAAGGAGGTAATGCAGGAGGTGGCCCAGCTGTCACAATTCGACGAGGAGCTCTACAAGGTGGTGGGTAAGGGCAGCGAGAAGGCCGAGGAGGTAGGCATCGATGAGAAGTACCTGATCGCCACCTCAGAGCAGCCCATCGCCGCCTACCATCGCGATGAGTGGCTGCCGGAGGCTTCGCTGCCCATCAAGTATGCCGGTCTGTCCACCTGCTTCCGGCAGGAAGTGGGCTCGCACGGACGCGACACTCGCGGCATTTTCCGCGTCCACCAGTTCGAGAAGGTGGAGCAGTTCGTACTGACCTCCCCACACGACAACAAGTCGTGGGAGATGATGGACGAGATGATCGGCAATGCGGAGCAGTTCTGCCAGTCACTGGGCATTCCATATCGCGTGGTGAACATCGTTTCCGGTGCGCTCAACCATGCAGCCTCCAAGAAACTGGATCTGGAGGCCTGGTTCGGCGGCAGCGGCGCTTACAGAGAACTTGTATCGTGCTCCAACTGCTTGGACTACCAGGCCCGTCGTCTGCTCGTACGTTTCGGCCAGACCAAGAAGATGAACGCCGCCGTCGACTATGTGCACATGCTGAACGCCACGATGTGCGCAGCCACTCGCGTCATCTGCGCCATCCTGGAGACGCATCAGACAGAGACGGGTATCAAGGTGCCGGAGCCATTGAAGAAGTACATGCCGGCGAAGTTCCAGGATGAGATTCCGTTCGTCAAGCCCGCTCCCATTGATCTGGAGTTGGCCGCCGCCGAGAAGCAAAAGGGCAAAAAGGAGAAAACCAAGAAGGACCCTGCCGCCGGT
 9 | >SLEBA.7636
10 | atggtACTCGATTTGGATTTGTTTCGCAGCGATAAGGGCGGTAATCCCGATGCTGTGCGCGAGAACCAAAAGAAGCGCTTCAAGGATGTAGGACTGGTCGAGACGGTAATTGAGAAGGACTCCGAATGGCGTCAGCGACGTCATCGGGCCGACAACCTGAACAAGGTGAAAAATGTCTGCAGCAAAGTAATTGGCGAGAAGATGAAGAAAAAGGAGCCCGTCGGTGCTGAAGGCGAAGAAGTGCCGGCCGCCATACGCGCGGATCTGACCCAAATTACGGCCGAGACGCTGCAGTCATTGACTGTGAATCAAATTAAACAGCTGCGCTTACTCATCGACGATGCCATGACTGAAAACCAAAAGTTGCTGGAGGCTGCTGAACAAACCCGAAACACGGCACTGCGTGAGGTGGGCAATCACCTGCATGAGTCTGTGCCCGTCTCCAATGACGAGGATGAGAATCGAGTGGAGCGCACCTTTGGGGATTGCGAGAAGCGCGGAAAATATTCTCATGTTGATCTGATTGTTATGATCGATGGCATGAACGCCGAAAAGGGTGCTGTAGTATCTGGTGGGCGTGGCTATTTTCTCACTGGCGCCGCTGTATTCCTCGAACAAGCGCTCATTCAGCATGCCCTACACTTGTTGTATGCCCGCGAATATACGCCGCTTTATACACCATTCTTCATGCGCAAGGAAGTCATGCAGGAGGTGGCGCAGCTCTCGCAATTCGACGAGGAGCTGTATAAAGTTGTCGGCAAGGGTAGCGAGCGGGCTGAGGAGGGTGGCACTGATGAAAAGTATTTGATAGCTACTTCGGAACAGCCCATTGCGGCATATCATCGTGACGAGTGGCTGCCAGAGACTTCACTGCCTATTAAATATGCCGGTTTGTCTACATGCTTCCGCCAGGAGGTCGGCTCACATGGACGCGATACCCGTGGCATATTCCGCGTGCATCAATTTGAGAAGATCGAACAATTCGTGCTTACCTCGCCACATGATAATAAATCTTGGGAAATGATGGACGAAATGATTGGCAATGCAGAGAACTTCTGTCAATCGTTGGGCATTCCATATCGTGTGGTCAACATCGTCTCTGGCGCCCTCAATCATGCTGCCTCCAAGAAACTTGATCTGGAAGCCTGGTTCGGTGGCAGCGGCGCCTACAGAGAGCTGGTCTCCTGCTCCAATTGCCTAGACTATCAGGCACGTCGTCTACTTGTTCGCTATGGCCAGACGAAAAAGATGAATGCTGCAGTGGATTATGTGCACATGTTAAATGCTACTATGTGCGCCACAACGCGTGTCATTTGCGCCATTCTGGAAACACATCAGACGGAGACAGGTGTTAGGGTTCCAGAGCCtctgaaaaaatatatgcCGGCCAAGTTCCAAGATGAGATTCCTTTTGTTAAACCGGCGCCCATTGATTTGGAACaggctgctgccgctgccaagGGCAAAAAGGAGAAGAACAAGAAAGATGCAGCTGCC
11 | 


--------------------------------------------------------------------------------
/tests/ref/OG0003977.fa:
--------------------------------------------------------------------------------
 1 | >DBUSC.10374
 2 | ATGACTTCAAGTATATTTCTAACGACTGCGGAGAATGGTTTGCGACATGACAAAATTGTAATTCTGGATGCTGGAGCACAATATGGCAAGgtTATCGATCGCAAGGTGCGCGAACTGTTGGTGGAATCGGATATTCTGCCATTGGATacaccagcagcaactataCGCAATCATGGCTATAAGGGTATTATAATCTCAGGGGGCCCCAATTCCGTTTACGCTGAAGATGCGCCCACATACGATCCTGAGctgtttaagcttaaaataccGATGCTGGGCATCTGCTATGGCATGCAACTAATTAACAAAGAATTCGGCGGCACCGTCCTCAAGACGGATGTGCGGGAGGATGGACAGCagaatattgaaattgaaacctCATGTCCTTTGTTTAGTCGTCTCAGTCGCACACAGTCTGTGCTGTTAACCCATGGCGACAGCGTGGAGCGAGTGGGCGATAAACTGAAAGTGGGCGGCTGGTCAACAAATCGTATTGTGACTGCTATTTATAACGACGTATTGCGCATCTACGGTGTGCAATTTCATCCGGAGGTGGACCTAACCATCAACGGCAAGCAGATACTGTCCAATTTTCTATACGAAATCTGTGAGTTAACGCCAAACTTCACCATGGGTAGTCGAAAAGAGGAATGCATTCGATATATACGCAATAAAgtgggcaacaacaaagtgttgttaTTGGTCAGTGGCGGTGTTGACTCGAGCGTCTGTGCCGCGCTGCTGCGTTGCGCTCTACATCCTAGCCAAATTATAGCAGTGCATGTGGATAATGGTTTCATGCGTAAGAACGAAAGCGATAAAGTTGAACGCTCACTCAGAGAAATTGGCATTGATTTGATTGTACGCAAGGAAGGCTACACGTTTCTCAAAGGGACCACGCAAGTGAAGCGTGCCGGCCAGTATTCAGTAGTGGAAACGCCCATGTTGTGTCAAACTTATAATCCCGAGGAGAAGCGTAAAATCATTGGTGATATATTTGTGAAGGTGACCAATGATGTGGTGGCCGAGCTCAAGCTTAAACCAGAGGAAGTGCTGTTGGCGCAAGGAACTCTGCGACCGGACTTAATAGAGTCAGCATCGAATATGGTCAGCACGAATGCCGAAACCATCAAGACGCATCACAATGACACGGATCTGATTCGGGAACTGCGCAATGCCGGTCGAGTTGTAGAGCCATTATGTGACTTCCATAAGGATGAGGTGCGTGATCTGGGCAGTGATCTTGGCCTGCCAGCAGAGCTTGTGGAACGACAACCGTTTCCAGGACCTGGTCTGGCCATACGCGTACTCTGCGCTGAAGAAGCCTACATGGAAAAGGATTACTCAGAGACACAGGTTATAGCACGCGTCATTGTGGACTACAAGAACAAATTGCATAAGAATCATGCGCTTATCAATCGTGTAACGGGAGCTACTAGTGAATTAGAGCAAAAGGAACTCTTGCGTATCTCTGCCAATTCGGAAATCCAGGCCACGCTGTTGCCTATTCGCTCGGTGGGCGTGCAAGGCGACAAGCGCACTTATAGCTATGTGGTGGGACTGTCAACATCAACAGCCGAGCCCAATTGGGCGGACATGATGtttttagcaaaattaataCCACGCATTTTGCATAATGTTAACCGCGTCTGTTATATCTTTGGTGAGCCTGTGCAGTATCTGGTGAATGATATAACGCATACAACGCTCAATACTGTGGTGCTGGCACATCTGCGGCAAGCAGATGCCATTGCTAATGAGATTATAATGCAAGCGGGCCTTTATCGCAAAATCTCACAAATGCCTGTAGTTCTAATACCCGTGCACTTTGATCGTGATCCCATAAATCGCACACCCTCCTGCAGAAGATCCGTGGTGTTGCGACCCTTCATAACGAACGACTTTATGACTGGTGTGCCAGCAGTGCCTGGTTCCGTTCAACTGCCTTTGCAAGTCTTAAATCAAATGGTGCGAGAAATCTCTAAATTAGATGGCGTATCCCGAGTACTCTACGATTTAACAGCTAAACCGCCCGGCACCACAGAGTGGGAA
 3 | >DMOJA.1896
 4 | ATGAGTACAACTATATACCTAACGACAGCGGAGAATGGACTGCGACACGATAAAATTGTTATACTCGATGCTGGTGCACAGTATGGCAAGGTAATCGATCGCAAGGTTCGAGAACTTTTAGTAGAATCGGATATTCTGCCATTGGATACGCCTGCCGCAACGATACGTGATAATGGCTATCGAGGCATAATCATATCCGGCGGACCCAATTCAGTTTATGCCGAAGATGCGCCCACGTACGATCCCGATCTATTTAAACTGAAAATCCCAGTGCTTGGCATCTGCTATGGCATGCAGTTAATCAACAAAGAGTTCGGAGGCAGTGTCCTCAAGACTGACGTGCGAGAAGATggacaacaaaatattgagaTTGAAACATCATGCCCTTTGTTCAGTCGCCTTAGCCGTACACAGTCCGTGCTGCTCACCCATGGCGATAGCGTGGAGCGAGTGGGCGATAAGTTGAAAGTGGGTGGCTGGTCATCGAATCGGATTGTGACGGCTATTTACAGTGAGGTATTGAGAATCTATGGTGTTCAATTTCATCCAGAAGTCGATCTAACTATCAACGGCAAACAAATGCTATCCAATTTTCTATACGAAATTTGCGAACTAACGCCTAACTTTACCATGGGCAGTCGAAAAGAGGAATGCATTCGGTACATACGCGAAAAAGTGGGCAATAATAAAGTTTTGTTGCTCGTCAGTGGAGGCGTCGATTCAAGTGTCTGTGCTGCATTGCTGCGAAAAGCGCTGCATCCCAATCAGATTATAGCAGTTCATGTGGATAATGGTTTCATGCGAAaaaatgaaagcgaaaatGTAGTGCGTTCATTGCGAGATATTGGCATTGATTTGATAGTGCGTAAAGAATGCTACACGTTCCTCAAGGGCACCACGCAAGTGAAACGACCCGGCCAGTATTCGGTAGTTGAAACGCCCATGCTATGTCAGACCTACAACCCGGAGGAGAAGCGAAAAATAATCGGTGATATATTTGTGAAAGTGACTAACGATGTGGTGGCTGAGCTGAAACTGAAGCCCGAGGAGGTCCTCTTGGCGCAGGGCACACTGAGGCCGGATTTGATTGAGTCTGCTTCGAATATGGTCAGCACGAATGCCGAAACAATTAAAACGCATCACAATGACACGGATCTGATTCGgGAGCTGCGTAATGCTGGCCGTGTCGTAGAACCATTATGTGACTTCCACAAGGATGAGGTGCGTGACCTGGGCAATGATCTCGGCCTGCCAGCGGAATTGGTCGAGCGCCAGCCGTTCCCCGGTCCTGGTCTCGCCATACGCGTTCTTTGCGCCGAGGAAGCATACATGGAGAAGGATTACTCGGAAACACAGGTCATTGTGCGCGTTATAGTGGactacaaaaataaactacAAAAGAATCATGCGCTTATTAATCGTGTCACGGGCGCCACTAGTGAAGCTGAGCAAAAGGAGCTGCTGCGCATCTCTGCCAATTCCGATATCCAGGCTACTCTACTGCCGGTACGCTCGGTGGGTGTACAGGGTGATAAACGCACCTACAGCTATGTTGTTGGCTTGTCAACGACGACCCCGGAGCCCAACTGGACTGACATGttatttttggccaaaatcATACCACGCATATTGCATAACGTTAACCGAGTCTGTTATATCTTTGGCGAGCCCGTGCAGTATCTAATTACAGATATAACGCACACAACTCTGAACACTGTAGTACTGGCGCAACTCAGGCAAGCGGACGCTATAGCCAATGAGATTATAATGAAAGAGGGACTGTATCGCAAAATCTCACAAATGCCCGTAGTTCTGATACCCGTGCACTTCGATCGTGATCCCATTAATCGTACGCCTTCCTGCAGAAGATCAGTGGTATTGCGTCCATTTATAACGAACGATTTTATGACTGGTGTGCCAGCTGTGCCCGGATCTGCTCAACTGCCATTGCATGTTCTAAATCAAATTGTGCGAGATATTTCTAAATTGGATGGCATCTCGAGGGTACTCTACGACTTGACAGCCAAGCCGCCTGGCACCACGGAGTGGGAA
 5 | >DPSEU.1121
 6 | ATGAATTCAGGCATATTTCTGGGCACAGCGGAGAACGGCCTGCGGCACGATAAGATTGTTATACTGGATGCGGGAGCACAGTACGGCAAGGTTATCGATCGCAAGGTGCGGGAACTGCTCGTGGAGTCGGATATACTGCCCCTAGATACACCAGCGGCAACGATACGGAATAATGGCTATCGGGGCATCATCATATCCGGCGGTCCCAATTCCGTGTATGCCGAGGATGCGCCCACGTACGATCCGGATCTGTTTAAGCTGAAAATTCCCATGCTCGGAATTTGCTATGGCATGCAGCTAATCAACAAAGAGTTTGGCGGCAGCGTTCTCAAGACGGACGTCCGGGAAGATGGCCAACAGAACATTGAGATTGAGACCTCGTGCCCGCTGTTCAGCCGCCTCAGTCGCACACAGTCCGTACTGCTCACCCACGGGGACAGCGTGGAGCGAGTGGGCGAGAAGCTAAAGGTGGGGGGTTGGTCAACGAACCGCATCGTCACGGCCATCTACAGCGAGGTGTTGCGCATCTATGGCGTCCAGTTCCATCCAGAGGTGGATCTAACCATCAATGGCAAACAGATGCTATCAAATTTCCTCTATGAAATCTGCGAACTAACGCCAAACTTCACAATGGGCAGCAGGAAGGAGGagtgcatacggtacatacgAGAGAAAGTGGGCAGTAATAAAGTCTTGCTCTTGGTCAGTGGCGGTGTGGATTCGAGTGTATGTGCAGCCTTGTTGCGCCGTGCCCTATACCCTAATCAAATAATTGCCGTGCATGTAGATAATGGTTTCATGCGCAAGAACGAAAGTGAAAAGGTGGAGCGTTCGTTGCGCGAAATCGGCATCGATTTGATTGTGCGAAAGGAATGCTATACGTTCCTCAAGGGCACCACGCAAGTGAAGCGACCCGGCCAGTATTCGGTCGTTGAGACGCCCATGCTCTGCCAGACCTACAACCCCGAGGAGAAGCGCAAGATAATTGGTGATATATTCGTAAAGGTGACCAACGATGTGGTGGCCTTTCTGAAACTCAAACCCGAAGAAGTTATGCTCGCCCAGGGCACCCTCAGGCCGGATCTAATTGAGTCTGCATCGAATATGGTCAGCACGAATGCAGAAACAATCAAAACGCATCACAATGACACGGATCTGATTAGAGAACTCCGCAATGCGGGACGCGTCGTAGAGCCTCTGTGCGACTTTCACAAGGATGAGGTACGAGATCTGGGCAATGATCTGGGTCTGCCACCAGAGCTAGTCGAGCGACAACCCTTCCCAGGTCCCGGTCTGGCCATCCGCGTTCTGTGTGCCGAGGAGGCTTACATGGAGAAGGACTATTCAGAGACTCAGGTAATTGTGCGTGTTATTGTGGACTATAAAAACAAACTGCAGAAGAACCATGCGCTCATAAATCGAGTGACGGGCGCCACCAGTGAGGCCGAACAAATAGAGCTACTGCGCATCTCTGCCAACTCAACGATCCAGGCTACCCTGCTGCCCATCCGGTCTGTGGGTGTGCAAGGCGACAAACGCAGCTACAGCTACGTGGTGGGGCTGTCTACGAGTCAAGAGCCCAACTGGATGGACCTGCTCTTCCTAGCAAAGATCATACCGCGTATTTTGCATAATGTCAATAGGGTTTGCTATATCTTTGGGGAGCCCGTTCAGTATCTGGTCACGGACATAACGCACACAACCCTCAATACGGTGGTGTTGTCGCAGCTCCGGCAAGCCGATGCCATTGCAAATGAGattATCATGCAAGCGGGCTTGTACAGGAAAATCTCACAGATGCCTGTCGTTCTGATACCCGTGCACTTTGATCGCGATCCCATCAATAGGACGCCCTCCTGCAGAAGATCGGTGGTTCTGCGTCCCTTCATAACGAACGATTTTATGACTGGAGTGCCGGCTGTGCCTGGATCAGTGCAACTGCCATTGCAAGTCCTTAATCAAATGGTGCGCGACATAACCAAGCTGGATGGAATATCGCGGGTACTCTACGACTTGACCGCCAAGCCGCCGGGTACCACTGAGTGGGAA
 7 | >DYAKU.12180
 8 | ATGAACTCAAACATATTTCTGGGCACAGCAGAGAACGGCCTGCGGCACGATAAGATTGTTATACTTGATGCTGGAGCACAGTACGGCAAGGTTATCGACCGTAAGGTACGCGAACTCCTCGTTGAGTCGGATATCCTTCCACTGGACACCCCAGCGGCTACGATACGCAACAATGGCTATCGAGGCATCATCATCTCCGGGGGACCCAACTCAGTCTACGCAGAGGATGCACCCAGCTATGATCCCGATCTGTTCAAGCTAAAGATACCTATGCTGGGCATCTGCTACGGCATGCAGCTAATAAACAAAGAGTTCGGAGGCACAGTGCTCAAGAAGGATGTACGAGAGGATGGCCAACAAAATATCGAGATTGAGACCTCGTGTCCGCTCTTTAGTCGCCTCAGTCGCACACAGTCGGTGCTGTTAACCCACGGTGATAGCGTTGAGAGAGTGGGCGAGAATCTGAAGATTGGTGGCTGGTCCACAAACCGCATTGTGACAGCTATTTACAATGAAGTACTCCGCATCTACGGCGTCCAGTTCCATCCTGAGGTGGACCTCACTATCAATGGCAAACAGATGCTATCGAACTTCCTGTACGAAATCTGCGAACTGACGCCCAACTTTACCATGGGTAGTCGAAAGGAGGAGTGCATACGCTACATCCGCGAGAAAGTGGGCAGTAATAAAGTGTTGCTACTGGTCAGCGGCGGCGTGGATTCGAGTGTCTGTGCAGCTTTGCTCCGCCGTGCTTTGTACCCCAATCAGATAATTGCCGTACATGTAGATAATGGTTTCATGCGCAAAAACGAAAGTGAAAAGGTGGAGCGTTCACTGCGCGATATTGGCATTGATTTAATTGTCCGAAAAGAAGGCTACACGTTCCTTACAGGCACTACGCAAGTCAAGAGGCCCGGACAGTACTCCGTGGTGGAAACGCCTATGTTATGTCAGACCTATAATCCGGAGGAGAAACGCAAGATAATTGGTGATATATTCGTCAAAGTGACCAACGATGTAGTAGCCGAATTGAAACTAAAGCCCGAGGAAGTTATGTTGGCTCAGGGAACCCTCCGACCAGATCTAATCGAGTCCGCCTCGAACATGGTGAGCACGAATGCAGAAACAATCAAAACGCACCACAATGACACGGATCTGATCAGaGAGCTTCGTAACGCAGGACGTGTGGTTGAGCCGCTGTGTGACTTTCATAAGGATGAAGTGCGCGACTTAGGCAACGATCTTGGACTGCCCCAAGAGCTTGTGGAGAGACAACCCTTCCCAGGTCCTGGCCTGGCAATCCGCGTTCTCTGCGCCGAGGAGGCATACATGGAGAAGGACTACTCAGAGACTCAggTTATTATACGCGTGATTGTAGACTACAAGAATAAACTGCAGAAGAACCATGCTCTAATCAACCGCGTAACAGGGACCACGAGCGAGTCAGAACAGAAAGACCTATTGCGTATCTCTGCGAACTCGCAGATTCAGGCAACTTTGCTGCCCATCCGCTCCGTGGGCGTGCAAGGTGATAAACGGTCATACAGCTACGTAGTAGGTCTATCAACGAGCCAGGAGCCCAACTGGCAGGACCTTCTCTTCTTGGCTAAAATTATACCGCGCATACTGCACAACGTGAACAGGGTGTGCTACATTTTTGGCGAGCCCGTGCAGTACCTAGTAACGGATATAACGCACACCACACTGAATACGGTAGTTCTTTCGCAGCTAAGGCAAGCGGATGCTATTGCCAATGAAATCATAATGCAAGCTGGACTATACCGGAAAATCTCTCAGATGCCTGTTGTTCTCATACCCGTGCACTTTGACCGCGATCCCATTAACCGTACACCCTCATGCCTAAGGTCGGTAGTGCTGCGTCCGTTCATAACGAACGACTTTATGACTGGTGTGCCGGCTGAGCCCGGCTCCGTGCAACTGCCATTGCAGGTCCTAAATCAAATTGTACGCGATATATCCAAACTGGGCGGAATCTCGAGGGTGCTGTACGACTTGACAGCTAAGCCACCGGGCACCACCGAGTGGGAA
 9 | >SLEBA.7381
10 | ATGTCTTCAAATCTATTTCtaacaacagcagaaaatgGTCTGCGACACGATAAAATTGTTATACTCGATGCTGGTGCACAGTACGGCAAGGTTATCGATCGTAAAGTGCGTGAATTGCTCGTTGAATCGGATATCCTGCCTCTAGATACGCCAGCATCGACGATACGCGATCATGGCTATCGCGGAATCATAATTTCCGGCGGACCCAACTCAGTATATGCCGAAGATGCGCCCTCTTATGATCCCGACTTATTTAAGCTGAAAATACCTATGCTAGGCATTTGCTATGGCATGCAGCTGATTAATAAGGAATTCGGCGGAACCGTACTCAAGACGGATGTACGCGAGGATGGTCAGCAAAGCATAGAAATTGAAACATCGTGTCCGTTGTTCAGCCGCCTCAGTCGCACTCAATCCGTGTTACTTACACATGGCGATAGTGTAGAGCGTGTTGGTGAAAAACTTAAAGTAGGGGGCTGGTCAACGAATCGAATTGTCACCGCCATTTATAATGAAGTGCTGCGCATTTATGGTGTACAATTCCATCCGGAAGTCGACTTAACAATCAATGGAAAACAGATGCTATCGAATTTTCTCTATGATATCTGTGAATTGACGCCGAATTTTACCATGGGTAGTCGAAAAGAGGAATGCATACGCTATATTAGAGAGAAAGTGGGCAACAATAAAGTACTgTTGCTGGTCAGCGGCGGCGTCGATTCCAGTGTGTGTGCTGCGCTTTTAAGACGCGCCTTACATCCTGGCCAAATTATAGCGGTGCATGTGGATAAtgGTTTCATGCGAAAAAACGAAAGTGAAAAGGTGGAGCGTTCCTTGCGAGAAATTGGCATAGATTTGATAGTTCGTAAAGAAAGCTACACGTTCCTCAAAGGCACCACGCAAGTGAAAAGGCCTGGACAGTATTCGGTGGTTGAAACGCCCATGCTATGCCAGACATACAACCCCGAAGAAAAACGCAAGATAATTGGTGATATATTCGTAAAAGTGACCAATGATGTGGTAGCCGAGCTAAAATTGAAACCCGAAGAGGTTTTGCTGGCTCAGGGTACCCTACGACCCGATCTAATAGAGTCCGCCTCAAATATGGTTAGCATGAATGCCGAAACGATCAAGACGCATCACAATGACACAGATCTGATAAGAGAACTGCGGAATGCAGGGCGCGTTGTCGAGCCACTGTGCGATTTCCATAAGGACGAAGTACGGGATCTTGGCAATGATTTGGGCTTACCAGCGGAGCTAGTAGAAAGGCAACCATTTCCGGGACCTGGTCTAGCAATTCGAGTACTTTGTGCTGAGGAGGCTTACATGGAGAAAGACTATTCAGAAACACAGGTCATAGTTCGTGTAATAGTGgactacaaaaacaaactacaGAAAAATCATGCACTCATCAATCGTGTTACGGGCGCTACCAATGAAGCCGAACAAAAGGAACTCATACGCATTtcaacaaatacacaaatcCAAGCCACATTGCTGCCTGTGCGCTCGGTGGGTGTGCAAGGTGATAAGCGCACCTACAGCTATGTTGTTGGTCTATCAACGAGTCAGGAACCCAATTGGACGGATCTGTTATTCTTAGCGAAAATCATACCGCGCATTTTACACAATGTAAATAGAGTTTGCTATATCTTCGGTGATCCCGTGCAGTATCTGGTAACCGATATAACGCATACGACACTCAATACTGTGGTGTTGGCGCAGCTACGCCAAGCGGATGCAATTGCCAATGAGaTTATTATGCAAGCTGGTCTGTATCGTAAAATTTCACAGATGCCTGTTGTGCTTATACCAGTACATTTTGATCGTGATCCAATAAATCGTACGCCGTCGTGTCGAAGATCCGTGGTGTTGCGTCCCTTTATAACAAATGATTTCATGACTGGTGTGCCAGCTGTGCCTGGTTCTGTGCAACTGCCATTGCAAGTCCTTAATCAAATAGTGAGGGATATATCCAAATTGGATGGAATCTCGAGGGTACTATACGATTTAACCGCAAAACCGCCCGGCACCACAGAATGGGAA
11 | 


--------------------------------------------------------------------------------
/lib/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import sys
  4 | import os
  5 | import argparse
  6 | try:
  7 | 	import library as lib
  8 | 	import map
  9 | 	import assemble as ab
 10 | except ImportError:
 11 | 	import lib.library as lib
 12 | 	import lib.map as map
 13 | 	import lib.assemble as ab
 14 | 
 15 | def parameter(x):
 16 | 	return x.strip()
 17 | 
 18 | def main(args):
 19 | 	# the arguments
 20 | 	parser = argparse.ArgumentParser(prog='PhyloAln', usage="%(prog)s [options] -a reference_alignment_file -s species -i fasta_file -f fasta -o output_directory\n%(prog)s [options] -d reference_alignments_directory -c config.tsv -f fastq -o output_directory", description="A program to directly generate multiple sequence alignments from FASTA/FASTQ files based on reference alignments for phylogenetic analyses.\nCitation: Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. MBE. 41(7):msae150. https://doi.org/10.1093/molbev/msae150", epilog="Written by Yu-Hao Huang (2023-2025) huangyh45@mail3.sysu.edu.cn", formatter_class=argparse.RawDescriptionHelpFormatter)
 21 | 	parser.add_argument('-a', '--aln', type=os.path.abspath, help='the single reference FASTA alignment file')
 22 | 	parser.add_argument('-d', '--aln_dir', type=os.path.abspath, help='the directory containing all the reference FASTA alignment files')
 23 | 	parser.add_argument('-x', '--aln_suffix', default='.fa', help='the suffix of the reference FASTA alignment files when using "-d"(default:%(default)s)')
 24 | 	parser.add_argument('-s', '--species', help='the studied species ID for the provided FASTA/FASTQ files(-i)')
 25 | 	parser.add_argument('-i', '--input', type=os.path.abspath, nargs='+', help='the input FASTA/FASTQ file(s) of the single species(-s), compressed files ending with ".gz" are allowed')
 26 | 	parser.add_argument('-c', '--config', type=os.path.abspath, help="the TSV file with the format of 'species	sequence_file(s)(absolute path, files separated by commas)' per line for multiple species")
 27 | 	parser.add_argument('-f', '--file_format', choices=['guess', 'fastq', 'fasta', 'large_fasta'], default='guess', help="the file format of the provided FASTA/FASTQ files, 'large_fasta' is recommended for speeding up reading the FASTA files with long sequences(e.g. genome sequences) and cannot be guessed(default:%(default)s)")
 28 | 	parser.add_argument('-o', '--output', default='PhyloAln_out', type=os.path.abspath, help='the output directory containing the results(default:%(default)s)')
 29 | 	parser.add_argument('-p', '--cpu', type=int, default=8, help="maximum threads to be totally used in parallel tasks(default:%(default)d)")
 30 | 	parser.add_argument('--parallel', type=int, help="number of parallel tasks for each alignments, number of CPUs used for single alignment will be automatically calculated by '--cpu / --parallel'(default:the smaller value between number of alignments and the maximum threads to be used)")
 31 | 	parser.add_argument('-e', '--mode', choices=['dna2reads', 'prot2reads', 'codon2reads', 'fast_dna2reads', 'fast_prot2reads', 'fast_codon2reads', 'dna2trans', 'prot2trans', 'codon2trans', 'dna2genome', 'prot2genome', 'codon2genome', 'rna2rna', 'prot2prot', 'codon2codon', 'gene_dna2dna', 'gene_rna2rna', 'gene_codon2codon', 'gene_codon2dna', 'gene_prot2prot'], help="the common mode to automatically set the parameters for easy use(**NOTICE: if you manually set those parameters, the parameters you set will be ignored and covered! See https://github.com/huangyh45/PhyloAln/blob/main/README.md#example-commands-for-different-data-and-common-mode-for-easy-use for detailed parameters)")
 32 | 	parser.add_argument('-m', '--mol_type', choices=['dna', 'prot', 'codon', 'dna_codon'], default='dna', help="the molecular type of the reference alignments(default:%(default)s, 'dna' suitable for nucleotide-to-nucleotide or protein-to-protein alignment, 'prot' suitable for protein-to-nucleotide alignment, 'codon' and 'dna_codon' suitable for codon-to-nucleotide alignment based on protein and nucleotide alignments respectively)")
 33 | 	parser.add_argument('-g', '--gencode', type=int, default=1, help="the genetic code used in translation(default:%(default)d = the standard code, see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)")
 34 | 	parser.add_argument('--ref_split_len', type=int, help="If provided, split the reference alignments longer than this length into short alignments with this length, ~1000 may be recommended for concatenated alignments, and codon alignments should be devided by 3")
 35 | 	parser.add_argument('-l','--split_len', type=int, help="If provided, split the sequences longer than this length into short sequences with this length, 200 may be recommended for long genomic reads or sequences")
 36 | 	parser.add_argument('--split_slide', type=int, help="the slide to split the sequences using sliding window method(default:half of '--split_len')")
 37 | 	parser.add_argument('-n', '--no_reverse', action='store_true', help="not to prepare and search the reverse strand of the sequences, recommended for searching protein or CDS sequences")
 38 | 	parser.add_argument('--low_mem', action='store_true', help="use a low-memory but slower mode to prepare the reads, 'large_fasta' format is not supported and gz compressed files may still spend some memory")
 39 | 	parser.add_argument('--hmmbuild_parameters', type=parameter, nargs='+', default=[], help="the parameters when using HMMER hmmbuild for reference preparation, with the format of ' --xxx' of each parameter, in which space is required(default:%(default)s)")
 40 | 	parser.add_argument('--hmmsearch_parameters', type=parameter, nargs='+', default=[], help="the parameters when using HMMER hmmsearch for mapping the sequences, with the format of ' --xxx' of each parameter, in which space is required((default:%(default)s)")
 41 | 	parser.add_argument('-b', '--no_assemble', action='store_true', help="not to assemble the raw sequences based on overlap regions")
 42 | 	parser.add_argument('--overlap_len', type=int, default=30, help="minimum overlap length when assembling the raw sequences(default:%(default)d)")
 43 | 	parser.add_argument('--overlap_pident', type=float, default=98, help="minimum overlap percent identity when assembling the raw sequences(default:%(default).2f)")
 44 | 	parser.add_argument('-t', '--no_out_filter', action='store_true', help="not to filter the foreign or no-signal sequences based on conservative score")
 45 | 	parser.add_argument('-u', '--outgroup', nargs='+', default=[], help="the outgroup species for foreign or no-signal sequences detection(default:all the sequences in the alignments with all sequences as ingroups)")
 46 | 	parser.add_argument('--ingroup', nargs='+', default=[], help="the ingroup species for score calculation in foreign or no-signal sequences detection(default:all the sequences when all sequences are set as outgroups; all other sequences except the outgroups)")
 47 | 	parser.add_argument('-q', '--sep', default='.', help="the separate symbol between species name and gene identifier in the sequence headers of the alignments(default:%(default)s)")
 48 | 	parser.add_argument('--outgroup_weight', type=float, default=0.9, help="the weight coefficient to adjust strictness of the foreign or no-signal sequence filter, small number or decimal means ralaxed criterion (default:%(default).2f, 1 = not adjust)")
 49 | 	parser.add_argument('-r', '--no_cross_species', action='store_true', help="not to remove the cross contamination for multiple species")
 50 | 	parser.add_argument('--cross_overlap_len', type=int, default=30, help="minimum overlap length when cross contamination detection(default:%(default)d)")
 51 | 	parser.add_argument('--cross_overlap_pident', type=float, default=98, help="minimum overlap percent identity when cross contamination detection(default:%(default).2f)")
 52 | 	parser.add_argument('--min_exp', type=float, default=0.2, help="minimum expression value when cross contamination detection(default:%(default).2f)")
 53 | 	parser.add_argument('--min_exp_fold', type=float, default=5, help="minimum expression fold when cross contamination detection(default:%(default).2f)")
 54 | 	parser.add_argument('-w', '--unknow_symbol', default='unknow', help="the symbol representing unknown bases for missing regions(default:%(default)s = 'N' in nucleotide alignments and 'X' in protein alignments)")
 55 | 	parser.add_argument('-z', '--final_seq', choices=['consensus', 'consensus_strict', 'all', 'expression', 'length'], default='consensus', help="the mode to output the sequences(default:%(default)s, 'consensus' means selecting most common bases from all sequences, 'consensus_strict' means only selecting the common bases and remaining the different bases unknow, 'all' means remaining all sequences, 'expression' means the sequence with highest read counts after assembly, 'length' means sequence with longest length")
 56 | 	parser.add_argument('-y', '--no_ref', action='store_true', help="not to output the reference sequences")
 57 | 	parser.add_argument('-k', '--keep_seqid', action='store_true', help="keep original sequence IDs in the output alignments instead of renaming them based on the species ID, not recommended when the output mode is 'consensus'/'consensus_strict' or the assembly step is on")
 58 | 	parser.add_argument('-v', '--version', action='version', version="%(prog)s v1.1.0")
 59 | 	args = parser.parse_args(args)
 60 | 
 61 | 	# automatically set the parameters when mode is set for easy use
 62 | 	if args.mode is not None:
 63 | 		if args.mode in ['rna2rna', 'prot2prot', 'codon2codon']:
 64 | 			args.no_reverse = True
 65 | 			args.no_assemble = True
 66 | 			args.no_cross_species = True
 67 | 			if args.mode == 'codon2codon':
 68 | 				args.mol_type = 'codon'
 69 | 			else:
 70 | 				args.mol_type = 'dna'
 71 | 				if args.mode == 'prot2prot':
 72 | 					args.unknow_symbol = 'X'
 73 | 		elif args.mode.startswith('gene_'):
 74 | 			args.no_assemble = True
 75 | 			args.no_cross_species = True
 76 | 			args.final_seq = 'all'
 77 | 			args.keep_seqid = True
 78 | 			args.unknow_symbol = '-'
 79 | 			if args.mode.endswith('2dna'):
 80 | 				args.no_reverse = False
 81 | 			else:
 82 | 				args.no_reverse = True
 83 | 			if args.mode.startswith('gene_codon2'):
 84 | 				args.mol_type = 'codon'
 85 | 			else:
 86 | 				args.mol_type = 'dna'
 87 | 			#if args.mode == 'gene_prot2prot':
 88 | 			#	args.unknow_symbol = 'X'
 89 | 		else:
 90 | 			if 'dna2' in args.mode:
 91 | 				args.mol_type = 'dna'
 92 | 			elif 'prot2' in args.mode:
 93 | 				args.mol_type = 'prot'
 94 | 			elif 'codon2' in args.mode:
 95 | 				args.mol_type = 'codon'
 96 | 			if args.mode.endswith('reads') and not args.mode.startswith('fast_'):
 97 | 				args.no_assemble = False
 98 | 			else:
 99 | 				args.no_assemble = True
100 | 			if args.mode.endswith('reads'):
101 | 				args.no_cross_species = False
102 | 			else:
103 | 				args.no_cross_species = True
104 | 			if args.mode.endswith('2genome'):
105 | 				args.split_len = 200
106 | 				args.file_format = 'large_fasta'
107 | 
108 | 	# parse the alignment files
109 | 	alns = {}
110 | 	if args.ref_split_len:
111 | 		if args.aln_dir:
112 | 			print("\nError: split of the alignment is not supported for multiple alignments, please input a single alignment file through '-a' or '--aln'!")
113 | 			sys.exit(1)
114 | 		elif args.final_seq == 'all':
115 | 			print("\nError: split of the alignment is not supported to output all sequences, please choice other options to keep unqiue sequences instead!")
116 | 			sys.exit(1)
117 | 		elif args.aln:
118 | 			alns = map.split_ref(args.aln, args.ref_split_len)
119 | 	elif args.aln:
120 | 		alns['aln'] = args.aln
121 | 	elif args.aln_dir:
122 | 		for filename in os.listdir(args.aln_dir):
123 | 			if filename.endswith(args.aln_suffix):
124 | 				alns[filename.replace(args.aln_suffix, '')] = os.path.join(args.aln_dir, filename)
125 | 	if not alns:
126 | 		print("\nError: fail to find any alignment!")
127 | 		sys.exit(1)
128 | 
129 | 	# parse the species data
130 | 	rawdata = {}
131 | 	if args.species and args.input:
132 | 		rawdata[args.species] = args.input
133 | 	elif args.config:
134 | 		for line in open(args.config):
135 | 			arr = line.rstrip().split("\t")
136 | 			rawdata[arr[0]] = arr[1].split(',')
137 | 	if not rawdata:
138 | 		print("\nError: fail to find any species data!")
139 | 		sys.exit(1)
140 | 
141 | 	# check the unknow symbol, low-memory mode and outgroup, and parse the parallel task number and length to split
142 | 	if args.unknow_symbol != 'unknow' and len(args.unknow_symbol) > 1:
143 | 		print("\nError: the symbol representing unknown bases should be single character!")
144 | 		sys.exit(1)
145 | 	if args.low_mem and args.file_format == 'large_fasta':
146 | 		print("\nError: the format of 'large_fasta' is not supported in the low-memory mode! If you want to use the low-memory mode, you can use 'fasta' format and it will take a while!")
147 | 		sys.exit(1)
148 | 	if not args.outgroup and not args.no_out_filter:
149 | 		print("\nWarning: no outgroup was set, all the sequences in the alignments will be considered as outgroups with all sequences as ingroups in foreign sequence filter!")
150 | 	if args.parallel is None:
151 | 		args.parallel = min(len(alns), args.cpu)
152 | 	else:
153 | 		args.parallel = min(args.parallel, len(alns), args.cpu)
154 | 	if args.split_len is not None and args.split_slide is None:
155 | 		args.split_slide = int(args.split_len / 2)
156 | 
157 | 	# create and enter the output directory, and output the splitted reference alignments
158 | 	if not os.path.isdir(args.output):
159 | 		os.makedirs(args.output)
160 | 	os.chdir(args.output)
161 | 	if not os.path.isdir('ok'):
162 | 		os.mkdir('ok')
163 | 	if args.ref_split_len:
164 | 		if not os.path.isdir('ref_split'):
165 | 			os.mkdir('ref_split')
166 | 		for aln, aln_seqstr in alns.items():
167 | 			outfile = open(os.path.join('ref_split', aln + '.fa'), 'w')
168 | 			outfile.write(aln_seqstr)
169 | 			outfile.close()
170 | 			alns[aln] = os.path.join('ref_split', aln + '.fa')
171 | 
172 | 	# prepare the reference HMMs
173 | 	map.prepare_ref(alns, cpu=args.cpu, np=args.parallel, moltype=args.mol_type, gencode=args.gencode, parameters=args.hmmbuild_parameters)
174 | 
175 | 	# map (by HMMER), extract, assemble the sequences and remove foreign or no-signal sequences of each species
176 | 	total_reads = {}
177 | 	assemblers = {}
178 | 	for sp, fastxs in rawdata.items():
179 | 		all_seqs, total_reads[sp] = map.map_reads(alns, sp, fastxs, file_format=args.file_format, cpu=args.cpu, np=args.parallel, moltype=args.mol_type, gencode=args.gencode, split_len=args.split_len, split_slide=args.split_slide, no_reverse=args.no_reverse, low_mem=args.low_mem, parameters=args.hmmsearch_parameters)
180 | 		all_hmmres = map.extract_reads(alns, sp, all_seqs, moltype=args.mol_type, split_len=args.split_len, low_mem=args.low_mem)
181 | 		del all_seqs
182 | 		assemblers[sp] = ab.generate_assembly_mp(alns, sp, all_hmmres, np = args.parallel, moltype=args.mol_type, gencode=args.gencode, no_assemble=args.no_assemble, overlap_len=args.overlap_len, overlap_pident=args.overlap_pident, no_out_filter=args.no_out_filter, outgroup=args.outgroup, ingroup=args.ingroup, sep=args.sep, outgroup_weight=args.outgroup_weight, final_seq=args.final_seq)
183 | 
184 | 	# cross decontamination and output
185 | 	for group_name in alns.keys():
186 | 		assemblers[group_name] = {}
187 | 		for sp in rawdata.keys():
188 | 			assemblers[group_name][sp] = assemblers[sp][group_name]
189 | 	for sp in rawdata.keys():
190 | 		assemblers.pop(sp)
191 | 	ab.cross_and_output_mp(alns.keys(), list(rawdata.keys()), assemblers, total_reads, np = args.parallel, moltype=args.mol_type, gencode=args.gencode, no_assemble=args.no_assemble, no_cross_species=args.no_cross_species, min_overlap=args.cross_overlap_len, min_pident=args.cross_overlap_pident, min_exp=args.min_exp, min_fold=args.min_exp_fold, unknow=args.unknow_symbol, final_seq=args.final_seq, no_ref=args.no_ref, sep=args.sep, keep_seqid=args.keep_seqid)
192 | 
193 | 	# concatenate the output alignments if split the reference alignment
194 | 	if args.ref_split_len:
195 | 		if args.unknow_symbol == 'unknow':
196 | 			fill = 'N'
197 | 		else:
198 | 			fill = args.unknow_symbol
199 | 		ab.concatenate('nt_out', alns, fill)
200 | 		if args.mol_type != 'dna':
201 | 			if args.unknow_symbol == 'unknow':
202 | 				fill = 'X'
203 | 			else:
204 | 				fill = args.unknow_symbol
205 | 			ab.concatenate('aa_out', alns, fill)
206 | 
207 | if __name__ == "__main__":
208 |     main(sys.argv[1:])
209 | 


--------------------------------------------------------------------------------
/lib/map.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import sys
  4 | import os
  5 | import warnings
  6 | from Bio import SearchIO, BiopythonWarning
  7 | from Bio.Seq import reverse_complement, translate
  8 | try:
  9 |     import library as lib
 10 | except ImportError:
 11 |     import lib.library as lib
 12 | 
 13 | # split the reference alignments into short alignments
 14 | def split_ref(alnfile, split_len):
 15 | 	print("Splitting the reference alignment into short alignments...")
 16 | 	seqs = lib.read_fastx(alnfile, 'fasta')
 17 | 	aln_len = len(list(seqs.values())[0])
 18 | 	alns = {}
 19 | 	i = 0
 20 | 	while i * split_len < aln_len:
 21 | 		alns['aln_' + str(i)] = ''
 22 | 		end = i * split_len + split_len
 23 | 		# for final alignments and avoid too short length < 30
 24 | 		if end + 30 > aln_len:
 25 | 			end = aln_len
 26 | 		for seqid, seqstr in seqs.items():
 27 | 			alns['aln_' + str(i)] += ">{}\n{}\n".format(seqid, seqstr[i*split_len:end])
 28 | 		i += 1
 29 | 	return alns
 30 | 
 31 | # output the splited sequences to different FASTA file by single CPU
 32 | def output_fasta_percpu(seq_list, output_fasta, fastx_num=1, moltype='dna', gencode=1, outfile0=None, split_len=None, split_slide=None, no_reverse=False, low_mem_iter=None, low_mem_format=None):
 33 | 	if outfile0 is None:
 34 | 		outfile = open(output_fasta, 'w')
 35 | 	else:
 36 | 		outfile = outfile0
 37 | 	if low_mem_iter is not None:
 38 | 		# low-memory mode: directly read the file instead of reading the store
 39 | 		seqid = None
 40 | 		seqstr = None
 41 | 		count = 0
 42 | 		line_num = 0
 43 | 		for line in low_mem_iter:
 44 | 			line = line.rstrip()
 45 | 			if low_mem_format == 'fastq' and line.startswith('@') and line_num % 4 == 0:
 46 | 				seqid = line.replace('@', '', 1).replace(' ', '_')
 47 | 			elif low_mem_format == 'fasta' and line.startswith('>'):
 48 | 				if seqid:
 49 | 					# prepare the single sequence
 50 | 					output_fasta_percpu([(seqid, seqstr)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse)
 51 | 					count += 1
 52 | 				arr = line.split(" ")
 53 | 				seqid = arr[0].lstrip('>')
 54 | 				seqstr = ''
 55 | 			elif seqid:
 56 | 				if low_mem_format == 'fastq':
 57 | 					# prepare the single sequence
 58 | 					output_fasta_percpu([(seqid, line)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse)
 59 | 					count += 1
 60 | 					seqid = None
 61 | 				else:
 62 | 					seqstr += line
 63 | 			line_num += 1
 64 | 		if low_mem_format == 'fasta' and seqstr:
 65 | 			output_fasta_percpu([(seqid, seqstr)], output_fasta, fastx_num, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse)
 66 | 			count += 1
 67 | 		# low-memory mode for this FASTA/FASTQ file ends here and return the read count
 68 | 		return count
 69 | 	for seqid, seqstr0 in seq_list:
 70 | 		if split_len:
 71 | 			# split the sequences into short sequences with length of split_len
 72 | 			i = 0
 73 | 			while i + split_len < len(seqstr0):
 74 | 				output_fasta_percpu([('_'.join([seqid, 'split', str(i+1), str(i+split_len)]), seqstr0[i:i+split_len])], output_fasta, fastx_num, moltype, gencode, outfile, no_reverse=no_reverse)
 75 | 				i += split_slide
 76 | 			output_fasta_percpu([('_'.join([seqid, 'split', str(i+1), str(len(seqstr0))]), seqstr0[i:])], output_fasta, fastx_num, moltype, gencode, outfile, no_reverse=no_reverse)
 77 | 		elif moltype.startswith('dna'):
 78 | 			outfile.write(">{}_fastx{}\n{}\n".format(seqid, fastx_num, seqstr0))
 79 | 			if not no_reverse:
 80 | 				outfile.write(">{}_fastx{}_rev\n{}\n".format(seqid, fastx_num, reverse_complement(seqstr0)))
 81 | 		else:
 82 | 			# supress the warnings due to the last incomplete codons when translate the sequences
 83 | 			with warnings.catch_warnings():
 84 | 				warnings.simplefilter('ignore', BiopythonWarning)
 85 | 				for j in [1,2,3]:
 86 | 					seqstr = seqstr0[(j-1):]
 87 | 					outfile.write(">{}_fastx{}_pos{}\n{}\n".format(seqid, fastx_num, j, translate(seqstr, table=gencode)))
 88 | 				if not no_reverse:
 89 | 					seqstr0 = reverse_complement(seqstr0)
 90 | 					for j in [1,2,3]:
 91 | 						seqstr = seqstr0[(j-1):]
 92 | 						outfile.write(">{}_fastx{}_pos{}rev\n{}\n".format(seqid, fastx_num, j, translate(seqstr, table=gencode)))
 93 | 	if outfile0 is None:
 94 | 		outfile.close()
 95 | 
 96 | # convert the raw FASTQ/FASTA files to (translated) FASTA format
 97 | def fastx2fasta(fastxs, fasta, file_format='guess', cpu=8, moltype='dna', gencode=1, split_len=None, split_slide=None, no_reverse=False, low_mem=False, output=True):
 98 | 	if output:
 99 | 		outfile = open(fasta, 'w')
100 | 	all_seqs = []
101 | 	total_count = 0
102 | 	for i in range(len(fastxs)):
103 | 		if low_mem:
104 | 			fastx_iter, file_format = lib.read_fastx(fastxs[i], file_format, low_mem=True)
105 | 			all_seqs.append([fastxs[i], file_format])
106 | 			seqs = {}
107 | 		else:
108 | 			seqs = lib.read_fastx(fastxs[i], file_format)
109 | 			all_seqs.append(seqs)
110 | 			total_count += len(seqs)
111 | 		if output:
112 | 			if not low_mem and cpu > 1 and (split_len or len(seqs) > 10000):
113 | 				# run in multiprocess when too much sequences
114 | 				print("Binning the reads in '{}' into {} parts and preparing in multiprocess...".format(fastxs[i], cpu))
115 | 				nseq = int(len(seqs) / cpu)
116 | 				if nseq * cpu < len(seqs):
117 | 					nseq += 1
118 | 				kwds = {'fastx_num': i+1, 'moltype': moltype, 'gencode': gencode, 'split_len': split_len, 'split_slide': split_slide, 'no_reverse': no_reverse}
119 | 				args_list = []
120 | 				for j in range(cpu-1):
121 | 					args_list.append((list(seqs.items())[(nseq*j):(nseq*(j+1))], fasta + '.' + str(j)))
122 | 				args_list.append((list(seqs.items())[(nseq*(cpu-1)):], fasta + '.' + str(cpu-1)))
123 | 				lib.run_mp(output_fasta_percpu, args_list, cpu, kwds=kwds)
124 | 				for j in range(cpu):
125 | 					for line in open(fasta + '.' + str(j)):
126 | 						outfile.write(line)
127 | 					os.remove(fasta + '.' + str(j))
128 | 			elif low_mem:
129 | 				# output using a low-memory method and obtain the read count
130 | 				total_count += output_fasta_percpu([], fasta, i+1, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem_iter=fastx_iter, low_mem_format=file_format)
131 | 			else:
132 | 				# directly output by single cpu
133 | 				output_fasta_percpu(list(seqs.items()), fasta, i+1, moltype, gencode, outfile, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse)
134 | 	if output:
135 | 		outfile.close()
136 | 	return all_seqs, total_count
137 | 
138 | # construct HMM file by hmmbuild
139 | def hmmbuild(alnfile, group_name, cpu=1, moltype='dna', gencode=1, parameters=[]):
140 | 	log = open(os.path.join('ref_hmm', group_name + '.log'), 'w')
141 | 	if moltype == 'codon':
142 | 		lib.trans_seq(alnfile, os.path.join('ref_hmm', group_name + '.aa.fas'), gencode=gencode)
143 | 		cmd = ['hmmbuild', '-O', os.path.join('ref_hmm', group_name + '.sto'), '--cpu', str(cpu)]
144 | 		cmd.extend(parameters)
145 | 		cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), os.path.join('ref_hmm', group_name + '.aa.fas')])
146 | 	else:
147 | 		cmd = ['hmmbuild', '-O', os.path.join('ref_hmm', group_name + '.sto'), '--cpu', str(cpu)]
148 | 		cmd.extend(parameters)
149 | 		cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), alnfile])
150 | 	ifcomplish = lib.runcmd(cmd, log, stdout=False)
151 | 	log.close()
152 | 	if ifcomplish:
153 | 		return 0, group_name
154 | 	else:
155 | 		return 1, group_name
156 | 
157 | # HMM search by HMMER
158 | def hmmsearch(group_name, species, cpu=1, parameters=[]):
159 | 	if not os.path.isfile(os.path.join('ok', "map_{}_{}.ok".format(species, group_name))):
160 | 		fasta = species + '.temp.fasta'
161 | 		log = open(os.path.join('map_' + species, group_name + '.log'), 'w')
162 | 		cmd = ['hmmsearch', '-o', os.path.join('map_' + species, group_name + '.txt'), '--tblout', os.path.join('map_' + species, group_name + '.tbl'), '--cpu', str(cpu)]
163 | 		cmd.extend(parameters)
164 | 		cmd.extend([os.path.join('ref_hmm', group_name + '.hmm'), fasta])
165 | 		ifcomplish = lib.runcmd(cmd, log, stdout=False)
166 | 		log.close()
167 | 		if ifcomplish:
168 | 			open(os.path.join('ok', "map_{}_{}.ok".format(species, group_name)), 'w').close()
169 | 			return 0, group_name
170 | 		else:
171 | 			return 1, group_name
172 | 	return 0, group_name
173 | 
174 | # prepare the HMMs of alignments
175 | def prepare_ref(alns, cpu=8, np=8, moltype='dna', gencode=1, parameters=[]):
176 | 	lib.check_programs(['hmmbuild'])
177 | 	np = min(np, len(alns), cpu)
178 | 	ncpu = int(cpu / np)
179 | 	
180 | 	print("\nPreparing the reference alignments...")
181 | 	if os.path.isfile(os.path.join('ok', 'prepare_alignments.ok')):
182 | 		print("\nUsing the existing hmm files in directory 'ref_hmm'")
183 | 	else:
184 | 		print("\nBuilding HMMs for mapping...")
185 | 		if not os.path.isdir('ref_hmm'):
186 | 			os.mkdir('ref_hmm')
187 | 		args_list = []
188 | 		kwds = {'moltype': moltype, 'gencode': gencode, 'parameters': parameters}
189 | 		usedcpu = ncpu * np
190 | 		for group_name, alnfile in alns.items():
191 | 			if usedcpu < cpu:
192 | 				args_list.append((alnfile, group_name, ncpu+1))
193 | 				usedcpu += 1
194 | 			else:
195 | 				args_list.append((alnfile, group_name, ncpu))
196 | 		iferrors = lib.run_mp(hmmbuild, args_list, np, kwds=kwds)
197 | 		errors = []
198 | 		for iferror in iferrors:
199 | 			if iferror[0] == 1:
200 | 				errors.append(iferror[1])
201 | 		if errors:
202 | 			print("\nError in hmmbuild commands: {}".format(', '.join(errors)))
203 | 			sys.exit(1)
204 | 		open(os.path.join('ok', 'prepare_alignments.ok'), 'w').close()
205 | 
206 | # map the reads to the HMMs of alignments
207 | def map_reads(alns, species, fastxs, file_format='guess', cpu=8, np=8, moltype='dna', gencode=1, split_len=None, split_slide=None, no_reverse=False, low_mem=False, parameters=[]):
208 | 	lib.check_programs(['hmmsearch'])
209 | 	if os.path.isfile(os.path.join('ok', "prepare_{}.ok".format(species))):
210 | 		print("\nUsing the existing temp FASTA file of {}".format(species))
211 | 		all_seqs, total_reads = fastx2fasta(fastxs, species + '.temp.fasta', file_format, cpu=cpu, moltype=moltype, gencode=gencode, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem=low_mem, output=False)
212 | 	else:
213 | 		print("\nPreparing the reads of {}...".format(species))
214 | 		all_seqs, total_reads = fastx2fasta(fastxs, species + '.temp.fasta', file_format, cpu=cpu, moltype=moltype, split_len=split_len, split_slide=split_slide, no_reverse=no_reverse, low_mem=low_mem, gencode=gencode)
215 | 		open(os.path.join('ok', "prepare_{}.ok".format(species)), 'w').close()
216 | 
217 | 	if not os.path.isdir('map_' + species):
218 | 		os.mkdir('map_' + species)
219 | 	np = min(np, len(alns), cpu)
220 | 	ncpu = int(cpu / np)
221 | 	print("\nMapping the reads of {} to reference alignments...".format(species))
222 | 	args_list = []
223 | 	kwds = {'parameters': parameters}
224 | 	usedcpu = ncpu * np
225 | 	for group_name in alns.keys():
226 | 		if usedcpu < cpu:
227 | 			args_list.append((group_name, species, ncpu+1))
228 | 			usedcpu += 1
229 | 		else:
230 | 			args_list.append((group_name, species, ncpu))
231 | 	iferrors = lib.run_mp(hmmsearch, args_list, np, kwds=kwds)
232 | 	errors = []
233 | 	for iferror in iferrors:
234 | 		if iferror[0] == 1:
235 | 			errors.append(iferror[1])
236 | 	if errors:
237 | 		print("\nError in hmmsearch commands: {}".format(', '.join(errors)))
238 | 		sys.exit(1)
239 | 	if os.path.exists(species + '.temp.fasta'):
240 | 		os.remove(species + '.temp.fasta')
241 | 	return all_seqs, total_reads
242 | 
243 | # read the information of HMMER results
244 | def read_hmmer(hmmtxt):
245 | 	hmmresults = []
246 | 	qresults =  SearchIO.parse(hmmtxt, 'hmmer3-text')
247 | 	for qresult in qresults:
248 | 		for hit in qresult:
249 | 			for HSP in hit:
250 | 				for HSPfrag in HSP:
251 | 					hmmresults.append(HSPfrag)
252 | 	return hmmresults
253 | 
254 | # extract the target reads from the raw data
255 | def extract_reads(alns, species, all_seqs, moltype='dna', split_len=None, low_mem=False):
256 | 	all_hmmres = {}
257 | 	target_seqids = {}
258 | 	for group_name in alns.keys():
259 | 		hmmresults = read_hmmer(os.path.join('map_' + species, group_name + '.txt'))
260 | 		all_hmmres[group_name] = hmmresults
261 | 		outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'w')
262 | 		for hmmresult in hmmresults:
263 | 			if moltype.startswith('dna'):
264 | 				hitid = hmmresult.hit_id
265 | 				if hitid.endswith('_rev'):
266 | 					hitid = hitid.replace('_rev', '')
267 | 				seqid = '_'.join(hitid.split('_')[:-1])
268 | 				fastx_num = hitid.split('_')[-1].replace('fastx', '')
269 | 			else:
270 | 				seqid = '_'.join(hmmresult.hit_id.split('_')[:-2])
271 | 				fastx_num = hmmresult.hit_id.split('_')[-2].replace('fastx', '')
272 | 			if low_mem:
273 | 				if target_seqids.get(fastx_num) is None:
274 | 					target_seqids[fastx_num] = {}
275 | 				if split_len:
276 | 					seqid0, start_end = seqid.split('_split_')
277 | 					if target_seqids[fastx_num].get(seqid0) is None:
278 | 						target_seqids[fastx_num][seqid0] = {}
279 | 					if target_seqids[fastx_num][seqid0].get(group_name) is None:
280 | 						target_seqids[fastx_num][seqid0][group_name] = []
281 | 					target_seqids[fastx_num][seqid0][group_name].append(start_end.split('_'))
282 | 				else:
283 | 					if target_seqids[fastx_num].get(seqid) is None:
284 | 						target_seqids[fastx_num][seqid] = []
285 | 					target_seqids[fastx_num][seqid].append(group_name)
286 | 			elif split_len:
287 | 				seqid, start_end = seqid.split('_split_')
288 | 				start, end = start_end.split('_')
289 | 				outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', start, end, 'fastx' + fastx_num]), all_seqs[int(fastx_num)-1][seqid][int(start)-1:int(end)]))
290 | 			else:
291 | 				outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, all_seqs[int(fastx_num)-1][seqid]))
292 | 		outfile.close()
293 | 	if low_mem:
294 | 		# low-memory mode: extract the reads from the file instead of the store
295 | 		for fastx_num, seqids in target_seqids.items():
296 | 			fastx_iter, file_format = lib.read_fastx(all_seqs[int(fastx_num)-1][0], all_seqs[int(fastx_num)-1][1], low_mem=True)
297 | 			seqid = None
298 | 			seqstr = None
299 | 			line_num = 0
300 | 			for line in fastx_iter:
301 | 				line = line.rstrip()
302 | 				if file_format == 'fastq' and line.startswith('@') and line_num % 4 == 0:
303 | 					seqid = line.replace('@', '', 1).replace(' ', '_')
304 | 				elif file_format == 'fasta' and line.startswith('>'):
305 | 					if seqid is not None and seqids.get(seqid) is not None:
306 | 						if split_len:
307 | 							for group_name, start_ends in seqids[seqid].items():
308 | 								outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
309 | 								for start_end in start_ends:
310 | 									outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), seqstr[int(start_end[0])-1:int(start_end[1])]))
311 | 								outfile.close()
312 | 						else:
313 | 							for group_name in seqids[seqid]:
314 | 								outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
315 | 								outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, seqstr))
316 | 								outfile.close()
317 | 					arr = line.split(" ")
318 | 					seqid = arr[0].lstrip('>')
319 | 					seqstr = ''
320 | 				elif seqid:
321 | 					if file_format == 'fastq':
322 | 						if seqids.get(seqid) is not None:
323 | 							if split_len:
324 | 								for group_name, start_ends in seqids[seqid].items():
325 | 									outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
326 | 									for start_end in start_ends:
327 | 										outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), line[int(start_end[0])-1:int(start_end[1])]))
328 | 									outfile.close()
329 | 							else:
330 | 								for group_name in seqids[seqid]:
331 | 									outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
332 | 									outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, line))
333 | 									outfile.close()
334 | 						seqid = None
335 | 					else:
336 | 						seqstr += line
337 | 				line_num += 1
338 | 			if file_format == 'fasta' and seqstr and seqids.get(seqid) is not None: 
339 | 				if split_len:
340 | 					for group_name, start_ends in seqids[seqid].items():
341 | 						outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
342 | 						for start_end in start_ends:
343 | 							outfile.write(">{}\n{}\n".format('_'.join([seqid, 'split', '_'.join(start_end), 'fastx' + fastx_num]), seqstr[int(start_end[0])-1:int(start_end[1])]))
344 | 						outfile.close()
345 | 				else:
346 | 					for group_name in seqids[seqid]:
347 | 						outfile = open(os.path.join('map_' + species, group_name + '.targets.fa'), 'a')
348 | 						outfile.write(">{}\n{}\n".format(seqid + '_fastx' + fastx_num, seqstr))
349 | 						outfile.close()
350 | 	return all_hmmres
351 | 


--------------------------------------------------------------------------------
/lib/assemble.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import sys
  4 | import os
  5 | from copy import deepcopy
  6 | from Bio.Seq import reverse_complement, translate
  7 | from Bio import AlignIO
  8 | try:
  9 |     import library as lib
 10 | except ImportError:
 11 |     import lib.library as lib
 12 | 
 13 | # class to process the information of the read or sequence hits
 14 | class read_hit:
 15 | 
 16 | 	# inherit the key parameters from the HSPfrag objects
 17 | 	def __init__(self, hmmresult):
 18 | 		self.hit_id = hmmresult.hit_id
 19 | 		self.hit_start = hmmresult.hit_start + 1
 20 | 		self.hit_end = hmmresult.hit_end
 21 | 		self.hit_seq = str(hmmresult.hit.seq).upper()
 22 | 		self.query_start = hmmresult.query_start + 1
 23 | 		self.query_end = hmmresult.query_end
 24 | 		self.query_seq = str(hmmresult.query.seq).upper()
 25 | 
 26 | 	# map the sequence of the target region of target read without assembly
 27 | 	def map_read_seqstr(self, seqid, seqstr, reads, moltype='dna', gencode=1):
 28 | 		self.moltype = moltype
 29 | 		self.gencode = gencode
 30 | 		self.seq_comp = {}
 31 | 		self.reads = reads
 32 | 		rep = len(reads)
 33 | 		self.read_count = rep
 34 | 		pos = 0
 35 | 		j = 0
 36 | 		jpos = 0
 37 | 		if moltype.startswith('dna'):
 38 | 			if seqid.endswith('_rev'):
 39 | 				seqstr = reverse_complement(seqstr)
 40 | 			for i in range(self.hit_start-1, self.hit_end):
 41 | 				while self.hit_seq[pos] in ['.', '-']:
 42 | 					self.seq_comp[self.query_start + j] = {'-': rep}
 43 | 					j += 1
 44 | 					pos += 1
 45 | 					jpos += 1
 46 | 				if self.query_seq[jpos] not in ['.', '-']:
 47 | 					self.seq_comp[self.query_start + j] = {seqstr[i]: rep}
 48 | 					j += 1
 49 | 				pos += 1
 50 | 				jpos += 1
 51 | 		else:
 52 | 			# for protein or codon-to-nucleotide alignments (codon alignments)
 53 | 			if seqid.endswith('rev'):
 54 | 				seqstr = reverse_complement(seqstr)
 55 | 			i = int(seqid.split('_')[-1].replace('pos', '').replace('rev', ''))
 56 | 			codon_pos = i - 1
 57 | 			for protpos in range(self.hit_start-1, self.hit_end):
 58 | 				i = protpos * 3 + codon_pos
 59 | 				while self.hit_seq[pos] in ['.', '-']:
 60 | 					self.seq_comp[self.query_start + j] = {1: {'-': rep}, 2: {'-': rep}, 3: {'-': rep}}
 61 | 					j += 1
 62 | 					pos += 1
 63 | 					jpos += 1
 64 | 				if self.query_seq[jpos] not in ['.', '-']:
 65 | 					self.seq_comp[self.query_start + j] = {1: {seqstr[i]: rep}, 2: {seqstr[i+1]: rep}, 3: {seqstr[i+2]: rep}}
 66 | 					j += 1
 67 | 				pos += 1
 68 | 				jpos += 1
 69 | 
 70 | 	# print the sequence from the hit information
 71 | 	def print_seq(self, seq, protseq=None):
 72 | 		for pos, base in self.seq_comp.items():
 73 | 			if protseq:
 74 | 				for i in [1, 2, 3]:
 75 | 					seq[3*pos+i-4] = base['codon'][i-1]
 76 | 				protseq[pos-1] = base['base']
 77 | 			else:
 78 | 				seq[pos-1] = base
 79 | 		return seq, protseq
 80 | 
 81 | 	# calculate pseudo RPKM of the hit using read counts
 82 | 	def addRPKM(self, total_count):
 83 | 		self.pseudoRPKM = 1000 * 1000000 * self.read_count / total_count / len(self.seq_comp)
 84 | 
 85 | # read the alignments with stockholm format for HMM construction (conservative sites for output alignments)
 86 | def read_sto(stofile):
 87 | 	seqs = {}
 88 | 	alignment = AlignIO.read(stofile, "stockholm")
 89 | 	cons_pos = alignment._per_col_annotations['reference_annotation']
 90 | 	for seq in alignment:
 91 | 		seqstr = str(seq.seq)
 92 | 		seqs[seq.id] = ''
 93 | 		for i in range(len(seqstr)):
 94 | 			if cons_pos[i] == 'x':
 95 | 				seqs[seq.id] += seqstr[i]
 96 | 	return seqs
 97 | 
 98 | # calculate the most common base of the site
 99 | def most_common(bases):
100 | 	return max(bases, key=bases.get)
101 | 
102 | # calculate the identical site number between two hits
103 | def calculate_ident(bases1, bases2):
104 | 	total_bases1 = sum(bases1.values())
105 | 	total_bases2 = sum(bases2.values())
106 | 	ident = 0
107 | 	for base in set(bases1.keys()) & set(bases2.keys()):
108 | 		ident += bases1[base] / total_bases1 * bases2[base] / total_bases2
109 | 	return ident
110 | 
111 | # calculate overlap length and percent identity between two hits for cross decontamination
112 | def compare_hits(hitA, hitB, min_overlap=30, min_pident=98):
113 | 	if hitA.query_start > hitB.query_start:
114 | 		hit1 = deepcopy(hitB)
115 | 		hit2 = deepcopy(hitA)
116 | 	else:
117 | 		hit1 = deepcopy(hitA)
118 | 		hit2 = deepcopy(hitB)
119 | 	nident = 0
120 | 	end_pos = min(hit1.query_end, hit2.query_end)
121 | 	if hit1.moltype.startswith('dna'):
122 | 		overlap_len = end_pos - hit2.query_start + 1
123 | 		if overlap_len < min_overlap:
124 | 			return 0, 0
125 | 		for i in range(hit2.query_start, end_pos+1):
126 | 			if hit1.seq_comp[i] == hit2.seq_comp[i]:
127 | 				nident += 1
128 | 	else:
129 | 		# for codon alignments
130 | 		overlap_len = (end_pos - hit2.query_start + 1) * 3
131 | 		if overlap_len < min_overlap:
132 | 			return 0, 0
133 | 		for i in range(hit2.query_start, end_pos+1):
134 | 			for k in [0, 1, 2]:
135 | 				if hit1.seq_comp[i]['codon'][k] == hit2.seq_comp[i]['codon'][k]:
136 | 					nident += 1
137 | 	if overlap_len == 0:
138 | 		pident = 0
139 | 	else:
140 | 		pident = nident / overlap_len * 100
141 | 	if pident < min_pident:
142 | 		return 0, 0
143 | 	else:
144 | 		return overlap_len, pident
145 | 
146 | # class to assemble the reads and process the reads mapped to the alignment
147 | class Assembler:
148 | 
149 | 	# construct the site composition of the alignments and calculate the conservative score of the outgroup sequence
150 | 	def __init__(self, group_name, alnfile, species, outgroup=[], ingroup=[], sep='.'):
151 | 		self.species = species
152 | 		self.name = group_name
153 | 		self.alnfile = alnfile
154 | 		stofile = os.path.join('ref_hmm', group_name + '.sto')
155 | 		self.aln = read_sto(stofile)
156 | 		self.aln_len = len(list(self.aln.values())[0])
157 | 		outgroup_aln = {}
158 | 		ingroup_aln = {}
159 | 		# process the gap symbols
160 | 		for seqid, seqstr in self.aln.items():
161 | 			self.aln[seqid] = seqstr.replace('.', '-')
162 | 			self.aln[seqid] = seqstr.replace('~', '-')
163 | 		if outgroup:
164 | 			for spid in outgroup:
165 | 				for seqid in self.aln.keys():
166 | 					if seqid == spid or seqid.startswith(spid + sep):
167 | 						outgroup_aln[seqid] = self.aln[seqid]
168 | 						break
169 | 		else:
170 | 			outgroup_aln = deepcopy(self.aln)
171 | 		if ingroup:
172 | 			for spid in ingroup:
173 | 				for seqid in self.aln.keys():
174 | 					if seqid == spid or seqid.startswith(spid + sep):
175 | 						ingroup_aln[seqid] = self.aln[seqid]
176 | 						break
177 | 		else:
178 | 			if len(outgroup_aln) == len(self.aln):
179 | 				ingroup_aln = deepcopy(self.aln)
180 | 			else:
181 | 				for seqid, seqstr in self.aln.items():
182 | 					if outgroup_aln.get(seqid) is None:
183 | 						ingroup_aln[seqid] = seqstr
184 | 		self.site_composition = {}
185 | 		self.outgroup_score = {}
186 | 		for seqid in outgroup_aln.keys():
187 | 			self.outgroup_score[seqid] = {}
188 | 		for i in range(self.aln_len):
189 | 			self.site_composition[i+1] = {}
190 | 			for seqstr in ingroup_aln.values():
191 | 				base = seqstr[i]
192 | 				if self.site_composition[i+1].get(base) is None:
193 | 					self.site_composition[i+1][base] = 1
194 | 				else:
195 | 					self.site_composition[i+1][base] += 1
196 | 			for seqid in outgroup_aln.keys():
197 | 				self.outgroup_score[seqid][i+1] = self.site_composition[i+1].get(outgroup_aln[seqid][i], 0)
198 | 
199 | 	# map the new hits with or without assembly to the original mapped hits
200 | 	def map_binning(self, hit, seqid, seqstr, reads, min_overlap=30, min_pident=98):
201 | 		read_count = len(reads)
202 | 		for n in range(len(self.hits)):
203 | 			refhit = self.hits[n]
204 | 			newhit = deepcopy(refhit)
205 | 			end_pos = min(refhit.query_end, hit.query_end)
206 | 			pos = 0
207 | 			jpos = 0
208 | 			if self.moltype.startswith('dna'):
209 | 				overlap_len = end_pos - hit.query_start + 1
210 | 				if overlap_len < min_overlap:
211 | 					continue
212 | 				if seqid.endswith('_rev'):
213 | 					seqstr = reverse_complement(seqstr)
214 | 				nident = 0
215 | 				j = hit.hit_start - 1
216 | 				for i in range(hit.query_start, end_pos+1):
217 | 					while hit.query_seq[jpos] in ['.', '-']:
218 | 						j += 1
219 | 						pos += 1
220 | 						jpos += 1
221 | 					if hit.hit_seq[pos] in ['.', '-']:
222 | 						nident += calculate_ident(refhit.seq_comp[i], {'-': read_count})
223 | 						if newhit.seq_comp[i].get('-') is None:
224 | 							newhit.seq_comp[i]['-'] = read_count
225 | 						else:
226 | 							newhit.seq_comp[i]['-'] += read_count
227 | 					else:
228 | 						nident += calculate_ident(refhit.seq_comp[i], {seqstr[j]: read_count})
229 | 						if newhit.seq_comp[i].get(seqstr[j]) is None:
230 | 							newhit.seq_comp[i][seqstr[j]] = read_count
231 | 						else:
232 | 							newhit.seq_comp[i][seqstr[j]] += read_count
233 | 						j += 1
234 | 					pos += 1
235 | 					jpos += 1
236 | 				if nident / overlap_len * 100 < min_pident:
237 | 					continue
238 | 				i = end_pos + 1
239 | 				while i < hit.query_end + 1:
240 | 					if hit.hit_seq[pos] in ['.', '-']:
241 | 						newhit.seq_comp[i] = {'-': read_count}
242 | 						i += 1
243 | 					elif hit.query_seq[jpos] in ['.', '-']:
244 | 						j += 1
245 | 					else:
246 | 						newhit.seq_comp[i] = {seqstr[j]: read_count}
247 | 						i += 1
248 | 						j += 1
249 | 					pos += 1
250 | 					jpos += 1
251 | 				if end_pos < hit.query_end:
252 | 					newhit.query_end = hit.query_end
253 | 				newhit.read_count += read_count
254 | 				newhit.reads.extend(reads)
255 | 				self.hits[n] = newhit
256 | 				#debug
257 | 				#print("overlap length:", str(overlap_len), "pident:", str(nident / overlap_len * 100), "\nMerging\n", vars(refhit), "\nand\n", vars(hit), "seqid: ", seqid, "seq:", seqstr, "\ninto\n", vars(newhit))                        
258 | 				#input()
259 | 				hit = None
260 | 				break
261 | 			else:
262 | 				# for codon alignments
263 | 				overlap_len = (end_pos - hit.query_start + 1) * 3
264 | 				if overlap_len < min_overlap:
265 | 					continue
266 | 				if seqid.endswith('rev'):
267 | 					seqstr = reverse_complement(seqstr)
268 | 				i = int(seqid.split('_')[-1].replace('pos', '').replace('rev', ''))
269 | 				codon_pos = i - 1
270 | 				nident = 0
271 | 				j = hit.hit_start * 3 + codon_pos - 3
272 | 				for i in range(hit.query_start, end_pos+1):
273 | 					while hit.query_seq[jpos] in ['.', '-']:
274 | 						j += 3
275 | 						pos += 1
276 | 						jpos += 1
277 | 					if hit.hit_seq[pos] in ['.', '-']:
278 | 						for k in [1, 2, 3]:
279 | 							nident += calculate_ident(refhit.seq_comp[i][k], {'-': read_count})
280 | 							if newhit.seq_comp[i][k].get('-') is None:
281 | 								newhit.seq_comp[i][k]['-'] = read_count
282 | 							else:
283 | 								newhit.seq_comp[i][k]['-'] += read_count
284 | 					else:
285 | 						for k in [1, 2, 3]:
286 | 							nident += calculate_ident(refhit.seq_comp[i][k], {seqstr[j+k-1]: read_count})
287 | 							if newhit.seq_comp[i][k].get(seqstr[j+k-1]) is None:
288 | 								newhit.seq_comp[i][k][seqstr[j+k-1]] = read_count
289 | 							else:
290 | 								newhit.seq_comp[i][k][seqstr[j+k-1]] += read_count
291 | 						j += 3
292 | 					pos += 1
293 | 					jpos += 1
294 | 				if nident / overlap_len * 100 < min_pident:
295 | 					continue
296 | 				i = end_pos + 1
297 | 				while i < hit.query_end + 1:
298 | 					if hit.hit_seq[pos] in ['.', '-']:
299 | 						newhit.seq_comp[i] = {1: {'-': read_count}, 2: {'-': read_count}, 3: {'-': read_count}}
300 | 						i += 1
301 | 					elif hit.query_seq[jpos] in ['.', '-']:
302 | 						j += 3
303 | 					else:
304 | 						newhit.seq_comp[i] = {1: {seqstr[j]: read_count}, 2: {seqstr[j+1]: read_count}, 3: {seqstr[j+2]: read_count}}
305 | 						i += 1
306 | 						j += 3
307 | 					pos += 1
308 | 					jpos += 1
309 | 				if end_pos < hit.query_end:
310 | 					newhit.query_end = hit.query_end
311 | 				newhit.read_count += read_count
312 | 				newhit.reads.extend(reads)
313 | 				self.hits[n] = newhit
314 | 				#debug
315 | 				#print("overlap length:", str(overlap_len), "pident:", str(nident / overlap_len * 100), "\nMerging\n", vars(refhit), "\nand\n", vars(hit), "seqid: ", seqid, "seq:", seqstr, "\ninto\n", vars(newhit))
316 | 				#input()
317 | 				hit = None
318 | 				break
319 | 
320 | 		# if not assembled to any mapped hits, mapped it simply to the alignments
321 | 		if hit is not None:
322 | 			hit.map_read_seqstr(seqid, seqstr, reads, self.moltype, self.gencode)
323 | 			self.hits.append(hit)
324 | 
325 | 	# map and assemble the hits
326 | 	def map_read_seq(self, hmmresults, fasta, moltype='dna', gencode=1, no_assemble=False, min_overlap=30, min_pident=98):
327 | 		self.moltype = moltype
328 | 		self.gencode = gencode
329 | 		self.stat_num = {'raw hits': len(hmmresults)}
330 | 		self.hits = []
331 | 		targets = []
332 | 		readids = []
333 | 		hmmhits = []
334 | 		hmmresults.sort(key = lambda x : x.query_end)
335 | 		hmmresults.sort(key = lambda x : x.query_start)
336 | 		for hmmresult in hmmresults:
337 | 			if moltype.startswith('dna'):
338 | 				hitid = hmmresult.hit_id
339 | 				if hitid.endswith('_rev'):
340 | 					hitid = hitid.replace('_rev', '')
341 | 				targets.append(hmmresult.hit_id)
342 | 				readids.append(hitid)
343 | 				hmmhits.append(read_hit(hmmresult))
344 | 			else:
345 | 				targets.append(hmmresult.hit_id)
346 | 				readids.append('_'.join(hmmresult.hit_id.split('_')[:-1]))
347 | 				hmmhits.append(read_hit(hmmresult))
348 | 		reads = lib.read_fastx(fasta, 'fasta', select_list=readids)
349 | 
350 | 		# cluster the identical hits first
351 | 		if not no_assemble:
352 | 			reps = {}
353 | 			for i in range(len(targets)):
354 | 				identifier = '_'.join([reads[readids[i]].upper(), str(hmmhits[i].query_start), str(hmmhits[i].query_end)])
355 | 				if reps.get(identifier) is None:
356 | 					reps[identifier] = []
357 | 				reps[identifier].append(readids[i])
358 | 
359 | 		# map and assemble each hit
360 | 		for i in range(len(targets)):
361 | 			hit = hmmhits[i]
362 | 			#debug
363 | 			#print(vars(hit))
364 | 			#input()
365 | 			seqstr = reads[readids[i]].upper()
366 | 			if no_assemble:
367 | 				hit.map_read_seqstr(targets[i], seqstr, [readids[i]], moltype, gencode)
368 | 				self.hits.append(hit)
369 | 			else:
370 | 				identifier = '_'.join([seqstr, str(hit.query_start), str(hit.query_end)])
371 | 				if reps[identifier][-1] != 'mapped': 
372 | 					self.map_binning(hit, targets[i], seqstr, reps[identifier], min_overlap, min_pident)
373 | 					reps[identifier].append('mapped')
374 | 
375 | 		# record the hit number
376 | 		if not no_assemble:
377 | 			self.stat_num['assembled contigs'] = len(self.hits)
378 | 
379 | 	# convert the assembled contig information into simple sequence information (the most common bases instead of SNPs)
380 | 	def simplify_hit_info(self):
381 | 		for i in range(len(self.hits)):
382 | 			for j in self.hits[i].seq_comp.keys():
383 | 				bases = self.hits[i].seq_comp[j]
384 | 				if self.hits[i].moltype.startswith('dna'):
385 | 					self.hits[i].seq_comp[j] = most_common(bases)
386 | 				else:
387 | 					base1 = most_common(bases[1])
388 | 					base2 = most_common(bases[2])
389 | 					base3 = most_common(bases[3])
390 | 					if base1 == '-' and base2 == '-' and base3 == '-':
391 | 						base = '-'
392 | 					else:
393 | 						try:
394 | 							base = translate(base1 + base2 + base3, table=self.hits[i].gencode)
395 | 						except:
396 | 							base = 'X'
397 | 					self.hits[i].seq_comp[j] = {'base': base, 'codon': [base1, base2, base3]}
398 | 
399 | 	# remove the foreign sequences based on conservative scores
400 | 	def remove_out_hit(self, weight=1):
401 | 		to_remove = []
402 | 		for i in range(len(self.hits)):
403 | 			hit = self.hits[i]
404 | 			score = 0
405 | 			outgroup_score = {}
406 | 			for seqid in self.outgroup_score.keys():
407 | 				outgroup_score[seqid] = 0
408 | 			for pos, seq_comp in hit.seq_comp.items():
409 | 				if hit.moltype.startswith('dna'):
410 | 					base = seq_comp
411 | 				else:
412 | 					base = seq_comp['base']
413 | 				score += self.site_composition[pos].get(base, 0)
414 | 				for seqid in self.outgroup_score.keys():
415 | 					outgroup_score[seqid] += self.outgroup_score[seqid][pos]
416 | 			if score < min(outgroup_score.values()) * weight:
417 | 				to_remove.append(i)
418 | 		for i in reversed(to_remove):
419 | 			self.hits.pop(i)
420 | 		self.stat_num['ingroup clean seqs'] = len(self.hits)
421 | 
422 | 	# output the final sequences
423 | 	def output_sequence(self, mode='consensus', nuclfill='N', protfill='X'):
424 | 		seqs = []
425 | 		protseqs = []
426 | 		seqids = []
427 | 		if len(self.hits) == 0:
428 | 			return seqs, protseqs, seqids
429 | 		if mode.startswith('consensus'):
430 | 			seq_info = {}
431 | 			# the sequences consisting of more hits have more weights
432 | 			self.hits.sort(key = lambda x : x.read_count, reverse = True)
433 | 			for hit in self.hits:
434 | 				seqids.extend(hit.reads)
435 | 				for pos, base in hit.seq_comp.items():
436 | 					if hit.moltype.startswith('dna'):
437 | 						if seq_info.get(pos) is None:
438 | 							seq_info[pos] = {}
439 | 						if seq_info[pos].get(base) is None:
440 | 							seq_info[pos][base] = 1
441 | 						else:
442 | 							seq_info[pos][base] += 1
443 | 					else:
444 | 						if seq_info.get(pos) is None:
445 |                                                         seq_info[pos] = {1: {}, 2: {}, 3:{}}
446 | 						for k in [1, 2, 3]:
447 | 							if seq_info[pos][k].get(base['codon'][k-1]) is None:
448 | 								seq_info[pos][k][base['codon'][k-1]] = 1
449 | 							else:
450 | 								seq_info[pos][k][base['codon'][k-1]] += 1
451 | 			if self.moltype.startswith('dna'):
452 | 				seq = list(nuclfill * self.aln_len)
453 | 			else:
454 | 				seq = list(nuclfill * (3*self.aln_len))
455 | 				protseq = list(protfill * self.aln_len)
456 | 			for pos, bases in seq_info.items():
457 | 				if self.moltype.startswith('dna'):
458 | 					if mode == 'consensus' or len(set(bases)) == 1:
459 | 						seq[pos-1] = most_common(bases)
460 | 				else:
461 | 					for i in [1, 2, 3]:
462 | 						if mode == 'consensus' or len(set(bases[i])) == 1:
463 | 							seq[3*pos+i-4] = most_common(bases[i])
464 | 					codon = ''.join(seq[3*pos-3:3*pos])
465 | 					if codon == '---':
466 | 						protseq[pos-1] = '-'
467 | 					else:
468 | 						try:
469 | 							protseq[pos-1] = translate(codon, table=self.gencode)
470 | 						except:
471 | 							protseq[pos-1] = 'X'
472 | 			seqs.append(''.join(seq))
473 | 			if not self.moltype.startswith('dna'):
474 | 				protseqs.append(''.join(protseq))
475 | 			seqids = [' '.join(seqids)]
476 | 		elif mode == 'all':
477 | 			# output all assembled sequences without consensus or selection
478 | 			for hit in self.hits:
479 | 				if self.moltype.startswith('dna'):
480 | 					seq = list(nuclfill * self.aln_len)
481 | 					protseq = None
482 | 				else:
483 | 					seq = list(nuclfill * (3*self.aln_len))
484 | 					protseq = list(protfill * self.aln_len)
485 | 				seq, protseq = hit.print_seq(seq, protseq)
486 | 				seqs.append(''.join(seq))
487 | 				if not self.moltype.startswith('dna'):
488 | 					protseqs.append(''.join(protseq))
489 | 				seqids.append(' '.join(hit.reads))
490 | 		else:
491 | 			if mode == 'expression':
492 | 				self.hits.sort(key = lambda x : x.read_count, reverse = True)
493 | 			else:
494 | 				# for longest sequence
495 | 				self.hits.sort(key = lambda x : len(x.seq_comp), reverse = True)
496 | 			if self.moltype.startswith('dna'):
497 | 				seq = list(nuclfill * self.aln_len)
498 | 				protseq = None
499 | 			else:
500 | 				seq = list(nuclfill * (3*self.aln_len))
501 | 				protseq = list(protfill * self.aln_len)
502 | 			seq, protseq = self.hits[0].print_seq(seq, protseq)
503 | 			seqs.append(''.join(seq))
504 | 			if not self.moltype.startswith('dna'):
505 | 				protseqs.append(''.join(protseq))
506 | 			seqids = [' '.join(self.hits[0].reads)]
507 | 		return seqs, protseqs, list(map(lambda x : x.split('_fastx')[0], seqids))
508 | 
509 | # a single task for assembly and foreign sequence removal
510 | def generate_assembly(group_name, alnfile, species, hmmresults, np=8, moltype='dna', gencode=1, no_assemble=False, overlap_len=30, overlap_pident=98, no_out_filter=False, outgroup=[], ingroup=[], sep='.', outgroup_weight=1, final_seq='consensus'):
511 | 	assembler = Assembler(group_name, alnfile, species, outgroup=outgroup, ingroup=ingroup, sep=sep)
512 | 	assembler.map_read_seq(hmmresults, os.path.join('map_' + species, group_name + '.targets.fa'), moltype=moltype, gencode=gencode, no_assemble=no_assemble, min_overlap=overlap_len, min_pident=overlap_pident)
513 | 	assembler.simplify_hit_info()
514 | 	if not no_out_filter:
515 | 		assembler.remove_out_hit(outgroup_weight)
516 | 	# pre-process the hits to reduce the memory when not to assemble (too many hits)
517 | 	if no_assemble and len(assembler.hits) > 0:
518 | 		if final_seq.startswith('consensus'):
519 | 			seq_info = {}
520 | 			for hit in assembler.hits:
521 | 				for pos, base in hit.seq_comp.items():
522 | 					if hit.moltype.startswith('dna'):
523 | 						if seq_info.get(pos) is None:
524 | 							seq_info[pos] = {}
525 | 						if seq_info[pos].get(base) is None:
526 | 							seq_info[pos][base] = 1
527 | 						else:
528 | 							seq_info[pos][base] += 1
529 | 					else:
530 | 						if seq_info.get(pos) is None:
531 |                                                         seq_info[pos] = {1: {}, 2: {}, 3:{}}
532 | 						for k in [1, 2, 3]:
533 | 							if seq_info[pos][k].get(base['codon'][k-1]) is None:
534 | 								seq_info[pos][k][base['codon'][k-1]] = 1
535 | 							else:
536 | 								seq_info[pos][k][base['codon'][k-1]] += 1
537 | 			start_pos = 0
538 | 			end_pos = 0
539 | 			for i in range(1, assembler.aln_len + 1):
540 | 				if seq_info.get(i) is None:
541 | 					continue
542 | 				if assembler.moltype.startswith('dna'):
543 | 					if final_seq == 'consensus' or len(seq_info[i]) == 1:
544 | 						assembler.hits[0].seq_comp[i] = most_common(seq_info[i])
545 | 					else:
546 | 						assembler.hits[0].seq_comp[i] = None
547 | 				else:
548 | 					if final_seq == 'consensus' or len(seq_info[i][1]) == 1:
549 | 						base1 = most_common(seq_info[i][1])
550 | 					else:
551 | 						base1 = 'N'
552 | 					if final_seq == 'consensus' or len(seq_info[i][2]) == 1:
553 | 						base2 = most_common(seq_info[i][2])
554 | 					else:
555 | 						base2 = 'N'
556 | 					if final_seq == 'consensus' or len(seq_info[i][3]) == 1:
557 | 						base3 = most_common(seq_info[i][3])
558 | 					else:
559 | 						base3 = 'N'
560 | 					if base1 == '-' and base2 == '-' and base3 == '-':
561 | 						base = '-'
562 | 					else:
563 | 						try:
564 | 							base = translate(base1 + base2 + base3, table=self.hits[i].gencode)
565 | 						except:
566 | 							base = 'X'
567 | 					assembler.hits[0].seq_comp[i] = {'base': base, 'codon': [base1, base2, base3]}
568 | 				if not start_pos:
569 | 					start_pos = i
570 | 				end_pos = i
571 | 			assembler.hits[0].query_start = start_pos
572 | 			assembler.hits[0].query_end = end_pos
573 | 		elif final_seq == 'length':
574 | 			assembler.hits.sort(key = lambda x : len(x.seq_comp), reverse = True)
575 | 		if final_seq != 'all':
576 | 			assembler.hits = [assembler.hits[0]]
577 | 	return assembler
578 | 
579 | # assemble the hits and remove putative foreign sequences in multiple processes
580 | def generate_assembly_mp(alns, species, all_hmmres, np=8, moltype='dna', gencode=1, no_assemble=False, overlap_len=30, overlap_pident=98, no_out_filter=False, outgroup=[], ingroup=[], sep='.', outgroup_weight=1, final_seq='consensus'):
581 | 	print("\nAssembling the reads of {} and removing putative foreign sequences...".format(species))
582 | 	assemblers = {}
583 | 	args_list = []
584 | 	kwds = {'moltype': moltype, 'gencode': gencode, 'no_assemble': no_assemble, 'overlap_len': overlap_len, 'overlap_pident': overlap_pident, 'no_out_filter': no_out_filter, 'outgroup': outgroup, 'ingroup': ingroup, 'sep': sep, 'outgroup_weight': outgroup_weight, 'final_seq': final_seq}
585 | 	for group_name, alnfile in alns.items():
586 | 		args_list.append((group_name, alnfile, species, all_hmmres[group_name]))
587 | 	asbs = lib.run_mp(generate_assembly, args_list, np, kwds=kwds)
588 | 	for assembler in asbs:
589 | 		assemblers[assembler.name] = assembler
590 | 	return assemblers
591 | 
592 | # detect and remove cross contamination of assembled sequences
593 | def cross_decontam(assemblers, total_reads, min_overlap=30, min_pident=98, min_expression=0.2, min_fold=2):
594 | 	to_remove = {}
595 | 	for sp1 in assemblers.keys():
596 | 		to_remove[sp1] = []
597 | 		for i in range(len(assemblers[sp1].hits)):
598 | 			assemblers[sp1].hits[i].addRPKM(total_reads[sp1])
599 | 	statout = open(os.path.join('stat_info', list(assemblers.values())[0].name + '.cross_info.tsv'), 'w')
600 | 	statout.write("species1\tquery_start1\tquery_end1\tspecies2\tquery_start2\tquery_end2\toverlap_length\tpident\tpseudoRPKM1\tpseudoRPKM2\texpression_fold\tseq1\tseq2\n")
601 | 	for sp1, assembler1 in assemblers.items():
602 | 		for sp2, assembler2 in assemblers.items():
603 | 			if sp1 == sp2:
604 | 				continue
605 | 			for i in range(len(assembler1.hits)):
606 | 				for j in range(len(assembler2.hits)):
607 | 					overlap_len, pident = compare_hits(assembler1.hits[i], assembler2.hits[j], min_overlap, min_pident)
608 | 					if overlap_len:
609 | 						exp_fold = assembler1.hits[i].pseudoRPKM / assembler2.hits[j].pseudoRPKM
610 | 						seq1 = ''
611 | 						for pos in range(assembler1.hits[i].query_start, assembler1.hits[i].query_end+1):
612 | 							if assembler1.moltype.startswith('dna'):
613 | 								seq1 += assembler1.hits[i].seq_comp[pos]
614 | 							else:
615 | 								for k in [1, 2, 3]:
616 | 									seq1 += assembler1.hits[i].seq_comp[pos]['codon'][k-1]
617 | 						seq2 = ''
618 | 						for pos in range(assembler2.hits[j].query_start, assembler2.hits[j].query_end+1):
619 | 							if assembler2.moltype.startswith('dna'):
620 | 								seq2 += assembler2.hits[j].seq_comp[pos]
621 | 							else:
622 | 								for k in [1, 2, 3]:
623 | 									seq2 += assembler2.hits[j].seq_comp[pos]['codon'][k-1]
624 | 						statout.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(assembler1.species, assembler1.hits[i].query_start, assembler1.hits[i].query_end, assembler2.species, assembler2.hits[j].query_start, assembler2.hits[j].query_end, overlap_len, pident, assembler1.hits[i].pseudoRPKM, assembler2.hits[j].pseudoRPKM, exp_fold, seq1, seq2))
625 | 						if exp_fold > min_fold and assembler1.hits[i].pseudoRPKM >= min_expression and j not in to_remove[sp2]:
626 | 							to_remove[sp2].append(j)
627 | 						elif exp_fold < 1 / min_fold and assembler2.hits[j].pseudoRPKM >= min_expression and i not in to_remove[sp1]:
628 | 							to_remove[sp1].append(i)
629 | 	statout.close()
630 | 	for sp, remove_indexes in to_remove.items():
631 | 		for i in sorted(remove_indexes, reverse=True):
632 | 			assemblers[sp].hits.pop(i)
633 | 		assemblers[sp].stat_num['cross clean seqs'] = len(assemblers[sp].hits)
634 | 	return assemblers
635 | 
636 | # generate conservative reference codon alignments (corresponding to protein alignments for HMM construction)
637 | def generate_codon_ref(assembler):
638 | 	codon_seqs = lib.read_fastx(assembler.alnfile, 'fasta')
639 | 	prot_seqs = assembler.aln
640 | 	seqs = {}
641 | 	for seqid, protstr in prot_seqs.items():
642 | 		seqs[seqid] = ''
643 | 		i = 0
644 | 		j = 0
645 | 		while j < len(protstr) and i + 3 <= len(codon_seqs[seqid]):
646 | 			protbase = protstr[j]
647 | 			codon = codon_seqs[seqid][i:i+3]
648 | 			#debug
649 | 			#print(seqid, 'codon', i, codon, 'protein', j, protbase)
650 | 			#input()
651 | 			if codon == '---':
652 | 				if protbase == '-':
653 | 					seqs[seqid] += codon
654 | 					j += 1
655 | 				i += 3
656 | 			elif translate(codon, table=assembler.gencode) == protbase:
657 | 				seqs[seqid] += codon
658 | 				j += 1
659 | 				i += 3
660 | 			elif protbase == '-':
661 | 				seqs[seqid] += '---'
662 | 				j += 1
663 | 			else:
664 | 				i += 3
665 | 		if j < len(protstr):
666 | 			print("\nError: fail to match the codon columns to the protein alignment HMM: {}-{}-{}!".format(assembler.name, seqid, j))
667 | 			return None
668 | 	#old version
669 | 	#i = 0
670 | 	#for j in range(len(assembler.outgroup_seq)):
671 | 	#	ifmatch = False
672 | 	#	while not ifmatch:
673 | 	#		if i + 3 > len(codon_seqs[assembler.outgroup]):
674 | 	#			break
675 | 	#		ifmatch = True
676 | 	#		sites = {}
677 | 	#		for seqid, protstr in prot_seqs.items():
678 | 	#			protbase = protstr[j]
679 | 	#			codon = codon_seqs[seqid][i:i+3]
680 | 	#			#debug
681 | 	#			print(seqid, 'codon', i, codon, 'protein', j, protbase)
682 | 	#			input()
683 | 	#			if codon == '---':
684 | 	#				if protbase != '-':
685 | 	#					ifmatch = False
686 | 	#					break
687 | 	#				else:
688 | 	#					sites[seqid] = codon
689 | 	#			elif translate(codon, table=assembler.gencode) != protbase:
690 | 	#				ifmatch = False
691 | 	#				break
692 | 	#			else:
693 | 	#				sites[seqid] = codon
694 | 	#		i += 3
695 | 	#	if not ifmatch:
696 | 	#		print("\nError: fail to match the codon columns to the protein alignment HMM: {}-{}!".format(assembler.name, j+1))
697 | 	#		return None
698 | 	#	for seqid in seqs.keys():
699 | 	#		seqs[seqid] += sites[seqid]
700 | 	return seqs
701 | 
702 | # a single task for cross contamination and final sequence output
703 | def cross_and_output(group_name, assemblers, sps, total_reads, moltype='dna', gencode=1, no_assemble=False, no_cross_species=False, min_overlap=30, min_pident=98, min_exp=0.2, min_fold=2, unknow='unknow', final_seq='consensus', no_ref=False, sep='.', keep_seqid=False):
704 | 	if not no_cross_species and len(sps) > 1 and not no_assemble:
705 | 		assemblers = cross_decontam(assemblers, total_reads, min_overlap=min_overlap, min_pident=min_pident, min_expression=min_exp, min_fold=min_fold)
706 | 	if unknow == 'unknow':
707 | 		nuclfill = 'N'
708 | 		protfill = 'X'
709 | 	else:
710 | 		nuclfill = unknow
711 | 		protfill = unknow
712 | 	seqs = {}
713 | 	protseqs = {}
714 | 	if not no_ref:
715 | 		# prepare the reference alignments
716 | 		if moltype.startswith('dna'):
717 | 			seqs.update(assemblers[sps[0]].aln)
718 | 		else:
719 | 			protseqs.update(assemblers[sps[0]].aln)
720 | 			if moltype == 'codon':
721 | 				seqs = generate_codon_ref(assemblers[sps[0]])
722 | 				if seqs is None:
723 | 					return 1, group_name
724 | 
725 | 	# print the stat information
726 | 	statout = open(os.path.join('stat_info', group_name + '.stat_info.tsv'), 'w')
727 | 	statout.write("species\t" + "\t".join(assemblers[sps[0]].stat_num.keys()) + "\tfinal seqs\n")
728 | 	for sp in sps:
729 | 		seq_list, protseq_list, seqid_list = assemblers[sp].output_sequence(mode=final_seq, nuclfill=nuclfill, protfill=protfill)
730 | 		statout.write("{}\t{}\t{}\n".format(sp, "\t".join(map(str, assemblers[sp].stat_num.values())), len(seq_list)))
731 | 		for i in range(len(seq_list)):
732 | 			if keep_seqid:
733 | 				if no_assemble and not final_seq.startswith('consensus'):
734 | 					seqid = seqid_list[i]
735 | 				else:
736 | 					seqid = sep.join([sp, group_name, str(i+1)]) + ' ' + seqid_list[i]
737 | 			else:
738 | 				seqid = sep.join([sp, group_name, str(i+1)])
739 | 			seqs[seqid] = seq_list[i]
740 | 			if not moltype.startswith('dna'):
741 | 				protseqs[seqid] = protseq_list[i]
742 | 	statout.close()
743 | 
744 | 	# cross decontamination for consensus without assembly
745 | 	if not no_cross_species and len(sps) > 1 and no_assemble and final_seq.startswith('consensus'):
746 | 		crossout = open(os.path.join('stat_info', group_name + '.cross_info.tsv'), 'w')
747 | 		crossout.write("seqid1\tseqid2\toverlap_length\tpident\tpseudoRPKM1\tpseudoRPKM2\texpression_fold\tseq1\tseq2\n")
748 | 		for sp1 in sps:
749 | 			for sp2 in sps:
750 | 				if sp1 == sp2:
751 | 					continue
752 | 				seqid1 = None
753 | 				seqid2 = None
754 | 				for seqid in seqs.keys():
755 | 					if seqid.startswith(sp1 + sep):
756 | 						seqid1 = seqid
757 | 					elif seqid.startswith(sp2 + sep):
758 | 						seqid2 = seqid
759 | 				if seqid1 is None or seqid2 is None:
760 | 					continue
761 | 				overlap_len = 0
762 | 				nident = 0
763 | 				length1 = 0
764 | 				length2 = 0
765 | 				for i in range(len(seqs[seqid1])):
766 | 					if seqs[seqid1][i] != nuclfill:
767 | 						length1 += 1
768 | 					if seqs[seqid2][i] != nuclfill:
769 | 						length2 += 1
770 | 					if seqs[seqid1][i] == nuclfill or seqs[seqid2][i] == nuclfill:
771 | 						continue
772 | 					overlap_len += 1
773 | 					if seqs[seqid1][i] == seqs[seqid2][i]:
774 | 						nident += 1
775 | 				if overlap_len < min_overlap:
776 | 					continue
777 | 				pident = 0
778 | 				if overlap_len > 0:
779 | 					pident = nident / overlap_len * 100
780 | 				if pident < min_pident:
781 | 					continue
782 | 				pseudoRPKM1 = 1000 * 1000000 * assemblers[sp1].stat_num['raw hits'] / total_reads[sp1] / length1
783 | 				pseudoRPKM2 = 1000 * 1000000 * assemblers[sp2].stat_num['raw hits'] / total_reads[sp2] / length2
784 | 				exp_fold = pseudoRPKM1 / pseudoRPKM2
785 | 				crossout.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(seqid1, seqid2, overlap_len, pident, pseudoRPKM1, pseudoRPKM2, exp_fold, seqs[seqid1], seqs[seqid2]))
786 | 				if exp_fold > min_fold and pseudoRPKM1 > min_exp:
787 | 					seqs.pop(seqid2)
788 | 					if not assemblers[sp2].moltype.startswith('dna'):
789 | 						protseqs.pop(seqid2)
790 | 				elif exp_fold < 1 / min_fold and pseudoRPKM2 > min_exp:
791 | 					seqs.pop(seqid1)
792 | 					if not assemblers[sp1].moltype.startswith('dna'):
793 | 						protseqs.pop(seqid1)
794 | 		crossout.close()
795 | 
796 | 	# output the final sequences
797 | 	seqout = open(os.path.join('nt_out', group_name + '.fa'), 'w')
798 | 	for seqid, seqstr in seqs.items():
799 | 		seqout.write(">{}\n{}\n".format(seqid, seqstr))
800 | 	seqout.close()
801 | 	if not moltype.startswith('dna'):
802 | 		seqout = open(os.path.join('aa_out', group_name + '.fa'), 'w')
803 | 		for seqid, seqstr in protseqs.items():
804 | 			seqout.write(">{}\n{}\n".format(seqid, seqstr))
805 | 		seqout.close()
806 | 	elif moltype == 'dna_codon':
807 | 		lib.trans_seq(os.path.join('nt_out', group_name + '.fa'), os.path.join('aa_out', group_name + '.fa'), gencode, dna_codon_unknow=protfill)
808 | 	return 0, group_name
809 | 
810 | # cross contamination and final sequence output in multiple processes
811 | def cross_and_output_mp(groups, sps, assemblers, total_reads, np = 8, moltype='dna', gencode=1, no_assemble=False, no_cross_species=False, min_overlap=30, min_pident=98, min_exp=0.2, min_fold=2, unknow='unknow', final_seq='consensus', no_ref=False, sep='.', keep_seqid=False):
812 | 	print("\nRemoving cross contamination and printing the final sequences...")
813 | 	if not os.path.isdir('stat_info'):
814 | 		os.mkdir('stat_info')
815 | 	if not os.path.isdir('nt_out'):
816 | 		os.mkdir('nt_out')
817 | 	if moltype != 'dna' and not os.path.isdir('aa_out'):
818 | 		os.mkdir('aa_out')
819 | 	if not no_cross_species and len(sps) > 1 and no_assemble:
820 | 		if final_seq.startswith('consensus'):
821 | 			print("\nWarning: not to assemble and {}, cross decontamination will be conducted after consensus!".format(final_seq))
822 | 		else:
823 | 			print("\nWarning: not to assemble and {}, cross decontamination will be disabled due to no read counts!".format(final_seq))
824 | 	args_list = []
825 | 	kwds = {'sps': sps, 'total_reads': total_reads, 'moltype': moltype, 'gencode': gencode, 'no_assemble': no_assemble, 'no_cross_species': no_cross_species, 'min_overlap': min_overlap, 'min_pident': min_pident, 'min_exp': min_exp, 'min_fold': min_fold, 'unknow': unknow, 'final_seq': final_seq, 'no_ref': no_ref, 'sep': sep, 'keep_seqid': keep_seqid}
826 | 	for group_name in groups:
827 | 		args_list.append((group_name, assemblers[group_name]))
828 | 	iferrors = lib.run_mp(cross_and_output, args_list, np, kwds=kwds)
829 | 	# if errors occur, print the message
830 | 	errors = []
831 | 	for iferror in iferrors:
832 | 		if iferror[0] == 1:
833 | 			errors.append(iferror[1])
834 | 	if errors:
835 | 		print("\nError: fail to output due to invalid reference codon alignments: {}".format(', '.join(errors)))
836 | 		sys.exit(1)
837 | 
838 | # concatenate the short alignments if split the reference alignements
839 | def concatenate(dirname, alns, fill='-'):
840 | 	seqs = {}
841 | 	total_len = 0
842 | 	for aln in alns.keys():
843 | 		aln_seqs = lib.read_fastx(os.path.join(dirname, aln + '.fa'), 'fasta')
844 | 		for seqid, seqstr in aln_seqs.items():
845 | 			if seqid.endswith('.' + aln + '.1'):
846 | 				seqid = seqid.replace('.' + aln + '.1', '')
847 | 			if seqs.get(seqid) is None:
848 | 				seqs[seqid] = fill * total_len + seqstr
849 | 			else:
850 | 				seqs[seqid] += seqstr
851 | 			aln_len = len(seqstr)
852 | 		for seqid in seqs.keys():
853 | 			if aln_seqs.get(seqid) is None and aln_seqs.get(seqid + '.' + aln + '.1') is None:
854 | 				seqs[seqid] += fill * aln_len
855 | 		total_len += aln_len
856 | 	outfile = open(os.path.join(dirname, 'aln.concatenated.fa'), 'w')
857 | 	for seqid, seqstr in seqs.items():
858 | 		outfile.write(">{}\n{}\n".format(seqid, seqstr))
859 | 	outfile.close()
860 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # PhyloAln
  2 | ## PhyloAln: a convenient reference-based tool to align sequences and high-throughput reads for phylogeny and evolution in the omic era
  3 | 
  4 | ![logo](https://github.com/huangyh45/PhyloAln/blob/main/logo.png)  
  5 | 
  6 | PhyloAln is a reference-based multiple sequence alignment tool for phylogeny and evolution. PhyloAln can directly map not only the raw reads but also assembled or translated sequences into the reference alignments, which is suitable for different omic data and skips the complex preparation in the traditional method including data assembly, gene prediction, orthology assignment and multiple sequence alignment, with a relatively high accuracy in the aligned sites and the downstream phylogeny. It is also able to detect and remove foreign and cross contamination in the generated alignments, which is not considered in other reference-based methods, and thus improve the quality of the alignments for downstream analyses.
  7 | 
  8 | ### Catalogue
  9 | - [Installation](#installation)
 10 |   - [1) Installation from source](#1-installation-from-source)
 11 |   - [2) Installation using Conda](#2-installation-using-conda)
 12 | - [Usage](#usage)
 13 |   - [Quick start](#quick-start)
 14 |   - [A practice using PhyloAln for phylogenomics](#a-practice-using-phyloaln-for-phylogenomics)
 15 |   - [A practice using PhyloAln for gene family analysis](#a-practice-using-phyloaln-for-gene-family-analysis)
 16 |   - [Input](#input)
 17 |   - [Output](#output)
 18 |   - [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use)
 19 |   - [Detailed parameters](#detailed-parameters)
 20 |   - [Limitations](#limitations)
 21 | - [Auxiliary scripts for PhyloAln and phylogenetic analyses](#auxiliary-scripts-for-phyloaln-and-phylogenetic-analyses)
 22 |   - [Script to translate sequences: transseq.pl](#transseqpl)
 23 |   - [Script to back-translate sequences: revertransseq.pl](#revertransseqpl)
 24 |   - [Script to align the sequences: alignseq.pl](#alignseqpl)
 25 |   - [Script to concatenate the alignments: connect.pl](#connectpl)
 26 |   - [Script to combine each result alignment of different PhyloAln runs: merge_seqs.py](#merge_seqspy)
 27 |   - [Script to select the sequences in the files in bulk: select_seqs.py](#select_seqspy)
 28 |   - [Script to trim the alignments based on unknown sites in bulk: trim_matrix.py](#trim_matrixpy)
 29 |   - [Script to root the phylogenetic tree: root_tree.py](#root_treepy)
 30 |   - [Script to prune the phylogenetic tree: prune_tree.py](#prune_treepy)
 31 |   - [Script to assist in checking the unaligned sequences in the reference alignments in bulk: check_aln.py](#check_alnpy)
 32 |   - [Script to test performance of PhyloAln: test_effect.py](#test_effectpy)
 33 | - [Citation](#citation)
 34 | - [Acknowledgments](#acknowledgments)
 35 | - [Questions & Answers](#questions--answers)
 36 |   - [Fail to locate Bio/SeqIO.pm in @INC when installation](#fail-to-locate-bioseqiopm-in-inc-when-installation)
 37 |   - [How can I obtain the reference alignments and the final tree?](#how-can-i-obtain-the-reference-alignments-and-the-final-tree)
 38 |   - [Does selection of the outgroup influence detection of foreign contamination? How can I choose an appropriate outgroup?](#does-selection-of-the-outgroup-influence-detection-of-foreign-contamination-how-can-i-choose-an-appropriate-outgroup)
 39 |   - [The required memory is too large to run PhyloAln.](#the-required-memory-is-too-large-to-run-PhyloAln)
 40 |   - [The positions of sites in the reference alignments are changed in the output alignments.](#the-positions-of-sites-in-the-reference-alignments-are-changed-in-the-output-alignments)
 41 |   - [How can I assemble the paired-end reads?](#how-can-i-assemble-the-paired-end-reads)
 42 |   - [Can PhyloAln generate the alignments of multiple-copy genes for gene family analyses?](#can-phyloaln-generate-the-alignments-of-multiple-copy-genes-for-gene-family-analyses)
 43 |   
 44 | ### Installation
 45 | 
 46 | #### 1) Installation from source
 47 | ##### Requirements
 48 | - python >=3.7.4 (https://www.python.org/downloads/)
 49 | - biopython >=1.77 (https://biopython.org/wiki/Download)
 50 | - hmmer >=3.1 (http://hmmer.org/download.html)
 51 | - mafft >=7.467 (optional for the auxiliary scripts, https://mafft.cbrc.jp/alignment/software/source.html)
 52 | - ete3 >=3.1.2 (optional for the auxiliary scripts, http://etetoolkit.org/download/)
 53 | - perl >=5.26.2 (optional for the auxiliary scripts, https://www.perl.org/get.html)
 54 | - perl-bioperl >=1.7.2 (optional for the auxiliary scripts, https://github.com/bioperl/bioperl-live/blob/master/README.md)
 55 | - perl-parallel-forkmanager >=2.02 (optional for the auxiliary scripts, https://github.com/dluxhu/perl-parallel-forkmanager)
 56 | 
 57 | After installing these requirements, you can download the latest release of PhyloAln directly from this page or using the command in your computer:  
 58 | ```
 59 | git clone https://github.com/huangyh45/PhyloAln.git
 60 | cd PhyloAln
 61 | git checkout v1.1.0   # switch to the latest stable release version
 62 | ```
 63 | If your computer needs execute permissions to run the programs, such as the Linux or macOS system, you should first run the command :  
 64 | ```
 65 | chmod -R +x /your/PhyloAln/path/   # the absolute path of 'PhyloAln' directory in the above commands
 66 | ```
 67 | Then, you can test if PhyloAln has been available using the commands:   
 68 | ```
 69 | cd /your/PhyloAln/path/  
 70 | export PATH=$PATH:/your/PhyloAln/path/:/your/PhyloAln/path/scripts  
 71 | bash tests/run_test.sh
 72 | ```
 73 | When you see "Successfully installed" at the end of the screen output, PhyloAln and all its auxiliary scripts has been successfully installed and available.  
 74 | If the test fails, you should check if the requirements have been successfully installed and executable in the current environment.  
 75 | After test, you can manually delete all the newly generated files, or run the command to delete them:  
 76 | ```
 77 | rm -rf alignseq.log all.block all.fas list tests/run_test.config tests/PhyloAln_* tests/aln tests/ref/*.fas tests/ref/*.index
 78 | ```
 79 | ##### Full usage experience
 80 | If you have installed [IQ-TREE](http://www.iqtree.org/#download) and want to experience the usage of all the scripts with real examples through a simple phylogenomic flow after installation, you can run the commands (it will spend some minutes):  
 81 | ```
 82 | bash tests/run_test.sh full
 83 | ```
 84 | After running, you can manually delete all the newly generated files, or run the command to delete them:  
 85 | ```
 86 | rm -rf alignseq.log all.block all.fas list tests/run_test.config tests/PhyloAln_* tests/aln tests/ref/*.fas tests/ref/*.index
 87 | ```
 88 | #### 2) Installation using Conda
 89 | PhyloAln has been provided on [Bioconda](https://bioconda.github.io/recipes/phyloaln/README.html), run the command to install it:  
 90 | ```
 91 | conda install phyloaln
 92 | ```
 93 | If your base environment of Conda has installed amounts of packages, Conda may be hard to manage the packages when installing PhyloAln. In this case, you can install the requirements in a newly created Conda environment using this command:  
 94 | ```
 95 | conda install -m -n your_env phyloaln
 96 | ```
 97 | and activate your environment before using PhyloAln:  
 98 | ```
 99 | conda activate your_env
100 | ```
101 | If the installation spends too much time, you can try to install the requirements and all their dependencies with fixed but not latest version. Download the [Conda configure file of requirements with fixed version](https://github.com/huangyh45/PhyloAln/releases/download/v0.1.0/requirement_fix.txt), and install these requirements using the command:  
102 | ```
103 | conda install (-m -n your_env) --file requirement_fix.txt
104 | ```
105 | Then, you can install PhyloAln using the command:  
106 | ```
107 | conda install (-n your_env) phyloaln
108 | ```
109 | 
110 | ### Usage
111 | 
112 | #### Quick start
113 | If you have only one reference alignment FASTA file and sequence data from only one source/species, you can use -a to input the reference alignment file, -s to input the species name and -i to input the FASTA/FASTQ sequence/read file(s), like this command:  
114 | ```
115 | PhyloAln -a reference_alignment_file -s species -i sequence_file1 (sequence_file2) -o output_directory
116 | ```
117 | 
118 | You can also use -c to input a configure file representing information of sequence data from multiple sources/species. The configure file should be tab-separated and like this:  
119 | ```
120 | species1  /absolute/path/sequence_file1
121 | species2  /absolute/path/sequence_file1,/absolute/path/sequence_file2
122 | ```
123 | If you have a directory containing multiple reference alignment FASTA files with a same suffix, you can use -d to input the directory and -x to input the suffix. The command using multiple reference alignments and multiple sources/species is like this:  
124 | ```
125 | PhyloAln -d reference_alignments_directory -c config.tsv -x alignment_file_name_suffix -o output_directory
126 | ```
127 | **Note：we found a bug when using unzipped FASTA/FASTQ sequence/read file(s) and guessed file format in the versions ≤ 1.0.0, which is fixed in the versions ≥ 1.1.0. Please always input the file format (-f) without guess when you run the versions ≤ 1.0.0 of PhyloAln！**
128 | 
129 | #### A practice using PhyloAln for phylogenomics
130 | The following practice is for phylogenomics using codon alignments of nuclear single-copy orthologous groups and 20 CPUs.
131 | ##### 1. obtain the reference orthologous sequences
132 | You can download the reference sequences from the ortholog database (e.g., [OrthoDB](https://www.orthodb.org/), [OMA](https://omabrowser.org/oma/home/)), or perform *de novo* orthology assignment (e.g., by [OrthoFinder](https://github.com/davidemms/OrthoFinder)). The reference species are recommended to contain one or several outgroup for PhyloAln.
133 | ##### 2. codon alignment for each ortholog group
134 | In this step, you can use our auxiliary script [alignseq.pl](#alignseqpl)  
135 | Run the shell commands:  
136 | ```
137 | mkdir aln  
138 | for file in orthogroup/*.fa; do  
139 |   name=`basename $file`  
140 |   scripts/alignseq.pl -i $file -o aln/$name -a codon -n 20  
141 | done
142 | ```
143 | ##### 3. trim the alignments (optional)
144 | In this step, you can use the tool [trimAl](https://github.com/inab/trimal)  
145 | Run the shell commands to trim the codon alignments generated in the above step:  
146 | ```
147 | mkdir ref_aln
148 | for file in aln/*.aa.fas; do  
149 |   name=`basename $file .aa.fas`  
150 |   trimal -in $file -out ref_aln/$name.fa -automated1 -keepheader -backtrans orthogroup/$name.fa
151 | done
152 | ```
153 | However, sometimes you would like to directly trim the alignments without considering the codons, like these commands:  
154 | ```
155 | mkdir ref_aln  
156 | for file in aln/*.fa; do  
157 |   name=`basename $file`  
158 |   trimal -in $file -out ref_aln/$name -automated1 -keepheader  
159 | done
160 | ```
161 | The reference alignments have been generated here. And you can directly obtain the existing alignments as reference instead of the above three steps, for example, from the published supplementary data.
162 | ##### 4. write the configure of the species and data
163 | The format of the configure file is TSV and like this:  
164 | ```
165 | species1  /absolute/path/sequence_file1  
166 | species2  /absolute/path/sequence_file1,/absolute/path/sequence_file2
167 | ```
168 | ##### 5. run PhyloAln the map the sequences/reads into the reference alignments
169 | ```
170 | PhyloAln -d ref_aln -c config.tsv -p 20 -m codon -u outgroup
171 | ```
172 | The output alignments can be trimmed to remove the sites with too many unknown bases using our auxiliary script [trim_matrix.py](#trim_matrixpy), and trimmed to make other editing about gappy or conservative sites using the tool [trimAl](https://github.com/inab/trimal)
173 | ##### 6. concatenate the alignments into a supermatrix
174 | This step can be done with our auxiliary script [connect.pl](#connectpl)  
175 | For codon dataset, you can run it like:  
176 | ```
177 | scripts/connect.pl -i PhyloAln_out/nt_out -f N -b all.block -n -c 123
178 | ```
179 | For protein dataset, the command is like:  
180 | ```
181 | scripts/connect.pl -i PhyloAln_out/aa_out -f X -b all.block -n
182 | ```
183 | ##### 7. reconstruct the phylogenetic tree
184 | You can build the tree by [IQ-TREE](http://www.iqtree.org/#download) like this:  
185 | ```
186 | iqtree -s all.fas -p all.block -m MFP+MERGE -B 1000 -T AUTO --threads-max 20 --prefix species_tree
187 | ```
188 | ##### 8. root the tree
189 | You can root the tree with the outgroup using our auxiliary script [root_tree.py](#root_treepy)
190 | ```
191 | scripts/root_tree.py species_tree.treefile species_tree.rooted.tre outgroup  
192 | ```
193 | Finally you obtain a species tree with NEWICK format here and you can then visualize it or use it in other downstream analyses.
194 | 
195 | #### A practice using PhyloAln for gene family analysis
196 | The following practice is for gene family analysis or marker sequence polish using codon alignment of insect COX1 genes as reference, undirected COX1 marker sequences as targets, and 20 CPUs. The idea for this usage is provided by **Yi-Fei Sun**.  
197 | The commands here will use the easy mode (different modes suitable for different data in different gene family analyses, see [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use)) provided in the versions ≥ 1.1.0.
198 | ##### 1. obtain the reference alignment
199 | You can download or extract the COX1 reference sequences from the mitochondrial genomes in the NCBI RefSeq database or other places, and then conduct codon alignment.  
200 | Our auxiliary script [alignseq.pl](#alignseqpl) can be used to conduct the alignment.  
201 | Run the shell commands:  
202 | ``` 
203 | scripts/alignseq.pl -i COX1.fa -o COX1.aln.fa -a codon -g 5 -n 20  
204 | ```
205 | The start and end regions are recommended to be trimed.
206 | ##### 2. run PhyloAln the map the target sequences into the reference alignment
207 | ```
208 | PhyloAln -a COX1.aln.fa -s anything -i targets.fa -e gene_codon2dna -g 5 -p 20  
209 | ```
210 | One or several outgroups in the reference alignment can be set with '-u'.
211 | ##### 3. trim the result alignment
212 | The output alignment is recommended to be trimmed to remove the sites with too many gaps and the sequences with short or actually no regions mapped to the reference using our auxiliary script [trim_matrix.py](#trim_matrixpy) like this:  
213 | ```
214 | scripts/trim_matrix.py PhyloAln_out/nt_out trim_out - 0.5 0.6
215 | ```
216 | ##### 4. check the trimmed alignment
217 | You can use our auxiliary script [check_aln.py](#check_alnpy) to assist in checking if the sequences are well aligned in the result alignments, like this:    
218 | ```
219 | scripts/check_aln.py trim_out  
220 | ```
221 | Based on the warnings output by check_aln.py, you should manually check the unaligned sequences and edit the alignments.
222 | ##### 5. reconstruct the phylogenetic tree
223 | You can build the gene tree by [IQ-TREE](http://www.iqtree.org/#download) like this:  
224 | ```
225 | iqtree -s trim_out/aln.fa -B 1000 -T AUTO --threads-max 20 --prefix gene_tree
226 | ```
227 | ##### 6. root the tree
228 | You can root the tree with the midpoint outgroup (default) or your provided outgroup using our auxiliary script [root_tree.py](#root_treepy)
229 | ```
230 | scripts/root_tree.py gene_tree.treefile gene_tree.rooted.tre (your_provided_outgroup)  
231 | ```
232 | Finally you obtain a gene tree with NEWICK format here and you can then visualize it or use it in other downstream analyses.
233 | 
234 | #### Input
235 | PhyloAln needs two types of file:  
236 | - the alignment file(s) with FASTA format. Trimmed alignments with conservative sites are recommended. Multiple alignment files with the same suffix should be placed into a directory for input.
237 | - the sequence/read file(s) with FASTA or FASTQ format. Compressed files ending with ".gz" are allowed. Sequence/read files from multiple sources/species should be inputed through a configure file as described in quick start.
238 | 
239 | #### Output
240 | PhyloAln generates new alignment file(s) with FASTA format. Each output alignment in `nt_out` directory is corresponding to each reference alignment file, with the aligned target sequences from the provided sequence/read file(s). If using prot, codon or dna_codon mode, the translated protein alignments will be also generated in `aa_out` directory. These alignments are mainly for phylogenetic analyses and evolutionary analyses using conservative sites.
241 | 
242 | #### Example commands for different data and common mode for easy use
243 | Notice: the following commands are only recommended according to our practice, and you can manually set the options as you need without setting '-e' or '--mode' if you want to change the specific options listed as follows.
244 | 
245 | Map the reads into the DNA alignments(-e dna2reads):
246 | ```
247 | PhyloAln [options] -m dna
248 | ```
249 | Map the reads into large numbers of DNA alignments(-e fast_dna2reads):
250 | ```
251 | PhyloAln [options] -m dna -b
252 | ```
253 | Map the transcript assembly/sequences into the DNA alignments(-e dna2trans):
254 | ```
255 | PhyloAln [options] -m dna -b -r
256 | ```
257 | Map the genomic assembly/sequences with intron regions into the DNA alignments(-e dna2genome):
258 | ```
259 | PhyloAln [options] -m dna -b -r -l 200 -f large_fasta
260 | ```
261 | Map the reads into the protein alignments(-e prot2reads):
262 | ```
263 | PhyloAln [options] -m prot
264 | ```
265 | Map the reads into large numbers of protein alignments(-e fast_prot2reads):
266 | ```
267 | PhyloAln [options] -m prot -b
268 | ```
269 | Map the transcript assembly/sequences into the protein alignments(-e prot2trans):
270 | ```
271 | PhyloAln [options] -m prot -b -r
272 | ```
273 | Map the genomic assembly/sequences with intron regions into the protein alignments(-e prot2genome):
274 | ```
275 | PhyloAln [options] -m prot -b -r -l 200 -f large_fasta
276 | ```
277 | Map the reads into the codon alignments(-e codon2reads):
278 | ```
279 | PhyloAln [options] -m codon
280 | ```
281 | Map the reads into large numbers of codon alignments(-e fast_codon2reads):
282 | ```
283 | PhyloAln [options] -m codon -b
284 | ```
285 | Map the transcript assembly/sequences into the codon alignments(-e codon2trans):
286 | ```
287 | PhyloAln [options] -m codon -b -r
288 | ```
289 | Map the genomic assembly/sequences with intron regions into the codon alignments(-e codon2genome):
290 | ```
291 | PhyloAln [options] -m codon -b -r -l 200 -f large_fasta
292 | ```
293 | Map the directed RNA/cDNA sequences into the RNA/cDNA alignments(-e rna2rna):
294 | ```
295 | PhyloAln [options] -m dna -n -b -r
296 | ```
297 | Map the protein sequences into the protein alignments(-e prot2prot):
298 | ```
299 | PhyloAln [options] -m dna -n -b -r -w X
300 | ```
301 | Map the CDS or the directed transcript/cDNA sequences into the codon alignments(-e codon2codon):
302 | ```
303 | PhyloAln [options] -m codon -n -b -r
304 | ```
305 | Map the DNA sequences into the DNA alignments for gene family analysis or polish the marker sequences(-e gene_dna2dna):
306 | ```
307 | PhyloAln [options] -m dna -b -r -z all -k -w -
308 | ```
309 | Map the directed RNA/cDNA/protein sequences into the RNA/cDNA/protein alignments for gene family analysis or polish the marker sequences(-e gene_rna2rna or -e gene_prot2prot):
310 | ```
311 | PhyloAln [options] -m dna -n -b -r -z all -k -w -
312 | ```
313 | Map the CDS or the directed transcript/cDNA sequences into the codon alignments for gene family analysis or polish the marker sequences(-e gene_codon2codon):
314 | ```
315 | PhyloAln [options] -m codon -n -b -r -z all -k -w -
316 | ```
317 | Map the DNA sequences into the codon alignments for gene family analysis or polish the marker sequences(-e gene_codon2dna):
318 | ```
319 | PhyloAln [options] -m codon -b -r -z all -k -w -
320 | ```
321 | Map the sequences/reads into the codon alignments using the non-standard genetic code (see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for detail), for example, the codon alignments of plastid protein-coding genes:
322 | ```
323 | PhyloAln [options] -m codon -g 11
324 | ```
325 | or
326 | ```
327 | PhyloAln [options] -e codon2reads -g 11
328 | ```
329 | Map the long reads with high insertion and deletetion rates into the codon alignments (actually not recommended to use long reads with high error rates):
330 | ```
331 | PhyloAln [options] -m dna_codon
332 | ```
333 | Map the sequences/reads into the concatenated or other long DNA alignments:
334 | ```
335 | PhyloAln [options] ---ref_split_len 1000
336 | ```
337 | 
338 | #### Detailed parameters
339 | ```
340 | usage: PhyloAln [options] -a reference_alignment_file -s species -i fasta_file -f fasta -o output_directory  
341 | PhyloAln [options] -d reference_alignments_directory -c config.tsv -f fastq -o output_directory  
342 |   
343 | A program to directly generate multiple sequence alignments from FASTA/FASTQ files based on reference alignments for  
344 | phylogenetic analyses.  
345 | Citation: Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. MBE. 41(7):msae150. https://doi.org/10.1093/molbev/msae150  
346 |   
347 | options:  
348 |   -h, --help            show this help message and exit  
349 |   -a ALN, --aln ALN     the single reference FASTA alignment file  
350 |   -d ALN_DIR, --aln_dir ALN_DIR  
351 |                         the directory containing all the reference FASTA alignment files  
352 |   -x ALN_SUFFIX, --aln_suffix ALN_SUFFIX  
353 |                         the suffix of the reference FASTA alignment files when using "-d"(default:.fa)  
354 |   -s SPECIES, --species SPECIES  
355 |                         the studied species ID for the provided FASTA/FASTQ files(-i)  
356 |   -i INPUT [INPUT ...], --input INPUT [INPUT ...]  
357 |                         the input FASTA/FASTQ file(s) of the single species(-s), compressed files ending with ".gz" are  
358 |                         allowed  
359 |   -c CONFIG, --config CONFIG  
360 |                         the TSV file with the format of 'species sequence_file(s)(absolute path, files separated by  
361 |                         commas)' per line for multiple species  
362 |   -f {guess,fastq,fasta,large_fasta}, --file_format {guess,fastq,fasta,large_fasta}  
363 |                         the file format of the provided FASTA/FASTQ files, 'large_fasta' is recommended for speeding up  
364 |                         reading the FASTA files with long sequences(e.g. genome sequences) and cannot be  
365 |                         guessed(default:guess)  
366 |   -o OUTPUT, --output OUTPUT  
367 |                         the output directory containing the results(default:PhyloAln_out)  
368 |   -p CPU, --cpu CPU     maximum threads to be totally used in parallel tasks(default:8)  
369 |   --parallel PARALLEL   number of parallel tasks for each alignments, number of CPUs used for single alignment will be  
370 |                         automatically calculated by '--cpu / --parallel'(default:the smaller value between number of  
371 |                         alignments and the maximum threads to be used)  
372 |   -e {dna2reads,prot2reads,codon2reads,fast_dna2reads,fast_prot2reads,fast_codon2reads,dna2trans,prot2trans,codon2trans,  
373 | dna2genome,prot2genome,codon2genome,rna2rna,prot2prot,codon2codon,gene_dna2dna,gene_rna2rna,gene_codon2codon,gene_codon2dna,  
374 | gene_prot2prot}, --mode {dna2reads,prot2reads,codon2reads,fast_dna2reads,fast_prot2reads,fast_codon2reads,dna2trans,prot2trans,  
375 | codon2trans,dna2genome,prot2genome,codon2genome,rna2rna,prot2prot,codon2codon,gene_dna2dna,gene_rna2rna,gene_codon2codon,  
376 | gene_codon2dna,gene_prot2prot}  
377 |                         the common mode to automatically set the parameters for easy use(**NOTICE: if you manually set  
378 |                         those parameters, the parameters you set will be ignored and covered! See  
379 |                         https://github.com/huangyh45/PhyloAln/blob/main/README.md#example-commands-for-different-data-  
380 |                         and-common-mode-for-easy-use for detailed parameters)  
381 |   -m {dna,prot,codon,dna_codon}, --mol_type {dna,prot,codon,dna_codon}  
382 |                         the molecular type of the reference alignments(default:dna, 'dna' suitable for nucleotide-to-  
383 |                         nucleotide or protein-to-protein alignment, 'prot' suitable for protein-to-nucleotide alignment,  
384 |                         'codon' and 'dna_codon' suitable for codon-to-nucleotide alignment based on protein and  
385 |                         nucleotide alignments respectively)  
386 |   -g GENCODE, --gencode GENCODE  
387 |                         the genetic code used in translation(default:1 = the standard code, see  
388 |                         https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)  
389 |   --ref_split_len REF_SPLIT_LEN  
390 |                         If provided, split the reference alignments longer than this length into short alignments with  
391 |                         this length, ~1000 may be recommended for concatenated alignments, and codon alignments should be  
392 |                         devided by 3  
393 |   -l SPLIT_LEN, --split_len SPLIT_LEN  
394 |                         If provided, split the sequences longer than this length into short sequences with this length,  
395 |                         200 may be recommended for long genomic reads or sequences  
396 |   --split_slide SPLIT_SLIDE  
397 |                         the slide to split the sequences using sliding window method(default:half of '--split_len')  
398 |   -n, --no_reverse      not to prepare and search the reverse strand of the sequences, recommended for searching protein  
399 |                         or CDS sequences  
400 |   --low_mem             use a low-memory but slower mode to prepare the reads, 'large_fasta' format is not supported and  
401 |                         gz compressed files may still spend some memory  
402 |   --hmmbuild_parameters HMMBUILD_PARAMETERS [HMMBUILD_PARAMETERS ...]  
403 |                         the parameters when using HMMER hmmbuild for reference preparation, with the format of ' --xxx'  
404 |                         of each parameter, in which space is required(default:[])  
405 |   --hmmsearch_parameters HMMSEARCH_PARAMETERS [HMMSEARCH_PARAMETERS ...]  
406 |                         the parameters when using HMMER hmmsearch for mapping the sequences, with the format of ' --xxx'  
407 |                         of each parameter, in which space is required((default:[])  
408 |   -b, --no_assemble     not to assemble the raw sequences based on overlap regions  
409 |   --overlap_len OVERLAP_LEN  
410 |                         minimum overlap length when assembling the raw sequences(default:30)  
411 |   --overlap_pident OVERLAP_PIDENT  
412 |                         minimum overlap percent identity when assembling the raw sequences(default:98.00)  
413 |   -t, --no_out_filter   not to filter the foreign or no-signal sequences based on conservative score  
414 |   -u OUTGROUP [OUTGROUP ...], --outgroup OUTGROUP [OUTGROUP ...]  
415 |                         the outgroup species for foreign or no-signal sequences detection(default:all the sequences in  
416 |                         the alignments with all sequences as ingroups)  
417 |   --ingroup INGROUP [INGROUP ...]  
418 |                         the ingroup species for score calculation in foreign or no-signal sequences detection(default:all  
419 |                         the sequences when all sequences are set as outgroups; all other sequences except the outgroups)  
420 |   -q SEP, --sep SEP     the separate symbol between species name and gene identifier in the sequence headers of the  
421 |                         alignments(default:.)  
422 |   --outgroup_weight OUTGROUP_WEIGHT  
423 |                         the weight coefficient to adjust strictness of the foreign or no-signal sequence filter, small  
424 |                         number or decimal means ralaxed criterion (default:0.90, 1 = not adjust)  
425 |   -r, --no_cross_species  
426 |                         not to remove the cross contamination for multiple species  
427 |   --cross_overlap_len CROSS_OVERLAP_LEN  
428 |                         minimum overlap length when cross contamination detection(default:30)  
429 |   --cross_overlap_pident CROSS_OVERLAP_PIDENT  
430 |                         minimum overlap percent identity when cross contamination detection(default:98.00)  
431 |   --min_exp MIN_EXP     minimum expression value when cross contamination detection(default:0.20)  
432 |   --min_exp_fold MIN_EXP_FOLD  
433 |                         minimum expression fold when cross contamination detection(default:5.00)  
434 |   -w UNKNOW_SYMBOL, --unknow_symbol UNKNOW_SYMBOL  
435 |                         the symbol representing unknown bases for missing regions(default:unknow = 'N' in nucleotide  
436 |                         alignments and 'X' in protein alignments)  
437 |   -z {consensus,consensus_strict,all,expression,length}, --final_seq {consensus,consensus_strict,all,expression,length}  
438 |                         the mode to output the sequences(default:consensus, 'consensus' means selecting most common bases  
439 |                         from all sequences, 'consensus_strict' means only selecting the common bases and remaining the  
440 |                         different bases unknow, 'all' means remaining all sequences, 'expression' means the sequence with  
441 |                         highest read counts after assembly, 'length' means sequence with longest length  
442 |   -y, --no_ref          not to output the reference sequences  
443 |   -k, --keep_seqid      keep original sequence IDs in the output alignments instead of renaming them based on the species  
444 |                         ID, not recommended when the output mode is 'consensus'/'consensus_strict' or the assembly step  
445 |                         is on  
446 |   -v, --version         show program's version number and exit  
447 |   
448 | Written by Yu-Hao Huang (2023-2025) huangyh45@mail3.sysu.edu.cn
449 | ```
450 | 
451 | #### Limitations
452 | - PhyloAln is only designed for phylogenetic analyses and evolutionary analyses with reference-based conservative sites, and thus cannot perform *de novo* assembly due to non-conservative sites and sites not covered in the reference alignments. The unmapped sites will be ignored.
453 | - We prioritize the flexibility of PhyloAln and thus did not provide the upstream steps of collecting the reference sequences and generating the reference alignments, and the downstream phylogenetic analyses. But you can use the auxiliary scripts to help preparation and perform downstream analyses.
454 | - In current version, we did not heavily focus on optimizing the runtime and memory usage of PhyloAln. Specially, speed and memory usage may be influenced by the numbers of the reference alignments and the numbers of the target sequences/reads. A version with optimized parallel and storage operations and optional accessories using C or other rapid languages may be developed in the future. Faster sequence search tools may also be a candidate to be integrated into PhyloAln as an option to speed up the alignments.
455 | 
456 | ###  Auxiliary scripts for PhyloAln and phylogenetic analyses
457 | 
458 | #### transseq.pl
459 | Requirements:  
460 | - perl >=5.26.2
461 | - perl-bioperl >=1.7.2
462 | - perl-parallel-forkmanager >=2.02
463 | 
464 | ```
465 | perl scripts/transseq.pl  
466 | Translate nucleotide sequences in a file to amino acid sequences.  
467 |   
468 | Usage:   
469 | -i   input nucleotide sequences file  
470 | -o   output amino acid sequences file  
471 | -g   genetic code(default=1, invertebrate mitochondrion=5)  
472 | -t   symbol of termination(default='*')  
473 | -c   if translate incomplete codons into 'X'(default=no)  
474 | -a   if translate all six ORF(default=no)  
475 | -n   num threads(default=1)  
476 | -l   log file(default='transseq.log')  
477 | -h   this help message  
478 |   
479 | Example:  
480 | transseq.pl -i ntfile -o aafile -g gencode -t termination -c 1 -a 1 -n numthreads -l logfile  
481 |   
482 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn
483 | ```
484 | 
485 | #### revertransseq.pl
486 | Requirements:  
487 | - perl >=5.26.2
488 | - perl-bioperl >=1.7.2
489 | - perl-parallel-forkmanager >=2.02
490 | 
491 | ```
492 | perl scripts/revertransseq.pl  
493 | Used the aligned translated sequences in a file as blueprint to aligned nucleotide sequences, which means reverse-translation.  
494 |   
495 | Usage:   
496 | -i   input nucleotide sequences file or files(separated by ',')  
497 | -b   aligned amino acid sequences file translated by input file as blueprint  
498 | -o   output aligned nucleotide sequences file  
499 | -g   genetic code(default=1, invertebrate mitochondrion=5)  
500 | -t   symbol of termination in blueprint(default='*')  
501 | -n   num threads(default=1)  
502 | -l   log file(default='revertransseq.log')  
503 | -h   this help message  
504 |   
505 | Example:  
506 | revertransseq.pl -i ntfile1,ntfile2,ntfile3 -b aafile -o alignedfile -g gencode -t termination -n numthreads -l logfile  
507 |   
508 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn
509 | ```
510 | 
511 | #### alignseq.pl
512 | Requirements:  
513 | - perl >=5.26.2
514 | - mafft >=7.467
515 | - transseq.pl
516 | - revertransseq.pl
517 | 
518 | ```
519 | perl scripts/alignseq.pl  
520 | Align sequences in a file by mafft.  
521 | Requirement: mafft  
522 |   
523 | Usage:   
524 | -i   input sequences file  
525 | -o   output sequences file  
526 | -a   type of alignment(direct/translate/codon/complement(experimental)/ncRNA(experimental), default='direct', 'translate' means alignment of translation of sequences)  
527 | -g   genetic code(default=1, invertebrate mitochondrion=5)  
528 | -t   symbol of termination(default='X', mafft will clean '*')  
529 | -c   if translate incomplete codons into 'X'(default=no)  
530 | -m   if delete the intermediate files, such as translated files and aligned aa files(default=no)  
531 | -f   the folder where mafft/linsi is, if mafft/linsi had been in PATH you can ignore this parameter  
532 | -n   num threads(default=1)  
533 | -l   log file(default='alignseq.log')  
534 | -h   this help message  
535 |   
536 | Example:  
537 | alignseq.pl -i inputfile -o outputfile -a aligntype -g gencode -t termination -c 1 -m 1 -f mafftfolder -n numthreads -l logfile  
538 |   
539 | Written by Yu-Hao Huang (2017-2024) huangyh45@mail3.sysu.edu.cn
540 | ```
541 | 
542 | #### connect.pl
543 | Requirements:  
544 | - perl >=5.26.2
545 | - perl-bioperl >=1.7.2
546 | 
547 | ```
548 | perl scripts/connect.pl  
549 | Concatenate multiple alignments into a matrix.  
550 |   
551 | Usage:   
552 | -i   directory containing input FASTA alignment files  
553 | -o   output concatenated FASTA alignment file  
554 | -t   type of input format(phyloaln/orthograph/blastsearch, default='phyloaln', also suitable for the format with same species name in all alignments, but the name shuold not contain separate symbol)  
555 | -f   the symbol to fill the sites of absent species in the alignments(default='-')  
556 | -s   the symbol to separate the sequences name and the first space is the species name in the 'phyloaln' format(default='.')  
557 | -x   the suffix of the input FASTA alignment files(default='.fa')  
558 | -b   the block file of the positions of each alignments(default=not to output)  
559 | -n   output the block file with NEXUS format, suitable for IQ-TREE(default=no)  
560 | -c   the codon positions to be written in the block file(default=no codon position, '123' represents outputing all the three codon positions, '12' represents outputing first and second positions)  
561 | -l   the list file with all the involved species you want to be included in the output alignments, one species per line(default=automatically generated, with all species found at least once in all the alignments)  
562 | -h   this help message  
563 |   
564 | Example:  
565 | connect.pl -i inputdir -o outputfile -t inputtype -f fillsymbol -s separate -x suffix -b block1file -n -c codonpos -l listfile  
566 |   
567 | Written by Yu-Hao Huang (2018-2024) huangyh45@mail3.sysu.edu.cn  
568 | ```
569 | 
570 | #### merge_seqs.py
571 | Requirements:
572 | - python >=3.7.4
573 | 
574 | The script can be used to merge the output alignments in different PhyloAln output directories with the same reference alignments, for example, for data of different batches.  
575 | Usage:  
576 | ```
577 | scripts/merge_seqs.py output_dir PhyloAln_dir1 PhyloAln_dir2 (PhyloAln_dir3 ...)
578 | ```
579 | 
580 | #### select_seqs.py
581 | Requirements:
582 | - python >=3.7.4
583 | 
584 | The script can be used to select or exclude a list of species (first space separated by separate_symbol from the sequence name) or sequences (the sequence name) from the sequence FASTA files with the same suffix in a directory and output the managed sequence files to a new diretory.  
585 | Usage:  
586 | ```
587 | scripts/select_seqs.py input_dir selected_species_or_sequences(separated by comma) output_dir fasta_suffix(default='.fa') separate_symbol(default='.') if_list_for_exclusion(default=no)
588 | ```
589 | 
590 | #### trim_matrix.py
591 | Requirements:
592 | - python >=3.7.4
593 | 
594 | The script can be used to trim first the colomns (sites) and/or then the rows (sequences) in the sequence matrixes in FASTA files with the same suffix in a directory based on the unknown sites and output the managed sequence files to a new diretory.  
595 | Usage:  
596 | ```
597 | scripts/trim_matrix.py input_dir output_dir unknown_symbol(default='X') known_number(>=1)_or_percent(<1)_for_columns(default=0.5) known_number(>=1)_or_percent(<1)_for_rows(default=0) fasta_suffix(default='.fa')
598 | ```
599 | 
600 | #### root_tree.py
601 | Requirements:
602 | - python >=3.7.4
603 | - ete3 >=3.1.2
604 | 
605 | The script can be used to root the tree with NEWICK format by ETE 3 package and output the rooted NEWICK tree file.  
606 | Usage:  
607 | ```
608 | scripts/root_tree.py input.nwk output.nwk outgroup/outgroups(default=the midpoint outgroup, separated by comma)
609 | ```
610 | 
611 | #### prune_tree.py
612 | Requirements:
613 | - python >=3.7.4
614 | - ete3 >=3.1.2
615 | 
616 | The script can be used to prune the tree with NEWICK format by ETE 3 package and output the pruned NEWICK tree file.  
617 | Usage:  
618 | ```
619 | scripts/prune_tree.py input.nwk output.nwk seq/seqs(separated by comma)_in_clade1_for_deletion (seq/seqs_in_clade2_for_deletion ...)
620 | ```
621 | 
622 | #### check_aln.py
623 | Requirements:
624 | - python >=3.7.4
625 | 
626 | The script can be used to assist in checking and finding out the unaligned sequences in the reference alignments in FASTA files with the same suffix in a directory and optionally exlude the unaligned sequences and output the managed alignment files to a new diretory (not recommended, manually checking the warning alignments and managing them are better).  
627 | Usage:  
628 | ```
629 | scripts/check_aln.py input_dir output_dir(default='none') aver_freq_per_site(default=0.75) gap_symbol(default='-') start_end_no_gap_number(>=1)_or_percent(<1)(default=0.6) fasta_suffix(default='.fa')
630 | ```
631 | 
632 | #### test_effect.py
633 | Requirements:
634 | - python >=3.7.4
635 | 
636 | The script can be used to calculate the completeness and percent identity of the alignments in FASTA files with the same suffix in a directory compared with the reference alignments in another directory, mainly for testing the effect of reference-based alignment tools, such as PhyloAln.  
637 | Usage:  
638 | ```
639 | scripts/test_effect.py reference_dir:ref_species_or_seq_name target_dir:target_species_or_seq_name output_tsv unknown_symbol(default='N') separate(default='.') fasta_suffix(default='.fa') selected_species_or_sequences(separated by comma)
640 | ```
641 | 
642 | ### Citation
643 | Huang Y-H, Sun Y-F, Li H, Li H-S, Pang H. 2024. PhyloAln: A Convenient Reference-Based Tool to Align Sequences and High-Throughput Reads for Phylogeny and Evolution in the Omic Era. Molecular Biology and Evolution 41(7):msae150. https://doi.org/10.1093/molbev/msae150
644 | 
645 | ### Acknowledgments
646 | We would like to thank these people for improvement of PhyloAln:  
647 | - **Zong-Jin Jiang:** test of installation and suggestions of environment configuration
648 | - **Yuan-Sen Liang:** test of installation
649 | - **Xin-Hui Xia:** test of installation
650 | 
651 | ### Questions & Answers
652 | #### Fail to locate Bio/SeqIO.pm in @INC when installation
653 | This is because perl module, especially bioperl, have not been successfully installed or in the perl library path, which sometimes occurs when configuration by Conda. You should set or add the perl library path to solve the problem, for example, try this command if you use Conda to install the requirements:  
654 | ```
655 | export PERL5LIB=/your/Conda/path/lib/perl5/site_perl
656 | ```
657 | If you install the requirements in a newly created Conda environment, you can try this command:  
658 | ```
659 | export PERL5LIB=/your/Conda/path/envs/your_env/lib/perl5/site_perl
660 | ```
661 | You can also add the perl library path in the Conda config to avoid set it each time you run, using the command: 
662 | ```
663 | conda env config vars set PERL5LIB=/your/Conda/path/lib/perl5/site_perl
664 | ```
665 | or  
666 | ```
667 | conda env config vars set -n your_env PERL5LIB=/your/Conda/path/envs/your_env/lib/perl5/site_perl
668 | ```
669 | In addition, you can try mamba or other tools to install the requirements.
670 | #### How can I obtain the reference alignments and the final tree?
671 | We do not provide the upstream preparation of the reference alignments and the downstream phylogenetic analyses in PhyloAln. You can manually collect the reference sequences, align them to generate the reference alignments and build the tree. These steps are flexible as you like. The reference alignments are recommended to contain outgroup(s) for foreign decontamination in PhyloAln and rooting tree. A detailed practice of phylogenomics using nuclear single-copy protein-coding genes can be seen here ([A practice using PhyloAln for phylogenomics](#a-practice-using-phyloaln-for-phylogenomics)). For other types of the data, such as non-protein-coding genes or genes with non-standard genetic codes, you can collect the reference sequences from [NCBI](https://www.ncbi.nlm.nih.gov/) or other places, and additionally adjust the options of alignseq.pl and PhyloAln.
672 | #### Does selection of the outgroup influence detection of foreign contamination? How can I choose an appropriate outgroup?
673 | Actually, in a specific reference alignment, selection of the outgroup have minimum impact on the results through our test (see [our article](https://doi.org/10.1093/molbev/msae150) for detail). Therefore, if you are not sure which species should be the outgroup, you can tentatively not defined the outgroup, and PhyloAln will acquiescently use the first sequences in the reference alignments as the outgroup (versions ≤ 1.0.0), or all the sequences in the alignments as the outgroups with all sequences as ingroups (versions ≥ 1.1.0).  
674 | But when preparing the reference alignments, it should be noticed that the evolutionary distance between the ingroups and the defined outgroup may have impact on detection of foreign contamination based on conservative score. The contamination from the species phylogenetically close to the reference species is relatively hard to be distinguished from the clean ingroup sequences, compared with the contamination from the species distinct from all the reference species, such as symbiotic bacteria of the target eukaryotic species. If the defined outgroup species is too divergent from the ingroups, a large amount of foreign contamination, especially those from species closer to the ingroups than the defined outgroup species, may not be detected and removed.   
675 | Consequently, it should be better that the users have priori knowledge of choosing the defined outgroup when constructing or obtaining the reference alignments. In most cases, the defined outgroup in PhyloAln is recommended to be from close or sister group of the monophyletic ingroup. If several outgroup species are used for phylogenetic reconstruction, you can input all these outgroups or only the closest outgroup to PhyloAln (versions ≥ 1.1.0). Furthermore, you can set the ingroups in the versions ≥ 1.1.0. In addition, the sensitivity of detection can be manually adjusted by setting a weight coefficient, which is default as 0.9 (see `--outgroup_weight` in [parameters](#detailed-parameters) for detail). 
676 | #### The required memory is too large to run PhyloAln.
677 | By default, the step to prepare the sequences/reads is in parallel and thus memory-consuming, especially when the data is large. You can try adding the option `--low_mem` to use a low-memory but slower mode to prepare the sequences/reads. In addition, decompression of the ".gz"-ended files will spend some memory. You can also try decompressing the files manually and then running PhyloAln.
678 | #### The positions of sites in the reference alignments are changed in the output alignments.
679 | When HMMER3 search, some non-conservative sites are deleted (e.g., gappy sites) or sometimes realigned. This has little impact on the downstream phylogenetic or evolutionary analyses. If you want to remain unchanged reference alignments or need special HMMER3 search, you can try utilizing the options `--hmmbuild_parameters` and `--hmmsearch_parameters` to control the parameters of HMMER3. For example, you can try adding the option `--hmmbuild_parameters ' --symfrac' '0'` to remain the gappy sites. It should be noticed that the HMMER3 parameters starting with '-' or '--' can only be parsed by adding space before it between a pair of quotation marks.
680 | #### How can I assemble the paired-end reads?
681 | PhyloAln does not have the method to specifically assemble the paired-end reads. It only mapped all the sequences/reads into the alignments and build a consensus in the assemble and/or output steps. You can input both two paired-end read files for a single source/species (see `-i` and `-c` in [parameters](#detailed-parameters) for detail). Furthermore, if you focus on the effect of assembly using paired-end reads, you can first merge them by other tools (e.g., [fastp](https://github.com/OpenGene/fastp)) and then input the merged read files with or without unpaired read files into PhyloAln.
682 | #### Can PhyloAln generate the alignments of multiple-copy genes for gene family analyses?
683 | Actually, we have designed options of this possibility. You can try it like this:
684 | ```
685 | PhyloAln -d reference_alignments_directory -c config.tsv -x alignment_file_name_suffix -o output_directory -p 20 -m codon -u outgroup -z all --overlap_len overlap_len --overlap_pident overlap_pident
686 | ```
687 | `-z all` represents outputing all the assembled sequences instead of consensus of them. And you can adjust `--overlap_len` and `--overlap_pident` to find a best assembly for the genes.  
688 | If the target sequences you provided contain those complete gene sequences instead of reads and genomic sequences with introns, you can see [A practice using PhyloAln for gene family analysis](#a-practice-using-phyloaln-for-gene-family-analysis) and [Example commands for different data and common mode for easy use](#example-commands-for-different-data-and-common-mode-for-easy-use), and find the modes gene_xxx2xxx to help you.
689 | 


--------------------------------------------------------------------------------