├── README.md
├── docker
├── Dockerfile
├── scripts
│ ├── run_dpam.py
│ ├── run_step1.py
│ ├── run_step10.py
│ ├── run_step11.py
│ ├── run_step12.py
│ ├── run_step13.py
│ ├── run_step14.py
│ ├── run_step15.py
│ ├── run_step16.py
│ ├── run_step17.py
│ ├── run_step18.py
│ ├── run_step19.py
│ ├── run_step2.py
│ ├── run_step21.py
│ ├── run_step23.py
│ ├── run_step3.py
│ ├── run_step4.py
│ ├── run_step5.py
│ ├── run_step6.py
│ ├── run_step7.py
│ ├── run_step8.py
│ ├── run_step9.py
│ ├── step10_get_support.py
│ ├── step11_get_good_domains.py
│ ├── step12_get_sse.py
│ ├── step13_get_diso.py
│ ├── step14_parse_domains.py
│ ├── step15_prepare_domass.py
│ ├── step16_run_domass.py
│ ├── step17_get_confident.py
│ ├── step18_get_mapping.py
│ ├── step19_get_merge_candidates.py
│ ├── step1_get_AFDB_seqs.py
│ ├── step20_extract_domains.py
│ ├── step21_compare_domains.py
│ ├── step22_merge_domains.py
│ ├── step23_get_predictions.py
│ ├── step24_integrate_results.py
│ ├── step25_generate_pdbs.py
│ ├── step2_get_AFDB_pdbs.py
│ ├── step3_run_hhsearch.py
│ ├── step4_run_foldseek.py
│ ├── step5_process_hhsearch.py
│ ├── step6_process_foldseek.py
│ ├── step7_prepare_dali.py
│ ├── step8_iterative_dali.py
│ ├── step9_analyze_dali.py
│ └── summarize_check.py
└── utilities
│ ├── DaliLite.v5.tar.gz
│ ├── HHPaths.pm
│ ├── foldseek
│ └── pdb2fasta
├── example
├── test
│ ├── O05011.json
│ ├── O05011.pdb
│ ├── O05012.cif
│ ├── O05012.json
│ ├── O05023.cif
│ └── O05023.json
└── test_struc.list
├── run_dpam_docker.py
├── run_dpam_singularity.py
└── v1.0
├── A0A0K2WPR7.zip
├── DPAM.py
├── LICENSE
├── README.md
├── check_dependencies.py
├── download_all_data.sh
├── mkdssp
├── model_organisms
├── Caenorhabditis_elegans.tgz
├── Danio_rerio.tgz
├── Drosophila_melanogaster.tgz
├── Homo_Sapiens.tgz
├── Mus_musculus.tgz
└── Pan_paniscus.tgz
├── pdb2fasta
├── step10_get_good_domains.py
├── step11_get_sse.py
├── step12_get_diso.py
├── step13_parse_domains.py
├── step1_get_AFDB_pdbs.py
├── step1_get_AFDB_seqs.py
├── step2_run_hhsearch.py
├── step3_run_foldseek.py
├── step4_filter_foldseek.py
├── step5_map_to_ecod.py
├── step6_get_dali_candidates.py
├── step7_iterative_dali_aug_multi.py
├── step8_analyze_dali.py
└── step9_get_support.py
/README.md:
--------------------------------------------------------------------------------
1 | # DPAM
2 | A **D**omain **P**arser for **A**lphafold **M**odels
3 |
4 | DPAM: A Domain Parser for AlphaFold Models (https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4548)
5 |
6 | ## Updates:
7 | A docker image for DPAM v2.0 can be dowloaded by **docker pull conglab/dpam:latest** and previous version (v1.0) is moved to v1.0 directory (2023-12-10) . New version includes domain classification based on ECOD database and addresses over-segmentation for some proteins. **Warning**: current Docker image only works on AMD x86, not Apple M series chip. We're updating it for the compatibility. Stay tuned!
8 | Upload domain parser results for six model organisms. (2022-12-6)
9 |
10 | Replace Dali with Foldseek for initial hits searching. (2022-11-30)
11 |
12 | Fix a bug in analyze_PDB.py which prevents the proper usage of Dali results. (2022-10-31)
13 | ## Prerequisites (required):
14 | Docker/Singularity
15 |
16 | Python3
17 |
18 | [Databases and supporting files](https://conglab.swmed.edu/DPAM/databases.tar.gz)
19 |
20 | ### Supporting databases for DPAM:
21 |
22 | The databases necessary for DPAM, along with all supporting files, are available for download from our lab server at [https://conglab.swmed.edu/DPAM/](https://conglab.swmed.edu/DPAM/). The compressed file is around 89GB, expanding to about **400GB** when uncompressed (best run on a computing cluster/workstation due to the substantial storage needs, which may surpass the capacity of typical personal computers). It is essential to ensure that you have sufficient hard drive space to accommodate these databases. Additionally, due to their substantial size, downloading these databases might require several hours to a few days, depending on your internet connection speed.
23 |
24 | After downloading the databases.tar.gz, please decompress the file. And the directory(`[download_path]/databases`) must be provided to `run_dpam_docker.py` as `--databases_dir`
25 |
26 | ## Installation
27 | For Docker:
28 |
29 | docker pull conglab/dpam:latest
30 | git clone https://github.com/CongLabCode/DPAM
31 | cd ./DPAM
32 | wget https://conglab.swmed.edu/DPAM/databases.tar.gz
33 | tar -xzf databases.tar.gz
34 |
35 | For Singularity:
36 |
37 | git clone https://github.com/CongLabCode/DPAM
38 | cd ./DPAM
39 | wget https://conglab.swmed.edu/DPAM/databases.tar.gz
40 | tar -xzf databases.tar.gz
41 | singularity pull dpam.sif docker://conglab/dpam
42 |
43 |
44 |
45 |
46 | ### Quick test
47 | For Docker:
48 |
49 | python run_dpam_docker.py --dataset test --input_dir example --databases_dir databases --threads 32
50 |
51 | For Singularity:
52 |
53 | python ./run_dpam_singularity.py --databases_dir databases --input_dir example --dataset test --threads 32 --image_name dpam.sif`
54 |
55 | ## Usage
56 |
python run_dpam_docker.py [-h] --databases_dir DATABASES_DIR --input_dir
57 | INPUT_DIR --dataset DATASET
58 | [--image_name IMAGE_NAME] [--threads THREADS]
59 | [--log_file LOG_FILE]
60 |
61 | ### Arguments
62 |
63 | - `-h`, `--help`
64 | Show this help message and exit. Use this argument if you need information about different command options.
65 |
66 | - `--databases_dir DATABASES_DIR`
67 | **(Required)** Specify the path to the databases directory (downloaded before and uncompressed) that needs to be mounted to the docker. Please make sure you download the databases before running
68 |
69 | - `--input_dir INPUT_DIR`
70 | **(Required)** Specify the path to the input directory that needs to be mounted.
71 |
72 | - `--dataset DATASET`
73 | **(Required)** Provide the name of the dataset for domain segmentation and classification.
74 |
75 | - `--image_name IMAGE_NAME`
76 | Specify the Docker image name. If not provided, a default image name will be used.
77 |
78 | - `--threads THREADS`
79 | Define the number of threads to be used. By default, the script is configured to utilize all available CPUs.
80 |
81 | - `--log_file LOG_FILE`
82 | Specify a file where the logs should be saved. If not provided, logs will be displayed in the standard output.
83 |
84 | ### Input organization
85 |
86 | Before running the wrapper, the `INPUT_DIR` needs to be in the following structure:
87 |
88 | /
89 | /
90 | _struc.list
91 | /
92 | _struc.list
93 | ...
94 |
95 |
96 | The `/` and `/` directories include PDB/mmCIF files and json file for PAE and `dataset1_struc.list` and `dataset2_struc.list` include targets (prefix of PDB/mmCIF and json), one line per one target. can be any name and postfix _struc.list has to be maintained.
97 |
98 | In the example test in **Quick test** above,
99 |
100 | `example/` is `` and `test` under `example/` is ``
101 |
102 | **exmaple command**:
103 |
104 | `python run_dpam_docker.py --dataset test --input_dir example --databases_dir databases --threads 32`
105 |
106 | `databases` is the directory uncompressed fromd databases.tar.gz from our lab server.
107 |
108 | ### Output
109 | The pipeline will generate log files for each step for debugging.
110 |
111 | Final output is \_domains under .
112 |
113 | For the example, it should be `test_domains` under `example/`.
114 |
--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
1 | ARG CUDA=11.1.1
2 | FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04
3 | ARG CUDA
4 |
5 |
6 | SHELL ["/bin/bash", "-o", "pipefail", "-c"]
7 |
8 | RUN apt-get update \
9 | && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
10 | build-essential \
11 | cmake \
12 | cuda-command-line-tools-$(cut -f1,2 -d- <<< ${CUDA//./-}) \
13 | git \
14 | tzdata \
15 | wget \
16 | dialog \
17 | gfortran \
18 | && rm -rf /var/lib/apt/lists/* \
19 | && apt-get autoremove -y \
20 | && apt-get clean
21 |
22 | RUN wget -q -P /tmp \
23 | https://repo.anaconda.com/miniconda/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh \
24 | && bash /tmp/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh -b -p /opt/conda \
25 | && rm /tmp/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh
26 |
27 | ENV PATH="/opt/conda/bin:$PATH"
28 | RUN conda install -y -c bioconda blast-legacy
29 | RUN conda install -y -c biocore psipred
30 | RUN conda install -y -c salilab dssp
31 |
32 |
33 | RUN pip install numpy
34 | RUN pip install tensorflow==1.14
35 | RUN pip install protobuf==3.20.*
36 |
37 |
38 | COPY utilities/DaliLite.v5.tar.gz /opt
39 | RUN cd /opt \
40 | && tar -zxvf DaliLite.v5.tar.gz \
41 | && cd /opt/DaliLite.v5/bin \
42 | && make clean \
43 | && make \
44 | && ln -s /opt/DaliLite.v5/bin/dali.pl /usr/bin \
45 | && rm /opt/DaliLite.v5.tar.gz
46 |
47 | RUN git clone https://github.com/soedinglab/pdbx.git /opt/pdbx \
48 | && mkdir /opt/pdbx/build \
49 | && pushd /opt/pdbx/build \
50 | && cmake ../ \
51 | && make install \
52 | && popd
53 |
54 | RUN git clone --branch v3.3.0 https://github.com/soedinglab/hh-suite.git /tmp/hh-suite \
55 | && mkdir /tmp/hh-suite/build \
56 | && pushd /tmp/hh-suite/build \
57 | && cmake -DCMAKE_INSTALL_PREFIX=/opt/hhsuite .. \
58 | && make -j 4 && make install \
59 | && ln -s /opt/hhsuite/bin/* /usr/bin \
60 | && popd \
61 | && rm -rf /tmp/hh-suite
62 |
63 | RUN mkdir /opt/DPAM && mkdir /opt/DPAM/scripts
64 | COPY scripts/*.py /opt/DPAM/scripts
65 | COPY utilities/HHPaths.pm /opt/hhsuite/scripts
66 | COPY utilities/pdb2fasta /usr/bin
67 | COPY utilities/foldseek /usr/bin
68 |
69 | RUN chmod -R +x /opt/DPAM/scripts
70 |
71 | ENV PATH="/opt/DPAM/scripts:/opt/hhsuite/scripts:/opt/hhsuite/bin:/opt/DaliLite.v5/bin:$PATH"
72 | ENV LD_LIBRARY_PATH="/opt/conda/lib:$LD_LIBRARY_PATH"
73 | ENV PERL5LIB="/usr/local/lib/perl5:/opt/hhsuite/scripts"
74 | ENV OPENBLAS_NUM_THREADS=1
75 |
76 |
--------------------------------------------------------------------------------
/docker/scripts/run_dpam.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import os, sys, time, subprocess
3 | dataset = sys.argv[1]
4 | ncore = sys.argv[2]
5 | wd = os.getcwd()
6 |
7 | for step in range(1,25):
8 | if 1 <= step <= 19 or step == 21 or step == 23:
9 | if os.path.exists(dataset + '_step' + str(step) + '.log'):
10 | with open(dataset + '_step' + str(step) + '.log') as f:
11 | step_logs = f.read()
12 | if 'done\n' != step_logs:
13 | rcode = subprocess.run('run_step' + str(step) + '.py ' + dataset + ' ' + ncore,shell=True).returncode
14 | if rcode != 0:
15 | print(f'Error in step{step}')
16 | sys.exit()
17 | else:
18 | for s in range(step,25):
19 | os.system('rm ' + dataset + '_step' + str(s) + '.log')
20 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*')
21 | rcode = subprocess.run('run_step' + str(step) + '.py ' + dataset + ' ' + ncore,shell=True).returncode
22 | if rcode != 0:
23 | print(f'Error in step{step}')
24 | sys.exit()
25 | elif step == 20:
26 | run_flag = 0
27 | if os.path.exists(dataset + '_step' + str(step) + '.log'):
28 | with open(dataset + '_step' + str(step) + '.log') as f:
29 | step_logs=f.read()
30 | if 'done\n' != step_logs:
31 | run_flag = 1
32 | else:
33 | run_flag = 1
34 | if run_flag == 1:
35 | for s in range(step,25):
36 | os.system('rm ' + dataset + '_step' + str(s) + '.log')
37 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*')
38 | status_code = subprocess.run('step20_extract_domains.py ' + dataset,shell=True).returncode
39 | if status_code == 0:
40 | with open(dataset + '_step20.log','w') as f:
41 | f.write('done\n')
42 | else:
43 | with open(dataset + '_step20.log','w') as f:
44 | f.write('fail\n')
45 | print(f'Error in step{step}')
46 | sys.exit()
47 | elif step == 22:
48 | run_flag = 0
49 | if os.path.exists(dataset + '_step' + str(step) + '.log'):
50 | with open(dataset + '_step' + str(step) + '.log') as f:
51 | step_logs=f.read()
52 | if 'done\n' != step_logs:
53 | run_flag = 1
54 | else:
55 | run_flag = 1
56 | if run_flag == 1:
57 | for s in range(step,25):
58 | os.system('rm ' + dataset + '_step' + str(s) + '.log')
59 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*')
60 | status_code = subprocess.run('step22_merge_domains.py ' + dataset,shell=True).returncode
61 | if status_code == 0:
62 | with open(dataset + '_step22.log','w') as f:
63 | f.write('done\n')
64 | else:
65 | with open(dataset + '_step22.log','w') as f:
66 | f.write('fail\n')
67 | print(f'Error in step{step}')
68 | sys.exit()
69 | elif step == 24:
70 | run_flag = 0
71 | if os.path.exists(dataset + '_step' + str(step) + '.log'):
72 | with open(dataset + '_step' + str(step) + '.log') as f:
73 | step_logs = f.read()
74 | if 'done\n' != step_logs:
75 | run_flag = 1
76 | else:
77 | run_flag = 1
78 | if run_flag == 1:
79 | for s in range(step,25):
80 | os.system('rm ' + dataset + '_step' + str(s) + '.log')
81 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*')
82 | status_code = subprocess.run('step24_integrate_results.py ' + dataset,shell=True).returncode
83 | if status_code == 0:
84 | with open(dataset + '_step24.log','w') as f:
85 | f.write('done\n')
86 | else:
87 | with open(dataset + '_step24.log','w') as f:
88 | f.write('fail\n')
89 | print(f'Error in step{step}')
90 | sys.exit()
91 | filelist=[wd + '/' + dataset + '_step' + str(k)+'.log' for k in range(1,25)]
92 | undone = 24
93 | for name in filelist:
94 | with open(name) as f:
95 | info = f.read()
96 | if info.strip() == 'done':
97 | undone = undone - 1
98 | else:
99 | print(dataset + ' ' + name.split('/')[-1].split(dataset + '_')[1]+' has errors..Fail')
100 | break
101 | if undone == 0:
102 | print(dataset + ' done')
103 |
--------------------------------------------------------------------------------
/docker/scripts/run_step1.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(sample,cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return sample+' succeed'
9 | else:
10 | return sample+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | sample=cmd.split()[2]
18 | process = pool.apply_async(run_cmd,(sample,cmd,))
19 | result.append(process)
20 | for process in result:
21 | log.append(process.get())
22 | return log
23 |
24 |
25 |
26 | dataset = sys.argv[1]
27 | ncore = int(sys.argv[2])
28 |
29 | if not os.path.exists('step1/'):
30 | os.system('mkdir step1')
31 | if not os.path.exists('step1/' + dataset):
32 | os.system('mkdir step1/' + dataset)
33 |
34 | fp = open(dataset + '_struc.list', 'r')
35 | cases = []
36 | for line in fp:
37 | words = line.split()
38 | accession = words[0]
39 | cases.append(accession)
40 | # cases.append([accession, version])
41 | fp.close()
42 |
43 | cmds = []
44 | for case in cases:
45 | if os.path.exists('step1/' + dataset + '/' + case + '.fa'):
46 | fp = open('step1/' + dataset + '/' + case + '.fa', 'r')
47 | check_header = 0
48 | check_seq = 0
49 | check_length = 0
50 | for line in fp:
51 | check_length += 1
52 | if line[0] == '>':
53 | if line[1:-1] == case:
54 | check_header = 1
55 | else:
56 | if len(line) > 10:
57 | check_seq = 1
58 | fp.close()
59 | if check_header and check_seq and check_length == 2:
60 | pass
61 | else:
62 | os.system('rm step1/' + dataset + '/' + case + '.fa')
63 | cmds.append('python /opt/DPAM/scripts/step1_get_AFDB_seqs.py ' + dataset + ' ' + case)
64 | else:
65 | cmds.append('python /opt/DPAM/scripts/step1_get_AFDB_seqs.py ' + dataset + ' ' + case)
66 |
67 |
68 | if cmds:
69 | logs = batch_run(cmds,ncore)
70 | fail = [i for i in logs if 'fail' in i]
71 | if fail:
72 | with open(dataset + '_step1.log','w') as f:
73 | for i in fail:
74 | f.write(i+'\n')
75 | else:
76 | with open(dataset + '_step1.log','w') as f:
77 | f.write('done\n')
78 | else:
79 | with open(dataset + '_step1.log','w') as f:
80 | f.write('done\n')
81 |
--------------------------------------------------------------------------------
/docker/scripts/run_step10.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd + ' succeed'
9 | else:
10 | return cmd + ' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 |
28 | if not os.path.exists('step10'):
29 | os.system('mkdir step10/')
30 | if not os.path.exists('step10/' + dataset):
31 | os.system('mkdir step10/' + dataset)
32 |
33 | fp = open(dataset + '_struc.list', 'r')
34 | prots = []
35 | for line in fp:
36 | words = line.split()
37 | prots.append(words[0])
38 | fp.close()
39 |
40 | need_prots = set([])
41 | for prot in prots:
42 | get_seq = 0
43 | if os.path.exists('step10/' + dataset + '/' + prot + '_sequence.result'):
44 | fp = open('step10/' + dataset + '/' + prot + '_sequence.result','r')
45 | word_counts = set([])
46 | for line in fp:
47 | words = line.split()
48 | word_counts.add(len(words))
49 | fp.close()
50 | if len(word_counts) == 1 and 8 in word_counts:
51 | get_seq = 1
52 | else:
53 | os.system('rm step10/' + dataset + '/' + prot + '_sequence.result')
54 | need_prots.add(prot)
55 |
56 | get_str = 0
57 | if os.path.exists('step10/' + dataset + '/' + prot + '_structure.result'):
58 | fp = open('step10/' + dataset + '/' + prot + '_structure.result','r')
59 | word_counts = set([])
60 | for line in fp:
61 | words = line.split()
62 | word_counts.add(len(words))
63 | fp.close()
64 | if len(word_counts) == 1 and 12 in word_counts:
65 | get_str = 1
66 | else:
67 | os.system('rm step10/' + dataset + '/' + prot + '_structure.result')
68 | need_prots.add(prot)
69 |
70 | if get_seq and get_str:
71 | pass
72 | elif os.path.exists('step10/' + dataset + '/' + prot + '.done'):
73 | pass
74 | else:
75 | need_prots.add(prot)
76 |
77 |
78 | if need_prots:
79 | cmds = []
80 | for prot in need_prots:
81 | cmds.append('step10_get_support.py ' + dataset + ' ' + prot + '\n')
82 | logs = batch_run(cmds, ncore)
83 | fail = [i for i in logs if 'fail' in i]
84 | if fail:
85 | with open(dataset + '_step10.log','w') as f:
86 | for i in fail:
87 | f.write(i+'\n')
88 | else:
89 | with open(dataset + '_step10.log','w') as f:
90 | f.write('done\n')
91 | else:
92 | with open(dataset + '_step10.log','w') as f:
93 | f.write('done\n')
94 |
--------------------------------------------------------------------------------
/docker/scripts/run_step11.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd + ' succeed'
9 | else:
10 | return cmd + ' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 |
28 | if not os.path.exists('step11'):
29 | os.system('mkdir step11/')
30 | if not os.path.exists('step11/' + dataset):
31 | os.system('mkdir step11/' + dataset)
32 |
33 | fp = open(dataset + '_struc.list', 'r')
34 | prots = []
35 | for line in fp:
36 | words = line.split()
37 | prots.append(words[0])
38 | fp.close()
39 |
40 | need_prots = []
41 | for prot in prots:
42 | if os.path.exists('step11/' + dataset + '/' + prot + '.goodDomains'):
43 | fp = open('step11/' + dataset + '/' + prot + '.goodDomains','r')
44 | seq_word_counts = set([])
45 | str_word_counts = set([])
46 | for line in fp:
47 | words = line.split()
48 | if words[0] == 'sequence':
49 | seq_word_counts.add(len(words))
50 | elif words[0] == 'structure':
51 | str_word_counts.add(len(words))
52 | fp.close()
53 |
54 | bad_seq = 0
55 | bad_str = 0
56 | if seq_word_counts:
57 | if len(seq_word_counts) == 1 and 10 in seq_word_counts:
58 | pass
59 | else:
60 | bad_seq = 1
61 | if str_word_counts:
62 | if len(str_word_counts) == 1 and 16 in str_word_counts:
63 | pass
64 | else:
65 | bad_str = 1
66 |
67 | if bad_seq or bad_str:
68 | os.system('rm step11/' + dataset + '/' + prot + '.goodDomains')
69 | need_prots.append(prot)
70 | elif os.path.exists('step11/' + dataset + '/' + prot + '.done'):
71 | pass
72 | else:
73 | need_prots.append(prot)
74 |
75 |
76 | if need_prots:
77 | cmds = []
78 | for prot in need_prots:
79 | cmds.append('step11_get_good_domains.py ' + dataset + ' ' + prot)
80 | logs = batch_run(cmds, ncore)
81 | fail = [i for i in logs if 'fail' in i]
82 | if fail:
83 | with open(dataset + '_step11.log','w') as f:
84 | for i in fail:
85 | f.write(i+'\n')
86 | else:
87 | with open(dataset + '_step11.log','w') as f:
88 | f.write('done\n')
89 | else:
90 | with open(dataset + '_step11.log','w') as f:
91 | f.write('done\n')
92 |
--------------------------------------------------------------------------------
/docker/scripts/run_step12.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd + ' succeed'
9 | else:
10 | return cmd + ' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 |
28 | if not os.path.exists('step12'):
29 | os.system('mkdir step12/')
30 | if not os.path.exists('step12/' + dataset):
31 | os.system('mkdir step12/' + dataset)
32 |
33 | fp = open(dataset + '_struc.list', 'r')
34 | prots = []
35 | for line in fp:
36 | words = line.split()
37 | prots.append(words[0])
38 | fp.close()
39 |
40 | need_prots = []
41 | for prot in prots:
42 | if os.path.exists('step12/' + dataset + '/' + prot + '.sse'):
43 | fp = open('step12/' + dataset + '/' + prot + '.sse', 'r')
44 | word_counts = set([])
45 | for line in fp:
46 | words = line.split()
47 | word_counts.add(len(words))
48 | fp.close()
49 | if len(word_counts) == 1 and 4 in word_counts:
50 | pass
51 | else:
52 | os.system('rm step12/' + dataset + '/' + prot + '.sse')
53 | need_prots.append(prot)
54 | else:
55 | need_prots.append(prot)
56 |
57 | if need_prots:
58 | cmds = []
59 | for prot in need_prots:
60 | cmds.append('step12_get_sse.py ' + dataset + ' ' + prot + '\n')
61 | logs = batch_run(cmds, ncore)
62 | fail = [i for i in logs if 'fail' in i]
63 | if fail:
64 | with open(dataset + '_step12.log','w') as f:
65 | for i in fail:
66 | f.write(i+'\n')
67 | else:
68 | with open(dataset + '_step12.log','w') as f:
69 | f.write('done\n')
70 | else:
71 | with open(dataset + '_step12.log','w') as f:
72 | f.write('done\n')
73 |
--------------------------------------------------------------------------------
/docker/scripts/run_step13.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd + ' succeed'
9 | else:
10 | return cmd + ' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 | dataset = sys.argv[1]
25 | ncore = int(sys.argv[2])
26 |
27 | if not os.path.exists('step13'):
28 | os.system('mkdir step13/')
29 |
30 | if not os.path.exists('step13/' + dataset):
31 | os.system('mkdir step13/' + dataset)
32 |
33 | fp = open(dataset + '_struc.list', 'r')
34 | cases = []
35 | for line in fp:
36 | words = line.split()
37 | cases.append(words[0])
38 | fp.close()
39 |
40 | need_cases = []
41 | for case in cases:
42 | prot = case
43 | if os.path.exists('step13/' + dataset + '/' + prot + '.diso'):
44 | pass
45 | else:
46 | need_cases.append(case)
47 |
48 | if need_cases:
49 | cmds = []
50 | for case in need_cases:
51 | cmds.append('step13_get_diso.py ' + dataset + ' ' + case)
52 | logs = batch_run(cmds, ncore)
53 | fail = [i for i in logs if 'fail' in i]
54 | if fail:
55 | with open(dataset + '_step13.log','w') as f:
56 | for i in fail:
57 | f.write(i+'\n')
58 | else:
59 | with open(dataset + '_step13.log','w') as f:
60 | f.write('done\n')
61 | else:
62 | with open(dataset + '_step13.log','w') as f:
63 | f.write('done\n')
64 |
--------------------------------------------------------------------------------
/docker/scripts/run_step14.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 | def run_cmd(cmd):
5 | status=subprocess.run(cmd,shell=True).returncode
6 | if status==0:
7 | return cmd + ' succeed'
8 | else:
9 | return cmd + ' fail'
10 |
11 | def batch_run(cmds,process_num):
12 | log=[]
13 | pool = Pool(processes=process_num)
14 | result = []
15 | for cmd in cmds:
16 | process = pool.apply_async(run_cmd,(cmd,))
17 | result.append(process)
18 | for process in result:
19 | log.append(process.get())
20 | return log
21 |
22 |
23 | dataset = sys.argv[1]
24 | ncore = int(sys.argv[2])
25 |
26 | if not os.path.exists('step14'):
27 | os.system('mkdir step14/')
28 | if not os.path.exists('step14/' + dataset):
29 | os.system('mkdir step14/' + dataset)
30 |
31 | fp = open(dataset + '_struc.list', 'r')
32 | cases = []
33 | for line in fp:
34 | words = line.split()
35 | cases.append(words[0])
36 | fp.close()
37 |
38 |
39 | need_cases = []
40 | for case in cases:
41 | prot = case
42 | if os.path.exists('step14/' + dataset + '/' + prot + '.domains'):
43 | word_counts = set([])
44 | fp = open('step14/' + dataset + '/' + prot + '.domains', 'r')
45 | for line in fp:
46 | words = line.split()
47 | word_counts.add(len(words))
48 | fp.close()
49 | if len(word_counts) == 1 and 2 in word_counts:
50 | pass
51 | else:
52 | os.system('rm step14/' + dataset + '/' + prot + '.domains')
53 | need_cases.append(case)
54 | elif os.path.exists('step14/' + dataset + '/' + prot + '.done'):
55 | pass
56 | else:
57 | need_cases.append(case)
58 |
59 | if need_cases:
60 | cmds = []
61 | for case in need_cases:
62 | cmds.append('step14_parse_domains.py ' + dataset + ' ' + case)
63 | logs = batch_run(cmds, ncore)
64 | fail = [i for i in logs if 'fail' in i]
65 | if fail:
66 | with open(dataset + '_step14.log','w') as f:
67 | for i in fail:
68 | f.write(i+'\n')
69 | else:
70 | with open(dataset + '_step14.log','w') as f:
71 | f.write('done\n')
72 | else:
73 | with open(dataset + '_step14.log','w') as f:
74 | f.write('done\n')
75 |
--------------------------------------------------------------------------------
/docker/scripts/run_step15.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd + ' succeed'
9 | else:
10 | return cmd + ' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 |
28 | if not os.path.exists('step15'):
29 | os.system('mkdir step15/')
30 | if not os.path.exists('step15/' + dataset):
31 | os.system('mkdir step15/' + dataset)
32 |
33 | fp = open(dataset + '_struc.list', 'r')
34 | prots = []
35 | for line in fp:
36 | words = line.split()
37 | prots.append(words[0])
38 | fp.close()
39 |
40 | need_prots = []
41 | for prot in prots:
42 | if os.path.exists('step15/' + dataset + '/' + prot + '.data'):
43 | word_counts = set([])
44 | fp = open('step15/' + dataset + '/' + prot + '.data', 'r')
45 | for line in fp:
46 | words = line.split()
47 | word_counts.add(len(words))
48 | fp.close()
49 | if len(word_counts) == 1 and 23 in word_counts:
50 | pass
51 | else:
52 | os.system('rm step15/' + dataset + '/' + prot + '.data')
53 | need_prots.append(prot)
54 | else:
55 | if os.path.exists('step15/' + dataset + '/' + prot + '.done'):
56 | pass
57 | else:
58 | need_prots.append(prot)
59 |
60 | if need_prots:
61 | cmds = []
62 | for prot in need_prots:
63 | cmds.append('step15_prepare_domass.py ' + dataset + ' ' + prot)
64 | logs = batch_run(cmds, ncore)
65 | fail = [i for i in logs if 'fail' in i]
66 | if fail:
67 | with open(dataset + '_step15.log','w') as f:
68 | for i in fail:
69 | f.write(i+'\n')
70 | else:
71 | with open(dataset + '_step15.log','w') as f:
72 | f.write('done\n')
73 | else:
74 | with open(dataset + '_step15.log','w') as f:
75 | f.write('done\n')
76 |
--------------------------------------------------------------------------------
/docker/scripts/run_step16.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 |
4 | dataset = sys.argv[1]
5 |
6 | if not os.path.exists('step16'):
7 | os.system('mkdir step16/')
8 | if not os.path.exists('step16/' + dataset):
9 | os.system('mkdir step16/' + dataset)
10 |
11 | fp = open(dataset + '_struc.list', 'r')
12 | prots = []
13 | for line in fp:
14 | words = line.split()
15 | prots.append(words[0])
16 | fp.close()
17 |
18 | need_prots = []
19 | for prot in prots:
20 | if os.path.exists('step16/' + dataset + '/' + prot + '.result'):
21 | word_counts = set([])
22 | fp = open('step16/' + dataset + '/' + prot + '.result', 'r')
23 | for line in fp:
24 | words = line.split()
25 | word_counts.add(len(words))
26 | fp.close()
27 | if len(word_counts) == 1 and 21 in word_counts:
28 | pass
29 | else:
30 | os.system('rm step16/' + dataset + '/' + prot + '.result')
31 | need_prots.append(prot)
32 | elif os.path.exists('step16/' + dataset + '/' + prot + '.done'):
33 | pass
34 | else:
35 | need_prots.append(prot)
36 |
37 |
38 | if need_prots:
39 | rp = open('step16_' + dataset + '.list', 'w')
40 | for prot in need_prots:
41 | rp.write(prot + '\n')
42 | rp.close()
43 | rcode=subprocess.run('step16_run_domass.py ' + dataset,shell=True).returncode
44 | if rcode!=0:
45 | with open(dataset + '_step16.log','w')as f:
46 | f.write(' '.join(need_prots)+' fail\n')
47 | else:
48 | with open(dataset + '_step16.log','w')as f:
49 | f.write('done\n')
50 | os.system('rm step16_' + dataset + '*.list\n')
51 | else:
52 | with open(dataset + '_step16.log','w')as f:
53 | f.write('done\n')
54 |
--------------------------------------------------------------------------------
/docker/scripts/run_step17.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd+' succeed'
9 | else:
10 | return cmd+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 | dataset = sys.argv[1]
25 | ncore = int(sys.argv[2])
26 | if not os.path.exists('step17'):
27 | os.system('mkdir step17/')
28 | if not os.path.exists('step17/' + dataset):
29 | os.system('mkdir step17/' + dataset)
30 |
31 | fp = open(dataset + '_struc.list', 'r')
32 | prots = []
33 | for line in fp:
34 | words = line.split()
35 | prots.append(words[0])
36 | fp.close()
37 |
38 | need_prots = []
39 | for prot in prots:
40 | if os.path.exists('step17/' + dataset + '/' + prot + '.result'):
41 | word_counts = set([])
42 | fp = open('step17/' + dataset + '/' + prot + '.result', 'r')
43 | for line in fp:
44 | words = line.split()
45 | word_counts.add(len(words))
46 | fp.close()
47 | if len(word_counts) == 1 and 6 in word_counts:
48 | pass
49 | else:
50 | os.system('rm step17/' + dataset + '/' + prot + '.result')
51 | need_prots.append(prot)
52 | elif os.path.exists('step17/' + dataset + '/' + prot + '.done'):
53 | pass
54 | else:
55 | need_prots.append(prot)
56 |
57 | if need_prots:
58 | cmds = []
59 | for prot in need_prots:
60 | cmds.append('step17_get_confident.py ' + dataset + ' ' + prot)
61 | logs = batch_run(cmds, ncore)
62 | fail = [i for i in logs if 'fail' in i]
63 | if fail:
64 | with open(dataset + '_step17.log','w') as f:
65 | for i in fail:
66 | f.write(i+'\n')
67 | else:
68 | with open(dataset + '_step17.log','w') as f:
69 | f.write('done\n')
70 | else:
71 | with open(dataset + '_step17.log','w') as f:
72 | f.write('done\n')
73 |
--------------------------------------------------------------------------------
/docker/scripts/run_step18.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd+' succeed'
9 | else:
10 | return cmd+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 | if not os.path.exists('step18'):
28 | os.system('mkdir step18/')
29 | if not os.path.exists('step18/' + dataset):
30 | os.system('mkdir step18/' + dataset)
31 |
32 | fp = open(dataset + '_struc.list', 'r')
33 | prots = []
34 | for line in fp:
35 | words = line.split()
36 | prots.append(words[0])
37 | fp.close()
38 |
39 | need_prots = []
40 | for prot in prots:
41 | if os.path.exists('step18/' + dataset + '/' + prot + '.data'):
42 | word_counts = set([])
43 | fp = open('step18/' + dataset + '/' + prot + '.data', 'r')
44 | for line in fp:
45 | words = line.split()
46 | word_counts.add(len(words))
47 | fp.close()
48 | if len(word_counts) == 1 and 8 in word_counts:
49 | pass
50 | else:
51 | os.system('rm step18/' + dataset + '/' + prot + '.data')
52 | need_prots.append(prot)
53 | elif os.path.exists('step18/' + dataset + '/' + prot + '.done'):
54 | pass
55 | else:
56 | need_prots.append(prot)
57 |
58 | if need_prots:
59 | cmds = []
60 | for prot in need_prots:
61 | cmds.append('step18_get_mapping.py ' + dataset + ' ' + prot)
62 | logs = batch_run(cmds, ncore)
63 | fail = [i for i in logs if 'fail' in i]
64 | if fail:
65 | with open(dataset + '_step18.log','w') as f:
66 | for i in fail:
67 | f.write(i+'\n')
68 | else:
69 | with open(dataset + '_step18.log','w') as f:
70 | f.write('done\n')
71 | else:
72 | with open(dataset + '_step18.log','w') as f:
73 | f.write('done\n')
74 |
--------------------------------------------------------------------------------
/docker/scripts/run_step19.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd+' succeed'
9 | else:
10 | return cmd+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 | dataset = sys.argv[1]
24 | ncore = int(sys.argv[2])
25 | if not os.path.exists('step19'):
26 | os.system('mkdir step19/')
27 | if not os.path.exists('step19/' + dataset):
28 | os.system('mkdir step19/' + dataset)
29 |
30 | fp = open(dataset + '_struc.list', 'r')
31 | prots = []
32 | for line in fp:
33 | words = line.split()
34 | prots.append(words[0])
35 | fp.close()
36 |
37 | need_prots = set([])
38 | for prot in prots:
39 | check_info = 0
40 | if os.path.exists('step19/' + dataset + '/' + prot + '.info'):
41 | word_counts = set([])
42 | fp = open('step19/' + dataset + '/' + prot + '.info', 'r')
43 | for line in fp:
44 | words = line.split()
45 | word_counts.add(len(words))
46 | fp.close()
47 | if len(word_counts) == 1 and 2 in word_counts:
48 | check_info = 1
49 | else:
50 | os.system('rm step19/' + dataset + '/' + prot + '.info')
51 | need_prots.add(prot)
52 |
53 | check_result = 0
54 | if os.path.exists('step19/' + dataset + '/' + prot + '.result'):
55 | word_counts = set([])
56 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r')
57 | for line in fp:
58 | words = line.split()
59 | word_counts.add(len(words))
60 | fp.close()
61 | if len(word_counts) == 1 and 4 in word_counts:
62 | check_result = 1
63 | else:
64 | os.system('rm step19/' + dataset + '/' + prot + '.result')
65 | need_prots.add(prot)
66 |
67 | if check_info and check_result:
68 | pass
69 | elif os.path.exists('step19/' + dataset + '/' + prot + '.done'):
70 | pass
71 | else:
72 | need_prots.add(prot)
73 |
74 | if need_prots:
75 | cmds = []
76 | for prot in need_prots:
77 | cmds.append('step19_get_merge_candidates.py ' + dataset + ' ' + prot)
78 | logs = batch_run(cmds, ncore)
79 | fail = [i for i in logs if 'fail' in i]
80 | if fail:
81 | with open(dataset + '_step19.log','w') as f:
82 | for i in fail:
83 | f.write(i+'\n')
84 | else:
85 | with open(dataset + '_step19.log','w') as f:
86 | f.write('done\n')
87 | else:
88 | with open(dataset + '_step19.log','w') as f:
89 | f.write('done\n')
90 |
--------------------------------------------------------------------------------
/docker/scripts/run_step2.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 |
3 | import os, sys, subprocess
4 | from multiprocessing import Pool
5 |
6 | def run_cmd(sample,cmd):
7 | status=subprocess.run(cmd,shell=True).returncode
8 | if status==0:
9 | return sample+' succeed'
10 | else:
11 | return sample+' fail'
12 |
13 | def batch_run(cmds,process_num):
14 | log=[]
15 | pool = Pool(processes=process_num)
16 | result = []
17 | for cmd in cmds:
18 | sample=cmd.split()[2]
19 | process = pool.apply_async(run_cmd,(sample,cmd,))
20 | result.append(process)
21 | for process in result:
22 | log.append(process.get())
23 | return log
24 |
25 |
26 |
27 | dataset = sys.argv[1]
28 | ncore = int(sys.argv[2])
29 | if not os.path.exists('step2'):
30 | os.system('mkdir step2/')
31 | if not os.path.exists('step2/' + dataset):
32 | os.system('mkdir step2/' + dataset)
33 |
34 | fp = open(dataset + '_struc.list', 'r')
35 | cases = []
36 | for line in fp:
37 | words = line.split()
38 | cases.append(words[0])
39 | fp.close()
40 |
41 | cmds = []
42 | for case in cases:
43 | fasta_length = 0
44 | if os.path.exists('step1/' + dataset + '/' + case + '.fa'):
45 | fp = open('step1/' + dataset + '/' + case + '.fa', 'r')
46 | for line in fp:
47 | if line[0] != '>':
48 | fasta_length = len(line[:-1])
49 | fp.close()
50 |
51 | pdb_resids = set([])
52 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'):
53 | fp = open('step2/' + dataset + '/' + case + '.pdb', 'r')
54 | for line in fp:
55 | if len(line) >= 50:
56 | if line[:4] == 'ATOM':
57 | resid = int(line[22:26])
58 | pdb_resids.add(resid)
59 | fp.close()
60 | pdb_length = len(pdb_resids)
61 |
62 | if fasta_length == pdb_length:
63 | if fasta_length:
64 | pass
65 | else:
66 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'):
67 | os.system('rm step2/' + dataset + '/' + case + '.pdb')
68 | cmds.append('python /opt/DPAM/scripts/step2_get_AFDB_pdbs.py ' + dataset + ' ' + case + ' ' + case + '\n')
69 | else:
70 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'):
71 | os.system('rm step2/' + dataset + '/' + case + '.pdb')
72 | cmds.append('python /opt/DPAM/scripts/step2_get_AFDB_pdbs.py ' + dataset + ' ' + case + ' ' + case + '\n')
73 |
74 | if cmds:
75 | logs=batch_run(cmds, ncore)
76 | fail = [i for i in logs if 'fail' in i]
77 | if fail:
78 | with open(dataset + '_step2.log','w') as f:
79 | for i in fail:
80 | f.write(i+'\n')
81 | else:
82 | with open(dataset + '_step2.log','w') as f:
83 | f.write('done\n')
84 | else:
85 | with open(dataset + '_step2.log','w') as f:
86 | f.write('done\n')
87 |
--------------------------------------------------------------------------------
/docker/scripts/run_step21.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd+' succeed'
9 | else:
10 | return cmd+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 | dataset = sys.argv[1]
25 | ncore = int(sys.argv[2])
26 | os.system('rm step21_' + dataset + '_*.list')
27 |
28 | fp = os.popen('ls -1 step19/' + dataset + '/*.result')
29 | prots = []
30 | for line in fp:
31 | prot = line.split('/')[2].split('.')[0]
32 | prots.append(prot)
33 | fp.close()
34 |
35 | cases = []
36 | all_cases = set([])
37 | for prot in prots:
38 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r')
39 | for line in fp:
40 | words = line.split()
41 | domain1 = words[0]
42 | resids1 = words[1]
43 | domain2 = words[2]
44 | resids2 = words[3]
45 | cases.append([prot, domain1, resids1, domain2, resids2])
46 | all_cases.add(prot + '_' + domain1 + '_' + domain2)
47 | fp.close()
48 |
49 | get_cases = set([])
50 | if os.path.exists('step21_' + dataset + '.result'):
51 | fp = open('step21_' + dataset + '.result','r')
52 | for line in fp:
53 | words = line.split()
54 | get_cases.add(words[0] + '_' + words[1] + '_' + words[2])
55 | fp.close()
56 |
57 | if all_cases == get_cases:
58 | with open(dataset + '_step21.log','w') as f:
59 | f.write('done\n')
60 | else:
61 | total = len(cases)
62 | batchsize = total // ncore + 1
63 | cmds = []
64 | for i in range(ncore):
65 | rp = open('step21_' + dataset + '_' + str(i) + '.list', 'w')
66 | for case in cases[batchsize * i : batchsize * i + batchsize]:
67 | rp.write(case[0] + '\t' + case[1] + '\t' + case[2] + '\t' + case[3] + '\t' + case[4] + '\n')
68 | rp.close()
69 | cmds.append('step21_compare_domains.py ' + dataset + ' ' + str(i))
70 | logs = batch_run(cmds, ncore)
71 | fail = [i for i in logs if 'fail' in i]
72 | if fail:
73 | with open(dataset + '_step21.log','w') as f:
74 | for i in fail:
75 | f.write(i+'\n')
76 | else:
77 | with open(dataset + '_step21.log','w') as f:
78 | f.write('done\n')
79 | status=subprocess.run('cat step21_' + dataset + '_*.result >> step21_' + dataset + '.result',shell=True).returncode
80 | os.system('rm step21_' + dataset + '_*.list')
81 | os.system('rm step21_' + dataset + '_*.result')
82 |
--------------------------------------------------------------------------------
/docker/scripts/run_step23.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 |
5 | def run_cmd(cmd):
6 | status=subprocess.run(cmd,shell=True).returncode
7 | if status==0:
8 | return cmd+' succeed'
9 | else:
10 | return cmd+' fail'
11 |
12 | def batch_run(cmds,process_num):
13 | log=[]
14 | pool = Pool(processes=process_num)
15 | result = []
16 | for cmd in cmds:
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 | if not os.path.exists('step23'):
28 | os.system('mkdir step23/')
29 | if not os.path.exists('step23/' + dataset):
30 | os.system('mkdir step23/' + dataset)
31 |
32 | fp = open(dataset + '_struc.list', 'r')
33 | prots = []
34 | for line in fp:
35 | words = line.split()
36 | prots.append(words[0])
37 | fp.close()
38 |
39 | need_prots = []
40 | for prot in prots:
41 | if os.path.exists('step23/' + dataset + '/' + prot + '.assign'):
42 | word_counts = set([])
43 | fp = open('step23/' + dataset + '/' + prot + '.assign', 'r')
44 | for line in fp:
45 | words = line.split()
46 | word_counts.add(len(words))
47 | fp.close()
48 | if len(word_counts) == 1 and 10 in word_counts:
49 | pass
50 | else:
51 | os.system('rm step23/' + dataset + '/' + prot + '.assign')
52 | need_prots.append(prot)
53 | else:
54 | if os.path.exists('step23/' + dataset + '/' + prot + '.done'):
55 | pass
56 | else:
57 | need_prots.append(prot)
58 |
59 |
60 | if need_prots:
61 | cmds = []
62 | for prot in need_prots:
63 | cmds.append('step23_get_predictions.py ' + dataset + ' ' + prot + '\n')
64 | logs = batch_run(cmds, ncore)
65 | fail = [i for i in logs if 'fail' in i]
66 | if fail:
67 | with open(dataset + '_step23.log','w') as f:
68 | for i in fail:
69 | f.write(i+'\n')
70 | else:
71 | with open(dataset + '_step23.log','w') as f:
72 | f.write('done\n')
73 | else:
74 | with open(dataset + '_step23.log','w') as f:
75 | f.write('done\n')
76 |
--------------------------------------------------------------------------------
/docker/scripts/run_step3.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess,time
3 | def run_cmd(cmd):
4 | status = subprocess.run(cmd,shell = True).returncode
5 | if status == 0:
6 | return cmd + ' succeed'
7 | else:
8 | return cmd + ' fail'
9 |
10 |
11 | dataset = sys.argv[1]
12 | ncore = sys.argv[2]
13 | if not os.path.exists('step3'):
14 | os.system('mkdir step3/')
15 | if not os.path.exists('step3/' + dataset):
16 | os.system('mkdir step3/' + dataset)
17 |
18 | fp = open(dataset + '_struc.list', 'r')
19 | prots = []
20 | for line in fp:
21 | words = line.split()
22 | prots.append(words[0])
23 | fp.close()
24 |
25 | cmds = []
26 | for prot in prots:
27 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm') and os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'):
28 | fp = open('step3/' + dataset + '/' + prot + '.hmm', 'r')
29 | get_sspred = 0
30 | get_ssconf = 0
31 | for line in fp:
32 | if len(line) >= 10:
33 | if line[0] == '>' and line[1:8] == 'ss_pred':
34 | get_sspred = 1
35 | elif line[0] == '>' and line[1:8] == 'ss_conf':
36 | get_ssconf = 1
37 | if get_sspred and get_ssconf:
38 | break
39 | fp.close()
40 |
41 | if get_sspred and get_ssconf:
42 | pass
43 | elif os.path.exists('step3/' + dataset + '/' + prot + '.a3m'):
44 | fp = open('step3/' + dataset + '/' + prot + '.a3m', 'r')
45 | count_line = 0
46 | for line in fp:
47 | count_line += 1
48 | fp.close()
49 | if count_line == 2:
50 | get_sspred = 1
51 | get_ssconf = 1
52 |
53 | fp = open('step3/' + dataset + '/' + prot + '.hhsearch', 'r')
54 | start = 0
55 | end = 0
56 | hitsA = set([])
57 | hitsB = set([])
58 | for line in fp:
59 | words = line.split()
60 | if len(words) >= 2:
61 | if words[0] == 'No' and words[1] == 'Hit':
62 | start = 1
63 | elif words[0] == 'No' and words[1] == '1':
64 | hitsB.add(int(words[1]))
65 | end = 1
66 | elif start and not end:
67 | hitsA.add(int(words[0]))
68 | elif end:
69 | if words[0] == 'No':
70 | hitsB.add(int(words[1]))
71 | fp.close()
72 | last_words = line.split()
73 |
74 | if get_sspred and get_ssconf and hitsA == hitsB and not last_words:
75 | pass
76 | else:
77 | os.system('rm step1/' + dataset + '/' + prot + '.hhr')
78 | os.system('rm step3/' + dataset + '/' + prot + '.a3m')
79 | os.system('rm step3/' + dataset + '/' + prot + '.hmm')
80 | os.system('rm step3/' + dataset + '/' + prot + '.hhsearch')
81 | cmds.append('step3_run_hhsearch.py ' + dataset + ' ' + prot + ' ' + ncore)
82 | else:
83 | if os.path.exists('step1/' + dataset + '/' + prot + '.hhr'):
84 | os.system('rm step1/' + dataset + '/' + prot + '.hhr')
85 | if os.path.exists('step3/' + dataset + '/' + prot + '.a3m'):
86 | os.system('rm step3/' + dataset + '/' + prot + '.a3m')
87 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm'):
88 | os.system('rm step3/' + dataset + '/' + prot + '.hmm')
89 | if os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'):
90 | os.system('rm step3/' + dataset + '/' + prot + '.hhsearch')
91 | cmds.append('step3_run_hhsearch.py ' + dataset + ' ' + prot + ' ' + ncore)
92 |
93 | if cmds:
94 | fail=[]
95 | for cmd in cmds:
96 | for i in range(5):
97 | log=run_cmd(cmd)
98 | if 'succeed' in log:
99 | break
100 | time.sleep(1)
101 | else:
102 | fail.append(log)
103 | if fail:
104 | with open(dataset + '_step3.log','w') as f:
105 | for i in fail:
106 | f.write(i + '\n')
107 | sys.exit(1)
108 | else:
109 | with open(dataset + '_step3.log','w') as f:
110 | f.write('done\n')
111 |
--------------------------------------------------------------------------------
/docker/scripts/run_step4.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 |
3 | import os, sys, subprocess
4 | def run_cmd(cmd):
5 | status = subprocess.run(cmd,shell = True).returncode
6 | if status == 0:
7 | return cmd + ' succeed'
8 | else:
9 | return cmd + ' fail'
10 |
11 | dataset = sys.argv[1]
12 | ncore = sys.argv[2]
13 |
14 | if not os.path.exists('step4'):
15 | os.system('mkdir step4/')
16 |
17 | if not os.path.exists('step4/' + dataset):
18 | os.system('mkdir step4/' + dataset)
19 |
20 | fp = open(dataset + '_struc.list', 'r')
21 | prots = []
22 | for line in fp:
23 | words = line.split()
24 | prots.append(words[0])
25 | fp.close()
26 |
27 | need_prots = []
28 | for prot in prots:
29 | if os.path.exists('step4/' + dataset + '/' + prot + '.foldseek'):
30 | fp = open('step4/' + dataset + '/' + prot + '.foldseek','r')
31 | word_counts = set([])
32 | for line in fp:
33 | words = line.split()
34 | word_counts.add(len(words))
35 | fp.close()
36 | if len(word_counts) == 1 and 12 in word_counts:
37 | pass
38 | elif os.path.exists('step4/' + dataset + '/' + prot + '.done'):
39 | pass
40 | else:
41 | os.system('rm step4/' + dataset + '/' + prot + '.foldseek')
42 | need_prots.append(prot)
43 | else:
44 | need_prots.append(prot)
45 |
46 |
47 | if need_prots:
48 | with open('step4/' + dataset + '_step4.list','w') as f:
49 | for i in need_prots:
50 | f.write(i+'\n')
51 | log = run_cmd('step4_run_foldseek.py ' + dataset + ' ' + ncore + ' \n')
52 | if 'fail' in log:
53 | with open(dataset + '_step4.log','w') as f:
54 | f.write(dataset + ' fail\n')
55 | else:
56 | with open(dataset + '_step4.log','w') as f:
57 | f.write('done\n')
58 | os.system('rm step4/' + dataset + '_step4.list')
59 | else:
60 | if os.path.exists('step4/' + dataset + '_step4.list'):
61 | os.system('rm step4/' + dataset + '_step4.list')
62 | with open(dataset + '_step4.log','w') as f:
63 | f.write('done\n')
64 |
--------------------------------------------------------------------------------
/docker/scripts/run_step5.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 | def run_cmd(sample,cmd):
5 | status=subprocess.run(cmd,shell=True).returncode
6 | if status==0:
7 | return sample+' succeed'
8 | else:
9 | return sample+' fail'
10 |
11 | def batch_run(cmds,process_num):
12 | log=[]
13 | pool = Pool(processes=process_num)
14 | result = []
15 | for cmd in cmds:
16 | sample=cmd.split()[2]
17 | process = pool.apply_async(run_cmd,(sample,cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 |
26 | dataset = sys.argv[1]
27 | ncore = int(sys.argv[2])
28 |
29 | if not os.path.exists('step5'):
30 | os.system('mkdir step5/')
31 | if not os.path.exists('step5/' + dataset):
32 | os.system('mkdir step5/' + dataset)
33 |
34 | if os.path.exists('step5_' + dataset + '.cmds'):
35 | os.system('rm step5_' + dataset + '.cmds')
36 |
37 | fp = open(dataset + '_struc.list', 'r')
38 | prots = []
39 | for line in fp:
40 | words = line.split()
41 | prots.append(words[0])
42 | fp.close()
43 |
44 | need_prots = []
45 | for prot in prots:
46 | if os.path.exists('step5/' + dataset + '/' + prot + '.result'):
47 | fp = open('step5/' + dataset + '/' + prot + '.result','r')
48 | word_counts = set([])
49 | for line in fp:
50 | words = line.split()
51 | word_counts.add(len(words))
52 | fp.close()
53 | if len(word_counts) == 1 and 15 in word_counts:
54 | pass
55 | else:
56 | os.system('rm step5/' + dataset + '/' + prot + '.result')
57 | need_prots.append(prot)
58 | else:
59 | if os.path.exists('step5/' + dataset + '/' + prot + '.done'):
60 | pass
61 | else:
62 | need_prots.append(prot)
63 |
64 | if need_prots:
65 | cmds = []
66 | for prot in need_prots:
67 | cmds.append('step5_process_hhsearch.py ' + dataset + ' ' + prot)
68 | logs = batch_run(cmds,ncore)
69 | fail = [i for i in logs if 'fail' in i]
70 | if fail:
71 | with open(dataset + '_step5.log','w') as f:
72 | for i in fail:
73 | f.write(i+'\n')
74 | else:
75 | with open(dataset + '_step5.log','w') as f:
76 | f.write('done\n')
77 | else:
78 | with open(dataset + '_step5.log','w') as f:
79 | f.write('done\n')
80 |
--------------------------------------------------------------------------------
/docker/scripts/run_step6.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys,subprocess
3 | from multiprocessing import Pool
4 | def run_cmd(sample,cmd):
5 | status=subprocess.run(cmd,shell=True).returncode
6 | if status==0:
7 | return sample+' succeed'
8 | else:
9 | return sample+' fail'
10 |
11 | def batch_run(cmds,process_num):
12 | log=[]
13 | pool = Pool(processes=process_num)
14 | result = []
15 | for cmd in cmds:
16 | sample=cmd.split()[2]
17 | process = pool.apply_async(run_cmd,(sample,cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 |
25 | dataset = sys.argv[1]
26 | ncore = int(sys.argv[2])
27 | if not os.path.exists('step6'):
28 | os.system('mkdir step6/')
29 | if not os.path.exists('step6/' + dataset):
30 | os.system('mkdir step6/' + dataset)
31 |
32 | if os.path.exists('step6_' + dataset + '.cmds'):
33 | os.system('rm step6_' + dataset + '.cmds')
34 |
35 | fp = open(dataset + '_struc.list', 'r')
36 | prots = []
37 | for line in fp:
38 | words = line.split()
39 | prots.append(words[0])
40 | fp.close()
41 |
42 | need_prots = []
43 | for prot in prots:
44 | if os.path.exists('step6/' + dataset + '/' + prot + '.result'):
45 | fp = open('step6/' + dataset + '/' + prot + '.result','r')
46 | word_counts = set([])
47 | for line in fp:
48 | words = line.split()
49 | word_counts.add(len(words))
50 | fp.close()
51 | if len(word_counts) == 1 and 3 in word_counts:
52 | pass
53 | else:
54 | os.system('rm step6/' + dataset + '/' + prot + '.result')
55 | need_prots.append(prot)
56 | else:
57 | need_prots.append(prot)
58 |
59 | if need_prots:
60 | cmds = []
61 | for prot in need_prots:
62 | cmds.append('step6_process_foldseek.py ' + dataset + ' ' + prot)
63 | logs = batch_run(cmds, ncore)
64 | fail = [i for i in logs if 'fail' in i]
65 | if fail:
66 | with open(dataset + '_step6.log','w') as f:
67 | for i in fail:
68 | f.write(i+'\n')
69 | else:
70 | with open(dataset + '_step6.log','w') as f:
71 | f.write('done\n')
72 | else:
73 | with open(dataset + '_step6.log','w') as f:
74 | f.write('done\n')
75 |
--------------------------------------------------------------------------------
/docker/scripts/run_step7.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 | def run_cmd(cmd):
5 | status=subprocess.run(cmd,shell=True).returncode
6 | if status==0:
7 | return cmd + ' succeed'
8 | else:
9 | return cmd + ' fail'
10 |
11 | def batch_run(cmds,process_num):
12 | log=[]
13 | pool = Pool(processes=process_num)
14 | result = []
15 | for cmd in cmds:
16 | sample=cmd.split()[2]
17 | process = pool.apply_async(run_cmd,(cmd,))
18 | result.append(process)
19 | for process in result:
20 | log.append(process.get())
21 | return log
22 |
23 |
24 | dataset = sys.argv[1]
25 | ncore = int(sys.argv[2])
26 | if not os.path.exists('step7'):
27 | os.system('mkdir step7/')
28 | if not os.path.exists('step7/' + dataset):
29 | os.system('mkdir step7/' + dataset)
30 |
31 | fp = open(dataset + '_struc.list', 'r')
32 | prots = []
33 | for line in fp:
34 | words = line.split()
35 | prots.append(words[0])
36 | fp.close()
37 |
38 | need_prots = []
39 | for prot in prots:
40 | if os.path.exists('step7/' + dataset + '/' + prot + '_hits'):
41 | pass
42 | elif os.path.exists('step7/' + dataset + '/' + prot + '.done'):
43 | pass
44 | else:
45 | need_prots.append(prot)
46 |
47 | if need_prots:
48 | cmds = []
49 | for prot in need_prots:
50 | cmds.append('step7_prepare_dali.py ' + dataset + ' ' + prot)
51 | logs = batch_run(cmds, ncore)
52 | fail = [i for i in logs if 'fail' in i]
53 | if fail:
54 | with open(dataset + '_step7.log','w') as f :
55 | for i in fail:
56 | f.write(i+'\n')
57 | else:
58 | with open(dataset + '_step7.log','w') as f:
59 | f.write('done\n')
60 | else:
61 | with open(dataset + '_step7.log','w') as f:
62 | f.write('done\n')
63 |
--------------------------------------------------------------------------------
/docker/scripts/run_step8.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | def run_cmd(cmd):
4 | status=subprocess.run(cmd,shell=True).returncode
5 | if status==0:
6 | return cmd +' succeed'
7 | else:
8 | return cmd +' fail'
9 |
10 | dataset = sys.argv[1]
11 | ncore = sys.argv[2]
12 | if not os.path.exists('step8'):
13 | os.system('mkdir step8/')
14 |
15 | if not os.path.exists('step8/' + dataset):
16 | os.system('mkdir step8/' + dataset)
17 |
18 | fp = open(dataset + '_struc.list', 'r')
19 | prots = []
20 | for line in fp:
21 | words = line.split()
22 | prots.append(words[0])
23 | fp.close()
24 |
25 | need_prots = []
26 | for prot in prots:
27 | if os.path.exists('step8/' + dataset + '/' + prot + '_hits'):
28 | hit_count = 0
29 | fp = open('step8/' + dataset + '/' + prot + '_hits', 'r')
30 | hit_lines = []
31 | hit_line_count = 0
32 | bad = 0
33 | for line in fp:
34 | if line[0] == '>':
35 | hit_count += 1
36 | if hit_line_count:
37 | if hit_line_count + 4 != len(hit_lines):
38 | bad = 1
39 | break
40 | words = line.split()
41 | hit_line_count = int(words[2])
42 | hit_lines = []
43 | else:
44 | hit_lines.append(line)
45 | fp.close()
46 | if hit_line_count:
47 | if hit_line_count + 4 != len(hit_lines):
48 | bad = 1
49 | if bad:
50 | os.system('rm step8/' + dataset + '/' + prot + '_hits')
51 | need_prots.append(prot)
52 | elif hit_count:
53 | pass
54 | else:
55 | if os.path.exists('step8/' + dataset + '/' + prot + '.done'):
56 | pass
57 | else:
58 | os.system('rm step8/' + dataset + '/' + prot + '_hits')
59 | need_prots.append(prot)
60 | else:
61 | if os.path.exists('step8/' + dataset + '/' + prot + '.done'):
62 | pass
63 | else:
64 | need_prots.append(prot)
65 |
66 | if need_prots:
67 | print(need_prots)
68 | fail = []
69 | for prot in need_prots:
70 | log = run_cmd ('step8_iterative_dali.py ' + dataset + ' ' + prot + ' ' + ncore )
71 | if 'fail' in log:
72 | fail.append(log)
73 | if fail:
74 | with open(dataset + '_step8.log','w') as f:
75 | for i in fail:
76 | f.write(i+'\n')
77 | else:
78 | with open(dataset + '_step8.log','w') as f:
79 | f.write('done\n')
80 | else:
81 | with open(dataset + '_step8.log','w') as f:
82 | f.write('done\n')
83 |
--------------------------------------------------------------------------------
/docker/scripts/run_step9.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | from multiprocessing import Pool
4 | def run_cmd(cmd):
5 | status=subprocess.run(cmd,shell=True).returncode
6 | if status==0:
7 | return cmd+' succeed'
8 | else:
9 | return cmd+' fail'
10 |
11 | def batch_run(cmds,process_num):
12 | log=[]
13 | pool = Pool(processes=process_num)
14 | result = []
15 | for cmd in cmds:
16 | process = pool.apply_async(run_cmd,(cmd,))
17 | result.append(process)
18 | for process in result:
19 | log.append(process.get())
20 | return log
21 |
22 |
23 | dataset = sys.argv[1]
24 | ncore = int(sys.argv[2])
25 |
26 | if not os.path.exists('step9'):
27 | os.system('mkdir step9/')
28 |
29 | if not os.path.exists('step9/' + dataset):
30 | os.system('mkdir step9/' + dataset)
31 |
32 | fp = open(dataset + '_struc.list', 'r')
33 | prots = []
34 | for line in fp:
35 | words = line.split()
36 | prots.append(words[0])
37 | fp.close()
38 |
39 | need_prots = []
40 | for prot in prots:
41 | if os.path.exists('step9/' + dataset + '/' + prot + '_good_hits'):
42 | fp = open('step9/' + dataset + '/' + prot + '_good_hits','r')
43 | word_counts = set([])
44 | for line in fp:
45 | words = line.split()
46 | word_counts.add(len(words))
47 | fp.close()
48 | if len(word_counts) == 1 and 15 in word_counts:
49 | pass
50 | else:
51 | os.system('rm step9/' + dataset + '/' + prot + '_good_hits')
52 | need_prots.append(prot)
53 | else:
54 | if os.path.exists('step9/' + dataset + '/' + prot + '.done'):
55 | pass
56 | else:
57 | need_prots.append(prot)
58 |
59 | if need_prots:
60 | cmds = []
61 | for prot in need_prots:
62 | cmds.append('step9_analyze_dali.py ' + dataset + ' ' + prot)
63 | logs = batch_run(cmds, ncore)
64 | fail = [i for i in logs if 'fail' in i]
65 | if fail:
66 | with open(dataset + '_step9.log','w') as f:
67 | for i in fail:
68 | f.write(i+'\n')
69 | else:
70 | with open(dataset + '_step9.log','w') as f:
71 | f.write('done\n')
72 | else:
73 | with open(dataset + '_step9.log','w') as f:
74 | f.write('done\n')
75 |
--------------------------------------------------------------------------------
/docker/scripts/step11_get_good_domains.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 |
4 | fp = open('/mnt/databases/ECOD_norms', 'r')
5 | ecod2norm = {}
6 | for line in fp:
7 | words = line.split()
8 | ecod2norm[words[0]] = float(words[1])
9 | fp.close()
10 |
11 | spname = sys.argv[1]
12 | prot= sys.argv[2]
13 | results = []
14 | if os.path.exists(f'step10/{spname}/{prot}_sequence.result'):
15 | fp = open(f'step10/{spname}/{prot}_sequence.result', 'r')
16 | for line in fp:
17 | words = line.split()
18 | filt_segs = []
19 | for seg in words[6].split(','):
20 | start = int(seg.split('-')[0])
21 | end = int(seg.split('-')[1])
22 | for res in range(start, end + 1):
23 | if not filt_segs:
24 | filt_segs.append([res])
25 | else:
26 | if res > filt_segs[-1][-1] + 10:
27 | filt_segs.append([res])
28 | else:
29 | filt_segs[-1].append(res)
30 |
31 | filt_seg_strings = []
32 | total_good_count = 0
33 | for seg in filt_segs:
34 | start = seg[0]
35 | end = seg[-1]
36 | good_count = 0
37 | for res in range(start, end + 1):
38 | good_count += 1
39 | if good_count >= 5:
40 | total_good_count += good_count
41 | filt_seg_strings.append(f'{str(start)}-{str(end)}')
42 | if total_good_count >= 25:
43 | results.append('sequence\t' + prot + '\t' + '\t'.join(words[:7]) + '\t' + ','.join(filt_seg_strings) + '\n')
44 | fp.close()
45 |
46 | if os.path.exists(f'step10/{spname}/{prot}_structure.result'):
47 | fp = open(f'step10/{spname}/{prot}_structure.result', 'r')
48 | for line in fp:
49 | words = line.split()
50 | ecodnum = words[0].split('_')[0]
51 | edomain = words[1]
52 | zscore = float(words[3])
53 | try:
54 | znorm = round(zscore / ecod2norm[ecodnum], 2)
55 | except KeyError:
56 | znorm = 0.0
57 | qscore = float(words[4])
58 | ztile = float(words[5])
59 | qtile = float(words[6])
60 | rank = float(words[7])
61 | bestprob = float(words[8])
62 | bestcov = float(words[9])
63 |
64 | judge = 0
65 | if rank < 1.5:
66 | judge += 1
67 | if qscore > 0.5:
68 | judge += 1
69 | if ztile < 0.75 and ztile >= 0:
70 | judge += 1
71 | if qtile < 0.75 and qtile >= 0:
72 | judge += 1
73 | if znorm > 0.225:
74 | judge += 1
75 |
76 | seqjudge = 'no'
77 | if bestprob >= 20 and bestcov >= 0.2:
78 | judge += 1
79 | seqjudge = 'low'
80 | if bestprob >= 50 and bestcov >= 0.3:
81 | judge += 1
82 | seqjudge = 'medium'
83 | if bestprob >= 80 and bestcov >= 0.4:
84 | judge += 1
85 | seqjudge = 'high'
86 | if bestprob >= 95 and bestcov >= 0.6:
87 | judge += 1
88 | seqjudge = 'superb'
89 |
90 | if judge:
91 | seg_strings = words[10].split(',')
92 | filt_segs = []
93 | for seg in words[10].split(','):
94 | start = int(seg.split('-')[0])
95 | end = int(seg.split('-')[1])
96 | for res in range(start, end + 1):
97 | if not filt_segs:
98 | filt_segs.append([res])
99 | else:
100 | if res > filt_segs[-1][-1] + 10:
101 | filt_segs.append([res])
102 | else:
103 | filt_segs[-1].append(res)
104 |
105 | filt_seg_strings = []
106 | total_good_count = 0
107 | for seg in filt_segs:
108 | start = seg[0]
109 | end = seg[-1]
110 | good_count = 0
111 | for res in range(start, end + 1):
112 | good_count += 1
113 | if good_count >= 5:
114 | total_good_count += good_count
115 | filt_seg_strings.append(f'{str(start)}-{str(end)}')
116 | if total_good_count >= 25:
117 | results.append('structure\t' + seqjudge + '\t' + prot + '\t' + str(znorm) + '\t' + '\t'.join(words[:10]) + '\t' + ','.join(seg_strings) + '\t' + ','.join(filt_seg_strings) + '\n')
118 | fp.close()
119 |
120 | if results:
121 | rp = open(f'step11/{spname}/{prot}.goodDomains', 'w')
122 | for line in results:
123 | rp.write(line)
124 | rp.close()
125 | else:
126 | os.system(f'echo \'done\' > step11/{spname}/{prot}.done')
127 |
--------------------------------------------------------------------------------
/docker/scripts/step12_get_sse.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 | import numpy as np
4 |
5 | dataset = sys.argv[1]
6 | prot = sys.argv[2]
7 |
8 | os.system(f'mkdssp -i step2/{dataset}/{prot}.pdb -o step12/{dataset}/{prot}.dssp')
9 | fp = open(f'step1/{dataset}/{prot}.fa', 'r')
10 | for line in fp:
11 | if line[0] != '>':
12 | seq = line[:-1]
13 | fp.close()
14 |
15 | fp = open(f'step12/{dataset}/{prot}.dssp', 'r')
16 | start = 0
17 | dssp_result = ''
18 | resids = []
19 | for line in fp:
20 | words = line.split()
21 | if len(words) > 3:
22 | if words[0] == '#' and words[1] == 'RESIDUE':
23 | start = 1
24 | elif start:
25 | try:
26 | resid = int(line[5:10])
27 | getit = 1
28 | except ValueError:
29 | getit = 0
30 |
31 | if getit:
32 | pred = line[16]
33 | resids.append(resid)
34 | pred = line[16]
35 | if pred == 'E' or pred == 'B':
36 | newpred = 'E'
37 | elif pred == 'G' or pred == 'H' or pred == 'I':
38 | newpred = 'H'
39 | else:
40 | newpred = '-'
41 | dssp_result += newpred
42 | fp.close()
43 |
44 | res2sse = {}
45 | dssp_segs = dssp_result.split('--')
46 | posi = 0
47 | Nsse = 0
48 | for dssp_seg in dssp_segs:
49 | judge = 0
50 | if dssp_seg.count('E') >= 3 or dssp_seg.count('H') >= 6:
51 | Nsse += 1
52 | judge = 1
53 | for char in dssp_seg:
54 | resid = resids[posi]
55 | if char != '-':
56 | if judge:
57 | res2sse[resid] = [Nsse, char]
58 | posi += 1
59 | posi += 2
60 |
61 | os.system(f'rm step12/{dataset}/{prot}.dssp')
62 | if len(resids) != len(seq):
63 | sys.exit(1)
64 | print (f'error\t{prot}\t{str(len(resids))}\t{str(len(seq))}')
65 | else:
66 | rp = open(f'step12/{dataset}/{prot}.sse', 'w')
67 | for resid in resids:
68 | try:
69 | rp.write(f'{str(resid)}\t{seq[resid - 1]}\t{str(res2sse[resid][0])}\t{res2sse[resid][1]}\n')
70 | except KeyError:
71 | rp.write(f'{str(resid)}\t{seq[resid - 1]}\tna\tC\n')
72 | rp.close()
73 |
--------------------------------------------------------------------------------
/docker/scripts/step13_get_diso.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import sys, os, time, json, math, string
3 | import numpy as np
4 |
5 | dataset = sys.argv[1]
6 | prot = sys.argv[2]
7 |
8 | insses = set([])
9 | res2sse = {}
10 | fp = open(f'step12/{dataset}/{prot}.sse', 'r')
11 | for line in fp:
12 | words = line.split()
13 | if words[2] != 'na':
14 | sseid = int(words[2])
15 | resid = int(words[0])
16 | insses.add(resid)
17 | res2sse[resid] = sseid
18 | fp.close()
19 |
20 | hit_resids = set([])
21 | if os.path.exists(f'step11/{dataset}/{prot}.goodDomains'):
22 | fp = open(f'step11/{dataset}/{prot}.goodDomains', 'r')
23 | for line in fp:
24 | words = line.split()
25 | if words[0] == 'sequence':
26 | segs = words[8].split(',')
27 | elif words[0] == 'structure':
28 | segs = words[14].split(',')
29 | for seg in segs:
30 | if '-' in seg:
31 | start = int(seg.split('-')[0])
32 | end = int(seg.split('-')[1])
33 | for resid in range(start, end+1):
34 | hit_resids.add(resid)
35 | else:
36 | resid = int(seg)
37 | hit_resids.add(resid)
38 | fp.close()
39 |
40 |
41 | fp = open(f'{dataset}/{prot}.json','r')
42 | text = fp.read()[1:-1]
43 | fp.close()
44 | get_json = 0
45 | try:
46 | json_dict = json.loads(text)
47 | get_json = 1
48 | except:
49 | pass
50 |
51 | if get_json:
52 | if 'predicted_aligned_error' in json_dict.keys():
53 | paes = json_dict['predicted_aligned_error']
54 | length = len(paes)
55 | rpair2error = {}
56 | for i in range(length):
57 | res1 = i + 1
58 | try:
59 | rpair2error[res1]
60 | except KeyError:
61 | rpair2error[res1] = {}
62 | for j in range(length):
63 | res2 = j + 1
64 | rpair2error[res1][res2] = paes[i][j]
65 |
66 | elif 'distance' in json_dict.keys():
67 | resid1s = json_dict['residue1']
68 | resid2s = json_dict['residue2']
69 | prot_len1 = max(resid1s)
70 | prot_len2 = max(resid2s)
71 | if prot_len1 != prot_len2:
72 | print (f'error, matrix is not a square with shape ({str(prot_len1)}, {str(prot_len2)})')
73 | else:
74 | length = prot_len1
75 |
76 | allerrors = json_dict['distance']
77 | mtx_size = len(allerrors)
78 |
79 | rpair2error = {}
80 | for i in range(mtx_size):
81 | res1 = resid1s[i]
82 | res2 = resid2s[i]
83 | try:
84 | rpair2error[res1]
85 | except KeyError:
86 | rpair2error[res1] = {}
87 | rpair2error[res1][res2] = allerrors[i]
88 | else:
89 | print ('error\t' + prot)
90 | else:
91 | print ('error\t' + prot)
92 |
93 |
94 | res2contacts = {}
95 | for i in range(length):
96 | res1 = i + 1
97 | for j in range (length):
98 | res2 = j + 1
99 | err = rpair2error[res1][res2]
100 | if res1 + 10 <= res2 and err < 12:
101 | if res2 in insses:
102 | if res1 in insses and res2sse[res1] == res2sse[res2]:
103 | pass
104 | else:
105 | try:
106 | res2contacts[res1].append(res2)
107 | except KeyError:
108 | res2contacts[res1] = [res2]
109 | if res1 in insses:
110 | if res2 in insses and res2sse[res2] == res2sse[res1]:
111 | pass
112 | else:
113 | try:
114 | res2contacts[res2].append(res1)
115 | except KeyError:
116 | res2contacts[res2] = [res1]
117 |
118 |
119 | diso_resids = set([])
120 | for start in range (1, length - 9):
121 | total_contact = 0
122 | hitres_count = 0
123 | for res in range(start, start + 10):
124 | if res in hit_resids:
125 | hitres_count += 1
126 | if res in insses:
127 | try:
128 | total_contact += len(res2contacts[res])
129 | except KeyError:
130 | pass
131 | if total_contact <= 30 and hitres_count <= 5:
132 | for res in range(start, start + 10):
133 | diso_resids.add(res)
134 |
135 | diso_resids_list = list(diso_resids)
136 | diso_resids_list.sort()
137 |
138 | rp = open(f'step13/{dataset}/{prot}.diso', 'w')
139 | for resid in diso_resids_list:
140 | rp.write(f'{str(resid)}\n')
141 | rp.close()
142 |
--------------------------------------------------------------------------------
/docker/scripts/step16_run_domass.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 | import numpy as np
4 | import tensorflow as tf
5 | from tensorflow.python.client import device_lib
6 |
7 | dataset = sys.argv[1]
8 | local_device_protos = device_lib.list_local_devices()
9 | gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
10 | if not gpus:
11 | print("No GPUs found. Falling back to CPU.")
12 | config = tf.ConfigProto()
13 | else:
14 | config = tf.ConfigProto(gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9))
15 | config.gpu_options.allow_growth = True
16 |
17 | fp = open('step16_' + dataset + '.list', 'r')
18 | prots = []
19 | for line in fp:
20 | words = line.split()
21 | prots.append(words[0])
22 | fp.close()
23 |
24 | all_cases = []
25 | all_inputs = []
26 | for prot in prots:
27 | if os.path.exists('step15/' + dataset + '/' + prot + '.data'):
28 | fp = open('step15/' + dataset + '/' + prot + '.data', 'r')
29 | for countl, line in enumerate(fp):
30 | if countl:
31 | words = line.split()
32 | all_cases.append([prot, words[0], words[1], words[2], words[3], words[17], words[18], words[19], words[20], words[21], words[22]])
33 | all_inputs.append([float(words[4]), float(words[5]), float(words[6]), float(words[7]), float(words[8]), float(words[9]), float(words[10]), float(words[11]), float(words[12]), float(words[13]), float(words[14]), float(words[15]), float(words[16])])
34 | fp.close()
35 | total_case = len(all_cases)
36 |
37 |
38 | def get_feed(batch_inputs):
39 | inputs = np.zeros((100, 13), dtype = np.float32)
40 | for i in range(100):
41 | for j, value in enumerate(batch_inputs[i]):
42 | inputs[i, j] = value
43 | feed_dict = {myinputs: inputs}
44 | return feed_dict
45 |
46 | dense = tf.compat.v1.layers.dense
47 | with tf.Graph().as_default():
48 | with tf.name_scope('input'):
49 | myinputs = tf.placeholder(dtype = tf.float32, shape = (100, 13))
50 | layers = [myinputs]
51 | layers.append(dense(layers[-1], 64, activation = tf.nn.relu))
52 | preds = dense(layers[-1], 2, activation = tf.nn.softmax)
53 | saver = tf.train.Saver()
54 |
55 | with tf.Session(config = config) as sess:
56 | saver.restore(sess, '/mnt/databases/domass_epo29')
57 | all_preds = []
58 | if total_case >= 100:
59 | batch_count = total_case // 100
60 | get_case = 0
61 | for i in range(batch_count):
62 | batch_inputs = all_inputs[i * 100 : i * 100 + 100]
63 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs))
64 | get_case += 100
65 | for j in range(100):
66 | all_preds.append(batch_preds[j,:])
67 | if i % 1000 == 0:
68 | print ('prediction for batch ' + str(i))
69 |
70 | remain_case = total_case - get_case
71 | add_case = 100 - remain_case
72 | batch_inputs = all_inputs[get_case:] + all_inputs[:add_case]
73 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs))
74 | for j in range(remain_case):
75 | all_preds.append(batch_preds[j,:])
76 |
77 | else:
78 | fold = 100 // total_case + 1
79 | pseudo_inputs = all_inputs * fold
80 | batch_inputs = pseudo_inputs[:100]
81 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs))
82 | for j in range(total_case):
83 | all_preds.append(batch_preds[j,:])
84 |
85 | prot2results = {}
86 | for prot in prots:
87 | prot2results[prot] = []
88 | for i in range(total_case):
89 | this_case = all_cases[i]
90 | this_input = all_inputs[i]
91 | this_pred = all_preds[i]
92 | prot = this_case[0]
93 | prot2results[prot].append([this_case[1], this_case[2], this_case[3], this_case[4], this_pred[1], this_input[3], this_input[4], this_input[5], this_input[6], this_input[7], this_input[8], this_input[9], this_input[10], this_input[11], this_input[12], this_case[5], this_case[6], this_case[7], this_case[8], this_case[9], this_case[10]])
94 | for prot in prots:
95 | if prot2results[prot]:
96 | rp = open('step16/' + dataset + '/' + prot + '.result', 'w')
97 | rp.write('Domain\tRange\tTgroup\tECOD_ref\tDPAM_prob\tHH_prob\tHH_cov\tHH_rank\tDALI_zscore\tDALI_qscore\tDALI_ztile\tDALI_qtile\tDALI_rank\tConsensus_diff\tConsensus_cov\tHH_hit\tDALI_hit\tDALI_rot1\tDALI_rot2\tDALI_rot3\tDALI_trans\n')
98 | for item in prot2results[prot]:
99 | rp.write(f'{item[0]}\t{item[1]}\t{item[2]}\t{item[3]}\t{str(round(item[4], 4))}\t{str(item[5])}\t{str(item[6])}\t{str(item[7])}\t{str(item[8])}\t{str(item[9])}\t{str(item[10])}\t{str(item[11])}\t{str(item[12])}\t{str(item[13])}\t{str(item[14])}\t{item[15]}\t{item[16]}\t{item[17]}\t{item[18]}\t{item[19]}\t{item[20]}\n')
100 | rp.close()
101 | else:
102 | os.system('echo \'done\' > step16/' + dataset + '/' + prot + '.done')
103 |
--------------------------------------------------------------------------------
/docker/scripts/step17_get_confident.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 |
3 | import os, sys
4 |
5 | dataset = sys.argv[1]
6 | prot = sys.argv[2]
7 | if os.path.exists('step16/' + dataset + '/' + prot + '.result'):
8 | fp = open('step16/' + dataset + '/' + prot + '.result','r')
9 | domains = []
10 | domain2range = {}
11 | domain2hits = {}
12 | for countl, line in enumerate(fp):
13 | if countl:
14 | words = line.split()
15 | domain = words[0]
16 | drange = words[1]
17 | tgroup = words[2]
18 | refdom = words[3]
19 | prob = float(words[4])
20 | domain2range[domain] = drange
21 | try:
22 | domain2hits[domain].append([tgroup, refdom, prob])
23 | except KeyError:
24 | domains.append(domain)
25 | domain2hits[domain] = [[tgroup, refdom, prob]]
26 | fp.close()
27 |
28 | results = []
29 | for domain in domains:
30 | drange = domain2range[domain]
31 | tgroups = []
32 | tgroup2best = {}
33 | for hit in domain2hits[domain]:
34 | tgroup = hit[0]
35 | refdom = hit[1]
36 | prob = hit[2]
37 | try:
38 | if prob > tgroup2best[tgroup]:
39 | tgroup2best[tgroup] = prob
40 | except KeyError:
41 | tgroups.append(tgroup)
42 | tgroup2best[tgroup] = prob
43 |
44 | domain2hits[domain].sort(key = lambda x:x[2], reverse = True)
45 | for hit in domain2hits[domain]:
46 | tgroup = hit[0]
47 | refdom = hit[1]
48 | prob = hit[2]
49 | if prob >= 0.6:
50 | similar_tgroups = set([])
51 | for ogroup in tgroups:
52 | if prob < tgroup2best[ogroup] + 0.05:
53 | similar_tgroups.add(ogroup)
54 | similar_hgroups = set([])
55 | for group in similar_tgroups:
56 | hgroup = group.split('.')[0] + '.' + group.split('.')[1]
57 | similar_hgroups.add(hgroup)
58 |
59 | if len(similar_tgroups) == 1:
60 | judge = 'good'
61 | elif len(similar_hgroups) == 1:
62 | judge = 'ok'
63 | else:
64 | judge = 'bad'
65 | results.append(domain + '\t' + drange + '\t' + tgroup + '\t' + refdom + '\t' + str(prob) + '\t' + judge + '\n')
66 |
67 | if results:
68 | rp = open('step17/' + dataset + '/' + prot + '.result','w')
69 | for line in results:
70 | rp.write(line)
71 | rp.close()
72 | else:
73 | os.system('echo \'done\' > step17/' + dataset + '/' + prot + '.done')
74 | else:
75 | os.system('echo \'done\' > step17/' + dataset + '/' + prot + '.done')
76 |
--------------------------------------------------------------------------------
/docker/scripts/step18_get_mapping.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 | import numpy as np
4 |
5 | def get_resids(domain_range):
6 | domain_resids = []
7 | for seg in domain_range.split(','):
8 | if '-' in seg:
9 | start = int(seg.split('-')[0])
10 | end = int(seg.split('-')[1])
11 | for res in range(start, end + 1):
12 | domain_resids.append(res)
13 | else:
14 | domain_resids.append(int(seg))
15 | return domain_resids
16 |
17 | def check_overlap(residsA, residsB):
18 | overlap = set(residsA).intersection(set(residsB))
19 | if len(overlap) >= len(residsA) * 0.33:
20 | if len(overlap) >= len(residsA) * 0.5 or len(overlap) >= len(residsB) * 0.5:
21 | return 1
22 | else:
23 | return 0
24 | else:
25 | return 0
26 |
27 | def get_range(resids):
28 | resids.sort()
29 | segs = []
30 | for resid in resids:
31 | if not segs:
32 | segs.append([resid])
33 | else:
34 | if resid > segs[-1][-1] + 1:
35 | segs.append([resid])
36 | else:
37 | segs[-1].append(resid)
38 | ranges = []
39 | for seg in segs:
40 | ranges.append(f'{str(seg[0])}-{str(seg[-1])}')
41 | return ','.join(ranges)
42 |
43 |
44 | spname = sys.argv[1]
45 | prot = sys.argv[2]
46 | HHhits = []
47 | if os.path.exists('step5/' + spname + '/' + prot + '.result'):
48 | fp = open('step5/' + spname + '/' + prot + '.result', 'r')
49 | for countl, line in enumerate(fp):
50 | if countl:
51 | words = line.split()
52 | ecodid = words[1]
53 | getres = set([])
54 | resmap = {}
55 | fp1 = open('/mnt/databases/ECOD_maps/' + ecodid + '.map', 'r')
56 | for line1 in fp1:
57 | words1 = line1.split()
58 | getres.add(int(words1[1]))
59 | resmap[int(words1[1])] = int(words1[0])
60 | fp1.close()
61 | hhprob = float(words[3]) / 100
62 |
63 | raw_qresids = get_resids(words[12])
64 | raw_tresids = get_resids(words[13])
65 | qresids = []
66 | tresids = []
67 | for i in range(len(raw_qresids)):
68 | if raw_tresids[i] in getres:
69 | qresid = raw_qresids[i]
70 | tresid = resmap[raw_tresids[i]]
71 | qresids.append(qresid)
72 | tresids.append(tresid)
73 | HHhits.append([ecodid, hhprob, qresids, tresids])
74 | fp.close()
75 |
76 | DALIhits = []
77 | if os.path.exists('step9/' + spname + '/' + prot + '_good_hits'):
78 | fp = open('step9/' + spname + '/' + prot + '_good_hits', 'r')
79 | for countl, line in enumerate(fp):
80 | if countl:
81 | words = line.split()
82 | ecodid = words[1]
83 | zscore = float(words[4]) / 10
84 | qresids = get_resids(words[9])
85 | tresids = get_resids(words[10])
86 | DALIhits.append([ecodid, zscore, qresids, tresids])
87 | fp.close()
88 |
89 | if os.path.exists('step17/' + spname + '/' + prot + '.result'):
90 | fp = open('step17/' + spname + '/' + prot + '.result', 'r')
91 | domains = []
92 | domain2def = {}
93 | domain2resids = {}
94 | domain2hits = {}
95 | for line in fp:
96 | words = line.split()
97 | dname = words[0]
98 | try:
99 | domain2resids[dname]
100 | except KeyError:
101 | domains.append(dname)
102 | domain2resids[dname] = get_resids(words[1])
103 | domain2def[dname] = words[1]
104 | tgroup = words[2]
105 | ecodhit = words[3]
106 | prob = float(words[4])
107 | judge = words[5]
108 | try:
109 | domain2hits[dname]
110 | except KeyError:
111 | domain2hits[dname] = {}
112 | domain2hits[dname][ecodhit] = [prob, tgroup, judge]
113 | fp.close()
114 |
115 | results = []
116 | for domain in domains:
117 | domain_resids = domain2resids[domain]
118 | domain_residset = set(domain_resids)
119 | hitinfo = domain2hits[domain]
120 | good_hits = list(hitinfo.keys())
121 |
122 | Hecods = set([])
123 | ecod2Hhit = {}
124 | for hit in HHhits:
125 | ecodid = hit[0]
126 | Hprob = hit[1]
127 | Hqresids = hit[2]
128 | Htresids = hit[3]
129 | if check_overlap(domain_resids, Hqresids):
130 | try:
131 | if Hprob > ecod2Hhit[ecodid][0]:
132 | ecod2Hhit[ecodid] = [Hprob, Hqresids, Htresids]
133 | except KeyError:
134 | Hecods.add(ecodid)
135 | ecod2Hhit[ecodid] = [Hprob, Hqresids, Htresids]
136 |
137 | Decods = set([])
138 | ecod2Dhit = {}
139 | for hit in DALIhits:
140 | ecodid = hit[0]
141 | Dzscore = hit[1]
142 | Dqresids = hit[2]
143 | Dtresids = hit[3]
144 | if check_overlap(domain_resids, Dqresids):
145 | try:
146 | if Dzscore > ecod2Dhit[ecodid][0]:
147 | ecod2Dhit[ecodid] = [Dzscore, Dqresids, Dtresids]
148 | except KeyError:
149 | Decods.add(ecodid)
150 | ecod2Dhit[ecodid] = [Dzscore, Dqresids, Dtresids]
151 |
152 | for hit in good_hits:
153 | [DPAMprob, tgroup, judge] = hitinfo[hit]
154 | if hit in Hecods:
155 | HQresids = ecod2Hhit[hit][1]
156 | HTresids = ecod2Hhit[hit][2]
157 | Hresids = []
158 | if len(HQresids) != len(HTresids):
159 | print (spname, prot, domain, hit)
160 | Hresid_string = 'na'
161 | elif HQresids:
162 | for i in range(len(HQresids)):
163 | if HQresids[i] in domain_residset:
164 | Hresids.append(HTresids[i])
165 | Hresid_string = get_range(Hresids)
166 | else:
167 | Hresid_string = 'na'
168 | else:
169 | Hresid_string = 'na'
170 |
171 | if hit in Decods:
172 | DQresids = ecod2Dhit[hit][1]
173 | DTresids = ecod2Dhit[hit][2]
174 | Dresids = []
175 | if len(DQresids) != len(DTresids):
176 | print (spname, prot, domain, hit)
177 | Dresid_string = 'na'
178 | elif DQresids:
179 | for i in range(len(DQresids)):
180 | if DQresids[i] in domain_residset:
181 | Dresids.append(DTresids[i])
182 | Dresid_string = get_range(Dresids)
183 | else:
184 | Dresid_string = 'na'
185 | else:
186 | Dresid_string = 'na'
187 | results.append(domain + '\t' + domain2def[domain] + '\t' + hit + '\t' + tgroup + '\t' + str(DPAMprob) + '\t' + judge + '\t' + Hresid_string + '\t' + Dresid_string + '\n')
188 |
189 | if results:
190 | rp = open('step18/' + spname + '/' + prot + '.data', 'w')
191 | for line in results:
192 | rp.write(line)
193 | rp.close()
194 | else:
195 | os.system('echo \'done\' > step18/' + spname + '/' + prot + '.done')
196 | else:
197 | os.system('echo \'done\' > step18/' + spname + '/' + prot + '.done')
198 |
--------------------------------------------------------------------------------
/docker/scripts/step19_get_merge_candidates.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 |
4 | def get_resids(domain_range):
5 | domain_resids = []
6 | for seg in domain_range.split(','):
7 | if '-' in seg:
8 | start = int(seg.split('-')[0])
9 | end = int(seg.split('-')[1])
10 | for res in range(start, end + 1):
11 | domain_resids.append(res)
12 | else:
13 | domain_resids.append(int(seg))
14 | return domain_resids
15 |
16 |
17 | spname = sys.argv[1]
18 | prot = sys.argv[2]
19 | if os.path.exists('step18/' + spname + '/' + prot + '.data'):
20 | fp = open('step18/' + spname + '/' + prot + '.data','r')
21 | need_ecods = set([])
22 | for line in fp:
23 | words = line.split()
24 | need_ecods.add(words[2])
25 | fp.close()
26 |
27 | fp = open('/mnt/databases/ECOD_length','r')
28 | ecod2length = {}
29 | for line in fp:
30 | words = line.split()
31 | if words[0] in need_ecods:
32 | ecod2length[words[0]] = int(words[2])
33 | fp.close()
34 |
35 | ecod2totW = {}
36 | ecod2posW = {}
37 | for ecod in need_ecods:
38 | ecod2totW[ecod] = 0
39 | ecod2posW[ecod] = {}
40 | if os.path.exists('/mnt/databases/posi_weights/' + ecod + '.weight'):
41 | fp = open('/mnt/databases/posi_weights/' + ecod + '.weight','r')
42 | for line in fp:
43 | words = line.split()
44 | resid = int(words[0])
45 | weight = float(words[3])
46 | ecod2totW[ecod] += weight
47 | ecod2posW[ecod][resid] = weight
48 | fp.close()
49 | else:
50 | ecod2totW[ecod] = ecod2length[ecod]
51 | for i in range(ecod2length[ecod]):
52 | ecod2posW[ecod][i + 1] = 1
53 |
54 |
55 | fp = open('step18/' + spname + '/' + prot + '.data','r')
56 | domains = []
57 | ecods = []
58 | domain2def = {}
59 | domain2prob = {}
60 | domain2hits = {}
61 | ecod2hits = {}
62 | for line in fp:
63 | words = line.split()
64 | domain = words[0]
65 | domdef = words[1]
66 | ecod = words[2]
67 | tgroup = words[3]
68 | prob = float(words[4])
69 | try:
70 | if prob > domain2prob[domain]:
71 | domain2prob[domain] = prob
72 | except KeyError:
73 | domain2def[domain] = domdef
74 | domain2prob[domain] = prob
75 |
76 | if words[6] == 'na':
77 | Hresids = []
78 | else:
79 | Hresids = get_resids(words[6])
80 | if words[7] == 'na':
81 | Dresids = []
82 | else:
83 | Dresids = get_resids(words[7])
84 | if len(Dresids) > len(Hresids) * 0.5:
85 | HDresids = set(Dresids)
86 | else:
87 | HDresids = set(Hresids)
88 |
89 | total_weight = ecod2totW[ecod]
90 | get_weight = 0
91 | for resid in HDresids:
92 | try:
93 | get_weight += ecod2posW[ecod][resid]
94 | except KeyError:
95 | print (prot, ecod, resid)
96 | try:
97 | domain2hits[domain].append([ecod, tgroup, prob, get_weight / total_weight])
98 | except KeyError:
99 | domains.append(domain)
100 | domain2hits[domain] = [[ecod, tgroup, prob, get_weight / total_weight]]
101 | try:
102 | ecod2hits[ecod].append([domain, tgroup, prob, HDresids])
103 | except KeyError:
104 | ecods.append(ecod)
105 | ecod2hits[ecod] = [[domain, tgroup, prob, HDresids]]
106 | fp.close()
107 |
108 |
109 | domain_pairs = []
110 | dpair2supports = {}
111 | for ecod in ecods:
112 | if len(ecod2hits[ecod]) > 1:
113 | for c1, hit1 in enumerate(ecod2hits[ecod]):
114 | for c2, hit2 in enumerate(ecod2hits[ecod]):
115 | if c1 < c2:
116 | domain1 = hit1[0]
117 | tgroup1 = hit1[1]
118 | prob1 = hit1[2]
119 | get_resids1 = hit1[3]
120 | domain2 = hit2[0]
121 | tgroup2 = hit2[1]
122 | prob2 = hit2[2]
123 | get_resids2 = hit2[3]
124 | if prob1 + 0.1 > domain2prob[domain1] and prob2 + 0.1 > domain2prob[domain2]:
125 | common_resids = get_resids1.intersection(get_resids2)
126 | if len(common_resids) < 0.25 * len(get_resids1) or len(common_resids) < 0.25 * len(get_resids2):
127 | domain_pair = domain1 + '_' + domain2
128 | try:
129 | dpair2supports[domain_pair].append([ecod, tgroup1, prob1, prob2])
130 | except KeyError:
131 | domain_pairs.append(domain_pair)
132 | dpair2supports[domain_pair] = [[ecod, tgroup1, prob1, prob2]]
133 |
134 |
135 | merge_pairs = []
136 | merge_info = []
137 | for domain_pair in domain_pairs:
138 | domain1 = domain_pair.split('_')[0]
139 | domain2 = domain_pair.split('_')[1]
140 | support_ecods = set([])
141 | for item in dpair2supports[domain_pair]:
142 | ecod = item[0]
143 | support_ecods.add(ecod)
144 | against_ecods1 = set([])
145 | against_ecods2 = set([])
146 | merge_info.append(domain1 + ',' + domain2 + '\t' + ','.join(support_ecods))
147 |
148 | for item in domain2hits[domain1]:
149 | ecod = item[0]
150 | tgroup = item[1]
151 | prob = item[2]
152 | ratio = item[3]
153 | if prob + 0.1 > domain2prob[domain1]:
154 | if ratio > 0.5:
155 | if not ecod in support_ecods:
156 | against_ecods1.add(ecod)
157 |
158 | for item in domain2hits[domain2]:
159 | ecod = item[0]
160 | tgroup = item[1]
161 | prob = item[2]
162 | ratio = item[3]
163 | if prob + 0.1 > domain2prob[domain2]:
164 | if ratio > 0.5:
165 | if not ecod in support_ecods:
166 | against_ecods2.add(ecod)
167 |
168 | if len(support_ecods) > len(against_ecods1) or len(support_ecods) > len(against_ecods2):
169 | merge_pairs.append([domain1, domain2])
170 |
171 | if merge_info:
172 | rp = open('step19/' + spname + '/' + prot + '.info','w')
173 | for merge_line in merge_info:
174 | rp.write(merge_line + '\n')
175 | rp.close()
176 |
177 | if merge_pairs:
178 | rp = open('step19/' + spname + '/' + prot + '.result','w')
179 | for merge_pair in merge_pairs:
180 | domain1 = merge_pair[0]
181 | domain2 = merge_pair[1]
182 | rp.write(domain1 + '\t' + domain2def[domain1] + '\t' + domain2 + '\t' + domain2def[domain2] + '\n')
183 | rp.close()
184 |
185 | if merge_info and merge_pairs:
186 | pass
187 | else:
188 | os.system('echo \'done\' > step19/' + spname + '/' + prot + '.done')
189 | else:
190 | os.system('echo \'done\' > step19/' + spname + '/' + prot + '.done')
191 |
--------------------------------------------------------------------------------
/docker/scripts/step1_get_AFDB_seqs.py:
--------------------------------------------------------------------------------
1 | #!/usr1/local/bin/python
2 | import os, sys
3 | import pdbx
4 | from pdbx.reader.PdbxReader import PdbxReader
5 |
6 | three2one = {}
7 | three2one["ALA"] = 'A'
8 | three2one["CYS"] = 'C'
9 | three2one["ASP"] = 'D'
10 | three2one["GLU"] = 'E'
11 | three2one["PHE"] = 'F'
12 | three2one["GLY"] = 'G'
13 | three2one["HIS"] = 'H'
14 | three2one["ILE"] = 'I'
15 | three2one["LYS"] = 'K'
16 | three2one["LEU"] = 'L'
17 | three2one["MET"] = 'M'
18 | three2one["MSE"] = 'M'
19 | three2one["ASN"] = 'N'
20 | three2one["PRO"] = 'P'
21 | three2one["GLN"] = 'Q'
22 | three2one["ARG"] = 'R'
23 | three2one["SER"] = 'S'
24 | three2one["THR"] = 'T'
25 | three2one["VAL"] = 'V'
26 | three2one["TRP"] = 'W'
27 | three2one["TYR"] = 'Y'
28 |
29 | dataset = sys.argv[1]
30 | prot = sys.argv[2]
31 |
32 | flag=1
33 |
34 | if os.path.exists(dataset + "/" + prot + ".cif"):
35 | cif = open(dataset + "/" + prot + ".cif")
36 | pRd = PdbxReader(cif)
37 | data = []
38 | pRd.read(data)
39 | block = data[0]
40 |
41 | modinfo = {}
42 | mod_residues = block.getObj("pdbx_struct_mod_residue")
43 | if mod_residues:
44 | chainid = mod_residues.getIndex("label_asym_id")
45 | posiid = mod_residues.getIndex("label_seq_id")
46 | parentid = mod_residues.getIndex("parent_comp_id")
47 | resiid = mod_residues.getIndex("label_comp_id")
48 | for i in range(mod_residues.getRowCount()):
49 | words = mod_residues.getRow(i)
50 | try:
51 | modinfo[words[chainid]]
52 | except KeyError:
53 | modinfo[words[chainid]] = {}
54 | modinfo[words[chainid]][words[posiid]] = [words[resiid], words[parentid]]
55 |
56 | entity_poly = block.getObj("entity_poly")
57 | pdbx_poly_seq_scheme = block.getObj("pdbx_poly_seq_scheme")
58 | if pdbx_poly_seq_scheme and entity_poly:
59 | typeid = entity_poly.getIndex("type")
60 | entityid1 = entity_poly.getIndex("entity_id")
61 | entityid2 = pdbx_poly_seq_scheme.getIndex("entity_id")
62 | chainid = pdbx_poly_seq_scheme.getIndex("asym_id")
63 | resiid = pdbx_poly_seq_scheme.getIndex("mon_id")
64 | posiid = pdbx_poly_seq_scheme.getIndex("seq_id")
65 |
66 | good_entities = []
67 | for i in range(entity_poly.getRowCount()):
68 | words = entity_poly.getRow(i)
69 | entity = words[entityid1]
70 | type = words[typeid]
71 | if type == "polypeptide(L)":
72 | good_entities.append(entity)
73 |
74 | if good_entities:
75 | chains = []
76 | residues = {}
77 | seqs = {}
78 | rp = open("step1/" + dataset + "/" + prot + ".fa","w")
79 | for i in range(pdbx_poly_seq_scheme.getRowCount()):
80 | words = pdbx_poly_seq_scheme.getRow(i)
81 | entity = words[entityid2]
82 | if entity in good_entities:
83 | chain = words[chainid]
84 |
85 | try:
86 | aa = three2one[words[resiid]]
87 | except KeyError:
88 | try:
89 | modinfo[chain][words[posiid]]
90 | resiname = modinfo[chain][words[posiid]][0]
91 | if words[resiid] == resiname:
92 | new_resiname = modinfo[chain][words[posiid]][1]
93 | try:
94 | aa = three2one[new_resiname]
95 | except KeyError:
96 | aa = "X"
97 | print ("error1 " + new_resiname)
98 | else:
99 | aa = "X"
100 | print ("error2 " + words[resiid] + " " + resiname)
101 | except KeyError:
102 | print (modinfo)
103 | print (words[resiid])
104 | aa = "X"
105 | try:
106 | seqs[chain]
107 | except KeyError:
108 | chains.append(chain)
109 | seqs[chain] = {}
110 |
111 | try:
112 | if seqs[chain][int(words[posiid])] == "X" and aa != "X":
113 | seqs[chain][int(words[posiid])] = aa
114 | except KeyError:
115 | seqs[chain][int(words[posiid])] = aa
116 |
117 | try:
118 | residues[chain].add(int(words[posiid]))
119 | except KeyError:
120 | residues[chain] = set([int(words[posiid])])
121 |
122 | for chain in chains:
123 | for i in range(len(residues[chain])):
124 | if not i + 1 in residues[chain]:
125 | flag = 0
126 | print ("error3 " + prot + " " + chain)
127 | break
128 | else:
129 | rp.write(">" + prot + "\n")
130 | finalseq = []
131 | for i in range(len(residues[chain])):
132 | finalseq.append(seqs[chain][i+1])
133 | rp.write("".join(finalseq) + "\n")
134 | rp.close()
135 | else:
136 | flag = 0
137 | print ("empty " + prot)
138 | else:
139 | flag = 0
140 | print ("bad " + prot)
141 | elif os.path.exists(dataset + "/" + prot + ".pdb"):
142 | os.system(f'pdb2fasta '+ dataset + "/" + prot + ".pdb > step1/" + dataset + "/" + prot + ".fa")
143 | with open("step1/" + dataset + "/" + prot + ".fa") as f:
144 | fa = f.readlines()
145 | fa[0] = fa[0].split(':')[0] + '\n'
146 | with open("step1/" + dataset + "/" + prot + ".fa",'w') as f:
147 | f.write(''.join(fa))
148 | else:
149 | flag = 0
150 | print("No recognized structure file (*.cif or *.pdb). Existing...")
151 | if flag == 0:
152 | sys.exit(1)
153 |
--------------------------------------------------------------------------------
/docker/scripts/step20_extract_domains.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 |
4 | dataset = sys.argv[1]
5 | if not os.path.exists('step20'):
6 | os.system('mkdir step20')
7 | os.system('mkdir step20/' + dataset)
8 |
9 | fp = os.popen('ls -1 step19/' + dataset + '/*.result')
10 | prots = []
11 | for line in fp:
12 | prot = line.split('/')[2].split('.result')[0]
13 | prots.append(prot)
14 | fp.close()
15 |
16 | domains = []
17 | for prot in prots:
18 | get_domains = set([])
19 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r')
20 | for line in fp:
21 | words = line.split()
22 | domain1 = words[0]
23 | if not domain1 in get_domains:
24 | get_domains.add(domain1)
25 | resids1 = set([])
26 | for seg in words[1].split(','):
27 | if '-' in seg:
28 | start = int(seg.split('-')[0])
29 | end = int(seg.split('-')[1])
30 | for res in range(start, end + 1):
31 | resids1.add(res)
32 | else:
33 | resids1.add(int(seg))
34 | domains.append([prot, domain1, resids1])
35 |
36 | domain2 = words[2]
37 | if not domain2 in get_domains:
38 | get_domains.add(domain2)
39 | resids2 = set([])
40 | for seg in words[3].split(','):
41 | if '-' in seg:
42 | start = int(seg.split('-')[0])
43 | end = int(seg.split('-')[1])
44 | for res in range(start, end + 1):
45 | resids2.add(res)
46 | else:
47 | resids2.add(int(seg))
48 | domains.append([prot, domain2, resids2])
49 | fp.close()
50 |
51 | for item in domains:
52 | prot = item[0]
53 | dname = item[1]
54 | resids = item[2]
55 | fp = open('step2/' + dataset + '/' + prot + '.pdb', 'r')
56 | rp = open('step20/' + dataset + '/' + prot + '_' + dname + '.pdb', 'w')
57 | for line in fp:
58 | resid = int(line[22:26])
59 | if resid in resids:
60 | rp.write(line)
61 | fp.close()
62 | rp.close()
63 |
--------------------------------------------------------------------------------
/docker/scripts/step21_compare_domains.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 |
4 | def get_seq_dist(residsA, residsB, good_resids):
5 | indsA = []
6 | for ind, resid in enumerate(good_resids):
7 | if resid in residsA:
8 | indsA.append(ind)
9 | indsB = []
10 | for ind, resid in enumerate(good_resids):
11 | if resid in residsB:
12 | indsB.append(ind)
13 |
14 | connected = 0
15 | for indA in indsA:
16 | for indB in indsB:
17 | if abs(indA - indB) <= 5:
18 | connected = 1
19 | break
20 | if connected:
21 | break
22 | return connected
23 |
24 | dataset = sys.argv[1]
25 | part = sys.argv[2]
26 | fp = open('step21_' + dataset + '_' + part + '.list','r')
27 | cases = []
28 | for line in fp:
29 | words = line.split()
30 | cases.append(words)
31 | fp.close()
32 |
33 | rp = open('step21_' + dataset + '_' + part + '.result', 'w')
34 | for case in cases:
35 | prot = case[0]
36 | good_resids = []
37 | fp = open('step14/' + dataset + '/' + prot + '.domains','r')
38 | for line in fp:
39 | words = line.split()
40 | for seg in words[1].split(','):
41 | if '-' in seg:
42 | start = int(seg.split('-')[0])
43 | end = int(seg.split('-')[1])
44 | for res in range(start, end + 1):
45 | good_resids.append(res)
46 | else:
47 | good_resids.append(int(seg))
48 | fp.close()
49 | good_resids.sort()
50 |
51 | dom1 = case[1]
52 | segs1 = case[2]
53 | residsA = []
54 | for seg in segs1.split(','):
55 | if '-' in seg:
56 | start = int(seg.split('-')[0])
57 | end = int(seg.split('-')[1])
58 | for res in range(start, end + 1):
59 | residsA.append(res)
60 | else:
61 | residsA.append(int(seg))
62 |
63 | dom2 = case[3]
64 | segs2 = case[4]
65 | residsB = []
66 | for seg in segs2.split(','):
67 | if '-' in seg:
68 | start = int(seg.split('-')[0])
69 | end = int(seg.split('-')[1])
70 | for res in range(start, end + 1):
71 | residsB.append(res)
72 | else:
73 | residsB.append(int(seg))
74 |
75 | if get_seq_dist(set(residsA), set(residsB), good_resids):
76 | judge = 1
77 | else:
78 | resid2coors = {}
79 | fp = open('step20/' + dataset + '/' + prot + '_' + dom1 + '.pdb', 'r')
80 | for line in fp:
81 | resid = int(line[22:26])
82 | coorx = float(line[30:38])
83 | coory = float(line[38:46])
84 | coorz = float(line[46:54])
85 | try:
86 | resid2coors[resid].append([coorx, coory, coorz])
87 | except KeyError:
88 | resid2coors[resid] = [[coorx, coory, coorz]]
89 | fp.close()
90 |
91 | fp = open('step20/' + dataset + '/' + prot + '_' + dom2 + '.pdb', 'r')
92 | for line in fp:
93 | resid = int(line[22:26])
94 | coorx = float(line[30:38])
95 | coory = float(line[38:46])
96 | coorz = float(line[46:54])
97 | try:
98 | resid2coors[resid].append([coorx, coory, coorz])
99 | except KeyError:
100 | resid2coors[resid] = [[coorx, coory, coorz]]
101 | fp.close()
102 |
103 | interface_count = 0
104 | for residA in residsA:
105 | for residB in residsB:
106 | dists = []
107 | coorsA = resid2coors[residA]
108 | coorsB = resid2coors[residB]
109 | for coorA in coorsA:
110 | for coorB in coorsB:
111 | dist = ((coorA[0] - coorB[0]) ** 2 + (coorA[1] - coorB[1]) ** 2 + (coorA[2] - coorB[2]) ** 2) ** 0.5
112 | dists.append(dist)
113 | min_dist = min(dists)
114 | if min_dist <= 8:
115 | interface_count += 1
116 | if interface_count >= 9:
117 | judge = 2
118 | else:
119 | judge = 0
120 | rp.write(prot + '\t' + dom1 + '\t' + dom2 + '\t' + str(judge) + '\t' + segs1 + '\t' + segs2 + '\n')
121 | rp.close()
122 |
--------------------------------------------------------------------------------
/docker/scripts/step22_merge_domains.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import sys
3 |
4 | def get_range(resids):
5 | if resids:
6 | resids = list(resids)
7 | resids.sort()
8 | segs = []
9 | for resid in resids:
10 | if not segs:
11 | segs.append([resid])
12 | else:
13 | if resid > segs[-1][-1] + 1:
14 | segs.append([resid])
15 | else:
16 | segs[-1].append(resid)
17 | ranges = []
18 | for seg in segs:
19 | start = seg[0]
20 | end = seg[-1]
21 | ranges.append(str(start) + '-' + str(end))
22 | return ','.join(ranges)
23 | else:
24 | return 'na'
25 |
26 |
27 | dataset = sys.argv[1]
28 | fp = open('step21_' + dataset + '.result', 'r')
29 | get_prots = set([])
30 | prot2merges = {}
31 | domain2resids = {}
32 | for line in fp:
33 | words = line.split()
34 | prot = words[0]
35 | dom1 = words[1]
36 | dom2 = words[2]
37 | resids1 = []
38 | for seg in words[4].split(','):
39 | if '-' in seg:
40 | start = int(seg.split('-')[0])
41 | end = int(seg.split('-')[1])
42 | for res in range(start, end + 1):
43 | resids1.append(res)
44 | else:
45 | resids1.append(int(seg))
46 | resids2 = []
47 | for seg in words[5].split(','):
48 | if '-' in seg:
49 | start = int(seg.split('-')[0])
50 | end = int(seg.split('-')[1])
51 | for res in range(start, end + 1):
52 | resids2.append(res)
53 | else:
54 | resids2.append(int(seg))
55 |
56 | if int(words[3]) > 0:
57 | try:
58 | prot2merges[prot].append(set([dom1, dom2]))
59 | except KeyError:
60 | prot2merges[prot] = [set([dom1, dom2])]
61 | get_prots.add(prot)
62 | try:
63 | domain2resids[prot]
64 | except KeyError:
65 | domain2resids[prot] = {}
66 | domain2resids[prot][dom1] = resids1
67 | domain2resids[prot][dom2] = resids2
68 | fp.close()
69 |
70 |
71 | rp = open('step22_' + dataset + '.result','w')
72 | for prot in get_prots:
73 | pairs = prot2merges[prot]
74 | groups = []
75 | for pair in pairs:
76 | groups.append(pair)
77 | while 1:
78 | newgroups = []
79 | for group in groups:
80 | if not newgroups:
81 | newgroups.append(group)
82 | else:
83 | for newgroup in newgroups:
84 | if group.intersection(newgroup):
85 | for item in group:
86 | newgroup.add(item)
87 | break
88 | else:
89 | newgroups.append(group)
90 |
91 | if len(groups) == len(newgroups):
92 | break
93 | groups = []
94 | for newgroup in newgroups:
95 | groups.append(newgroup)
96 |
97 | for group in groups:
98 | group_domains = []
99 | group_resids = set([])
100 | for domain in group:
101 | group_domains.append(domain)
102 | group_resids = group_resids.union(domain2resids[prot][domain])
103 | group_range = get_range(group_resids)
104 | rp.write(prot + '\t' + ','.join(group_domains) + '\t' + group_range + '\n')
105 | rp.close()
106 |
--------------------------------------------------------------------------------
/docker/scripts/step2_get_AFDB_pdbs.py:
--------------------------------------------------------------------------------
1 | #!/usr1/local/bin/python
2 | import os, sys, string
3 | import pdbx
4 | from pdbx.reader.PdbxReader import PdbxReader
5 |
6 | three2one = {}
7 | three2one["ALA"] = "A"
8 | three2one["CYS"] = "C"
9 | three2one["ASP"] = "D"
10 | three2one["GLU"] = "E"
11 | three2one["PHE"] = "F"
12 | three2one["GLY"] = "G"
13 | three2one["HIS"] = "H"
14 | three2one["ILE"] = "I"
15 | three2one["LYS"] = "K"
16 | three2one["LEU"] = "L"
17 | three2one["MET"] = "M"
18 | three2one["MSE"] = "M"
19 | three2one["ASN"] = "N"
20 | three2one["PRO"] = "P"
21 | three2one["GLN"] = "Q"
22 | three2one["ARG"] = "R"
23 | three2one["SER"] = "S"
24 | three2one["THR"] = "T"
25 | three2one["VAL"] = "V"
26 | three2one["TRP"] = "W"
27 | three2one["TYR"] = "Y"
28 |
29 | dataset = sys.argv[1]
30 | prot = sys.argv[2]
31 |
32 | if os.path.exists(dataset + "/" + prot + ".cif") and os.path.exists("step1/" + dataset + "/" + prot + ".fa"):
33 | fp = open("step1/" + dataset + "/" + prot + ".fa", "r")
34 | myseq = ""
35 | for line in fp:
36 | if line[0] == ">":
37 | pass
38 | else:
39 | myseq += line[:-1]
40 | fp.close()
41 |
42 | cif = open(dataset + "/" + prot + ".cif", "r")
43 | pRd = PdbxReader(cif)
44 | data = []
45 | pRd.read(data)
46 | block = data[0]
47 |
48 | atom_site = block.getObj("atom_site")
49 | record_type_index = atom_site.getIndex("group_PDB")
50 | atom_type_index = atom_site.getIndex("type_symbol")
51 | atom_identity_index = atom_site.getIndex("label_atom_id")
52 | residue_type_index = atom_site.getIndex("label_comp_id")
53 | chain_id_index = atom_site.getIndex("label_asym_id")
54 | residue_id_index = atom_site.getIndex("label_seq_id")
55 | coor_x_index = atom_site.getIndex("Cartn_x")
56 | coor_y_index = atom_site.getIndex("Cartn_y")
57 | coor_z_index = atom_site.getIndex("Cartn_z")
58 | alt_id_index = atom_site.getIndex("label_alt_id")
59 | model_num_index = atom_site.getIndex("pdbx_PDB_model_num")
60 | occupancy_index = atom_site.getIndex("occupancy")
61 | bfactor_index = atom_site.getIndex("B_iso_or_equiv")
62 |
63 | if model_num_index == -1:
64 | mylines = []
65 | for i in range(atom_site.getRowCount()):
66 | words = atom_site.getRow(i)
67 | chain_id = words[chain_id_index]
68 | record_type = words[record_type_index]
69 | if chain_id == "A" and record_type == "ATOM":
70 | mylines.append(words)
71 | else:
72 | model2lines = {}
73 | models = []
74 | for i in range(atom_site.getRowCount()):
75 | words = atom_site.getRow(i)
76 | chain_id = words[chain_id_index]
77 | record_type = words[record_type_index]
78 | model_num = int(words[model_num_index])
79 | if chain_id == "A" and record_type == "ATOM":
80 | try:
81 | model2lines[model_num].append(words)
82 | except KeyError:
83 | model2lines[model_num] = [words]
84 | models.append(model_num)
85 | best_model = min(models)
86 | mylines = model2lines[best_model]
87 |
88 | goodlines = []
89 | resid2altid = {}
90 | resid2aa = {}
91 | atom_count = 0
92 | for words in mylines:
93 | atom_type = words[atom_type_index]
94 | atom_identity = words[atom_identity_index]
95 | residue_type = words[residue_type_index]
96 | residue_id = int(words[residue_id_index])
97 | alt_id = words[alt_id_index]
98 |
99 | if atom_identity == "CA":
100 | try:
101 | resid2aa[residue_id] = three2one[residue_type]
102 | except KeyError:
103 | resid2aa[residue_id] = "X"
104 |
105 | get_line = 0
106 | if alt_id == ".":
107 | get_line = 1
108 | else:
109 | try:
110 | if resid2altid[residue_id] == alt_id:
111 | get_line = 1
112 | else:
113 | get_line = 0
114 | except KeyError:
115 | resid2altid[residue_id] = alt_id
116 | get_line = 1
117 |
118 | if get_line:
119 | atom_count += 1
120 | coor_x_info = words[coor_x_index].split(".")
121 | if len(coor_x_info) >= 2:
122 | coor_x = coor_x_info[0] + "." + coor_x_info[1][:3]
123 | else:
124 | coor_x = coor_x_info[0]
125 | coor_y_info = words[coor_y_index].split(".")
126 | if len(coor_y_info) >= 2:
127 | coor_y = coor_y_info[0] + "." + coor_y_info[1][:3]
128 | else:
129 | coor_y = coor_y_info[0]
130 | coor_z_info = words[coor_z_index].split(".")
131 | if len(coor_z_info) >= 2:
132 | coor_z = coor_z_info[0] + "." + coor_z_info[1][:3]
133 | else:
134 | coor_z = coor_z_info[0]
135 |
136 | occupancy_info = words[occupancy_index].split(".")
137 | if len(occupancy_info) == 1:
138 | occupancy = occupancy_info[0] + ".00"
139 | else:
140 | if len(occupancy_info[1]) == 1:
141 | occupancy = occupancy_info[0] + "." + occupancy_info[1] + "0"
142 | else:
143 | occupancy = occupancy_info[0] + "." + occupancy_info[1][:2]
144 | bfactor_info = words[bfactor_index].split(".")
145 | if len(bfactor_info) == 1:
146 | bfactor = bfactor_info[0] + ".00"
147 | else:
148 | if len(bfactor_info[1]) == 1:
149 | bfactor = bfactor_info[0] + "." + bfactor_info[1] + "0"
150 | else:
151 | bfactor = bfactor_info[0] + "." + bfactor_info[1][:2]
152 |
153 | if len(atom_identity) < 4:
154 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity.ljust(3) + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + occupancy.rjust(6) + bfactor.rjust(6) + " " + atom_type + "\n")
155 | elif len(atom_identity) == 4:
156 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + occupancy.rjust(6) + bfactor.rjust(6) + " " + atom_type + "\n")
157 |
158 | newseq = ""
159 | for i in range(len(myseq)):
160 | resid = i + 1
161 | try:
162 | newseq += resid2aa[resid]
163 | if resid2aa[resid] == "X":
164 | pass
165 | elif resid2aa[resid] == myseq[i]:
166 | pass
167 | else:
168 | print ("error\t" + dataset + "\t" + prot)
169 | except KeyError:
170 | newseq += "-"
171 | if newseq == myseq:
172 | rp = open("step2/" + dataset + "/" + prot + ".pdb","w")
173 | for goodline in goodlines:
174 | rp.write(goodline)
175 | rp.close()
176 | else:
177 | sys.exit(1)
178 | print ("error\t" + dataset + "\t" + prot)
179 | elif os.path.exists(dataset + "/" + prot + ".pdb") and os.path.exists("step1/" + dataset + "/" + prot + ".fa"):
180 | with open(dataset + "/" + prot + ".pdb") as f:
181 | pdblines = f.readlines()
182 | pdblines = [i for i in pdblines if i[:4]=='ATOM']
183 | with open("step2/" + dataset + "/" + prot + ".pdb",'w') as f:
184 | for i in pdblines:
185 | f.write(i)
186 | else:
187 | sys.exit(1)
188 |
--------------------------------------------------------------------------------
/docker/scripts/step3_run_hhsearch.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, subprocess
3 | def run_cmd(cmd):
4 | status = subprocess.run(cmd,shell = True).returncode
5 | return status
6 |
7 | dataset = sys.argv[1]
8 | prot = sys.argv[2]
9 | cpu = sys.argv[3]
10 |
11 | if os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'):
12 | pass
13 | else:
14 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm'):
15 | status = run_cmd('hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch')
16 | if status != 0:
17 | sys.exit(1)
18 | elif os.path.exists('step3/' + dataset + '/' + prot + '.hhm'):
19 | os.system('mv step3/' + dataset + '/' + prot + '.hhm step3/' + dataset + '/' + prot + '.hmm')
20 | status = run_cmd('hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch')
21 | if status != 0:
22 | sys.exit(1)
23 | else:
24 | cmds= ['hhblits -cpu ' + cpu + ' -i step1/' + dataset + '/' + prot + '.fa -d /mnt/databases/UniRef30_2022_02/UniRef30_2022_02 -oa3m step3/' + dataset + '/' + prot + '.a3m','addss.pl step3/' + dataset + '/' + prot + '.a3m step3/' + dataset + '/' + prot + '.a3m.ss -a3m','mv step3/' + dataset + '/' + prot + '.a3m.ss step3/' + dataset + '/' + prot + '.a3m','hhmake -i step3/' + dataset + '/' + prot + '.a3m -o step3/' + dataset + '/' + prot + '.hmm','hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch']
25 | for cmd in cmds:
26 | status = run_cmd(cmd)
27 | if status != 0:
28 | sys.exit(1)
29 |
--------------------------------------------------------------------------------
/docker/scripts/step4_run_foldseek.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys, random, string
3 | def generate_random_directory_name():
4 | characters = string.ascii_lowercase + string.digits
5 | random_string = ''.join(random.choice(characters) for _ in range(8))
6 | return random_string
7 |
8 |
9 | dataset = sys.argv[1]
10 | ncore = sys.argv[2]
11 | tmp_dir_name = generate_random_directory_name()
12 |
13 | if not os.path.exists('/tmp/' + dataset + '_' + tmp_dir_name):
14 | os.system('mkdir /tmp/' + dataset + '_' + tmp_dir_name)
15 |
16 | fp = open('step4/' + dataset + '_step4.list', 'r')
17 | prots = []
18 | for line in fp:
19 | words = line.split()
20 | prots.append(words[0])
21 | fp.close()
22 |
23 | for prot in prots:
24 | os.system('foldseek easy-search step2/' + dataset + '/' + prot + '.pdb /mnt/databases/ECOD_foldseek_DB/ECOD_foldseek_DB step4/' + dataset + '/' + prot + '.foldseek /tmp/' + dataset + '_' + tmp_dir_name + ' -e 1000 --max-seqs 1000000 --threads ' + ncore + ' >> /tmp/step4_' + dataset + '_' + tmp_dir_name + '.log')
25 | fp = open('step4/' + dataset + '/' + prot + '.foldseek', 'r')
26 | countline = 0
27 | for line in fp:
28 | countline += 1
29 | fp.close()
30 | if not countline:
31 | os.system('echo \'done\' > step4/' + dataset + '/' + prot + '.done')
32 |
33 | os.system('rm -rf /tmp/' + dataset + '_' + tmp_dir_name)
34 | os.system('mv /tmp/step4_' + dataset + '_' + tmp_dir_name + '.log ./')
35 |
--------------------------------------------------------------------------------
/docker/scripts/step6_process_foldseek.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import sys
3 |
4 | dataset = sys.argv[1]
5 | prot = sys.argv[2]
6 |
7 | fp = open('step1/' + dataset + '/' + prot + '.fa','r')
8 | query_seq = ''
9 | for line in fp:
10 | if line[0] != '>':
11 | query_seq += line[:-1]
12 | fp.close()
13 | qlen = len(query_seq)
14 |
15 | fp = open('step4/' + dataset + '/' + prot + '.foldseek', 'r')
16 | hits = []
17 | for line in fp:
18 | words = line.split()
19 | dnum = words[1].split('.')[0]
20 | qstart = int(words[6])
21 | qend = int(words[7])
22 | qresids = set([])
23 | for qres in range(qstart, qend + 1):
24 | qresids.add(qres)
25 | evalue = float(words[10])
26 | hits.append([dnum, evalue, qstart, qend, qresids])
27 | fp.close()
28 | hits.sort(key = lambda x:x[1])
29 |
30 | qres2count = {}
31 | for res in range(1, qlen + 1):
32 | qres2count[res] = 0
33 |
34 | rp = open('step6/' + dataset + '/' + prot + '.result', 'w')
35 | rp.write('ecodnum\tevalue\trange\n')
36 | for hit in hits:
37 | dnum = hit[0]
38 | evalue = hit[1]
39 | qstart = hit[2]
40 | qend = hit[3]
41 | qresids = hit[4]
42 | for res in qresids:
43 | qres2count[res] += 1
44 | good_res = 0
45 | for res in qresids:
46 | if qres2count[res] <= 100:
47 | good_res += 1
48 | if good_res >= 10:
49 | rp.write(dnum + '\t' + str(evalue) + '\t' + str(qstart) + '-' + str(qend) + '\n')
50 | rp.close()
51 |
--------------------------------------------------------------------------------
/docker/scripts/step7_prepare_dali.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 |
4 | dataset = sys.argv[1]
5 | prot = sys.argv[2]
6 |
7 |
8 | domains = set([])
9 | if os.path.exists('step5/' + dataset + '/' + prot + '.result'):
10 | fp = open('step5/' + dataset + '/' + prot + '.result', 'r')
11 | for countl, line in enumerate(fp):
12 | if countl:
13 | words = line.split()
14 | domains.add(words[1])
15 | fp.close()
16 |
17 | if os.path.exists('step6/' + dataset + '/' + prot + '.result'):
18 | fp = open('step6/' + dataset + '/' + prot + '.result','r')
19 | for countl, line in enumerate(fp):
20 | if countl:
21 | words = line.split()
22 | domains.add(words[0])
23 | fp.close()
24 |
25 | if domains:
26 | rp = open('step7/' + dataset + '/' + prot + '_hits', 'w')
27 | for domain in domains:
28 | rp.write(domain + '\n')
29 | rp.close()
30 | else:
31 | os.system('echo \'done\' > step7/' + dataset + '/' + prot + '.done')
32 |
--------------------------------------------------------------------------------
/docker/scripts/step9_analyze_dali.py:
--------------------------------------------------------------------------------
1 | #!/opt/conda/bin/python
2 | import os, sys
3 | import numpy as np
4 |
5 |
6 | def get_range(resids):
7 | resids.sort()
8 | segs = []
9 | for resid in resids:
10 | if not segs:
11 | segs.append([resid])
12 | else:
13 | if resid > segs[-1][-1] + 1:
14 | segs.append([resid])
15 | else:
16 | segs[-1].append(resid)
17 | ranges = []
18 | for seg in segs:
19 | ranges.append(f'{str(seg[0])}-{str(seg[-1])}')
20 | return ','.join(ranges)
21 |
22 |
23 | spname = sys.argv[1]
24 | prot = sys.argv[2]
25 | fp = open('/mnt/databases/ecod.latest.domains','r')
26 | ecod2id = {}
27 | ecod2fam = {}
28 | for line in fp:
29 | if line[0] != '#':
30 | words = line[:-1].split('\t')
31 | ecodnum = words[0]
32 | ecodid = words[1]
33 | ecodfam = '.'.join(words[3].split('.')[:2])
34 | ecod2id[ecodnum] = ecodid
35 | ecod2fam[ecodnum] = ecodfam
36 | fp.close()
37 |
38 | if os.path.exists(f'step8/{spname}/{prot}_hits'):
39 | fp = open(f'step8/{spname}/{prot}_hits','r')
40 | ecodnum = ''
41 | ecodid = ''
42 | ecodfam = ''
43 | hitname = ''
44 | rot1 = ''
45 | rot2 = ''
46 | rot3 = ''
47 | trans = ''
48 | maps = []
49 | hits = []
50 | for line in fp:
51 | if line[0] == '>':
52 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps:
53 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps, rot1, rot2, rot3, trans])
54 | words = line[1:].split()
55 | zscore = float(words[1])
56 | hitname = words[0]
57 | ecodnum = hitname.split('_')[0]
58 | ecodid = ecod2id[ecodnum]
59 | ecodfam = ecod2fam[ecodnum]
60 | maps = []
61 | rotcount = 0
62 | else:
63 | words = line.split()
64 | if words[0] == 'rotation':
65 | rotcount += 1
66 | if rotcount == 1:
67 | rot1 = ','.join(words[1:])
68 | elif rotcount == 2:
69 | rot2 = ','.join(words[1:])
70 | elif rotcount == 3:
71 | rot3 = ','.join(words[1:])
72 | elif words[0] == 'translation':
73 | trans = ','.join(words[1:])
74 | else:
75 | pres = int(words[0])
76 | eres = int(words[1])
77 | maps.append([pres, eres])
78 | fp.close()
79 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps:
80 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps, rot1, rot2, rot3, trans])
81 |
82 |
83 | newhits = []
84 | for hit in hits:
85 | hitname = hit[0]
86 | ecodnum = hit[1]
87 | total_weight = 0
88 | posi2weight = {}
89 | zscores = []
90 | qscores = []
91 | if os.path.exists(f'/mnt/databases/posi_weights/{ecodnum}.weight'):
92 | fp = open(f'/mnt/databases/posi_weights/{ecodnum}.weight','r')
93 | posi2weight = {}
94 | for line in fp:
95 | words = line.split()
96 | total_weight += float(words[3])
97 | posi2weight[int(words[0])] = float(words[3])
98 | fp.close()
99 | if os.path.exists(f'/mnt/databases/ecod_internal/{ecodnum}.info'):
100 | fp = open(f'/mnt/databases/ecod_internal/{ecodnum}.info','r')
101 | for line in fp:
102 | words = line.split()
103 | zscores.append(float(words[1]))
104 | qscores.append(float(words[2]))
105 | fp.close()
106 | ecodid = hit[2]
107 | ecodfam = hit[3]
108 | zscore = hit[4]
109 | maps = hit[5]
110 | rot1 = hit[6]
111 | rot2 = hit[7]
112 | rot3 = hit[8]
113 | trans = hit[9]
114 |
115 | if zscores and qscores:
116 | qscore = 0
117 | for item in maps:
118 | try:
119 | qscore += posi2weight[item[1]]
120 | except KeyError:
121 | pass
122 |
123 | better = 0
124 | worse = 0
125 | for other_qscore in qscores:
126 | if other_qscore > qscore:
127 | better += 1
128 | else:
129 | worse += 1
130 | qtile = better / (better + worse)
131 |
132 | better = 0
133 | worse = 0
134 | for other_zscore in zscores:
135 | if other_zscore > zscore:
136 | better += 1
137 | else:
138 | worse += 1
139 | ztile = better / (better + worse)
140 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, qscore / total_weight, ztile, qtile, maps, rot1, rot2, rot3, trans])
141 | else:
142 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, -1, -1, -1, maps, rot1, rot2, rot3, trans])
143 |
144 |
145 | newhits.sort(key = lambda x:x[4], reverse = True)
146 | finalhits = []
147 | posi2fams = {}
148 | for hit in newhits:
149 | ecodfam = hit[3]
150 | maps = hit[8]
151 | rot1 = hit[9]
152 | rot2 = hit[10]
153 | rot3 = hit[11]
154 | trans = hit[12]
155 | qposis = []
156 | eposis = []
157 | ranks = []
158 | for item in maps:
159 | qposis.append(item[0])
160 | eposis.append(item[1])
161 | try:
162 | posi2fams[item[0]].add(ecodfam)
163 | except KeyError:
164 | posi2fams[item[0]] = set([ecodfam])
165 | ranks.append(len(posi2fams[item[0]]))
166 | ave_rank = round(np.mean(ranks), 2)
167 | qrange = get_range(qposis)
168 | erange = get_range(eposis)
169 | finalhits.append([hit[0], hit[1], hit[2], hit[3], round(hit[4], 2), round(hit[5], 2), round(hit[6], 2), round(hit[7], 2), ave_rank, qrange, erange, rot1, rot2, rot3, trans])
170 |
171 | rp = open(f'step9/{spname}/{prot}_good_hits', 'w')
172 | rp.write('hitname\tecodnum\tecodkey\thgroup\tzscore\tqscore\tztile\tqtile\trank\tqrange\terange\trotation1\trotation2\trotation3\ttranslation\n')
173 | for hit in finalhits:
174 | rp.write(f'{hit[0]}\t{hit[1]}\t{hit[2]}\t{hit[3]}\t{str(hit[4])}\t{str(hit[5])}\t{str(hit[6])}\t{str(hit[7])}\t{str(hit[8])}\t{hit[9]}\t{hit[10]}\t{hit[11]}\t{hit[12]}\t{hit[13]}\t{hit[14]}\n')
175 | rp.close()
176 | else:
177 | os.system(f'echo \'done\' > step9/{spname}/{prot}.done')
178 |
--------------------------------------------------------------------------------
/docker/scripts/summarize_check.py:
--------------------------------------------------------------------------------
1 | import sys
2 |
3 | fp = open(sys.argv[1] + '_check','r')
4 | check1 = 0
5 | check2 = 0
6 | check3 = 0
7 | check4 = 0
8 | for line in fp:
9 | words = line.split()
10 | check1 += int(words[1])
11 | check2 += int(words[2])
12 | check3 += int(words[3])
13 | check4 += int(words[4])
14 | fp.close()
15 |
16 | print (sys.argv[1] + '\t' + str(check1) + '\t' + str(check2) + '\t' + str(check3) + '\t' + str(check4))
17 |
--------------------------------------------------------------------------------
/docker/utilities/DaliLite.v5.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/DaliLite.v5.tar.gz
--------------------------------------------------------------------------------
/docker/utilities/HHPaths.pm:
--------------------------------------------------------------------------------
1 | # HHPaths.pm
2 |
3 | # HHsuite version 3.0.0 (15-03-2015)
4 | # (C) J. Soeding, A. Hauser 2012
5 |
6 | # This program is free software: you can redistribute it and/or modify
7 | # it under the terms of the GNU General Public License as published by
8 | # the Free Software Foundation, either version 3 of the License, or
9 | # (at your option) any later version.
10 |
11 | # This program is distributed in the hope that it will be useful,
12 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | # GNU General Public License for more details.
15 |
16 | # You should have received a copy of the GNU General Public License
17 | # along with this program. If not, see .
18 |
19 | # We are very grateful for bug reports! Please contact us at soeding@mpibpc.mpg.de
20 |
21 | # PLEASE INSERT CORRECT PATHS AT POSITIONS INDICATED BY ... BELOW
22 | # THE ENVIRONMENT VARIABLE HHLIB NEEDS TO BE SET TO YOUR LOCAL HH-SUITE DIRECTORY,
23 | # AS DESCRIBED IN THE HH-SUITE USER GUIDE AND README FILE
24 |
25 | package HHPaths;
26 |
27 | # This block can stay unmodified
28 | use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS $VERSION);
29 | use Exporter;
30 | our $v;
31 | our $VERSION = "version 3.0.0 (15-03-2015)";
32 | our @ISA = qw(Exporter);
33 | our @EXPORT = qw($VERSION $hhlib $hhdata $hhbin $hhscripts $execdir $datadir $ncbidir $dummydb $pdbdir $dsspdir $dssp $cs_lib $context_lib $v);
34 | push @EXPORT, qw($hhshare $hhbdata);
35 |
36 | ##############################################################################################
37 | # PLEASE COMPLETE THE PATHS ... TO PSIPRED AND OLD-STYLE BLAST (NOT BLAST+) (NEEDED FOR PSIPRED)
38 | #our $execdir = ".../psipred/bin"; # path to PSIPRED V2 binaries
39 | #our $datadir = ".../psipred/data"; # path to PSIPRED V2 data files
40 | #our $ncbidir = ".../blast/bin"; # path to NCBI binaries (for PSIPRED in addss.pl)
41 | our $execdir = "/opt/conda/bin"; # path to PSIPRED V2 binaries
42 | our $datadir = "/opt/conda/pkgs/psipred-4.01-1/share/psipred_4.01/data"; # path to PSIPRED V2 data files
43 | our $ncbidir = "/opt/conda/bin"; # path to NCBI binaries (for PSIPRED in addss.pl)
44 |
45 | ##############################################################################################
46 | # PLEASE COMPLETE THE PATHS ... TO YOUR LOCAL PDB FILES, DSSP FILES ETC.
47 | #our $pdbdir = ".../pdb/all"; # where are the pdb files? (pdb/divided directory will also work)
48 | #our $dsspdir = ".../dssp/data"; # where are the dssp files? Used in addss.pl.
49 | #our $dssp = ".../dssp/bin/dsspcmbi"; # where is the dssp binary? Used in addss.pl.
50 | our $pdbdir = "/cluster/databases/pdb/all"; # where are the pdb files? (pdb/divided directory will also work)
51 | our $dsspdir = "/cluster/databases/dssp/data"; # where are the dssp files? Used in addss.pl
52 | our $dssp = "/usr/bin/mkdssp"; # where is the dssp binary? Used in addss.pl
53 | ##############################################################################################
54 |
55 | # The lines below probably do not need to be changed
56 |
57 | # Setting paths for hh-suite perl scripts
58 | #our $hhlib = $ENV{"HHLIB"} || "/usr/lib/hhsuite"; # main hh-suite directory
59 | #our $hhshare = $ENV{"HHLIB"} || "/usr/share/hhsuite"; # main hh-suite directory
60 | our $hhlib = "/opt/hhsuite";
61 | our $hhshare = "/opt/hhsuite";
62 | our $hhdata = $hhshare."/data"; # path to arch indep data directory for hhblits, example files
63 | our $hhbdata = $hhlib."/data"; # path to arch dep data directory for hhblits, example files
64 | our $hhbin = $hhlib."/bin"; # path to cstranslate (path to hhsearch, hhblits etc. should be in environment variable PATH)
65 | our $hhscripts= $hhshare."/scripts"; # path to hh perl scripts (addss.pl, reformat.pl, hhblitsdb.pl etc.)
66 | our $dummydb = $hhbdata."/do_not_delete"; # Name of dummy blast db for PSIPRED (single sequence formatted with NCBI formatdb)
67 |
68 | # HHblits data files
69 | our $cs_lib = "$hhdata/cs219.lib";
70 | our $context_lib = "$hhdata/context_data.lib";
71 |
72 | # Add hh-suite scripts directory to search path
73 | $ENV{"PATH"} = $hhscripts.":".$ENV{"PATH"}; # Add hh scripts directory to environment variable PATH
74 |
75 | ################################################################################################
76 | ### System command with return value parsed from output
77 | ################################################################################################
78 | sub System()
79 | {
80 | if ($v>=2) {printf(STDERR "\$ %s\n",$_[0]);}
81 | system($_[0]);
82 | if ($? == -1) {
83 | die("\nError: failed to execute '$_[0]': $!\n\n");
84 | } elsif ($? != 0) {
85 | printf(STDERR "\nError: command '$_[0]' returned error code %d\n\n", $? >> 8);
86 | return 1;
87 | }
88 | return $?;
89 | }
90 |
91 | return 1;
92 |
--------------------------------------------------------------------------------
/docker/utilities/foldseek:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/foldseek
--------------------------------------------------------------------------------
/docker/utilities/pdb2fasta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/pdb2fasta
--------------------------------------------------------------------------------
/example/test_struc.list:
--------------------------------------------------------------------------------
1 | O05011
2 | O05012
3 | O05023
4 |
--------------------------------------------------------------------------------
/run_dpam_docker.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import docker
3 | import os,sys
4 |
5 | def check_image_exists(image_name):
6 | client = docker.from_env()
7 | try:
8 | client.images.get(image_name)
9 | return True
10 | except docker.errors.ImageNotFound:
11 | return False
12 |
13 |
14 | def check_databases(databases_dir):
15 | path = os.getcwd()
16 | flag = 1
17 | if not os.path.exists(databases_dir):
18 | print (databases_dir, 'does not exist')
19 | flag = 0
20 | else:
21 | missing = []
22 | with open(f'{databases_dir}/all_files') as f:
23 | all_files = f.readlines()
24 | all_files = [i.strip() for i in all_files]
25 | for fn in all_files:
26 | if not os.path.exists(f'{databases_dir}/{fn}'):
27 | missing.append(fn)
28 | if missing:
29 | flag = 0
30 | with open('dpam_databases_missing_files','w') as f:
31 | f.write('\n'.join(missing)+'\n')
32 | print(f"Files missing for databases. Please check {path}/dpam_databases_missing_files for details")
33 | else:
34 | if os.path.exists('dpam_databases_missing_files'):
35 | os.system('rm dpam_databases_missing_files')
36 | return flag
37 |
38 | def check_inputs(input_dir,dataset):
39 | flag = 1
40 | if not os.path.exists(input_dir):
41 | flag = 0
42 | print('Error!', input_dir, 'does not exist.')
43 | else:
44 | if os.path.exists(f'{input_dir}/{dataset}') and os.path.exists(f'{input_dir}/{dataset}_struc.list'):
45 | with open(f'{input_dir}/{dataset}_struc.list') as f:
46 | alist = f.readlines()
47 | alist = [i.strip() for i in alist]
48 | missing = []
49 | for name in alist:
50 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.cif') and not os.path.exists(f'{input_dir}/{dataset}/{name}.pdb'):
51 | missing.append([dataset, name, ':PDB/CIF missing'])
52 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.json'):
53 | missing.append([dataset, name, ':PAE json missing'])
54 | if missing:
55 | flag = 0
56 | with open(f'{input_dir}/dpam_{dataset}_inputs_missing_files','w') as f:
57 | for i in missing:
58 | f.write(' '.join(i)+'\n')
59 | print(f'Error! Please check {input_dir}/dpam_{dataset}_inputs_missing_files for details')
60 | if not os.path.exists(f'{input_dir}/{dataset}'):
61 | flag = 0
62 | print('Error!', dataset, 'containing PDB/CIF and PAE does not exist.')
63 | if not os.path.exists(f'{input_dir}/{dataset}_struc.list'):
64 | flag = 0
65 | print('Error!', f'{input_dir}/{dataset}_struc.list for targets does not exist.')
66 | return flag
67 |
68 |
69 |
70 | def run_docker_container(image_name, databases_dir, input_dir, dataset, threads, log_file):
71 | client = docker.from_env()
72 | wdir = f'/home/'+input_dir.split('/')[-1]
73 |
74 | # Mount the directories to the container
75 | volumes = {
76 | databases_dir: {'bind': '/mnt/databases', 'mode': 'ro'},
77 | input_dir: {'bind': wdir, 'mode': 'rw'}
78 | }
79 |
80 | container = client.containers.run(image_name, detach=True, volumes=volumes, working_dir=wdir, command='tail -f /dev/null')
81 |
82 | # Example of running a script inside the container
83 | # Modify as needed for your specific script execution
84 | try:
85 | exec_log = container.exec_run(f"/bin/bash -c 'run_dpam.py {dataset} {threads}'", stdout=False, stderr=True)
86 | final_status = f'DPAM run for {dataset} under {input_dir} done\n'
87 | except:
88 | final_status = f'DPAM run for {dataset} under {input_dir} failed\n'
89 |
90 | with open(log_file, 'w') as file:
91 | file.write(exec_log.output.decode())
92 | file.write(final_status)
93 | # Stop the container after the script execution
94 | container.stop()
95 |
96 | # Optionally, remove the container if not needed anymore
97 | container.remove()
98 |
99 | if __name__ == "__main__":
100 | parser = argparse.ArgumentParser(description="Run a DPAM docker container.")
101 | parser.add_argument("--databases_dir", help="Path to the databases directory to mount (required)", required=True)
102 | parser.add_argument("--input_dir", help="Path to the input directory to mount (required)", required=True)
103 | parser.add_argument("--dataset", help="Name of dataset (required)", required=True)
104 | parser.add_argument("--image_name", help="Image name", default="conglab/dpam")
105 | parser.add_argument("--threads", type=int, default=os.cpu_count(), help="Number of threads. Default is to use all CPUs")
106 | parser.add_argument("--log_file", help="File to save the logs")
107 |
108 | args = parser.parse_args()
109 |
110 | image_flag = check_image_exists(args.image_name)
111 | if not image_flag:
112 | print(args.image_name, 'does not exist!')
113 | sys.exit(1)
114 |
115 | db_flag = check_databases(args.databases_dir)
116 | if db_flag == 0:
117 | print("Databases are not complete")
118 | sys.exit(1)
119 |
120 | input_flag = check_inputs(args.input_dir,args.dataset)
121 | if input_flag == 0:
122 | print('Error(s)! Inputs missing')
123 | sys.exit(1)
124 |
125 | if '/' != args.input_dir[0]:
126 | path = os.path.join(os.getcwd(), args.input_dir)
127 | input_dir = os.path.abspath(path)
128 | else:
129 | input_dir = os.path.abspath(args.input_dir)
130 |
131 | if '/' != args.databases_dir[0]:
132 | path = os.path.join(os.getcwd(), args.databases_dir)
133 | databases_dir = os.path.abspath(path)
134 | else:
135 | databases_dir = os.path.abspath(args.databases_dir)
136 |
137 | if args.log_file is None:
138 | log_file = input_dir + '/' + args.dataset + '_docker.log'
139 | else:
140 | log_file = args.log_file
141 |
142 | run_docker_container(args.image_name,databases_dir, input_dir, args.dataset, args.threads,log_file)
143 |
--------------------------------------------------------------------------------
/run_dpam_singularity.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import argparse
3 | import os,sys,subprocess
4 |
5 | def check_singularity_image_existence(image_path):
6 | result = subprocess.run(['singularity', 'inspect', image_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
7 | if result.returncode != 0:
8 | return 0
9 | else:
10 | return 1
11 |
12 |
13 | def check_databases(databases_dir):
14 | path = os.getcwd()
15 | flag = 1
16 | if not os.path.exists(databases_dir):
17 | print (databases_dir, 'does not exist')
18 | flag = 0
19 | else:
20 | missing = []
21 | with open(f'{databases_dir}/all_files') as f:
22 | all_files = f.readlines()
23 | all_files = [i.strip() for i in all_files]
24 | for fn in all_files:
25 | if not os.path.exists(f'{databases_dir}/{fn}'):
26 | missing.append(fn)
27 | if missing:
28 | flag = 0
29 | with open('dpam_databases_missing_files','w') as f:
30 | f.write('\n'.join(missing)+'\n')
31 | print(f"Files missing for databases. Please check {path}/dpam_databases_missing_files for details")
32 | else:
33 | if os.path.exists('dpam_databases_missing_files'):
34 | os.system('rm dpam_databases_missing_files')
35 | return flag
36 |
37 | def check_inputs(input_dir,dataset):
38 | flag = 1
39 | if not os.path.exists(input_dir):
40 | flag = 0
41 | print('Error!', input_dir, 'does not exist.')
42 | else:
43 | if os.path.exists(f'{input_dir}/{dataset}') and os.path.exists(f'{input_dir}/{dataset}_struc.list'):
44 | with open(f'{input_dir}/{dataset}_struc.list') as f:
45 | alist = f.readlines()
46 | alist = [i.strip() for i in alist]
47 | missing = []
48 | for name in alist:
49 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.cif') and not os.path.exists(f'{input_dir}/{dataset}/{name}.pdb'):
50 | missing.append([dataset, name, ':PDB/CIF missing'])
51 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.json'):
52 | missing.append([dataset, name, ':PAE json missing'])
53 | if missing:
54 | flag = 0
55 | with open(f'{input_dir}/dpam_{dataset}_inputs_missing_files','w') as f:
56 | for i in missing:
57 | f.write(' '.join(i)+'\n')
58 | print(f'Error! Please check {input_dir}/dpam_{dataset}_inputs_missing_files for details')
59 | if not os.path.exists(f'{input_dir}/{dataset}'):
60 | flag = 0
61 | print('Error!', dataset, 'containing PDB/CIF and PAE does not exist.')
62 | if not os.path.exists(f'{input_dir}/{dataset}_struc.list'):
63 | flag = 0
64 | print('Error!', f'{input_dir}/{dataset}_struc.list for targets does not exist.')
65 | return flag
66 |
67 |
68 |
69 |
70 | def run_singularity_container(image_name, databases_dir, input_dir, dataset, threads, log_file):
71 | wdir = f'/home/{input_dir.split("/")[-1]}'
72 |
73 | # Building the Singularity exec command with bind mounts
74 | exec_command = (
75 | f"singularity exec --fakeroot --bind {databases_dir}:/mnt/databases:ro "
76 | f"--bind {input_dir}:{wdir}:rw {image_name} "
77 | f"/bin/bash -c 'cd {wdir};run_dpam.py {dataset} {threads}'"
78 | )
79 |
80 | # Running the container
81 | try:
82 | print(exec_command)
83 | result = subprocess.run(exec_command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
84 | exec_output = result.stdout
85 | exec_error = result.stderr
86 | final_status = f'DPAM run for {dataset} under {input_dir} done\n'
87 | except subprocess.CalledProcessError as e:
88 | final_status = f'DPAM run for {dataset} under {input_dir} failed\n'
89 | exec_output = e.stdout
90 | exec_error = e.stderr
91 |
92 | # Writing to log file
93 | with open(log_file, 'w') as file:
94 | file.write(exec_output + exec_error)
95 | file.write(final_status)
96 |
97 |
98 | if __name__ == "__main__":
99 | parser = argparse.ArgumentParser(description="Run a DPAM docker container.")
100 | parser.add_argument("--databases_dir", help="Path to the databases directory to mount (required)", required=True)
101 | parser.add_argument("--input_dir", help="Path to the input directory to mount (required)", required=True)
102 | parser.add_argument("--dataset", help="Name of dataset (required)", required=True)
103 | parser.add_argument("--image_name", help="Image name")
104 | parser.add_argument("--threads", type=int, default=os.cpu_count(), help="Number of threads. Default is to use all CPUs")
105 | parser.add_argument("--log_file", help="File to save the logs. Default is _docker.log under .")
106 |
107 | args = parser.parse_args()
108 |
109 | image_flag = check_singularity_image_existence(args.image_name)
110 | if not image_flag:
111 | print(args.image_name, 'or Singularity does not exist!')
112 | sys.exit(1)
113 |
114 | db_flag = check_databases(args.databases_dir)
115 | if db_flag == 0:
116 | print("Databases are not complete")
117 | sys.exit(1)
118 |
119 | input_flag = check_inputs(args.input_dir,args.dataset)
120 | if input_flag == 0:
121 | print('Error(s)! Inputs missing')
122 | sys.exit(1)
123 |
124 | if '/' != args.input_dir[0]:
125 | path = os.path.join(os.getcwd(), args.input_dir)
126 | input_dir = os.path.abspath(path)
127 | else:
128 | input_dir = os.path.abspath(args.input_dir)
129 |
130 | if '/' != args.databases_dir[0]:
131 | path = os.path.join(os.getcwd(), args.databases_dir)
132 | databases_dir = os.path.abspath(path)
133 | else:
134 | databases_dir = os.path.abspath(args.databases_dir)
135 |
136 | if args.log_file is None:
137 | log_file = input_dir + '/' + args.dataset + '_docker.log'
138 | else:
139 | log_file = args.log_file
140 |
141 | run_singularity_container(args.image_name,databases_dir, input_dir, args.dataset, args.threads,log_file)
142 |
143 |
--------------------------------------------------------------------------------
/v1.0/A0A0K2WPR7.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/A0A0K2WPR7.zip
--------------------------------------------------------------------------------
/v1.0/DPAM.py:
--------------------------------------------------------------------------------
1 | import sys,os,time
2 | from datetime import datetime
3 | import subprocess
4 | script_dir=os.path.dirname(os.path.realpath(__file__))
5 |
6 | def print_usage ():
7 | print("usage: DPAM.py ")
8 |
9 | def check_progress(basedir, basename):
10 | full_progress=range(1,13)
11 | if os.path.exists(basedir):
12 | if os.path.exists(basedir + '/' + basename + '_progress_logs'):
13 | with open(basedir + '/' + basename + '_progress_logs') as f:
14 | logs=f.readlines()
15 | logs=[i.strip() for i in logs if i.strip()!='']
16 | if logs:
17 | logs=[int(i.split()[0]) for i in logs]
18 | full_progress=set(full_progress)-set(logs)
19 | full_progress=sorted(full_progress)
20 | if full_progress:
21 | progress=full_progress[0]
22 | else:
23 | progress=[]
24 | return progress
25 |
26 | if len(sys.argv) != 7:
27 | print_usage()
28 | else:
29 | logs=[]
30 | input_struc = sys.argv[1]
31 | input_pae = sys.argv[2]
32 | basename = sys.argv[3]
33 | basedir = sys.argv[4]
34 | threads = sys.argv[5]
35 | datadir = sys.argv[6]
36 | if basedir[0] != '/':
37 | basedir = os.getcwd() + '/' + basedir
38 | if not os.path.exists(basedir):
39 | os.system('mkdir ' + basedir)
40 | if '.cif' == input_struc[-4:]:
41 | os.system('cp ' + input_struc + ' ' + basedir + '/' + basename + '.cif')
42 | elif '.pdb' == input_struc[-4:]:
43 | os.system('cp ' + input_struc + ' ' + basedir + '/' + basename + '.pdb')
44 | else:
45 | print("Cannot recognize the structure file.Please use either mmcif or PDB as input. Exiting...")
46 | sys.exit()
47 | os.system('cp ' + input_pae + ' ' + basedir + '/' + basename + '.json')
48 | print('start input processing', datetime.now())
49 | status = subprocess.call(f'python {script_dir}/step1_get_AFDB_seqs.py {basename} {basedir}',shell=True)
50 | if status != 0:
51 | print('Cannot get protein sequence. Exiting...')
52 | sys.exit()
53 | status = subprocess.call(f'python {script_dir}/step1_get_AFDB_pdbs.py {basename} {basedir}',shell=True)
54 | if status != 0:
55 | print('Cannot process structure file.Exiting...')
56 | sys.exit()
57 | logs.append('0')
58 | progress=check_progress(basedir, basename)
59 | if progress!=[]:
60 | cmds=[]
61 | cmds.append(f'python {script_dir}/step2_run_hhsearch.py {basename} {threads} {basedir} {datadir}')
62 | cmds.append(f'python {script_dir}/step3_run_foldseek.py {basename} {threads} {basedir} {datadir}')
63 | cmds.append(f'python {script_dir}/step4_filter_foldseek.py {basename} {basedir}')
64 | cmds.append(f'python {script_dir}/step5_map_to_ecod.py {basename} {basedir} {datadir}')
65 | cmds.append(f'python {script_dir}/step6_get_dali_candidates.py {basename} {basedir}')
66 | cmds.append(f'python {script_dir}/step7_iterative_dali_aug_multi.py {basename} {threads} {basedir} {datadir}')
67 | cmds.append(f'python {script_dir}/step8_analyze_dali.py {basename} {basedir} {datadir}')
68 | cmds.append(f'python {script_dir}/step9_get_support.py {basename} {basedir} {datadir}')
69 | cmds.append(f'python {script_dir}/step10_get_good_domains.py {basename} {basedir} {datadir}')
70 | cmds.append(f'python {script_dir}/step11_get_sse.py {basename} {basedir}')
71 | cmds.append(f'python {script_dir}/step12_get_diso.py {basename} {basedir}')
72 | cmds.append(f'python {script_dir}/step13_parse_domains.py {basename} {basedir}')
73 | step=1
74 | for cmd in cmds[progress-1:]:
75 | print(f'start {cmd}', datetime.now())
76 | status = subprocess.call(cmd,shell=True)
77 | if status != 0:
78 | print(f"Error in {cmd}.Exiting...")
79 | with open(f'{basedir}/{basename}_progress_logs','w') as f:
80 | for i in logs:
81 | f.write(i+'\n')
82 | sys.exit()
83 | else:
84 | logs.append(str(step))
85 | step = step + 1
86 | print(f'end {cmd}', datetime.now())
87 | print(f'Domain Parsing for {basename} done')
88 | with open(f'{basedir}/{basename}_progress_logs','w') as f:
89 | for i in logs:
90 | f.write(i+'\n')
91 | else:
92 | print(f'Previous domain parsing result for {basename} is complete')
93 |
--------------------------------------------------------------------------------
/v1.0/README.md:
--------------------------------------------------------------------------------
1 | # DPAM
2 | A **D**omain **P**arser for **A**lphafold **M**odels
3 |
4 | DPAM: A Domain Parser for AlphaFold Models (https://www.biorxiv.org/content/10.1101/2022.09.22.509116v1, accepted by Protein Science)
5 |
6 | ## Updates:
7 | A docker image can be dowloaded by **docker pull conglab/dpam:latest** (this is an enhanced version of current DPAM, we will soon update the repository too)
8 |
9 | Upload domain parser results for six model organisms. (2022-12-6)
10 |
11 | Replace Dali with Foldseek for initial hits searching. (2022-11-30)
12 |
13 | Fix a bug in analyze_PDB.py which prevents the proper usage of Dali results. (2022-10-31)
14 | ## Prerequisites:
15 |
16 | ### Software and packages
17 | - HH-suite3: https://github.com/soedinglab/hh-suite (enable addss.pl to add secondary structure)
18 | - DaliLite.v5: http://ekhidna2.biocenter.helsinki.fi/dali/
19 | - Python 3.8
20 | - Foldseek
21 | - mkdssp
22 | - pdbx: https://github.com/soedinglab/pdbx
23 | - pdb2fasta (https://zhanggroup.org/pdb2fasta)
24 |
25 | Please add above software to environment path for DPAM. We also provide a script `check_dependencies.py` to check if above programs can be found.
26 | ### Supporting database:
27 | - hhsearch UniRef database (https://wwwuser.gwdg.de/~compbiol/uniclust/2022_02/)
28 | - pdb70 (https://conglab.swmed.edu/DPAM/pdb70.tgz)
29 | - ECOD database
30 | - ECOD ID map to pdb
31 | - ECOD domain length
32 | - ECOD domain list
33 | - ECOD norms
34 | - ECOD domain quality information
35 | - ECOD residue weight in domains
36 | - ECOD70 domain structures
37 | - ECOD70 foldseek database
38 |
39 | We provide a script download_all_data.sh that can be used to download all of these databases.
40 |
41 | `bash download_all_data.sh `
42 |
43 | After downloading the databases, please decompress files. All supporting database files should be put in the same directory and the directory should be provided to `DPAM.py` as ``. The `` should have the following structure and files.
44 |
45 | /
46 | ECOD70/
47 | ecod_domain_info/
48 | ECOD_foldseek_DB/
49 | ecod_weights/
50 | pdb70/
51 | UniRef30_2022_02/
52 | ecod.latest.domains
53 | ECOD_length
54 | ECOD_norms
55 | ECOD_pdbmap
56 |
57 |
58 | ## Installation
59 | git clone https://github.com/CongLabCode/DPAM.git
60 |
61 | conda install -c qianlabcode dpam
62 |
63 | ## Usage
64 | `python DPAM.py `
65 |
66 | ## Future improvments
67 | - Incoperate mmseq to improve search speed
68 | - Provide public server and incoperate with ECOD database
69 |
--------------------------------------------------------------------------------
/v1.0/check_dependencies.py:
--------------------------------------------------------------------------------
1 | import shutil
2 | hhsuite=['hhblits','hhsearch','hhmake','addss.pl']
3 | programs=['hhblits','hhsearch','hhmake','addss.pl','foldseek','dali.pl','mkdssp']
4 | missing=[]
5 | pdbx=0
6 | for prog in programs:
7 | check = shutil.which(prog)
8 | if check == None:
9 | missing.append(prog)
10 | try:
11 | import pdbx
12 | from pdbx.reader.PdbxReader import PdbxReader
13 | except:
14 | pdbx = 1
15 |
16 |
17 | if missing or pdbx == 1:
18 | if missing:
19 | text = "Please add"
20 | hhsuite_missing = [i for i in missing if i in hhsuite]
21 | if hhsuite_missing:
22 | if len(hhsuite_missing) >= 2:
23 | hhsuite_missing = ','.join(hhsuite_missing[:-1]) +' and '+hhsuite_missing[-1] + " in HH-suite3"
24 | else:
25 | hhsuite_missing = hhsuite_missing[0] + " in HH-suite3"
26 | text = text + " " + hhsuite_missing
27 | others = [i for i in missing if i not in hhsuite_missing]
28 | if others:
29 | if len(others) >= 2:
30 | others = ','.join(others[:-1]) + ' and ' + others[-1]
31 | else:
32 | others = others[0]
33 | text = text + " and " + others
34 | text = text + " to envirnoment path"
35 | print(text)
36 | if pdbx == 1:
37 | print('pdbx is not installed properly.Please refer to https://github.com/soedinglab/pdbx for installation')
38 | else:
39 | print('HH-suite, Foldseek and dali.pl are found')
40 |
--------------------------------------------------------------------------------
/v1.0/download_all_data.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # Usage: bash download_all_data.sh /path/to/download/directory
3 | set -e
4 |
5 | if [[ $# -eq 0 ]]; then
6 | echo "Error: download directory must be provided as an input argument."
7 | exit 1
8 | fi
9 |
10 | if ! command -v aria2c &> /dev/null ; then
11 | echo "Error: aria2c could not be found. Please install aria2c."
12 | exit 1
13 | fi
14 |
15 |
16 | DOWNLOAD_DIR="$1"
17 | ### Download ECOD70
18 | echo "Downloading ECOD70..."
19 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD70.tgz"
20 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
21 |
22 | ### Download pdb70
23 | echo "Downloading pdb70..."
24 | SOURCE_URL="https://conglab.swmed.edu/DPAM/pdb70.tgz"
25 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
26 |
27 | ### Download UniRef30
28 | echo "Downloading UniRef30..."
29 | SOURCE_URL="https://wwwuser.gwdg.de/~compbiol/uniclust/2022_02/UniRef30_2022_02_hhsuite.tar.gz"
30 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
31 |
32 | ### Download ECOD70 foldseek database
33 | echo "Downloading ECOD70 foldseek database"
34 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD_foldseek_DB.tgz"
35 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
36 |
37 | ### Download ECOD position weights
38 | echo "Downloading ECOD position weights"
39 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ecod_weights.tgz"
40 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
41 |
42 | ### Download ECOD domain information
43 | echo "Downloading ECOD domain information"
44 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ecod_domain_info.tgz"
45 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
46 |
47 |
48 | ### Download ECOD domain list, length, relationship to pdb and normalization
49 | echo "Downloading other ECOD related data"
50 | files=(ECOD_norms ecod.latest.domains ECOD_length ECOD_pdbmap)
51 | for str in ${files[@]}
52 | do
53 | SOURCE_URL="https://conglab.swmed.edu/DPAM/${str}"
54 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
55 | done
56 |
57 | ### Download benchmark data
58 | echo "Download benchmark data"
59 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD_benchmark.tgz"
60 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}"
61 |
62 | echo "Download done"
63 |
--------------------------------------------------------------------------------
/v1.0/mkdssp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/mkdssp
--------------------------------------------------------------------------------
/v1.0/model_organisms/Caenorhabditis_elegans.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Caenorhabditis_elegans.tgz
--------------------------------------------------------------------------------
/v1.0/model_organisms/Danio_rerio.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Danio_rerio.tgz
--------------------------------------------------------------------------------
/v1.0/model_organisms/Drosophila_melanogaster.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Drosophila_melanogaster.tgz
--------------------------------------------------------------------------------
/v1.0/model_organisms/Homo_Sapiens.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Homo_Sapiens.tgz
--------------------------------------------------------------------------------
/v1.0/model_organisms/Mus_musculus.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Mus_musculus.tgz
--------------------------------------------------------------------------------
/v1.0/model_organisms/Pan_paniscus.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Pan_paniscus.tgz
--------------------------------------------------------------------------------
/v1.0/pdb2fasta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/pdb2fasta
--------------------------------------------------------------------------------
/v1.0/step10_get_good_domains.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 |
3 | prefix = sys.argv[1]
4 | wd = sys.argv[2]
5 | data_dir = sys.argv[3]
6 | if os.getcwd() != wd:
7 | os.chdir(wd)
8 |
9 | fp = open(f'{data_dir}/ECOD_norms', 'r')
10 | ecod2norm = {}
11 | for line in fp:
12 | words = line.split()
13 | ecod2norm[words[0]] = float(words[1])
14 | fp.close()
15 |
16 | results = []
17 | fp = open(f'{prefix}_sequence.result', 'r')
18 | for line in fp:
19 | words = line.split()
20 | filt_segs = []
21 | for seg in words[6].split(','):
22 | start = int(seg.split('-')[0])
23 | end = int(seg.split('-')[1])
24 | for res in range(start, end + 1):
25 | if not filt_segs:
26 | filt_segs.append([res])
27 | else:
28 | if res > filt_segs[-1][-1] + 10:
29 | filt_segs.append([res])
30 | else:
31 | filt_segs[-1].append(res)
32 |
33 | filt_seg_strings = []
34 | total_good_count = 0
35 | for seg in filt_segs:
36 | start = seg[0]
37 | end = seg[-1]
38 | good_count = 0
39 | for res in range(start, end + 1):
40 | good_count += 1
41 | if good_count >= 5:
42 | total_good_count += good_count
43 | filt_seg_strings.append(f'{start}-{end}')
44 | if total_good_count >= 25:
45 | results.append('sequence\t' + prefix + '\t' + '\t'.join(words[:7]) + '\t' + ','.join(filt_seg_strings) + '\n')
46 | fp.close()
47 |
48 | if os.path.exists(f'{prefix}_structure.result'):
49 | fp = open(f'{prefix}_structure.result', 'r')
50 | for line in fp:
51 | words = line.split()
52 | ecodnum = words[0].split('_')[0]
53 | edomain = words[1]
54 | zscore = float(words[3])
55 | try:
56 | znorm = round(zscore / ecod2norm[ecodnum], 2)
57 | except KeyError:
58 | znorm = 0.0
59 | qscore = float(words[4])
60 | ztile = float(words[5])
61 | qtile = float(words[6])
62 | rank = float(words[7])
63 | bestprob = float(words[8])
64 | bestcov = float(words[9])
65 |
66 | judge = 0
67 | if rank < 1.5:
68 | judge += 1
69 | if qscore > 0.5:
70 | judge += 1
71 | if ztile < 0.75 and ztile >= 0:
72 | judge += 1
73 | if qtile < 0.75 and qtile >= 0:
74 | judge += 1
75 | if znorm > 0.225:
76 | judge += 1
77 |
78 | seqjudge = 'no'
79 | if bestprob >= 20 and bestcov >= 0.2:
80 | judge += 1
81 | seqjudge = 'low'
82 | if bestprob >= 50 and bestcov >= 0.3:
83 | judge += 1
84 | seqjudge = 'medium'
85 | if bestprob >= 80 and bestcov >= 0.4:
86 | judge += 1
87 | seqjudge = 'high'
88 | if bestprob >= 95 and bestcov >= 0.6:
89 | judge += 1
90 | seqjudge = 'superb'
91 |
92 | if judge:
93 | seg_strings = words[10].split(',')
94 | filt_segs = []
95 | for seg in words[10].split(','):
96 | start = int(seg.split('-')[0])
97 | end = int(seg.split('-')[1])
98 | for res in range(start, end + 1):
99 | if not filt_segs:
100 | filt_segs.append([res])
101 | else:
102 | if res > filt_segs[-1][-1] + 10:
103 | filt_segs.append([res])
104 | else:
105 | filt_segs[-1].append(res)
106 |
107 | filt_seg_strings = []
108 | total_good_count = 0
109 | for seg in filt_segs:
110 | start = seg[0]
111 | end = seg[-1]
112 | good_count = 0
113 | for res in range(start, end + 1):
114 | good_count += 1
115 | if good_count >= 5:
116 | total_good_count += good_count
117 | filt_seg_strings.append(f'{str(start)}-{str(end)}')
118 | if total_good_count >= 25:
119 | results.append('structure\t' + seqjudge + '\t' + prefix + '\t' + str(znorm) + '\t' + '\t'.join(words[:10]) + '\t' + ','.join(seg_strings) + '\t' + ','.join(filt_seg_strings) + '\n')
120 | fp.close()
121 |
122 | if results:
123 | rp = open(f'{prefix}.goodDomains', 'w')
124 | for line in results:
125 | rp.write(line)
126 | rp.close()
127 |
--------------------------------------------------------------------------------
/v1.0/step11_get_sse.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 | import numpy as np
3 |
4 | prefix = sys.argv[1]
5 | wd = sys.argv[2]
6 | if os.getcwd() != wd:
7 | os.chdir(wd)
8 |
9 | os.system(f'mkdssp -i {prefix}.pdb -o {prefix}.dssp')
10 | fp = open(f'{prefix}.fa', 'r')
11 | for line in fp:
12 | if line[0] != '>':
13 | seq = line[:-1]
14 | fp.close()
15 |
16 | fp = open(f'{prefix}.dssp', 'r')
17 | start = 0
18 | dssp_result = ''
19 | resids = []
20 | for line in fp:
21 | words = line.split()
22 | if len(words) > 3:
23 | if words[0] == '#' and words[1] == 'RESIDUE':
24 | start = 1
25 | elif start:
26 | try:
27 | resid = int(line[5:10])
28 | getit = 1
29 | except ValueError:
30 | getit = 0
31 |
32 | if getit:
33 | pred = line[16]
34 | resids.append(resid)
35 | pred = line[16]
36 | if pred == 'E' or pred == 'B':
37 | newpred = 'E'
38 | elif pred == 'G' or pred == 'H' or pred == 'I':
39 | newpred = 'H'
40 | else:
41 | newpred = '-'
42 | dssp_result += newpred
43 | fp.close()
44 |
45 | res2sse = {}
46 | dssp_segs = dssp_result.split('--')
47 | posi = 0
48 | Nsse = 0
49 | for dssp_seg in dssp_segs:
50 | judge = 0
51 | if dssp_seg.count('E') >= 3 or dssp_seg.count('H') >= 6:
52 | Nsse += 1
53 | judge = 1
54 | for char in dssp_seg:
55 | resid = resids[posi]
56 | if char != '-':
57 | if judge:
58 | res2sse[resid] = [Nsse, char]
59 | posi += 1
60 | posi += 2
61 |
62 | os.system(f'rm {prefix}.dssp')
63 | if len(resids) != len(seq):
64 | print (f'error\t{prefix}\t{len(resids)}\t{len(seq)}')
65 | else:
66 | rp = open(f'{prefix}.sse', 'w')
67 | for resid in resids:
68 | try:
69 | rp.write(f'{resid}\t{seq[resid - 1]}\t{res2sse[resid][0]}\t{res2sse[resid][1]}\n')
70 | except KeyError:
71 | rp.write(f'{resid}\t{seq[resid - 1]}\tna\tC\n')
72 | rp.close()
73 |
--------------------------------------------------------------------------------
/v1.0/step12_get_diso.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import os, sys, json, string
3 |
4 | prefix = sys.argv[1]
5 | wd = sys.argv[2]
6 | if os.getcwd() != wd:
7 | os.chdir(wd)
8 |
9 | insses = set([])
10 | res2sse = {}
11 | fp = open(f'{prefix}.sse', 'r')
12 | for line in fp:
13 | words = line.split()
14 | if words[2] != 'na':
15 | sseid = int(words[2])
16 | resid = int(words[0])
17 | insses.add(resid)
18 | res2sse[resid] = sseid
19 | fp.close()
20 |
21 | hit_resids = set([])
22 | if os.path.exists(f'{prefix}.goodDomains'):
23 | fp = open(f'{prefix}.goodDomains', 'r')
24 | for line in fp:
25 | words = line.split()
26 | if words[0] == 'sequence':
27 | segs = words[8].split(',')
28 | elif words[0] == 'structure':
29 | segs = words[14].split(',')
30 | for seg in segs:
31 | if '-' in seg:
32 | start = int(seg.split('-')[0])
33 | end = int(seg.split('-')[1])
34 | for resid in range(start, end+1):
35 | hit_resids.add(resid)
36 | else:
37 | resid = int(seg)
38 | hit_resids.add(resid)
39 | fp.close()
40 |
41 | fp = open(f'{prefix}.json','r')
42 | text = fp.read()[1:-1]
43 | fp.close()
44 | json_dict = json.loads(text)
45 |
46 | if 'predicted_aligned_error' in json_dict.keys():
47 | paes = json_dict['predicted_aligned_error']
48 | length = len(paes)
49 | rpair2error = {}
50 | for i in range(length):
51 | res1 = i + 1
52 | try:
53 | rpair2error[res1]
54 | except KeyError:
55 | rpair2error[res1] = {}
56 | for j in range(length):
57 | res2 = j + 1
58 | rpair2error[res1][res2] = paes[i][j]
59 |
60 | elif 'distance' in json_dict.keys():
61 | resid1s = json_dict['residue1']
62 | resid2s = json_dict['residue2']
63 | prot_len1 = max(resid1s)
64 | prot_len2 = max(resid2s)
65 | if prot_len1 != prot_len2:
66 | print (f'error, matrix is not a square with shape ({str(prot_len1)}, {str(prot_len2)})')
67 | else:
68 | length = prot_len1
69 |
70 | allerrors = json_dict['distance']
71 | mtx_size = len(allerrors)
72 |
73 | rpair2error = {}
74 | for i in range(mtx_size):
75 | res1 = resid1s[i]
76 | res2 = resid2s[i]
77 | try:
78 | rpair2error[res1]
79 | except KeyError:
80 | rpair2error[res1] = {}
81 | rpair2error[res1][res2] = allerrors[i]
82 |
83 | else:
84 | print ('error')
85 |
86 | res2contacts = {}
87 | for i in range(length):
88 | for j in range(length):
89 | res1 = i + 1
90 | res2 = j + 1
91 | error = rpair2error[res1][res2]
92 | if res1 + 20 <= res2 and error < 6:
93 | if res2 in insses:
94 | if res1 in insses and res2sse[res1] == res2sse[res2]:
95 | pass
96 | else:
97 | try:
98 | res2contacts[res1].append(res2)
99 | except KeyError:
100 | res2contacts[res1] = [res2]
101 | if res1 in insses:
102 | if res2 in insses and res2sse[res2] == res2sse[res1]:
103 | pass
104 | else:
105 | try:
106 | res2contacts[res2].append(res1)
107 | except KeyError:
108 | res2contacts[res2] = [res1]
109 |
110 |
111 | diso_resids = set([])
112 | for start in range(1, length - 4):
113 | total_contact = 0
114 | hitres_count = 0
115 | for res in range(start, start + 5):
116 | if res in hit_resids:
117 | hitres_count += 1
118 | if res in insses:
119 | try:
120 | total_contact += len(res2contacts[res])
121 | except KeyError:
122 | pass
123 | if total_contact <= 5 and hitres_count <= 2:
124 | for res in range(start, start + 5):
125 | diso_resids.add(res)
126 |
127 | diso_resids_list = list(diso_resids)
128 | diso_resids_list.sort()
129 |
130 | rp = open(f'{prefix}.diso', 'w')
131 | for resid in diso_resids_list:
132 | rp.write(f'{resid}\n')
133 | rp.close()
134 |
--------------------------------------------------------------------------------
/v1.0/step1_get_AFDB_pdbs.py:
--------------------------------------------------------------------------------
1 | #!/usr1/local/bin/python
2 | import os, sys, string
3 | import pdbx
4 | from pdbx.reader.PdbxReader import PdbxReader
5 |
6 | three2one = {}
7 | three2one["ALA"] = "A"
8 | three2one["CYS"] = "C"
9 | three2one["ASP"] = "D"
10 | three2one["GLU"] = "E"
11 | three2one["PHE"] = "F"
12 | three2one["GLY"] = "G"
13 | three2one["HIS"] = "H"
14 | three2one["ILE"] = "I"
15 | three2one["LYS"] = "K"
16 | three2one["LEU"] = "L"
17 | three2one["MET"] = "M"
18 | three2one["MSE"] = "M"
19 | three2one["ASN"] = "N"
20 | three2one["PRO"] = "P"
21 | three2one["GLN"] = "Q"
22 | three2one["ARG"] = "R"
23 | three2one["SER"] = "S"
24 | three2one["THR"] = "T"
25 | three2one["VAL"] = "V"
26 | three2one["TRP"] = "W"
27 | three2one["TYR"] = "Y"
28 |
29 | prefix = sys.argv[1]
30 | wd=sys.argv[2]
31 | if os.getcwd() != wd:
32 | os.chdir(wd)
33 |
34 | if os.path.exists(prefix+'.cif') and os.path.exists(prefix + ".fa"):
35 | fp = open(prefix + ".fa", "r")
36 | myseq = ""
37 | for line in fp:
38 | if line[0] == ">":
39 | pass
40 | else:
41 | myseq += line[:-1]
42 | fp.close()
43 |
44 | cif = open(prefix + ".cif", "r")
45 | pRd = PdbxReader(cif)
46 | data = []
47 | pRd.read(data)
48 | block = data[0]
49 |
50 | atom_site = block.getObj("atom_site")
51 | record_type_index = atom_site.getIndex("group_PDB")
52 | atom_type_index = atom_site.getIndex("type_symbol")
53 | atom_identity_index = atom_site.getIndex("label_atom_id")
54 | residue_type_index = atom_site.getIndex("label_comp_id")
55 | chain_id_index = atom_site.getIndex("label_asym_id")
56 | residue_id_index = atom_site.getIndex("label_seq_id")
57 | coor_x_index = atom_site.getIndex("Cartn_x")
58 | coor_y_index = atom_site.getIndex("Cartn_y")
59 | coor_z_index = atom_site.getIndex("Cartn_z")
60 | alt_id_index = atom_site.getIndex("label_alt_id")
61 | model_num_index = atom_site.getIndex("pdbx_PDB_model_num")
62 |
63 | if model_num_index == -1:
64 | mylines = []
65 | for i in range(atom_site.getRowCount()):
66 | words = atom_site.getRow(i)
67 | chain_id = words[chain_id_index]
68 | record_type = words[record_type_index]
69 | if chain_id == "A" and record_type == "ATOM":
70 | mylines.append(words)
71 | else:
72 | model2lines = {}
73 | models = []
74 | for i in range(atom_site.getRowCount()):
75 | words = atom_site.getRow(i)
76 | chain_id = words[chain_id_index]
77 | record_type = words[record_type_index]
78 | model_num = int(words[model_num_index])
79 | if chain_id == "A" and record_type == "ATOM":
80 | try:
81 | model2lines[model_num].append(words)
82 | except KeyError:
83 | model2lines[model_num] = [words]
84 | models.append(model_num)
85 | best_model = min(models)
86 | mylines = model2lines[best_model]
87 |
88 | goodlines = []
89 | resid2altid = {}
90 | resid2aa = {}
91 | atom_count = 0
92 | for words in mylines:
93 | atom_type = words[atom_type_index]
94 | atom_identity = words[atom_identity_index]
95 | residue_type = words[residue_type_index]
96 | residue_id = int(words[residue_id_index])
97 | alt_id = words[alt_id_index]
98 |
99 | if atom_identity == "CA":
100 | try:
101 | resid2aa[residue_id] = three2one[residue_type]
102 | except KeyError:
103 | resid2aa[residue_id] = "X"
104 |
105 | get_line = 0
106 | if alt_id == ".":
107 | get_line = 1
108 | else:
109 | try:
110 | if resid2altid[residue_id] == alt_id:
111 | get_line = 1
112 | else:
113 | get_line = 0
114 | except KeyError:
115 | resid2altid[residue_id] = alt_id
116 | get_line = 1
117 |
118 | if get_line:
119 | atom_count += 1
120 | coor_x_info = words[coor_x_index].split(".")
121 | if len(coor_x_info) >= 2:
122 | coor_x = coor_x_info[0] + "." + coor_x_info[1][:3]
123 | else:
124 | coor_x = coor_x_info[0]
125 | coor_y_info = words[coor_y_index].split(".")
126 | if len(coor_y_info) >= 2:
127 | coor_y = coor_y_info[0] + "." + coor_y_info[1][:3]
128 | else:
129 | coor_y = coor_y_info[0]
130 | coor_z_info = words[coor_z_index].split(".")
131 | if len(coor_z_info) >= 2:
132 | coor_z = coor_z_info[0] + "." + coor_z_info[1][:3]
133 | else:
134 | coor_z = coor_z_info[0]
135 | if len(atom_identity) < 4:
136 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity.ljust(3) + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + " 1.00 0.00 " + atom_type + "\n")
137 | elif len(atom_identity) == 4:
138 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + " 1.00 0.00 " + atom_type + "\n")
139 |
140 | newseq = ""
141 | for i in range(len(myseq)):
142 | resid = i + 1
143 | try:
144 | newseq += resid2aa[resid]
145 | if resid2aa[resid] == "X":
146 | pass
147 | elif resid2aa[resid] == myseq[i]:
148 | pass
149 | else:
150 | print ("error\t" + prefix)
151 | except KeyError:
152 | newseq += "-"
153 | if newseq == myseq:
154 | rp = open(prefix + ".pdb","w")
155 | for goodline in goodlines:
156 | rp.write(goodline)
157 | rp.close()
158 | else:
159 | sys.exit(1)
160 | elif os.path.exists(prefix + ".pdb") and os.path.exists(prefix + ".fa"):
161 | fp = open(prefix + ".fa", "r")
162 | myseq = ""
163 | for line in fp:
164 | if line[0] == ">":
165 | pass
166 | else:
167 | myseq += line[:-1]
168 | fp.close()
169 |
170 | pdb_new = []
171 | res = []
172 | f = open(prefix + ".pdb")
173 | for line in f:
174 | if line[:4] == 'ATOM':
175 | pdb_new.append(line)
176 | if line[13:16].strip()=='CA':
177 | try:
178 | res.append(three2one[line[17:20]])
179 | except:
180 | res.append('X')
181 | f.close()
182 | if ''.join(res) == myseq:
183 | rp = open(prefix + ".pdb","w")
184 | for i in pdb_new:
185 | rp.write(i)
186 | rp.close()
187 | else:
188 | sys.exit(1)
189 | else:
190 | sys.exit(1)
191 |
--------------------------------------------------------------------------------
/v1.0/step1_get_AFDB_seqs.py:
--------------------------------------------------------------------------------
1 | #!/usr1/local/bin/python
2 | import os, sys
3 | import pdbx
4 | from pdbx.reader.PdbxReader import PdbxReader
5 |
6 | three2one = {}
7 | three2one["ALA"] = 'A'
8 | three2one["CYS"] = 'C'
9 | three2one["ASP"] = 'D'
10 | three2one["GLU"] = 'E'
11 | three2one["PHE"] = 'F'
12 | three2one["GLY"] = 'G'
13 | three2one["HIS"] = 'H'
14 | three2one["ILE"] = 'I'
15 | three2one["LYS"] = 'K'
16 | three2one["LEU"] = 'L'
17 | three2one["MET"] = 'M'
18 | three2one["MSE"] = 'M'
19 | three2one["ASN"] = 'N'
20 | three2one["PRO"] = 'P'
21 | three2one["GLN"] = 'Q'
22 | three2one["ARG"] = 'R'
23 | three2one["SER"] = 'S'
24 | three2one["THR"] = 'T'
25 | three2one["VAL"] = 'V'
26 | three2one["TRP"] = 'W'
27 | three2one["TYR"] = 'Y'
28 |
29 |
30 | prefix = sys.argv[1]
31 | wd = sys.argv[2]
32 | if os.getcwd() != wd:
33 | os.chdir(wd)
34 |
35 |
36 | struc_fn = prefix + '.cif'
37 | if os.path.exists(struc_fn):
38 | cif = open(prefix + ".cif")
39 | else:
40 | if os.path.exists(prefix + ".pdb"):
41 | os.system(f'pdb2fasta '+ prefix + ".pdb > " + prefix + ".fa")
42 | with open(prefix + ".fa") as f:
43 | fa = f.readlines()
44 | fa[0] = fa[0].split(':')[0] + '\n'
45 | with open(prefix + ".fa",'w') as f:
46 | f.write(''.join(fa))
47 | else:
48 | print("No recognized structure file (*.cif or *.pdb). Existing...")
49 | sys.exit()
50 |
51 | pRd = PdbxReader(cif)
52 | data = []
53 | pRd.read(data)
54 | block = data[0]
55 |
56 | modinfo = {}
57 | mod_residues = block.getObj("pdbx_struct_mod_residue")
58 | if mod_residues:
59 | chainid = mod_residues.getIndex("label_asym_id")
60 | posiid = mod_residues.getIndex("label_seq_id")
61 | parentid = mod_residues.getIndex("parent_comp_id")
62 | resiid = mod_residues.getIndex("label_comp_id")
63 | for i in range(mod_residues.getRowCount()):
64 | words = mod_residues.getRow(i)
65 | try:
66 | modinfo[words[chainid]]
67 | except KeyError:
68 | modinfo[words[chainid]] = {}
69 | modinfo[words[chainid]][words[posiid]] = [words[resiid], words[parentid]]
70 |
71 | entity_poly = block.getObj("entity_poly")
72 | pdbx_poly_seq_scheme = block.getObj("pdbx_poly_seq_scheme")
73 | if pdbx_poly_seq_scheme and entity_poly:
74 | typeid = entity_poly.getIndex("type")
75 | entityid1 = entity_poly.getIndex("entity_id")
76 | entityid2 = pdbx_poly_seq_scheme.getIndex("entity_id")
77 | chainid = pdbx_poly_seq_scheme.getIndex("asym_id")
78 | resiid = pdbx_poly_seq_scheme.getIndex("mon_id")
79 | posiid = pdbx_poly_seq_scheme.getIndex("seq_id")
80 |
81 | good_entities = []
82 | for i in range(entity_poly.getRowCount()):
83 | words = entity_poly.getRow(i)
84 | entity = words[entityid1]
85 | type = words[typeid]
86 | if type == "polypeptide(L)":
87 | good_entities.append(entity)
88 |
89 | if good_entities:
90 | chains = []
91 | residues = {}
92 | seqs = {}
93 | rp = open(prefix + ".fa","w")
94 | for i in range(pdbx_poly_seq_scheme.getRowCount()):
95 | words = pdbx_poly_seq_scheme.getRow(i)
96 | entity = words[entityid2]
97 | if entity in good_entities:
98 | chain = words[chainid]
99 |
100 | try:
101 | aa = three2one[words[resiid]]
102 | except KeyError:
103 | try:
104 | modinfo[chain][words[posiid]]
105 | resiname = modinfo[chain][words[posiid]][0]
106 | if words[resiid] == resiname:
107 | new_resiname = modinfo[chain][words[posiid]][1]
108 | try:
109 | aa = three2one[new_resiname]
110 | except KeyError:
111 | aa = "X"
112 | print ("error1 " + new_resiname)
113 | else:
114 | aa = "X"
115 | print ("error2 " + words[resiid] + " " + resiname)
116 | except KeyError:
117 | print (modinfo)
118 | print (words[resiid])
119 | aa = "X"
120 |
121 | try:
122 | seqs[chain]
123 | except KeyError:
124 | chains.append(chain)
125 | seqs[chain] = {}
126 |
127 | try:
128 | if seqs[chain][int(words[posiid])] == "X" and aa != "X":
129 | seqs[chain][int(words[posiid])] = aa
130 | except KeyError:
131 | seqs[chain][int(words[posiid])] = aa
132 |
133 | try:
134 | residues[chain].add(int(words[posiid]))
135 | except KeyError:
136 | residues[chain] = set([int(words[posiid])])
137 |
138 | for chain in chains:
139 | for i in range(len(residues[chain])):
140 | if not i + 1 in residues[chain]:
141 | print ("error3 " + prefix + " " + chain)
142 | break
143 | else:
144 | rp.write(">" + prefix + "\n")
145 | finalseq = []
146 | for i in range(len(residues[chain])):
147 | finalseq.append(seqs[chain][i+1])
148 | rp.write("".join(finalseq) + "\n")
149 | rp.close()
150 | else:
151 | print ("empty " + prefix)
152 | else:
153 | print ("bad " + prefix)
154 |
--------------------------------------------------------------------------------
/v1.0/step2_run_hhsearch.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 |
3 | prefix = sys.argv[1]
4 | CPUs = sys.argv[2]
5 | wd = sys.argv[3]
6 | data_dir=sys.argv[4]
7 | if os.getcwd() != wd:
8 | os.chdir(wd)
9 |
10 | print (f'hhblits -cpu {CPUs} -i {prefix}.fa -d {data_dir}/UniRef30_2022_02/UniRef30_2022_02 -oa3m {prefix}.a3m')
11 | os.system(f'hhblits -cpu {CPUs} -i {prefix}.fa -d {data_dir}/UniRef30_2022_02/UniRef30_2022_02 -oa3m {prefix}.a3m')
12 | print (f'addss.pl {prefix}.a3m {prefix}.a3m.ss -a3m')
13 | os.system(f'addss.pl {prefix}.a3m {prefix}.a3m.ss -a3m')
14 | os.system(f'mv {prefix}.a3m.ss {prefix}.a3m')
15 | print (f'hhmake -i {prefix}.a3m -o {prefix}.hmm')
16 | os.system(f'hhmake -i {prefix}.a3m -o {prefix}.hmm')
17 | print (f'hhsearch -cpu {CPUs} -Z 100000 -B 100000 -i {prefix}.hmm -d {data_dir}/pdb70/pdb70 -o {prefix}.hhsearch')
18 | os.system(f'hhsearch -cpu {CPUs} -Z 100000 -B 100000 -i {prefix}.hmm -d {data_dir}/pdb70/pdb70 -o {prefix}.hhsearch')
19 |
--------------------------------------------------------------------------------
/v1.0/step3_run_foldseek.py:
--------------------------------------------------------------------------------
1 | import os, sys,time
2 |
3 |
4 | prefix = sys.argv[1]
5 | threads = sys.argv[2]
6 | wd = sys.argv[3]
7 | data_dir = sys.argv[4]
8 |
9 | if os.getcwd() != wd:
10 | os.chdir(wd)
11 | if not os.path.exists('foldseek_tmp'):
12 | os.system('mkdir foldseek_tmp')
13 |
14 | os.system(f'foldseek easy-search {prefix}.pdb {data_dir}/ECOD_foldseek_DB/ECOD_foldseek_DB {prefix}.foldseek foldseek_tmp -e 1000000 --max-seqs 1000000 --threads {threads} > {prefix}_foldseek.log')
15 | os.system('rm -rf foldseek_tmp')
16 |
--------------------------------------------------------------------------------
/v1.0/step4_filter_foldseek.py:
--------------------------------------------------------------------------------
1 | import sys,os
2 |
3 | prefix = sys.argv[1]
4 | wd=sys.argv[2]
5 | if os.getcwd() != wd:
6 | os.chdir(wd)
7 |
8 | fp = open(prefix + '.fa','r')
9 | query_seq = ''
10 | for line in fp:
11 | if line[0] != '>':
12 | query_seq += line[:-1]
13 | fp.close()
14 | qlen = len(query_seq)
15 |
16 | fp = open(prefix + '.foldseek', 'r')
17 | hits = []
18 | for line in fp:
19 | words = line.split()
20 | dnum = words[1].split('.')[0]
21 | qstart = int(words[6])
22 | qend = int(words[7])
23 | qresids = set([])
24 | for qres in range(qstart, qend + 1):
25 | qresids.add(qres)
26 | evalue = float(words[10])
27 | hits.append([dnum, evalue, qstart, qend, qresids])
28 | fp.close()
29 | hits.sort(key = lambda x:x[1])
30 |
31 | qres2count = {}
32 | for res in range(1, qlen + 1):
33 | qres2count[res] = 0
34 |
35 | rp = open(prefix + '.foldseek.flt.result', 'w')
36 | rp.write('ecodnum\tevalue\trange\n')
37 | for hit in hits:
38 | dnum = hit[0]
39 | evalue = hit[1]
40 | qstart = hit[2]
41 | qend = hit[3]
42 | qresids = hit[4]
43 | for res in qresids:
44 | qres2count[res] += 1
45 | good_res = 0
46 | for res in qresids:
47 | if qres2count[res] <= 100:
48 | good_res += 1
49 | if good_res >= 10:
50 | rp.write(dnum + '\t' + str(evalue) + '\t' + str(qstart) + '-' + str(qend) + '\n')
51 | rp.close()
52 |
--------------------------------------------------------------------------------
/v1.0/step5_map_to_ecod.py:
--------------------------------------------------------------------------------
1 | import sys,os
2 |
3 | def get_range(resids, chainid):
4 | resids.sort()
5 | segs = []
6 | for resid in resids:
7 | if not segs:
8 | segs.append([resid])
9 | else:
10 | if resid > segs[-1][-1] + 1:
11 | segs.append([resid])
12 | else:
13 | segs[-1].append(resid)
14 | ranges = []
15 | for seg in segs:
16 | if chainid:
17 | ranges.append(chainid + ':' + str(seg[0]) + '-' + str(seg[-1]))
18 | else:
19 | ranges.append(str(seg[0]) + '-' + str(seg[-1]))
20 | return ','.join(ranges)
21 |
22 |
23 | prefix = sys.argv[1]
24 | wd = sys.argv[2]
25 | data_dir = sys.argv[3]
26 | if os.getcwd() != wd:
27 | os.chdir(wd)
28 |
29 | fp = open(prefix + '.hhsearch', 'r')
30 | info = fp.read().split('\n>')
31 | fp.close()
32 | allhits = []
33 | need_pdbchains = set([])
34 | need_pdbs = set([])
35 | for hit in info[1:]:
36 | lines = hit.split('\n')
37 | qstart = 0
38 | qend = 0
39 | qseq = ''
40 | hstart = 0
41 | hend = 0
42 | hseq = ''
43 | for line in lines:
44 | if len(line) >= 6:
45 | if line[:6] == 'Probab':
46 | words = line.split()
47 | for word in words:
48 | subwords = word.split('=')
49 | if subwords[0] == 'Probab':
50 | hh_prob = subwords[1]
51 | elif subwords[0] == 'E-value':
52 | hh_eval = subwords[1]
53 | elif subwords[0] == 'Score':
54 | hh_score = subwords[1]
55 | elif subwords[0] == 'Aligned_cols':
56 | aligned_cols = subwords[1]
57 | elif subwords[0] == 'Identities':
58 | idents = subwords[1]
59 | elif subwords[0] == 'Similarity':
60 | similarities = subwords[1]
61 | elif subwords[0] == 'Sum_probs':
62 | sum_probs = subwords[1]
63 |
64 | elif line[:2] == 'Q ':
65 | words = line.split()
66 | if words[1] != 'ss_pred' and words[1] != 'Consensus':
67 | qseq += words[3]
68 | if not qstart:
69 | qstart = int(words[2])
70 | qend = int(words[4])
71 |
72 | elif line[:2] == 'T ':
73 | words = line.split()
74 | if words[1] != 'Consensus' and words[1] != 'ss_dssp' and words[1] != 'ss_pred':
75 | hid = words[1]
76 | hseq += words[3]
77 | if not hstart:
78 | hstart = int(words[2])
79 | hend = int(words[4])
80 | allhits.append([hid, hh_prob, hh_eval, hh_score, aligned_cols, idents, similarities, sum_probs, qstart, qend, qseq, hstart, hend, hseq])
81 | need_pdbchains.add(hid)
82 | need_pdbs.add(hid.split('_')[0].lower())
83 |
84 |
85 | fp = open(data_dir + '/ECOD_pdbmap','r')
86 | pdb2ecod = {}
87 | good_hids = set([])
88 | for line in fp:
89 | words = line.split()
90 | pdbid = words[1]
91 | segments = words[2].split(',')
92 | chainids = set([])
93 | resids = []
94 | for segment in segments:
95 | chainids.add(segment.split(':')[0])
96 | if '-' in segment:
97 | start = int(segment.split(':')[1].split('-')[0])
98 | end = int(segment.split(':')[1].split('-')[1])
99 | for res in range(start, end + 1):
100 | resids.append(res)
101 | else:
102 | resid = int(segment.split(':')[1])
103 | resids.append(resid)
104 | if len(chainids) == 1:
105 | chainid = list(chainids)[0]
106 | pdbchain = pdbid.upper() + '_' + chainid
107 | if pdbchain in need_pdbchains:
108 | good_hids.add(pdbchain)
109 | pdb2ecod[pdbchain] = {}
110 | for i, resid in enumerate(resids):
111 | pdb2ecod[pdbchain][resid] = words[0] + ':' + str(i + 1)
112 | else:
113 | print (line[:-1])
114 | fp.close()
115 |
116 | ecod2key = {}
117 | ecod2len = {}
118 | fp = open(data_dir + '/ECOD_length','r')
119 | for line in fp:
120 | words = line.split()
121 | ecod2key[words[0]] = words[1]
122 | ecod2len[words[0]] = int(words[2])
123 | fp.close()
124 |
125 | rp = open(prefix + '.map2ecod.result', 'w')
126 | rp.write('uid\tecod_domain_id\thh_prob\thh_eval\thh_score\taligned_cols\tidents\tsimilarities\tsum_probs\tcoverage\tungapped_coverage\tquery_range\ttemplate_range\ttemplate_seqid_range\n')
127 | for hit in allhits:
128 | hid = hit[0]
129 | pdbid = hid.split('_')[0]
130 | chainid = hid.split('_')[1]
131 | ecods = []
132 | ecod2hres = {}
133 | ecod2hresmap = {}
134 | if hid in good_hids:
135 | for pdbres in pdb2ecod[hid].keys():
136 | for item in pdb2ecod[hid][pdbres].split(','):
137 | ecod = item.split(':')[0]
138 | ecodres = int(item.split(':')[1])
139 | try:
140 | ecod2hres[ecod]
141 | ecod2hresmap[ecod]
142 | except KeyError:
143 | ecods.append(ecod)
144 | ecod2hres[ecod] = set([])
145 | ecod2hresmap[ecod] = {}
146 | ecod2hres[ecod].add(pdbres)
147 | ecod2hresmap[ecod][pdbres] = ecodres
148 |
149 | hh_prob = hit[1]
150 | hh_eval = hit[2]
151 | hh_score = hit[3]
152 | aligned_cols = hit[4]
153 | idents = hit[5]
154 | similarities = hit[6]
155 | sum_probs = hit[7]
156 | qstart = hit[8]
157 | qseq = hit[10]
158 | hstart = hit[11]
159 | hseq = hit[13]
160 |
161 | for ecod in ecods:
162 | ecodkey = ecod2key[ecod]
163 | ecodlen = ecod2len[ecod]
164 | qposi = qstart - 1
165 | hposi = hstart - 1
166 | qresids = []
167 | hresids = []
168 | eresids = []
169 | if len(qseq) == len(hseq):
170 | for i in range(len(hseq)):
171 | if qseq[i] != '-':
172 | qposi += 1
173 | if hseq[i] != '-':
174 | hposi += 1
175 | if qseq[i] != '-' and hseq[i] != '-':
176 | if hposi in ecod2hres[ecod]:
177 | eposi = ecod2hresmap[ecod][hposi]
178 | qresids.append(qposi)
179 | hresids.append(hposi)
180 | eresids.append(eposi)
181 | if len(qresids) >= 10 and len(eresids) >= 10:
182 | qrange = get_range(qresids,'')
183 | hrange = get_range(hresids, chainid)
184 | erange = get_range(eresids,'')
185 | coverage = round(len(eresids) / ecodlen, 3)
186 | ungapped_coverage = round((max(eresids) - min(eresids) + 1) / ecodlen, 3)
187 | rp.write(ecod + '\t' + ecodkey + '\t' + hh_prob + '\t' + hh_eval + '\t' + hh_score + '\t' + aligned_cols + '\t' + idents + '\t' + similarities + '\t' + sum_probs + '\t' + str(coverage) + '\t' + str(ungapped_coverage) + '\t' + qrange + '\t' + erange + '\t' + hrange + '\n')
188 | else:
189 | print ('error\t' + prot + '\t' + ecod)
190 | rp.close()
191 |
--------------------------------------------------------------------------------
/v1.0/step6_get_dali_candidates.py:
--------------------------------------------------------------------------------
1 | import sys,os
2 |
3 | prefix = sys.argv[1]
4 | wd = sys.argv[2]
5 |
6 | if os.getcwd() != wd:
7 | os.chdir(wd)
8 |
9 | domains = set([])
10 | fp = open(prefix + '.map2ecod.result', 'r')
11 | for countl, line in enumerate(fp):
12 | if countl:
13 | words = line.split()
14 | domains.add(words[0])
15 | fp.close()
16 |
17 | fp = open(prefix + '.foldseek.flt.result','r')
18 | for countl, line in enumerate(fp):
19 | if countl:
20 | words = line.split()
21 | domains.add(words[0])
22 | fp.close()
23 |
24 | rp = open(prefix + '_hits4Dali', 'w')
25 | for domain in domains:
26 | rp.write(domain + '\n')
27 | rp.close()
28 |
--------------------------------------------------------------------------------
/v1.0/step7_iterative_dali_aug_multi.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 | import time
3 | from multiprocessing import Pool
4 |
5 | prefix = sys.argv[1]
6 | CPUs = sys.argv[2]
7 | wd = sys.argv[3]
8 | data_dir=sys.argv[4]
9 |
10 |
11 | def get_domain_range(resids):
12 | segs = []
13 | resids.sort()
14 | cutoff1 = 5
15 | cutoff2 = len(resids) * 0.05
16 | cutoff = max(cutoff1, cutoff2)
17 | for resid in resids:
18 | if not segs:
19 | segs.append([resid])
20 | else:
21 | if resid > segs[-1][-1] + cutoff:
22 | segs.append([resid])
23 | else:
24 | segs[-1].append(resid)
25 | seg_string = []
26 | for seg in segs:
27 | start = str(seg[0])
28 | end = str(seg[-1])
29 | seg_string.append(start + '-' + end)
30 | return ','.join(seg_string)
31 |
32 |
33 | def run_dali(edomain):
34 | alicount = 0
35 | os.system(f'mkdir {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}')
36 | os.system(f'cp {wd}/{prefix}.pdb {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/{prefix}_{edomain}.pdb')
37 | os.system(f'mkdir {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp')
38 | os.chdir(f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp')
39 | while True:
40 | os.system(f'dali.pl --pdbfile1 ../{prefix}_{edomain}.pdb --pdbfile2 {data_dir}/ECOD70/{edomain}.pdb --dat1 ./ -dat2 ./ --outfmt summary,alignments,transrot >& log')
41 | fp = os.popen('ls -1 mol*.txt')
42 | info = fp.readlines()
43 | fp.close()
44 | filenames = []
45 | for line in info:
46 | filenames.append(line[:-1])
47 |
48 | if filenames:
49 | fp = open('../' + prefix + '_' + edomain + '.pdb', 'r')
50 | Qresids_set = set([])
51 | for line in fp:
52 | resid = int(line[22:26])
53 | Qresids_set.add(resid)
54 | fp.close()
55 | Qresids = list(Qresids_set)
56 |
57 | info = []
58 | for filename in filenames:
59 | fp = open(filename, "r")
60 | lines = fp.readlines()
61 | fp.close()
62 | for line in lines:
63 | info.append(line)
64 |
65 | qali = ''
66 | sali = ''
67 | getit = 1
68 | zscore = 0
69 | for line in info:
70 | words = line.split()
71 | if len(words) >= 2 and getit:
72 | if words[0] == 'Query':
73 | qali += words[1]
74 | elif words[0] == 'Sbjct':
75 | sali += words[1]
76 | elif words[0] == 'No' and words[1] == '1:':
77 | for word in words:
78 | if '=' in word:
79 | subwords = word.split('=')
80 | if subwords[0] == 'Z-score':
81 | zinfo = subwords[1].split('.')
82 | zscore = float(zinfo[0] + '.' + zinfo[1])
83 | elif words[0] == 'No' and words[1] == '2:':
84 | getit = 0
85 |
86 | qinds = []
87 | sinds = []
88 | length = len(qali)
89 | qposi = 0
90 | sposi = 0
91 | match = 0
92 | for i in range(length):
93 | if qali[i] != '-':
94 | qposi += 1
95 | if sali[i] != '-':
96 | sposi += 1
97 | if qali[i] != '-' and sali[i] != '-':
98 | if qali[i].isupper() and sali[i].isupper():
99 | match += 1
100 | qinds.append(qposi)
101 | sinds.append(sposi)
102 | qlen = qposi
103 | slen = sposi
104 |
105 | if match >= 20:
106 | alicount += 1
107 | rp = open(f'{wd}/iterativeDali_{prefix}/{prefix}_{edomain}_hits', 'a')
108 | rp.write('>' + edomain + '_' + str(alicount) + '\t' + str(zscore) + '\t' + str(match) + '\t' + str(qlen) + '\t' + str(slen) + '\n')
109 | for i in range(len(qinds)):
110 | qind = qinds[i] - 1
111 | sind = sinds[i]
112 | rp.write(str(Qresids[qind]) + '\t' + str(sind) + '\n')
113 | rp.close()
114 |
115 | raw_qresids = []
116 | for qind in qinds:
117 | raw_qresids.append(Qresids[qind - 1])
118 | qrange = get_domain_range(raw_qresids)
119 | qresids = set([])
120 | qsegs = qrange.split(',')
121 | for qseg in qsegs:
122 | qedges = qseg.split('-')
123 | qstart = int(qedges[0])
124 | qend = int(qedges[1])
125 | for qres in range(qstart, qend + 1):
126 | qresids.add(qres)
127 | remain_resids = Qresids_set.difference(qresids)
128 |
129 | if len(remain_resids) >= 20:
130 | rp = open('../' + prefix + '_' + edomain + '.pdbnew', 'w')
131 | fp = open('../' + prefix + '_' + edomain + '.pdb', 'r')
132 | for line in fp:
133 | resid = int(line[22:26])
134 | if resid in remain_resids:
135 | rp.write(line)
136 | fp.close()
137 | rp.close()
138 | os.system('mv ../' + prefix + '_' + edomain + '.pdbnew ../' + prefix + '_' + edomain + '.pdb')
139 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp':
140 | os.system('rm *')
141 | else:
142 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp':
143 | os.system('rm *')
144 | break
145 | else:
146 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp':
147 | os.system('rm *')
148 | break
149 | else:
150 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp':
151 | os.system('rm *')
152 | break
153 | os.chdir(wd)
154 | time.sleep(1)
155 | os.system(f'rm -rf {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}')
156 |
157 | if os.getcwd() != wd:
158 | os.chdir(wd)
159 |
160 | if os.path.exists(prefix + '.iterativeDali.done'):
161 | pass
162 | else:
163 | if not os.path.exists(f'{wd}/iterativeDali_{prefix}'):
164 | os.system(f'mkdir {wd}/iterativeDali_{prefix}')
165 | fp = open(prefix + '_hits4Dali','r')
166 | edomains = []
167 | for line in fp:
168 | words = line.split()
169 | edomains.append(words[0])
170 | fp.close()
171 |
172 | inputs = []
173 | for edomain in edomains:
174 | inputs.append([edomain])
175 | pool = Pool(processes = int(CPUs))
176 | results = []
177 | for item in inputs:
178 | process = pool.apply_async(run_dali, item)
179 | results.append(process)
180 | for process in results:
181 | process.get()
182 |
183 | os.system(f'cat {wd}/iterativeDali_{prefix}/{prefix}_*_hits > {prefix}_iterativdDali_hits')
184 | os.system(f'rm -rf {wd}/iterativeDal_{prefix}/tmp_*')
185 | os.system(f'rm -rf {wd}/iterativeDal_{prefix}')
186 | os.system(f'echo "done" > {prefix}.iterativeDali.done')
187 |
--------------------------------------------------------------------------------
/v1.0/step8_analyze_dali.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 | import numpy as np
3 |
4 |
5 | def get_range(resids):
6 | resids.sort()
7 | segs = []
8 | for resid in resids:
9 | if not segs:
10 | segs.append([resid])
11 | else:
12 | if resid > segs[-1][-1] + 1:
13 | segs.append([resid])
14 | else:
15 | segs[-1].append(resid)
16 | ranges = []
17 | for seg in segs:
18 | ranges.append(f'{seg[0]}-{seg[-1]}')
19 | return ','.join(ranges)
20 |
21 |
22 | prefix = sys.argv[1]
23 | wd = sys.argv[2]
24 | data_dir = sys.argv[3]
25 | if os.getcwd() != wd:
26 | os.chdir(wd)
27 |
28 | if os.path.exists(f'{prefix}_iterativdDali_hits'):
29 | fp = open(f'{data_dir}/ecod.latest.domains','r')
30 | ecod2id = {}
31 | ecod2fam = {}
32 | for line in fp:
33 | if line[0] != '#':
34 | words = line[:-1].split('\t')
35 | ecodnum = words[0]
36 | ecodid = words[1]
37 | ecodfam = '.'.join(words[3].split('.')[:2])
38 | ecod2id[ecodnum] = ecodid
39 | ecod2fam[ecodnum] = ecodfam
40 | fp.close()
41 |
42 | fp = open(f'{prefix}_iterativdDali_hits','r')
43 | ecodnum = ''
44 | ecodid = ''
45 | ecodfam = ''
46 | hitname = ''
47 | maps = []
48 | hits = []
49 | for line in fp:
50 | if line[0] == '>':
51 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps:
52 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps])
53 | words = line[1:].split()
54 | zscore = float(words[1])
55 | hitname = words[0]
56 | ecodnum = hitname.split('_')[0]
57 | ecodid = ecod2id[ecodnum]
58 | ecodfam = ecod2fam[ecodnum]
59 | maps = []
60 | else:
61 | words = line.split()
62 | pres = int(words[0])
63 | eres = int(words[1])
64 | maps.append([pres, eres])
65 | fp.close()
66 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps:
67 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps])
68 |
69 |
70 | newhits = []
71 | for hit in hits:
72 | hitname = hit[0]
73 | ecodnum = hit[1]
74 | total_weight = 0
75 | posi2weight = {}
76 | zscores = []
77 | qscores = []
78 | if os.path.exists(f'{data_dir}/ecod_weights/{ecodnum}.weight'):
79 | fp = open(f'{data_dir}/ecod_weights/{ecodnum}.weight','r')
80 | posi2weight = {}
81 | for line in fp:
82 | words = line.split()
83 | total_weight += float(words[3])
84 | posi2weight[int(words[0])] = float(words[3])
85 | fp.close()
86 | if os.path.exists(f'{data_dir}/ecod_domain_info/{ecodnum}.info'):
87 | fp = open(f'{data_dir}/ecod_domain_info/{ecodnum}.info','r')
88 | for line in fp:
89 | words = line.split()
90 | zscores.append(float(words[1]))
91 | qscores.append(float(words[2]))
92 | fp.close()
93 | ecodid = hit[2]
94 | ecodfam = hit[3]
95 | zscore = hit[4]
96 | maps = hit[5]
97 |
98 | if zscores and qscores:
99 | qscore = 0
100 | for item in maps:
101 | try:
102 | qscore += posi2weight[item[1]]
103 | except KeyError:
104 | pass
105 |
106 | better = 0
107 | worse = 0
108 | for other_qscore in qscores:
109 | if other_qscore > qscore:
110 | better += 1
111 | else:
112 | worse += 1
113 | qtile = better / (better + worse)
114 |
115 | better = 0
116 | worse = 0
117 | for other_zscore in zscores:
118 | if other_zscore > zscore:
119 | better += 1
120 | else:
121 | worse += 1
122 | ztile = better / (better + worse)
123 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, qscore / total_weight, ztile, qtile, maps])
124 | else:
125 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, -1, -1, -1, maps])
126 |
127 |
128 | newhits.sort(key = lambda x:x[4], reverse = True)
129 | finalhits = []
130 | posi2fams = {}
131 | for hit in newhits:
132 | ecodfam = hit[3]
133 | maps = hit[8]
134 | qposis = []
135 | eposis = []
136 | ranks = []
137 | for item in maps:
138 | qposis.append(item[0])
139 | eposis.append(item[1])
140 | try:
141 | posi2fams[item[0]].add(ecodfam)
142 | except KeyError:
143 | posi2fams[item[0]] = set([ecodfam])
144 | ranks.append(len(posi2fams[item[0]]))
145 | ave_rank = round(np.mean(ranks), 2)
146 | qrange = get_range(qposis)
147 | erange = get_range(eposis)
148 | finalhits.append([hit[0], hit[1], hit[2], hit[3], round(hit[4], 2), round(hit[5], 2), round(hit[6], 2), round(hit[7], 2), ave_rank, qrange, erange])
149 |
150 | rp = open(f'{prefix}_good_hits', 'w')
151 | rp.write('hitname\tecodnum\tecodkey\thgroup\tzscore\tqscore\tztile\tqtile\trank\tqrange\terange\n')
152 | for hit in finalhits:
153 | rp.write(f'{hit[0]}\t{hit[1]}\t{hit[2]}\t{hit[3]}\t{hit[4]}\t{hit[5]}\t{hit[6]}\t{hit[7]}\t{hit[8]}\t{hit[9]}\t{hit[10]}\n')
154 | rp.close()
155 |
--------------------------------------------------------------------------------
/v1.0/step9_get_support.py:
--------------------------------------------------------------------------------
1 | import os, sys
2 |
3 | def get_range(resids):
4 | resids.sort()
5 | segs = []
6 | for resid in resids:
7 | if not segs:
8 | segs.append([resid])
9 | else:
10 | if resid > segs[-1][-1] + 1:
11 | segs.append([resid])
12 | else:
13 | segs[-1].append(resid)
14 | ranges = []
15 | for seg in segs:
16 | ranges.append(f'{seg[0]}-{seg[-1]}')
17 | return ','.join(ranges)
18 |
19 | prefix = sys.argv[1]
20 | wd = sys.argv[2]
21 | data_dir = sys.argv[3]
22 | if os.getcwd() != wd:
23 | os.chdir(wd)
24 |
25 | fp = open(f'{data_dir}/ECOD_length','r')
26 | ecod2len = {}
27 | for line in fp:
28 | words = line.split()
29 | ecod2len[words[0]] = int(words[2])
30 | fp.close()
31 |
32 | fp = open(f'{data_dir}/ecod.latest.domains','r')
33 | ecod2id = {}
34 | ecod2fam = {}
35 | for line in fp:
36 | if line[0] != '#':
37 | words = line[:-1].split('\t')
38 | ecodnum = words[0]
39 | ecodid = words[1]
40 | ecodfam = '.'.join(words[3].split('.')[:2])
41 | ecod2id[ecodnum] = ecodid
42 | ecod2fam[ecodnum] = ecodfam
43 | fp.close()
44 |
45 |
46 | seqhits = []
47 | fp = open(f'{prefix}.map2ecod.result', 'r')
48 | for countl, line in enumerate(fp):
49 | if countl:
50 | words = line.split()
51 | ecodnum = words[0]
52 | ecodlen = ecod2len[ecodnum]
53 | ecodfam = ecod2fam[ecodnum]
54 | prob = float(words[2])
55 | Qsegs = words[11].split(',')
56 | Tsegs = words[12].split(',')
57 | Qresids = []
58 | for seg in Qsegs:
59 | start = int(seg.split('-')[0])
60 | end = int(seg.split('-')[1])
61 | for res in range(start, end + 1):
62 | Qresids.append(res)
63 | Tresids = []
64 | for seg in Tsegs:
65 | start = int(seg.split('-')[0])
66 | end = int(seg.split('-')[1])
67 | for res in range(start, end + 1):
68 | Tresids.append(res)
69 | seqhits.append([ecodnum, ecodlen, ecodfam, prob, Qresids, Tresids])
70 | fp.close()
71 |
72 |
73 | fam2hits = {}
74 | fams = set([])
75 | for hit in seqhits:
76 | fam = hit[2]
77 | fams.add(fam)
78 | try:
79 | fam2hits[fam].append([hit[3], hit[1], hit[4], hit[5]])
80 | except KeyError:
81 | fam2hits[fam] = [[hit[3], hit[1], hit[4], hit[5]]]
82 |
83 | ecods = []
84 | ecod2hits = {}
85 | for hit in seqhits:
86 | ecodnum = hit[0]
87 | ecodlen = hit[1]
88 | ecodfam = hit[2]
89 | prob = hit[3]
90 | Qresids = hit[4]
91 | Tresids = hit[5]
92 | Qset = set(Qresids)
93 | try:
94 | ecod2hits[ecodnum].append([prob, ecodfam, ecodlen, Qresids, Tresids, Qset])
95 | except KeyError:
96 | ecod2hits[ecodnum] = [[prob, ecodfam, ecodlen, Qresids, Tresids, Qset]]
97 | ecods.append(ecodnum)
98 |
99 |
100 | rp = open(f'{prefix}_sequence.result', 'w')
101 | for ecodnum in ecods:
102 | ecodid = ecod2id[ecodnum]
103 | ecod2hits[ecodnum].sort(key = lambda x:x[0], reverse = True)
104 | get_resids = set([])
105 | mycount = 0
106 | for hit in ecod2hits[ecodnum]:
107 | hit_prob = hit[0]
108 | hit_fam = hit[1]
109 | hit_ecodlen = hit[2]
110 | query_resids = hit[3]
111 | query_range = get_range(query_resids)
112 | hit_resids = hit[4]
113 | hit_range = get_range(hit_resids)
114 | hit_resids_set = hit[5]
115 | hit_coverage = round(len(hit_resids_set) / hit_ecodlen, 2)
116 | if hit_coverage >= 0.4 and hit_prob >= 50:
117 | new_resids = hit_resids_set.difference(get_resids)
118 | if len(new_resids) >= len(hit_resids_set) * 0.5:
119 | mycount += 1
120 | get_resids = get_resids.union(hit_resids_set)
121 | rp.write(f'{ecodnum}_{str(mycount)}\t{ecodid}\t{hit_fam}\t{hit_prob}\t{hit_coverage}\t{hit_ecodlen}\t{query_range}\t{hit_range}\n')
122 | rp.close()
123 |
124 |
125 | if os.path.exists(f'{prefix}_good_hits'):
126 | fp = open(f'{prefix}_good_hits', 'r')
127 | rp = open(f'{prefix}_structure.result', 'w')
128 | for countl, line in enumerate(fp):
129 | if countl:
130 | words = line.split()
131 | hitname = words[0]
132 | ecodnum = words[1]
133 | ecodid = words[2]
134 | ecodfam = words[3]
135 | zscore = words[4]
136 | qscore = words[5]
137 | ztile = words[6]
138 | qtile = words[7]
139 | rank = words[8]
140 | qsegments = words[9]
141 | ssegments = words[10]
142 | segs = []
143 | for seg in qsegments.split(','):
144 | start = int(seg.split('-')[0])
145 | end = int(seg.split('-')[1])
146 | for res in range(start, end + 1):
147 | if not segs:
148 | segs.append([res])
149 | else:
150 | if res > segs[-1][-1] + 10:
151 | segs.append([res])
152 | else:
153 | segs[-1].append(res)
154 | resids = set([])
155 | for seg in segs:
156 | start = seg[0]
157 | end = seg[-1]
158 | for res in range(start, end + 1):
159 | resids.add(res)
160 |
161 | good_hits = []
162 | try:
163 | for hit in fam2hits[ecodfam]:
164 | prob = float(hit[0])
165 | Tlen = hit[1]
166 | Qresids = hit[2]
167 | Tresids = hit[3]
168 | get_Tresids = set([])
169 | for i in range(len(Qresids)):
170 | if Qresids[i] in resids:
171 | get_Tresids.add(Tresids[i])
172 | Tcov = len(get_Tresids) / Tlen
173 | good_hits.append([prob, Tcov])
174 | except KeyError:
175 | pass
176 |
177 | bestprob = 0
178 | bestcov = 0
179 | if good_hits:
180 | for item in good_hits:
181 | if item[0] > bestprob:
182 | bestprob = item[0]
183 | bestcovs = []
184 | for item in good_hits:
185 | if item[0] >= bestprob - 0.1:
186 | bestcovs.append(item[1])
187 | bestcov = round(max(bestcovs), 2)
188 | rp.write(f'{hitname}\t{ecodid}\t{ecodfam}\t{zscore}\t{qscore}\t{ztile}\t{qtile}\t{rank}\t{bestprob}\t{bestcov}\t{qsegments}\t{ssegments}\n')
189 | fp.close()
190 | rp.close()
191 |
--------------------------------------------------------------------------------