├── README.md ├── docker ├── Dockerfile ├── scripts │ ├── run_dpam.py │ ├── run_step1.py │ ├── run_step10.py │ ├── run_step11.py │ ├── run_step12.py │ ├── run_step13.py │ ├── run_step14.py │ ├── run_step15.py │ ├── run_step16.py │ ├── run_step17.py │ ├── run_step18.py │ ├── run_step19.py │ ├── run_step2.py │ ├── run_step21.py │ ├── run_step23.py │ ├── run_step3.py │ ├── run_step4.py │ ├── run_step5.py │ ├── run_step6.py │ ├── run_step7.py │ ├── run_step8.py │ ├── run_step9.py │ ├── step10_get_support.py │ ├── step11_get_good_domains.py │ ├── step12_get_sse.py │ ├── step13_get_diso.py │ ├── step14_parse_domains.py │ ├── step15_prepare_domass.py │ ├── step16_run_domass.py │ ├── step17_get_confident.py │ ├── step18_get_mapping.py │ ├── step19_get_merge_candidates.py │ ├── step1_get_AFDB_seqs.py │ ├── step20_extract_domains.py │ ├── step21_compare_domains.py │ ├── step22_merge_domains.py │ ├── step23_get_predictions.py │ ├── step24_integrate_results.py │ ├── step25_generate_pdbs.py │ ├── step2_get_AFDB_pdbs.py │ ├── step3_run_hhsearch.py │ ├── step4_run_foldseek.py │ ├── step5_process_hhsearch.py │ ├── step6_process_foldseek.py │ ├── step7_prepare_dali.py │ ├── step8_iterative_dali.py │ ├── step9_analyze_dali.py │ └── summarize_check.py └── utilities │ ├── DaliLite.v5.tar.gz │ ├── HHPaths.pm │ ├── foldseek │ └── pdb2fasta ├── example ├── test │ ├── O05011.json │ ├── O05011.pdb │ ├── O05012.cif │ ├── O05012.json │ ├── O05023.cif │ └── O05023.json └── test_struc.list ├── run_dpam_docker.py ├── run_dpam_singularity.py └── v1.0 ├── A0A0K2WPR7.zip ├── DPAM.py ├── LICENSE ├── README.md ├── check_dependencies.py ├── download_all_data.sh ├── mkdssp ├── model_organisms ├── Caenorhabditis_elegans.tgz ├── Danio_rerio.tgz ├── Drosophila_melanogaster.tgz ├── Homo_Sapiens.tgz ├── Mus_musculus.tgz └── Pan_paniscus.tgz ├── pdb2fasta ├── step10_get_good_domains.py ├── step11_get_sse.py ├── step12_get_diso.py ├── step13_parse_domains.py ├── step1_get_AFDB_pdbs.py ├── step1_get_AFDB_seqs.py ├── step2_run_hhsearch.py ├── step3_run_foldseek.py ├── step4_filter_foldseek.py ├── step5_map_to_ecod.py ├── step6_get_dali_candidates.py ├── step7_iterative_dali_aug_multi.py ├── step8_analyze_dali.py └── step9_get_support.py /README.md: -------------------------------------------------------------------------------- 1 | # DPAM 2 | A **D**omain **P**arser for **A**lphafold **M**odels 3 | 4 | DPAM: A Domain Parser for AlphaFold Models (https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4548) 5 | 6 | ## Updates: 7 | A docker image for DPAM v2.0 can be dowloaded by **docker pull conglab/dpam:latest** and previous version (v1.0) is moved to v1.0 directory (2023-12-10) . New version includes domain classification based on ECOD database and addresses over-segmentation for some proteins. **Warning**: current Docker image only works on AMD x86, not Apple M series chip. We're updating it for the compatibility. Stay tuned! 8 | Upload domain parser results for six model organisms. (2022-12-6) 9 | 10 | Replace Dali with Foldseek for initial hits searching. (2022-11-30) 11 | 12 | Fix a bug in analyze_PDB.py which prevents the proper usage of Dali results. (2022-10-31) 13 | ## Prerequisites (required): 14 | Docker/Singularity 15 | 16 | Python3 17 | 18 | [Databases and supporting files](https://conglab.swmed.edu/DPAM/databases.tar.gz) 19 | 20 | ### Supporting databases for DPAM: 21 | 22 | The databases necessary for DPAM, along with all supporting files, are available for download from our lab server at [https://conglab.swmed.edu/DPAM/](https://conglab.swmed.edu/DPAM/). The compressed file is around 89GB, expanding to about **400GB** when uncompressed (best run on a computing cluster/workstation due to the substantial storage needs, which may surpass the capacity of typical personal computers). It is essential to ensure that you have sufficient hard drive space to accommodate these databases. Additionally, due to their substantial size, downloading these databases might require several hours to a few days, depending on your internet connection speed. 23 | 24 | After downloading the databases.tar.gz, please decompress the file. And the directory(`[download_path]/databases`) must be provided to `run_dpam_docker.py` as `--databases_dir` 25 | 26 | ## Installation 27 | For Docker: 28 | 29 | docker pull conglab/dpam:latest 30 | git clone https://github.com/CongLabCode/DPAM 31 | cd ./DPAM 32 | wget https://conglab.swmed.edu/DPAM/databases.tar.gz 33 | tar -xzf databases.tar.gz 34 | 35 | For Singularity: 36 | 37 | git clone https://github.com/CongLabCode/DPAM 38 | cd ./DPAM 39 | wget https://conglab.swmed.edu/DPAM/databases.tar.gz 40 | tar -xzf databases.tar.gz 41 | singularity pull dpam.sif docker://conglab/dpam 42 | 43 | 44 | 45 | 46 | ### Quick test 47 | For Docker: 48 | 49 | python run_dpam_docker.py --dataset test --input_dir example --databases_dir databases --threads 32 50 | 51 | For Singularity: 52 | 53 | python ./run_dpam_singularity.py --databases_dir databases --input_dir example --dataset test --threads 32 --image_name dpam.sif` 54 | 55 | ## Usage 56 |
python run_dpam_docker.py [-h] --databases_dir DATABASES_DIR --input_dir
 57 |                     INPUT_DIR --dataset DATASET
 58 |                     [--image_name IMAGE_NAME] [--threads THREADS]
 59 |                     [--log_file LOG_FILE]
60 | 61 | ### Arguments 62 | 63 | - `-h`, `--help` 64 | Show this help message and exit. Use this argument if you need information about different command options. 65 | 66 | - `--databases_dir DATABASES_DIR` 67 | **(Required)** Specify the path to the databases directory (downloaded before and uncompressed) that needs to be mounted to the docker. Please make sure you download the databases before running 68 | 69 | - `--input_dir INPUT_DIR` 70 | **(Required)** Specify the path to the input directory that needs to be mounted. 71 | 72 | - `--dataset DATASET` 73 | **(Required)** Provide the name of the dataset for domain segmentation and classification. 74 | 75 | - `--image_name IMAGE_NAME` 76 | Specify the Docker image name. If not provided, a default image name will be used. 77 | 78 | - `--threads THREADS` 79 | Define the number of threads to be used. By default, the script is configured to utilize all available CPUs. 80 | 81 | - `--log_file LOG_FILE` 82 | Specify a file where the logs should be saved. If not provided, logs will be displayed in the standard output. 83 | 84 | ### Input organization 85 | 86 | Before running the wrapper, the `INPUT_DIR` needs to be in the following structure: 87 | 88 | / 89 | / 90 | _struc.list 91 | / 92 | _struc.list 93 | ... 94 | 95 | 96 | The `/` and `/` directories include PDB/mmCIF files and json file for PAE and `dataset1_struc.list` and `dataset2_struc.list` include targets (prefix of PDB/mmCIF and json), one line per one target. can be any name and postfix _struc.list has to be maintained. 97 | 98 | In the example test in **Quick test** above, 99 | 100 | `example/` is `` and `test` under `example/` is `` 101 | 102 | **exmaple command**: 103 | 104 | `python run_dpam_docker.py --dataset test --input_dir example --databases_dir databases --threads 32` 105 | 106 | `databases` is the directory uncompressed fromd databases.tar.gz from our lab server. 107 | 108 | ### Output 109 | The pipeline will generate log files for each step for debugging. 110 | 111 | Final output is \_domains under . 112 | 113 | For the example, it should be `test_domains` under `example/`. 114 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG CUDA=11.1.1 2 | FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04 3 | ARG CUDA 4 | 5 | 6 | SHELL ["/bin/bash", "-o", "pipefail", "-c"] 7 | 8 | RUN apt-get update \ 9 | && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \ 10 | build-essential \ 11 | cmake \ 12 | cuda-command-line-tools-$(cut -f1,2 -d- <<< ${CUDA//./-}) \ 13 | git \ 14 | tzdata \ 15 | wget \ 16 | dialog \ 17 | gfortran \ 18 | && rm -rf /var/lib/apt/lists/* \ 19 | && apt-get autoremove -y \ 20 | && apt-get clean 21 | 22 | RUN wget -q -P /tmp \ 23 | https://repo.anaconda.com/miniconda/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh \ 24 | && bash /tmp/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh -b -p /opt/conda \ 25 | && rm /tmp/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh 26 | 27 | ENV PATH="/opt/conda/bin:$PATH" 28 | RUN conda install -y -c bioconda blast-legacy 29 | RUN conda install -y -c biocore psipred 30 | RUN conda install -y -c salilab dssp 31 | 32 | 33 | RUN pip install numpy 34 | RUN pip install tensorflow==1.14 35 | RUN pip install protobuf==3.20.* 36 | 37 | 38 | COPY utilities/DaliLite.v5.tar.gz /opt 39 | RUN cd /opt \ 40 | && tar -zxvf DaliLite.v5.tar.gz \ 41 | && cd /opt/DaliLite.v5/bin \ 42 | && make clean \ 43 | && make \ 44 | && ln -s /opt/DaliLite.v5/bin/dali.pl /usr/bin \ 45 | && rm /opt/DaliLite.v5.tar.gz 46 | 47 | RUN git clone https://github.com/soedinglab/pdbx.git /opt/pdbx \ 48 | && mkdir /opt/pdbx/build \ 49 | && pushd /opt/pdbx/build \ 50 | && cmake ../ \ 51 | && make install \ 52 | && popd 53 | 54 | RUN git clone --branch v3.3.0 https://github.com/soedinglab/hh-suite.git /tmp/hh-suite \ 55 | && mkdir /tmp/hh-suite/build \ 56 | && pushd /tmp/hh-suite/build \ 57 | && cmake -DCMAKE_INSTALL_PREFIX=/opt/hhsuite .. \ 58 | && make -j 4 && make install \ 59 | && ln -s /opt/hhsuite/bin/* /usr/bin \ 60 | && popd \ 61 | && rm -rf /tmp/hh-suite 62 | 63 | RUN mkdir /opt/DPAM && mkdir /opt/DPAM/scripts 64 | COPY scripts/*.py /opt/DPAM/scripts 65 | COPY utilities/HHPaths.pm /opt/hhsuite/scripts 66 | COPY utilities/pdb2fasta /usr/bin 67 | COPY utilities/foldseek /usr/bin 68 | 69 | RUN chmod -R +x /opt/DPAM/scripts 70 | 71 | ENV PATH="/opt/DPAM/scripts:/opt/hhsuite/scripts:/opt/hhsuite/bin:/opt/DaliLite.v5/bin:$PATH" 72 | ENV LD_LIBRARY_PATH="/opt/conda/lib:$LD_LIBRARY_PATH" 73 | ENV PERL5LIB="/usr/local/lib/perl5:/opt/hhsuite/scripts" 74 | ENV OPENBLAS_NUM_THREADS=1 75 | 76 | -------------------------------------------------------------------------------- /docker/scripts/run_dpam.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os, sys, time, subprocess 3 | dataset = sys.argv[1] 4 | ncore = sys.argv[2] 5 | wd = os.getcwd() 6 | 7 | for step in range(1,25): 8 | if 1 <= step <= 19 or step == 21 or step == 23: 9 | if os.path.exists(dataset + '_step' + str(step) + '.log'): 10 | with open(dataset + '_step' + str(step) + '.log') as f: 11 | step_logs = f.read() 12 | if 'done\n' != step_logs: 13 | rcode = subprocess.run('run_step' + str(step) + '.py ' + dataset + ' ' + ncore,shell=True).returncode 14 | if rcode != 0: 15 | print(f'Error in step{step}') 16 | sys.exit() 17 | else: 18 | for s in range(step,25): 19 | os.system('rm ' + dataset + '_step' + str(s) + '.log') 20 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*') 21 | rcode = subprocess.run('run_step' + str(step) + '.py ' + dataset + ' ' + ncore,shell=True).returncode 22 | if rcode != 0: 23 | print(f'Error in step{step}') 24 | sys.exit() 25 | elif step == 20: 26 | run_flag = 0 27 | if os.path.exists(dataset + '_step' + str(step) + '.log'): 28 | with open(dataset + '_step' + str(step) + '.log') as f: 29 | step_logs=f.read() 30 | if 'done\n' != step_logs: 31 | run_flag = 1 32 | else: 33 | run_flag = 1 34 | if run_flag == 1: 35 | for s in range(step,25): 36 | os.system('rm ' + dataset + '_step' + str(s) + '.log') 37 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*') 38 | status_code = subprocess.run('step20_extract_domains.py ' + dataset,shell=True).returncode 39 | if status_code == 0: 40 | with open(dataset + '_step20.log','w') as f: 41 | f.write('done\n') 42 | else: 43 | with open(dataset + '_step20.log','w') as f: 44 | f.write('fail\n') 45 | print(f'Error in step{step}') 46 | sys.exit() 47 | elif step == 22: 48 | run_flag = 0 49 | if os.path.exists(dataset + '_step' + str(step) + '.log'): 50 | with open(dataset + '_step' + str(step) + '.log') as f: 51 | step_logs=f.read() 52 | if 'done\n' != step_logs: 53 | run_flag = 1 54 | else: 55 | run_flag = 1 56 | if run_flag == 1: 57 | for s in range(step,25): 58 | os.system('rm ' + dataset + '_step' + str(s) + '.log') 59 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*') 60 | status_code = subprocess.run('step22_merge_domains.py ' + dataset,shell=True).returncode 61 | if status_code == 0: 62 | with open(dataset + '_step22.log','w') as f: 63 | f.write('done\n') 64 | else: 65 | with open(dataset + '_step22.log','w') as f: 66 | f.write('fail\n') 67 | print(f'Error in step{step}') 68 | sys.exit() 69 | elif step == 24: 70 | run_flag = 0 71 | if os.path.exists(dataset + '_step' + str(step) + '.log'): 72 | with open(dataset + '_step' + str(step) + '.log') as f: 73 | step_logs = f.read() 74 | if 'done\n' != step_logs: 75 | run_flag = 1 76 | else: 77 | run_flag = 1 78 | if run_flag == 1: 79 | for s in range(step,25): 80 | os.system('rm ' + dataset + '_step' + str(s) + '.log') 81 | os.system('rm -rf step_' + str(s) + '/' + dataset + '/*') 82 | status_code = subprocess.run('step24_integrate_results.py ' + dataset,shell=True).returncode 83 | if status_code == 0: 84 | with open(dataset + '_step24.log','w') as f: 85 | f.write('done\n') 86 | else: 87 | with open(dataset + '_step24.log','w') as f: 88 | f.write('fail\n') 89 | print(f'Error in step{step}') 90 | sys.exit() 91 | filelist=[wd + '/' + dataset + '_step' + str(k)+'.log' for k in range(1,25)] 92 | undone = 24 93 | for name in filelist: 94 | with open(name) as f: 95 | info = f.read() 96 | if info.strip() == 'done': 97 | undone = undone - 1 98 | else: 99 | print(dataset + ' ' + name.split('/')[-1].split(dataset + '_')[1]+' has errors..Fail') 100 | break 101 | if undone == 0: 102 | print(dataset + ' done') 103 | -------------------------------------------------------------------------------- /docker/scripts/run_step1.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(sample,cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return sample+' succeed' 9 | else: 10 | return sample+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | sample=cmd.split()[2] 18 | process = pool.apply_async(run_cmd,(sample,cmd,)) 19 | result.append(process) 20 | for process in result: 21 | log.append(process.get()) 22 | return log 23 | 24 | 25 | 26 | dataset = sys.argv[1] 27 | ncore = int(sys.argv[2]) 28 | 29 | if not os.path.exists('step1/'): 30 | os.system('mkdir step1') 31 | if not os.path.exists('step1/' + dataset): 32 | os.system('mkdir step1/' + dataset) 33 | 34 | fp = open(dataset + '_struc.list', 'r') 35 | cases = [] 36 | for line in fp: 37 | words = line.split() 38 | accession = words[0] 39 | cases.append(accession) 40 | # cases.append([accession, version]) 41 | fp.close() 42 | 43 | cmds = [] 44 | for case in cases: 45 | if os.path.exists('step1/' + dataset + '/' + case + '.fa'): 46 | fp = open('step1/' + dataset + '/' + case + '.fa', 'r') 47 | check_header = 0 48 | check_seq = 0 49 | check_length = 0 50 | for line in fp: 51 | check_length += 1 52 | if line[0] == '>': 53 | if line[1:-1] == case: 54 | check_header = 1 55 | else: 56 | if len(line) > 10: 57 | check_seq = 1 58 | fp.close() 59 | if check_header and check_seq and check_length == 2: 60 | pass 61 | else: 62 | os.system('rm step1/' + dataset + '/' + case + '.fa') 63 | cmds.append('python /opt/DPAM/scripts/step1_get_AFDB_seqs.py ' + dataset + ' ' + case) 64 | else: 65 | cmds.append('python /opt/DPAM/scripts/step1_get_AFDB_seqs.py ' + dataset + ' ' + case) 66 | 67 | 68 | if cmds: 69 | logs = batch_run(cmds,ncore) 70 | fail = [i for i in logs if 'fail' in i] 71 | if fail: 72 | with open(dataset + '_step1.log','w') as f: 73 | for i in fail: 74 | f.write(i+'\n') 75 | else: 76 | with open(dataset + '_step1.log','w') as f: 77 | f.write('done\n') 78 | else: 79 | with open(dataset + '_step1.log','w') as f: 80 | f.write('done\n') 81 | -------------------------------------------------------------------------------- /docker/scripts/run_step10.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd + ' succeed' 9 | else: 10 | return cmd + ' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | 28 | if not os.path.exists('step10'): 29 | os.system('mkdir step10/') 30 | if not os.path.exists('step10/' + dataset): 31 | os.system('mkdir step10/' + dataset) 32 | 33 | fp = open(dataset + '_struc.list', 'r') 34 | prots = [] 35 | for line in fp: 36 | words = line.split() 37 | prots.append(words[0]) 38 | fp.close() 39 | 40 | need_prots = set([]) 41 | for prot in prots: 42 | get_seq = 0 43 | if os.path.exists('step10/' + dataset + '/' + prot + '_sequence.result'): 44 | fp = open('step10/' + dataset + '/' + prot + '_sequence.result','r') 45 | word_counts = set([]) 46 | for line in fp: 47 | words = line.split() 48 | word_counts.add(len(words)) 49 | fp.close() 50 | if len(word_counts) == 1 and 8 in word_counts: 51 | get_seq = 1 52 | else: 53 | os.system('rm step10/' + dataset + '/' + prot + '_sequence.result') 54 | need_prots.add(prot) 55 | 56 | get_str = 0 57 | if os.path.exists('step10/' + dataset + '/' + prot + '_structure.result'): 58 | fp = open('step10/' + dataset + '/' + prot + '_structure.result','r') 59 | word_counts = set([]) 60 | for line in fp: 61 | words = line.split() 62 | word_counts.add(len(words)) 63 | fp.close() 64 | if len(word_counts) == 1 and 12 in word_counts: 65 | get_str = 1 66 | else: 67 | os.system('rm step10/' + dataset + '/' + prot + '_structure.result') 68 | need_prots.add(prot) 69 | 70 | if get_seq and get_str: 71 | pass 72 | elif os.path.exists('step10/' + dataset + '/' + prot + '.done'): 73 | pass 74 | else: 75 | need_prots.add(prot) 76 | 77 | 78 | if need_prots: 79 | cmds = [] 80 | for prot in need_prots: 81 | cmds.append('step10_get_support.py ' + dataset + ' ' + prot + '\n') 82 | logs = batch_run(cmds, ncore) 83 | fail = [i for i in logs if 'fail' in i] 84 | if fail: 85 | with open(dataset + '_step10.log','w') as f: 86 | for i in fail: 87 | f.write(i+'\n') 88 | else: 89 | with open(dataset + '_step10.log','w') as f: 90 | f.write('done\n') 91 | else: 92 | with open(dataset + '_step10.log','w') as f: 93 | f.write('done\n') 94 | -------------------------------------------------------------------------------- /docker/scripts/run_step11.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd + ' succeed' 9 | else: 10 | return cmd + ' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | 28 | if not os.path.exists('step11'): 29 | os.system('mkdir step11/') 30 | if not os.path.exists('step11/' + dataset): 31 | os.system('mkdir step11/' + dataset) 32 | 33 | fp = open(dataset + '_struc.list', 'r') 34 | prots = [] 35 | for line in fp: 36 | words = line.split() 37 | prots.append(words[0]) 38 | fp.close() 39 | 40 | need_prots = [] 41 | for prot in prots: 42 | if os.path.exists('step11/' + dataset + '/' + prot + '.goodDomains'): 43 | fp = open('step11/' + dataset + '/' + prot + '.goodDomains','r') 44 | seq_word_counts = set([]) 45 | str_word_counts = set([]) 46 | for line in fp: 47 | words = line.split() 48 | if words[0] == 'sequence': 49 | seq_word_counts.add(len(words)) 50 | elif words[0] == 'structure': 51 | str_word_counts.add(len(words)) 52 | fp.close() 53 | 54 | bad_seq = 0 55 | bad_str = 0 56 | if seq_word_counts: 57 | if len(seq_word_counts) == 1 and 10 in seq_word_counts: 58 | pass 59 | else: 60 | bad_seq = 1 61 | if str_word_counts: 62 | if len(str_word_counts) == 1 and 16 in str_word_counts: 63 | pass 64 | else: 65 | bad_str = 1 66 | 67 | if bad_seq or bad_str: 68 | os.system('rm step11/' + dataset + '/' + prot + '.goodDomains') 69 | need_prots.append(prot) 70 | elif os.path.exists('step11/' + dataset + '/' + prot + '.done'): 71 | pass 72 | else: 73 | need_prots.append(prot) 74 | 75 | 76 | if need_prots: 77 | cmds = [] 78 | for prot in need_prots: 79 | cmds.append('step11_get_good_domains.py ' + dataset + ' ' + prot) 80 | logs = batch_run(cmds, ncore) 81 | fail = [i for i in logs if 'fail' in i] 82 | if fail: 83 | with open(dataset + '_step11.log','w') as f: 84 | for i in fail: 85 | f.write(i+'\n') 86 | else: 87 | with open(dataset + '_step11.log','w') as f: 88 | f.write('done\n') 89 | else: 90 | with open(dataset + '_step11.log','w') as f: 91 | f.write('done\n') 92 | -------------------------------------------------------------------------------- /docker/scripts/run_step12.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd + ' succeed' 9 | else: 10 | return cmd + ' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | 28 | if not os.path.exists('step12'): 29 | os.system('mkdir step12/') 30 | if not os.path.exists('step12/' + dataset): 31 | os.system('mkdir step12/' + dataset) 32 | 33 | fp = open(dataset + '_struc.list', 'r') 34 | prots = [] 35 | for line in fp: 36 | words = line.split() 37 | prots.append(words[0]) 38 | fp.close() 39 | 40 | need_prots = [] 41 | for prot in prots: 42 | if os.path.exists('step12/' + dataset + '/' + prot + '.sse'): 43 | fp = open('step12/' + dataset + '/' + prot + '.sse', 'r') 44 | word_counts = set([]) 45 | for line in fp: 46 | words = line.split() 47 | word_counts.add(len(words)) 48 | fp.close() 49 | if len(word_counts) == 1 and 4 in word_counts: 50 | pass 51 | else: 52 | os.system('rm step12/' + dataset + '/' + prot + '.sse') 53 | need_prots.append(prot) 54 | else: 55 | need_prots.append(prot) 56 | 57 | if need_prots: 58 | cmds = [] 59 | for prot in need_prots: 60 | cmds.append('step12_get_sse.py ' + dataset + ' ' + prot + '\n') 61 | logs = batch_run(cmds, ncore) 62 | fail = [i for i in logs if 'fail' in i] 63 | if fail: 64 | with open(dataset + '_step12.log','w') as f: 65 | for i in fail: 66 | f.write(i+'\n') 67 | else: 68 | with open(dataset + '_step12.log','w') as f: 69 | f.write('done\n') 70 | else: 71 | with open(dataset + '_step12.log','w') as f: 72 | f.write('done\n') 73 | -------------------------------------------------------------------------------- /docker/scripts/run_step13.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd + ' succeed' 9 | else: 10 | return cmd + ' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | dataset = sys.argv[1] 25 | ncore = int(sys.argv[2]) 26 | 27 | if not os.path.exists('step13'): 28 | os.system('mkdir step13/') 29 | 30 | if not os.path.exists('step13/' + dataset): 31 | os.system('mkdir step13/' + dataset) 32 | 33 | fp = open(dataset + '_struc.list', 'r') 34 | cases = [] 35 | for line in fp: 36 | words = line.split() 37 | cases.append(words[0]) 38 | fp.close() 39 | 40 | need_cases = [] 41 | for case in cases: 42 | prot = case 43 | if os.path.exists('step13/' + dataset + '/' + prot + '.diso'): 44 | pass 45 | else: 46 | need_cases.append(case) 47 | 48 | if need_cases: 49 | cmds = [] 50 | for case in need_cases: 51 | cmds.append('step13_get_diso.py ' + dataset + ' ' + case) 52 | logs = batch_run(cmds, ncore) 53 | fail = [i for i in logs if 'fail' in i] 54 | if fail: 55 | with open(dataset + '_step13.log','w') as f: 56 | for i in fail: 57 | f.write(i+'\n') 58 | else: 59 | with open(dataset + '_step13.log','w') as f: 60 | f.write('done\n') 61 | else: 62 | with open(dataset + '_step13.log','w') as f: 63 | f.write('done\n') 64 | -------------------------------------------------------------------------------- /docker/scripts/run_step14.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | def run_cmd(cmd): 5 | status=subprocess.run(cmd,shell=True).returncode 6 | if status==0: 7 | return cmd + ' succeed' 8 | else: 9 | return cmd + ' fail' 10 | 11 | def batch_run(cmds,process_num): 12 | log=[] 13 | pool = Pool(processes=process_num) 14 | result = [] 15 | for cmd in cmds: 16 | process = pool.apply_async(run_cmd,(cmd,)) 17 | result.append(process) 18 | for process in result: 19 | log.append(process.get()) 20 | return log 21 | 22 | 23 | dataset = sys.argv[1] 24 | ncore = int(sys.argv[2]) 25 | 26 | if not os.path.exists('step14'): 27 | os.system('mkdir step14/') 28 | if not os.path.exists('step14/' + dataset): 29 | os.system('mkdir step14/' + dataset) 30 | 31 | fp = open(dataset + '_struc.list', 'r') 32 | cases = [] 33 | for line in fp: 34 | words = line.split() 35 | cases.append(words[0]) 36 | fp.close() 37 | 38 | 39 | need_cases = [] 40 | for case in cases: 41 | prot = case 42 | if os.path.exists('step14/' + dataset + '/' + prot + '.domains'): 43 | word_counts = set([]) 44 | fp = open('step14/' + dataset + '/' + prot + '.domains', 'r') 45 | for line in fp: 46 | words = line.split() 47 | word_counts.add(len(words)) 48 | fp.close() 49 | if len(word_counts) == 1 and 2 in word_counts: 50 | pass 51 | else: 52 | os.system('rm step14/' + dataset + '/' + prot + '.domains') 53 | need_cases.append(case) 54 | elif os.path.exists('step14/' + dataset + '/' + prot + '.done'): 55 | pass 56 | else: 57 | need_cases.append(case) 58 | 59 | if need_cases: 60 | cmds = [] 61 | for case in need_cases: 62 | cmds.append('step14_parse_domains.py ' + dataset + ' ' + case) 63 | logs = batch_run(cmds, ncore) 64 | fail = [i for i in logs if 'fail' in i] 65 | if fail: 66 | with open(dataset + '_step14.log','w') as f: 67 | for i in fail: 68 | f.write(i+'\n') 69 | else: 70 | with open(dataset + '_step14.log','w') as f: 71 | f.write('done\n') 72 | else: 73 | with open(dataset + '_step14.log','w') as f: 74 | f.write('done\n') 75 | -------------------------------------------------------------------------------- /docker/scripts/run_step15.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd + ' succeed' 9 | else: 10 | return cmd + ' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | 28 | if not os.path.exists('step15'): 29 | os.system('mkdir step15/') 30 | if not os.path.exists('step15/' + dataset): 31 | os.system('mkdir step15/' + dataset) 32 | 33 | fp = open(dataset + '_struc.list', 'r') 34 | prots = [] 35 | for line in fp: 36 | words = line.split() 37 | prots.append(words[0]) 38 | fp.close() 39 | 40 | need_prots = [] 41 | for prot in prots: 42 | if os.path.exists('step15/' + dataset + '/' + prot + '.data'): 43 | word_counts = set([]) 44 | fp = open('step15/' + dataset + '/' + prot + '.data', 'r') 45 | for line in fp: 46 | words = line.split() 47 | word_counts.add(len(words)) 48 | fp.close() 49 | if len(word_counts) == 1 and 23 in word_counts: 50 | pass 51 | else: 52 | os.system('rm step15/' + dataset + '/' + prot + '.data') 53 | need_prots.append(prot) 54 | else: 55 | if os.path.exists('step15/' + dataset + '/' + prot + '.done'): 56 | pass 57 | else: 58 | need_prots.append(prot) 59 | 60 | if need_prots: 61 | cmds = [] 62 | for prot in need_prots: 63 | cmds.append('step15_prepare_domass.py ' + dataset + ' ' + prot) 64 | logs = batch_run(cmds, ncore) 65 | fail = [i for i in logs if 'fail' in i] 66 | if fail: 67 | with open(dataset + '_step15.log','w') as f: 68 | for i in fail: 69 | f.write(i+'\n') 70 | else: 71 | with open(dataset + '_step15.log','w') as f: 72 | f.write('done\n') 73 | else: 74 | with open(dataset + '_step15.log','w') as f: 75 | f.write('done\n') 76 | -------------------------------------------------------------------------------- /docker/scripts/run_step16.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | 4 | dataset = sys.argv[1] 5 | 6 | if not os.path.exists('step16'): 7 | os.system('mkdir step16/') 8 | if not os.path.exists('step16/' + dataset): 9 | os.system('mkdir step16/' + dataset) 10 | 11 | fp = open(dataset + '_struc.list', 'r') 12 | prots = [] 13 | for line in fp: 14 | words = line.split() 15 | prots.append(words[0]) 16 | fp.close() 17 | 18 | need_prots = [] 19 | for prot in prots: 20 | if os.path.exists('step16/' + dataset + '/' + prot + '.result'): 21 | word_counts = set([]) 22 | fp = open('step16/' + dataset + '/' + prot + '.result', 'r') 23 | for line in fp: 24 | words = line.split() 25 | word_counts.add(len(words)) 26 | fp.close() 27 | if len(word_counts) == 1 and 21 in word_counts: 28 | pass 29 | else: 30 | os.system('rm step16/' + dataset + '/' + prot + '.result') 31 | need_prots.append(prot) 32 | elif os.path.exists('step16/' + dataset + '/' + prot + '.done'): 33 | pass 34 | else: 35 | need_prots.append(prot) 36 | 37 | 38 | if need_prots: 39 | rp = open('step16_' + dataset + '.list', 'w') 40 | for prot in need_prots: 41 | rp.write(prot + '\n') 42 | rp.close() 43 | rcode=subprocess.run('step16_run_domass.py ' + dataset,shell=True).returncode 44 | if rcode!=0: 45 | with open(dataset + '_step16.log','w')as f: 46 | f.write(' '.join(need_prots)+' fail\n') 47 | else: 48 | with open(dataset + '_step16.log','w')as f: 49 | f.write('done\n') 50 | os.system('rm step16_' + dataset + '*.list\n') 51 | else: 52 | with open(dataset + '_step16.log','w')as f: 53 | f.write('done\n') 54 | -------------------------------------------------------------------------------- /docker/scripts/run_step17.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd+' succeed' 9 | else: 10 | return cmd+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | dataset = sys.argv[1] 25 | ncore = int(sys.argv[2]) 26 | if not os.path.exists('step17'): 27 | os.system('mkdir step17/') 28 | if not os.path.exists('step17/' + dataset): 29 | os.system('mkdir step17/' + dataset) 30 | 31 | fp = open(dataset + '_struc.list', 'r') 32 | prots = [] 33 | for line in fp: 34 | words = line.split() 35 | prots.append(words[0]) 36 | fp.close() 37 | 38 | need_prots = [] 39 | for prot in prots: 40 | if os.path.exists('step17/' + dataset + '/' + prot + '.result'): 41 | word_counts = set([]) 42 | fp = open('step17/' + dataset + '/' + prot + '.result', 'r') 43 | for line in fp: 44 | words = line.split() 45 | word_counts.add(len(words)) 46 | fp.close() 47 | if len(word_counts) == 1 and 6 in word_counts: 48 | pass 49 | else: 50 | os.system('rm step17/' + dataset + '/' + prot + '.result') 51 | need_prots.append(prot) 52 | elif os.path.exists('step17/' + dataset + '/' + prot + '.done'): 53 | pass 54 | else: 55 | need_prots.append(prot) 56 | 57 | if need_prots: 58 | cmds = [] 59 | for prot in need_prots: 60 | cmds.append('step17_get_confident.py ' + dataset + ' ' + prot) 61 | logs = batch_run(cmds, ncore) 62 | fail = [i for i in logs if 'fail' in i] 63 | if fail: 64 | with open(dataset + '_step17.log','w') as f: 65 | for i in fail: 66 | f.write(i+'\n') 67 | else: 68 | with open(dataset + '_step17.log','w') as f: 69 | f.write('done\n') 70 | else: 71 | with open(dataset + '_step17.log','w') as f: 72 | f.write('done\n') 73 | -------------------------------------------------------------------------------- /docker/scripts/run_step18.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd+' succeed' 9 | else: 10 | return cmd+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | if not os.path.exists('step18'): 28 | os.system('mkdir step18/') 29 | if not os.path.exists('step18/' + dataset): 30 | os.system('mkdir step18/' + dataset) 31 | 32 | fp = open(dataset + '_struc.list', 'r') 33 | prots = [] 34 | for line in fp: 35 | words = line.split() 36 | prots.append(words[0]) 37 | fp.close() 38 | 39 | need_prots = [] 40 | for prot in prots: 41 | if os.path.exists('step18/' + dataset + '/' + prot + '.data'): 42 | word_counts = set([]) 43 | fp = open('step18/' + dataset + '/' + prot + '.data', 'r') 44 | for line in fp: 45 | words = line.split() 46 | word_counts.add(len(words)) 47 | fp.close() 48 | if len(word_counts) == 1 and 8 in word_counts: 49 | pass 50 | else: 51 | os.system('rm step18/' + dataset + '/' + prot + '.data') 52 | need_prots.append(prot) 53 | elif os.path.exists('step18/' + dataset + '/' + prot + '.done'): 54 | pass 55 | else: 56 | need_prots.append(prot) 57 | 58 | if need_prots: 59 | cmds = [] 60 | for prot in need_prots: 61 | cmds.append('step18_get_mapping.py ' + dataset + ' ' + prot) 62 | logs = batch_run(cmds, ncore) 63 | fail = [i for i in logs if 'fail' in i] 64 | if fail: 65 | with open(dataset + '_step18.log','w') as f: 66 | for i in fail: 67 | f.write(i+'\n') 68 | else: 69 | with open(dataset + '_step18.log','w') as f: 70 | f.write('done\n') 71 | else: 72 | with open(dataset + '_step18.log','w') as f: 73 | f.write('done\n') 74 | -------------------------------------------------------------------------------- /docker/scripts/run_step19.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd+' succeed' 9 | else: 10 | return cmd+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | dataset = sys.argv[1] 24 | ncore = int(sys.argv[2]) 25 | if not os.path.exists('step19'): 26 | os.system('mkdir step19/') 27 | if not os.path.exists('step19/' + dataset): 28 | os.system('mkdir step19/' + dataset) 29 | 30 | fp = open(dataset + '_struc.list', 'r') 31 | prots = [] 32 | for line in fp: 33 | words = line.split() 34 | prots.append(words[0]) 35 | fp.close() 36 | 37 | need_prots = set([]) 38 | for prot in prots: 39 | check_info = 0 40 | if os.path.exists('step19/' + dataset + '/' + prot + '.info'): 41 | word_counts = set([]) 42 | fp = open('step19/' + dataset + '/' + prot + '.info', 'r') 43 | for line in fp: 44 | words = line.split() 45 | word_counts.add(len(words)) 46 | fp.close() 47 | if len(word_counts) == 1 and 2 in word_counts: 48 | check_info = 1 49 | else: 50 | os.system('rm step19/' + dataset + '/' + prot + '.info') 51 | need_prots.add(prot) 52 | 53 | check_result = 0 54 | if os.path.exists('step19/' + dataset + '/' + prot + '.result'): 55 | word_counts = set([]) 56 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r') 57 | for line in fp: 58 | words = line.split() 59 | word_counts.add(len(words)) 60 | fp.close() 61 | if len(word_counts) == 1 and 4 in word_counts: 62 | check_result = 1 63 | else: 64 | os.system('rm step19/' + dataset + '/' + prot + '.result') 65 | need_prots.add(prot) 66 | 67 | if check_info and check_result: 68 | pass 69 | elif os.path.exists('step19/' + dataset + '/' + prot + '.done'): 70 | pass 71 | else: 72 | need_prots.add(prot) 73 | 74 | if need_prots: 75 | cmds = [] 76 | for prot in need_prots: 77 | cmds.append('step19_get_merge_candidates.py ' + dataset + ' ' + prot) 78 | logs = batch_run(cmds, ncore) 79 | fail = [i for i in logs if 'fail' in i] 80 | if fail: 81 | with open(dataset + '_step19.log','w') as f: 82 | for i in fail: 83 | f.write(i+'\n') 84 | else: 85 | with open(dataset + '_step19.log','w') as f: 86 | f.write('done\n') 87 | else: 88 | with open(dataset + '_step19.log','w') as f: 89 | f.write('done\n') 90 | -------------------------------------------------------------------------------- /docker/scripts/run_step2.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | 3 | import os, sys, subprocess 4 | from multiprocessing import Pool 5 | 6 | def run_cmd(sample,cmd): 7 | status=subprocess.run(cmd,shell=True).returncode 8 | if status==0: 9 | return sample+' succeed' 10 | else: 11 | return sample+' fail' 12 | 13 | def batch_run(cmds,process_num): 14 | log=[] 15 | pool = Pool(processes=process_num) 16 | result = [] 17 | for cmd in cmds: 18 | sample=cmd.split()[2] 19 | process = pool.apply_async(run_cmd,(sample,cmd,)) 20 | result.append(process) 21 | for process in result: 22 | log.append(process.get()) 23 | return log 24 | 25 | 26 | 27 | dataset = sys.argv[1] 28 | ncore = int(sys.argv[2]) 29 | if not os.path.exists('step2'): 30 | os.system('mkdir step2/') 31 | if not os.path.exists('step2/' + dataset): 32 | os.system('mkdir step2/' + dataset) 33 | 34 | fp = open(dataset + '_struc.list', 'r') 35 | cases = [] 36 | for line in fp: 37 | words = line.split() 38 | cases.append(words[0]) 39 | fp.close() 40 | 41 | cmds = [] 42 | for case in cases: 43 | fasta_length = 0 44 | if os.path.exists('step1/' + dataset + '/' + case + '.fa'): 45 | fp = open('step1/' + dataset + '/' + case + '.fa', 'r') 46 | for line in fp: 47 | if line[0] != '>': 48 | fasta_length = len(line[:-1]) 49 | fp.close() 50 | 51 | pdb_resids = set([]) 52 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'): 53 | fp = open('step2/' + dataset + '/' + case + '.pdb', 'r') 54 | for line in fp: 55 | if len(line) >= 50: 56 | if line[:4] == 'ATOM': 57 | resid = int(line[22:26]) 58 | pdb_resids.add(resid) 59 | fp.close() 60 | pdb_length = len(pdb_resids) 61 | 62 | if fasta_length == pdb_length: 63 | if fasta_length: 64 | pass 65 | else: 66 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'): 67 | os.system('rm step2/' + dataset + '/' + case + '.pdb') 68 | cmds.append('python /opt/DPAM/scripts/step2_get_AFDB_pdbs.py ' + dataset + ' ' + case + ' ' + case + '\n') 69 | else: 70 | if os.path.exists('step2/' + dataset + '/' + case + '.pdb'): 71 | os.system('rm step2/' + dataset + '/' + case + '.pdb') 72 | cmds.append('python /opt/DPAM/scripts/step2_get_AFDB_pdbs.py ' + dataset + ' ' + case + ' ' + case + '\n') 73 | 74 | if cmds: 75 | logs=batch_run(cmds, ncore) 76 | fail = [i for i in logs if 'fail' in i] 77 | if fail: 78 | with open(dataset + '_step2.log','w') as f: 79 | for i in fail: 80 | f.write(i+'\n') 81 | else: 82 | with open(dataset + '_step2.log','w') as f: 83 | f.write('done\n') 84 | else: 85 | with open(dataset + '_step2.log','w') as f: 86 | f.write('done\n') 87 | -------------------------------------------------------------------------------- /docker/scripts/run_step21.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd+' succeed' 9 | else: 10 | return cmd+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | dataset = sys.argv[1] 25 | ncore = int(sys.argv[2]) 26 | os.system('rm step21_' + dataset + '_*.list') 27 | 28 | fp = os.popen('ls -1 step19/' + dataset + '/*.result') 29 | prots = [] 30 | for line in fp: 31 | prot = line.split('/')[2].split('.')[0] 32 | prots.append(prot) 33 | fp.close() 34 | 35 | cases = [] 36 | all_cases = set([]) 37 | for prot in prots: 38 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r') 39 | for line in fp: 40 | words = line.split() 41 | domain1 = words[0] 42 | resids1 = words[1] 43 | domain2 = words[2] 44 | resids2 = words[3] 45 | cases.append([prot, domain1, resids1, domain2, resids2]) 46 | all_cases.add(prot + '_' + domain1 + '_' + domain2) 47 | fp.close() 48 | 49 | get_cases = set([]) 50 | if os.path.exists('step21_' + dataset + '.result'): 51 | fp = open('step21_' + dataset + '.result','r') 52 | for line in fp: 53 | words = line.split() 54 | get_cases.add(words[0] + '_' + words[1] + '_' + words[2]) 55 | fp.close() 56 | 57 | if all_cases == get_cases: 58 | with open(dataset + '_step21.log','w') as f: 59 | f.write('done\n') 60 | else: 61 | total = len(cases) 62 | batchsize = total // ncore + 1 63 | cmds = [] 64 | for i in range(ncore): 65 | rp = open('step21_' + dataset + '_' + str(i) + '.list', 'w') 66 | for case in cases[batchsize * i : batchsize * i + batchsize]: 67 | rp.write(case[0] + '\t' + case[1] + '\t' + case[2] + '\t' + case[3] + '\t' + case[4] + '\n') 68 | rp.close() 69 | cmds.append('step21_compare_domains.py ' + dataset + ' ' + str(i)) 70 | logs = batch_run(cmds, ncore) 71 | fail = [i for i in logs if 'fail' in i] 72 | if fail: 73 | with open(dataset + '_step21.log','w') as f: 74 | for i in fail: 75 | f.write(i+'\n') 76 | else: 77 | with open(dataset + '_step21.log','w') as f: 78 | f.write('done\n') 79 | status=subprocess.run('cat step21_' + dataset + '_*.result >> step21_' + dataset + '.result',shell=True).returncode 80 | os.system('rm step21_' + dataset + '_*.list') 81 | os.system('rm step21_' + dataset + '_*.result') 82 | -------------------------------------------------------------------------------- /docker/scripts/run_step23.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | 5 | def run_cmd(cmd): 6 | status=subprocess.run(cmd,shell=True).returncode 7 | if status==0: 8 | return cmd+' succeed' 9 | else: 10 | return cmd+' fail' 11 | 12 | def batch_run(cmds,process_num): 13 | log=[] 14 | pool = Pool(processes=process_num) 15 | result = [] 16 | for cmd in cmds: 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | if not os.path.exists('step23'): 28 | os.system('mkdir step23/') 29 | if not os.path.exists('step23/' + dataset): 30 | os.system('mkdir step23/' + dataset) 31 | 32 | fp = open(dataset + '_struc.list', 'r') 33 | prots = [] 34 | for line in fp: 35 | words = line.split() 36 | prots.append(words[0]) 37 | fp.close() 38 | 39 | need_prots = [] 40 | for prot in prots: 41 | if os.path.exists('step23/' + dataset + '/' + prot + '.assign'): 42 | word_counts = set([]) 43 | fp = open('step23/' + dataset + '/' + prot + '.assign', 'r') 44 | for line in fp: 45 | words = line.split() 46 | word_counts.add(len(words)) 47 | fp.close() 48 | if len(word_counts) == 1 and 10 in word_counts: 49 | pass 50 | else: 51 | os.system('rm step23/' + dataset + '/' + prot + '.assign') 52 | need_prots.append(prot) 53 | else: 54 | if os.path.exists('step23/' + dataset + '/' + prot + '.done'): 55 | pass 56 | else: 57 | need_prots.append(prot) 58 | 59 | 60 | if need_prots: 61 | cmds = [] 62 | for prot in need_prots: 63 | cmds.append('step23_get_predictions.py ' + dataset + ' ' + prot + '\n') 64 | logs = batch_run(cmds, ncore) 65 | fail = [i for i in logs if 'fail' in i] 66 | if fail: 67 | with open(dataset + '_step23.log','w') as f: 68 | for i in fail: 69 | f.write(i+'\n') 70 | else: 71 | with open(dataset + '_step23.log','w') as f: 72 | f.write('done\n') 73 | else: 74 | with open(dataset + '_step23.log','w') as f: 75 | f.write('done\n') 76 | -------------------------------------------------------------------------------- /docker/scripts/run_step3.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess,time 3 | def run_cmd(cmd): 4 | status = subprocess.run(cmd,shell = True).returncode 5 | if status == 0: 6 | return cmd + ' succeed' 7 | else: 8 | return cmd + ' fail' 9 | 10 | 11 | dataset = sys.argv[1] 12 | ncore = sys.argv[2] 13 | if not os.path.exists('step3'): 14 | os.system('mkdir step3/') 15 | if not os.path.exists('step3/' + dataset): 16 | os.system('mkdir step3/' + dataset) 17 | 18 | fp = open(dataset + '_struc.list', 'r') 19 | prots = [] 20 | for line in fp: 21 | words = line.split() 22 | prots.append(words[0]) 23 | fp.close() 24 | 25 | cmds = [] 26 | for prot in prots: 27 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm') and os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'): 28 | fp = open('step3/' + dataset + '/' + prot + '.hmm', 'r') 29 | get_sspred = 0 30 | get_ssconf = 0 31 | for line in fp: 32 | if len(line) >= 10: 33 | if line[0] == '>' and line[1:8] == 'ss_pred': 34 | get_sspred = 1 35 | elif line[0] == '>' and line[1:8] == 'ss_conf': 36 | get_ssconf = 1 37 | if get_sspred and get_ssconf: 38 | break 39 | fp.close() 40 | 41 | if get_sspred and get_ssconf: 42 | pass 43 | elif os.path.exists('step3/' + dataset + '/' + prot + '.a3m'): 44 | fp = open('step3/' + dataset + '/' + prot + '.a3m', 'r') 45 | count_line = 0 46 | for line in fp: 47 | count_line += 1 48 | fp.close() 49 | if count_line == 2: 50 | get_sspred = 1 51 | get_ssconf = 1 52 | 53 | fp = open('step3/' + dataset + '/' + prot + '.hhsearch', 'r') 54 | start = 0 55 | end = 0 56 | hitsA = set([]) 57 | hitsB = set([]) 58 | for line in fp: 59 | words = line.split() 60 | if len(words) >= 2: 61 | if words[0] == 'No' and words[1] == 'Hit': 62 | start = 1 63 | elif words[0] == 'No' and words[1] == '1': 64 | hitsB.add(int(words[1])) 65 | end = 1 66 | elif start and not end: 67 | hitsA.add(int(words[0])) 68 | elif end: 69 | if words[0] == 'No': 70 | hitsB.add(int(words[1])) 71 | fp.close() 72 | last_words = line.split() 73 | 74 | if get_sspred and get_ssconf and hitsA == hitsB and not last_words: 75 | pass 76 | else: 77 | os.system('rm step1/' + dataset + '/' + prot + '.hhr') 78 | os.system('rm step3/' + dataset + '/' + prot + '.a3m') 79 | os.system('rm step3/' + dataset + '/' + prot + '.hmm') 80 | os.system('rm step3/' + dataset + '/' + prot + '.hhsearch') 81 | cmds.append('step3_run_hhsearch.py ' + dataset + ' ' + prot + ' ' + ncore) 82 | else: 83 | if os.path.exists('step1/' + dataset + '/' + prot + '.hhr'): 84 | os.system('rm step1/' + dataset + '/' + prot + '.hhr') 85 | if os.path.exists('step3/' + dataset + '/' + prot + '.a3m'): 86 | os.system('rm step3/' + dataset + '/' + prot + '.a3m') 87 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm'): 88 | os.system('rm step3/' + dataset + '/' + prot + '.hmm') 89 | if os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'): 90 | os.system('rm step3/' + dataset + '/' + prot + '.hhsearch') 91 | cmds.append('step3_run_hhsearch.py ' + dataset + ' ' + prot + ' ' + ncore) 92 | 93 | if cmds: 94 | fail=[] 95 | for cmd in cmds: 96 | for i in range(5): 97 | log=run_cmd(cmd) 98 | if 'succeed' in log: 99 | break 100 | time.sleep(1) 101 | else: 102 | fail.append(log) 103 | if fail: 104 | with open(dataset + '_step3.log','w') as f: 105 | for i in fail: 106 | f.write(i + '\n') 107 | sys.exit(1) 108 | else: 109 | with open(dataset + '_step3.log','w') as f: 110 | f.write('done\n') 111 | -------------------------------------------------------------------------------- /docker/scripts/run_step4.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | 3 | import os, sys, subprocess 4 | def run_cmd(cmd): 5 | status = subprocess.run(cmd,shell = True).returncode 6 | if status == 0: 7 | return cmd + ' succeed' 8 | else: 9 | return cmd + ' fail' 10 | 11 | dataset = sys.argv[1] 12 | ncore = sys.argv[2] 13 | 14 | if not os.path.exists('step4'): 15 | os.system('mkdir step4/') 16 | 17 | if not os.path.exists('step4/' + dataset): 18 | os.system('mkdir step4/' + dataset) 19 | 20 | fp = open(dataset + '_struc.list', 'r') 21 | prots = [] 22 | for line in fp: 23 | words = line.split() 24 | prots.append(words[0]) 25 | fp.close() 26 | 27 | need_prots = [] 28 | for prot in prots: 29 | if os.path.exists('step4/' + dataset + '/' + prot + '.foldseek'): 30 | fp = open('step4/' + dataset + '/' + prot + '.foldseek','r') 31 | word_counts = set([]) 32 | for line in fp: 33 | words = line.split() 34 | word_counts.add(len(words)) 35 | fp.close() 36 | if len(word_counts) == 1 and 12 in word_counts: 37 | pass 38 | elif os.path.exists('step4/' + dataset + '/' + prot + '.done'): 39 | pass 40 | else: 41 | os.system('rm step4/' + dataset + '/' + prot + '.foldseek') 42 | need_prots.append(prot) 43 | else: 44 | need_prots.append(prot) 45 | 46 | 47 | if need_prots: 48 | with open('step4/' + dataset + '_step4.list','w') as f: 49 | for i in need_prots: 50 | f.write(i+'\n') 51 | log = run_cmd('step4_run_foldseek.py ' + dataset + ' ' + ncore + ' \n') 52 | if 'fail' in log: 53 | with open(dataset + '_step4.log','w') as f: 54 | f.write(dataset + ' fail\n') 55 | else: 56 | with open(dataset + '_step4.log','w') as f: 57 | f.write('done\n') 58 | os.system('rm step4/' + dataset + '_step4.list') 59 | else: 60 | if os.path.exists('step4/' + dataset + '_step4.list'): 61 | os.system('rm step4/' + dataset + '_step4.list') 62 | with open(dataset + '_step4.log','w') as f: 63 | f.write('done\n') 64 | -------------------------------------------------------------------------------- /docker/scripts/run_step5.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | def run_cmd(sample,cmd): 5 | status=subprocess.run(cmd,shell=True).returncode 6 | if status==0: 7 | return sample+' succeed' 8 | else: 9 | return sample+' fail' 10 | 11 | def batch_run(cmds,process_num): 12 | log=[] 13 | pool = Pool(processes=process_num) 14 | result = [] 15 | for cmd in cmds: 16 | sample=cmd.split()[2] 17 | process = pool.apply_async(run_cmd,(sample,cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | 26 | dataset = sys.argv[1] 27 | ncore = int(sys.argv[2]) 28 | 29 | if not os.path.exists('step5'): 30 | os.system('mkdir step5/') 31 | if not os.path.exists('step5/' + dataset): 32 | os.system('mkdir step5/' + dataset) 33 | 34 | if os.path.exists('step5_' + dataset + '.cmds'): 35 | os.system('rm step5_' + dataset + '.cmds') 36 | 37 | fp = open(dataset + '_struc.list', 'r') 38 | prots = [] 39 | for line in fp: 40 | words = line.split() 41 | prots.append(words[0]) 42 | fp.close() 43 | 44 | need_prots = [] 45 | for prot in prots: 46 | if os.path.exists('step5/' + dataset + '/' + prot + '.result'): 47 | fp = open('step5/' + dataset + '/' + prot + '.result','r') 48 | word_counts = set([]) 49 | for line in fp: 50 | words = line.split() 51 | word_counts.add(len(words)) 52 | fp.close() 53 | if len(word_counts) == 1 and 15 in word_counts: 54 | pass 55 | else: 56 | os.system('rm step5/' + dataset + '/' + prot + '.result') 57 | need_prots.append(prot) 58 | else: 59 | if os.path.exists('step5/' + dataset + '/' + prot + '.done'): 60 | pass 61 | else: 62 | need_prots.append(prot) 63 | 64 | if need_prots: 65 | cmds = [] 66 | for prot in need_prots: 67 | cmds.append('step5_process_hhsearch.py ' + dataset + ' ' + prot) 68 | logs = batch_run(cmds,ncore) 69 | fail = [i for i in logs if 'fail' in i] 70 | if fail: 71 | with open(dataset + '_step5.log','w') as f: 72 | for i in fail: 73 | f.write(i+'\n') 74 | else: 75 | with open(dataset + '_step5.log','w') as f: 76 | f.write('done\n') 77 | else: 78 | with open(dataset + '_step5.log','w') as f: 79 | f.write('done\n') 80 | -------------------------------------------------------------------------------- /docker/scripts/run_step6.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys,subprocess 3 | from multiprocessing import Pool 4 | def run_cmd(sample,cmd): 5 | status=subprocess.run(cmd,shell=True).returncode 6 | if status==0: 7 | return sample+' succeed' 8 | else: 9 | return sample+' fail' 10 | 11 | def batch_run(cmds,process_num): 12 | log=[] 13 | pool = Pool(processes=process_num) 14 | result = [] 15 | for cmd in cmds: 16 | sample=cmd.split()[2] 17 | process = pool.apply_async(run_cmd,(sample,cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | 25 | dataset = sys.argv[1] 26 | ncore = int(sys.argv[2]) 27 | if not os.path.exists('step6'): 28 | os.system('mkdir step6/') 29 | if not os.path.exists('step6/' + dataset): 30 | os.system('mkdir step6/' + dataset) 31 | 32 | if os.path.exists('step6_' + dataset + '.cmds'): 33 | os.system('rm step6_' + dataset + '.cmds') 34 | 35 | fp = open(dataset + '_struc.list', 'r') 36 | prots = [] 37 | for line in fp: 38 | words = line.split() 39 | prots.append(words[0]) 40 | fp.close() 41 | 42 | need_prots = [] 43 | for prot in prots: 44 | if os.path.exists('step6/' + dataset + '/' + prot + '.result'): 45 | fp = open('step6/' + dataset + '/' + prot + '.result','r') 46 | word_counts = set([]) 47 | for line in fp: 48 | words = line.split() 49 | word_counts.add(len(words)) 50 | fp.close() 51 | if len(word_counts) == 1 and 3 in word_counts: 52 | pass 53 | else: 54 | os.system('rm step6/' + dataset + '/' + prot + '.result') 55 | need_prots.append(prot) 56 | else: 57 | need_prots.append(prot) 58 | 59 | if need_prots: 60 | cmds = [] 61 | for prot in need_prots: 62 | cmds.append('step6_process_foldseek.py ' + dataset + ' ' + prot) 63 | logs = batch_run(cmds, ncore) 64 | fail = [i for i in logs if 'fail' in i] 65 | if fail: 66 | with open(dataset + '_step6.log','w') as f: 67 | for i in fail: 68 | f.write(i+'\n') 69 | else: 70 | with open(dataset + '_step6.log','w') as f: 71 | f.write('done\n') 72 | else: 73 | with open(dataset + '_step6.log','w') as f: 74 | f.write('done\n') 75 | -------------------------------------------------------------------------------- /docker/scripts/run_step7.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | def run_cmd(cmd): 5 | status=subprocess.run(cmd,shell=True).returncode 6 | if status==0: 7 | return cmd + ' succeed' 8 | else: 9 | return cmd + ' fail' 10 | 11 | def batch_run(cmds,process_num): 12 | log=[] 13 | pool = Pool(processes=process_num) 14 | result = [] 15 | for cmd in cmds: 16 | sample=cmd.split()[2] 17 | process = pool.apply_async(run_cmd,(cmd,)) 18 | result.append(process) 19 | for process in result: 20 | log.append(process.get()) 21 | return log 22 | 23 | 24 | dataset = sys.argv[1] 25 | ncore = int(sys.argv[2]) 26 | if not os.path.exists('step7'): 27 | os.system('mkdir step7/') 28 | if not os.path.exists('step7/' + dataset): 29 | os.system('mkdir step7/' + dataset) 30 | 31 | fp = open(dataset + '_struc.list', 'r') 32 | prots = [] 33 | for line in fp: 34 | words = line.split() 35 | prots.append(words[0]) 36 | fp.close() 37 | 38 | need_prots = [] 39 | for prot in prots: 40 | if os.path.exists('step7/' + dataset + '/' + prot + '_hits'): 41 | pass 42 | elif os.path.exists('step7/' + dataset + '/' + prot + '.done'): 43 | pass 44 | else: 45 | need_prots.append(prot) 46 | 47 | if need_prots: 48 | cmds = [] 49 | for prot in need_prots: 50 | cmds.append('step7_prepare_dali.py ' + dataset + ' ' + prot) 51 | logs = batch_run(cmds, ncore) 52 | fail = [i for i in logs if 'fail' in i] 53 | if fail: 54 | with open(dataset + '_step7.log','w') as f : 55 | for i in fail: 56 | f.write(i+'\n') 57 | else: 58 | with open(dataset + '_step7.log','w') as f: 59 | f.write('done\n') 60 | else: 61 | with open(dataset + '_step7.log','w') as f: 62 | f.write('done\n') 63 | -------------------------------------------------------------------------------- /docker/scripts/run_step8.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | def run_cmd(cmd): 4 | status=subprocess.run(cmd,shell=True).returncode 5 | if status==0: 6 | return cmd +' succeed' 7 | else: 8 | return cmd +' fail' 9 | 10 | dataset = sys.argv[1] 11 | ncore = sys.argv[2] 12 | if not os.path.exists('step8'): 13 | os.system('mkdir step8/') 14 | 15 | if not os.path.exists('step8/' + dataset): 16 | os.system('mkdir step8/' + dataset) 17 | 18 | fp = open(dataset + '_struc.list', 'r') 19 | prots = [] 20 | for line in fp: 21 | words = line.split() 22 | prots.append(words[0]) 23 | fp.close() 24 | 25 | need_prots = [] 26 | for prot in prots: 27 | if os.path.exists('step8/' + dataset + '/' + prot + '_hits'): 28 | hit_count = 0 29 | fp = open('step8/' + dataset + '/' + prot + '_hits', 'r') 30 | hit_lines = [] 31 | hit_line_count = 0 32 | bad = 0 33 | for line in fp: 34 | if line[0] == '>': 35 | hit_count += 1 36 | if hit_line_count: 37 | if hit_line_count + 4 != len(hit_lines): 38 | bad = 1 39 | break 40 | words = line.split() 41 | hit_line_count = int(words[2]) 42 | hit_lines = [] 43 | else: 44 | hit_lines.append(line) 45 | fp.close() 46 | if hit_line_count: 47 | if hit_line_count + 4 != len(hit_lines): 48 | bad = 1 49 | if bad: 50 | os.system('rm step8/' + dataset + '/' + prot + '_hits') 51 | need_prots.append(prot) 52 | elif hit_count: 53 | pass 54 | else: 55 | if os.path.exists('step8/' + dataset + '/' + prot + '.done'): 56 | pass 57 | else: 58 | os.system('rm step8/' + dataset + '/' + prot + '_hits') 59 | need_prots.append(prot) 60 | else: 61 | if os.path.exists('step8/' + dataset + '/' + prot + '.done'): 62 | pass 63 | else: 64 | need_prots.append(prot) 65 | 66 | if need_prots: 67 | print(need_prots) 68 | fail = [] 69 | for prot in need_prots: 70 | log = run_cmd ('step8_iterative_dali.py ' + dataset + ' ' + prot + ' ' + ncore ) 71 | if 'fail' in log: 72 | fail.append(log) 73 | if fail: 74 | with open(dataset + '_step8.log','w') as f: 75 | for i in fail: 76 | f.write(i+'\n') 77 | else: 78 | with open(dataset + '_step8.log','w') as f: 79 | f.write('done\n') 80 | else: 81 | with open(dataset + '_step8.log','w') as f: 82 | f.write('done\n') 83 | -------------------------------------------------------------------------------- /docker/scripts/run_step9.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | from multiprocessing import Pool 4 | def run_cmd(cmd): 5 | status=subprocess.run(cmd,shell=True).returncode 6 | if status==0: 7 | return cmd+' succeed' 8 | else: 9 | return cmd+' fail' 10 | 11 | def batch_run(cmds,process_num): 12 | log=[] 13 | pool = Pool(processes=process_num) 14 | result = [] 15 | for cmd in cmds: 16 | process = pool.apply_async(run_cmd,(cmd,)) 17 | result.append(process) 18 | for process in result: 19 | log.append(process.get()) 20 | return log 21 | 22 | 23 | dataset = sys.argv[1] 24 | ncore = int(sys.argv[2]) 25 | 26 | if not os.path.exists('step9'): 27 | os.system('mkdir step9/') 28 | 29 | if not os.path.exists('step9/' + dataset): 30 | os.system('mkdir step9/' + dataset) 31 | 32 | fp = open(dataset + '_struc.list', 'r') 33 | prots = [] 34 | for line in fp: 35 | words = line.split() 36 | prots.append(words[0]) 37 | fp.close() 38 | 39 | need_prots = [] 40 | for prot in prots: 41 | if os.path.exists('step9/' + dataset + '/' + prot + '_good_hits'): 42 | fp = open('step9/' + dataset + '/' + prot + '_good_hits','r') 43 | word_counts = set([]) 44 | for line in fp: 45 | words = line.split() 46 | word_counts.add(len(words)) 47 | fp.close() 48 | if len(word_counts) == 1 and 15 in word_counts: 49 | pass 50 | else: 51 | os.system('rm step9/' + dataset + '/' + prot + '_good_hits') 52 | need_prots.append(prot) 53 | else: 54 | if os.path.exists('step9/' + dataset + '/' + prot + '.done'): 55 | pass 56 | else: 57 | need_prots.append(prot) 58 | 59 | if need_prots: 60 | cmds = [] 61 | for prot in need_prots: 62 | cmds.append('step9_analyze_dali.py ' + dataset + ' ' + prot) 63 | logs = batch_run(cmds, ncore) 64 | fail = [i for i in logs if 'fail' in i] 65 | if fail: 66 | with open(dataset + '_step9.log','w') as f: 67 | for i in fail: 68 | f.write(i+'\n') 69 | else: 70 | with open(dataset + '_step9.log','w') as f: 71 | f.write('done\n') 72 | else: 73 | with open(dataset + '_step9.log','w') as f: 74 | f.write('done\n') 75 | -------------------------------------------------------------------------------- /docker/scripts/step11_get_good_domains.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | 4 | fp = open('/mnt/databases/ECOD_norms', 'r') 5 | ecod2norm = {} 6 | for line in fp: 7 | words = line.split() 8 | ecod2norm[words[0]] = float(words[1]) 9 | fp.close() 10 | 11 | spname = sys.argv[1] 12 | prot= sys.argv[2] 13 | results = [] 14 | if os.path.exists(f'step10/{spname}/{prot}_sequence.result'): 15 | fp = open(f'step10/{spname}/{prot}_sequence.result', 'r') 16 | for line in fp: 17 | words = line.split() 18 | filt_segs = [] 19 | for seg in words[6].split(','): 20 | start = int(seg.split('-')[0]) 21 | end = int(seg.split('-')[1]) 22 | for res in range(start, end + 1): 23 | if not filt_segs: 24 | filt_segs.append([res]) 25 | else: 26 | if res > filt_segs[-1][-1] + 10: 27 | filt_segs.append([res]) 28 | else: 29 | filt_segs[-1].append(res) 30 | 31 | filt_seg_strings = [] 32 | total_good_count = 0 33 | for seg in filt_segs: 34 | start = seg[0] 35 | end = seg[-1] 36 | good_count = 0 37 | for res in range(start, end + 1): 38 | good_count += 1 39 | if good_count >= 5: 40 | total_good_count += good_count 41 | filt_seg_strings.append(f'{str(start)}-{str(end)}') 42 | if total_good_count >= 25: 43 | results.append('sequence\t' + prot + '\t' + '\t'.join(words[:7]) + '\t' + ','.join(filt_seg_strings) + '\n') 44 | fp.close() 45 | 46 | if os.path.exists(f'step10/{spname}/{prot}_structure.result'): 47 | fp = open(f'step10/{spname}/{prot}_structure.result', 'r') 48 | for line in fp: 49 | words = line.split() 50 | ecodnum = words[0].split('_')[0] 51 | edomain = words[1] 52 | zscore = float(words[3]) 53 | try: 54 | znorm = round(zscore / ecod2norm[ecodnum], 2) 55 | except KeyError: 56 | znorm = 0.0 57 | qscore = float(words[4]) 58 | ztile = float(words[5]) 59 | qtile = float(words[6]) 60 | rank = float(words[7]) 61 | bestprob = float(words[8]) 62 | bestcov = float(words[9]) 63 | 64 | judge = 0 65 | if rank < 1.5: 66 | judge += 1 67 | if qscore > 0.5: 68 | judge += 1 69 | if ztile < 0.75 and ztile >= 0: 70 | judge += 1 71 | if qtile < 0.75 and qtile >= 0: 72 | judge += 1 73 | if znorm > 0.225: 74 | judge += 1 75 | 76 | seqjudge = 'no' 77 | if bestprob >= 20 and bestcov >= 0.2: 78 | judge += 1 79 | seqjudge = 'low' 80 | if bestprob >= 50 and bestcov >= 0.3: 81 | judge += 1 82 | seqjudge = 'medium' 83 | if bestprob >= 80 and bestcov >= 0.4: 84 | judge += 1 85 | seqjudge = 'high' 86 | if bestprob >= 95 and bestcov >= 0.6: 87 | judge += 1 88 | seqjudge = 'superb' 89 | 90 | if judge: 91 | seg_strings = words[10].split(',') 92 | filt_segs = [] 93 | for seg in words[10].split(','): 94 | start = int(seg.split('-')[0]) 95 | end = int(seg.split('-')[1]) 96 | for res in range(start, end + 1): 97 | if not filt_segs: 98 | filt_segs.append([res]) 99 | else: 100 | if res > filt_segs[-1][-1] + 10: 101 | filt_segs.append([res]) 102 | else: 103 | filt_segs[-1].append(res) 104 | 105 | filt_seg_strings = [] 106 | total_good_count = 0 107 | for seg in filt_segs: 108 | start = seg[0] 109 | end = seg[-1] 110 | good_count = 0 111 | for res in range(start, end + 1): 112 | good_count += 1 113 | if good_count >= 5: 114 | total_good_count += good_count 115 | filt_seg_strings.append(f'{str(start)}-{str(end)}') 116 | if total_good_count >= 25: 117 | results.append('structure\t' + seqjudge + '\t' + prot + '\t' + str(znorm) + '\t' + '\t'.join(words[:10]) + '\t' + ','.join(seg_strings) + '\t' + ','.join(filt_seg_strings) + '\n') 118 | fp.close() 119 | 120 | if results: 121 | rp = open(f'step11/{spname}/{prot}.goodDomains', 'w') 122 | for line in results: 123 | rp.write(line) 124 | rp.close() 125 | else: 126 | os.system(f'echo \'done\' > step11/{spname}/{prot}.done') 127 | -------------------------------------------------------------------------------- /docker/scripts/step12_get_sse.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | import numpy as np 4 | 5 | dataset = sys.argv[1] 6 | prot = sys.argv[2] 7 | 8 | os.system(f'mkdssp -i step2/{dataset}/{prot}.pdb -o step12/{dataset}/{prot}.dssp') 9 | fp = open(f'step1/{dataset}/{prot}.fa', 'r') 10 | for line in fp: 11 | if line[0] != '>': 12 | seq = line[:-1] 13 | fp.close() 14 | 15 | fp = open(f'step12/{dataset}/{prot}.dssp', 'r') 16 | start = 0 17 | dssp_result = '' 18 | resids = [] 19 | for line in fp: 20 | words = line.split() 21 | if len(words) > 3: 22 | if words[0] == '#' and words[1] == 'RESIDUE': 23 | start = 1 24 | elif start: 25 | try: 26 | resid = int(line[5:10]) 27 | getit = 1 28 | except ValueError: 29 | getit = 0 30 | 31 | if getit: 32 | pred = line[16] 33 | resids.append(resid) 34 | pred = line[16] 35 | if pred == 'E' or pred == 'B': 36 | newpred = 'E' 37 | elif pred == 'G' or pred == 'H' or pred == 'I': 38 | newpred = 'H' 39 | else: 40 | newpred = '-' 41 | dssp_result += newpred 42 | fp.close() 43 | 44 | res2sse = {} 45 | dssp_segs = dssp_result.split('--') 46 | posi = 0 47 | Nsse = 0 48 | for dssp_seg in dssp_segs: 49 | judge = 0 50 | if dssp_seg.count('E') >= 3 or dssp_seg.count('H') >= 6: 51 | Nsse += 1 52 | judge = 1 53 | for char in dssp_seg: 54 | resid = resids[posi] 55 | if char != '-': 56 | if judge: 57 | res2sse[resid] = [Nsse, char] 58 | posi += 1 59 | posi += 2 60 | 61 | os.system(f'rm step12/{dataset}/{prot}.dssp') 62 | if len(resids) != len(seq): 63 | sys.exit(1) 64 | print (f'error\t{prot}\t{str(len(resids))}\t{str(len(seq))}') 65 | else: 66 | rp = open(f'step12/{dataset}/{prot}.sse', 'w') 67 | for resid in resids: 68 | try: 69 | rp.write(f'{str(resid)}\t{seq[resid - 1]}\t{str(res2sse[resid][0])}\t{res2sse[resid][1]}\n') 70 | except KeyError: 71 | rp.write(f'{str(resid)}\t{seq[resid - 1]}\tna\tC\n') 72 | rp.close() 73 | -------------------------------------------------------------------------------- /docker/scripts/step13_get_diso.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import sys, os, time, json, math, string 3 | import numpy as np 4 | 5 | dataset = sys.argv[1] 6 | prot = sys.argv[2] 7 | 8 | insses = set([]) 9 | res2sse = {} 10 | fp = open(f'step12/{dataset}/{prot}.sse', 'r') 11 | for line in fp: 12 | words = line.split() 13 | if words[2] != 'na': 14 | sseid = int(words[2]) 15 | resid = int(words[0]) 16 | insses.add(resid) 17 | res2sse[resid] = sseid 18 | fp.close() 19 | 20 | hit_resids = set([]) 21 | if os.path.exists(f'step11/{dataset}/{prot}.goodDomains'): 22 | fp = open(f'step11/{dataset}/{prot}.goodDomains', 'r') 23 | for line in fp: 24 | words = line.split() 25 | if words[0] == 'sequence': 26 | segs = words[8].split(',') 27 | elif words[0] == 'structure': 28 | segs = words[14].split(',') 29 | for seg in segs: 30 | if '-' in seg: 31 | start = int(seg.split('-')[0]) 32 | end = int(seg.split('-')[1]) 33 | for resid in range(start, end+1): 34 | hit_resids.add(resid) 35 | else: 36 | resid = int(seg) 37 | hit_resids.add(resid) 38 | fp.close() 39 | 40 | 41 | fp = open(f'{dataset}/{prot}.json','r') 42 | text = fp.read()[1:-1] 43 | fp.close() 44 | get_json = 0 45 | try: 46 | json_dict = json.loads(text) 47 | get_json = 1 48 | except: 49 | pass 50 | 51 | if get_json: 52 | if 'predicted_aligned_error' in json_dict.keys(): 53 | paes = json_dict['predicted_aligned_error'] 54 | length = len(paes) 55 | rpair2error = {} 56 | for i in range(length): 57 | res1 = i + 1 58 | try: 59 | rpair2error[res1] 60 | except KeyError: 61 | rpair2error[res1] = {} 62 | for j in range(length): 63 | res2 = j + 1 64 | rpair2error[res1][res2] = paes[i][j] 65 | 66 | elif 'distance' in json_dict.keys(): 67 | resid1s = json_dict['residue1'] 68 | resid2s = json_dict['residue2'] 69 | prot_len1 = max(resid1s) 70 | prot_len2 = max(resid2s) 71 | if prot_len1 != prot_len2: 72 | print (f'error, matrix is not a square with shape ({str(prot_len1)}, {str(prot_len2)})') 73 | else: 74 | length = prot_len1 75 | 76 | allerrors = json_dict['distance'] 77 | mtx_size = len(allerrors) 78 | 79 | rpair2error = {} 80 | for i in range(mtx_size): 81 | res1 = resid1s[i] 82 | res2 = resid2s[i] 83 | try: 84 | rpair2error[res1] 85 | except KeyError: 86 | rpair2error[res1] = {} 87 | rpair2error[res1][res2] = allerrors[i] 88 | else: 89 | print ('error\t' + prot) 90 | else: 91 | print ('error\t' + prot) 92 | 93 | 94 | res2contacts = {} 95 | for i in range(length): 96 | res1 = i + 1 97 | for j in range (length): 98 | res2 = j + 1 99 | err = rpair2error[res1][res2] 100 | if res1 + 10 <= res2 and err < 12: 101 | if res2 in insses: 102 | if res1 in insses and res2sse[res1] == res2sse[res2]: 103 | pass 104 | else: 105 | try: 106 | res2contacts[res1].append(res2) 107 | except KeyError: 108 | res2contacts[res1] = [res2] 109 | if res1 in insses: 110 | if res2 in insses and res2sse[res2] == res2sse[res1]: 111 | pass 112 | else: 113 | try: 114 | res2contacts[res2].append(res1) 115 | except KeyError: 116 | res2contacts[res2] = [res1] 117 | 118 | 119 | diso_resids = set([]) 120 | for start in range (1, length - 9): 121 | total_contact = 0 122 | hitres_count = 0 123 | for res in range(start, start + 10): 124 | if res in hit_resids: 125 | hitres_count += 1 126 | if res in insses: 127 | try: 128 | total_contact += len(res2contacts[res]) 129 | except KeyError: 130 | pass 131 | if total_contact <= 30 and hitres_count <= 5: 132 | for res in range(start, start + 10): 133 | diso_resids.add(res) 134 | 135 | diso_resids_list = list(diso_resids) 136 | diso_resids_list.sort() 137 | 138 | rp = open(f'step13/{dataset}/{prot}.diso', 'w') 139 | for resid in diso_resids_list: 140 | rp.write(f'{str(resid)}\n') 141 | rp.close() 142 | -------------------------------------------------------------------------------- /docker/scripts/step16_run_domass.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | import numpy as np 4 | import tensorflow as tf 5 | from tensorflow.python.client import device_lib 6 | 7 | dataset = sys.argv[1] 8 | local_device_protos = device_lib.list_local_devices() 9 | gpus = [x.name for x in local_device_protos if x.device_type == 'GPU'] 10 | if not gpus: 11 | print("No GPUs found. Falling back to CPU.") 12 | config = tf.ConfigProto() 13 | else: 14 | config = tf.ConfigProto(gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9)) 15 | config.gpu_options.allow_growth = True 16 | 17 | fp = open('step16_' + dataset + '.list', 'r') 18 | prots = [] 19 | for line in fp: 20 | words = line.split() 21 | prots.append(words[0]) 22 | fp.close() 23 | 24 | all_cases = [] 25 | all_inputs = [] 26 | for prot in prots: 27 | if os.path.exists('step15/' + dataset + '/' + prot + '.data'): 28 | fp = open('step15/' + dataset + '/' + prot + '.data', 'r') 29 | for countl, line in enumerate(fp): 30 | if countl: 31 | words = line.split() 32 | all_cases.append([prot, words[0], words[1], words[2], words[3], words[17], words[18], words[19], words[20], words[21], words[22]]) 33 | all_inputs.append([float(words[4]), float(words[5]), float(words[6]), float(words[7]), float(words[8]), float(words[9]), float(words[10]), float(words[11]), float(words[12]), float(words[13]), float(words[14]), float(words[15]), float(words[16])]) 34 | fp.close() 35 | total_case = len(all_cases) 36 | 37 | 38 | def get_feed(batch_inputs): 39 | inputs = np.zeros((100, 13), dtype = np.float32) 40 | for i in range(100): 41 | for j, value in enumerate(batch_inputs[i]): 42 | inputs[i, j] = value 43 | feed_dict = {myinputs: inputs} 44 | return feed_dict 45 | 46 | dense = tf.compat.v1.layers.dense 47 | with tf.Graph().as_default(): 48 | with tf.name_scope('input'): 49 | myinputs = tf.placeholder(dtype = tf.float32, shape = (100, 13)) 50 | layers = [myinputs] 51 | layers.append(dense(layers[-1], 64, activation = tf.nn.relu)) 52 | preds = dense(layers[-1], 2, activation = tf.nn.softmax) 53 | saver = tf.train.Saver() 54 | 55 | with tf.Session(config = config) as sess: 56 | saver.restore(sess, '/mnt/databases/domass_epo29') 57 | all_preds = [] 58 | if total_case >= 100: 59 | batch_count = total_case // 100 60 | get_case = 0 61 | for i in range(batch_count): 62 | batch_inputs = all_inputs[i * 100 : i * 100 + 100] 63 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs)) 64 | get_case += 100 65 | for j in range(100): 66 | all_preds.append(batch_preds[j,:]) 67 | if i % 1000 == 0: 68 | print ('prediction for batch ' + str(i)) 69 | 70 | remain_case = total_case - get_case 71 | add_case = 100 - remain_case 72 | batch_inputs = all_inputs[get_case:] + all_inputs[:add_case] 73 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs)) 74 | for j in range(remain_case): 75 | all_preds.append(batch_preds[j,:]) 76 | 77 | else: 78 | fold = 100 // total_case + 1 79 | pseudo_inputs = all_inputs * fold 80 | batch_inputs = pseudo_inputs[:100] 81 | batch_preds = sess.run(preds, feed_dict = get_feed(batch_inputs)) 82 | for j in range(total_case): 83 | all_preds.append(batch_preds[j,:]) 84 | 85 | prot2results = {} 86 | for prot in prots: 87 | prot2results[prot] = [] 88 | for i in range(total_case): 89 | this_case = all_cases[i] 90 | this_input = all_inputs[i] 91 | this_pred = all_preds[i] 92 | prot = this_case[0] 93 | prot2results[prot].append([this_case[1], this_case[2], this_case[3], this_case[4], this_pred[1], this_input[3], this_input[4], this_input[5], this_input[6], this_input[7], this_input[8], this_input[9], this_input[10], this_input[11], this_input[12], this_case[5], this_case[6], this_case[7], this_case[8], this_case[9], this_case[10]]) 94 | for prot in prots: 95 | if prot2results[prot]: 96 | rp = open('step16/' + dataset + '/' + prot + '.result', 'w') 97 | rp.write('Domain\tRange\tTgroup\tECOD_ref\tDPAM_prob\tHH_prob\tHH_cov\tHH_rank\tDALI_zscore\tDALI_qscore\tDALI_ztile\tDALI_qtile\tDALI_rank\tConsensus_diff\tConsensus_cov\tHH_hit\tDALI_hit\tDALI_rot1\tDALI_rot2\tDALI_rot3\tDALI_trans\n') 98 | for item in prot2results[prot]: 99 | rp.write(f'{item[0]}\t{item[1]}\t{item[2]}\t{item[3]}\t{str(round(item[4], 4))}\t{str(item[5])}\t{str(item[6])}\t{str(item[7])}\t{str(item[8])}\t{str(item[9])}\t{str(item[10])}\t{str(item[11])}\t{str(item[12])}\t{str(item[13])}\t{str(item[14])}\t{item[15]}\t{item[16]}\t{item[17]}\t{item[18]}\t{item[19]}\t{item[20]}\n') 100 | rp.close() 101 | else: 102 | os.system('echo \'done\' > step16/' + dataset + '/' + prot + '.done') 103 | -------------------------------------------------------------------------------- /docker/scripts/step17_get_confident.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | 3 | import os, sys 4 | 5 | dataset = sys.argv[1] 6 | prot = sys.argv[2] 7 | if os.path.exists('step16/' + dataset + '/' + prot + '.result'): 8 | fp = open('step16/' + dataset + '/' + prot + '.result','r') 9 | domains = [] 10 | domain2range = {} 11 | domain2hits = {} 12 | for countl, line in enumerate(fp): 13 | if countl: 14 | words = line.split() 15 | domain = words[0] 16 | drange = words[1] 17 | tgroup = words[2] 18 | refdom = words[3] 19 | prob = float(words[4]) 20 | domain2range[domain] = drange 21 | try: 22 | domain2hits[domain].append([tgroup, refdom, prob]) 23 | except KeyError: 24 | domains.append(domain) 25 | domain2hits[domain] = [[tgroup, refdom, prob]] 26 | fp.close() 27 | 28 | results = [] 29 | for domain in domains: 30 | drange = domain2range[domain] 31 | tgroups = [] 32 | tgroup2best = {} 33 | for hit in domain2hits[domain]: 34 | tgroup = hit[0] 35 | refdom = hit[1] 36 | prob = hit[2] 37 | try: 38 | if prob > tgroup2best[tgroup]: 39 | tgroup2best[tgroup] = prob 40 | except KeyError: 41 | tgroups.append(tgroup) 42 | tgroup2best[tgroup] = prob 43 | 44 | domain2hits[domain].sort(key = lambda x:x[2], reverse = True) 45 | for hit in domain2hits[domain]: 46 | tgroup = hit[0] 47 | refdom = hit[1] 48 | prob = hit[2] 49 | if prob >= 0.6: 50 | similar_tgroups = set([]) 51 | for ogroup in tgroups: 52 | if prob < tgroup2best[ogroup] + 0.05: 53 | similar_tgroups.add(ogroup) 54 | similar_hgroups = set([]) 55 | for group in similar_tgroups: 56 | hgroup = group.split('.')[0] + '.' + group.split('.')[1] 57 | similar_hgroups.add(hgroup) 58 | 59 | if len(similar_tgroups) == 1: 60 | judge = 'good' 61 | elif len(similar_hgroups) == 1: 62 | judge = 'ok' 63 | else: 64 | judge = 'bad' 65 | results.append(domain + '\t' + drange + '\t' + tgroup + '\t' + refdom + '\t' + str(prob) + '\t' + judge + '\n') 66 | 67 | if results: 68 | rp = open('step17/' + dataset + '/' + prot + '.result','w') 69 | for line in results: 70 | rp.write(line) 71 | rp.close() 72 | else: 73 | os.system('echo \'done\' > step17/' + dataset + '/' + prot + '.done') 74 | else: 75 | os.system('echo \'done\' > step17/' + dataset + '/' + prot + '.done') 76 | -------------------------------------------------------------------------------- /docker/scripts/step18_get_mapping.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | import numpy as np 4 | 5 | def get_resids(domain_range): 6 | domain_resids = [] 7 | for seg in domain_range.split(','): 8 | if '-' in seg: 9 | start = int(seg.split('-')[0]) 10 | end = int(seg.split('-')[1]) 11 | for res in range(start, end + 1): 12 | domain_resids.append(res) 13 | else: 14 | domain_resids.append(int(seg)) 15 | return domain_resids 16 | 17 | def check_overlap(residsA, residsB): 18 | overlap = set(residsA).intersection(set(residsB)) 19 | if len(overlap) >= len(residsA) * 0.33: 20 | if len(overlap) >= len(residsA) * 0.5 or len(overlap) >= len(residsB) * 0.5: 21 | return 1 22 | else: 23 | return 0 24 | else: 25 | return 0 26 | 27 | def get_range(resids): 28 | resids.sort() 29 | segs = [] 30 | for resid in resids: 31 | if not segs: 32 | segs.append([resid]) 33 | else: 34 | if resid > segs[-1][-1] + 1: 35 | segs.append([resid]) 36 | else: 37 | segs[-1].append(resid) 38 | ranges = [] 39 | for seg in segs: 40 | ranges.append(f'{str(seg[0])}-{str(seg[-1])}') 41 | return ','.join(ranges) 42 | 43 | 44 | spname = sys.argv[1] 45 | prot = sys.argv[2] 46 | HHhits = [] 47 | if os.path.exists('step5/' + spname + '/' + prot + '.result'): 48 | fp = open('step5/' + spname + '/' + prot + '.result', 'r') 49 | for countl, line in enumerate(fp): 50 | if countl: 51 | words = line.split() 52 | ecodid = words[1] 53 | getres = set([]) 54 | resmap = {} 55 | fp1 = open('/mnt/databases/ECOD_maps/' + ecodid + '.map', 'r') 56 | for line1 in fp1: 57 | words1 = line1.split() 58 | getres.add(int(words1[1])) 59 | resmap[int(words1[1])] = int(words1[0]) 60 | fp1.close() 61 | hhprob = float(words[3]) / 100 62 | 63 | raw_qresids = get_resids(words[12]) 64 | raw_tresids = get_resids(words[13]) 65 | qresids = [] 66 | tresids = [] 67 | for i in range(len(raw_qresids)): 68 | if raw_tresids[i] in getres: 69 | qresid = raw_qresids[i] 70 | tresid = resmap[raw_tresids[i]] 71 | qresids.append(qresid) 72 | tresids.append(tresid) 73 | HHhits.append([ecodid, hhprob, qresids, tresids]) 74 | fp.close() 75 | 76 | DALIhits = [] 77 | if os.path.exists('step9/' + spname + '/' + prot + '_good_hits'): 78 | fp = open('step9/' + spname + '/' + prot + '_good_hits', 'r') 79 | for countl, line in enumerate(fp): 80 | if countl: 81 | words = line.split() 82 | ecodid = words[1] 83 | zscore = float(words[4]) / 10 84 | qresids = get_resids(words[9]) 85 | tresids = get_resids(words[10]) 86 | DALIhits.append([ecodid, zscore, qresids, tresids]) 87 | fp.close() 88 | 89 | if os.path.exists('step17/' + spname + '/' + prot + '.result'): 90 | fp = open('step17/' + spname + '/' + prot + '.result', 'r') 91 | domains = [] 92 | domain2def = {} 93 | domain2resids = {} 94 | domain2hits = {} 95 | for line in fp: 96 | words = line.split() 97 | dname = words[0] 98 | try: 99 | domain2resids[dname] 100 | except KeyError: 101 | domains.append(dname) 102 | domain2resids[dname] = get_resids(words[1]) 103 | domain2def[dname] = words[1] 104 | tgroup = words[2] 105 | ecodhit = words[3] 106 | prob = float(words[4]) 107 | judge = words[5] 108 | try: 109 | domain2hits[dname] 110 | except KeyError: 111 | domain2hits[dname] = {} 112 | domain2hits[dname][ecodhit] = [prob, tgroup, judge] 113 | fp.close() 114 | 115 | results = [] 116 | for domain in domains: 117 | domain_resids = domain2resids[domain] 118 | domain_residset = set(domain_resids) 119 | hitinfo = domain2hits[domain] 120 | good_hits = list(hitinfo.keys()) 121 | 122 | Hecods = set([]) 123 | ecod2Hhit = {} 124 | for hit in HHhits: 125 | ecodid = hit[0] 126 | Hprob = hit[1] 127 | Hqresids = hit[2] 128 | Htresids = hit[3] 129 | if check_overlap(domain_resids, Hqresids): 130 | try: 131 | if Hprob > ecod2Hhit[ecodid][0]: 132 | ecod2Hhit[ecodid] = [Hprob, Hqresids, Htresids] 133 | except KeyError: 134 | Hecods.add(ecodid) 135 | ecod2Hhit[ecodid] = [Hprob, Hqresids, Htresids] 136 | 137 | Decods = set([]) 138 | ecod2Dhit = {} 139 | for hit in DALIhits: 140 | ecodid = hit[0] 141 | Dzscore = hit[1] 142 | Dqresids = hit[2] 143 | Dtresids = hit[3] 144 | if check_overlap(domain_resids, Dqresids): 145 | try: 146 | if Dzscore > ecod2Dhit[ecodid][0]: 147 | ecod2Dhit[ecodid] = [Dzscore, Dqresids, Dtresids] 148 | except KeyError: 149 | Decods.add(ecodid) 150 | ecod2Dhit[ecodid] = [Dzscore, Dqresids, Dtresids] 151 | 152 | for hit in good_hits: 153 | [DPAMprob, tgroup, judge] = hitinfo[hit] 154 | if hit in Hecods: 155 | HQresids = ecod2Hhit[hit][1] 156 | HTresids = ecod2Hhit[hit][2] 157 | Hresids = [] 158 | if len(HQresids) != len(HTresids): 159 | print (spname, prot, domain, hit) 160 | Hresid_string = 'na' 161 | elif HQresids: 162 | for i in range(len(HQresids)): 163 | if HQresids[i] in domain_residset: 164 | Hresids.append(HTresids[i]) 165 | Hresid_string = get_range(Hresids) 166 | else: 167 | Hresid_string = 'na' 168 | else: 169 | Hresid_string = 'na' 170 | 171 | if hit in Decods: 172 | DQresids = ecod2Dhit[hit][1] 173 | DTresids = ecod2Dhit[hit][2] 174 | Dresids = [] 175 | if len(DQresids) != len(DTresids): 176 | print (spname, prot, domain, hit) 177 | Dresid_string = 'na' 178 | elif DQresids: 179 | for i in range(len(DQresids)): 180 | if DQresids[i] in domain_residset: 181 | Dresids.append(DTresids[i]) 182 | Dresid_string = get_range(Dresids) 183 | else: 184 | Dresid_string = 'na' 185 | else: 186 | Dresid_string = 'na' 187 | results.append(domain + '\t' + domain2def[domain] + '\t' + hit + '\t' + tgroup + '\t' + str(DPAMprob) + '\t' + judge + '\t' + Hresid_string + '\t' + Dresid_string + '\n') 188 | 189 | if results: 190 | rp = open('step18/' + spname + '/' + prot + '.data', 'w') 191 | for line in results: 192 | rp.write(line) 193 | rp.close() 194 | else: 195 | os.system('echo \'done\' > step18/' + spname + '/' + prot + '.done') 196 | else: 197 | os.system('echo \'done\' > step18/' + spname + '/' + prot + '.done') 198 | -------------------------------------------------------------------------------- /docker/scripts/step19_get_merge_candidates.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | 4 | def get_resids(domain_range): 5 | domain_resids = [] 6 | for seg in domain_range.split(','): 7 | if '-' in seg: 8 | start = int(seg.split('-')[0]) 9 | end = int(seg.split('-')[1]) 10 | for res in range(start, end + 1): 11 | domain_resids.append(res) 12 | else: 13 | domain_resids.append(int(seg)) 14 | return domain_resids 15 | 16 | 17 | spname = sys.argv[1] 18 | prot = sys.argv[2] 19 | if os.path.exists('step18/' + spname + '/' + prot + '.data'): 20 | fp = open('step18/' + spname + '/' + prot + '.data','r') 21 | need_ecods = set([]) 22 | for line in fp: 23 | words = line.split() 24 | need_ecods.add(words[2]) 25 | fp.close() 26 | 27 | fp = open('/mnt/databases/ECOD_length','r') 28 | ecod2length = {} 29 | for line in fp: 30 | words = line.split() 31 | if words[0] in need_ecods: 32 | ecod2length[words[0]] = int(words[2]) 33 | fp.close() 34 | 35 | ecod2totW = {} 36 | ecod2posW = {} 37 | for ecod in need_ecods: 38 | ecod2totW[ecod] = 0 39 | ecod2posW[ecod] = {} 40 | if os.path.exists('/mnt/databases/posi_weights/' + ecod + '.weight'): 41 | fp = open('/mnt/databases/posi_weights/' + ecod + '.weight','r') 42 | for line in fp: 43 | words = line.split() 44 | resid = int(words[0]) 45 | weight = float(words[3]) 46 | ecod2totW[ecod] += weight 47 | ecod2posW[ecod][resid] = weight 48 | fp.close() 49 | else: 50 | ecod2totW[ecod] = ecod2length[ecod] 51 | for i in range(ecod2length[ecod]): 52 | ecod2posW[ecod][i + 1] = 1 53 | 54 | 55 | fp = open('step18/' + spname + '/' + prot + '.data','r') 56 | domains = [] 57 | ecods = [] 58 | domain2def = {} 59 | domain2prob = {} 60 | domain2hits = {} 61 | ecod2hits = {} 62 | for line in fp: 63 | words = line.split() 64 | domain = words[0] 65 | domdef = words[1] 66 | ecod = words[2] 67 | tgroup = words[3] 68 | prob = float(words[4]) 69 | try: 70 | if prob > domain2prob[domain]: 71 | domain2prob[domain] = prob 72 | except KeyError: 73 | domain2def[domain] = domdef 74 | domain2prob[domain] = prob 75 | 76 | if words[6] == 'na': 77 | Hresids = [] 78 | else: 79 | Hresids = get_resids(words[6]) 80 | if words[7] == 'na': 81 | Dresids = [] 82 | else: 83 | Dresids = get_resids(words[7]) 84 | if len(Dresids) > len(Hresids) * 0.5: 85 | HDresids = set(Dresids) 86 | else: 87 | HDresids = set(Hresids) 88 | 89 | total_weight = ecod2totW[ecod] 90 | get_weight = 0 91 | for resid in HDresids: 92 | try: 93 | get_weight += ecod2posW[ecod][resid] 94 | except KeyError: 95 | print (prot, ecod, resid) 96 | try: 97 | domain2hits[domain].append([ecod, tgroup, prob, get_weight / total_weight]) 98 | except KeyError: 99 | domains.append(domain) 100 | domain2hits[domain] = [[ecod, tgroup, prob, get_weight / total_weight]] 101 | try: 102 | ecod2hits[ecod].append([domain, tgroup, prob, HDresids]) 103 | except KeyError: 104 | ecods.append(ecod) 105 | ecod2hits[ecod] = [[domain, tgroup, prob, HDresids]] 106 | fp.close() 107 | 108 | 109 | domain_pairs = [] 110 | dpair2supports = {} 111 | for ecod in ecods: 112 | if len(ecod2hits[ecod]) > 1: 113 | for c1, hit1 in enumerate(ecod2hits[ecod]): 114 | for c2, hit2 in enumerate(ecod2hits[ecod]): 115 | if c1 < c2: 116 | domain1 = hit1[0] 117 | tgroup1 = hit1[1] 118 | prob1 = hit1[2] 119 | get_resids1 = hit1[3] 120 | domain2 = hit2[0] 121 | tgroup2 = hit2[1] 122 | prob2 = hit2[2] 123 | get_resids2 = hit2[3] 124 | if prob1 + 0.1 > domain2prob[domain1] and prob2 + 0.1 > domain2prob[domain2]: 125 | common_resids = get_resids1.intersection(get_resids2) 126 | if len(common_resids) < 0.25 * len(get_resids1) or len(common_resids) < 0.25 * len(get_resids2): 127 | domain_pair = domain1 + '_' + domain2 128 | try: 129 | dpair2supports[domain_pair].append([ecod, tgroup1, prob1, prob2]) 130 | except KeyError: 131 | domain_pairs.append(domain_pair) 132 | dpair2supports[domain_pair] = [[ecod, tgroup1, prob1, prob2]] 133 | 134 | 135 | merge_pairs = [] 136 | merge_info = [] 137 | for domain_pair in domain_pairs: 138 | domain1 = domain_pair.split('_')[0] 139 | domain2 = domain_pair.split('_')[1] 140 | support_ecods = set([]) 141 | for item in dpair2supports[domain_pair]: 142 | ecod = item[0] 143 | support_ecods.add(ecod) 144 | against_ecods1 = set([]) 145 | against_ecods2 = set([]) 146 | merge_info.append(domain1 + ',' + domain2 + '\t' + ','.join(support_ecods)) 147 | 148 | for item in domain2hits[domain1]: 149 | ecod = item[0] 150 | tgroup = item[1] 151 | prob = item[2] 152 | ratio = item[3] 153 | if prob + 0.1 > domain2prob[domain1]: 154 | if ratio > 0.5: 155 | if not ecod in support_ecods: 156 | against_ecods1.add(ecod) 157 | 158 | for item in domain2hits[domain2]: 159 | ecod = item[0] 160 | tgroup = item[1] 161 | prob = item[2] 162 | ratio = item[3] 163 | if prob + 0.1 > domain2prob[domain2]: 164 | if ratio > 0.5: 165 | if not ecod in support_ecods: 166 | against_ecods2.add(ecod) 167 | 168 | if len(support_ecods) > len(against_ecods1) or len(support_ecods) > len(against_ecods2): 169 | merge_pairs.append([domain1, domain2]) 170 | 171 | if merge_info: 172 | rp = open('step19/' + spname + '/' + prot + '.info','w') 173 | for merge_line in merge_info: 174 | rp.write(merge_line + '\n') 175 | rp.close() 176 | 177 | if merge_pairs: 178 | rp = open('step19/' + spname + '/' + prot + '.result','w') 179 | for merge_pair in merge_pairs: 180 | domain1 = merge_pair[0] 181 | domain2 = merge_pair[1] 182 | rp.write(domain1 + '\t' + domain2def[domain1] + '\t' + domain2 + '\t' + domain2def[domain2] + '\n') 183 | rp.close() 184 | 185 | if merge_info and merge_pairs: 186 | pass 187 | else: 188 | os.system('echo \'done\' > step19/' + spname + '/' + prot + '.done') 189 | else: 190 | os.system('echo \'done\' > step19/' + spname + '/' + prot + '.done') 191 | -------------------------------------------------------------------------------- /docker/scripts/step1_get_AFDB_seqs.py: -------------------------------------------------------------------------------- 1 | #!/usr1/local/bin/python 2 | import os, sys 3 | import pdbx 4 | from pdbx.reader.PdbxReader import PdbxReader 5 | 6 | three2one = {} 7 | three2one["ALA"] = 'A' 8 | three2one["CYS"] = 'C' 9 | three2one["ASP"] = 'D' 10 | three2one["GLU"] = 'E' 11 | three2one["PHE"] = 'F' 12 | three2one["GLY"] = 'G' 13 | three2one["HIS"] = 'H' 14 | three2one["ILE"] = 'I' 15 | three2one["LYS"] = 'K' 16 | three2one["LEU"] = 'L' 17 | three2one["MET"] = 'M' 18 | three2one["MSE"] = 'M' 19 | three2one["ASN"] = 'N' 20 | three2one["PRO"] = 'P' 21 | three2one["GLN"] = 'Q' 22 | three2one["ARG"] = 'R' 23 | three2one["SER"] = 'S' 24 | three2one["THR"] = 'T' 25 | three2one["VAL"] = 'V' 26 | three2one["TRP"] = 'W' 27 | three2one["TYR"] = 'Y' 28 | 29 | dataset = sys.argv[1] 30 | prot = sys.argv[2] 31 | 32 | flag=1 33 | 34 | if os.path.exists(dataset + "/" + prot + ".cif"): 35 | cif = open(dataset + "/" + prot + ".cif") 36 | pRd = PdbxReader(cif) 37 | data = [] 38 | pRd.read(data) 39 | block = data[0] 40 | 41 | modinfo = {} 42 | mod_residues = block.getObj("pdbx_struct_mod_residue") 43 | if mod_residues: 44 | chainid = mod_residues.getIndex("label_asym_id") 45 | posiid = mod_residues.getIndex("label_seq_id") 46 | parentid = mod_residues.getIndex("parent_comp_id") 47 | resiid = mod_residues.getIndex("label_comp_id") 48 | for i in range(mod_residues.getRowCount()): 49 | words = mod_residues.getRow(i) 50 | try: 51 | modinfo[words[chainid]] 52 | except KeyError: 53 | modinfo[words[chainid]] = {} 54 | modinfo[words[chainid]][words[posiid]] = [words[resiid], words[parentid]] 55 | 56 | entity_poly = block.getObj("entity_poly") 57 | pdbx_poly_seq_scheme = block.getObj("pdbx_poly_seq_scheme") 58 | if pdbx_poly_seq_scheme and entity_poly: 59 | typeid = entity_poly.getIndex("type") 60 | entityid1 = entity_poly.getIndex("entity_id") 61 | entityid2 = pdbx_poly_seq_scheme.getIndex("entity_id") 62 | chainid = pdbx_poly_seq_scheme.getIndex("asym_id") 63 | resiid = pdbx_poly_seq_scheme.getIndex("mon_id") 64 | posiid = pdbx_poly_seq_scheme.getIndex("seq_id") 65 | 66 | good_entities = [] 67 | for i in range(entity_poly.getRowCount()): 68 | words = entity_poly.getRow(i) 69 | entity = words[entityid1] 70 | type = words[typeid] 71 | if type == "polypeptide(L)": 72 | good_entities.append(entity) 73 | 74 | if good_entities: 75 | chains = [] 76 | residues = {} 77 | seqs = {} 78 | rp = open("step1/" + dataset + "/" + prot + ".fa","w") 79 | for i in range(pdbx_poly_seq_scheme.getRowCount()): 80 | words = pdbx_poly_seq_scheme.getRow(i) 81 | entity = words[entityid2] 82 | if entity in good_entities: 83 | chain = words[chainid] 84 | 85 | try: 86 | aa = three2one[words[resiid]] 87 | except KeyError: 88 | try: 89 | modinfo[chain][words[posiid]] 90 | resiname = modinfo[chain][words[posiid]][0] 91 | if words[resiid] == resiname: 92 | new_resiname = modinfo[chain][words[posiid]][1] 93 | try: 94 | aa = three2one[new_resiname] 95 | except KeyError: 96 | aa = "X" 97 | print ("error1 " + new_resiname) 98 | else: 99 | aa = "X" 100 | print ("error2 " + words[resiid] + " " + resiname) 101 | except KeyError: 102 | print (modinfo) 103 | print (words[resiid]) 104 | aa = "X" 105 | try: 106 | seqs[chain] 107 | except KeyError: 108 | chains.append(chain) 109 | seqs[chain] = {} 110 | 111 | try: 112 | if seqs[chain][int(words[posiid])] == "X" and aa != "X": 113 | seqs[chain][int(words[posiid])] = aa 114 | except KeyError: 115 | seqs[chain][int(words[posiid])] = aa 116 | 117 | try: 118 | residues[chain].add(int(words[posiid])) 119 | except KeyError: 120 | residues[chain] = set([int(words[posiid])]) 121 | 122 | for chain in chains: 123 | for i in range(len(residues[chain])): 124 | if not i + 1 in residues[chain]: 125 | flag = 0 126 | print ("error3 " + prot + " " + chain) 127 | break 128 | else: 129 | rp.write(">" + prot + "\n") 130 | finalseq = [] 131 | for i in range(len(residues[chain])): 132 | finalseq.append(seqs[chain][i+1]) 133 | rp.write("".join(finalseq) + "\n") 134 | rp.close() 135 | else: 136 | flag = 0 137 | print ("empty " + prot) 138 | else: 139 | flag = 0 140 | print ("bad " + prot) 141 | elif os.path.exists(dataset + "/" + prot + ".pdb"): 142 | os.system(f'pdb2fasta '+ dataset + "/" + prot + ".pdb > step1/" + dataset + "/" + prot + ".fa") 143 | with open("step1/" + dataset + "/" + prot + ".fa") as f: 144 | fa = f.readlines() 145 | fa[0] = fa[0].split(':')[0] + '\n' 146 | with open("step1/" + dataset + "/" + prot + ".fa",'w') as f: 147 | f.write(''.join(fa)) 148 | else: 149 | flag = 0 150 | print("No recognized structure file (*.cif or *.pdb). Existing...") 151 | if flag == 0: 152 | sys.exit(1) 153 | -------------------------------------------------------------------------------- /docker/scripts/step20_extract_domains.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | 4 | dataset = sys.argv[1] 5 | if not os.path.exists('step20'): 6 | os.system('mkdir step20') 7 | os.system('mkdir step20/' + dataset) 8 | 9 | fp = os.popen('ls -1 step19/' + dataset + '/*.result') 10 | prots = [] 11 | for line in fp: 12 | prot = line.split('/')[2].split('.result')[0] 13 | prots.append(prot) 14 | fp.close() 15 | 16 | domains = [] 17 | for prot in prots: 18 | get_domains = set([]) 19 | fp = open('step19/' + dataset + '/' + prot + '.result', 'r') 20 | for line in fp: 21 | words = line.split() 22 | domain1 = words[0] 23 | if not domain1 in get_domains: 24 | get_domains.add(domain1) 25 | resids1 = set([]) 26 | for seg in words[1].split(','): 27 | if '-' in seg: 28 | start = int(seg.split('-')[0]) 29 | end = int(seg.split('-')[1]) 30 | for res in range(start, end + 1): 31 | resids1.add(res) 32 | else: 33 | resids1.add(int(seg)) 34 | domains.append([prot, domain1, resids1]) 35 | 36 | domain2 = words[2] 37 | if not domain2 in get_domains: 38 | get_domains.add(domain2) 39 | resids2 = set([]) 40 | for seg in words[3].split(','): 41 | if '-' in seg: 42 | start = int(seg.split('-')[0]) 43 | end = int(seg.split('-')[1]) 44 | for res in range(start, end + 1): 45 | resids2.add(res) 46 | else: 47 | resids2.add(int(seg)) 48 | domains.append([prot, domain2, resids2]) 49 | fp.close() 50 | 51 | for item in domains: 52 | prot = item[0] 53 | dname = item[1] 54 | resids = item[2] 55 | fp = open('step2/' + dataset + '/' + prot + '.pdb', 'r') 56 | rp = open('step20/' + dataset + '/' + prot + '_' + dname + '.pdb', 'w') 57 | for line in fp: 58 | resid = int(line[22:26]) 59 | if resid in resids: 60 | rp.write(line) 61 | fp.close() 62 | rp.close() 63 | -------------------------------------------------------------------------------- /docker/scripts/step21_compare_domains.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | 4 | def get_seq_dist(residsA, residsB, good_resids): 5 | indsA = [] 6 | for ind, resid in enumerate(good_resids): 7 | if resid in residsA: 8 | indsA.append(ind) 9 | indsB = [] 10 | for ind, resid in enumerate(good_resids): 11 | if resid in residsB: 12 | indsB.append(ind) 13 | 14 | connected = 0 15 | for indA in indsA: 16 | for indB in indsB: 17 | if abs(indA - indB) <= 5: 18 | connected = 1 19 | break 20 | if connected: 21 | break 22 | return connected 23 | 24 | dataset = sys.argv[1] 25 | part = sys.argv[2] 26 | fp = open('step21_' + dataset + '_' + part + '.list','r') 27 | cases = [] 28 | for line in fp: 29 | words = line.split() 30 | cases.append(words) 31 | fp.close() 32 | 33 | rp = open('step21_' + dataset + '_' + part + '.result', 'w') 34 | for case in cases: 35 | prot = case[0] 36 | good_resids = [] 37 | fp = open('step14/' + dataset + '/' + prot + '.domains','r') 38 | for line in fp: 39 | words = line.split() 40 | for seg in words[1].split(','): 41 | if '-' in seg: 42 | start = int(seg.split('-')[0]) 43 | end = int(seg.split('-')[1]) 44 | for res in range(start, end + 1): 45 | good_resids.append(res) 46 | else: 47 | good_resids.append(int(seg)) 48 | fp.close() 49 | good_resids.sort() 50 | 51 | dom1 = case[1] 52 | segs1 = case[2] 53 | residsA = [] 54 | for seg in segs1.split(','): 55 | if '-' in seg: 56 | start = int(seg.split('-')[0]) 57 | end = int(seg.split('-')[1]) 58 | for res in range(start, end + 1): 59 | residsA.append(res) 60 | else: 61 | residsA.append(int(seg)) 62 | 63 | dom2 = case[3] 64 | segs2 = case[4] 65 | residsB = [] 66 | for seg in segs2.split(','): 67 | if '-' in seg: 68 | start = int(seg.split('-')[0]) 69 | end = int(seg.split('-')[1]) 70 | for res in range(start, end + 1): 71 | residsB.append(res) 72 | else: 73 | residsB.append(int(seg)) 74 | 75 | if get_seq_dist(set(residsA), set(residsB), good_resids): 76 | judge = 1 77 | else: 78 | resid2coors = {} 79 | fp = open('step20/' + dataset + '/' + prot + '_' + dom1 + '.pdb', 'r') 80 | for line in fp: 81 | resid = int(line[22:26]) 82 | coorx = float(line[30:38]) 83 | coory = float(line[38:46]) 84 | coorz = float(line[46:54]) 85 | try: 86 | resid2coors[resid].append([coorx, coory, coorz]) 87 | except KeyError: 88 | resid2coors[resid] = [[coorx, coory, coorz]] 89 | fp.close() 90 | 91 | fp = open('step20/' + dataset + '/' + prot + '_' + dom2 + '.pdb', 'r') 92 | for line in fp: 93 | resid = int(line[22:26]) 94 | coorx = float(line[30:38]) 95 | coory = float(line[38:46]) 96 | coorz = float(line[46:54]) 97 | try: 98 | resid2coors[resid].append([coorx, coory, coorz]) 99 | except KeyError: 100 | resid2coors[resid] = [[coorx, coory, coorz]] 101 | fp.close() 102 | 103 | interface_count = 0 104 | for residA in residsA: 105 | for residB in residsB: 106 | dists = [] 107 | coorsA = resid2coors[residA] 108 | coorsB = resid2coors[residB] 109 | for coorA in coorsA: 110 | for coorB in coorsB: 111 | dist = ((coorA[0] - coorB[0]) ** 2 + (coorA[1] - coorB[1]) ** 2 + (coorA[2] - coorB[2]) ** 2) ** 0.5 112 | dists.append(dist) 113 | min_dist = min(dists) 114 | if min_dist <= 8: 115 | interface_count += 1 116 | if interface_count >= 9: 117 | judge = 2 118 | else: 119 | judge = 0 120 | rp.write(prot + '\t' + dom1 + '\t' + dom2 + '\t' + str(judge) + '\t' + segs1 + '\t' + segs2 + '\n') 121 | rp.close() 122 | -------------------------------------------------------------------------------- /docker/scripts/step22_merge_domains.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import sys 3 | 4 | def get_range(resids): 5 | if resids: 6 | resids = list(resids) 7 | resids.sort() 8 | segs = [] 9 | for resid in resids: 10 | if not segs: 11 | segs.append([resid]) 12 | else: 13 | if resid > segs[-1][-1] + 1: 14 | segs.append([resid]) 15 | else: 16 | segs[-1].append(resid) 17 | ranges = [] 18 | for seg in segs: 19 | start = seg[0] 20 | end = seg[-1] 21 | ranges.append(str(start) + '-' + str(end)) 22 | return ','.join(ranges) 23 | else: 24 | return 'na' 25 | 26 | 27 | dataset = sys.argv[1] 28 | fp = open('step21_' + dataset + '.result', 'r') 29 | get_prots = set([]) 30 | prot2merges = {} 31 | domain2resids = {} 32 | for line in fp: 33 | words = line.split() 34 | prot = words[0] 35 | dom1 = words[1] 36 | dom2 = words[2] 37 | resids1 = [] 38 | for seg in words[4].split(','): 39 | if '-' in seg: 40 | start = int(seg.split('-')[0]) 41 | end = int(seg.split('-')[1]) 42 | for res in range(start, end + 1): 43 | resids1.append(res) 44 | else: 45 | resids1.append(int(seg)) 46 | resids2 = [] 47 | for seg in words[5].split(','): 48 | if '-' in seg: 49 | start = int(seg.split('-')[0]) 50 | end = int(seg.split('-')[1]) 51 | for res in range(start, end + 1): 52 | resids2.append(res) 53 | else: 54 | resids2.append(int(seg)) 55 | 56 | if int(words[3]) > 0: 57 | try: 58 | prot2merges[prot].append(set([dom1, dom2])) 59 | except KeyError: 60 | prot2merges[prot] = [set([dom1, dom2])] 61 | get_prots.add(prot) 62 | try: 63 | domain2resids[prot] 64 | except KeyError: 65 | domain2resids[prot] = {} 66 | domain2resids[prot][dom1] = resids1 67 | domain2resids[prot][dom2] = resids2 68 | fp.close() 69 | 70 | 71 | rp = open('step22_' + dataset + '.result','w') 72 | for prot in get_prots: 73 | pairs = prot2merges[prot] 74 | groups = [] 75 | for pair in pairs: 76 | groups.append(pair) 77 | while 1: 78 | newgroups = [] 79 | for group in groups: 80 | if not newgroups: 81 | newgroups.append(group) 82 | else: 83 | for newgroup in newgroups: 84 | if group.intersection(newgroup): 85 | for item in group: 86 | newgroup.add(item) 87 | break 88 | else: 89 | newgroups.append(group) 90 | 91 | if len(groups) == len(newgroups): 92 | break 93 | groups = [] 94 | for newgroup in newgroups: 95 | groups.append(newgroup) 96 | 97 | for group in groups: 98 | group_domains = [] 99 | group_resids = set([]) 100 | for domain in group: 101 | group_domains.append(domain) 102 | group_resids = group_resids.union(domain2resids[prot][domain]) 103 | group_range = get_range(group_resids) 104 | rp.write(prot + '\t' + ','.join(group_domains) + '\t' + group_range + '\n') 105 | rp.close() 106 | -------------------------------------------------------------------------------- /docker/scripts/step2_get_AFDB_pdbs.py: -------------------------------------------------------------------------------- 1 | #!/usr1/local/bin/python 2 | import os, sys, string 3 | import pdbx 4 | from pdbx.reader.PdbxReader import PdbxReader 5 | 6 | three2one = {} 7 | three2one["ALA"] = "A" 8 | three2one["CYS"] = "C" 9 | three2one["ASP"] = "D" 10 | three2one["GLU"] = "E" 11 | three2one["PHE"] = "F" 12 | three2one["GLY"] = "G" 13 | three2one["HIS"] = "H" 14 | three2one["ILE"] = "I" 15 | three2one["LYS"] = "K" 16 | three2one["LEU"] = "L" 17 | three2one["MET"] = "M" 18 | three2one["MSE"] = "M" 19 | three2one["ASN"] = "N" 20 | three2one["PRO"] = "P" 21 | three2one["GLN"] = "Q" 22 | three2one["ARG"] = "R" 23 | three2one["SER"] = "S" 24 | three2one["THR"] = "T" 25 | three2one["VAL"] = "V" 26 | three2one["TRP"] = "W" 27 | three2one["TYR"] = "Y" 28 | 29 | dataset = sys.argv[1] 30 | prot = sys.argv[2] 31 | 32 | if os.path.exists(dataset + "/" + prot + ".cif") and os.path.exists("step1/" + dataset + "/" + prot + ".fa"): 33 | fp = open("step1/" + dataset + "/" + prot + ".fa", "r") 34 | myseq = "" 35 | for line in fp: 36 | if line[0] == ">": 37 | pass 38 | else: 39 | myseq += line[:-1] 40 | fp.close() 41 | 42 | cif = open(dataset + "/" + prot + ".cif", "r") 43 | pRd = PdbxReader(cif) 44 | data = [] 45 | pRd.read(data) 46 | block = data[0] 47 | 48 | atom_site = block.getObj("atom_site") 49 | record_type_index = atom_site.getIndex("group_PDB") 50 | atom_type_index = atom_site.getIndex("type_symbol") 51 | atom_identity_index = atom_site.getIndex("label_atom_id") 52 | residue_type_index = atom_site.getIndex("label_comp_id") 53 | chain_id_index = atom_site.getIndex("label_asym_id") 54 | residue_id_index = atom_site.getIndex("label_seq_id") 55 | coor_x_index = atom_site.getIndex("Cartn_x") 56 | coor_y_index = atom_site.getIndex("Cartn_y") 57 | coor_z_index = atom_site.getIndex("Cartn_z") 58 | alt_id_index = atom_site.getIndex("label_alt_id") 59 | model_num_index = atom_site.getIndex("pdbx_PDB_model_num") 60 | occupancy_index = atom_site.getIndex("occupancy") 61 | bfactor_index = atom_site.getIndex("B_iso_or_equiv") 62 | 63 | if model_num_index == -1: 64 | mylines = [] 65 | for i in range(atom_site.getRowCount()): 66 | words = atom_site.getRow(i) 67 | chain_id = words[chain_id_index] 68 | record_type = words[record_type_index] 69 | if chain_id == "A" and record_type == "ATOM": 70 | mylines.append(words) 71 | else: 72 | model2lines = {} 73 | models = [] 74 | for i in range(atom_site.getRowCount()): 75 | words = atom_site.getRow(i) 76 | chain_id = words[chain_id_index] 77 | record_type = words[record_type_index] 78 | model_num = int(words[model_num_index]) 79 | if chain_id == "A" and record_type == "ATOM": 80 | try: 81 | model2lines[model_num].append(words) 82 | except KeyError: 83 | model2lines[model_num] = [words] 84 | models.append(model_num) 85 | best_model = min(models) 86 | mylines = model2lines[best_model] 87 | 88 | goodlines = [] 89 | resid2altid = {} 90 | resid2aa = {} 91 | atom_count = 0 92 | for words in mylines: 93 | atom_type = words[atom_type_index] 94 | atom_identity = words[atom_identity_index] 95 | residue_type = words[residue_type_index] 96 | residue_id = int(words[residue_id_index]) 97 | alt_id = words[alt_id_index] 98 | 99 | if atom_identity == "CA": 100 | try: 101 | resid2aa[residue_id] = three2one[residue_type] 102 | except KeyError: 103 | resid2aa[residue_id] = "X" 104 | 105 | get_line = 0 106 | if alt_id == ".": 107 | get_line = 1 108 | else: 109 | try: 110 | if resid2altid[residue_id] == alt_id: 111 | get_line = 1 112 | else: 113 | get_line = 0 114 | except KeyError: 115 | resid2altid[residue_id] = alt_id 116 | get_line = 1 117 | 118 | if get_line: 119 | atom_count += 1 120 | coor_x_info = words[coor_x_index].split(".") 121 | if len(coor_x_info) >= 2: 122 | coor_x = coor_x_info[0] + "." + coor_x_info[1][:3] 123 | else: 124 | coor_x = coor_x_info[0] 125 | coor_y_info = words[coor_y_index].split(".") 126 | if len(coor_y_info) >= 2: 127 | coor_y = coor_y_info[0] + "." + coor_y_info[1][:3] 128 | else: 129 | coor_y = coor_y_info[0] 130 | coor_z_info = words[coor_z_index].split(".") 131 | if len(coor_z_info) >= 2: 132 | coor_z = coor_z_info[0] + "." + coor_z_info[1][:3] 133 | else: 134 | coor_z = coor_z_info[0] 135 | 136 | occupancy_info = words[occupancy_index].split(".") 137 | if len(occupancy_info) == 1: 138 | occupancy = occupancy_info[0] + ".00" 139 | else: 140 | if len(occupancy_info[1]) == 1: 141 | occupancy = occupancy_info[0] + "." + occupancy_info[1] + "0" 142 | else: 143 | occupancy = occupancy_info[0] + "." + occupancy_info[1][:2] 144 | bfactor_info = words[bfactor_index].split(".") 145 | if len(bfactor_info) == 1: 146 | bfactor = bfactor_info[0] + ".00" 147 | else: 148 | if len(bfactor_info[1]) == 1: 149 | bfactor = bfactor_info[0] + "." + bfactor_info[1] + "0" 150 | else: 151 | bfactor = bfactor_info[0] + "." + bfactor_info[1][:2] 152 | 153 | if len(atom_identity) < 4: 154 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity.ljust(3) + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + occupancy.rjust(6) + bfactor.rjust(6) + " " + atom_type + "\n") 155 | elif len(atom_identity) == 4: 156 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + occupancy.rjust(6) + bfactor.rjust(6) + " " + atom_type + "\n") 157 | 158 | newseq = "" 159 | for i in range(len(myseq)): 160 | resid = i + 1 161 | try: 162 | newseq += resid2aa[resid] 163 | if resid2aa[resid] == "X": 164 | pass 165 | elif resid2aa[resid] == myseq[i]: 166 | pass 167 | else: 168 | print ("error\t" + dataset + "\t" + prot) 169 | except KeyError: 170 | newseq += "-" 171 | if newseq == myseq: 172 | rp = open("step2/" + dataset + "/" + prot + ".pdb","w") 173 | for goodline in goodlines: 174 | rp.write(goodline) 175 | rp.close() 176 | else: 177 | sys.exit(1) 178 | print ("error\t" + dataset + "\t" + prot) 179 | elif os.path.exists(dataset + "/" + prot + ".pdb") and os.path.exists("step1/" + dataset + "/" + prot + ".fa"): 180 | with open(dataset + "/" + prot + ".pdb") as f: 181 | pdblines = f.readlines() 182 | pdblines = [i for i in pdblines if i[:4]=='ATOM'] 183 | with open("step2/" + dataset + "/" + prot + ".pdb",'w') as f: 184 | for i in pdblines: 185 | f.write(i) 186 | else: 187 | sys.exit(1) 188 | -------------------------------------------------------------------------------- /docker/scripts/step3_run_hhsearch.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, subprocess 3 | def run_cmd(cmd): 4 | status = subprocess.run(cmd,shell = True).returncode 5 | return status 6 | 7 | dataset = sys.argv[1] 8 | prot = sys.argv[2] 9 | cpu = sys.argv[3] 10 | 11 | if os.path.exists('step3/' + dataset + '/' + prot + '.hhsearch'): 12 | pass 13 | else: 14 | if os.path.exists('step3/' + dataset + '/' + prot + '.hmm'): 15 | status = run_cmd('hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch') 16 | if status != 0: 17 | sys.exit(1) 18 | elif os.path.exists('step3/' + dataset + '/' + prot + '.hhm'): 19 | os.system('mv step3/' + dataset + '/' + prot + '.hhm step3/' + dataset + '/' + prot + '.hmm') 20 | status = run_cmd('hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch') 21 | if status != 0: 22 | sys.exit(1) 23 | else: 24 | cmds= ['hhblits -cpu ' + cpu + ' -i step1/' + dataset + '/' + prot + '.fa -d /mnt/databases/UniRef30_2022_02/UniRef30_2022_02 -oa3m step3/' + dataset + '/' + prot + '.a3m','addss.pl step3/' + dataset + '/' + prot + '.a3m step3/' + dataset + '/' + prot + '.a3m.ss -a3m','mv step3/' + dataset + '/' + prot + '.a3m.ss step3/' + dataset + '/' + prot + '.a3m','hhmake -i step3/' + dataset + '/' + prot + '.a3m -o step3/' + dataset + '/' + prot + '.hmm','hhsearch -cpu ' + cpu + ' -Z 100000 -B 100000 -i step3/' + dataset + '/' + prot + '.hmm -d /mnt/databases/pdb70/pdb70 -o step3/' + dataset + '/' + prot + '.hhsearch'] 25 | for cmd in cmds: 26 | status = run_cmd(cmd) 27 | if status != 0: 28 | sys.exit(1) 29 | -------------------------------------------------------------------------------- /docker/scripts/step4_run_foldseek.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys, random, string 3 | def generate_random_directory_name(): 4 | characters = string.ascii_lowercase + string.digits 5 | random_string = ''.join(random.choice(characters) for _ in range(8)) 6 | return random_string 7 | 8 | 9 | dataset = sys.argv[1] 10 | ncore = sys.argv[2] 11 | tmp_dir_name = generate_random_directory_name() 12 | 13 | if not os.path.exists('/tmp/' + dataset + '_' + tmp_dir_name): 14 | os.system('mkdir /tmp/' + dataset + '_' + tmp_dir_name) 15 | 16 | fp = open('step4/' + dataset + '_step4.list', 'r') 17 | prots = [] 18 | for line in fp: 19 | words = line.split() 20 | prots.append(words[0]) 21 | fp.close() 22 | 23 | for prot in prots: 24 | os.system('foldseek easy-search step2/' + dataset + '/' + prot + '.pdb /mnt/databases/ECOD_foldseek_DB/ECOD_foldseek_DB step4/' + dataset + '/' + prot + '.foldseek /tmp/' + dataset + '_' + tmp_dir_name + ' -e 1000 --max-seqs 1000000 --threads ' + ncore + ' >> /tmp/step4_' + dataset + '_' + tmp_dir_name + '.log') 25 | fp = open('step4/' + dataset + '/' + prot + '.foldseek', 'r') 26 | countline = 0 27 | for line in fp: 28 | countline += 1 29 | fp.close() 30 | if not countline: 31 | os.system('echo \'done\' > step4/' + dataset + '/' + prot + '.done') 32 | 33 | os.system('rm -rf /tmp/' + dataset + '_' + tmp_dir_name) 34 | os.system('mv /tmp/step4_' + dataset + '_' + tmp_dir_name + '.log ./') 35 | -------------------------------------------------------------------------------- /docker/scripts/step6_process_foldseek.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import sys 3 | 4 | dataset = sys.argv[1] 5 | prot = sys.argv[2] 6 | 7 | fp = open('step1/' + dataset + '/' + prot + '.fa','r') 8 | query_seq = '' 9 | for line in fp: 10 | if line[0] != '>': 11 | query_seq += line[:-1] 12 | fp.close() 13 | qlen = len(query_seq) 14 | 15 | fp = open('step4/' + dataset + '/' + prot + '.foldseek', 'r') 16 | hits = [] 17 | for line in fp: 18 | words = line.split() 19 | dnum = words[1].split('.')[0] 20 | qstart = int(words[6]) 21 | qend = int(words[7]) 22 | qresids = set([]) 23 | for qres in range(qstart, qend + 1): 24 | qresids.add(qres) 25 | evalue = float(words[10]) 26 | hits.append([dnum, evalue, qstart, qend, qresids]) 27 | fp.close() 28 | hits.sort(key = lambda x:x[1]) 29 | 30 | qres2count = {} 31 | for res in range(1, qlen + 1): 32 | qres2count[res] = 0 33 | 34 | rp = open('step6/' + dataset + '/' + prot + '.result', 'w') 35 | rp.write('ecodnum\tevalue\trange\n') 36 | for hit in hits: 37 | dnum = hit[0] 38 | evalue = hit[1] 39 | qstart = hit[2] 40 | qend = hit[3] 41 | qresids = hit[4] 42 | for res in qresids: 43 | qres2count[res] += 1 44 | good_res = 0 45 | for res in qresids: 46 | if qres2count[res] <= 100: 47 | good_res += 1 48 | if good_res >= 10: 49 | rp.write(dnum + '\t' + str(evalue) + '\t' + str(qstart) + '-' + str(qend) + '\n') 50 | rp.close() 51 | -------------------------------------------------------------------------------- /docker/scripts/step7_prepare_dali.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | 4 | dataset = sys.argv[1] 5 | prot = sys.argv[2] 6 | 7 | 8 | domains = set([]) 9 | if os.path.exists('step5/' + dataset + '/' + prot + '.result'): 10 | fp = open('step5/' + dataset + '/' + prot + '.result', 'r') 11 | for countl, line in enumerate(fp): 12 | if countl: 13 | words = line.split() 14 | domains.add(words[1]) 15 | fp.close() 16 | 17 | if os.path.exists('step6/' + dataset + '/' + prot + '.result'): 18 | fp = open('step6/' + dataset + '/' + prot + '.result','r') 19 | for countl, line in enumerate(fp): 20 | if countl: 21 | words = line.split() 22 | domains.add(words[0]) 23 | fp.close() 24 | 25 | if domains: 26 | rp = open('step7/' + dataset + '/' + prot + '_hits', 'w') 27 | for domain in domains: 28 | rp.write(domain + '\n') 29 | rp.close() 30 | else: 31 | os.system('echo \'done\' > step7/' + dataset + '/' + prot + '.done') 32 | -------------------------------------------------------------------------------- /docker/scripts/step9_analyze_dali.py: -------------------------------------------------------------------------------- 1 | #!/opt/conda/bin/python 2 | import os, sys 3 | import numpy as np 4 | 5 | 6 | def get_range(resids): 7 | resids.sort() 8 | segs = [] 9 | for resid in resids: 10 | if not segs: 11 | segs.append([resid]) 12 | else: 13 | if resid > segs[-1][-1] + 1: 14 | segs.append([resid]) 15 | else: 16 | segs[-1].append(resid) 17 | ranges = [] 18 | for seg in segs: 19 | ranges.append(f'{str(seg[0])}-{str(seg[-1])}') 20 | return ','.join(ranges) 21 | 22 | 23 | spname = sys.argv[1] 24 | prot = sys.argv[2] 25 | fp = open('/mnt/databases/ecod.latest.domains','r') 26 | ecod2id = {} 27 | ecod2fam = {} 28 | for line in fp: 29 | if line[0] != '#': 30 | words = line[:-1].split('\t') 31 | ecodnum = words[0] 32 | ecodid = words[1] 33 | ecodfam = '.'.join(words[3].split('.')[:2]) 34 | ecod2id[ecodnum] = ecodid 35 | ecod2fam[ecodnum] = ecodfam 36 | fp.close() 37 | 38 | if os.path.exists(f'step8/{spname}/{prot}_hits'): 39 | fp = open(f'step8/{spname}/{prot}_hits','r') 40 | ecodnum = '' 41 | ecodid = '' 42 | ecodfam = '' 43 | hitname = '' 44 | rot1 = '' 45 | rot2 = '' 46 | rot3 = '' 47 | trans = '' 48 | maps = [] 49 | hits = [] 50 | for line in fp: 51 | if line[0] == '>': 52 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps: 53 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps, rot1, rot2, rot3, trans]) 54 | words = line[1:].split() 55 | zscore = float(words[1]) 56 | hitname = words[0] 57 | ecodnum = hitname.split('_')[0] 58 | ecodid = ecod2id[ecodnum] 59 | ecodfam = ecod2fam[ecodnum] 60 | maps = [] 61 | rotcount = 0 62 | else: 63 | words = line.split() 64 | if words[0] == 'rotation': 65 | rotcount += 1 66 | if rotcount == 1: 67 | rot1 = ','.join(words[1:]) 68 | elif rotcount == 2: 69 | rot2 = ','.join(words[1:]) 70 | elif rotcount == 3: 71 | rot3 = ','.join(words[1:]) 72 | elif words[0] == 'translation': 73 | trans = ','.join(words[1:]) 74 | else: 75 | pres = int(words[0]) 76 | eres = int(words[1]) 77 | maps.append([pres, eres]) 78 | fp.close() 79 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps: 80 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps, rot1, rot2, rot3, trans]) 81 | 82 | 83 | newhits = [] 84 | for hit in hits: 85 | hitname = hit[0] 86 | ecodnum = hit[1] 87 | total_weight = 0 88 | posi2weight = {} 89 | zscores = [] 90 | qscores = [] 91 | if os.path.exists(f'/mnt/databases/posi_weights/{ecodnum}.weight'): 92 | fp = open(f'/mnt/databases/posi_weights/{ecodnum}.weight','r') 93 | posi2weight = {} 94 | for line in fp: 95 | words = line.split() 96 | total_weight += float(words[3]) 97 | posi2weight[int(words[0])] = float(words[3]) 98 | fp.close() 99 | if os.path.exists(f'/mnt/databases/ecod_internal/{ecodnum}.info'): 100 | fp = open(f'/mnt/databases/ecod_internal/{ecodnum}.info','r') 101 | for line in fp: 102 | words = line.split() 103 | zscores.append(float(words[1])) 104 | qscores.append(float(words[2])) 105 | fp.close() 106 | ecodid = hit[2] 107 | ecodfam = hit[3] 108 | zscore = hit[4] 109 | maps = hit[5] 110 | rot1 = hit[6] 111 | rot2 = hit[7] 112 | rot3 = hit[8] 113 | trans = hit[9] 114 | 115 | if zscores and qscores: 116 | qscore = 0 117 | for item in maps: 118 | try: 119 | qscore += posi2weight[item[1]] 120 | except KeyError: 121 | pass 122 | 123 | better = 0 124 | worse = 0 125 | for other_qscore in qscores: 126 | if other_qscore > qscore: 127 | better += 1 128 | else: 129 | worse += 1 130 | qtile = better / (better + worse) 131 | 132 | better = 0 133 | worse = 0 134 | for other_zscore in zscores: 135 | if other_zscore > zscore: 136 | better += 1 137 | else: 138 | worse += 1 139 | ztile = better / (better + worse) 140 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, qscore / total_weight, ztile, qtile, maps, rot1, rot2, rot3, trans]) 141 | else: 142 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, -1, -1, -1, maps, rot1, rot2, rot3, trans]) 143 | 144 | 145 | newhits.sort(key = lambda x:x[4], reverse = True) 146 | finalhits = [] 147 | posi2fams = {} 148 | for hit in newhits: 149 | ecodfam = hit[3] 150 | maps = hit[8] 151 | rot1 = hit[9] 152 | rot2 = hit[10] 153 | rot3 = hit[11] 154 | trans = hit[12] 155 | qposis = [] 156 | eposis = [] 157 | ranks = [] 158 | for item in maps: 159 | qposis.append(item[0]) 160 | eposis.append(item[1]) 161 | try: 162 | posi2fams[item[0]].add(ecodfam) 163 | except KeyError: 164 | posi2fams[item[0]] = set([ecodfam]) 165 | ranks.append(len(posi2fams[item[0]])) 166 | ave_rank = round(np.mean(ranks), 2) 167 | qrange = get_range(qposis) 168 | erange = get_range(eposis) 169 | finalhits.append([hit[0], hit[1], hit[2], hit[3], round(hit[4], 2), round(hit[5], 2), round(hit[6], 2), round(hit[7], 2), ave_rank, qrange, erange, rot1, rot2, rot3, trans]) 170 | 171 | rp = open(f'step9/{spname}/{prot}_good_hits', 'w') 172 | rp.write('hitname\tecodnum\tecodkey\thgroup\tzscore\tqscore\tztile\tqtile\trank\tqrange\terange\trotation1\trotation2\trotation3\ttranslation\n') 173 | for hit in finalhits: 174 | rp.write(f'{hit[0]}\t{hit[1]}\t{hit[2]}\t{hit[3]}\t{str(hit[4])}\t{str(hit[5])}\t{str(hit[6])}\t{str(hit[7])}\t{str(hit[8])}\t{hit[9]}\t{hit[10]}\t{hit[11]}\t{hit[12]}\t{hit[13]}\t{hit[14]}\n') 175 | rp.close() 176 | else: 177 | os.system(f'echo \'done\' > step9/{spname}/{prot}.done') 178 | -------------------------------------------------------------------------------- /docker/scripts/summarize_check.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | fp = open(sys.argv[1] + '_check','r') 4 | check1 = 0 5 | check2 = 0 6 | check3 = 0 7 | check4 = 0 8 | for line in fp: 9 | words = line.split() 10 | check1 += int(words[1]) 11 | check2 += int(words[2]) 12 | check3 += int(words[3]) 13 | check4 += int(words[4]) 14 | fp.close() 15 | 16 | print (sys.argv[1] + '\t' + str(check1) + '\t' + str(check2) + '\t' + str(check3) + '\t' + str(check4)) 17 | -------------------------------------------------------------------------------- /docker/utilities/DaliLite.v5.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/DaliLite.v5.tar.gz -------------------------------------------------------------------------------- /docker/utilities/HHPaths.pm: -------------------------------------------------------------------------------- 1 | # HHPaths.pm 2 | 3 | # HHsuite version 3.0.0 (15-03-2015) 4 | # (C) J. Soeding, A. Hauser 2012 5 | 6 | # This program is free software: you can redistribute it and/or modify 7 | # it under the terms of the GNU General Public License as published by 8 | # the Free Software Foundation, either version 3 of the License, or 9 | # (at your option) any later version. 10 | 11 | # This program is distributed in the hope that it will be useful, 12 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 13 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 | # GNU General Public License for more details. 15 | 16 | # You should have received a copy of the GNU General Public License 17 | # along with this program. If not, see . 18 | 19 | # We are very grateful for bug reports! Please contact us at soeding@mpibpc.mpg.de 20 | 21 | # PLEASE INSERT CORRECT PATHS AT POSITIONS INDICATED BY ... BELOW 22 | # THE ENVIRONMENT VARIABLE HHLIB NEEDS TO BE SET TO YOUR LOCAL HH-SUITE DIRECTORY, 23 | # AS DESCRIBED IN THE HH-SUITE USER GUIDE AND README FILE 24 | 25 | package HHPaths; 26 | 27 | # This block can stay unmodified 28 | use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS $VERSION); 29 | use Exporter; 30 | our $v; 31 | our $VERSION = "version 3.0.0 (15-03-2015)"; 32 | our @ISA = qw(Exporter); 33 | our @EXPORT = qw($VERSION $hhlib $hhdata $hhbin $hhscripts $execdir $datadir $ncbidir $dummydb $pdbdir $dsspdir $dssp $cs_lib $context_lib $v); 34 | push @EXPORT, qw($hhshare $hhbdata); 35 | 36 | ############################################################################################## 37 | # PLEASE COMPLETE THE PATHS ... TO PSIPRED AND OLD-STYLE BLAST (NOT BLAST+) (NEEDED FOR PSIPRED) 38 | #our $execdir = ".../psipred/bin"; # path to PSIPRED V2 binaries 39 | #our $datadir = ".../psipred/data"; # path to PSIPRED V2 data files 40 | #our $ncbidir = ".../blast/bin"; # path to NCBI binaries (for PSIPRED in addss.pl) 41 | our $execdir = "/opt/conda/bin"; # path to PSIPRED V2 binaries 42 | our $datadir = "/opt/conda/pkgs/psipred-4.01-1/share/psipred_4.01/data"; # path to PSIPRED V2 data files 43 | our $ncbidir = "/opt/conda/bin"; # path to NCBI binaries (for PSIPRED in addss.pl) 44 | 45 | ############################################################################################## 46 | # PLEASE COMPLETE THE PATHS ... TO YOUR LOCAL PDB FILES, DSSP FILES ETC. 47 | #our $pdbdir = ".../pdb/all"; # where are the pdb files? (pdb/divided directory will also work) 48 | #our $dsspdir = ".../dssp/data"; # where are the dssp files? Used in addss.pl. 49 | #our $dssp = ".../dssp/bin/dsspcmbi"; # where is the dssp binary? Used in addss.pl. 50 | our $pdbdir = "/cluster/databases/pdb/all"; # where are the pdb files? (pdb/divided directory will also work) 51 | our $dsspdir = "/cluster/databases/dssp/data"; # where are the dssp files? Used in addss.pl 52 | our $dssp = "/usr/bin/mkdssp"; # where is the dssp binary? Used in addss.pl 53 | ############################################################################################## 54 | 55 | # The lines below probably do not need to be changed 56 | 57 | # Setting paths for hh-suite perl scripts 58 | #our $hhlib = $ENV{"HHLIB"} || "/usr/lib/hhsuite"; # main hh-suite directory 59 | #our $hhshare = $ENV{"HHLIB"} || "/usr/share/hhsuite"; # main hh-suite directory 60 | our $hhlib = "/opt/hhsuite"; 61 | our $hhshare = "/opt/hhsuite"; 62 | our $hhdata = $hhshare."/data"; # path to arch indep data directory for hhblits, example files 63 | our $hhbdata = $hhlib."/data"; # path to arch dep data directory for hhblits, example files 64 | our $hhbin = $hhlib."/bin"; # path to cstranslate (path to hhsearch, hhblits etc. should be in environment variable PATH) 65 | our $hhscripts= $hhshare."/scripts"; # path to hh perl scripts (addss.pl, reformat.pl, hhblitsdb.pl etc.) 66 | our $dummydb = $hhbdata."/do_not_delete"; # Name of dummy blast db for PSIPRED (single sequence formatted with NCBI formatdb) 67 | 68 | # HHblits data files 69 | our $cs_lib = "$hhdata/cs219.lib"; 70 | our $context_lib = "$hhdata/context_data.lib"; 71 | 72 | # Add hh-suite scripts directory to search path 73 | $ENV{"PATH"} = $hhscripts.":".$ENV{"PATH"}; # Add hh scripts directory to environment variable PATH 74 | 75 | ################################################################################################ 76 | ### System command with return value parsed from output 77 | ################################################################################################ 78 | sub System() 79 | { 80 | if ($v>=2) {printf(STDERR "\$ %s\n",$_[0]);} 81 | system($_[0]); 82 | if ($? == -1) { 83 | die("\nError: failed to execute '$_[0]': $!\n\n"); 84 | } elsif ($? != 0) { 85 | printf(STDERR "\nError: command '$_[0]' returned error code %d\n\n", $? >> 8); 86 | return 1; 87 | } 88 | return $?; 89 | } 90 | 91 | return 1; 92 | -------------------------------------------------------------------------------- /docker/utilities/foldseek: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/foldseek -------------------------------------------------------------------------------- /docker/utilities/pdb2fasta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/docker/utilities/pdb2fasta -------------------------------------------------------------------------------- /example/test_struc.list: -------------------------------------------------------------------------------- 1 | O05011 2 | O05012 3 | O05023 4 | -------------------------------------------------------------------------------- /run_dpam_docker.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import docker 3 | import os,sys 4 | 5 | def check_image_exists(image_name): 6 | client = docker.from_env() 7 | try: 8 | client.images.get(image_name) 9 | return True 10 | except docker.errors.ImageNotFound: 11 | return False 12 | 13 | 14 | def check_databases(databases_dir): 15 | path = os.getcwd() 16 | flag = 1 17 | if not os.path.exists(databases_dir): 18 | print (databases_dir, 'does not exist') 19 | flag = 0 20 | else: 21 | missing = [] 22 | with open(f'{databases_dir}/all_files') as f: 23 | all_files = f.readlines() 24 | all_files = [i.strip() for i in all_files] 25 | for fn in all_files: 26 | if not os.path.exists(f'{databases_dir}/{fn}'): 27 | missing.append(fn) 28 | if missing: 29 | flag = 0 30 | with open('dpam_databases_missing_files','w') as f: 31 | f.write('\n'.join(missing)+'\n') 32 | print(f"Files missing for databases. Please check {path}/dpam_databases_missing_files for details") 33 | else: 34 | if os.path.exists('dpam_databases_missing_files'): 35 | os.system('rm dpam_databases_missing_files') 36 | return flag 37 | 38 | def check_inputs(input_dir,dataset): 39 | flag = 1 40 | if not os.path.exists(input_dir): 41 | flag = 0 42 | print('Error!', input_dir, 'does not exist.') 43 | else: 44 | if os.path.exists(f'{input_dir}/{dataset}') and os.path.exists(f'{input_dir}/{dataset}_struc.list'): 45 | with open(f'{input_dir}/{dataset}_struc.list') as f: 46 | alist = f.readlines() 47 | alist = [i.strip() for i in alist] 48 | missing = [] 49 | for name in alist: 50 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.cif') and not os.path.exists(f'{input_dir}/{dataset}/{name}.pdb'): 51 | missing.append([dataset, name, ':PDB/CIF missing']) 52 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.json'): 53 | missing.append([dataset, name, ':PAE json missing']) 54 | if missing: 55 | flag = 0 56 | with open(f'{input_dir}/dpam_{dataset}_inputs_missing_files','w') as f: 57 | for i in missing: 58 | f.write(' '.join(i)+'\n') 59 | print(f'Error! Please check {input_dir}/dpam_{dataset}_inputs_missing_files for details') 60 | if not os.path.exists(f'{input_dir}/{dataset}'): 61 | flag = 0 62 | print('Error!', dataset, 'containing PDB/CIF and PAE does not exist.') 63 | if not os.path.exists(f'{input_dir}/{dataset}_struc.list'): 64 | flag = 0 65 | print('Error!', f'{input_dir}/{dataset}_struc.list for targets does not exist.') 66 | return flag 67 | 68 | 69 | 70 | def run_docker_container(image_name, databases_dir, input_dir, dataset, threads, log_file): 71 | client = docker.from_env() 72 | wdir = f'/home/'+input_dir.split('/')[-1] 73 | 74 | # Mount the directories to the container 75 | volumes = { 76 | databases_dir: {'bind': '/mnt/databases', 'mode': 'ro'}, 77 | input_dir: {'bind': wdir, 'mode': 'rw'} 78 | } 79 | 80 | container = client.containers.run(image_name, detach=True, volumes=volumes, working_dir=wdir, command='tail -f /dev/null') 81 | 82 | # Example of running a script inside the container 83 | # Modify as needed for your specific script execution 84 | try: 85 | exec_log = container.exec_run(f"/bin/bash -c 'run_dpam.py {dataset} {threads}'", stdout=False, stderr=True) 86 | final_status = f'DPAM run for {dataset} under {input_dir} done\n' 87 | except: 88 | final_status = f'DPAM run for {dataset} under {input_dir} failed\n' 89 | 90 | with open(log_file, 'w') as file: 91 | file.write(exec_log.output.decode()) 92 | file.write(final_status) 93 | # Stop the container after the script execution 94 | container.stop() 95 | 96 | # Optionally, remove the container if not needed anymore 97 | container.remove() 98 | 99 | if __name__ == "__main__": 100 | parser = argparse.ArgumentParser(description="Run a DPAM docker container.") 101 | parser.add_argument("--databases_dir", help="Path to the databases directory to mount (required)", required=True) 102 | parser.add_argument("--input_dir", help="Path to the input directory to mount (required)", required=True) 103 | parser.add_argument("--dataset", help="Name of dataset (required)", required=True) 104 | parser.add_argument("--image_name", help="Image name", default="conglab/dpam") 105 | parser.add_argument("--threads", type=int, default=os.cpu_count(), help="Number of threads. Default is to use all CPUs") 106 | parser.add_argument("--log_file", help="File to save the logs") 107 | 108 | args = parser.parse_args() 109 | 110 | image_flag = check_image_exists(args.image_name) 111 | if not image_flag: 112 | print(args.image_name, 'does not exist!') 113 | sys.exit(1) 114 | 115 | db_flag = check_databases(args.databases_dir) 116 | if db_flag == 0: 117 | print("Databases are not complete") 118 | sys.exit(1) 119 | 120 | input_flag = check_inputs(args.input_dir,args.dataset) 121 | if input_flag == 0: 122 | print('Error(s)! Inputs missing') 123 | sys.exit(1) 124 | 125 | if '/' != args.input_dir[0]: 126 | path = os.path.join(os.getcwd(), args.input_dir) 127 | input_dir = os.path.abspath(path) 128 | else: 129 | input_dir = os.path.abspath(args.input_dir) 130 | 131 | if '/' != args.databases_dir[0]: 132 | path = os.path.join(os.getcwd(), args.databases_dir) 133 | databases_dir = os.path.abspath(path) 134 | else: 135 | databases_dir = os.path.abspath(args.databases_dir) 136 | 137 | if args.log_file is None: 138 | log_file = input_dir + '/' + args.dataset + '_docker.log' 139 | else: 140 | log_file = args.log_file 141 | 142 | run_docker_container(args.image_name,databases_dir, input_dir, args.dataset, args.threads,log_file) 143 | -------------------------------------------------------------------------------- /run_dpam_singularity.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import os,sys,subprocess 4 | 5 | def check_singularity_image_existence(image_path): 6 | result = subprocess.run(['singularity', 'inspect', image_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE) 7 | if result.returncode != 0: 8 | return 0 9 | else: 10 | return 1 11 | 12 | 13 | def check_databases(databases_dir): 14 | path = os.getcwd() 15 | flag = 1 16 | if not os.path.exists(databases_dir): 17 | print (databases_dir, 'does not exist') 18 | flag = 0 19 | else: 20 | missing = [] 21 | with open(f'{databases_dir}/all_files') as f: 22 | all_files = f.readlines() 23 | all_files = [i.strip() for i in all_files] 24 | for fn in all_files: 25 | if not os.path.exists(f'{databases_dir}/{fn}'): 26 | missing.append(fn) 27 | if missing: 28 | flag = 0 29 | with open('dpam_databases_missing_files','w') as f: 30 | f.write('\n'.join(missing)+'\n') 31 | print(f"Files missing for databases. Please check {path}/dpam_databases_missing_files for details") 32 | else: 33 | if os.path.exists('dpam_databases_missing_files'): 34 | os.system('rm dpam_databases_missing_files') 35 | return flag 36 | 37 | def check_inputs(input_dir,dataset): 38 | flag = 1 39 | if not os.path.exists(input_dir): 40 | flag = 0 41 | print('Error!', input_dir, 'does not exist.') 42 | else: 43 | if os.path.exists(f'{input_dir}/{dataset}') and os.path.exists(f'{input_dir}/{dataset}_struc.list'): 44 | with open(f'{input_dir}/{dataset}_struc.list') as f: 45 | alist = f.readlines() 46 | alist = [i.strip() for i in alist] 47 | missing = [] 48 | for name in alist: 49 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.cif') and not os.path.exists(f'{input_dir}/{dataset}/{name}.pdb'): 50 | missing.append([dataset, name, ':PDB/CIF missing']) 51 | if not os.path.exists(f'{input_dir}/{dataset}/{name}.json'): 52 | missing.append([dataset, name, ':PAE json missing']) 53 | if missing: 54 | flag = 0 55 | with open(f'{input_dir}/dpam_{dataset}_inputs_missing_files','w') as f: 56 | for i in missing: 57 | f.write(' '.join(i)+'\n') 58 | print(f'Error! Please check {input_dir}/dpam_{dataset}_inputs_missing_files for details') 59 | if not os.path.exists(f'{input_dir}/{dataset}'): 60 | flag = 0 61 | print('Error!', dataset, 'containing PDB/CIF and PAE does not exist.') 62 | if not os.path.exists(f'{input_dir}/{dataset}_struc.list'): 63 | flag = 0 64 | print('Error!', f'{input_dir}/{dataset}_struc.list for targets does not exist.') 65 | return flag 66 | 67 | 68 | 69 | 70 | def run_singularity_container(image_name, databases_dir, input_dir, dataset, threads, log_file): 71 | wdir = f'/home/{input_dir.split("/")[-1]}' 72 | 73 | # Building the Singularity exec command with bind mounts 74 | exec_command = ( 75 | f"singularity exec --fakeroot --bind {databases_dir}:/mnt/databases:ro " 76 | f"--bind {input_dir}:{wdir}:rw {image_name} " 77 | f"/bin/bash -c 'cd {wdir};run_dpam.py {dataset} {threads}'" 78 | ) 79 | 80 | # Running the container 81 | try: 82 | print(exec_command) 83 | result = subprocess.run(exec_command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) 84 | exec_output = result.stdout 85 | exec_error = result.stderr 86 | final_status = f'DPAM run for {dataset} under {input_dir} done\n' 87 | except subprocess.CalledProcessError as e: 88 | final_status = f'DPAM run for {dataset} under {input_dir} failed\n' 89 | exec_output = e.stdout 90 | exec_error = e.stderr 91 | 92 | # Writing to log file 93 | with open(log_file, 'w') as file: 94 | file.write(exec_output + exec_error) 95 | file.write(final_status) 96 | 97 | 98 | if __name__ == "__main__": 99 | parser = argparse.ArgumentParser(description="Run a DPAM docker container.") 100 | parser.add_argument("--databases_dir", help="Path to the databases directory to mount (required)", required=True) 101 | parser.add_argument("--input_dir", help="Path to the input directory to mount (required)", required=True) 102 | parser.add_argument("--dataset", help="Name of dataset (required)", required=True) 103 | parser.add_argument("--image_name", help="Image name") 104 | parser.add_argument("--threads", type=int, default=os.cpu_count(), help="Number of threads. Default is to use all CPUs") 105 | parser.add_argument("--log_file", help="File to save the logs. Default is _docker.log under .") 106 | 107 | args = parser.parse_args() 108 | 109 | image_flag = check_singularity_image_existence(args.image_name) 110 | if not image_flag: 111 | print(args.image_name, 'or Singularity does not exist!') 112 | sys.exit(1) 113 | 114 | db_flag = check_databases(args.databases_dir) 115 | if db_flag == 0: 116 | print("Databases are not complete") 117 | sys.exit(1) 118 | 119 | input_flag = check_inputs(args.input_dir,args.dataset) 120 | if input_flag == 0: 121 | print('Error(s)! Inputs missing') 122 | sys.exit(1) 123 | 124 | if '/' != args.input_dir[0]: 125 | path = os.path.join(os.getcwd(), args.input_dir) 126 | input_dir = os.path.abspath(path) 127 | else: 128 | input_dir = os.path.abspath(args.input_dir) 129 | 130 | if '/' != args.databases_dir[0]: 131 | path = os.path.join(os.getcwd(), args.databases_dir) 132 | databases_dir = os.path.abspath(path) 133 | else: 134 | databases_dir = os.path.abspath(args.databases_dir) 135 | 136 | if args.log_file is None: 137 | log_file = input_dir + '/' + args.dataset + '_docker.log' 138 | else: 139 | log_file = args.log_file 140 | 141 | run_singularity_container(args.image_name,databases_dir, input_dir, args.dataset, args.threads,log_file) 142 | 143 | -------------------------------------------------------------------------------- /v1.0/A0A0K2WPR7.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/A0A0K2WPR7.zip -------------------------------------------------------------------------------- /v1.0/DPAM.py: -------------------------------------------------------------------------------- 1 | import sys,os,time 2 | from datetime import datetime 3 | import subprocess 4 | script_dir=os.path.dirname(os.path.realpath(__file__)) 5 | 6 | def print_usage (): 7 | print("usage: DPAM.py ") 8 | 9 | def check_progress(basedir, basename): 10 | full_progress=range(1,13) 11 | if os.path.exists(basedir): 12 | if os.path.exists(basedir + '/' + basename + '_progress_logs'): 13 | with open(basedir + '/' + basename + '_progress_logs') as f: 14 | logs=f.readlines() 15 | logs=[i.strip() for i in logs if i.strip()!=''] 16 | if logs: 17 | logs=[int(i.split()[0]) for i in logs] 18 | full_progress=set(full_progress)-set(logs) 19 | full_progress=sorted(full_progress) 20 | if full_progress: 21 | progress=full_progress[0] 22 | else: 23 | progress=[] 24 | return progress 25 | 26 | if len(sys.argv) != 7: 27 | print_usage() 28 | else: 29 | logs=[] 30 | input_struc = sys.argv[1] 31 | input_pae = sys.argv[2] 32 | basename = sys.argv[3] 33 | basedir = sys.argv[4] 34 | threads = sys.argv[5] 35 | datadir = sys.argv[6] 36 | if basedir[0] != '/': 37 | basedir = os.getcwd() + '/' + basedir 38 | if not os.path.exists(basedir): 39 | os.system('mkdir ' + basedir) 40 | if '.cif' == input_struc[-4:]: 41 | os.system('cp ' + input_struc + ' ' + basedir + '/' + basename + '.cif') 42 | elif '.pdb' == input_struc[-4:]: 43 | os.system('cp ' + input_struc + ' ' + basedir + '/' + basename + '.pdb') 44 | else: 45 | print("Cannot recognize the structure file.Please use either mmcif or PDB as input. Exiting...") 46 | sys.exit() 47 | os.system('cp ' + input_pae + ' ' + basedir + '/' + basename + '.json') 48 | print('start input processing', datetime.now()) 49 | status = subprocess.call(f'python {script_dir}/step1_get_AFDB_seqs.py {basename} {basedir}',shell=True) 50 | if status != 0: 51 | print('Cannot get protein sequence. Exiting...') 52 | sys.exit() 53 | status = subprocess.call(f'python {script_dir}/step1_get_AFDB_pdbs.py {basename} {basedir}',shell=True) 54 | if status != 0: 55 | print('Cannot process structure file.Exiting...') 56 | sys.exit() 57 | logs.append('0') 58 | progress=check_progress(basedir, basename) 59 | if progress!=[]: 60 | cmds=[] 61 | cmds.append(f'python {script_dir}/step2_run_hhsearch.py {basename} {threads} {basedir} {datadir}') 62 | cmds.append(f'python {script_dir}/step3_run_foldseek.py {basename} {threads} {basedir} {datadir}') 63 | cmds.append(f'python {script_dir}/step4_filter_foldseek.py {basename} {basedir}') 64 | cmds.append(f'python {script_dir}/step5_map_to_ecod.py {basename} {basedir} {datadir}') 65 | cmds.append(f'python {script_dir}/step6_get_dali_candidates.py {basename} {basedir}') 66 | cmds.append(f'python {script_dir}/step7_iterative_dali_aug_multi.py {basename} {threads} {basedir} {datadir}') 67 | cmds.append(f'python {script_dir}/step8_analyze_dali.py {basename} {basedir} {datadir}') 68 | cmds.append(f'python {script_dir}/step9_get_support.py {basename} {basedir} {datadir}') 69 | cmds.append(f'python {script_dir}/step10_get_good_domains.py {basename} {basedir} {datadir}') 70 | cmds.append(f'python {script_dir}/step11_get_sse.py {basename} {basedir}') 71 | cmds.append(f'python {script_dir}/step12_get_diso.py {basename} {basedir}') 72 | cmds.append(f'python {script_dir}/step13_parse_domains.py {basename} {basedir}') 73 | step=1 74 | for cmd in cmds[progress-1:]: 75 | print(f'start {cmd}', datetime.now()) 76 | status = subprocess.call(cmd,shell=True) 77 | if status != 0: 78 | print(f"Error in {cmd}.Exiting...") 79 | with open(f'{basedir}/{basename}_progress_logs','w') as f: 80 | for i in logs: 81 | f.write(i+'\n') 82 | sys.exit() 83 | else: 84 | logs.append(str(step)) 85 | step = step + 1 86 | print(f'end {cmd}', datetime.now()) 87 | print(f'Domain Parsing for {basename} done') 88 | with open(f'{basedir}/{basename}_progress_logs','w') as f: 89 | for i in logs: 90 | f.write(i+'\n') 91 | else: 92 | print(f'Previous domain parsing result for {basename} is complete') 93 | -------------------------------------------------------------------------------- /v1.0/README.md: -------------------------------------------------------------------------------- 1 | # DPAM 2 | A **D**omain **P**arser for **A**lphafold **M**odels 3 | 4 | DPAM: A Domain Parser for AlphaFold Models (https://www.biorxiv.org/content/10.1101/2022.09.22.509116v1, accepted by Protein Science) 5 | 6 | ## Updates: 7 | A docker image can be dowloaded by **docker pull conglab/dpam:latest** (this is an enhanced version of current DPAM, we will soon update the repository too) 8 | 9 | Upload domain parser results for six model organisms. (2022-12-6) 10 | 11 | Replace Dali with Foldseek for initial hits searching. (2022-11-30) 12 | 13 | Fix a bug in analyze_PDB.py which prevents the proper usage of Dali results. (2022-10-31) 14 | ## Prerequisites: 15 | 16 | ### Software and packages 17 | - HH-suite3: https://github.com/soedinglab/hh-suite (enable addss.pl to add secondary structure) 18 | - DaliLite.v5: http://ekhidna2.biocenter.helsinki.fi/dali/ 19 | - Python 3.8 20 | - Foldseek 21 | - mkdssp 22 | - pdbx: https://github.com/soedinglab/pdbx 23 | - pdb2fasta (https://zhanggroup.org/pdb2fasta) 24 | 25 | Please add above software to environment path for DPAM. We also provide a script `check_dependencies.py` to check if above programs can be found. 26 | ### Supporting database: 27 | - hhsearch UniRef database (https://wwwuser.gwdg.de/~compbiol/uniclust/2022_02/) 28 | - pdb70 (https://conglab.swmed.edu/DPAM/pdb70.tgz) 29 | - ECOD database 30 | - ECOD ID map to pdb 31 | - ECOD domain length 32 | - ECOD domain list 33 | - ECOD norms 34 | - ECOD domain quality information 35 | - ECOD residue weight in domains 36 | - ECOD70 domain structures 37 | - ECOD70 foldseek database 38 | 39 | We provide a script download_all_data.sh that can be used to download all of these databases. 40 | 41 | `bash download_all_data.sh ` 42 | 43 | After downloading the databases, please decompress files. All supporting database files should be put in the same directory and the directory should be provided to `DPAM.py` as ``. The `` should have the following structure and files. 44 | 45 | / 46 | ECOD70/ 47 | ecod_domain_info/ 48 | ECOD_foldseek_DB/ 49 | ecod_weights/ 50 | pdb70/ 51 | UniRef30_2022_02/ 52 | ecod.latest.domains 53 | ECOD_length 54 | ECOD_norms 55 | ECOD_pdbmap 56 | 57 | 58 | ## Installation 59 | git clone https://github.com/CongLabCode/DPAM.git 60 | 61 | conda install -c qianlabcode dpam 62 | 63 | ## Usage 64 | `python DPAM.py ` 65 | 66 | ## Future improvments 67 | - Incoperate mmseq to improve search speed 68 | - Provide public server and incoperate with ECOD database 69 | -------------------------------------------------------------------------------- /v1.0/check_dependencies.py: -------------------------------------------------------------------------------- 1 | import shutil 2 | hhsuite=['hhblits','hhsearch','hhmake','addss.pl'] 3 | programs=['hhblits','hhsearch','hhmake','addss.pl','foldseek','dali.pl','mkdssp'] 4 | missing=[] 5 | pdbx=0 6 | for prog in programs: 7 | check = shutil.which(prog) 8 | if check == None: 9 | missing.append(prog) 10 | try: 11 | import pdbx 12 | from pdbx.reader.PdbxReader import PdbxReader 13 | except: 14 | pdbx = 1 15 | 16 | 17 | if missing or pdbx == 1: 18 | if missing: 19 | text = "Please add" 20 | hhsuite_missing = [i for i in missing if i in hhsuite] 21 | if hhsuite_missing: 22 | if len(hhsuite_missing) >= 2: 23 | hhsuite_missing = ','.join(hhsuite_missing[:-1]) +' and '+hhsuite_missing[-1] + " in HH-suite3" 24 | else: 25 | hhsuite_missing = hhsuite_missing[0] + " in HH-suite3" 26 | text = text + " " + hhsuite_missing 27 | others = [i for i in missing if i not in hhsuite_missing] 28 | if others: 29 | if len(others) >= 2: 30 | others = ','.join(others[:-1]) + ' and ' + others[-1] 31 | else: 32 | others = others[0] 33 | text = text + " and " + others 34 | text = text + " to envirnoment path" 35 | print(text) 36 | if pdbx == 1: 37 | print('pdbx is not installed properly.Please refer to https://github.com/soedinglab/pdbx for installation') 38 | else: 39 | print('HH-suite, Foldseek and dali.pl are found') 40 | -------------------------------------------------------------------------------- /v1.0/download_all_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Usage: bash download_all_data.sh /path/to/download/directory 3 | set -e 4 | 5 | if [[ $# -eq 0 ]]; then 6 | echo "Error: download directory must be provided as an input argument." 7 | exit 1 8 | fi 9 | 10 | if ! command -v aria2c &> /dev/null ; then 11 | echo "Error: aria2c could not be found. Please install aria2c." 12 | exit 1 13 | fi 14 | 15 | 16 | DOWNLOAD_DIR="$1" 17 | ### Download ECOD70 18 | echo "Downloading ECOD70..." 19 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD70.tgz" 20 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 21 | 22 | ### Download pdb70 23 | echo "Downloading pdb70..." 24 | SOURCE_URL="https://conglab.swmed.edu/DPAM/pdb70.tgz" 25 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 26 | 27 | ### Download UniRef30 28 | echo "Downloading UniRef30..." 29 | SOURCE_URL="https://wwwuser.gwdg.de/~compbiol/uniclust/2022_02/UniRef30_2022_02_hhsuite.tar.gz" 30 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 31 | 32 | ### Download ECOD70 foldseek database 33 | echo "Downloading ECOD70 foldseek database" 34 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD_foldseek_DB.tgz" 35 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 36 | 37 | ### Download ECOD position weights 38 | echo "Downloading ECOD position weights" 39 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ecod_weights.tgz" 40 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 41 | 42 | ### Download ECOD domain information 43 | echo "Downloading ECOD domain information" 44 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ecod_domain_info.tgz" 45 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 46 | 47 | 48 | ### Download ECOD domain list, length, relationship to pdb and normalization 49 | echo "Downloading other ECOD related data" 50 | files=(ECOD_norms ecod.latest.domains ECOD_length ECOD_pdbmap) 51 | for str in ${files[@]} 52 | do 53 | SOURCE_URL="https://conglab.swmed.edu/DPAM/${str}" 54 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 55 | done 56 | 57 | ### Download benchmark data 58 | echo "Download benchmark data" 59 | SOURCE_URL="https://conglab.swmed.edu/DPAM/ECOD_benchmark.tgz" 60 | wget --no-check-certificate "${SOURCE_URL}" -P "${DOWNLOAD_DIR}" 61 | 62 | echo "Download done" 63 | -------------------------------------------------------------------------------- /v1.0/mkdssp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/mkdssp -------------------------------------------------------------------------------- /v1.0/model_organisms/Caenorhabditis_elegans.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Caenorhabditis_elegans.tgz -------------------------------------------------------------------------------- /v1.0/model_organisms/Danio_rerio.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Danio_rerio.tgz -------------------------------------------------------------------------------- /v1.0/model_organisms/Drosophila_melanogaster.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Drosophila_melanogaster.tgz -------------------------------------------------------------------------------- /v1.0/model_organisms/Homo_Sapiens.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Homo_Sapiens.tgz -------------------------------------------------------------------------------- /v1.0/model_organisms/Mus_musculus.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Mus_musculus.tgz -------------------------------------------------------------------------------- /v1.0/model_organisms/Pan_paniscus.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/model_organisms/Pan_paniscus.tgz -------------------------------------------------------------------------------- /v1.0/pdb2fasta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongLabCode/DPAM/6b20ab490a86ff8a9d5f733381a46f9d4fceb64d/v1.0/pdb2fasta -------------------------------------------------------------------------------- /v1.0/step10_get_good_domains.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | 3 | prefix = sys.argv[1] 4 | wd = sys.argv[2] 5 | data_dir = sys.argv[3] 6 | if os.getcwd() != wd: 7 | os.chdir(wd) 8 | 9 | fp = open(f'{data_dir}/ECOD_norms', 'r') 10 | ecod2norm = {} 11 | for line in fp: 12 | words = line.split() 13 | ecod2norm[words[0]] = float(words[1]) 14 | fp.close() 15 | 16 | results = [] 17 | fp = open(f'{prefix}_sequence.result', 'r') 18 | for line in fp: 19 | words = line.split() 20 | filt_segs = [] 21 | for seg in words[6].split(','): 22 | start = int(seg.split('-')[0]) 23 | end = int(seg.split('-')[1]) 24 | for res in range(start, end + 1): 25 | if not filt_segs: 26 | filt_segs.append([res]) 27 | else: 28 | if res > filt_segs[-1][-1] + 10: 29 | filt_segs.append([res]) 30 | else: 31 | filt_segs[-1].append(res) 32 | 33 | filt_seg_strings = [] 34 | total_good_count = 0 35 | for seg in filt_segs: 36 | start = seg[0] 37 | end = seg[-1] 38 | good_count = 0 39 | for res in range(start, end + 1): 40 | good_count += 1 41 | if good_count >= 5: 42 | total_good_count += good_count 43 | filt_seg_strings.append(f'{start}-{end}') 44 | if total_good_count >= 25: 45 | results.append('sequence\t' + prefix + '\t' + '\t'.join(words[:7]) + '\t' + ','.join(filt_seg_strings) + '\n') 46 | fp.close() 47 | 48 | if os.path.exists(f'{prefix}_structure.result'): 49 | fp = open(f'{prefix}_structure.result', 'r') 50 | for line in fp: 51 | words = line.split() 52 | ecodnum = words[0].split('_')[0] 53 | edomain = words[1] 54 | zscore = float(words[3]) 55 | try: 56 | znorm = round(zscore / ecod2norm[ecodnum], 2) 57 | except KeyError: 58 | znorm = 0.0 59 | qscore = float(words[4]) 60 | ztile = float(words[5]) 61 | qtile = float(words[6]) 62 | rank = float(words[7]) 63 | bestprob = float(words[8]) 64 | bestcov = float(words[9]) 65 | 66 | judge = 0 67 | if rank < 1.5: 68 | judge += 1 69 | if qscore > 0.5: 70 | judge += 1 71 | if ztile < 0.75 and ztile >= 0: 72 | judge += 1 73 | if qtile < 0.75 and qtile >= 0: 74 | judge += 1 75 | if znorm > 0.225: 76 | judge += 1 77 | 78 | seqjudge = 'no' 79 | if bestprob >= 20 and bestcov >= 0.2: 80 | judge += 1 81 | seqjudge = 'low' 82 | if bestprob >= 50 and bestcov >= 0.3: 83 | judge += 1 84 | seqjudge = 'medium' 85 | if bestprob >= 80 and bestcov >= 0.4: 86 | judge += 1 87 | seqjudge = 'high' 88 | if bestprob >= 95 and bestcov >= 0.6: 89 | judge += 1 90 | seqjudge = 'superb' 91 | 92 | if judge: 93 | seg_strings = words[10].split(',') 94 | filt_segs = [] 95 | for seg in words[10].split(','): 96 | start = int(seg.split('-')[0]) 97 | end = int(seg.split('-')[1]) 98 | for res in range(start, end + 1): 99 | if not filt_segs: 100 | filt_segs.append([res]) 101 | else: 102 | if res > filt_segs[-1][-1] + 10: 103 | filt_segs.append([res]) 104 | else: 105 | filt_segs[-1].append(res) 106 | 107 | filt_seg_strings = [] 108 | total_good_count = 0 109 | for seg in filt_segs: 110 | start = seg[0] 111 | end = seg[-1] 112 | good_count = 0 113 | for res in range(start, end + 1): 114 | good_count += 1 115 | if good_count >= 5: 116 | total_good_count += good_count 117 | filt_seg_strings.append(f'{str(start)}-{str(end)}') 118 | if total_good_count >= 25: 119 | results.append('structure\t' + seqjudge + '\t' + prefix + '\t' + str(znorm) + '\t' + '\t'.join(words[:10]) + '\t' + ','.join(seg_strings) + '\t' + ','.join(filt_seg_strings) + '\n') 120 | fp.close() 121 | 122 | if results: 123 | rp = open(f'{prefix}.goodDomains', 'w') 124 | for line in results: 125 | rp.write(line) 126 | rp.close() 127 | -------------------------------------------------------------------------------- /v1.0/step11_get_sse.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | import numpy as np 3 | 4 | prefix = sys.argv[1] 5 | wd = sys.argv[2] 6 | if os.getcwd() != wd: 7 | os.chdir(wd) 8 | 9 | os.system(f'mkdssp -i {prefix}.pdb -o {prefix}.dssp') 10 | fp = open(f'{prefix}.fa', 'r') 11 | for line in fp: 12 | if line[0] != '>': 13 | seq = line[:-1] 14 | fp.close() 15 | 16 | fp = open(f'{prefix}.dssp', 'r') 17 | start = 0 18 | dssp_result = '' 19 | resids = [] 20 | for line in fp: 21 | words = line.split() 22 | if len(words) > 3: 23 | if words[0] == '#' and words[1] == 'RESIDUE': 24 | start = 1 25 | elif start: 26 | try: 27 | resid = int(line[5:10]) 28 | getit = 1 29 | except ValueError: 30 | getit = 0 31 | 32 | if getit: 33 | pred = line[16] 34 | resids.append(resid) 35 | pred = line[16] 36 | if pred == 'E' or pred == 'B': 37 | newpred = 'E' 38 | elif pred == 'G' or pred == 'H' or pred == 'I': 39 | newpred = 'H' 40 | else: 41 | newpred = '-' 42 | dssp_result += newpred 43 | fp.close() 44 | 45 | res2sse = {} 46 | dssp_segs = dssp_result.split('--') 47 | posi = 0 48 | Nsse = 0 49 | for dssp_seg in dssp_segs: 50 | judge = 0 51 | if dssp_seg.count('E') >= 3 or dssp_seg.count('H') >= 6: 52 | Nsse += 1 53 | judge = 1 54 | for char in dssp_seg: 55 | resid = resids[posi] 56 | if char != '-': 57 | if judge: 58 | res2sse[resid] = [Nsse, char] 59 | posi += 1 60 | posi += 2 61 | 62 | os.system(f'rm {prefix}.dssp') 63 | if len(resids) != len(seq): 64 | print (f'error\t{prefix}\t{len(resids)}\t{len(seq)}') 65 | else: 66 | rp = open(f'{prefix}.sse', 'w') 67 | for resid in resids: 68 | try: 69 | rp.write(f'{resid}\t{seq[resid - 1]}\t{res2sse[resid][0]}\t{res2sse[resid][1]}\n') 70 | except KeyError: 71 | rp.write(f'{resid}\t{seq[resid - 1]}\tna\tC\n') 72 | rp.close() 73 | -------------------------------------------------------------------------------- /v1.0/step12_get_diso.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os, sys, json, string 3 | 4 | prefix = sys.argv[1] 5 | wd = sys.argv[2] 6 | if os.getcwd() != wd: 7 | os.chdir(wd) 8 | 9 | insses = set([]) 10 | res2sse = {} 11 | fp = open(f'{prefix}.sse', 'r') 12 | for line in fp: 13 | words = line.split() 14 | if words[2] != 'na': 15 | sseid = int(words[2]) 16 | resid = int(words[0]) 17 | insses.add(resid) 18 | res2sse[resid] = sseid 19 | fp.close() 20 | 21 | hit_resids = set([]) 22 | if os.path.exists(f'{prefix}.goodDomains'): 23 | fp = open(f'{prefix}.goodDomains', 'r') 24 | for line in fp: 25 | words = line.split() 26 | if words[0] == 'sequence': 27 | segs = words[8].split(',') 28 | elif words[0] == 'structure': 29 | segs = words[14].split(',') 30 | for seg in segs: 31 | if '-' in seg: 32 | start = int(seg.split('-')[0]) 33 | end = int(seg.split('-')[1]) 34 | for resid in range(start, end+1): 35 | hit_resids.add(resid) 36 | else: 37 | resid = int(seg) 38 | hit_resids.add(resid) 39 | fp.close() 40 | 41 | fp = open(f'{prefix}.json','r') 42 | text = fp.read()[1:-1] 43 | fp.close() 44 | json_dict = json.loads(text) 45 | 46 | if 'predicted_aligned_error' in json_dict.keys(): 47 | paes = json_dict['predicted_aligned_error'] 48 | length = len(paes) 49 | rpair2error = {} 50 | for i in range(length): 51 | res1 = i + 1 52 | try: 53 | rpair2error[res1] 54 | except KeyError: 55 | rpair2error[res1] = {} 56 | for j in range(length): 57 | res2 = j + 1 58 | rpair2error[res1][res2] = paes[i][j] 59 | 60 | elif 'distance' in json_dict.keys(): 61 | resid1s = json_dict['residue1'] 62 | resid2s = json_dict['residue2'] 63 | prot_len1 = max(resid1s) 64 | prot_len2 = max(resid2s) 65 | if prot_len1 != prot_len2: 66 | print (f'error, matrix is not a square with shape ({str(prot_len1)}, {str(prot_len2)})') 67 | else: 68 | length = prot_len1 69 | 70 | allerrors = json_dict['distance'] 71 | mtx_size = len(allerrors) 72 | 73 | rpair2error = {} 74 | for i in range(mtx_size): 75 | res1 = resid1s[i] 76 | res2 = resid2s[i] 77 | try: 78 | rpair2error[res1] 79 | except KeyError: 80 | rpair2error[res1] = {} 81 | rpair2error[res1][res2] = allerrors[i] 82 | 83 | else: 84 | print ('error') 85 | 86 | res2contacts = {} 87 | for i in range(length): 88 | for j in range(length): 89 | res1 = i + 1 90 | res2 = j + 1 91 | error = rpair2error[res1][res2] 92 | if res1 + 20 <= res2 and error < 6: 93 | if res2 in insses: 94 | if res1 in insses and res2sse[res1] == res2sse[res2]: 95 | pass 96 | else: 97 | try: 98 | res2contacts[res1].append(res2) 99 | except KeyError: 100 | res2contacts[res1] = [res2] 101 | if res1 in insses: 102 | if res2 in insses and res2sse[res2] == res2sse[res1]: 103 | pass 104 | else: 105 | try: 106 | res2contacts[res2].append(res1) 107 | except KeyError: 108 | res2contacts[res2] = [res1] 109 | 110 | 111 | diso_resids = set([]) 112 | for start in range(1, length - 4): 113 | total_contact = 0 114 | hitres_count = 0 115 | for res in range(start, start + 5): 116 | if res in hit_resids: 117 | hitres_count += 1 118 | if res in insses: 119 | try: 120 | total_contact += len(res2contacts[res]) 121 | except KeyError: 122 | pass 123 | if total_contact <= 5 and hitres_count <= 2: 124 | for res in range(start, start + 5): 125 | diso_resids.add(res) 126 | 127 | diso_resids_list = list(diso_resids) 128 | diso_resids_list.sort() 129 | 130 | rp = open(f'{prefix}.diso', 'w') 131 | for resid in diso_resids_list: 132 | rp.write(f'{resid}\n') 133 | rp.close() 134 | -------------------------------------------------------------------------------- /v1.0/step1_get_AFDB_pdbs.py: -------------------------------------------------------------------------------- 1 | #!/usr1/local/bin/python 2 | import os, sys, string 3 | import pdbx 4 | from pdbx.reader.PdbxReader import PdbxReader 5 | 6 | three2one = {} 7 | three2one["ALA"] = "A" 8 | three2one["CYS"] = "C" 9 | three2one["ASP"] = "D" 10 | three2one["GLU"] = "E" 11 | three2one["PHE"] = "F" 12 | three2one["GLY"] = "G" 13 | three2one["HIS"] = "H" 14 | three2one["ILE"] = "I" 15 | three2one["LYS"] = "K" 16 | three2one["LEU"] = "L" 17 | three2one["MET"] = "M" 18 | three2one["MSE"] = "M" 19 | three2one["ASN"] = "N" 20 | three2one["PRO"] = "P" 21 | three2one["GLN"] = "Q" 22 | three2one["ARG"] = "R" 23 | three2one["SER"] = "S" 24 | three2one["THR"] = "T" 25 | three2one["VAL"] = "V" 26 | three2one["TRP"] = "W" 27 | three2one["TYR"] = "Y" 28 | 29 | prefix = sys.argv[1] 30 | wd=sys.argv[2] 31 | if os.getcwd() != wd: 32 | os.chdir(wd) 33 | 34 | if os.path.exists(prefix+'.cif') and os.path.exists(prefix + ".fa"): 35 | fp = open(prefix + ".fa", "r") 36 | myseq = "" 37 | for line in fp: 38 | if line[0] == ">": 39 | pass 40 | else: 41 | myseq += line[:-1] 42 | fp.close() 43 | 44 | cif = open(prefix + ".cif", "r") 45 | pRd = PdbxReader(cif) 46 | data = [] 47 | pRd.read(data) 48 | block = data[0] 49 | 50 | atom_site = block.getObj("atom_site") 51 | record_type_index = atom_site.getIndex("group_PDB") 52 | atom_type_index = atom_site.getIndex("type_symbol") 53 | atom_identity_index = atom_site.getIndex("label_atom_id") 54 | residue_type_index = atom_site.getIndex("label_comp_id") 55 | chain_id_index = atom_site.getIndex("label_asym_id") 56 | residue_id_index = atom_site.getIndex("label_seq_id") 57 | coor_x_index = atom_site.getIndex("Cartn_x") 58 | coor_y_index = atom_site.getIndex("Cartn_y") 59 | coor_z_index = atom_site.getIndex("Cartn_z") 60 | alt_id_index = atom_site.getIndex("label_alt_id") 61 | model_num_index = atom_site.getIndex("pdbx_PDB_model_num") 62 | 63 | if model_num_index == -1: 64 | mylines = [] 65 | for i in range(atom_site.getRowCount()): 66 | words = atom_site.getRow(i) 67 | chain_id = words[chain_id_index] 68 | record_type = words[record_type_index] 69 | if chain_id == "A" and record_type == "ATOM": 70 | mylines.append(words) 71 | else: 72 | model2lines = {} 73 | models = [] 74 | for i in range(atom_site.getRowCount()): 75 | words = atom_site.getRow(i) 76 | chain_id = words[chain_id_index] 77 | record_type = words[record_type_index] 78 | model_num = int(words[model_num_index]) 79 | if chain_id == "A" and record_type == "ATOM": 80 | try: 81 | model2lines[model_num].append(words) 82 | except KeyError: 83 | model2lines[model_num] = [words] 84 | models.append(model_num) 85 | best_model = min(models) 86 | mylines = model2lines[best_model] 87 | 88 | goodlines = [] 89 | resid2altid = {} 90 | resid2aa = {} 91 | atom_count = 0 92 | for words in mylines: 93 | atom_type = words[atom_type_index] 94 | atom_identity = words[atom_identity_index] 95 | residue_type = words[residue_type_index] 96 | residue_id = int(words[residue_id_index]) 97 | alt_id = words[alt_id_index] 98 | 99 | if atom_identity == "CA": 100 | try: 101 | resid2aa[residue_id] = three2one[residue_type] 102 | except KeyError: 103 | resid2aa[residue_id] = "X" 104 | 105 | get_line = 0 106 | if alt_id == ".": 107 | get_line = 1 108 | else: 109 | try: 110 | if resid2altid[residue_id] == alt_id: 111 | get_line = 1 112 | else: 113 | get_line = 0 114 | except KeyError: 115 | resid2altid[residue_id] = alt_id 116 | get_line = 1 117 | 118 | if get_line: 119 | atom_count += 1 120 | coor_x_info = words[coor_x_index].split(".") 121 | if len(coor_x_info) >= 2: 122 | coor_x = coor_x_info[0] + "." + coor_x_info[1][:3] 123 | else: 124 | coor_x = coor_x_info[0] 125 | coor_y_info = words[coor_y_index].split(".") 126 | if len(coor_y_info) >= 2: 127 | coor_y = coor_y_info[0] + "." + coor_y_info[1][:3] 128 | else: 129 | coor_y = coor_y_info[0] 130 | coor_z_info = words[coor_z_index].split(".") 131 | if len(coor_z_info) >= 2: 132 | coor_z = coor_z_info[0] + "." + coor_z_info[1][:3] 133 | else: 134 | coor_z = coor_z_info[0] 135 | if len(atom_identity) < 4: 136 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity.ljust(3) + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + " 1.00 0.00 " + atom_type + "\n") 137 | elif len(atom_identity) == 4: 138 | goodlines.append("ATOM " + str(atom_count).rjust(5) + " " + atom_identity + " " + residue_type.ljust(3) + " A" + str(residue_id).rjust(4) + " " + coor_x.rjust(8) + coor_y.rjust(8) + coor_z.rjust(8) + " 1.00 0.00 " + atom_type + "\n") 139 | 140 | newseq = "" 141 | for i in range(len(myseq)): 142 | resid = i + 1 143 | try: 144 | newseq += resid2aa[resid] 145 | if resid2aa[resid] == "X": 146 | pass 147 | elif resid2aa[resid] == myseq[i]: 148 | pass 149 | else: 150 | print ("error\t" + prefix) 151 | except KeyError: 152 | newseq += "-" 153 | if newseq == myseq: 154 | rp = open(prefix + ".pdb","w") 155 | for goodline in goodlines: 156 | rp.write(goodline) 157 | rp.close() 158 | else: 159 | sys.exit(1) 160 | elif os.path.exists(prefix + ".pdb") and os.path.exists(prefix + ".fa"): 161 | fp = open(prefix + ".fa", "r") 162 | myseq = "" 163 | for line in fp: 164 | if line[0] == ">": 165 | pass 166 | else: 167 | myseq += line[:-1] 168 | fp.close() 169 | 170 | pdb_new = [] 171 | res = [] 172 | f = open(prefix + ".pdb") 173 | for line in f: 174 | if line[:4] == 'ATOM': 175 | pdb_new.append(line) 176 | if line[13:16].strip()=='CA': 177 | try: 178 | res.append(three2one[line[17:20]]) 179 | except: 180 | res.append('X') 181 | f.close() 182 | if ''.join(res) == myseq: 183 | rp = open(prefix + ".pdb","w") 184 | for i in pdb_new: 185 | rp.write(i) 186 | rp.close() 187 | else: 188 | sys.exit(1) 189 | else: 190 | sys.exit(1) 191 | -------------------------------------------------------------------------------- /v1.0/step1_get_AFDB_seqs.py: -------------------------------------------------------------------------------- 1 | #!/usr1/local/bin/python 2 | import os, sys 3 | import pdbx 4 | from pdbx.reader.PdbxReader import PdbxReader 5 | 6 | three2one = {} 7 | three2one["ALA"] = 'A' 8 | three2one["CYS"] = 'C' 9 | three2one["ASP"] = 'D' 10 | three2one["GLU"] = 'E' 11 | three2one["PHE"] = 'F' 12 | three2one["GLY"] = 'G' 13 | three2one["HIS"] = 'H' 14 | three2one["ILE"] = 'I' 15 | three2one["LYS"] = 'K' 16 | three2one["LEU"] = 'L' 17 | three2one["MET"] = 'M' 18 | three2one["MSE"] = 'M' 19 | three2one["ASN"] = 'N' 20 | three2one["PRO"] = 'P' 21 | three2one["GLN"] = 'Q' 22 | three2one["ARG"] = 'R' 23 | three2one["SER"] = 'S' 24 | three2one["THR"] = 'T' 25 | three2one["VAL"] = 'V' 26 | three2one["TRP"] = 'W' 27 | three2one["TYR"] = 'Y' 28 | 29 | 30 | prefix = sys.argv[1] 31 | wd = sys.argv[2] 32 | if os.getcwd() != wd: 33 | os.chdir(wd) 34 | 35 | 36 | struc_fn = prefix + '.cif' 37 | if os.path.exists(struc_fn): 38 | cif = open(prefix + ".cif") 39 | else: 40 | if os.path.exists(prefix + ".pdb"): 41 | os.system(f'pdb2fasta '+ prefix + ".pdb > " + prefix + ".fa") 42 | with open(prefix + ".fa") as f: 43 | fa = f.readlines() 44 | fa[0] = fa[0].split(':')[0] + '\n' 45 | with open(prefix + ".fa",'w') as f: 46 | f.write(''.join(fa)) 47 | else: 48 | print("No recognized structure file (*.cif or *.pdb). Existing...") 49 | sys.exit() 50 | 51 | pRd = PdbxReader(cif) 52 | data = [] 53 | pRd.read(data) 54 | block = data[0] 55 | 56 | modinfo = {} 57 | mod_residues = block.getObj("pdbx_struct_mod_residue") 58 | if mod_residues: 59 | chainid = mod_residues.getIndex("label_asym_id") 60 | posiid = mod_residues.getIndex("label_seq_id") 61 | parentid = mod_residues.getIndex("parent_comp_id") 62 | resiid = mod_residues.getIndex("label_comp_id") 63 | for i in range(mod_residues.getRowCount()): 64 | words = mod_residues.getRow(i) 65 | try: 66 | modinfo[words[chainid]] 67 | except KeyError: 68 | modinfo[words[chainid]] = {} 69 | modinfo[words[chainid]][words[posiid]] = [words[resiid], words[parentid]] 70 | 71 | entity_poly = block.getObj("entity_poly") 72 | pdbx_poly_seq_scheme = block.getObj("pdbx_poly_seq_scheme") 73 | if pdbx_poly_seq_scheme and entity_poly: 74 | typeid = entity_poly.getIndex("type") 75 | entityid1 = entity_poly.getIndex("entity_id") 76 | entityid2 = pdbx_poly_seq_scheme.getIndex("entity_id") 77 | chainid = pdbx_poly_seq_scheme.getIndex("asym_id") 78 | resiid = pdbx_poly_seq_scheme.getIndex("mon_id") 79 | posiid = pdbx_poly_seq_scheme.getIndex("seq_id") 80 | 81 | good_entities = [] 82 | for i in range(entity_poly.getRowCount()): 83 | words = entity_poly.getRow(i) 84 | entity = words[entityid1] 85 | type = words[typeid] 86 | if type == "polypeptide(L)": 87 | good_entities.append(entity) 88 | 89 | if good_entities: 90 | chains = [] 91 | residues = {} 92 | seqs = {} 93 | rp = open(prefix + ".fa","w") 94 | for i in range(pdbx_poly_seq_scheme.getRowCount()): 95 | words = pdbx_poly_seq_scheme.getRow(i) 96 | entity = words[entityid2] 97 | if entity in good_entities: 98 | chain = words[chainid] 99 | 100 | try: 101 | aa = three2one[words[resiid]] 102 | except KeyError: 103 | try: 104 | modinfo[chain][words[posiid]] 105 | resiname = modinfo[chain][words[posiid]][0] 106 | if words[resiid] == resiname: 107 | new_resiname = modinfo[chain][words[posiid]][1] 108 | try: 109 | aa = three2one[new_resiname] 110 | except KeyError: 111 | aa = "X" 112 | print ("error1 " + new_resiname) 113 | else: 114 | aa = "X" 115 | print ("error2 " + words[resiid] + " " + resiname) 116 | except KeyError: 117 | print (modinfo) 118 | print (words[resiid]) 119 | aa = "X" 120 | 121 | try: 122 | seqs[chain] 123 | except KeyError: 124 | chains.append(chain) 125 | seqs[chain] = {} 126 | 127 | try: 128 | if seqs[chain][int(words[posiid])] == "X" and aa != "X": 129 | seqs[chain][int(words[posiid])] = aa 130 | except KeyError: 131 | seqs[chain][int(words[posiid])] = aa 132 | 133 | try: 134 | residues[chain].add(int(words[posiid])) 135 | except KeyError: 136 | residues[chain] = set([int(words[posiid])]) 137 | 138 | for chain in chains: 139 | for i in range(len(residues[chain])): 140 | if not i + 1 in residues[chain]: 141 | print ("error3 " + prefix + " " + chain) 142 | break 143 | else: 144 | rp.write(">" + prefix + "\n") 145 | finalseq = [] 146 | for i in range(len(residues[chain])): 147 | finalseq.append(seqs[chain][i+1]) 148 | rp.write("".join(finalseq) + "\n") 149 | rp.close() 150 | else: 151 | print ("empty " + prefix) 152 | else: 153 | print ("bad " + prefix) 154 | -------------------------------------------------------------------------------- /v1.0/step2_run_hhsearch.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | 3 | prefix = sys.argv[1] 4 | CPUs = sys.argv[2] 5 | wd = sys.argv[3] 6 | data_dir=sys.argv[4] 7 | if os.getcwd() != wd: 8 | os.chdir(wd) 9 | 10 | print (f'hhblits -cpu {CPUs} -i {prefix}.fa -d {data_dir}/UniRef30_2022_02/UniRef30_2022_02 -oa3m {prefix}.a3m') 11 | os.system(f'hhblits -cpu {CPUs} -i {prefix}.fa -d {data_dir}/UniRef30_2022_02/UniRef30_2022_02 -oa3m {prefix}.a3m') 12 | print (f'addss.pl {prefix}.a3m {prefix}.a3m.ss -a3m') 13 | os.system(f'addss.pl {prefix}.a3m {prefix}.a3m.ss -a3m') 14 | os.system(f'mv {prefix}.a3m.ss {prefix}.a3m') 15 | print (f'hhmake -i {prefix}.a3m -o {prefix}.hmm') 16 | os.system(f'hhmake -i {prefix}.a3m -o {prefix}.hmm') 17 | print (f'hhsearch -cpu {CPUs} -Z 100000 -B 100000 -i {prefix}.hmm -d {data_dir}/pdb70/pdb70 -o {prefix}.hhsearch') 18 | os.system(f'hhsearch -cpu {CPUs} -Z 100000 -B 100000 -i {prefix}.hmm -d {data_dir}/pdb70/pdb70 -o {prefix}.hhsearch') 19 | -------------------------------------------------------------------------------- /v1.0/step3_run_foldseek.py: -------------------------------------------------------------------------------- 1 | import os, sys,time 2 | 3 | 4 | prefix = sys.argv[1] 5 | threads = sys.argv[2] 6 | wd = sys.argv[3] 7 | data_dir = sys.argv[4] 8 | 9 | if os.getcwd() != wd: 10 | os.chdir(wd) 11 | if not os.path.exists('foldseek_tmp'): 12 | os.system('mkdir foldseek_tmp') 13 | 14 | os.system(f'foldseek easy-search {prefix}.pdb {data_dir}/ECOD_foldseek_DB/ECOD_foldseek_DB {prefix}.foldseek foldseek_tmp -e 1000000 --max-seqs 1000000 --threads {threads} > {prefix}_foldseek.log') 15 | os.system('rm -rf foldseek_tmp') 16 | -------------------------------------------------------------------------------- /v1.0/step4_filter_foldseek.py: -------------------------------------------------------------------------------- 1 | import sys,os 2 | 3 | prefix = sys.argv[1] 4 | wd=sys.argv[2] 5 | if os.getcwd() != wd: 6 | os.chdir(wd) 7 | 8 | fp = open(prefix + '.fa','r') 9 | query_seq = '' 10 | for line in fp: 11 | if line[0] != '>': 12 | query_seq += line[:-1] 13 | fp.close() 14 | qlen = len(query_seq) 15 | 16 | fp = open(prefix + '.foldseek', 'r') 17 | hits = [] 18 | for line in fp: 19 | words = line.split() 20 | dnum = words[1].split('.')[0] 21 | qstart = int(words[6]) 22 | qend = int(words[7]) 23 | qresids = set([]) 24 | for qres in range(qstart, qend + 1): 25 | qresids.add(qres) 26 | evalue = float(words[10]) 27 | hits.append([dnum, evalue, qstart, qend, qresids]) 28 | fp.close() 29 | hits.sort(key = lambda x:x[1]) 30 | 31 | qres2count = {} 32 | for res in range(1, qlen + 1): 33 | qres2count[res] = 0 34 | 35 | rp = open(prefix + '.foldseek.flt.result', 'w') 36 | rp.write('ecodnum\tevalue\trange\n') 37 | for hit in hits: 38 | dnum = hit[0] 39 | evalue = hit[1] 40 | qstart = hit[2] 41 | qend = hit[3] 42 | qresids = hit[4] 43 | for res in qresids: 44 | qres2count[res] += 1 45 | good_res = 0 46 | for res in qresids: 47 | if qres2count[res] <= 100: 48 | good_res += 1 49 | if good_res >= 10: 50 | rp.write(dnum + '\t' + str(evalue) + '\t' + str(qstart) + '-' + str(qend) + '\n') 51 | rp.close() 52 | -------------------------------------------------------------------------------- /v1.0/step5_map_to_ecod.py: -------------------------------------------------------------------------------- 1 | import sys,os 2 | 3 | def get_range(resids, chainid): 4 | resids.sort() 5 | segs = [] 6 | for resid in resids: 7 | if not segs: 8 | segs.append([resid]) 9 | else: 10 | if resid > segs[-1][-1] + 1: 11 | segs.append([resid]) 12 | else: 13 | segs[-1].append(resid) 14 | ranges = [] 15 | for seg in segs: 16 | if chainid: 17 | ranges.append(chainid + ':' + str(seg[0]) + '-' + str(seg[-1])) 18 | else: 19 | ranges.append(str(seg[0]) + '-' + str(seg[-1])) 20 | return ','.join(ranges) 21 | 22 | 23 | prefix = sys.argv[1] 24 | wd = sys.argv[2] 25 | data_dir = sys.argv[3] 26 | if os.getcwd() != wd: 27 | os.chdir(wd) 28 | 29 | fp = open(prefix + '.hhsearch', 'r') 30 | info = fp.read().split('\n>') 31 | fp.close() 32 | allhits = [] 33 | need_pdbchains = set([]) 34 | need_pdbs = set([]) 35 | for hit in info[1:]: 36 | lines = hit.split('\n') 37 | qstart = 0 38 | qend = 0 39 | qseq = '' 40 | hstart = 0 41 | hend = 0 42 | hseq = '' 43 | for line in lines: 44 | if len(line) >= 6: 45 | if line[:6] == 'Probab': 46 | words = line.split() 47 | for word in words: 48 | subwords = word.split('=') 49 | if subwords[0] == 'Probab': 50 | hh_prob = subwords[1] 51 | elif subwords[0] == 'E-value': 52 | hh_eval = subwords[1] 53 | elif subwords[0] == 'Score': 54 | hh_score = subwords[1] 55 | elif subwords[0] == 'Aligned_cols': 56 | aligned_cols = subwords[1] 57 | elif subwords[0] == 'Identities': 58 | idents = subwords[1] 59 | elif subwords[0] == 'Similarity': 60 | similarities = subwords[1] 61 | elif subwords[0] == 'Sum_probs': 62 | sum_probs = subwords[1] 63 | 64 | elif line[:2] == 'Q ': 65 | words = line.split() 66 | if words[1] != 'ss_pred' and words[1] != 'Consensus': 67 | qseq += words[3] 68 | if not qstart: 69 | qstart = int(words[2]) 70 | qend = int(words[4]) 71 | 72 | elif line[:2] == 'T ': 73 | words = line.split() 74 | if words[1] != 'Consensus' and words[1] != 'ss_dssp' and words[1] != 'ss_pred': 75 | hid = words[1] 76 | hseq += words[3] 77 | if not hstart: 78 | hstart = int(words[2]) 79 | hend = int(words[4]) 80 | allhits.append([hid, hh_prob, hh_eval, hh_score, aligned_cols, idents, similarities, sum_probs, qstart, qend, qseq, hstart, hend, hseq]) 81 | need_pdbchains.add(hid) 82 | need_pdbs.add(hid.split('_')[0].lower()) 83 | 84 | 85 | fp = open(data_dir + '/ECOD_pdbmap','r') 86 | pdb2ecod = {} 87 | good_hids = set([]) 88 | for line in fp: 89 | words = line.split() 90 | pdbid = words[1] 91 | segments = words[2].split(',') 92 | chainids = set([]) 93 | resids = [] 94 | for segment in segments: 95 | chainids.add(segment.split(':')[0]) 96 | if '-' in segment: 97 | start = int(segment.split(':')[1].split('-')[0]) 98 | end = int(segment.split(':')[1].split('-')[1]) 99 | for res in range(start, end + 1): 100 | resids.append(res) 101 | else: 102 | resid = int(segment.split(':')[1]) 103 | resids.append(resid) 104 | if len(chainids) == 1: 105 | chainid = list(chainids)[0] 106 | pdbchain = pdbid.upper() + '_' + chainid 107 | if pdbchain in need_pdbchains: 108 | good_hids.add(pdbchain) 109 | pdb2ecod[pdbchain] = {} 110 | for i, resid in enumerate(resids): 111 | pdb2ecod[pdbchain][resid] = words[0] + ':' + str(i + 1) 112 | else: 113 | print (line[:-1]) 114 | fp.close() 115 | 116 | ecod2key = {} 117 | ecod2len = {} 118 | fp = open(data_dir + '/ECOD_length','r') 119 | for line in fp: 120 | words = line.split() 121 | ecod2key[words[0]] = words[1] 122 | ecod2len[words[0]] = int(words[2]) 123 | fp.close() 124 | 125 | rp = open(prefix + '.map2ecod.result', 'w') 126 | rp.write('uid\tecod_domain_id\thh_prob\thh_eval\thh_score\taligned_cols\tidents\tsimilarities\tsum_probs\tcoverage\tungapped_coverage\tquery_range\ttemplate_range\ttemplate_seqid_range\n') 127 | for hit in allhits: 128 | hid = hit[0] 129 | pdbid = hid.split('_')[0] 130 | chainid = hid.split('_')[1] 131 | ecods = [] 132 | ecod2hres = {} 133 | ecod2hresmap = {} 134 | if hid in good_hids: 135 | for pdbres in pdb2ecod[hid].keys(): 136 | for item in pdb2ecod[hid][pdbres].split(','): 137 | ecod = item.split(':')[0] 138 | ecodres = int(item.split(':')[1]) 139 | try: 140 | ecod2hres[ecod] 141 | ecod2hresmap[ecod] 142 | except KeyError: 143 | ecods.append(ecod) 144 | ecod2hres[ecod] = set([]) 145 | ecod2hresmap[ecod] = {} 146 | ecod2hres[ecod].add(pdbres) 147 | ecod2hresmap[ecod][pdbres] = ecodres 148 | 149 | hh_prob = hit[1] 150 | hh_eval = hit[2] 151 | hh_score = hit[3] 152 | aligned_cols = hit[4] 153 | idents = hit[5] 154 | similarities = hit[6] 155 | sum_probs = hit[7] 156 | qstart = hit[8] 157 | qseq = hit[10] 158 | hstart = hit[11] 159 | hseq = hit[13] 160 | 161 | for ecod in ecods: 162 | ecodkey = ecod2key[ecod] 163 | ecodlen = ecod2len[ecod] 164 | qposi = qstart - 1 165 | hposi = hstart - 1 166 | qresids = [] 167 | hresids = [] 168 | eresids = [] 169 | if len(qseq) == len(hseq): 170 | for i in range(len(hseq)): 171 | if qseq[i] != '-': 172 | qposi += 1 173 | if hseq[i] != '-': 174 | hposi += 1 175 | if qseq[i] != '-' and hseq[i] != '-': 176 | if hposi in ecod2hres[ecod]: 177 | eposi = ecod2hresmap[ecod][hposi] 178 | qresids.append(qposi) 179 | hresids.append(hposi) 180 | eresids.append(eposi) 181 | if len(qresids) >= 10 and len(eresids) >= 10: 182 | qrange = get_range(qresids,'') 183 | hrange = get_range(hresids, chainid) 184 | erange = get_range(eresids,'') 185 | coverage = round(len(eresids) / ecodlen, 3) 186 | ungapped_coverage = round((max(eresids) - min(eresids) + 1) / ecodlen, 3) 187 | rp.write(ecod + '\t' + ecodkey + '\t' + hh_prob + '\t' + hh_eval + '\t' + hh_score + '\t' + aligned_cols + '\t' + idents + '\t' + similarities + '\t' + sum_probs + '\t' + str(coverage) + '\t' + str(ungapped_coverage) + '\t' + qrange + '\t' + erange + '\t' + hrange + '\n') 188 | else: 189 | print ('error\t' + prot + '\t' + ecod) 190 | rp.close() 191 | -------------------------------------------------------------------------------- /v1.0/step6_get_dali_candidates.py: -------------------------------------------------------------------------------- 1 | import sys,os 2 | 3 | prefix = sys.argv[1] 4 | wd = sys.argv[2] 5 | 6 | if os.getcwd() != wd: 7 | os.chdir(wd) 8 | 9 | domains = set([]) 10 | fp = open(prefix + '.map2ecod.result', 'r') 11 | for countl, line in enumerate(fp): 12 | if countl: 13 | words = line.split() 14 | domains.add(words[0]) 15 | fp.close() 16 | 17 | fp = open(prefix + '.foldseek.flt.result','r') 18 | for countl, line in enumerate(fp): 19 | if countl: 20 | words = line.split() 21 | domains.add(words[0]) 22 | fp.close() 23 | 24 | rp = open(prefix + '_hits4Dali', 'w') 25 | for domain in domains: 26 | rp.write(domain + '\n') 27 | rp.close() 28 | -------------------------------------------------------------------------------- /v1.0/step7_iterative_dali_aug_multi.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | import time 3 | from multiprocessing import Pool 4 | 5 | prefix = sys.argv[1] 6 | CPUs = sys.argv[2] 7 | wd = sys.argv[3] 8 | data_dir=sys.argv[4] 9 | 10 | 11 | def get_domain_range(resids): 12 | segs = [] 13 | resids.sort() 14 | cutoff1 = 5 15 | cutoff2 = len(resids) * 0.05 16 | cutoff = max(cutoff1, cutoff2) 17 | for resid in resids: 18 | if not segs: 19 | segs.append([resid]) 20 | else: 21 | if resid > segs[-1][-1] + cutoff: 22 | segs.append([resid]) 23 | else: 24 | segs[-1].append(resid) 25 | seg_string = [] 26 | for seg in segs: 27 | start = str(seg[0]) 28 | end = str(seg[-1]) 29 | seg_string.append(start + '-' + end) 30 | return ','.join(seg_string) 31 | 32 | 33 | def run_dali(edomain): 34 | alicount = 0 35 | os.system(f'mkdir {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}') 36 | os.system(f'cp {wd}/{prefix}.pdb {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/{prefix}_{edomain}.pdb') 37 | os.system(f'mkdir {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp') 38 | os.chdir(f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp') 39 | while True: 40 | os.system(f'dali.pl --pdbfile1 ../{prefix}_{edomain}.pdb --pdbfile2 {data_dir}/ECOD70/{edomain}.pdb --dat1 ./ -dat2 ./ --outfmt summary,alignments,transrot >& log') 41 | fp = os.popen('ls -1 mol*.txt') 42 | info = fp.readlines() 43 | fp.close() 44 | filenames = [] 45 | for line in info: 46 | filenames.append(line[:-1]) 47 | 48 | if filenames: 49 | fp = open('../' + prefix + '_' + edomain + '.pdb', 'r') 50 | Qresids_set = set([]) 51 | for line in fp: 52 | resid = int(line[22:26]) 53 | Qresids_set.add(resid) 54 | fp.close() 55 | Qresids = list(Qresids_set) 56 | 57 | info = [] 58 | for filename in filenames: 59 | fp = open(filename, "r") 60 | lines = fp.readlines() 61 | fp.close() 62 | for line in lines: 63 | info.append(line) 64 | 65 | qali = '' 66 | sali = '' 67 | getit = 1 68 | zscore = 0 69 | for line in info: 70 | words = line.split() 71 | if len(words) >= 2 and getit: 72 | if words[0] == 'Query': 73 | qali += words[1] 74 | elif words[0] == 'Sbjct': 75 | sali += words[1] 76 | elif words[0] == 'No' and words[1] == '1:': 77 | for word in words: 78 | if '=' in word: 79 | subwords = word.split('=') 80 | if subwords[0] == 'Z-score': 81 | zinfo = subwords[1].split('.') 82 | zscore = float(zinfo[0] + '.' + zinfo[1]) 83 | elif words[0] == 'No' and words[1] == '2:': 84 | getit = 0 85 | 86 | qinds = [] 87 | sinds = [] 88 | length = len(qali) 89 | qposi = 0 90 | sposi = 0 91 | match = 0 92 | for i in range(length): 93 | if qali[i] != '-': 94 | qposi += 1 95 | if sali[i] != '-': 96 | sposi += 1 97 | if qali[i] != '-' and sali[i] != '-': 98 | if qali[i].isupper() and sali[i].isupper(): 99 | match += 1 100 | qinds.append(qposi) 101 | sinds.append(sposi) 102 | qlen = qposi 103 | slen = sposi 104 | 105 | if match >= 20: 106 | alicount += 1 107 | rp = open(f'{wd}/iterativeDali_{prefix}/{prefix}_{edomain}_hits', 'a') 108 | rp.write('>' + edomain + '_' + str(alicount) + '\t' + str(zscore) + '\t' + str(match) + '\t' + str(qlen) + '\t' + str(slen) + '\n') 109 | for i in range(len(qinds)): 110 | qind = qinds[i] - 1 111 | sind = sinds[i] 112 | rp.write(str(Qresids[qind]) + '\t' + str(sind) + '\n') 113 | rp.close() 114 | 115 | raw_qresids = [] 116 | for qind in qinds: 117 | raw_qresids.append(Qresids[qind - 1]) 118 | qrange = get_domain_range(raw_qresids) 119 | qresids = set([]) 120 | qsegs = qrange.split(',') 121 | for qseg in qsegs: 122 | qedges = qseg.split('-') 123 | qstart = int(qedges[0]) 124 | qend = int(qedges[1]) 125 | for qres in range(qstart, qend + 1): 126 | qresids.add(qres) 127 | remain_resids = Qresids_set.difference(qresids) 128 | 129 | if len(remain_resids) >= 20: 130 | rp = open('../' + prefix + '_' + edomain + '.pdbnew', 'w') 131 | fp = open('../' + prefix + '_' + edomain + '.pdb', 'r') 132 | for line in fp: 133 | resid = int(line[22:26]) 134 | if resid in remain_resids: 135 | rp.write(line) 136 | fp.close() 137 | rp.close() 138 | os.system('mv ../' + prefix + '_' + edomain + '.pdbnew ../' + prefix + '_' + edomain + '.pdb') 139 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp': 140 | os.system('rm *') 141 | else: 142 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp': 143 | os.system('rm *') 144 | break 145 | else: 146 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp': 147 | os.system('rm *') 148 | break 149 | else: 150 | if os.getcwd() == f'{wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}/output_tmp': 151 | os.system('rm *') 152 | break 153 | os.chdir(wd) 154 | time.sleep(1) 155 | os.system(f'rm -rf {wd}/iterativeDali_{prefix}/tmp_{prefix}_{edomain}') 156 | 157 | if os.getcwd() != wd: 158 | os.chdir(wd) 159 | 160 | if os.path.exists(prefix + '.iterativeDali.done'): 161 | pass 162 | else: 163 | if not os.path.exists(f'{wd}/iterativeDali_{prefix}'): 164 | os.system(f'mkdir {wd}/iterativeDali_{prefix}') 165 | fp = open(prefix + '_hits4Dali','r') 166 | edomains = [] 167 | for line in fp: 168 | words = line.split() 169 | edomains.append(words[0]) 170 | fp.close() 171 | 172 | inputs = [] 173 | for edomain in edomains: 174 | inputs.append([edomain]) 175 | pool = Pool(processes = int(CPUs)) 176 | results = [] 177 | for item in inputs: 178 | process = pool.apply_async(run_dali, item) 179 | results.append(process) 180 | for process in results: 181 | process.get() 182 | 183 | os.system(f'cat {wd}/iterativeDali_{prefix}/{prefix}_*_hits > {prefix}_iterativdDali_hits') 184 | os.system(f'rm -rf {wd}/iterativeDal_{prefix}/tmp_*') 185 | os.system(f'rm -rf {wd}/iterativeDal_{prefix}') 186 | os.system(f'echo "done" > {prefix}.iterativeDali.done') 187 | -------------------------------------------------------------------------------- /v1.0/step8_analyze_dali.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | import numpy as np 3 | 4 | 5 | def get_range(resids): 6 | resids.sort() 7 | segs = [] 8 | for resid in resids: 9 | if not segs: 10 | segs.append([resid]) 11 | else: 12 | if resid > segs[-1][-1] + 1: 13 | segs.append([resid]) 14 | else: 15 | segs[-1].append(resid) 16 | ranges = [] 17 | for seg in segs: 18 | ranges.append(f'{seg[0]}-{seg[-1]}') 19 | return ','.join(ranges) 20 | 21 | 22 | prefix = sys.argv[1] 23 | wd = sys.argv[2] 24 | data_dir = sys.argv[3] 25 | if os.getcwd() != wd: 26 | os.chdir(wd) 27 | 28 | if os.path.exists(f'{prefix}_iterativdDali_hits'): 29 | fp = open(f'{data_dir}/ecod.latest.domains','r') 30 | ecod2id = {} 31 | ecod2fam = {} 32 | for line in fp: 33 | if line[0] != '#': 34 | words = line[:-1].split('\t') 35 | ecodnum = words[0] 36 | ecodid = words[1] 37 | ecodfam = '.'.join(words[3].split('.')[:2]) 38 | ecod2id[ecodnum] = ecodid 39 | ecod2fam[ecodnum] = ecodfam 40 | fp.close() 41 | 42 | fp = open(f'{prefix}_iterativdDali_hits','r') 43 | ecodnum = '' 44 | ecodid = '' 45 | ecodfam = '' 46 | hitname = '' 47 | maps = [] 48 | hits = [] 49 | for line in fp: 50 | if line[0] == '>': 51 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps: 52 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps]) 53 | words = line[1:].split() 54 | zscore = float(words[1]) 55 | hitname = words[0] 56 | ecodnum = hitname.split('_')[0] 57 | ecodid = ecod2id[ecodnum] 58 | ecodfam = ecod2fam[ecodnum] 59 | maps = [] 60 | else: 61 | words = line.split() 62 | pres = int(words[0]) 63 | eres = int(words[1]) 64 | maps.append([pres, eres]) 65 | fp.close() 66 | if ecodnum and ecodid and ecodfam and hitname and zscore and maps: 67 | hits.append([hitname, ecodnum, ecodid, ecodfam, zscore, maps]) 68 | 69 | 70 | newhits = [] 71 | for hit in hits: 72 | hitname = hit[0] 73 | ecodnum = hit[1] 74 | total_weight = 0 75 | posi2weight = {} 76 | zscores = [] 77 | qscores = [] 78 | if os.path.exists(f'{data_dir}/ecod_weights/{ecodnum}.weight'): 79 | fp = open(f'{data_dir}/ecod_weights/{ecodnum}.weight','r') 80 | posi2weight = {} 81 | for line in fp: 82 | words = line.split() 83 | total_weight += float(words[3]) 84 | posi2weight[int(words[0])] = float(words[3]) 85 | fp.close() 86 | if os.path.exists(f'{data_dir}/ecod_domain_info/{ecodnum}.info'): 87 | fp = open(f'{data_dir}/ecod_domain_info/{ecodnum}.info','r') 88 | for line in fp: 89 | words = line.split() 90 | zscores.append(float(words[1])) 91 | qscores.append(float(words[2])) 92 | fp.close() 93 | ecodid = hit[2] 94 | ecodfam = hit[3] 95 | zscore = hit[4] 96 | maps = hit[5] 97 | 98 | if zscores and qscores: 99 | qscore = 0 100 | for item in maps: 101 | try: 102 | qscore += posi2weight[item[1]] 103 | except KeyError: 104 | pass 105 | 106 | better = 0 107 | worse = 0 108 | for other_qscore in qscores: 109 | if other_qscore > qscore: 110 | better += 1 111 | else: 112 | worse += 1 113 | qtile = better / (better + worse) 114 | 115 | better = 0 116 | worse = 0 117 | for other_zscore in zscores: 118 | if other_zscore > zscore: 119 | better += 1 120 | else: 121 | worse += 1 122 | ztile = better / (better + worse) 123 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, qscore / total_weight, ztile, qtile, maps]) 124 | else: 125 | newhits.append([hitname, ecodnum, ecodid, ecodfam, zscore, -1, -1, -1, maps]) 126 | 127 | 128 | newhits.sort(key = lambda x:x[4], reverse = True) 129 | finalhits = [] 130 | posi2fams = {} 131 | for hit in newhits: 132 | ecodfam = hit[3] 133 | maps = hit[8] 134 | qposis = [] 135 | eposis = [] 136 | ranks = [] 137 | for item in maps: 138 | qposis.append(item[0]) 139 | eposis.append(item[1]) 140 | try: 141 | posi2fams[item[0]].add(ecodfam) 142 | except KeyError: 143 | posi2fams[item[0]] = set([ecodfam]) 144 | ranks.append(len(posi2fams[item[0]])) 145 | ave_rank = round(np.mean(ranks), 2) 146 | qrange = get_range(qposis) 147 | erange = get_range(eposis) 148 | finalhits.append([hit[0], hit[1], hit[2], hit[3], round(hit[4], 2), round(hit[5], 2), round(hit[6], 2), round(hit[7], 2), ave_rank, qrange, erange]) 149 | 150 | rp = open(f'{prefix}_good_hits', 'w') 151 | rp.write('hitname\tecodnum\tecodkey\thgroup\tzscore\tqscore\tztile\tqtile\trank\tqrange\terange\n') 152 | for hit in finalhits: 153 | rp.write(f'{hit[0]}\t{hit[1]}\t{hit[2]}\t{hit[3]}\t{hit[4]}\t{hit[5]}\t{hit[6]}\t{hit[7]}\t{hit[8]}\t{hit[9]}\t{hit[10]}\n') 154 | rp.close() 155 | -------------------------------------------------------------------------------- /v1.0/step9_get_support.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | 3 | def get_range(resids): 4 | resids.sort() 5 | segs = [] 6 | for resid in resids: 7 | if not segs: 8 | segs.append([resid]) 9 | else: 10 | if resid > segs[-1][-1] + 1: 11 | segs.append([resid]) 12 | else: 13 | segs[-1].append(resid) 14 | ranges = [] 15 | for seg in segs: 16 | ranges.append(f'{seg[0]}-{seg[-1]}') 17 | return ','.join(ranges) 18 | 19 | prefix = sys.argv[1] 20 | wd = sys.argv[2] 21 | data_dir = sys.argv[3] 22 | if os.getcwd() != wd: 23 | os.chdir(wd) 24 | 25 | fp = open(f'{data_dir}/ECOD_length','r') 26 | ecod2len = {} 27 | for line in fp: 28 | words = line.split() 29 | ecod2len[words[0]] = int(words[2]) 30 | fp.close() 31 | 32 | fp = open(f'{data_dir}/ecod.latest.domains','r') 33 | ecod2id = {} 34 | ecod2fam = {} 35 | for line in fp: 36 | if line[0] != '#': 37 | words = line[:-1].split('\t') 38 | ecodnum = words[0] 39 | ecodid = words[1] 40 | ecodfam = '.'.join(words[3].split('.')[:2]) 41 | ecod2id[ecodnum] = ecodid 42 | ecod2fam[ecodnum] = ecodfam 43 | fp.close() 44 | 45 | 46 | seqhits = [] 47 | fp = open(f'{prefix}.map2ecod.result', 'r') 48 | for countl, line in enumerate(fp): 49 | if countl: 50 | words = line.split() 51 | ecodnum = words[0] 52 | ecodlen = ecod2len[ecodnum] 53 | ecodfam = ecod2fam[ecodnum] 54 | prob = float(words[2]) 55 | Qsegs = words[11].split(',') 56 | Tsegs = words[12].split(',') 57 | Qresids = [] 58 | for seg in Qsegs: 59 | start = int(seg.split('-')[0]) 60 | end = int(seg.split('-')[1]) 61 | for res in range(start, end + 1): 62 | Qresids.append(res) 63 | Tresids = [] 64 | for seg in Tsegs: 65 | start = int(seg.split('-')[0]) 66 | end = int(seg.split('-')[1]) 67 | for res in range(start, end + 1): 68 | Tresids.append(res) 69 | seqhits.append([ecodnum, ecodlen, ecodfam, prob, Qresids, Tresids]) 70 | fp.close() 71 | 72 | 73 | fam2hits = {} 74 | fams = set([]) 75 | for hit in seqhits: 76 | fam = hit[2] 77 | fams.add(fam) 78 | try: 79 | fam2hits[fam].append([hit[3], hit[1], hit[4], hit[5]]) 80 | except KeyError: 81 | fam2hits[fam] = [[hit[3], hit[1], hit[4], hit[5]]] 82 | 83 | ecods = [] 84 | ecod2hits = {} 85 | for hit in seqhits: 86 | ecodnum = hit[0] 87 | ecodlen = hit[1] 88 | ecodfam = hit[2] 89 | prob = hit[3] 90 | Qresids = hit[4] 91 | Tresids = hit[5] 92 | Qset = set(Qresids) 93 | try: 94 | ecod2hits[ecodnum].append([prob, ecodfam, ecodlen, Qresids, Tresids, Qset]) 95 | except KeyError: 96 | ecod2hits[ecodnum] = [[prob, ecodfam, ecodlen, Qresids, Tresids, Qset]] 97 | ecods.append(ecodnum) 98 | 99 | 100 | rp = open(f'{prefix}_sequence.result', 'w') 101 | for ecodnum in ecods: 102 | ecodid = ecod2id[ecodnum] 103 | ecod2hits[ecodnum].sort(key = lambda x:x[0], reverse = True) 104 | get_resids = set([]) 105 | mycount = 0 106 | for hit in ecod2hits[ecodnum]: 107 | hit_prob = hit[0] 108 | hit_fam = hit[1] 109 | hit_ecodlen = hit[2] 110 | query_resids = hit[3] 111 | query_range = get_range(query_resids) 112 | hit_resids = hit[4] 113 | hit_range = get_range(hit_resids) 114 | hit_resids_set = hit[5] 115 | hit_coverage = round(len(hit_resids_set) / hit_ecodlen, 2) 116 | if hit_coverage >= 0.4 and hit_prob >= 50: 117 | new_resids = hit_resids_set.difference(get_resids) 118 | if len(new_resids) >= len(hit_resids_set) * 0.5: 119 | mycount += 1 120 | get_resids = get_resids.union(hit_resids_set) 121 | rp.write(f'{ecodnum}_{str(mycount)}\t{ecodid}\t{hit_fam}\t{hit_prob}\t{hit_coverage}\t{hit_ecodlen}\t{query_range}\t{hit_range}\n') 122 | rp.close() 123 | 124 | 125 | if os.path.exists(f'{prefix}_good_hits'): 126 | fp = open(f'{prefix}_good_hits', 'r') 127 | rp = open(f'{prefix}_structure.result', 'w') 128 | for countl, line in enumerate(fp): 129 | if countl: 130 | words = line.split() 131 | hitname = words[0] 132 | ecodnum = words[1] 133 | ecodid = words[2] 134 | ecodfam = words[3] 135 | zscore = words[4] 136 | qscore = words[5] 137 | ztile = words[6] 138 | qtile = words[7] 139 | rank = words[8] 140 | qsegments = words[9] 141 | ssegments = words[10] 142 | segs = [] 143 | for seg in qsegments.split(','): 144 | start = int(seg.split('-')[0]) 145 | end = int(seg.split('-')[1]) 146 | for res in range(start, end + 1): 147 | if not segs: 148 | segs.append([res]) 149 | else: 150 | if res > segs[-1][-1] + 10: 151 | segs.append([res]) 152 | else: 153 | segs[-1].append(res) 154 | resids = set([]) 155 | for seg in segs: 156 | start = seg[0] 157 | end = seg[-1] 158 | for res in range(start, end + 1): 159 | resids.add(res) 160 | 161 | good_hits = [] 162 | try: 163 | for hit in fam2hits[ecodfam]: 164 | prob = float(hit[0]) 165 | Tlen = hit[1] 166 | Qresids = hit[2] 167 | Tresids = hit[3] 168 | get_Tresids = set([]) 169 | for i in range(len(Qresids)): 170 | if Qresids[i] in resids: 171 | get_Tresids.add(Tresids[i]) 172 | Tcov = len(get_Tresids) / Tlen 173 | good_hits.append([prob, Tcov]) 174 | except KeyError: 175 | pass 176 | 177 | bestprob = 0 178 | bestcov = 0 179 | if good_hits: 180 | for item in good_hits: 181 | if item[0] > bestprob: 182 | bestprob = item[0] 183 | bestcovs = [] 184 | for item in good_hits: 185 | if item[0] >= bestprob - 0.1: 186 | bestcovs.append(item[1]) 187 | bestcov = round(max(bestcovs), 2) 188 | rp.write(f'{hitname}\t{ecodid}\t{ecodfam}\t{zscore}\t{qscore}\t{ztile}\t{qtile}\t{rank}\t{bestprob}\t{bestcov}\t{qsegments}\t{ssegments}\n') 189 | fp.close() 190 | rp.close() 191 | --------------------------------------------------------------------------------