├── .github └── ISSUE_TEMPLATE │ └── question.md ├── LICENSE ├── README.md ├── README_ja.md ├── beta_install_colabbatch_linux.sh ├── beta_update_linux.sh ├── install_colabbatch_M1mac.sh ├── install_colabbatch_intelmac.sh ├── install_colabbatch_linux.sh ├── update_M1mac.sh ├── update_intelmac.sh ├── update_linux.sh └── v1.0.0 ├── README.md ├── README_ja.md ├── colabfold_alphafold.patch ├── gpurelaxation.patch ├── install_colabfold_M1mac.sh ├── install_colabfold_intelmac.sh ├── install_colabfold_linux.sh ├── residue_constants.patch ├── runner.py ├── runner_af2advanced.py └── runner_af2advanced_old.py /.github/ISSUE_TEMPLATE/question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Question 3 | about: Question template 4 | title: 'Question:' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Caution: Please only report your issue related to the installation on your local PC or macOS.** If you can get the help message by `colabfold_batch --help` or run a test prediction successfully, your installation is successful. Requests or questions regarding ColabFold features should be directed to [ColabFold repo's issues](https://github.com/sokrypton/ColabFold/issues). 11 | 12 | ---- 13 | 14 | **What is your installation issue?** 15 | 16 | Describe your question here. 17 | 18 | **Computational environment** 19 | 20 | - OS: [e.g. Ubuntu 22.04, Windows10 & WSL2, macOS...] 21 | - CUDA version if Linux (Show the output of `/usr/local/cuda/bin/nvcc --version`.) 22 | 23 | **To Reproduce** 24 | 25 | Steps to reproduce the behavior: 26 | 1. Go to '...' 27 | 2. Click on '....' 28 | 3. Scroll down to '....' 29 | 4. See error 30 | 31 | **Expected behavior** 32 | 33 | A clear and concise description of what you expected to happen. 34 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Yoshitaka Moriwaki 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LocalColabFold 2 | 3 | [ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) on your local PC (or macOS). See also [ColabFold repository](https://github.com/sokrypton/ColabFold). 4 | 5 | ## What is LocalColabFold? 6 | 7 | LocalColabFold is an installer script designed to make ColabFold functionality available on users' local machines. It supports wide range of operating systems, such as Windows 10 or later (using Windows Subsystem for Linux 2), macOS, and Linux. 8 | 9 | **If you only intend to predict a small number of naturally occurring proteins, I recommend using [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) or downloading structures from the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) or [UniProt](https://www.uniprot.org/). LocalColabFold is suitable for more advanced applications, such as batch processing of structure predictions for natural complexes, non-natural proteins, or predictions with manually specified MSAs/templates.** 10 | 11 | ## Advantages of LocalColabFold 12 | 13 | - **Structure inference and relaxation will be accelerated if your PC has Nvidia GPU and CUDA drivers.** 14 | - **No Time out (90 minutes and 12 hours)** 15 | - **No GPU limitations** 16 | - **NOT necessary to prepare the large database required for native AlphaFold2**. 17 | 18 | ## Note (May 21, 2024) 19 | 20 | - Since current GPU-supported jax > 0.4.26 requires CUDA 12.1 or later and cudnn 9, please upgrade or install your CUDA driver and cudnn. CUDA 12.4 is recommended. 21 | 22 | ## Note (Jan 30, 2024) 23 | 24 | - ColabFold now upgrade to 1.5.5 (compatible with AlphaFold 2.3.2). Now LocalColabFold requires **CUDA 12.1 or later**. Please update your CUDA driver if you have not done so. 25 | - Now (Local)ColabFold can predict protein structures without connecting the Internet. Use [`setup_databases.sh`](https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh) script to download and build the databases (See also [ColabFold Downloads](https://colabfold.mmseqs.com/)). An instruction to run `colabfold_search` to obtain the MSA and templates locally is written in [this comment](https://github.com/sokrypton/ColabFold/issues/563). 26 | 27 | ## New Updates 28 | 29 | - 30Jan2024, ColabFold 1.5.5 (Compatible with AlphaFold 2.3.2). Now LocalColabFold requires **CUDA 12.1 or later**. Please update your CUDA driver. 30 | - 30Apr2023, Updated to use python 3.10 for compatibility with Google Colaboratory. 31 | - 09Mar2023, version 1.5.1 released. The base directory has been changed to `localcolabfold` from `colabfold_batch` to distinguish it from the execution command. 32 | - 09Mar2023, version 1.5.0 released. See [Release v1.5.0](https://github.com/YoshitakaMo/localcolabfold/releases/tag/v1.5.0) 33 | - 05Feb2023, version 1.5.0-pre released. 34 | - 16Jun2022, version 1.4.0 released. See [Release v1.4.0](https://github.com/YoshitakaMo/localcolabfold/releases/tag/v1.4.0) 35 | - 07May2022, **Updated `update_linux.sh`.** See also [How to update](#how-to-update). Please use a new option `--use-gpu-relax` if GPU relaxation is required (recommended). 36 | - 12Apr2022, version 1.3.0 released. See [Release v1.3.0](https://github.com/YoshitakaMo/localcolabfold/releases/tag/v1.3.0) 37 | - 09Dec2021, version 1.2.0-beta released. easy-to-use updater scripts added. See [How to update](#how-to-update). 38 | - 04Dec2021, LocalColabFold is now compatible with the latest [pip installable ColabFold](https://github.com/sokrypton/ColabFold#running-locally). In this repository, I will provide a script to install ColabFold with some external parameter files to perform relaxation with AMBER. The weight parameters of AlphaFold and AlphaFold-Multimer will be downloaded automatically at your first run. 39 | 40 | ## Installation 41 | 42 | ### For Linux 43 | 44 | 1. Make sure `curl`, `git`, and `wget` commands are already installed on your PC. If not present, you need install them at first. For Ubuntu, type `sudo apt -y install curl git wget`. 45 | 2. Make sure your Cuda compiler driver is **11.8 or later** (the latest version 12.4 is preferable). If you don't have a GPU or don't plan to use a GPU, you can skip this step :
$ nvcc --version
 46 | nvcc: NVIDIA (R) Cuda compiler driver
 47 | Copyright (c) 2005-2022 NVIDIA Corporation
 48 | Built on Wed_Sep_21_10:33:58_PDT_2022
 49 | Cuda compilation tools, release 11.8, V11.8.89
 50 | Build cuda_11.8.r11.8/compiler.31833905_0
 51 | 
DO NOT use `nvidia-smi` to check the version.
See [NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) if you haven't installed it. 52 | 3. Make sure your GNU compiler version is **12.0 or later** because `GLIBCXX_3.4.30` is required for openmm 8.0.0 for `--amber` relaxation. 53 | If the version is old (e.g. CentOS 7, Rocky/Almalinux 8, etc.), install a new one and add `PATH` to it. 54 | 4. Download `install_colabbatch_linux.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
and run it in the directory where you want to install:
$ bash install_colabbatch_linux.sh
About 5 minutes later, `localcolabfold` directory will be created. Do not move this directory after the installation. 55 | 56 | Keep the network unblocked. And **check the log** output to see if there are any errors. 57 | 58 | If you find errors in the output log, the easiest way is to check the network and delete the localcolabfold directory, then re-run the installation script. 59 | 60 | 5. Add environment variable PATH:
# For bash or zsh
# e.g. export PATH="/home/moriwaki/Desktop/localcolabfold/colabfold-conda/bin:\$PATH"
export PATH="/path/to/your/localcolabfold/colabfold-conda/bin:\$PATH"
61 | It is recommended to add this export command to `~/.bashrc` and restart bash (`~/.bashrc` will be executed every time bash is started) 62 | 63 | 6. To run the prediction, type
colabfold_batch input outputdir/
The result files will be created in the `outputdir`. This command will execute the prediction without templates and relaxation (energy minimization). If you want to use templates and relaxation, add `--templates` and `--amber` flags, respectively. For example, 64 | 65 |
colabfold_batch --templates --amber input outputdir/
66 | 67 | `colabfold_batch` will automatically detect whether the prediction is for monomeric or complex prediction. In most cases, users don't have to add `--model-type alphafold2_multimer_v3` to turn on multimer prediction. `alphafold2_multimer_v1, alphafold2_multimer_v2` are also available. Default is `auto` (use `alphafold2_ptm` for monomers and `alphafold2_multimer_v3` for complexes.) 68 | 69 | If you have some errors on `--amber` relaxation, adding `export LD_LIBRARY_PATH=“/path/to/your/localcolabfold/colabfold-conda/lib:${LD_LIBRARY_PATH}”` may solve this issue before running `colabfold_batch`. 70 | For more details, see [Flags](#flags) and `colabfold_batch --help`. 71 | 72 | ### For WSL2 (in Windows) 73 | 74 | **Caution: If your installation fails due to symbolic link (`symlink`) creation issues, this is due to the Windows file system being case-insensitive (while the Linux file system is case-sensitive).** To resolve this, run the following command on Windows Powershell: 75 | ``` 76 | fsutil file SetCaseSensitiveInfo path\to\localcolabfold\installation enable 77 | ``` 78 | 79 | Replace `path\to\colabfold\installation` with the path to the directory where you are installing LocalColabFold. Also, make sure that you are running the command on Windows Powershell (not WSL). For more details, see [Adjust Case Sensitivty (Microsoft)](https://learn.microsoft.com/en-us/windows/wsl/case-sensitivity). 80 | 81 | Before running the prediction: 82 | 83 | ``` 84 | export TF_FORCE_UNIFIED_MEMORY="1" 85 | export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0" 86 | export XLA_PYTHON_CLIENT_ALLOCATOR="platform" 87 | export TF_FORCE_GPU_ALLOW_GROWTH="true" 88 | ``` 89 | 90 | It is recommended to add these export commands to `~/.bashrc` and restart bash (`~/.bashrc` will be executed every time bash is started) 91 | 92 | ### For macOS 93 | 94 | **Caution: Due to the lack of Nvidia GPU/CUDA driver, the structure prediction on macOS are 5-10 times slower than on Linux+GPU**. For the test sequence (58 a.a.), it may take 30 minutes. However, it may be useful to play with it before preparing Linux+GPU environment. 95 | 96 | You can check whether your Mac is Intel or Apple Silicon by typing `uname -m` on Terminal. 97 | 98 | ```bash 99 | $ uname -m 100 | x86_64 # Intel 101 | arm64 # Apple Silicon 102 | ``` 103 | 104 | Please use the correct installer for your Mac. 105 | 106 | #### For Mac with Intel CPU 107 | 108 | 1. Install [Homebrew](https://brew.sh/index_ja) if not present:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
109 | 2. Install `wget`, `gnu-sed`, [HH-suite](https://github.com/soedinglab/hh-suite) and [kalign](https://github.com/TimoLassmann/kalign) using Homebrew:
$ brew install wget gnu-sed
\$ brew install brewsci/bio/hh-suite brewsci/bio/kalign
110 | 3. Download `install_colabbatch_intelmac.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_intelmac.sh
and run it in the directory where you want to install:
$ bash install_colabbatch_intelmac.sh
About 5 minutes later, `colabfold_batch` directory will be created. Do not move this directory after the installation. 111 | 4. The rest procedure is the same as "For Linux". 112 | 113 | #### For Mac with Apple Silicon (M1 chip) 114 | 115 | **Note: This installer is experimental because most of the dependent packages are not fully tested on Apple Silicon Mac.** 116 | 117 | 1. Install [Homebrew](https://brew.sh/index_ja) if not present:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
118 | 2. Install several commands using Homebrew (Now kalign 3.3.2 is available!):
$ brew install wget cmake gnu-sed
$ brew install brewsci/bio/hh-suite
$ brew install brewsci/bio/kalign
119 | 3. Install `miniforge` command using Homebrew:
$ brew install --cask miniforge
120 | 4. Download `install_colabbatch_M1mac.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_M1mac.sh
and run it in the directory where you want to install:
$ bash install_colabbatch_M1mac.sh
About 5 minutes later, `colabfold_batch` directory will be created. Do not move this directory after the installation. **You can ignore the installation errors that appear along the way**. 121 | 5. The rest procedure is the same as "For Linux". 122 | 123 | ### Input Examples 124 | 125 | ColabFold can accept multiple file formats or directory. 126 | 127 | ``` 128 | positional arguments: 129 | input Can be one of the following: Directory with fasta/a3m 130 | files, a csv/tsv file, a fasta file or an a3m file 131 | results Directory to write the results to 132 | ``` 133 | 134 | #### fasta format 135 | 136 | It is recommended that the header line starting with `>` be short since the description will be the prefix of the output file. It is acceptable to insert line breaks in the amino acid sequence. 137 | 138 | ```:P61823.fasta 139 | >sp|P61823 140 | MALKSLVLLSLLVLVLLLVRVQPSLGKETAAAKFERQHMDSSTSAASSSNYCNQMMKSRN 141 | LTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPN 142 | CAYKTTQANKHIIVACEGNPYVPVHFDASV 143 | ``` 144 | 145 | **For prediction of multimers, insert `:` between the protein sequences.** 146 | 147 | ``` 148 | >1BJP_homohexamer 149 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR: 150 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR: 151 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR: 152 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR: 153 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR: 154 | PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR 155 | ``` 156 | 157 | ``` 158 | >3KUD_RasRaf_complex 159 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQ 160 | YMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIP 161 | YIETSAKTRQGVEDAFYTLVREIRQH: 162 | PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAAS 163 | LIGEELQVDFL 164 | ``` 165 | 166 | Multiple `>` header lines with sequences in a FASTA format file yield multiple predictions at once in the specified output directory. 167 | 168 | #### csv format 169 | 170 | In a csv format, `id` and `sequence` should be separated by `,`. 171 | 172 | ```:test.csv 173 | id,sequence 174 | 5AWL_1,YYDPETGTWY 175 | 3G5O_A_3G5O_B,MRILPISTIKGKLNEFVDAVSSTQDQITITKNGAPAAVLVGADEWESLQETLYWLAQPGIRESIAEADADIASGRTYGEDEIRAEFGVPRRPH:MPYTVRFTTTARRDLHKLPPRILAAVVEFAFGDLSREPLRVGKPLRRELAGTFSARRGTYRLLYRIDDEHTTVVILRVDHRADIYRR 176 | ``` 177 | 178 | #### a3m format 179 | 180 | You can input your a3m format MSA file. For multimer predictions, the a3m file should be compatible with colabfold format. 181 | 182 | ### Flags 183 | 184 | These flags are useful for the predictions. 185 | 186 | - **`--amber`** : Use amber for structure refinement (relaxation / energy minimization). To control number of top ranked structures are relaxed set `--num-relax`. 187 | - **`--templates`** : Use templates from pdb. 188 | - **`--use-gpu-relax`** : Run amber on NVidia GPU instead of CPU. This feature is only available on a machine with Nvidia GPUs. 189 | - **`--num-recycle `** : Number of prediction recycles. Increasing recycles can improve the quality but slows down the prediction. Default is `3`. (e.g. `--num-recycle 10`) 190 | - `--custom-template-path ` : Restrict template files used for `--template` to only those contained in the specified directory. This flag enables us to use non-public pdb files for the prediction. See also https://github.com/sokrypton/ColabFold/issues/177 . 191 | - `--random-seed ` **Changing the seed for the random number generator can result in different structure predictions.** (e.g. `--random-seed 42`) 192 | - `--num-seeds ` Number of seeds to try. Will iterate from range(random_seed, random_seed+num_seeds). (e.g. `--num-seed 5`) 193 | - `--max-msa` : Defines: `max-seq:max-extra-seq` number of sequences to use (e.g. `--max-msa 512:1024`). `--max-seq` and `--max-extra-seq` arguments are also available if you want to specify separately. This is a reimplementation of the paper of [Sampling alternative conformational states of transporters and receptors with AlphaFold2](https://elifesciences.org/articles/75751) demonstrated by del Alamo *et al*. 194 | - `--use-dropout` : activate dropouts during inference to sample from uncertainity of the models. 195 | - `--overwrite-existing-results` : Overwrite the result files. 196 | - For more information, `colabfold_batch --help`. 197 | 198 | ## How to update 199 | 200 | Since [ColabFold](https://github.com/sokrypton/ColabFold) is still a work in progress, your localcolabfold should be also updated frequently to use the latest features. An easy-to-use update script is provided for this purpose. 201 | 202 | To update your localcolabfold, simply execute the following: 203 | 204 | ```bash 205 | # set your OS. Select one of the following variables {linux,intelmac,M1mac} 206 | $ OS=linux # if Linux 207 | # navigate to the directory where you installed localcolabfold, e.g. 208 | $ cd /home/moriwaki/Desktop/localcolabfold/ 209 | # get the latest updater 210 | $ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_${OS}.sh -O update_${OS}.sh 211 | $ chmod +x update_${OS}.sh 212 | # execute it. 213 | $ ./update_${OS}.sh . 214 | ``` 215 | 216 | ## FAQ 217 | - What else do I need to do before installation? Do I need sudo privileges? 218 | - No, except for installation of `curl` and `wget` commands. 219 | - Do I need to prepare the large database such as PDB70, BFD, Uniclust30, MGnify? 220 | - **No. it is not necessary.** Generation of MSA is performed by the MMseqs2 web server, just as implemented in ColabFold. 221 | - Are the pLDDT score and PAE figures available? 222 | - Yes, they will be generated just like the ColabFold. 223 | - Is it possible to predict homooligomers and complexes? 224 | - Yes, the format of input sequence is the same as ColabFold. See `query_sequence:` and its use of [ColabFold: AlphaFold2 using MMseqs2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb). 225 | - Is it possible to create MSA by jackhmmer? 226 | - **No, it is not currently supported**. 227 | - I want to use multiple GPUs to perform the prediction. 228 | - **AlphaFold and ColabFold does not support multiple GPUs**. Only One GPU can model your protein. 229 | - I have multiple GPUs. Can I specify to run LocalColabfold on each GPU? 230 | - Use `CUDA_VISIBLE_DEVICES` environment variable. See https://github.com/YoshitakaMo/localcolabfold/issues/200. 231 | - I got an error message `CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered`. 232 | - You may not have updated to CUDA 11.8 or later. Please check the version of Cuda compiler with `nvcc --version` command, not `nvidia-smi`. 233 | - Is this available on Windows 10? 234 | - You can run LocalColabFold on your Windows 10 with [WSL2](https://docs.microsoft.com/en-us/windows/wsl/install-win10). 235 | - (New!)I want to use a custom MSA file in the format of a3m. 236 | - **ColabFold can accept various input files now**. See the help messsage. You can set your own A3M file, a fasta file that contains multiple sequences (in FASTA format), or a directory that contains multiple fasta files. 237 | 238 | 239 | ## Tutorials & Presentations 240 | 241 | - ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [[video]](https://www.youtube.com/watch?v=Rfw7thgGTwI) [[slides]](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI). 242 | 243 | ## Acknowledgments 244 | 245 | - The original colabfold was first created by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)). 246 | 247 | ## How do I reference this work? 248 | 249 | - Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold - Making protein folding accessible to all.
250 | *Nature Methods* (2022) doi: [10.1038/s41592-022-01488-1](https://www.nature.com/articles/s41592-022-01488-1) 251 | - If you’re using **AlphaFold**, please also cite:
252 | Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
253 | *Nature* (2021) doi: [10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2) 254 | - If you’re using **AlphaFold-multimer**, please also cite:
255 | Evans et al. "Protein complex prediction with AlphaFold-Multimer."
256 | *BioRxiv* (2022) doi: [10.1101/2021.10.04.463034v2](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2) 257 | -------------------------------------------------------------------------------- /README_ja.md: -------------------------------------------------------------------------------- 1 | # LocalColabFold 2 | 3 | 個人用パソコンまたはmacOSで動かす[ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)。 4 | 5 | ## アップデート情報 6 | 7 | - 2024年1月30日, ColabFold 1.5.5 (Compatible with AlphaFold 2.3.2). **CUDA 11.8 or later**が必要となりました。 8 | - 2023年2月5日, version 1.5.0-preリリース。 9 | - 2022年6月18日, version 1.4.0 リリース。[Release v1.4.0](https://github.com/YoshitakaMo/localcolabfold/releases/tag/v1.4.0) 10 | - 2021年12月9日, β版。簡単に使えるアップデートスクリプトを追加。[アップデートのやり方](#アップデートのやり方)を参照。 11 | - 2021年12月4日, LocalColabFoldは最新版の[pipでインストール可能なColabFold](https://github.com/sokrypton/ColabFold#running-locally)に対応しました。このリポジトリではrelax(構造最適化)処理を行うために必要な他のパラメータファイルとともにColabFoldをインストールするためのスクリプトを提供しています。AlphaFoldとAlphaFold-Multimerの重みパラメータは初回の実行時に自動的にダウンロードされます。 12 | 13 | ## インストール方法 14 | 15 | ### Linux+GPUの場合 16 | 17 | 1. ターミナル上で`curl`, `git`と`wget`コマンドがすでにインストールされていることを確認します。存在しない場合は先にこれらをインストールしてください。Ubuntuの場合はtype `sudo apt -y install curl git wget`でインストールできます。 18 | 2. **Cuda compilerのバージョンが11.8以降であることを確認します。**
$ nvcc --version
 19 | nvcc: NVIDIA (R) Cuda compiler driver
 20 | Copyright (c) 2005-2022 NVIDIA Corporation
 21 | Built on Wed_Sep_21_10:33:58_PDT_2022
 22 | Cuda compilation tools, release 11.8, V11.8.89
 23 | Build cuda_11.8.r11.8/compiler.31833905_0
 24 | 
バージョンチェックの時に`nvidia-smi`コマンドを使わないでください。こちらでは不正確です。
まだCUDA Compilerをインストールしていない場合は、[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)を参照してください。 25 | 3. **GNU compilerのバージョンが4.9以降であることを確認します。** 動作上、`GLIBCXX_3.4.20`が必要になるためです。
$ gcc --version
 26 | gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
 27 | Copyright (C) 2019 Free Software Foundation, Inc.
 28 | This is free software; see the source for copying conditions.  There is NO
 29 | warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 30 | 
もしバージョンが4.8.5以前の場合は(CentOS 7だとよくありがち)、新しいGCCをインストールしてそれにPATHを通してください。スパコンの場合はEnvironment moduleの`module avail`の中にあるかもしれません。 31 | 1. このリポジトリにある`install_colabbatch_linux.sh`をダウンロードします。
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabbatch_linux.sh
およそ5分後に`colabfold_batch`ディレクトリができます。インストール後はこのディレクトリを移動させないでください。 32 | 2. `cd colabfold_batch`を入力してこのディレクトリに入ります。 33 | 3. 環境変数`PATH`を追加します。
# For bash or zsh
# e.g. export PATH="/home/moriwaki/Desktop/colabfold_batch/bin:\$PATH"
export PATH="/bin:\$PATH"
この1行を`~/.bashrc`または`~/.zshrc`に追記しておくと便利です。 34 | 4. 以下のコマンドでColabFoldを実行します。
colabfold_batch --amber --templates --num-recycle 3 inputfile outputdir/ 
結果のファイルは`outputdir`に生成されます. 詳細な使い方は`colabfold_batch --help`コマンドで確認してください。 35 | 36 | ### macOSの場合 37 | 38 | **注意: macOSではNvidia GPUとCUDAドライバがないため、構造推論部分がLinux+GPU環境に比べて5〜10倍ほど遅くなります**。テスト用のアミノ酸配列(58アミノ酸)ではおよそ30分ほど計算に時間がかかります。ただ、Linux+GPU環境を準備する前にこれで遊んでみるのはありかもしれません。 39 | 40 | また、自身の持っているMacがIntel CPUのものか、M1 chip入りのもの(Apple Silicon)かを先に確認してください。ターミナルで`uname -m`の結果でどちらかが判明します。 41 | 42 | ```bash 43 | $ uname -m 44 | x86_64 # Intel 45 | arm64 # Apple Silicon 46 | ``` 47 | 48 | (Apple SiliconでRosetta2を使っている場合はApple Siliconでもx86_64って表示されますけれど……今のところこれには対応していません。) 49 | 50 | 以上の結果を踏まえて適切なインストーラーを選択してください。 51 | 52 | #### Intel CPUのMacの場合 53 | 54 | 1. [Homebrew](https://qiita.com/zaburo/items/29fe23c1ceb6056109fd)をインストールします:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
55 | 2. Homebrewで`wget`, `gnu-sed`, [HH-suite](https://github.com/soedinglab/hh-suite)と[kalign](https://github.com/TimoLassmann/kalign)をインストールします
$ brew install wget gnu-sed
\$ brew install brewsci/bio/hh-suite brewsci/bio/kalign
56 | 3. `install_colabbatch_intelmac.sh`をこのリポジトリからダウンロードします:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_intelmac.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabbatch_intelmac.sh
およそ5分後に`colabfold_batch`ディレクトリができます。インストール後はこのディレクトリを移動させないでください。 57 | 4. 残りの手順は"Linux+GPUの場合"と同様です. 58 | 59 | #### Apple Silicon (M1 chip)のMacの場合 60 | 61 | **Note: 依存するPythonパッケージのほとんどがまだApple Silicon Macで十分にテストされていないため、このインストーラーによる動作は試験的なものです。** 62 | 63 | 1. [Homebrew](https://qiita.com/zaburo/items/29fe23c1ceb6056109fd)をインストールします:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
64 | 1. いくつかのコマンドをHomebrewでインストールします。(現在kalignはM1 macでインストールすることはできないみたいですが、問題ありません):
$ brew install wget cmake gnu-sed
$ brew install brewsci/bio/hh-suite
65 | 2. `miniforge`をHomebrewでインストールします:
$ brew install --cask miniforge
66 | 3. インストーラー`install_colabbatch_M1mac.sh`をこのリポジトリからダウンロードします:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_M1mac.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabbatch_M1mac.sh
およそ5分後に`colabfold_batch`ディレクトリができます。途中色々WarningsやErrorが出るかもしれませんが大丈夫です。インストール後はこのディレクトリを移動させないでください。 67 | 4. 残りの手順は"Linux+GPUの場合"と同様です. 68 | 69 | ## アップデートのやり方 70 | 71 | [ColabFold](https://github.com/sokrypton/ColabFold)はいまだ開発途中であるため、最新の機能を利用するためにはこのlocalcolabfoldも頻繁にアップデートする必要があります。そこでお手軽にアップデートするためのスクリプトを用意しました。 72 | 73 | アップデートは`localcolabfold`ディレクトリで以下のように入力するだけです。 74 | 75 | ```bash 76 | $ ./update_linux.sh . # if Linux 77 | $ ./update_intelmac.sh . # if Intel Mac 78 | $ ./update_M1mac.sh . # if M1 Mac 79 | ``` 80 | 81 | また、もしすでに1.2.0-beta以前からlocalcolabfoldをインストールしていた場合は、まずこれらのアップデートスクリプトをダウンロードしてきてから実行してください。例として以下のような感じです。 82 | 83 | ```bash 84 | # set your OS. Select one of the following variables {linux,intelmac,M1mac} 85 | $ OS=linux # if Linux 86 | # navigate to the directory where you installed localcolabfold, e.g. 87 | $ cd /home/moriwaki/Desktop/localcolabfold/ 88 | $ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_${OS}.sh 89 | $ chmod +x update_${OS}.sh 90 | $ ./update_${OS}.sh /path/to/your/localcolabfold 91 | ``` 92 | 93 | ## LocalColabFoldを利用する利点 94 | 95 | - **お使いのパソコンにNvidia GPUとCUDAドライバがあれば、AlphaFold2による構造推論(Structure inference)と構造最適化(relax)が高速になります。** 96 | - **Google Colabは90分アイドルにしていたり、12時間以上の利用でタイムアウトしますが、その制限がありません。また、GPUの使用についても当然制限がありません。** 97 | - **データベースをダウンロードしてくる必要がないです**。 98 | 99 | ## FAQ 100 | - インストールの事前準備は? 101 | - `curl`, `wget`コマンド以外は不要です 102 | - BFD, Mgnify, PDB70, Uniclust30などの巨大なデータベースを用意する必要はありますか? 103 | - **必要ないです**。 104 | - AlphaFold2の最初の動作に必要なMSA作成はどのように行っていますか? 105 | - MSA作成はColabFoldと同様にMMseqs2のウェブサーバーによって行われています。 106 | - ColabFoldで表示されるようなpLDDTスコアやPAEの図も生成されますか? 107 | - はい、生成されます。 108 | - ホモ多量体予測、複合体予測も可能ですか? 109 | - はい、可能です。配列の入力方法は[ColabFold: AlphaFold2 using MMseqs2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)のやり方と同じです。 110 | - jackhmmerによるMSA作成は可能ですか? 111 | - **現在のところ対応していません**。 112 | - 複数のGPUを利用して計算を行いたい。 113 | - **AlphaFold, ColabFoldは複数GPUを利用した構造予測はできないようです**。1つのGPUでしか計算できません。 114 | - 長いアミノ酸を予測しようとしたときに`ResourceExhausted`というエラーが発生するのを解決したい。 115 | - 上と同じissueを読んでください。 116 | - `CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered`というエラーメッセージが出る 117 | - CUDA 11.1以降にアップデートされていない可能性があります。`nvcc --version`コマンドでCuda compilerのバージョンを確認してみてください。 118 | - Windows 10の上でも利用することはできますか? 119 | - [WSL2](https://docs.microsoft.com/en-us/windows/wsl/install-win10)を入れればWindows 10の上でも同様に動作させることができます。 120 | - (New!) 自作したA3Mファイルを利用して構造予測を行いたい。 121 | - **現在ColabFoldはFASTAファイル以外にも様々な入力を受け取ることが可能です**。詳細な使い方はヘルプメッセージを読んでください。手持ちのA3Mフォーマットファイル、FASTAフォーマットで入力された複数のアミノ酸配列を含む1つのfastaファイル、さらにはディレクトリ自体をインプットに指定する事が可能です。 122 | 123 | ## Tutorials & Presentations 124 | 125 | - ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [[video]](https://www.youtube.com/watch?v=Rfw7thgGTwI) [[slides]](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI). 126 | 127 | ## Acknowledgments 128 | 129 | - The original colabfold was first created by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)). 130 | 131 | ## How do I reference this work? 132 | 133 | - Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold - Making protein folding accessible to all.
134 | *Nature Methods* (2022) doi: [10.1038/s41592-022-01488-1](https://www.nature.com/articles/s41592-022-01488-1) 135 | - If you’re using **AlphaFold**, please also cite:
136 | Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
137 | *Nature* (2021) doi: [10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2) 138 | - If you’re using **AlphaFold-multimer**, please also cite:
139 | Evans et al. "Protein complex prediction with AlphaFold-Multimer."
140 | *BioRxiv* (2021) doi: [10.1101/2021.10.04.463034v1](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1) 141 | - If you are using **RoseTTAFold**, please also cite:
142 | Minkyung et al. "Accurate prediction of protein structures and interactions using a three-track neural network."
143 | *Science* (2021) doi: [10.1126/science.abj8754](https://doi.org/10.1126/science.abj8754) 144 | 145 | [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.5123296.svg)](https://doi.org/10.5281/zenodo.5123296) -------------------------------------------------------------------------------- /beta_install_colabbatch_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | type wget || { echo "wget command is not installed. Please install it at first using apt or yum." ; exit 1 ; } 4 | type curl || { echo "curl command is not installed. Please install it at first using apt or yum. " ; exit 1 ; } 5 | 6 | CURRENTPATH=`pwd` 7 | COLABFOLDDIR="${CURRENTPATH}/localcolabfold" 8 | 9 | mkdir -p ${COLABFOLDDIR} 10 | cd ${COLABFOLDDIR} 11 | wget -q -P . https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh 12 | bash ./Mambaforge-Linux-x86_64.sh -b -p ${COLABFOLDDIR}/conda 13 | rm Mambaforge-Linux-x86_64.sh 14 | . "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 15 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 16 | conda create -p $COLABFOLDDIR/colabfold-conda python=3.9 -y 17 | conda activate $COLABFOLDDIR/colabfold-conda 18 | conda update -n base conda -y 19 | conda install -c conda-forge python=3.9 cudnn==8.2.1.32 cudatoolkit==11.1.1 openmm==7.5.1 pdbfixer -y 20 | # Download the updater 21 | wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_linux.sh --no-check-certificate 22 | chmod +x update_linux.sh 23 | # install alignment tools 24 | conda install -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 mmseqs2=14.7e284 -y 25 | # install ColabFold and Jaxlib 26 | # colabfold-conda/bin/python3.9 -m pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" 27 | colabfold-conda/bin/python3.9 -m pip install -q --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold@beta" 28 | colabfold-conda/bin/python3.9 -m pip install https://storage.googleapis.com/jax-releases/cuda11/jaxlib-0.3.25+cuda11.cudnn82-cp39-cp39-manylinux2014_x86_64.whl 29 | colabfold-conda/bin/python3.9 -m pip install jax==0.3.25 biopython==1.79 30 | 31 | # Use 'Agg' for non-GUI backend 32 | cd ${COLABFOLDDIR}/colabfold-conda/lib/python3.9/site-packages/colabfold 33 | sed -i -e "s#from matplotlib import pyplot as plt#import matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt#g" plot.py 34 | # modify the default params directory 35 | sed -i -e "s#appdirs.user_cache_dir(__package__ or \"colabfold\")#\"${COLABFOLDDIR}/colabfold\"#g" download.py 36 | 37 | # start downloading weights 38 | cd ${COLABFOLDDIR} 39 | colabfold-conda/bin/python3.9 -m colabfold.download 40 | cd ${CURRENTPATH} 41 | 42 | echo "Download of alphafold2 weights finished." 43 | echo "-----------------------------------------" 44 | echo "Installation of colabfold_batch finished." 45 | echo "Add ${COLABFOLDDIR}/colabfold-conda/bin to your environment variable PATH to run 'colabfold_batch'." 46 | echo "i.e. For Bash, export PATH=\"${COLABFOLDDIR}/colabfold-conda/bin:\$PATH\"" 47 | echo "For more details, please type 'colabfold_batch --help'." -------------------------------------------------------------------------------- /beta_update_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | COLABFOLDDIR=$1 4 | 5 | if [ ! -d $COLABFOLDDIR/colabfold-conda ]; then 6 | echo "Error! colabfold-conda directory is not present in $COLABFOLDDIR." 7 | exit 1 8 | fi 9 | 10 | pushd $COLABFOLDDIR || { echo "${COLABFOLDDIR} is not present." ; exit 1 ; } 11 | 12 | # get absolute path of COLABFOLDDIR 13 | COLABFOLDDIR=$(cd $(dirname colabfold_batch); pwd) 14 | # activate conda in $COLABFOLDDIR/conda 15 | . ${COLABFOLDDIR}/conda/etc/profile.d/conda.sh 16 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 17 | conda activate $COLABFOLDDIR/colabfold-conda 18 | # reinstall colabfold and alphafold-colabfold 19 | python3.8 -m pip uninstall -q "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold@beta" -y 20 | python3.8 -m pip uninstall alphafold-colabfold -y 21 | python3.8 -m pip install --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold@beta" 22 | 23 | # use 'agg' for non-GUI backend 24 | pushd ${COLABFOLDDIR}/colabfold-conda/lib/python3.9/site-packages/colabfold 25 | sed -i -e "s#from matplotlib import pyplot as plt#import matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt#g" plot.py 26 | sed -i -e "s#appdirs.user_cache_dir(__package__ or \"colabfold\")#\"${COLABFOLDDIR}/colabfold\"#g" download.py 27 | popd 28 | popd -------------------------------------------------------------------------------- /install_colabbatch_M1mac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | # check commands 4 | type wget 2>/dev/null || { echo "Please install wget using Homebrew:\n\tbrew install wget" ; exit 1 ; } 5 | type hhsearch 2>/dev/null || { echo -e "Please install hh-suite using Homebrew:\n\tbrew install brewsci/bio/hh-suite" ; exit 1 ; } 6 | type kalign 2>/dev/null || { echo -e "Please install kalign using Homebrew:\n\tbrew install kalign" ; exit 1 ; } 7 | type mmseqs 2>/dev/null || { echo -e "Please install mmseqs2 using Homebrew:\n\tbrew install mmseqs2" ; exit 1 ; } 8 | 9 | # check whether Apple Silicon (M1 mac) or Intel Mac 10 | arch_name="$(uname -m)" 11 | if [ "${arch_name}" = "x86_64" ]; then 12 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 13 | echo "Running on Rosetta 2" 14 | else 15 | echo "Running on native Intel" 16 | fi 17 | echo "This installer is only for Apple Silicon. Use install_colabfold_intelmac.sh to install on this Mac." 18 | exit 1 19 | elif [ "${arch_name}" = "arm64" ]; then 20 | echo "Running on Apple Silicon (M1 mac)" 21 | else 22 | echo "Unknown architecture: ${arch_name}" 23 | exit 1 24 | fi 25 | 26 | # Maybe required for Apple Silicon (M1 mac) when installing mambaforge 27 | ulimit -n 99999 28 | 29 | CURRENTPATH="$(pwd)" 30 | COLABFOLDDIR="${CURRENTPATH}/localcolabfold" 31 | 32 | mkdir -p "${COLABFOLDDIR}" 33 | cd "${COLABFOLDDIR}" 34 | wget -q -P . https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh 35 | bash ./Miniforge3-MacOSX-arm64.sh -b -p "${COLABFOLDDIR}/conda" 36 | rm Miniforge3-MacOSX-arm64.sh 37 | 38 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 39 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 40 | conda update -n base conda -y 41 | conda create -p "$COLABFOLDDIR/colabfold-conda" -c conda-forge \ 42 | git python=3.10 openmm==8.0.0 pdbfixer==1.9 -y 43 | conda activate "$COLABFOLDDIR/colabfold-conda" 44 | 45 | # install colabfold 46 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts \ 47 | "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" 48 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install jax==0.4.23 jaxlib==0.4.23 49 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 50 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow 51 | 52 | # Download the updater 53 | wget -qnc -O "$COLABFOLDDIR/update_M1mac.sh" \ 54 | https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_M1mac.sh 55 | chmod +x "$COLABFOLDDIR/update_M1mac.sh" 56 | 57 | # Download weights 58 | "$COLABFOLDDIR/colabfold-conda/bin/python3" -m colabfold.download 59 | echo "Download of alphafold2 weights finished." 60 | echo "-----------------------------------------" 61 | echo "Installation of ColabFold finished." 62 | echo "Add ${COLABFOLDDIR}/colabfold-conda/bin to your environment variable PATH to run 'colabfold_batch'." 63 | echo -e "i.e. for Bash:\n\texport PATH=\"${COLABFOLDDIR}/colabfold-conda/bin:\$PATH\"" 64 | echo "For more details, please run 'colabfold_batch --help'." 65 | -------------------------------------------------------------------------------- /install_colabbatch_intelmac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | # check commands 4 | type wget 2>/dev/null || { echo "Please install wget using Homebrew:\n\tbrew install wget" ; exit 1 ; } 5 | type hhsearch 2>/dev/null || { echo -e "Please install hh-suite using Homebrew:\n\tbrew install brewsci/bio/hh-suite" ; exit 1 ; } 6 | type kalign 2>/dev/null || { echo -e "Please install kalign using Homebrew:\n\tbrew install kalign" ; exit 1 ; } 7 | type mmseqs 2>/dev/null || { echo -e "Please install mmseqs2 using Homebrew:\n\tbrew install mmseqs2" ; exit 1 ; } 8 | 9 | # check whether Apple Silicon (M1 mac) or Intel Mac 10 | arch_name="$(uname -m)" 11 | if [ "${arch_name}" = "x86_64" ]; then 12 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 13 | echo "Running on Rosetta 2" 14 | else 15 | echo "Running on native Intel" 16 | fi 17 | elif [ "${arch_name}" = "arm64" ]; then 18 | echo "Running on Apple Silicon (M1 mac)" 19 | echo "This installer is only for intel Mac. Use install_colabfold_M1mac.sh to install on this Mac." 20 | exit 1 21 | else 22 | echo "Unknown architecture: ${arch_name}" 23 | exit 1 24 | fi 25 | 26 | CURRENTPATH="$(pwd)" 27 | COLABFOLDDIR="${CURRENTPATH}/localcolabfold" 28 | 29 | mkdir -p "${COLABFOLDDIR}" 30 | cd "${COLABFOLDDIR}" 31 | wget -q -P . https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-x86_64.sh 32 | bash ./Miniforge3-MacOSX-x86_64.sh -b -p "${COLABFOLDDIR}/conda" 33 | rm Miniforge3-MacOSX-x86_64.sh 34 | 35 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 36 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 37 | conda update -n base conda -y 38 | conda create -p "$COLABFOLDDIR/colabfold-conda" -c conda-forge -c bioconda \ 39 | git python=3.10 openmm==8.0.0 pdbfixer==1.9 \ 40 | kalign2=2.04 hhsuite=3.3.0 mmseqs2=15.6f452 -y 41 | conda activate "$COLABFOLDDIR/colabfold-conda" 42 | 43 | # install colabfold 44 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts \ 45 | "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" 46 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install jax==0.4.23 jaxlib==0.4.23 47 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 48 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow 49 | 50 | # Download the updater 51 | wget -qnc -O "$COLABFOLDDIR/update_intelmac.sh" \ 52 | https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_intelmac.sh 53 | chmod +x "$COLABFOLDDIR/update_intelmac.sh" 54 | 55 | # Download weights 56 | "$COLABFOLDDIR/colabfold-conda/bin/python3" -m colabfold.download 57 | echo "Download of alphafold2 weights finished." 58 | echo "-----------------------------------------" 59 | echo "Installation of ColabFold finished." 60 | echo "Note: AlphaFold2 weights were downloaded to the ~/Library/Caches/colabfold/params directory." 61 | echo "Add ${COLABFOLDDIR}/colabfold-conda/bin to your PATH environment variable to run 'colabfold_batch'." 62 | echo -e "i.e. for Bash:\n\texport PATH=\"${COLABFOLDDIR}/colabfold-conda/bin:\$PATH\"" 63 | echo "For more details, please run 'colabfold_batch --help'." 64 | -------------------------------------------------------------------------------- /install_colabbatch_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | type wget 2>/dev/null || { echo "wget is not installed. Please install it using apt or yum." ; exit 1 ; } 4 | 5 | CURRENTPATH=`pwd` 6 | COLABFOLDDIR="${CURRENTPATH}/localcolabfold" 7 | 8 | mkdir -p "${COLABFOLDDIR}" 9 | cd "${COLABFOLDDIR}" 10 | wget -q -P . https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh 11 | bash ./Miniforge3-Linux-x86_64.sh -b -p "${COLABFOLDDIR}/conda" 12 | rm Miniforge3-Linux-x86_64.sh 13 | 14 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 15 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 16 | conda update -n base conda -y 17 | conda create -p "$COLABFOLDDIR/colabfold-conda" -c conda-forge -c bioconda \ 18 | git python=3.10 openmm==8.2.0 pdbfixer \ 19 | kalign2=2.04 hhsuite=3.3.0 mmseqs2 -y 20 | conda activate "$COLABFOLDDIR/colabfold-conda" 21 | 22 | # install ColabFold and Jaxlib 23 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts \ 24 | "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold" 25 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 26 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda12]==0.5.3" 27 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade tensorflow 28 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow 29 | 30 | # Download the updater 31 | wget -qnc -O "$COLABFOLDDIR/update_linux.sh" \ 32 | https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/update_linux.sh 33 | chmod +x "$COLABFOLDDIR/update_linux.sh" 34 | 35 | pushd "${COLABFOLDDIR}/colabfold-conda/lib/python3.10/site-packages/colabfold" 36 | # Use 'Agg' for non-GUI backend 37 | sed -i -e "s#from matplotlib import pyplot as plt#import matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt#g" plot.py 38 | # modify the default params directory 39 | sed -i -e "s#appdirs.user_cache_dir(__package__ or \"colabfold\")#\"${COLABFOLDDIR}/colabfold\"#g" download.py 40 | # suppress warnings related to tensorflow 41 | sed -i -e "s#from io import StringIO#from io import StringIO\nfrom silence_tensorflow import silence_tensorflow\nsilence_tensorflow()#g" batch.py 42 | # remove cache directory 43 | rm -rf __pycache__ 44 | popd 45 | 46 | # Download weights 47 | "$COLABFOLDDIR/colabfold-conda/bin/python3" -m colabfold.download 48 | echo "Download of alphafold2 weights finished." 49 | echo "-----------------------------------------" 50 | echo "Installation of ColabFold finished." 51 | echo "Add ${COLABFOLDDIR}/colabfold-conda/bin to your PATH environment variable to run 'colabfold_batch'." 52 | echo -e "i.e. for Bash:\n\texport PATH=\"${COLABFOLDDIR}/colabfold-conda/bin:\$PATH\"" 53 | echo "For more details, please run 'colabfold_batch --help'." 54 | -------------------------------------------------------------------------------- /update_M1mac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | # check whether Apple Silicon (M1 mac) or Intel Mac 4 | arch_name="$(uname -m)" 5 | if [ "${arch_name}" = "x86_64" ]; then 6 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 7 | echo "Running on Rosetta 2" 8 | else 9 | echo "Running on native Intel" 10 | fi 11 | echo "This installer is only for Apple Silicon. Use update_intelmac.sh to install on this Mac." 12 | exit 1 13 | elif [ "${arch_name}" = "arm64" ]; then 14 | echo "Running on Apple Silicon (M1 mac)" 15 | else 16 | echo "Unknown architecture: ${arch_name}" 17 | exit 1 18 | fi 19 | 20 | # Maybe required for Apple Silicon (M1 mac) when installing mambaforge 21 | ulimit -n 99999 22 | 23 | COLABFOLDDIR="$1" 24 | if [ ! -d "$COLABFOLDDIR/colabfold-conda" ]; then 25 | echo "Error! colabfold-conda directory is not present in $COLABFOLDDIR." 26 | exit 1 27 | fi 28 | 29 | # activate conda in $COLABFOLDDIR/conda 30 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 31 | conda activate "$COLABFOLDDIR/colabfold-conda" 32 | 33 | # reinstall colabfold 34 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts --upgrade --force-reinstall \ 35 | "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" 36 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install jax==0.4.23 jaxlib==0.4.23 37 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 38 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow -------------------------------------------------------------------------------- /update_intelmac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | # check whether Apple Silicon (M1 mac) or Intel Mac 4 | arch_name="$(uname -m)" 5 | if [ "${arch_name}" = "x86_64" ]; then 6 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 7 | echo "Running on Rosetta 2" 8 | else 9 | echo "Running on native Intel" 10 | fi 11 | elif [ "${arch_name}" = "arm64" ]; then 12 | echo "Running on Apple Silicon (M1 mac)" 13 | echo "This installer is only for intel Mac." 14 | exit 1 15 | else 16 | echo "Unknown architecture: ${arch_name}" 17 | exit 1 18 | fi 19 | 20 | COLABFOLDDIR="$1" 21 | if [ ! -d "$COLABFOLDDIR/colabfold-conda" ]; then 22 | echo "Error! colabfold-conda directory is not present in $COLABFOLDDIR." 23 | exit 1 24 | fi 25 | 26 | # activate conda in $COLABFOLDDIR/conda 27 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 28 | conda activate "$COLABFOLDDIR/colabfold-conda" 29 | 30 | # reinstall colabfold 31 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts --upgrade --force-reinstall \ 32 | "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" 33 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install jax==0.4.23 jaxlib==0.4.23 34 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 35 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow -------------------------------------------------------------------------------- /update_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | # get absolute path of COLABFOLDDIR 4 | COLABFOLDDIR=$(realpath $(dirname $0)) 5 | 6 | if [ ! -d "$COLABFOLDDIR/colabfold-conda" ]; then 7 | echo "Error! colabfold-conda directory is not present in $COLABFOLDDIR." 8 | exit 1 9 | fi 10 | 11 | # activate conda in $COLABFOLDDIR/conda 12 | source "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 13 | conda activate "$COLABFOLDDIR/colabfold-conda" 14 | 15 | # reinstall colabfold and alphafold-colabfold 16 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --no-warn-conflicts --upgrade --force-reinstall \ 17 | "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold" 18 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install "colabfold[alphafold]" 19 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --force-reinstall "jax[cuda12]==0.5.3" "numpy==2.2.5" 20 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade tensorflow 21 | "$COLABFOLDDIR/colabfold-conda/bin/pip" install silence_tensorflow 22 | 23 | # use 'agg' for non-GUI backend 24 | cd "${COLABFOLDDIR}/colabfold-conda/lib/python3.10/site-packages/colabfold" 25 | sed -i -e "s#from matplotlib import pyplot as plt#import matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt#g" plot.py 26 | # modify the default params directory 27 | sed -i -e "s#appdirs.user_cache_dir(__package__ or \"colabfold\")#\"${COLABFOLDDIR}/colabfold\"#g" download.py 28 | # suppress warnings related to tensorflow 29 | sed -i -e "s#from io import StringIO#from io import StringIO\nfrom silence_tensorflow import silence_tensorflow\nsilence_tensorflow()#g" batch.py 30 | # remove cache directory 31 | rm -rf __pycache__ 32 | -------------------------------------------------------------------------------- /v1.0.0/README.md: -------------------------------------------------------------------------------- 1 | # LocalColabFold 2 | 3 | [ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb) on your local PC (or macOS) 4 | 5 | ## Installation 6 | 7 | ### For Linux 8 | 9 | 1. Make sure `curl`, `git`, and `wget` commands are already installed on your PC. If not present, you need install them at first. For Ubuntu, type `sudo apt -y install curl git wget`. 10 | 2. Make sure your Cuda compiler driver is **11.1 or later**:
$ nvcc --version
 11 | nvcc: NVIDIA (R) Cuda compiler driver
 12 | Copyright (c) 2005-2020 NVIDIA Corporation
 13 | Built on Mon_Oct_12_20:09:46_PDT_2020
 14 | Cuda compilation tools, release 11.1, V11.1.105
 15 | Build cuda_11.1.TC455_06.29190527_0
 16 | 
DO NOT use `nvidia-smi` for checking the version.
See [NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) if you haven't installed it. 17 | 1. Download `install_colabfold_linux.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_linux.sh
and run it in the directory where you want to install:
$ bash install_colabfold_linux.sh
About 5 minutes later, `colabfold` directory will be created. Do not move this directory after the installation. 18 | 1. Type `cd colabfold` to enter the directory. 19 | 1. Modify the variables such as `sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK'`, `jobname = "test"`, and etc. in `runner.py` for your prediction. For more information, please refer to the original [ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb). 20 | 1. To run the prediction, type
$ colabfold-conda/bin/python3.7 runner.py
in the `colabfold` directory. The result files will be created in the `predition__` in the `colabfold` directory. After the prediction finished, you may move the results from the `colabfold` directory. 21 | 22 | ### For macOS 23 | 24 | **Caution: Due to the lack of Nvidia GPU/CUDA driver, the structure prediction on macOS are 5-10 times slower than on Linux+GPU**. For the test sequence (58 a.a.), it may take 30 minutes. However, it may be useful to play with it before preparing Linux+GPU environment. 25 | 26 | You can check whether your Mac is Intel or Apple Silicon by typing `uname -m` on Terminal. 27 | 28 | ```bash 29 | $ uname -m 30 | x86_64 # Intel 31 | arm64 # Apple Silicon 32 | ``` 33 | 34 | Please use the correct installer for your Mac. 35 | 36 | #### For Mac with Intel CPU 37 | 38 | 1. Install [Homebrew](https://brew.sh/index_ja) if not present:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
39 | 1. Install `wget` command using Homebrew:
$ brew install wget
40 | 1. Download `install_colabfold_intelmac.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_intelmac.sh
and run it in the directory where you want to install:
$ bash install_colabfold_intelmac.sh
About 5 minutes later, `colabfold` directory will be created. Do not move this directory after the installation. 41 | 1. The rest procedure is the same as "For Linux". 42 | 43 | #### For Mac with Apple Silicon (M1 chip) 44 | 45 | **Note: This installer is experimental because most of the dependent packages are not fully tested on Apple Silicon Mac.** 46 | 47 | 1. Install [Homebrew](https://brew.sh/index_ja) if not present:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
48 | 1. Install `wget` and `cmake` commands using Homebrew:
$ brew install wget cmake
49 | 1. Install `miniforge` command using Homebrew:
$ brew install --cask miniforge
50 | 1. Download `install_colabfold_M1mac.sh` from this repository:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_M1mac.sh
and run it in the directory where you want to install:
$ bash install_colabfold_M1mac.sh
About 5 minutes later, `colabfold` directory will be created. Do not move this directory after the installation. 51 | 1. Type `cd colabfold` to enter the directory. 52 | 1. Modify the variables such as `sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK'`, `jobname = "test"`, and etc. in `runner.py` for your prediction. For more information, please refer to the original [ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb). 53 | 1. To run the prediction, type
$ colabfold-conda/bin/python3.8 runner.py
in the `colabfold` directory. The result files will be created in the `predition__` in the `colabfold` directory. After the prediction finished, you may move the results from the `colabfold` directory. 54 | 55 | A Warning message appeared when you run the prediction: 56 | ``` 57 | You are using an experimental build of OpenMM v7.5.1. 58 | This is NOT SUITABLE for production! 59 | It has not been properly tested on this platform and we cannot guarantee it provides accurate results. 60 | ``` 61 | 62 | This message is due to Apple Silicon, but I think we can ignore it. 63 | 64 | ## Usage of `colabfold` shell script (Linux) 65 | 66 | An executable `colabfold` shell script is installed in `/path/to/colabfold/bin` directory. This is more helpful for installation on a shared computer and users who want to predict many sequences. 67 | 68 | 1. Prepare a FASTA file containing the amino acid sequence for which you want to predict the structure (e.g. `6x9z.fasta`).
>6X9Z_1|Chain A|Transmembrane beta-barrels|synthetic construct (32630)
 69 | MEQKPGTLMVYVVVGYNTDNTVDVVGGAQYAVSPYLFLDVGYGWNNSSLNFLEVGGGVSYKVSPDLEPYVKAGFEYNTDNTIKPTAGAGALYRVSPNLALMVEYGWNNSSLQKVAIGIAYKVKD
70 | 2. Type `export PATH="/path/to/colabfold/bin:$PATH"` to add a path to the PATH environment variable. For example, `export PATH="/home/foo/bar/colabfold/bin:$PATH"` if you installed localcolabfold on `/home/foo/bar/colabfold`. 71 | 3. Run colabfold command with your FASTA file. For example,
$ colabfold --input 6x9z.fasta \\
 72 |    --output_dir 6x9z \\
 73 |    --max_recycle 18 \\
 74 |    --use_ptm \\
 75 |    --use_turbo \\
 76 |    --num_relax Top5
This will predict a protein structure [6x9z](https://www.rcsb.org/structure/6x9z) with increasing the number of 'recycling' to 18. This may be effective for *de novo* structure prediction. For another example, [PDB: 3KUD](https://www.rcsb.org/structure/3KUD),
$ colabfold --input 3kud_complex.fasta \\
 77 |    --output_dir 3kud \\
 78 |    --homooligomer 1:1 \\
 79 |    --use_ptm \\
 80 |    --use_turbo \\
 81 |    --max_recycle 3 \\
 82 |    --num_relax Top5
where the input sequence `3kud_complex.fasta` is
>3KUD_complex
 83 |    MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH:
 84 |    PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLIGEELQVDFL
This will predict a heterooligomer. For more information about the options, type `colabfold --help` or refer to the original [ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb). 85 | 86 | ## Advantages of LocalColabFold 87 | - **Structure inference and relaxation will be accelerated if your PC has Nvidia GPU and CUDA drivers.** 88 | - **No Time out (90 minutes and 12 hours)** 89 | - **No GPU limitations** 90 | - **NOT necessary to prepare the large database required for native AlphaFold2**. 91 | 92 | ## FAQ 93 | - What else do I need to do before installation? Do I need sudo privileges? 94 | - No, except for installation of `curl` and `wget` commands. 95 | - Do I need to prepare the large database such as PDB70, BFD, Uniclust30, MGnify...? 96 | - **No. it is not necessary.** Generation of MSA is performed by the MMseqs2 web server, just as implemented in ColabFold. 97 | - Are the pLDDT score and PAE figures available? 98 | - Yes, they will be generated just like the ColabFold. 99 | - Is it possible to predict homooligomers and complexes? 100 | - Yes, the sequence input is the same as ColabFold. See [ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb). 101 | - Is it possible to create MSA by jackhmmer? 102 | - **No, it is not currently supported**. 103 | - I want to run the predictions step-by-step like Google Colab. 104 | - You can use VSCode and Python plugin to do the same. See https://code.visualstudio.com/docs/python/jupyter-support-py. 105 | - I want to use multiple GPUs to perform the prediction. 106 | - You need to set the environment variables `TF_FORCE_UNIFIED_MEMORY`,`XLA_PYTHON_CLIENT_MEM_FRACTION` before execution. See [this discussion](https://github.com/YoshitakaMo/localcolabfold/issues/7#issuecomment-923027641). 107 | - I want to solve the `ResourceExhausted` error when trying to predict for a sequence with > 1000 residues. 108 | - See the same discussion as above. 109 | - I got an error message `CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered`. 110 | - You may not have updated to CUDA 11.1 or later. Please check the version of Cuda compiler with `nvcc --version` command, not `nvidia-smi`. 111 | - Is this available on Windows 10? 112 | - You can run LocalColabFold on your Windows 10 with [WSL2](https://docs.microsoft.com/en-us/windows/wsl/install-win10). 113 | 114 | ## Tutorials & Presentations 115 | 116 | - ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [[video]](https://www.youtube.com/watch?v=Rfw7thgGTwI) [[slides]](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI). 117 | 118 | ## Acknowledgments 119 | 120 | - The original colabfold was first created by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)). 121 | 122 | ## How do I reference this work? 123 | 124 | - Mirdita M, Schuetze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold - Making protein folding accessible to all. *bioRxiv*, doi: [10.1101/2021.08.15.456425](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v2) (2021) 125 | - John Jumper, Richard Evans, Alexander Pritzel, et al. - Highly accurate protein structure prediction with AlphaFold. *Nature*, 1–11, doi: [10.1038/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2) (2021) 126 | 127 | [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.5123296.svg)](https://doi.org/10.5281/zenodo.5123296) 128 | -------------------------------------------------------------------------------- /v1.0.0/README_ja.md: -------------------------------------------------------------------------------- 1 | # LocalColabFold 2 | 3 | 個人用パソコンのCPUとGPUで動かす[ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb)。 4 | 5 | ## インストール方法 6 | 7 | ### Linux+GPUの場合 8 | 9 | 1. ターミナル上で`curl`, `git`と`wget`コマンドがすでにインストールされていることを確認します。存在しない場合は先にこれらをインストールしてください。Ubuntuの場合はtype `sudo apt -y install curl git wget`でインストールできます。 10 | 2. **Cuda compilerのバージョンが11.1以降であることを確認します。**
$ nvcc --version
 11 | nvcc: NVIDIA (R) Cuda compiler driver
 12 | Copyright (c) 2005-2020 NVIDIA Corporation
 13 | Built on Mon_Oct_12_20:09:46_PDT_2020
 14 | Cuda compilation tools, release 11.1, V11.1.105
 15 | Build cuda_11.1.TC455_06.29190527_0
 16 | 
バージョンチェックの時に`nvidia-smi`コマンドを使わないでください。こちらでは不正確です。
まだCUDA Compilerをインストールしていない場合は、[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)を参照してください。 17 | 3. このリポジトリにある`install_colabfold_linux.sh`をダウンロードします。
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_linux.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabfold_linux.sh
およそ5分後に`colabfold`ディレクトリができます。インストール後はこのディレクトリを移動させないでください。 18 | 4. `cd colabfold`を入力してこのディレクトリに入ります。 19 | 5. `runner.py`ファイル中の`sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK'`や`jobname = "test"`などのパラメータを変更し、構造予測のために必要な情報を入力します。詳細な設定方法についてはオリジナルの[ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb)を参考にしてください。こちらで可能な設定はほとんど利用可能です(MSA_methods以外)。 20 | 6. 予測を行うには、`colabfold`ディレクトリ内で以下のコマンドをターミナルで入力してください:
$ colabfold-conda/bin/python3.7 runner.py
予測結果のファイルは`predition__`という形式で`colabfold`内に作成されます。予測が終了した後は、結果ファイルを`colabfold`ディレクトリの外に移動させたり結果ファイルのディレクトリの名前を変えてもOKです。 21 | 22 | ### macOSの場合 23 | 24 | **注意: macOSではNvidia GPUとCUDAドライバがないため、構造推論部分がLinux+GPU環境に比べて5〜10倍ほど遅くなります**。テスト用のアミノ酸配列(58アミノ酸)ではおよそ30分ほど計算に時間がかかります。ただ、Linux+GPU環境を準備する前にこれで遊んでみるのはありかもしれません。 25 | 26 | また、自身の持っているMacがIntel CPUのものか、M1 chip入りのもの(Apple Silicon)かを先に確認してください。ターミナルで`uname -m`の結果でどちらかが判明します。 27 | 28 | ```bash 29 | $ uname -m 30 | x86_64 # Intel 31 | arm64 # Apple Silicon 32 | ``` 33 | 34 | (Apple SiliconでRosetta2を使っている場合はApple Siliconでもx86_64って表示されますけれど……今のところこれには対応していません。) 35 | 36 | 以上の結果を踏まえて適切なインストーラーを選択してください。 37 | 38 | #### Intel CPUのMacの場合 39 | 40 | 1. [Homebrew](https://qiita.com/zaburo/items/29fe23c1ceb6056109fd)をインストールします:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
41 | 2. Homebrewで`wget`コマンドをインストールします:
$ brew install wget
42 | 3. `install_colabfold_intelmac.sh`をこのリポジトリからダウンロードします:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_intelmac.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabfold_intelmac.sh
およそ5分後に`colabfold`ディレクトリができます。インストール後はこのディレクトリを移動させないでください。 43 | 4. 残りの手順は"Linux+GPUの場合"と同様です. 44 | 45 | #### Apple Silicon (M1 chip)のMacの場合 46 | 47 | **Note: 依存するPythonパッケージのほとんどがまだApple Silicon Macで十分にテストされていないため、このインストーラーによる動作は試験的なものです。** 48 | 49 | 1. [Homebrew](https://qiita.com/zaburo/items/29fe23c1ceb6056109fd)をインストールします:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
50 | 1. Homebrweで`wget`と`cmake`コマンドをインストールします:
$ brew install wget cmake
51 | 1. `miniforge`をHomebrewでインストールします:
$ brew install --cask miniforge
52 | 1. インストーラー`install_colabfold_M1mac.sh`をこのリポジトリからダウンロードします:
$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabfold_M1mac.sh
これをインストールしたいディレクトリの上に置いた後、以下のコマンドを入力します:
$ bash install_colabfold_M1mac.sh
およそ5分後に`colabfold`ディレクトリができます。途中色々WarningsやErrorが出るかもしれません。インストール後はこのディレクトリを移動させないでください。 53 | 1. `cd colabfold`を入力してこのディレクトリに入ります。 54 | 1. `runner.py`ファイル中の`sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK'`や`jobname = "test"`などのパラメータを変更し、構造予測のために必要な情報を入力します。詳細な設定方法についてはオリジナルの[ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb)を参考にしてください。こちらで可能な設定はほとんど利用可能です(MSA_methods以外)。 55 | 1. 予測を行うには、`colabfold`ディレクトリ内で以下のコマンドをターミナルで入力してください:
$ colabfold-conda/bin/python3.8 runner.py
予測結果のファイルは`predition__`という形式で`colabfold`内に作成されます。予測が終了した後は、結果ファイルを`colabfold`ディレクトリの外に移動させたり結果ファイルのディレクトリの名前を変えてもOKです。 56 | 57 | 予測を行っているときに以下のようなメッセージが現れます: 58 | 59 | ``` 60 | You are using an experimental build of OpenMM v7.5.1. 61 | This is NOT SUITABLE for production! 62 | It has not been properly tested on this platform and we cannot guarantee it provides accurate results. 63 | ``` 64 | 65 | このメッセージはApple Silicon上で動作させる時のみ現れますが、たぶん無視して大丈夫です。 66 | 67 | ## `colabfold`コマンドの使い方(Linux向け) 68 | 69 | `colabfold`は`runner.py`の代わりにコマンドライン引数を取ることのできる実行可能シェルスクリプトです。こちらは共用計算機上に一度インストールするだけで済み、複数のユーザーがlocalcolabfoldを使ってより多くの配列を予測したい場合に有用です。 70 | 71 | 1. 予測したいアミノ酸配列が含まれるFASTA形式のファイルを同ディレクトリに用意します。例として`6x9z.fasta`とします。
>6X9Z_1|Chain A|Transmembrane beta-barrels|synthetic construct (32630)
 72 | MEQKPGTLMVYVVVGYNTDNTVDVVGGAQYAVSPYLFLDVGYGWNNSSLNFLEVGGGVSYKVSPDLEPYVKAGFEYNTDNTIKPTAGAGALYRVSPNLALMVEYGWNNSSLQKVAIGIAYKVKD
73 | 1. `export PATH="/path/to/colabfold/bin:$PATH"`と打つことで環境変数PATHにこのcolabfoldシェルスクリプトのファイルパスを設定します。例えばLocalColabFoldを`/home/foo/bar/colabfold`にインストールした場合は、`export PATH="/home/foo/bar/colabfold/bin:$PATH"`と入力します。 74 | 1. 入力のアミノ酸配列ファイルを`--input`の引数に指定し、`colabfold`コマンドを実行します。例えばこんな感じ
$ colabfold --input 6x9z.fasta \\
 75 |    --output_dir 6x9z \\
 76 |    --max_recycle 18 \\
 77 |    --use_ptm \\
 78 |    --use_turbo \\
 79 |    --num_relax Top5
上記コマンドは*de novo*タンパク質構造[PDB: 6X9Z](https://www.rcsb.org/structure/6x9z)を予想するときに、'recycling'回数を最大18回まで引き上げています。この回数の引き上げは*de novo*タンパク質構造を予測する時には効果的であることが示されています(通常のタンパク質は3回で十分なことがほとんどです)。
他の入力例として, [PDB: 3KUD](https://www.rcsb.org/structure/3KUD)の**複合体予測**を行おうとするときは
$ colabfold --input 3kud_complex.fasta \\
 80 |    --output_dir 3kud \\
 81 |    --homooligomer 1:1 \\
 82 |    --use_ptm \\
 83 |    --use_turbo \\
 84 |    --max_recycle 3 \\
 85 |    --num_relax Top5
ここで入力配列`3kud_complex.fasta`は以下の通りです。
>3KUD_complex
 86 |    MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH:
 87 |    PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLIGEELQVDFL
 88 |    
`:`記号でアミノ酸配列を隔てることで複合体予測をすることができます。この場合はヘテロ複合体予測になっています。ホモオリゴマー予測を行いたいときなど、他の設定については`colabfold --help`で設定方法を読むか、オリジナルの[ColabFold / AlphaFold2_advanced](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb)にある説明を読んでください。 89 | 90 | ## LocalColabFoldを利用するメリット 91 | 92 | - **お使いのパソコンにNvidia GPUとCUDAドライバがあれば、AlphaFold2による構造推論(Structure inference)と構造最適化(relax)が高速になります。** 93 | - **Google Colabは90分アイドルにしていたり、12時間以上の利用でタイムアウトしますが、その制限がありません。また、GPUの使用についても当然制限がありません。** 94 | - **データベースをダウンロードしてくる必要がないです**。 95 | 96 | ## FAQ 97 | - インストールの事前準備は? 98 | - `curl`, `wget`コマンド以外は不要です 99 | - BFD, Mgnify, PDB70, Uniclust30などの巨大なデータベースを用意する必要はありますか? 100 | - **必要ないです**。 101 | - AlphaFold2の最初の動作に必要なMSA作成はどのように行っていますか? 102 | - MSA作成はColabFoldと同様にMMseqs2のウェブサーバーによって行われています。 103 | - ColabFoldで表示されるようなpLDDTスコアやPAEの図も生成されますか? 104 | - はい、生成されます。 105 | - ホモ多量体予測、複合体予測も可能ですか? 106 | - はい、可能です。配列の入力方法はGoogle Colabのやり方と同じです。 107 | - jackhmmerによるMSA作成は可能ですか? 108 | - **現在のところ対応していません**。 109 | - Google Colabのようにセルごとに実行したい。 110 | - VSCodeとPythonプラグインを使えば同様のことができます。See https://code.visualstudio.com/docs/python/jupyter-support-py . 111 | - 複数のGPUを利用して計算を行いたい。 112 | - 実行前に環境変数`TF_FORCE_UNIFIED_MEMORY`,`XLA_PYTHON_CLIENT_MEM_FRACTION`を設定する必要があります。[こちらのissue](https://github.com/YoshitakaMo/localcolabfold/issues/7#issuecomment-923027641)を読んでください。 113 | - 長いアミノ酸を予測しようとしたときに`ResourceExhausted`というエラーが発生するのを解決したい。 114 | - 上と同じissueを読んでください。 115 | - `CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered`というエラーメッセージが出る 116 | - CUDA 11.1以降にアップデートされていない可能性があります。`nvcc --version`コマンドでCuda compilerのバージョンを確認してみてください。 117 | - Windows 10の上でも利用することはできますか? 118 | - [WSL2](https://docs.microsoft.com/en-us/windows/wsl/install-win10)を入れればWindows 10の上でも同様に動作させることができます。 119 | 120 | ## Tutorials & Presentations 121 | 122 | - ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [[video]](https://www.youtube.com/watch?v=Rfw7thgGTwI) [[slides]](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI). 123 | 124 | ## Acknowledgments 125 | 126 | - The original colabfold was first created by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)). 127 | 128 | ## How do I reference this work? 129 | 130 | - Mirdita M, Schuetze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold - Making protein folding accessible to all. *bioRxiv*, doi: [10.1101/2021.08.15.456425](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v2) (2021) 131 | - John Jumper, Richard Evans, Alexander Pritzel, et al. - Highly accurate protein structure prediction with AlphaFold. *Nature*, 1–11, doi: [10.1038/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2) (2021) 132 | 133 | 134 | [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.5123296.svg)](https://doi.org/10.5281/zenodo.5123296) 135 | -------------------------------------------------------------------------------- /v1.0.0/colabfold_alphafold.patch: -------------------------------------------------------------------------------- 1 | --- colabfold_alphafold.py.orig 2021-10-24 10:56:09.887461716 +0900 2 | +++ colabfold_alphafold.py 2021-10-24 11:25:12.811888920 +0900 3 | @@ -32,6 +32,13 @@ try: 4 | except: 5 | IN_COLAB = False 6 | 7 | +if os.getenv('COLABFOLD_PATH'): 8 | + print("COLABFOLD_PATH is set to " + os.getenv('COLABFOLD_PATH')) 9 | + colabfold_path = os.getenv('COLABFOLD_PATH') 10 | +else: 11 | + print("COLABFOLD_PATH is not set.") 12 | + colabfold_path = '.' 13 | + 14 | import tqdm.notebook 15 | TQDM_BAR_FORMAT = '{l_bar}{bar}| {n_fmt}/{total_fmt} [elapsed: {elapsed} remaining: {remaining}]' 16 | 17 | @@ -641,7 +648,7 @@ def prep_model_runner(opt=None, model_na 18 | cfg.model.recycle_tol = opt["tol"] 19 | cfg.data.eval.num_ensemble = opt["num_ensemble"] 20 | 21 | - params = data.get_model_haiku_params(name, params_loc) 22 | + params = data.get_model_haiku_params(name, colabfold_path + "/" + params_loc) 23 | return {"model":model.RunModel(cfg, params, is_training=opt["is_training"]), "opt":opt} 24 | else: 25 | return old_runner 26 | @@ -749,7 +756,7 @@ def run_alphafold(feature_dict, opt=None 27 | pbar.set_description(f'Running {key}') 28 | 29 | # replace model parameters 30 | - params = data.get_model_haiku_params(name, params_loc) 31 | + params = data.get_model_haiku_params(name, colabfold_path + "/" + params_loc) 32 | for k in runner["model"].params.keys(): 33 | runner["model"].params[k] = params[k] 34 | 35 | -------------------------------------------------------------------------------- /v1.0.0/gpurelaxation.patch: -------------------------------------------------------------------------------- 1 | --- alphafold/relax/amber_minimize.py.org 2021-08-31 16:59:21.161164190 +0900 2 | +++ alphafold/relax/amber_minimize.py 2021-08-31 16:59:32.073226369 +0900 3 | @@ -90,7 +90,7 @@ def _openmm_minimize( 4 | _add_restraints(system, pdb, stiffness, restraint_set, exclude_residues) 5 | 6 | integrator = openmm.LangevinIntegrator(0, 0.01, 0.0) 7 | - platform = openmm.Platform.getPlatformByName("CPU") 8 | + platform = openmm.Platform.getPlatformByName("CUDA") 9 | simulation = openmm_app.Simulation( 10 | pdb.topology, system, integrator, platform) 11 | simulation.context.setPositions(pdb.positions) 12 | @@ -530,7 +530,7 @@ def get_initial_energies(pdb_strs: Seque 13 | simulation = openmm_app.Simulation(openmm_pdbs[0].topology, 14 | system, 15 | openmm.LangevinIntegrator(0, 0.01, 0.0), 16 | - openmm.Platform.getPlatformByName("CPU")) 17 | + openmm.Platform.getPlatformByName("CUDA")) 18 | energies = [] 19 | for pdb in openmm_pdbs: 20 | try: 21 | -------------------------------------------------------------------------------- /v1.0.0/install_colabfold_M1mac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # check whether `wget` and `cmake` are installed 4 | type wget || { echo "wget command is not installed. Please install it at first using Homebrew." ; exit 1 ; } 5 | type cmake || { echo "wget command is not installed. Please install it at first using Homebrew." ; exit 1 ; } 6 | 7 | # check whether miniforge is present 8 | test -f "/opt/homebrew/Caskroom/miniforge/base/etc/profile.d/conda.sh" || { echo "Install miniforge by using Homebrew before installation. \n 'brew install --cask miniforge'" ; exit 1 ; } 9 | 10 | # check whether Apple Silicon (M1 mac) or Intel Mac 11 | arch_name="$(uname -m)" 12 | 13 | if [ "${arch_name}" = "x86_64" ]; then 14 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 15 | echo "Running on Rosetta 2" 16 | else 17 | echo "Running on native Intel" 18 | fi 19 | echo "This installer is only for Apple Silicon. Use install_colabfold_intelmac.sh to install on this Mac." 20 | exit 1 21 | elif [ "${arch_name}" = "arm64" ]; then 22 | echo "Running on Apple Silicon (M1 mac)" 23 | else 24 | echo "Unknown architecture: ${arch_name}" 25 | exit 1 26 | fi 27 | 28 | GIT_REPO="https://github.com/deepmind/alphafold" 29 | SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar" 30 | CURRENTPATH=`pwd` 31 | COLABFOLDDIR="${CURRENTPATH}/colabfold" 32 | PARAMS_DIR="${COLABFOLDDIR}/alphafold/data/params" 33 | MSATOOLS="${COLABFOLDDIR}/tools" 34 | 35 | # download the original alphafold as "${COLABFOLDDIR}" 36 | echo "downloading the original alphafold as ${COLABFOLDDIR}..." 37 | rm -rf ${COLABFOLDDIR} 38 | git clone ${GIT_REPO} ${COLABFOLDDIR} 39 | (cd ${COLABFOLDDIR}; git checkout 1d43aaff941c84dc56311076b58795797e49107b --quiet) 40 | 41 | # colabfold patches 42 | echo "Applying several patches to be Alphafold2_advanced..." 43 | cd ${COLABFOLDDIR} 44 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold.py 45 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold_alphafold.py 46 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/pairmsa.py 47 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/protein.patch 48 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/config.patch 49 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/model.patch 50 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/modules.patch 51 | # GPU relaxation patch 52 | # wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/gpurelaxation.patch -O gpurelaxation.patch 53 | 54 | # donwload reformat.pl from hh-suite 55 | wget -qnc https://raw.githubusercontent.com/soedinglab/hh-suite/master/scripts/reformat.pl 56 | # Apply multi-chain patch from Lim Heo @huhlim 57 | patch -u alphafold/common/protein.py -i protein.patch 58 | patch -u alphafold/model/model.py -i model.patch 59 | patch -u alphafold/model/modules.py -i modules.patch 60 | patch -u alphafold/model/config.py -i config.patch 61 | cd .. 62 | 63 | # Downloading parameter files 64 | echo "Downloading AlphaFold2 trained parameters..." 65 | mkdir -p ${PARAMS_DIR} 66 | curl -fL ${SOURCE_URL} | tar x -C ${PARAMS_DIR} 67 | 68 | # Downloading stereo_chemical_props.txt from https://git.scicore.unibas.ch/schwede/openstructure 69 | echo "Downloading stereo_chemical_props.txt..." 70 | wget -q https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt 71 | mkdir -p ${COLABFOLDDIR}/alphafold/common 72 | mv stereo_chemical_props.txt ${COLABFOLDDIR}/alphafold/common 73 | 74 | # echo "installing HH-suite 3.3.0..." 75 | # mkdir -p ${MSATOOLS} 76 | # git clone --branch v3.3.0 https://github.com/soedinglab/hh-suite.git hh-suite-3.3.0 77 | # (cd hh-suite-3.3.0 ; mkdir build ; cd build ; cmake -DCMAKE_INSTALL_PREFIX=${MSATOOLS}/hh-suite .. ; make -j4 ; make install) 78 | # rm -rf hh-suite-3.3.0 79 | 80 | # echo "installing HMMER 3.3.2..." 81 | # wget http://eddylab.org/software/hmmer/hmmer-3.3.2.tar.gz 82 | # (tar xzvf hmmer-3.3.2.tar.gz ; cd hmmer-3.3.2 ; ./configure --prefix=${MSATOOLS}/hmmer ; make -j4 ; make install) 83 | # rm -rf hmmer-3.3.2.tar.gz hmmer-3.3.2 84 | 85 | echo "Creating conda environments with python3.8 as ${COLABFOLDDIR}/colabfold-conda" 86 | . "/opt/homebrew/Caskroom/miniforge/base/etc/profile.d/conda.sh" 87 | conda create -p $COLABFOLDDIR/colabfold-conda python=3.8 -y 88 | conda activate $COLABFOLDDIR/colabfold-conda 89 | conda update -y conda 90 | 91 | echo "Installing conda-forge packages" 92 | conda install -y -c conda-forge python=3.8 openmm==7.5.1 pdbfixer jupyter matplotlib py3Dmol tqdm biopython==1.79 immutabledict==2.0.0 93 | conda install -y -c conda-forge jax==0.2.20 94 | conda install -y -c apple tensorflow-deps 95 | python3.8 -m pip install tensorflow-macos 96 | python3.8 -m pip install jaxlib==0.1.70 -f "https://dfm.io/custom-wheels/jaxlib/index.html" 97 | python3.8 -m pip install numpy==1.21.2 98 | python3.8 -m pip install git+git://github.com/deepmind/tree.git 99 | python3.8 -m pip install git+git://github.com/google/ml_collections.git 100 | python3.8 -m pip install git+git://github.com/deepmind/dm-haiku.git 101 | 102 | # Apply OpenMM patch. 103 | echo "Applying OpenMM patch..." 104 | (cd ${COLABFOLDDIR}/colabfold-conda/lib/python3.8/site-packages/ && patch -p0 < ${COLABFOLDDIR}/docker/openmm.patch) 105 | 106 | # Enable GPU-accelerated relaxation. 107 | # echo "Enable GPU-accelerated relaxation..." 108 | # (cd ${COLABFOLDDIR} && patch -u alphafold/relax/amber_minimize.py -i gpurelaxation.patch) 109 | 110 | echo "Downloading runner.py" 111 | (cd ${COLABFOLDDIR} && wget -q "https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/runner.py") 112 | 113 | echo "Installation of Alphafold2_advanced finished." 114 | -------------------------------------------------------------------------------- /v1.0.0/install_colabfold_intelmac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # check whether `wget` are installed 4 | type wget || { echo "wget command is not installed. Please install it at first using Homebrew." ; exit 1 ; } 5 | 6 | # check whether Apple Silicon (M1 mac) or Intel Mac 7 | arch_name="$(uname -m)" 8 | 9 | if [ "${arch_name}" = "x86_64" ]; then 10 | if [ "$(sysctl -in sysctl.proc_translated)" = "1" ]; then 11 | echo "Running on Rosetta 2" 12 | else 13 | echo "Running on native Intel" 14 | fi 15 | elif [ "${arch_name}" = "arm64" ]; then 16 | echo "Running on Apple Silicon (M1 mac)" 17 | echo "This installer is only for intel Mac. Use install_colabfold_M1mac.sh to install on this Mac." 18 | exit 1 19 | else 20 | echo "Unknown architecture: ${arch_name}" 21 | exit 1 22 | fi 23 | 24 | GIT_REPO="https://github.com/deepmind/alphafold" 25 | SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar" 26 | CURRENTPATH=`pwd` 27 | COLABFOLDDIR="${CURRENTPATH}/colabfold" 28 | PARAMS_DIR="${COLABFOLDDIR}/alphafold/data/params" 29 | MSATOOLS="${COLABFOLDDIR}/tools" 30 | 31 | # download the original alphafold as "${COLABFOLDDIR}" 32 | echo "downloading the original alphafold as ${COLABFOLDDIR}..." 33 | rm -rf ${COLABFOLDDIR} 34 | git clone ${GIT_REPO} ${COLABFOLDDIR} 35 | (cd ${COLABFOLDDIR}; git checkout 1d43aaff941c84dc56311076b58795797e49107b --quiet) 36 | 37 | # colabfold patches 38 | echo "Applying several patches to be Alphafold2_advanced..." 39 | cd ${COLABFOLDDIR} 40 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold.py 41 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold_alphafold.py 42 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/pairmsa.py 43 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/protein.patch 44 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/config.patch 45 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/model.patch 46 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/modules.patch 47 | # GPU relaxation patch 48 | # wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/gpurelaxation.patch -O gpurelaxation.patch 49 | 50 | # donwload reformat.pl from hh-suite 51 | wget -qnc https://raw.githubusercontent.com/soedinglab/hh-suite/master/scripts/reformat.pl 52 | # Apply multi-chain patch from Lim Heo @huhlim 53 | patch -u alphafold/common/protein.py -i protein.patch 54 | patch -u alphafold/model/model.py -i model.patch 55 | patch -u alphafold/model/modules.py -i modules.patch 56 | patch -u alphafold/model/config.py -i config.patch 57 | cd .. 58 | 59 | # Downloading parameter files 60 | echo "Downloading AlphaFold2 trained parameters..." 61 | mkdir -p ${PARAMS_DIR} 62 | curl -fL ${SOURCE_URL} | tar x -C ${PARAMS_DIR} 63 | 64 | # Downloading stereo_chemical_props.txt from https://git.scicore.unibas.ch/schwede/openstructure 65 | echo "Downloading stereo_chemical_props.txt..." 66 | wget -q https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt 67 | mkdir -p ${COLABFOLDDIR}/alphafold/common 68 | mv stereo_chemical_props.txt ${COLABFOLDDIR}/alphafold/common 69 | 70 | # Install Miniconda3 for Linux 71 | echo "Installing Miniconda3 for macOS..." 72 | cd ${COLABFOLDDIR} 73 | wget -q -P . https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh 74 | bash ./Miniconda3-latest-MacOSX-x86_64.sh -b -p ${COLABFOLDDIR}/conda 75 | rm Miniconda3-latest-MacOSX-x86_64.sh 76 | cd .. 77 | 78 | echo "Creating conda environments with python3.7 as ${COLABFOLDDIR}/colabfold-conda" 79 | . "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 80 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 81 | conda create -p $COLABFOLDDIR/colabfold-conda python=3.7 -y 82 | conda activate $COLABFOLDDIR/colabfold-conda 83 | conda update -y conda 84 | 85 | echo "Installing conda-forge packages" 86 | conda install -c conda-forge python=3.7 openmm==7.5.1 pdbfixer -y 87 | conda install -c bioconda hmmer==3.3.2 hhsuite==3.3.0 -y 88 | echo "Installing alphafold dependencies by pip" 89 | python3.7 -m pip install absl-py==0.13.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.4 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.2.14 jaxlib==0.1.69 ml-collections==0.1.0 numpy==1.19.5 scipy==1.7.0 tensorflow==2.5.0 90 | python3.7 -m pip install jupyter matplotlib py3Dmol tqdm 91 | 92 | # Apply OpenMM patch. 93 | echo "Applying OpenMM patch..." 94 | (cd ${COLABFOLDDIR}/colabfold-conda/lib/python3.7/site-packages/ && patch -p0 < ${COLABFOLDDIR}/docker/openmm.patch) 95 | 96 | # Enable GPU-accelerated relaxation. 97 | # echo "Enable GPU-accelerated relaxation..." 98 | # (cd ${COLABFOLDDIR} && patch -u alphafold/relax/amber_minimize.py -i gpurelaxation.patch) 99 | 100 | echo "Downloading runner.py" 101 | (cd ${COLABFOLDDIR} && wget -q "https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/runner.py") 102 | 103 | echo "Installation of Alphafold2_advanced finished." 104 | -------------------------------------------------------------------------------- /v1.0.0/install_colabfold_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # check whether `wget` and `curl` are installed 4 | type wget || { echo "wget command is not installed. Please install it at first using apt or yum." ; exit 1 ; } 5 | type curl || { echo "curl command is not installed. Please install it at first using apt or yum. " ; exit 1 ; } 6 | 7 | GIT_REPO="https://github.com/deepmind/alphafold" 8 | SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar" 9 | CURRENTPATH=`pwd` 10 | COLABFOLDDIR="${CURRENTPATH}/colabfold" 11 | PARAMS_DIR="${COLABFOLDDIR}/alphafold/data/params" 12 | MSATOOLS="${COLABFOLDDIR}/tools" 13 | 14 | # download the original alphafold as "${COLABFOLDDIR}" 15 | echo "downloading the original alphafold as ${COLABFOLDDIR}..." 16 | rm -rf ${COLABFOLDDIR} 17 | git clone ${GIT_REPO} ${COLABFOLDDIR} 18 | (cd ${COLABFOLDDIR}; git checkout 1d43aaff941c84dc56311076b58795797e49107b --quiet) 19 | 20 | # colabfold patches 21 | echo "Applying several patches to be Alphafold2_advanced..." 22 | cd ${COLABFOLDDIR} 23 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold.py 24 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold_alphafold.py 25 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/pairmsa.py 26 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/protein.patch 27 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/config.patch 28 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/model.patch 29 | wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/modules.patch 30 | # GPU relaxation patch 31 | wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/gpurelaxation.patch -O gpurelaxation.patch 32 | 33 | # donwload reformat.pl from hh-suite 34 | wget -qnc https://raw.githubusercontent.com/soedinglab/hh-suite/master/scripts/reformat.pl 35 | # Apply multi-chain patch from Lim Heo @huhlim 36 | patch -u alphafold/common/protein.py -i protein.patch 37 | patch -u alphafold/model/model.py -i model.patch 38 | patch -u alphafold/model/modules.py -i modules.patch 39 | patch -u alphafold/model/config.py -i config.patch 40 | cd .. 41 | 42 | # Downloading parameter files 43 | echo "Downloading AlphaFold2 trained parameters..." 44 | mkdir -p ${PARAMS_DIR} 45 | curl -fL ${SOURCE_URL} | tar x -C ${PARAMS_DIR} 46 | 47 | # Downloading stereo_chemical_props.txt from https://git.scicore.unibas.ch/schwede/openstructure 48 | echo "Downloading stereo_chemical_props.txt..." 49 | wget -q https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt --no-check-certificate 50 | mkdir -p ${COLABFOLDDIR}/alphafold/common 51 | mv stereo_chemical_props.txt ${COLABFOLDDIR}/alphafold/common 52 | 53 | # Install Miniconda3 for Linux 54 | echo "Installing Miniconda3 for Linux..." 55 | cd ${COLABFOLDDIR} 56 | wget -q -P . https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 57 | bash ./Miniconda3-latest-Linux-x86_64.sh -b -p ${COLABFOLDDIR}/conda 58 | rm Miniconda3-latest-Linux-x86_64.sh 59 | cd .. 60 | 61 | echo "Creating conda environments with python3.7 as ${COLABFOLDDIR}/colabfold-conda" 62 | . "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 63 | export PATH="${COLABFOLDDIR}/conda/condabin:${PATH}" 64 | conda create -p $COLABFOLDDIR/colabfold-conda python=3.7 -y 65 | conda activate $COLABFOLDDIR/colabfold-conda 66 | conda update -n base conda -y 67 | 68 | echo "Installing conda-forge packages" 69 | conda install -c conda-forge python=3.7 cudnn==8.2.1.32 cudatoolkit==11.1.1 openmm==7.5.1 pdbfixer -y 70 | conda install -c bioconda hmmer==3.3.2 hhsuite==3.3.0 -y 71 | echo "Installing alphafold dependencies by pip" 72 | python3.7 -m pip install absl-py==0.13.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.4 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.2.14 ml-collections==0.1.0 numpy==1.19.5 scipy==1.7.0 tensorflow-gpu==2.5.0 73 | python3.7 -m pip install jupyter matplotlib py3Dmol tqdm 74 | python3.7 -m pip install --upgrade jax jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_releases.html 75 | 76 | # Apply OpenMM patch. 77 | echo "Applying OpenMM patch..." 78 | (cd ${COLABFOLDDIR}/colabfold-conda/lib/python3.7/site-packages/ && patch -p0 < ${COLABFOLDDIR}/docker/openmm.patch) 79 | 80 | # Enable GPU-accelerated relaxation. 81 | echo "Enable GPU-accelerated relaxation..." 82 | (cd ${COLABFOLDDIR} && patch -u alphafold/relax/amber_minimize.py -i gpurelaxation.patch) 83 | 84 | echo "Downloading runner.py..." 85 | (cd ${COLABFOLDDIR} && wget -q "https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/runner.py") 86 | (cd ${COLABFOLDDIR} && wget -q "https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/runner_af2advanced.py") 87 | 88 | echo "Making standalone command 'colabfold'..." 89 | cd ${COLABFOLDDIR} 90 | mkdir -p bin && cd bin 91 | cat << EOF > colabfold 92 | #!/bin/sh 93 | 94 | . "${COLABFOLDDIR}/conda/etc/profile.d/conda.sh" 95 | conda activate ${COLABFOLDDIR}/colabfold-conda 96 | export NVIDIA_VISIBLE_DEVICES="all" 97 | export TF_FORCE_UNIFIED_MEMORY="1" 98 | export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0" 99 | export COLABFOLD_PATH="${COLABFOLDDIR}" 100 | python3.7 ${COLABFOLDDIR}/runner_af2advanced.py \$@ 101 | EOF 102 | chmod +x ./colabfold 103 | cd ${COLABFOLDDIR} 104 | wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/residue_constants.patch -O residue_constants.patch 105 | wget -qnc https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/colabfold_alphafold.patch -O colabfold_alphafold.patch 106 | patch -u alphafold/common/residue_constants.py -i residue_constants.patch 107 | patch -u colabfold_alphafold.py -i colabfold_alphafold.patch 108 | 109 | echo "Installation of Alphafold2_advanced finished." 110 | -------------------------------------------------------------------------------- /v1.0.0/residue_constants.patch: -------------------------------------------------------------------------------- 1 | --- residue_constants.py.orig 2021-10-24 11:30:58.275400080 +0900 2 | +++ residue_constants.py 2021-10-24 11:20:08.028085425 +0900 3 | @@ -20,6 +20,8 @@ from typing import List, Mapping, Tuple 4 | 5 | import numpy as np 6 | import tree 7 | +import os 8 | +colabfold_path = os.getenv('COLABFOLD_PATH', '.') 9 | 10 | # Internal import (35fd). 11 | 12 | @@ -403,7 +405,7 @@ def load_stereo_chemical_props() -> Tupl 13 | residue_bond_angles: dict that maps resname --> list of BondAngle tuples 14 | """ 15 | stereo_chemical_props_path = ( 16 | - 'alphafold/common/stereo_chemical_props.txt') 17 | + colabfold_path + '/alphafold/common/stereo_chemical_props.txt') 18 | with open(stereo_chemical_props_path, 'rt') as f: 19 | stereo_chemical_props = f.read() 20 | lines_iter = iter(stereo_chemical_props.splitlines()) 21 | 22 | -------------------------------------------------------------------------------- /v1.0.0/runner.py: -------------------------------------------------------------------------------- 1 | #%% 2 | import os 3 | import tensorflow as tf 4 | tf.config.set_visible_devices([], 'GPU') 5 | 6 | import jax 7 | 8 | from IPython.utils import io 9 | import subprocess 10 | import tqdm.notebook 11 | 12 | # --- Python imports --- 13 | import colabfold as cf 14 | import pairmsa 15 | import sys 16 | import pickle 17 | 18 | from urllib import request 19 | from concurrent import futures 20 | import json 21 | from matplotlib import gridspec 22 | import matplotlib.pyplot as plt 23 | import numpy as np 24 | import py3Dmol 25 | 26 | from urllib import request 27 | from concurrent import futures 28 | import json 29 | from matplotlib import gridspec 30 | import matplotlib.pyplot as plt 31 | import numpy as np 32 | import py3Dmol 33 | 34 | from alphafold.model import model 35 | from alphafold.model import config 36 | from alphafold.model import data 37 | 38 | from alphafold.data import parsers 39 | from alphafold.data import pipeline 40 | from alphafold.data.tools import jackhmmer 41 | 42 | from alphafold.common import protein 43 | 44 | ### Check your OS for localcolabfold 45 | import platform 46 | pf = platform.system() 47 | if pf == 'Windows': 48 | print('ColabFold on Windows') 49 | elif pf == 'Darwin': 50 | print('ColabFold on Mac') 51 | device="cpu" 52 | elif pf == 'Linux': 53 | print('ColabFold on Linux') 54 | device="gpu" 55 | #%% 56 | 57 | def run_jackhmmer(sequence, prefix): 58 | 59 | fasta_path = f"{prefix}.fasta" 60 | with open(fasta_path, 'wt') as f: 61 | f.write(f'>query\n{sequence}') 62 | 63 | pickled_msa_path = f"{prefix}.jackhmmer.pickle" 64 | if os.path.isfile(pickled_msa_path): 65 | msas_dict = pickle.load(open(pickled_msa_path,"rb")) 66 | msas, deletion_matrices, names = (msas_dict[k] for k in ['msas', 'deletion_matrices', 'names']) 67 | full_msa = [] 68 | for msa in msas: 69 | full_msa += msa 70 | else: 71 | # --- Find the closest source --- 72 | test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2021_03.fasta.1' 73 | ex = futures.ThreadPoolExecutor(3) 74 | def fetch(source): 75 | request.urlretrieve(test_url_pattern.format(source)) 76 | return source 77 | fs = [ex.submit(fetch, source) for source in ['', '-europe', '-asia']] 78 | source = None 79 | for f in futures.as_completed(fs): 80 | source = f.result() 81 | ex.shutdown() 82 | break 83 | 84 | jackhmmer_binary_path = '/usr/bin/jackhmmer' 85 | dbs = [] 86 | 87 | num_jackhmmer_chunks = {'uniref90': 59, 'smallbfd': 17, 'mgnify': 71} 88 | total_jackhmmer_chunks = sum(num_jackhmmer_chunks.values()) 89 | with tqdm.notebook.tqdm(total=total_jackhmmer_chunks, bar_format=TQDM_BAR_FORMAT) as pbar: 90 | def jackhmmer_chunk_callback(i): 91 | pbar.update(n=1) 92 | 93 | pbar.set_description('Searching uniref90') 94 | jackhmmer_uniref90_runner = jackhmmer.Jackhmmer( 95 | binary_path=jackhmmer_binary_path, 96 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/uniref90_2021_03.fasta', 97 | get_tblout=True, 98 | num_streamed_chunks=num_jackhmmer_chunks['uniref90'], 99 | streaming_callback=jackhmmer_chunk_callback, 100 | z_value=135301051) 101 | dbs.append(('uniref90', jackhmmer_uniref90_runner.query(fasta_path))) 102 | 103 | pbar.set_description('Searching smallbfd') 104 | jackhmmer_smallbfd_runner = jackhmmer.Jackhmmer( 105 | binary_path=jackhmmer_binary_path, 106 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/bfd-first_non_consensus_sequences.fasta', 107 | get_tblout=True, 108 | num_streamed_chunks=num_jackhmmer_chunks['smallbfd'], 109 | streaming_callback=jackhmmer_chunk_callback, 110 | z_value=65984053) 111 | dbs.append(('smallbfd', jackhmmer_smallbfd_runner.query(fasta_path))) 112 | 113 | pbar.set_description('Searching mgnify') 114 | jackhmmer_mgnify_runner = jackhmmer.Jackhmmer( 115 | binary_path=jackhmmer_binary_path, 116 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/mgy_clusters_2019_05.fasta', 117 | get_tblout=True, 118 | num_streamed_chunks=num_jackhmmer_chunks['mgnify'], 119 | streaming_callback=jackhmmer_chunk_callback, 120 | z_value=304820129) 121 | dbs.append(('mgnify', jackhmmer_mgnify_runner.query(fasta_path))) 122 | 123 | # --- Extract the MSAs and visualize --- 124 | # Extract the MSAs from the Stockholm files. 125 | # NB: deduplication happens later in pipeline.make_msa_features. 126 | 127 | mgnify_max_hits = 501 128 | msas = [] 129 | deletion_matrices = [] 130 | names = [] 131 | for db_name, db_results in dbs: 132 | unsorted_results = [] 133 | for i, result in enumerate(db_results): 134 | msa, deletion_matrix, target_names = parsers.parse_stockholm(result['sto']) 135 | e_values_dict = parsers.parse_e_values_from_tblout(result['tbl']) 136 | e_values = [e_values_dict[t.split('/')[0]] for t in target_names] 137 | zipped_results = zip(msa, deletion_matrix, target_names, e_values) 138 | if i != 0: 139 | # Only take query from the first chunk 140 | zipped_results = [x for x in zipped_results if x[2] != 'query'] 141 | unsorted_results.extend(zipped_results) 142 | sorted_by_evalue = sorted(unsorted_results, key=lambda x: x[3]) 143 | db_msas, db_deletion_matrices, db_names, _ = zip(*sorted_by_evalue) 144 | if db_msas: 145 | if db_name == 'mgnify': 146 | db_msas = db_msas[:mgnify_max_hits] 147 | db_deletion_matrices = db_deletion_matrices[:mgnify_max_hits] 148 | db_names = db_names[:mgnify_max_hits] 149 | msas.append(db_msas) 150 | deletion_matrices.append(db_deletion_matrices) 151 | names.append(db_names) 152 | msa_size = len(set(db_msas)) 153 | print(f'{msa_size} Sequences Found in {db_name}') 154 | 155 | pickle.dump({"msas":msas, 156 | "deletion_matrices":deletion_matrices, 157 | "names":names}, open(pickled_msa_path,"wb")) 158 | return msas, deletion_matrices, names 159 | 160 | import re 161 | 162 | # define sequence 163 | sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK' #@param {type:"string"} 164 | sequence = re.sub("[^A-Z:/]", "", sequence.upper()) 165 | sequence = re.sub(":+",":",sequence) 166 | sequence = re.sub("/+","/",sequence) 167 | sequence = re.sub("^[:/]+","",sequence) 168 | sequence = re.sub("[:/]+$","",sequence) 169 | 170 | jobname = "test" #@param {type:"string"} 171 | jobname = re.sub(r'\W+', '', jobname) 172 | 173 | # define number of copies 174 | homooligomer = "1" #@param {type:"string"} 175 | homooligomer = re.sub("[:/]+",":",homooligomer) 176 | homooligomer = re.sub("^[:/]+","",homooligomer) 177 | homooligomer = re.sub("[:/]+$","",homooligomer) 178 | 179 | if len(homooligomer) == 0: homooligomer = "1" 180 | homooligomer = re.sub("[^0-9:]", "", homooligomer) 181 | homooligomers = [int(h) for h in homooligomer.split(":")] 182 | 183 | #@markdown - `sequence` Specify protein sequence to be modelled. 184 | #@markdown - Use `/` to specify intra-protein chainbreaks (for trimming regions within protein). 185 | #@markdown - Use `:` to specify inter-protein chainbreaks (for modeling protein-protein hetero-complexes). 186 | #@markdown - For example, sequence `AC/DE:FGH` will be modelled as polypeptides: `AC`, `DE` and `FGH`. A separate MSA will be generates for `ACDE` and `FGH`. 187 | #@markdown If `pair_msa` is enabled, `ACDE`'s MSA will be paired with `FGH`'s MSA. 188 | #@markdown - `homooligomer` Define number of copies in a homo-oligomeric assembly. 189 | #@markdown - Use `:` to specify different homooligomeric state (copy numer) for each component of the complex. 190 | #@markdown - For example, **sequence:**`ABC:DEF`, **homooligomer:** `2:1`, the first protein `ABC` will be modeled as a homodimer (2 copies) and second `DEF` a monomer (1 copy). 191 | 192 | ori_sequence = sequence 193 | sequence = sequence.replace("/","").replace(":","") 194 | seqs = ori_sequence.replace("/","").split(":") 195 | 196 | if len(seqs) != len(homooligomers): 197 | if len(homooligomers) == 1: 198 | homooligomers = [homooligomers[0]] * len(seqs) 199 | homooligomer = ":".join([str(h) for h in homooligomers]) 200 | else: 201 | while len(seqs) > len(homooligomers): 202 | homooligomers.append(1) 203 | homooligomers = homooligomers[:len(seqs)] 204 | homooligomer = ":".join([str(h) for h in homooligomers]) 205 | print("WARNING: Mismatch between number of breaks ':' in 'sequence' and 'homooligomer' definition") 206 | 207 | full_sequence = "".join([s*h for s,h in zip(seqs,homooligomers)]) 208 | 209 | # prediction directory 210 | output_dir = 'prediction_' + jobname + '_' + cf.get_hash(full_sequence)[:5] 211 | os.makedirs(output_dir, exist_ok=True) 212 | # delete existing files in working directory 213 | for f in os.listdir(output_dir): 214 | os.remove(os.path.join(output_dir, f)) 215 | 216 | MIN_SEQUENCE_LENGTH = 16 217 | MAX_SEQUENCE_LENGTH = 2500 218 | 219 | aatypes = set('ACDEFGHIKLMNPQRSTVWY') # 20 standard aatypes 220 | if not set(full_sequence).issubset(aatypes): 221 | raise Exception(f'Input sequence contains non-amino acid letters: {set(sequence) - aatypes}. AlphaFold only supports 20 standard amino acids as inputs.') 222 | if len(full_sequence) < MIN_SEQUENCE_LENGTH: 223 | raise Exception(f'Input sequence is too short: {len(full_sequence)} amino acids, while the minimum is {MIN_SEQUENCE_LENGTH}') 224 | if len(full_sequence) > MAX_SEQUENCE_LENGTH: 225 | raise Exception(f'Input sequence is too long: {len(full_sequence)} amino acids, while the maximum is {MAX_SEQUENCE_LENGTH}. Please use the full AlphaFold system for long sequences.') 226 | 227 | if len(full_sequence) > 1400: 228 | print(f"WARNING: For a typical Google-Colab-GPU (16G) session, the max total length is ~1400 residues. You are at {len(full_sequence)}! Run Alphafold may crash.") 229 | 230 | print(f"homooligomer: '{homooligomer}'") 231 | print(f"total_length: '{len(full_sequence)}'") 232 | print(f"working_directory: '{output_dir}'") 233 | #%% 234 | TQDM_BAR_FORMAT = '{l_bar}{bar}| {n_fmt}/{total_fmt} [elapsed: {elapsed} remaining: {remaining}]' 235 | #@markdown Once this cell has been executed, you will see 236 | #@markdown statistics about the multiple sequence alignment 237 | #@markdown (MSA) that will be used by AlphaFold. In particular, 238 | #@markdown you’ll see how well each residue is covered by similar 239 | #@markdown sequences in the MSA. 240 | #@markdown (Note that the search against databases and the actual prediction can take some time, from minutes to hours, depending on the length of the protein and what type of GPU you are allocated by Colab.) 241 | 242 | #@markdown --- 243 | msa_method = "mmseqs2" #@param ["mmseqs2","jackhmmer","single_sequence","precomputed"] 244 | #@markdown - `mmseqs2` - FAST method from [ColabFold](https://github.com/sokrypton/ColabFold) 245 | #@markdown - `jackhmmer` - default method from Deepmind (SLOW, but may find more/less sequences). 246 | #@markdown - `single_sequence` - use single sequence input 247 | #@markdown - `precomputed` If you have previously run this notebook and saved the results, 248 | #@markdown you can skip this step by uploading 249 | #@markdown the previously generated `prediction_?????/msa.pickle` 250 | 251 | 252 | #@markdown --- 253 | #@markdown **custom msa options** 254 | add_custom_msa = False #@param {type:"boolean"} 255 | msa_format = "fas" #@param ["fas","a2m","a3m","sto","psi","clu"] 256 | #@markdown - `add_custom_msa` - If enabled, you'll get an option to upload your custom MSA in the specified `msa_format`. Note: Your MSA will be supplemented with those from 'mmseqs2' or 'jackhmmer', unless `msa_method` is set to 'single_sequence'. 257 | 258 | #@markdown --- 259 | #@markdown **pair msa options** 260 | 261 | #@markdown Experimental option for protein complexes. Pairing currently only supported for proteins in same operon (prokaryotic genomes). 262 | pair_mode = "unpaired" #@param ["unpaired","unpaired+paired","paired"] {type:"string"} 263 | #@markdown - `unpaired` - generate separate MSA for each protein. 264 | #@markdown - `unpaired+paired` - attempt to pair sequences from the same operon within the genome. 265 | #@markdown - `paired` - only use sequences that were successfully paired. 266 | 267 | #@markdown Options to prefilter each MSA before pairing. (It might help if there are any paralogs in the complex.) 268 | pair_cov = 50 #@param [0,25,50,75,90] {type:"raw"} 269 | pair_qid = 20 #@param [0,15,20,30,40,50] {type:"raw"} 270 | #@markdown - `pair_cov` prefilter each MSA to minimum coverage with query (%) before pairing. 271 | #@markdown - `pair_qid` prefilter each MSA to minimum sequence identity with query (%) before pairing. 272 | 273 | # --- Search against genetic databases --- 274 | os.makedirs('tmp', exist_ok=True) 275 | msas, deletion_matrices = [],[] 276 | 277 | if add_custom_msa: 278 | print(f"upload custom msa in '{msa_format}' format") 279 | msa_dict = files.upload() 280 | lines = msa_dict[list(msa_dict.keys())[0]].decode() 281 | 282 | # convert to a3m 283 | with open(f"tmp/upload.{msa_format}","w") as tmp_upload: 284 | tmp_upload.write(lines) 285 | os.system(f"reformat.pl {msa_format} a3m tmp/upload.{msa_format} tmp/upload.a3m") 286 | a3m_lines = open("tmp/upload.a3m","r").read() 287 | 288 | # parse 289 | msa, mtx = parsers.parse_a3m(a3m_lines) 290 | msas.append(msa) 291 | deletion_matrices.append(mtx) 292 | 293 | if len(msas[0][0]) != len(sequence): 294 | raise ValueError("ERROR: the length of msa does not match input sequence") 295 | 296 | if msa_method == "precomputed": 297 | print("upload precomputed pickled msa from previous run") 298 | pickled_msa_dict = files.upload() 299 | msas_dict = pickle.loads(pickled_msa_dict[list(pickled_msa_dict.keys())[0]]) 300 | msas, deletion_matrices = (msas_dict[k] for k in ['msas', 'deletion_matrices']) 301 | 302 | elif msa_method == "single_sequence": 303 | if len(msas) == 0: 304 | msas.append([sequence]) 305 | deletion_matrices.append([[0]*len(sequence)]) 306 | 307 | else: 308 | seqs = ori_sequence.replace('/','').split(':') 309 | _blank_seq = ["-" * len(seq) for seq in seqs] 310 | _blank_mtx = [[0] * len(seq) for seq in seqs] 311 | def _pad(ns,vals,mode): 312 | if mode == "seq": _blank = _blank_seq.copy() 313 | if mode == "mtx": _blank = _blank_mtx.copy() 314 | if isinstance(ns, list): 315 | for n,val in zip(ns,vals): _blank[n] = val 316 | else: _blank[ns] = vals 317 | if mode == "seq": return "".join(_blank) 318 | if mode == "mtx": return sum(_blank,[]) 319 | 320 | if len(seqs) == 1 or "unpaired" in pair_mode: 321 | # gather msas 322 | if msa_method == "mmseqs2": 323 | prefix = cf.get_hash("".join(seqs)) 324 | prefix = os.path.join('tmp',prefix) 325 | print(f"running mmseqs2") 326 | A3M_LINES = cf.run_mmseqs2(seqs, prefix, filter=True) 327 | 328 | for n, seq in enumerate(seqs): 329 | # tmp directory 330 | prefix = cf.get_hash(seq) 331 | prefix = os.path.join('tmp',prefix) 332 | 333 | if msa_method == "mmseqs2": 334 | # run mmseqs2 335 | a3m_lines = A3M_LINES[n] 336 | msa, mtx = parsers.parse_a3m(a3m_lines) 337 | msas_, mtxs_ = [msa],[mtx] 338 | 339 | elif msa_method == "jackhmmer": 340 | print(f"running jackhmmer on seq_{n}") 341 | # run jackhmmer 342 | msas_, mtxs_, names_ = ([sum(x,())] for x in run_jackhmmer(seq, prefix)) 343 | 344 | # pad sequences 345 | for msa_,mtx_ in zip(msas_,mtxs_): 346 | msa,mtx = [sequence],[[0]*len(sequence)] 347 | for s,m in zip(msa_,mtx_): 348 | msa.append(_pad(n,s,"seq")) 349 | mtx.append(_pad(n,m,"mtx")) 350 | 351 | msas.append(msa) 352 | deletion_matrices.append(mtx) 353 | 354 | #################################################################################### 355 | # PAIR_MSA 356 | #################################################################################### 357 | 358 | if len(seqs) > 1 and (pair_mode == "paired" or pair_mode == "unpaired+paired"): 359 | print("attempting to pair some sequences...") 360 | 361 | if msa_method == "mmseqs2": 362 | prefix = cf.get_hash("".join(seqs)) 363 | prefix = os.path.join('tmp',prefix) 364 | print(f"running mmseqs2_noenv_nofilter on all seqs") 365 | A3M_LINES = cf.run_mmseqs2(seqs, prefix, use_env=False, use_filter=False) 366 | 367 | _data = [] 368 | for a in range(len(seqs)): 369 | print(f"prepping seq_{a}") 370 | _seq = seqs[a] 371 | _prefix = os.path.join('tmp',cf.get_hash(_seq)) 372 | 373 | if msa_method == "mmseqs2": 374 | a3m_lines = A3M_LINES[a] 375 | _msa, _mtx, _lab = pairmsa.parse_a3m(a3m_lines, 376 | filter_qid=pair_qid/100, 377 | filter_cov=pair_cov/100) 378 | 379 | elif msa_method == "jackhmmer": 380 | _msas, _mtxs, _names = run_jackhmmer(_seq, _prefix) 381 | _msa, _mtx, _lab = pairmsa.get_uni_jackhmmer(_msas[0], _mtxs[0], _names[0], 382 | filter_qid=pair_qid/100, 383 | filter_cov=pair_cov/100) 384 | 385 | if len(_msa) > 1: 386 | _data.append(pairmsa.hash_it(_msa, _lab, _mtx, call_uniprot=False)) 387 | else: 388 | _data.append(None) 389 | 390 | Ln = len(seqs) 391 | O = [[None for _ in seqs] for _ in seqs] 392 | for a in range(Ln): 393 | if _data[a] is not None: 394 | for b in range(a+1,Ln): 395 | if _data[b] is not None: 396 | print(f"attempting pairwise stitch for {a} {b}") 397 | O[a][b] = pairmsa._stitch(_data[a],_data[b]) 398 | _seq_a, _seq_b, _mtx_a, _mtx_b = (*O[a][b]["seq"],*O[a][b]["mtx"]) 399 | 400 | ############################################## 401 | # filter to remove redundant sequences 402 | ############################################## 403 | ok = [] 404 | with open("tmp/tmp.fas","w") as fas_file: 405 | fas_file.writelines([f">{n}\n{a+b}\n" for n,(a,b) in enumerate(zip(_seq_a,_seq_b))]) 406 | os.system("hhfilter -maxseq 1000000 -i tmp/tmp.fas -o tmp/tmp.id90.fas -id 90") 407 | for line in open("tmp/tmp.id90.fas","r"): 408 | if line.startswith(">"): ok.append(int(line[1:])) 409 | ############################################## 410 | print(f"found {len(_seq_a)} pairs ({len(ok)} after filtering)") 411 | 412 | if len(_seq_a) > 0: 413 | msa,mtx = [sequence],[[0]*len(sequence)] 414 | for s_a,s_b,m_a,m_b in zip(_seq_a, _seq_b, _mtx_a, _mtx_b): 415 | msa.append(_pad([a,b],[s_a,s_b],"seq")) 416 | mtx.append(_pad([a,b],[m_a,m_b],"mtx")) 417 | msas.append(msa) 418 | deletion_matrices.append(mtx) 419 | 420 | ''' 421 | # triwise stitching (WIP) 422 | if Ln > 2: 423 | for a in range(Ln): 424 | for b in range(a+1,Ln): 425 | for c in range(b+1,Ln): 426 | if O[a][b] is not None and O[b][c] is not None: 427 | print(f"attempting triwise stitch for {a} {b} {c}") 428 | list_ab = O[a][b]["lab"][1] 429 | list_bc = O[b][c]["lab"][0] 430 | msa,mtx = [sequence],[[0]*len(sequence)] 431 | for i,l_b in enumerate(list_ab): 432 | if l_b in list_bc: 433 | j = list_bc.index(l_b) 434 | s_a = O[a][b]["seq"][0][i] 435 | s_b = O[a][b]["seq"][1][i] 436 | s_c = O[b][c]["seq"][1][j] 437 | 438 | m_a = O[a][b]["mtx"][0][i] 439 | m_b = O[a][b]["mtx"][1][i] 440 | m_c = O[b][c]["mtx"][1][j] 441 | 442 | msa.append(_pad([a,b,c],[s_a,s_b,s_c],"seq")) 443 | mtx.append(_pad([a,b,c],[m_a,m_b,m_c],"mtx")) 444 | if len(msa) > 1: 445 | msas.append(msa) 446 | deletion_matrices.append(mtx) 447 | print(f"found {len(msa)} triplets") 448 | ''' 449 | #################################################################################### 450 | #################################################################################### 451 | 452 | # save MSA as pickle 453 | pickle.dump({"msas":msas,"deletion_matrices":deletion_matrices}, 454 | open(os.path.join(output_dir,"msa.pickle"),"wb")) 455 | 456 | make_msa_plot = len(msas[0]) > 1 457 | if make_msa_plot: 458 | plt = cf.plot_msas(msas, ori_sequence) 459 | plt.savefig(os.path.join(output_dir,"msa_coverage.png"), bbox_inches = 'tight', dpi=300) 460 | #%% 461 | #@title run alphafold 462 | num_relax = "None" 463 | rank_by = "pLDDT" #@param ["pLDDT","pTMscore"] 464 | use_turbo = True #@param {type:"boolean"} 465 | max_msa = "512:1024" #@param ["512:1024", "256:512", "128:256", "64:128", "32:64"] 466 | max_msa_clusters, max_extra_msa = [int(x) for x in max_msa.split(":")] 467 | 468 | 469 | 470 | #@markdown - `rank_by` specify metric to use for ranking models (For protein-protein complexes, we recommend pTMscore) 471 | #@markdown - `use_turbo` introduces a few modifications (compile once, swap params, adjust max_msa) to speedup and reduce memory requirements. Disable for default behavior. 472 | #@markdown - `max_msa` defines: `max_msa_clusters:max_extra_msa` number of sequences to use. When adjusting after GPU crash, be sure to `Runtime` → `Restart runtime`. (Lowering will reduce GPU requirements, but may result in poor model quality. This option ignored if `use_turbo` is disabled) 473 | show_images = True #@param {type:"boolean"} 474 | #@markdown - `show_images` To make things more exciting we show images of the predicted structures as they are being generated. (WARNING: the order of images displayed does not reflect any ranking). 475 | #@markdown --- 476 | #@markdown #### Sampling options 477 | #@markdown There are two stochastic parts of the pipeline. Within the feature generation (choice of cluster centers) and within the model (dropout). 478 | #@markdown To get structure diversity, you can iterate through a fixed number of random_seeds (using `num_samples`) and/or enable dropout (using `is_training`). 479 | 480 | num_models = 5 #@param [1,2,3,4,5] {type:"raw"} 481 | use_ptm = True #@param {type:"boolean"} 482 | num_ensemble = 1 #@param [1,8] {type:"raw"} 483 | max_recycles = 3 #@param [1,3,6,12,24,48] {type:"raw"} 484 | tol = 0 #@param [0,0.1,0.5,1] {type:"raw"} 485 | is_training = False #@param {type:"boolean"} 486 | num_samples = 1 #@param [1,2,4,8,16,32] {type:"raw"} 487 | 488 | subsample_msa = True #@param {type:"boolean"} 489 | #@markdown - `subsample_msa` subsample large MSA to `3E7/length` sequences to avoid crashing the preprocessing protocol. (This option ignored if `use_turbo` is disabled.) 490 | 491 | save_pae_json = True 492 | save_tmp_pdb = True 493 | 494 | 495 | if use_ptm == False and rank_by == "pTMscore": 496 | print("WARNING: models will be ranked by pLDDT, 'use_ptm' is needed to compute pTMscore") 497 | rank_by = "pLDDT" 498 | 499 | ############################# 500 | # delete old files 501 | ############################# 502 | for f in os.listdir(output_dir): 503 | if "rank_" in f: 504 | os.remove(os.path.join(output_dir, f)) 505 | 506 | ############################# 507 | # homooligomerize 508 | ############################# 509 | lengths = [len(seq) for seq in seqs] 510 | msas_mod, deletion_matrices_mod = cf.homooligomerize_heterooligomer(msas, deletion_matrices, 511 | lengths, homooligomers) 512 | ############################# 513 | # define input features 514 | ############################# 515 | def _placeholder_template_feats(num_templates_, num_res_): 516 | return { 517 | 'template_aatype': np.zeros([num_templates_, num_res_, 22], np.float32), 518 | 'template_all_atom_masks': np.zeros([num_templates_, num_res_, 37, 3], np.float32), 519 | 'template_all_atom_positions': np.zeros([num_templates_, num_res_, 37], np.float32), 520 | 'template_domain_names': np.zeros([num_templates_], np.float32), 521 | 'template_sum_probs': np.zeros([num_templates_], np.float32), 522 | } 523 | 524 | num_res = len(full_sequence) 525 | feature_dict = {} 526 | feature_dict.update(pipeline.make_sequence_features(full_sequence, 'test', num_res)) 527 | feature_dict.update(pipeline.make_msa_features(msas_mod, deletion_matrices=deletion_matrices_mod)) 528 | if not use_turbo: 529 | feature_dict.update(_placeholder_template_feats(0, num_res)) 530 | 531 | def do_subsample_msa(F, random_seed=0): 532 | '''subsample msa to avoid running out of memory''' 533 | N = len(F["msa"]) 534 | L = len(F["residue_index"]) 535 | N_ = int(3E7/L) 536 | if N > N_: 537 | print(f"whhhaaa... too many sequences ({N}) subsampling to {N_}") 538 | np.random.seed(random_seed) 539 | idx = np.append(0,np.random.permutation(np.arange(1,N)))[:N_] 540 | F_ = {} 541 | F_["msa"] = F["msa"][idx] 542 | F_["deletion_matrix_int"] = F["deletion_matrix_int"][idx] 543 | F_["num_alignments"] = np.full_like(F["num_alignments"],N_) 544 | for k in ['aatype', 'between_segment_residues', 545 | 'domain_name', 'residue_index', 546 | 'seq_length', 'sequence']: 547 | F_[k] = F[k] 548 | return F_ 549 | else: 550 | return F 551 | 552 | ################################ 553 | # set chain breaks 554 | ################################ 555 | Ls = [] 556 | for seq,h in zip(ori_sequence.split(":"),homooligomers): 557 | Ls += [len(s) for s in seq.split("/")] * h 558 | Ls_plot = sum([[len(seq)]*h for seq,h in zip(seqs,homooligomers)],[]) 559 | feature_dict['residue_index'] = cf.chain_break(feature_dict['residue_index'], Ls) 560 | 561 | ########################### 562 | # run alphafold 563 | ########################### 564 | def parse_results(prediction_result, processed_feature_dict): 565 | b_factors = prediction_result['plddt'][:,None] * prediction_result['structure_module']['final_atom_mask'] 566 | dist_bins = jax.numpy.append(0,prediction_result["distogram"]["bin_edges"]) 567 | dist_mtx = dist_bins[prediction_result["distogram"]["logits"].argmax(-1)] 568 | contact_mtx = jax.nn.softmax(prediction_result["distogram"]["logits"])[:,:,dist_bins < 8].sum(-1) 569 | 570 | out = {"unrelaxed_protein": protein.from_prediction(processed_feature_dict, prediction_result, b_factors=b_factors), 571 | "plddt": prediction_result['plddt'], 572 | "pLDDT": prediction_result['plddt'].mean(), 573 | "dists": dist_mtx, 574 | "adj": contact_mtx} 575 | 576 | if "ptm" in prediction_result: 577 | out.update({"pae": prediction_result['predicted_aligned_error'], 578 | "pTMscore": prediction_result['ptm']}) 579 | return out 580 | 581 | model_names = ['model_1', 'model_2', 'model_3', 'model_4', 'model_5'][:num_models] 582 | total = len(model_names) * num_samples 583 | with tqdm.notebook.tqdm(total=total, bar_format=TQDM_BAR_FORMAT) as pbar: 584 | ####################################################################### 585 | # precompile model and recompile only if length changes 586 | ####################################################################### 587 | if use_turbo: 588 | name = "model_5_ptm" if use_ptm else "model_5" 589 | N = len(feature_dict["msa"]) 590 | L = len(feature_dict["residue_index"]) 591 | compiled = (N, L, use_ptm, max_recycles, tol, num_ensemble, max_msa, is_training) 592 | if "COMPILED" in dir(): 593 | if COMPILED != compiled: recompile = True 594 | else: recompile = True 595 | if recompile: 596 | cf.clear_mem(device) 597 | cfg = config.model_config(name) 598 | 599 | # set size of msa (to reduce memory requirements) 600 | msa_clusters = min(N, max_msa_clusters) 601 | cfg.data.eval.max_msa_clusters = msa_clusters 602 | cfg.data.common.max_extra_msa = max(min(N-msa_clusters,max_extra_msa),1) 603 | 604 | cfg.data.common.num_recycle = max_recycles 605 | cfg.model.num_recycle = max_recycles 606 | cfg.model.recycle_tol = tol 607 | cfg.data.eval.num_ensemble = num_ensemble 608 | 609 | params = data.get_model_haiku_params(name,'./alphafold/data') 610 | model_runner = model.RunModel(cfg, params, is_training=is_training) 611 | COMPILED = compiled 612 | recompile = False 613 | 614 | else: 615 | cf.clear_mem(device) 616 | recompile = True 617 | 618 | # cleanup 619 | if "outs" in dir(): del outs 620 | outs = {} 621 | cf.clear_mem("cpu") 622 | 623 | ####################################################################### 624 | def report(key): 625 | pbar.update(n=1) 626 | o = outs[key] 627 | line = f"{key} recycles:{o['recycles']} tol:{o['tol']:.2f} pLDDT:{o['pLDDT']:.2f}" 628 | if use_ptm: line += f" pTMscore:{o['pTMscore']:.2f}" 629 | print(line) 630 | if show_images: 631 | fig = cf.plot_protein(o['unrelaxed_protein'], Ls=Ls_plot, dpi=100) 632 | # plt.show() 633 | plt.ion() 634 | if save_tmp_pdb: 635 | tmp_pdb_path = os.path.join(output_dir,f'unranked_{key}_unrelaxed.pdb') 636 | pdb_lines = protein.to_pdb(o['unrelaxed_protein']) 637 | with open(tmp_pdb_path, 'w') as f: f.write(pdb_lines) 638 | 639 | if use_turbo: 640 | # go through each random_seed 641 | for seed in range(num_samples): 642 | 643 | # prep input features 644 | if subsample_msa: 645 | sampled_feats_dict = do_subsample_msa(feature_dict, random_seed=seed) 646 | processed_feature_dict = model_runner.process_features(sampled_feats_dict, random_seed=seed) 647 | else: 648 | processed_feature_dict = model_runner.process_features(feature_dict, random_seed=seed) 649 | 650 | # go through each model 651 | for num, model_name in enumerate(model_names): 652 | name = model_name+"_ptm" if use_ptm else model_name 653 | key = f"{name}_seed_{seed}" 654 | pbar.set_description(f'Running {key}') 655 | 656 | # replace model parameters 657 | params = data.get_model_haiku_params(name, './alphafold/data') 658 | for k in model_runner.params.keys(): 659 | model_runner.params[k] = params[k] 660 | 661 | # predict 662 | prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu") 663 | 664 | # save results 665 | outs[key] = parse_results(prediction_result, processed_feature_dict) 666 | outs[key].update({"recycles":r, "tol":t}) 667 | report(key) 668 | 669 | del prediction_result, params 670 | del sampled_feats_dict, processed_feature_dict 671 | 672 | else: 673 | # go through each model 674 | for num, model_name in enumerate(model_names): 675 | name = model_name+"_ptm" if use_ptm else model_name 676 | params = data.get_model_haiku_params(name, './alphafold/data') 677 | cfg = config.model_config(name) 678 | cfg.data.common.num_recycle = cfg.model.num_recycle = max_recycles 679 | cfg.model.recycle_tol = tol 680 | cfg.data.eval.num_ensemble = num_ensemble 681 | model_runner = model.RunModel(cfg, params, is_training=is_training) 682 | 683 | # go through each random_seed 684 | for seed in range(num_samples): 685 | key = f"{name}_seed_{seed}" 686 | pbar.set_description(f'Running {key}') 687 | processed_feature_dict = model_runner.process_features(feature_dict, random_seed=seed) 688 | prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu") 689 | outs[key] = parse_results(prediction_result, processed_feature_dict) 690 | outs[key].update({"recycles":r, "tol":t}) 691 | report(key) 692 | 693 | # cleanup 694 | del processed_feature_dict, prediction_result 695 | 696 | del params, model_runner, cfg 697 | cf.clear_mem("gpu") 698 | 699 | # delete old files 700 | for f in os.listdir(output_dir): 701 | if "rank" in f: 702 | os.remove(os.path.join(output_dir, f)) 703 | 704 | # Find the best model according to the mean pLDDT. 705 | model_rank = list(outs.keys()) 706 | model_rank = [model_rank[i] for i in np.argsort([outs[x][rank_by] for x in model_rank])[::-1]] 707 | 708 | # Write out the prediction 709 | for n,key in enumerate(model_rank): 710 | prefix = f"rank_{n+1}_{key}" 711 | pred_output_path = os.path.join(output_dir,f'{prefix}_unrelaxed.pdb') 712 | fig = cf.plot_protein(outs[key]["unrelaxed_protein"], Ls=Ls_plot, dpi=200) 713 | plt.savefig(os.path.join(output_dir,f'{prefix}.png'), bbox_inches = 'tight') 714 | plt.close(fig) 715 | 716 | pdb_lines = protein.to_pdb(outs[key]["unrelaxed_protein"]) 717 | with open(pred_output_path, 'w') as f: 718 | f.write(pdb_lines) 719 | 720 | ############################################################ 721 | print(f"model rank based on {rank_by}") 722 | for n,key in enumerate(model_rank): 723 | print(f"rank_{n+1}_{key} {rank_by}:{outs[key][rank_by]:.2f}") 724 | #%% 725 | #@title Refine structures with Amber-Relax (Optional) 726 | num_relax = "None" #@param ["None", "Top1", "Top5", "All"] {type:"string"} 727 | if num_relax == "None": 728 | num_relax = 0 729 | elif num_relax == "Top1": 730 | num_relax = 1 731 | elif num_relax == "Top5": 732 | num_relax = 5 733 | else: 734 | num_relax = len(model_names) * num_samples 735 | 736 | if num_relax > 0: 737 | if "relax" not in dir(): 738 | # add conda environment to path 739 | sys.path.append('./colabfold-conda/lib/python3.7/site-packages') 740 | 741 | # import libraries 742 | from alphafold.relax import relax 743 | from alphafold.relax import utils 744 | 745 | with tqdm.notebook.tqdm(total=num_relax, bar_format=TQDM_BAR_FORMAT) as pbar: 746 | pbar.set_description(f'AMBER relaxation') 747 | for n,key in enumerate(model_rank): 748 | if n < num_relax: 749 | prefix = f"rank_{n+1}_{key}" 750 | pred_output_path = os.path.join(output_dir,f'{prefix}_relaxed.pdb') 751 | if not os.path.isfile(pred_output_path): 752 | amber_relaxer = relax.AmberRelaxation( 753 | max_iterations=0, 754 | tolerance=2.39, 755 | stiffness=10.0, 756 | exclude_residues=[], 757 | max_outer_iterations=20) 758 | relaxed_pdb_lines, _, _ = amber_relaxer.process(prot=outs[key]["unrelaxed_protein"]) 759 | with open(pred_output_path, 'w') as f: 760 | f.write(relaxed_pdb_lines) 761 | pbar.update(n=1) 762 | #%% 763 | #@title Display 3D structure {run: "auto"} 764 | rank_num = 1 #@param ["1", "2", "3", "4", "5"] {type:"raw"} 765 | color = "lDDT" #@param ["chain", "lDDT", "rainbow"] 766 | show_sidechains = False #@param {type:"boolean"} 767 | show_mainchains = False #@param {type:"boolean"} 768 | 769 | key = model_rank[rank_num-1] 770 | prefix = f"rank_{rank_num}_{key}" 771 | pred_output_path = os.path.join(output_dir,f'{prefix}_relaxed.pdb') 772 | if not os.path.isfile(pred_output_path): 773 | pred_output_path = os.path.join(output_dir,f'{prefix}_unrelaxed.pdb') 774 | 775 | cf.show_pdb(pred_output_path, show_sidechains, show_mainchains, color, Ls=Ls_plot).show() 776 | if color == "lDDT": cf.plot_plddt_legend().show() 777 | if use_ptm: 778 | cf.plot_confidence(outs[key]["plddt"], outs[key]["pae"], Ls=Ls_plot).show() 779 | else: 780 | cf.plot_confidence(outs[key]["plddt"], Ls=Ls_plot).show() 781 | #%% 782 | #@title Extra outputs 783 | dpi = 300#@param {type:"integer"} 784 | save_to_txt = True #@param {type:"boolean"} 785 | save_pae_json = True #@param {type:"boolean"} 786 | #@markdown - save data used to generate contact and distogram plots below to text file (pae values can be found in json file if `use_ptm` is enabled) 787 | 788 | if use_ptm: 789 | print("predicted alignment error") 790 | cf.plot_paes([outs[k]["pae"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 791 | plt.savefig(os.path.join(output_dir,f'predicted_alignment_error.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 792 | # plt.show() 793 | 794 | print("predicted contacts") 795 | cf.plot_adjs([outs[k]["adj"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 796 | plt.savefig(os.path.join(output_dir,f'predicted_contacts.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 797 | # plt.show() 798 | 799 | print("predicted distogram") 800 | cf.plot_dists([outs[k]["dists"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 801 | plt.savefig(os.path.join(output_dir,f'predicted_distogram.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 802 | # plt.show() 803 | 804 | print("predicted LDDT") 805 | cf.plot_plddts([outs[k]["plddt"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 806 | plt.savefig(os.path.join(output_dir,f'predicted_LDDT.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 807 | # plt.show() 808 | 809 | def do_save_to_txt(filename, adj, dists): 810 | adj = np.asarray(adj) 811 | dists = np.asarray(dists) 812 | L = len(adj) 813 | with open(filename,"w") as out: 814 | out.write("i\tj\taa_i\taa_j\tp(cbcb<8)\tmaxdistbin\n") 815 | for i in range(L): 816 | for j in range(i+1,L): 817 | if dists[i][j] < 21.68 or adj[i][j] >= 0.001: 818 | line = f"{i+1}\t{j+1}\t{full_sequence[i]}\t{full_sequence[j]}\t{adj[i][j]:.3f}" 819 | line += f"\t>{dists[i][j]:.2f}" if dists[i][j] == 21.6875 else f"\t{dists[i][j]:.2f}" 820 | out.write(f"{line}\n") 821 | 822 | for n,key in enumerate(model_rank): 823 | if save_to_txt: 824 | txt_filename = os.path.join(output_dir,f'rank_{n+1}_{key}.raw.txt') 825 | do_save_to_txt(txt_filename,adj=outs[key]["adj"],dists=outs[key]["dists"]) 826 | 827 | if use_ptm and save_pae_json: 828 | pae = outs[key]["pae"] 829 | max_pae = pae.max() 830 | # Save pLDDT and predicted aligned error (if it exists) 831 | pae_output_path = os.path.join(output_dir,f'rank_{n+1}_{key}_pae.json') 832 | # Save predicted aligned error in the same format as the AF EMBL DB 833 | rounded_errors = np.round(np.asarray(pae), decimals=1) 834 | indices = np.indices((len(rounded_errors), len(rounded_errors))) + 1 835 | indices_1 = indices[0].flatten().tolist() 836 | indices_2 = indices[1].flatten().tolist() 837 | pae_data = json.dumps([{ 838 | 'residue1': indices_1, 839 | 'residue2': indices_2, 840 | 'distance': rounded_errors.flatten().tolist(), 841 | 'max_predicted_aligned_error': max_pae.item() 842 | }], 843 | indent=None, 844 | separators=(',', ':')) 845 | with open(pae_output_path, 'w') as f: 846 | f.write(pae_data) 847 | #%% -------------------------------------------------------------------------------- /v1.0.0/runner_af2advanced.py: -------------------------------------------------------------------------------- 1 | #%% 2 | ## command-line arguments 3 | import argparse 4 | parser = argparse.ArgumentParser(description="Runner script that can take command-line arguments") 5 | parser.add_argument("-i", "--input", help="Path to a FASTA file. Required.", required=True) 6 | parser.add_argument("-o", "--output_dir", default="", type=str, 7 | help="Path to a directory that will store the results. " 8 | "The default name is 'prediction_'. ") 9 | parser.add_argument("-ho", "--homooligomer", default="1", type=str, 10 | help="homooligomer: Define number of copies in a homo-oligomeric assembly. " 11 | "For example, sequence:ABC:DEF, homooligomer: 2:1, " 12 | "the first protein ABC will be modeled as a homodimer (2 copies) and second DEF a monomer (1 copy). Default is 1.") 13 | parser.add_argument("-m", "--msa_method", default="mmseqs2", type=str, choices=["mmseqs2", "single_sequence", "precomputed"], 14 | help="Options to generate MSA." 15 | "mmseqs2 - FAST method from ColabFold (default) " 16 | "single_sequence - use single sequence input." 17 | "precomputed - specify 'msa.pickle' file generated previously if you have." 18 | "Default is 'mmseqs2'.") 19 | parser.add_argument("--precomputed", default=None, type=str, 20 | help="Specify the file path of a precomputed pickled msa from previous run. " 21 | ) 22 | parser.add_argument("-p", "--pair_mode", default="unpaired", choices=["unpaired", "unpaired+paired", "paired"], 23 | help="Experimental option for protein complexes. " 24 | "Pairing currently only supported for proteins in same operon (prokaryotic genomes). " 25 | "unpaired - generate separate MSA for each protein. (default) " 26 | "unpaired+paired - attempt to pair sequences from the same operon within the genome. " 27 | "paired - only use sequences that were successfully paired. " 28 | "Default is 'unpaired'.") 29 | parser.add_argument("-pc", "--pair_cov", default=50, type=int, 30 | help="Options to prefilter each MSA before pairing. It might help if there are any paralogs in the complex. " 31 | "prefilter each MSA to minimum coverage with query (%%) before pairing. " 32 | "Default is 50.") 33 | parser.add_argument("-pq", "--pair_qid", default=20, type=int, 34 | help="Options to prefilter each MSA before pairing. It might help if there are any paralogs in the complex. " 35 | "prefilter each MSA to minimum sequence identity with query (%%) before pairing. " 36 | "Default is 20.") 37 | parser.add_argument("-b", "--rank_by", default="pLDDT", type=str, choices=["pLDDT", "pTMscore"], 38 | help="specify metric to use for ranking models (For protein-protein complexes, we recommend pTMscore). " 39 | "Default is 'pLDDT'.") 40 | parser.add_argument("-t", "--use_turbo", action='store_true', 41 | help="introduces a few modifications (compile once, swap params, adjust max_msa) to speedup and reduce memory requirements. " 42 | "Disable for default behavior.") 43 | parser.add_argument("-mm", "--max_msa", default="512:1024", type=str, 44 | help="max_msa defines: max_msa_clusters:max_extra_msa number of sequences to use. " 45 | "This option ignored if use_turbo is disabled. Default is '512:1024'.") 46 | parser.add_argument("-n", "--num_models", default=5, type=int, help="specify how many model params to try. (Default is 5)") 47 | parser.add_argument("-pt", "--use_ptm", action='store_true', 48 | help="uses Deepmind's ptm finetuned model parameters to get PAE per structure. " 49 | "Disable to use the original model params. (Disabling may give alternative structures.)") 50 | parser.add_argument("-e", "--num_ensemble", default=1, type=int, choices=[1, 8], 51 | help="the trunk of the network is run multiple times with different random choices for the MSA cluster centers. " 52 | "(1=default, 8=casp14 setting)") 53 | parser.add_argument("-r", "--max_recycles", default=3, type=int, help="controls the maximum number of times the structure is fed back into the neural network for refinement. (default is 3)") 54 | parser.add_argument("--tol", default=0, type=float, help="tolerance for deciding when to stop (CA-RMS between recycles)") 55 | parser.add_argument("--is_training", action='store_true', 56 | help="enables the stochastic part of the model (dropout), when coupled with num_samples can be used to 'sample' a diverse set of structures. False (NOT specifying this option) is recommended at first.") 57 | parser.add_argument("--num_samples", default=1, type=int, help="number of random_seeds to try. Default is 1.") 58 | parser.add_argument("--num_relax", default="None", choices=["None", "Top1", "Top5", "All"], 59 | help="num_relax is 'None' (default), 'Top1', 'Top5' or 'All'. Specify how many of the top ranked structures to relax.") 60 | args = parser.parse_args() 61 | ## command-line arguments 62 | ### Check your OS for localcolabfold 63 | import platform 64 | pf = platform.system() 65 | if pf == 'Windows': 66 | print('ColabFold on Windows') 67 | elif pf == 'Darwin': 68 | print('ColabFold on Mac') 69 | device="cpu" 70 | elif pf == 'Linux': 71 | print('ColabFold on Linux') 72 | device="gpu" 73 | #%% 74 | ### python code of AlphaFold2_advanced.ipynb 75 | import os 76 | import tensorflow as tf 77 | tf.config.set_visible_devices([], 'GPU') 78 | 79 | import jax 80 | 81 | from IPython.utils import io 82 | import subprocess 83 | import tqdm.notebook 84 | 85 | # --- Python imports --- 86 | import colabfold as cf 87 | import colabfold_alphafold as cf_af 88 | import pairmsa 89 | import sys 90 | import pickle 91 | 92 | from urllib import request 93 | from concurrent import futures 94 | import json 95 | from matplotlib import gridspec 96 | import matplotlib.pyplot as plt 97 | import numpy as np 98 | 99 | TMP_DIR = "tmp" 100 | os.makedirs(TMP_DIR, exist_ok=True) 101 | 102 | try: 103 | from google.colab import files 104 | IN_COLAB = True 105 | except: 106 | IN_COLAB = False 107 | 108 | #%% 109 | import re 110 | # define sequence 111 | # --read sequence from input file-- 112 | from Bio import SeqIO 113 | 114 | def readfastafile(fastafile): 115 | records = list(SeqIO.parse(fastafile, "fasta")) 116 | if(len(records) != 1): 117 | raise ValueError('Input FASTA file must have a single ID/sequence.') 118 | else: 119 | return records[0].id, records[0].seq 120 | 121 | 122 | print("Input ID: {}".format(readfastafile(args.input)[0])) 123 | print("Input Sequence: {}".format(readfastafile(args.input)[1])) 124 | sequence = str(readfastafile(args.input)[1]) 125 | # --read sequence from input file-- 126 | jobname = "test" #@param {type:"string"} 127 | homooligomer = args.homooligomer #@param {type:"string"} 128 | 129 | TQDM_BAR_FORMAT = '{l_bar}{bar}| {n_fmt}/{total_fmt} [elapsed: {elapsed} remaining: {remaining}]' 130 | 131 | # prediction directory 132 | # --set the output directory from command-line arguments 133 | if args.output_dir != "": 134 | output_dir = args.output_dir 135 | # --set the output directory from command-line arguments 136 | 137 | I = cf_af.prep_inputs(sequence, jobname, homooligomer, output_dir, clean=IN_COLAB) 138 | 139 | msa_method = args.msa_method #@param ["mmseqs2","single_sequence"] 140 | 141 | if msa_method == "precomputed": 142 | if args.precomputed is None: 143 | raise ValueError("ERROR: `--precomputed` undefined. " 144 | "You must specify the file path of previously generated 'msa.pickle' if you set '--msa_method precomputed'.") 145 | else: 146 | precomputed = args.precomputed 147 | print("Use precomputed msa.pickle: {}".format(precomputed)) 148 | else: 149 | precomputed = args.precomputed 150 | 151 | add_custom_msa = False #@param {type:"boolean"} 152 | msa_format = "fas" #@param ["fas","a2m","a3m","sto","psi","clu"] 153 | 154 | # --set the output directory from command-line arguments 155 | pair_mode = args.pair_mode #@param ["unpaired","unpaired+paired","paired"] {type:"string"} 156 | pair_cov = args.pair_cov #@param [0,25,50,75,90] {type:"raw"} 157 | pair_qid = args.pair_qid #@param [0,15,20,30,40,50] {type:"raw"} 158 | # --set the output directory from command-line arguments 159 | 160 | # --- Search against genetic databases --- 161 | 162 | I = cf_af.prep_msa(I, msa_method, add_custom_msa, msa_format, pair_mode, pair_cov, pair_qid, 163 | hhfilter_loc="colabfold-conda/bin/hhfilter", precomputed=precomputed, TMP_DIR=output_dir) 164 | mod_I = I 165 | 166 | if len(I["msas"][0]) > 1: 167 | plt = cf.plot_msas(I["msas"], I["ori_sequence"]) 168 | plt.savefig(os.path.join(I["output_dir"],"msa_coverage.png"), bbox_inches = 'tight', dpi=200) 169 | # plt.show() 170 | #%% 171 | trim = "" #@param {type:"string"} 172 | trim_inverse = False #@param {type:"boolean"} 173 | cov = 0 #@param [0,25,50,75,90,95] {type:"raw"} 174 | qid = 0 #@param [0,15,20,25,30,40,50] {type:"raw"} 175 | 176 | mod_I = cf_af.prep_filter(I, trim, trim_inverse, cov, qid) 177 | 178 | if I["msas"] != mod_I["msas"]: 179 | plt.figure(figsize=(16,5),dpi=100) 180 | plt.subplot(1,2,1) 181 | plt.title("Sequence coverage (Before)") 182 | cf.plot_msas(I["msas"], I["ori_sequence"], return_plt=False) 183 | plt.subplot(1,2,2) 184 | plt.title("Sequence coverage (After)") 185 | cf.plot_msas(mod_I["msas"], mod_I["ori_sequence"], return_plt=False) 186 | plt.savefig(os.path.join(I["output_dir"],"msa_coverage.filtered.png"), bbox_inches = 'tight', dpi=200) 187 | plt.show() 188 | 189 | #%% 190 | ##@title run alphafold 191 | # --------set parameters from command-line arguments-------- 192 | num_relax = args.num_relax 193 | rank_by = args.rank_by 194 | 195 | use_turbo = True if args.use_turbo else False 196 | max_msa = args.max_msa 197 | # --------set parameters from command-line arguments-------- 198 | 199 | max_msa_clusters, max_extra_msa = [int(x) for x in max_msa.split(":")] 200 | 201 | show_images = True #@param {type:"boolean"} 202 | 203 | # --------set parameters from command-line arguments-------- 204 | num_models = args.num_models 205 | use_ptm = True if args.use_ptm else False 206 | num_ensemble = args.num_ensemble 207 | max_recycles = args.max_recycles 208 | tol = args.tol 209 | is_training = True if args.is_training else False 210 | num_samples = args.num_samples 211 | # --------set parameters from command-line arguments-------- 212 | 213 | subsample_msa = True #@param {type:"boolean"} 214 | 215 | if not use_ptm and rank_by == "pTMscore": 216 | print("WARNING: models will be ranked by pLDDT, 'use_ptm' is needed to compute pTMscore") 217 | rank_by = "pLDDT" 218 | 219 | # prep input features 220 | feature_dict = cf_af.prep_feats(mod_I, clean=IN_COLAB) 221 | Ls_plot = feature_dict["Ls"] 222 | 223 | # prep model options 224 | opt = {"N":len(feature_dict["msa"]), 225 | "L":len(feature_dict["residue_index"]), 226 | "use_ptm":use_ptm, 227 | "use_turbo":use_turbo, 228 | "max_recycles":max_recycles, 229 | "tol":tol, 230 | "num_ensemble":num_ensemble, 231 | "max_msa_clusters":max_msa_clusters, 232 | "max_extra_msa":max_extra_msa, 233 | "is_training":is_training} 234 | 235 | if use_turbo: 236 | if "runner" in dir(): 237 | # only recompile if options changed 238 | runner = cf_af.prep_model_runner(opt, old_runner=runner) 239 | else: 240 | runner = cf_af.prep_model_runner(opt) 241 | else: 242 | runner = None 243 | 244 | ########################### 245 | # run alphafold 246 | ########################### 247 | outs, model_rank = cf_af.run_alphafold(feature_dict, opt, runner, num_models, num_samples, subsample_msa, 248 | rank_by=rank_by, show_images=show_images) 249 | 250 | #%% 251 | #@title Refine structures with Amber-Relax (Optional) 252 | 253 | # --------set parameters from command-line arguments-------- 254 | num_relax = args.num_relax 255 | # --------set parameters from command-line arguments-------- 256 | 257 | if num_relax == "None": 258 | num_relax = 0 259 | elif num_relax == "Top1": 260 | num_relax = 1 261 | elif num_relax == "Top5": 262 | num_relax = 5 263 | else: 264 | model_names = ['model_1', 'model_2', 'model_3', 'model_4', 'model_5'][:num_models] 265 | num_relax = len(model_names) * num_samples 266 | 267 | if num_relax > 0: 268 | if "relax" not in dir(): 269 | # add conda environment to path 270 | sys.path.append('./colabfold-conda/lib/python3.7/site-packages') 271 | 272 | # import libraries 273 | from alphafold.relax import relax 274 | from alphafold.relax import utils 275 | 276 | with tqdm.notebook.tqdm(total=num_relax, bar_format=TQDM_BAR_FORMAT) as pbar: 277 | pbar.set_description(f'AMBER relaxation') 278 | for n,key in enumerate(model_rank): 279 | if n < num_relax: 280 | prefix = f"rank_{n+1}_{key}" 281 | pred_output_path = os.path.join(I["output_dir"],f'{prefix}_relaxed.pdb') 282 | if not os.path.isfile(pred_output_path): 283 | amber_relaxer = relax.AmberRelaxation( 284 | max_iterations=0, 285 | tolerance=2.39, 286 | stiffness=10.0, 287 | exclude_residues=[], 288 | max_outer_iterations=20) 289 | relaxed_pdb_lines, _, _ = amber_relaxer.process(prot=outs[key]["unrelaxed_protein"]) 290 | with open(pred_output_path, 'w') as f: 291 | f.write(relaxed_pdb_lines) 292 | pbar.update(n=1) 293 | #%% 294 | #@title Display 3D structure {run: "auto"} 295 | rank_num = 1 #@param ["1", "2", "3", "4", "5"] {type:"raw"} 296 | color = "lDDT" #@param ["chain", "lDDT", "rainbow"] 297 | show_sidechains = False #@param {type:"boolean"} 298 | show_mainchains = False #@param {type:"boolean"} 299 | 300 | key = model_rank[rank_num-1] 301 | prefix = f"rank_{rank_num}_{key}" 302 | pred_output_path = os.path.join(I["output_dir"],f'{prefix}_relaxed.pdb') 303 | if not os.path.isfile(pred_output_path): 304 | pred_output_path = os.path.join(I["output_dir"],f'{prefix}_unrelaxed.pdb') 305 | 306 | cf.show_pdb(pred_output_path, show_sidechains, show_mainchains, color, Ls=Ls_plot).show() 307 | if color == "lDDT": cf.plot_plddt_legend().show() 308 | if use_ptm: 309 | cf.plot_confidence(outs[key]["plddt"], outs[key]["pae"], Ls=Ls_plot).show() 310 | else: 311 | cf.plot_confidence(outs[key]["plddt"], Ls=Ls_plot).show() 312 | #%% 313 | #@title Extra outputs 314 | dpi = 300#@param {type:"integer"} 315 | save_to_txt = True #@param {type:"boolean"} 316 | save_pae_json = True #@param {type:"boolean"} 317 | 318 | if use_ptm: 319 | print("predicted alignment error") 320 | cf.plot_paes([outs[k]["pae"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 321 | plt.savefig(os.path.join(I["output_dir"],f'predicted_alignment_error.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 322 | plt.show() 323 | 324 | print("predicted contacts") 325 | cf.plot_adjs([outs[k]["adj"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 326 | plt.savefig(os.path.join(I["output_dir"],f'predicted_contacts.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 327 | plt.show() 328 | 329 | print("predicted distogram") 330 | cf.plot_dists([outs[k]["dists"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 331 | plt.savefig(os.path.join(I["output_dir"],f'predicted_distogram.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 332 | plt.show() 333 | 334 | print("predicted LDDT") 335 | cf.plot_plddts([outs[k]["plddt"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 336 | plt.savefig(os.path.join(I["output_dir"],f'predicted_LDDT.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 337 | plt.show() 338 | 339 | def do_save_to_txt(filename, adj, dists, sequence): 340 | adj = np.asarray(adj) 341 | dists = np.asarray(dists) 342 | L = len(adj) 343 | with open(filename,"w") as out: 344 | out.write("i\tj\taa_i\taa_j\tp(cbcb<8)\tmaxdistbin\n") 345 | for i in range(L): 346 | for j in range(i+1,L): 347 | if dists[i][j] < 21.68 or adj[i][j] >= 0.001: 348 | line = f"{i}\t{j}\t{sequence[i]}\t{sequence[j]}\t{adj[i][j]:.3f}" 349 | line += f"\t>{dists[i][j]:.2f}" if dists[i][j] == 21.6875 else f"\t{dists[i][j]:.2f}" 350 | out.write(f"{line}\n") 351 | 352 | for n,key in enumerate(model_rank): 353 | if save_to_txt: 354 | txt_filename = os.path.join(I["output_dir"],f'rank_{n+1}_{key}.raw.txt') 355 | do_save_to_txt(txt_filename, 356 | outs[key]["adj"], 357 | outs[key]["dists"], 358 | mod_I["full_sequence"]) 359 | 360 | if use_ptm and save_pae_json: 361 | pae = outs[key]["pae"] 362 | max_pae = pae.max() 363 | # Save pLDDT and predicted aligned error (if it exists) 364 | pae_output_path = os.path.join(I["output_dir"],f'rank_{n+1}_{key}_pae.json') 365 | # Save predicted aligned error in the same format as the AF EMBL DB 366 | rounded_errors = np.round(np.asarray(pae), decimals=1) 367 | indices = np.indices((len(rounded_errors), len(rounded_errors))) + 1 368 | indices_1 = indices[0].flatten().tolist() 369 | indices_2 = indices[1].flatten().tolist() 370 | pae_data = json.dumps([{ 371 | 'residue1': indices_1, 372 | 'residue2': indices_2, 373 | 'distance': rounded_errors.flatten().tolist(), 374 | 'max_predicted_aligned_error': max_pae.item() 375 | }], 376 | indent=None, 377 | separators=(',', ':')) 378 | with open(pae_output_path, 'w') as f: 379 | f.write(pae_data) 380 | #%% 381 | -------------------------------------------------------------------------------- /v1.0.0/runner_af2advanced_old.py: -------------------------------------------------------------------------------- 1 | #%% 2 | ## command-line arguments 3 | import argparse 4 | parser = argparse.ArgumentParser(description="Runner script that can take command-line arguments") 5 | parser.add_argument("-i", "--input", help="Path to a FASTA file. Required.", required=True) 6 | parser.add_argument("-o", "--output_dir", default="", type=str, 7 | help="Path to a directory that will store the results. " 8 | "The default name is 'prediction_'. ") 9 | parser.add_argument("-h", "--homooligomer", default="1", type=str, 10 | help="homooligomer: Define number of copies in a homo-oligomeric assembly. " 11 | "For example, sequence:ABC:DEF, homooligomer: 2:1, " 12 | "the first protein ABC will be modeled as a homodimer (2 copies) and second DEF a monomer (1 copy). Default is 1.") 13 | parser.add_argument("-m", "--msa_method", default="mmseqs2", type=str, choices=["mmseqs2", "single_sequence"], 14 | help="Options to generate MSA." 15 | "mmseqs2 - FAST method from ColabFold (default) " 16 | "single_sequence - use single sequence input." 17 | "Default is 'mmseqs2'.") 18 | parser.add_argument("-p", "--pair_mode", default="unpaired", choices=["unpaired", "unpaired+paired", "paired"], 19 | help="Experimental option for protein complexes. " 20 | "Pairing currently only supported for proteins in same operon (prokaryotic genomes). " 21 | "unpaired - generate separate MSA for each protein. (default) " 22 | "unpaired+paired - attempt to pair sequences from the same operon within the genome. " 23 | "paired - only use sequences that were successfully paired. " 24 | "Default is 'unpaired'.") 25 | parser.add_argument("-pc", "--pair_cov", default=50, type=int, 26 | help="Options to prefilter each MSA before pairing. It might help if there are any paralogs in the complex. " 27 | "prefilter each MSA to minimum coverage with query (%%) before pairing. " 28 | "Default is 50.") 29 | parser.add_argument("-pq", "--pair_qid", default=20, type=int, 30 | help="Options to prefilter each MSA before pairing. It might help if there are any paralogs in the complex. " 31 | "prefilter each MSA to minimum sequence identity with query (%%) before pairing. " 32 | "Default is 20.") 33 | parser.add_argument("-b", "--rank_by", default="pLDDT", type=str, choices=["pLDDT", "pTMscore"], 34 | help="specify metric to use for ranking models (For protein-protein complexes, we recommend pTMscore). " 35 | "Default is 'pLDDT'.") 36 | parser.add_argument("-t", "--use_turbo", action='store_true', 37 | help="introduces a few modifications (compile once, swap params, adjust max_msa) to speedup and reduce memory requirements. " 38 | "Disable for default behavior.") 39 | parser.add_argument("-mm", "--max_msa", default="512:1024", type=str, 40 | help="max_msa defines: max_msa_clusters:max_extra_msa number of sequences to use. " 41 | "This option ignored if use_turbo is disabled. Default is '512:1024'.") 42 | parser.add_argument("-n", "--num_models", default=5, type=int, help="specify how many model params to try. (Default is 5)") 43 | parser.add_argument("-pt", "--use_ptm", action='store_true', 44 | help="uses Deepmind's ptm finetuned model parameters to get PAE per structure. " 45 | "Disable to use the original model params. (Disabling may give alternative structures.)") 46 | parser.add_argument("-e", "--num_ensemble", default=1, type=int, choices=[1, 8], 47 | help="the trunk of the network is run multiple times with different random choices for the MSA cluster centers. " 48 | "(1=default, 8=casp14 setting)") 49 | parser.add_argument("-r", "--max_recycles", default=3, type=int, help="controls the maximum number of times the structure is fed back into the neural network for refinement. (default is 3)") 50 | parser.add_argument("--tol", default=0, type=float, help="tolerance for deciding when to stop (CA-RMS between recycles)") 51 | parser.add_argument("--is_training", action='store_true', 52 | help="enables the stochastic part of the model (dropout), when coupled with num_samples can be used to 'sample' a diverse set of structures. False (NOT specifying this option) is recommended at first.") 53 | parser.add_argument("--num_samples", default=1, type=int, help="number of random_seeds to try. Default is 1.") 54 | parser.add_argument("--num_relax", default="None", choices=["None", "Top1", "Top5", "All"], 55 | help="num_relax is 'None' (default), 'Top1', 'Top5' or 'All'. Specify how many of the top ranked structures to relax.") 56 | args = parser.parse_args() 57 | ## command-line arguments 58 | 59 | ### Check your OS for localcolabfold 60 | import platform 61 | pf = platform.system() 62 | if pf == 'Windows': 63 | print('ColabFold on Windows') 64 | elif pf == 'Darwin': 65 | print('ColabFold on Mac') 66 | device="cpu" 67 | elif pf == 'Linux': 68 | print('ColabFold on Linux') 69 | device="gpu" 70 | #%% 71 | ### python code of AlphaFold2_advanced.ipynb 72 | import os 73 | import tensorflow as tf 74 | tf.config.set_visible_devices([], 'GPU') 75 | 76 | import jax 77 | 78 | from IPython.utils import io 79 | import subprocess 80 | import tqdm.notebook 81 | 82 | # --- Python imports --- 83 | import colabfold as cf 84 | import pairmsa 85 | import sys 86 | import pickle 87 | 88 | from urllib import request 89 | from concurrent import futures 90 | import json 91 | from matplotlib import gridspec 92 | import matplotlib.pyplot as plt 93 | import numpy as np 94 | import py3Dmol 95 | 96 | from urllib import request 97 | from concurrent import futures 98 | import json 99 | from matplotlib import gridspec 100 | import matplotlib.pyplot as plt 101 | import numpy as np 102 | import py3Dmol 103 | 104 | from alphafold.model import model 105 | from alphafold.model import config 106 | from alphafold.model import data 107 | 108 | from alphafold.data import parsers 109 | from alphafold.data import pipeline 110 | from alphafold.data.tools import jackhmmer 111 | 112 | from alphafold.common import protein 113 | 114 | def run_jackhmmer(sequence, prefix): 115 | 116 | fasta_path = f"{prefix}.fasta" 117 | with open(fasta_path, 'wt') as f: 118 | f.write(f'>query\n{sequence}') 119 | 120 | pickled_msa_path = f"{prefix}.jackhmmer.pickle" 121 | if os.path.isfile(pickled_msa_path): 122 | msas_dict = pickle.load(open(pickled_msa_path,"rb")) 123 | msas, deletion_matrices, names = (msas_dict[k] for k in ['msas', 'deletion_matrices', 'names']) 124 | full_msa = [] 125 | for msa in msas: 126 | full_msa += msa 127 | else: 128 | # --- Find the closest source --- 129 | test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2021_03.fasta.1' 130 | ex = futures.ThreadPoolExecutor(3) 131 | def fetch(source): 132 | request.urlretrieve(test_url_pattern.format(source)) 133 | return source 134 | fs = [ex.submit(fetch, source) for source in ['', '-europe', '-asia']] 135 | source = None 136 | for f in futures.as_completed(fs): 137 | source = f.result() 138 | ex.shutdown() 139 | break 140 | 141 | jackhmmer_binary_path = '/usr/bin/jackhmmer' 142 | dbs = [] 143 | 144 | num_jackhmmer_chunks = {'uniref90': 59, 'smallbfd': 17, 'mgnify': 71} 145 | total_jackhmmer_chunks = sum(num_jackhmmer_chunks.values()) 146 | with tqdm.notebook.tqdm(total=total_jackhmmer_chunks, bar_format=TQDM_BAR_FORMAT) as pbar: 147 | def jackhmmer_chunk_callback(i): 148 | pbar.update(n=1) 149 | 150 | pbar.set_description('Searching uniref90') 151 | jackhmmer_uniref90_runner = jackhmmer.Jackhmmer( 152 | binary_path=jackhmmer_binary_path, 153 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/uniref90_2021_03.fasta', 154 | get_tblout=True, 155 | num_streamed_chunks=num_jackhmmer_chunks['uniref90'], 156 | streaming_callback=jackhmmer_chunk_callback, 157 | z_value=135301051) 158 | dbs.append(('uniref90', jackhmmer_uniref90_runner.query(fasta_path))) 159 | 160 | pbar.set_description('Searching smallbfd') 161 | jackhmmer_smallbfd_runner = jackhmmer.Jackhmmer( 162 | binary_path=jackhmmer_binary_path, 163 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/bfd-first_non_consensus_sequences.fasta', 164 | get_tblout=True, 165 | num_streamed_chunks=num_jackhmmer_chunks['smallbfd'], 166 | streaming_callback=jackhmmer_chunk_callback, 167 | z_value=65984053) 168 | dbs.append(('smallbfd', jackhmmer_smallbfd_runner.query(fasta_path))) 169 | 170 | pbar.set_description('Searching mgnify') 171 | jackhmmer_mgnify_runner = jackhmmer.Jackhmmer( 172 | binary_path=jackhmmer_binary_path, 173 | database_path=f'https://storage.googleapis.com/alphafold-colab{source}/latest/mgy_clusters_2019_05.fasta', 174 | get_tblout=True, 175 | num_streamed_chunks=num_jackhmmer_chunks['mgnify'], 176 | streaming_callback=jackhmmer_chunk_callback, 177 | z_value=304820129) 178 | dbs.append(('mgnify', jackhmmer_mgnify_runner.query(fasta_path))) 179 | 180 | # --- Extract the MSAs and visualize --- 181 | # Extract the MSAs from the Stockholm files. 182 | # NB: deduplication happens later in pipeline.make_msa_features. 183 | 184 | mgnify_max_hits = 501 185 | msas = [] 186 | deletion_matrices = [] 187 | names = [] 188 | for db_name, db_results in dbs: 189 | unsorted_results = [] 190 | for i, result in enumerate(db_results): 191 | msa, deletion_matrix, target_names = parsers.parse_stockholm(result['sto']) 192 | e_values_dict = parsers.parse_e_values_from_tblout(result['tbl']) 193 | e_values = [e_values_dict[t.split('/')[0]] for t in target_names] 194 | zipped_results = zip(msa, deletion_matrix, target_names, e_values) 195 | if i != 0: 196 | # Only take query from the first chunk 197 | zipped_results = [x for x in zipped_results if x[2] != 'query'] 198 | unsorted_results.extend(zipped_results) 199 | sorted_by_evalue = sorted(unsorted_results, key=lambda x: x[3]) 200 | db_msas, db_deletion_matrices, db_names, _ = zip(*sorted_by_evalue) 201 | if db_msas: 202 | if db_name == 'mgnify': 203 | db_msas = db_msas[:mgnify_max_hits] 204 | db_deletion_matrices = db_deletion_matrices[:mgnify_max_hits] 205 | db_names = db_names[:mgnify_max_hits] 206 | msas.append(db_msas) 207 | deletion_matrices.append(db_deletion_matrices) 208 | names.append(db_names) 209 | msa_size = len(set(db_msas)) 210 | print(f'{msa_size} Sequences Found in {db_name}') 211 | 212 | pickle.dump({"msas":msas, 213 | "deletion_matrices":deletion_matrices, 214 | "names":names}, open(pickled_msa_path,"wb")) 215 | return msas, deletion_matrices, names 216 | 217 | #%% 218 | import re 219 | 220 | # --read sequence from input file-- 221 | from Bio import SeqIO 222 | 223 | def readfastafile(fastafile): 224 | records = list(SeqIO.parse(fastafile, "fasta")) 225 | if(len(records) != 1): 226 | raise ValueError('Input FASTA file must have a single ID/sequence.') 227 | else: 228 | return records[0].id, records[0].seq 229 | 230 | 231 | print("Input ID: {}".format(readfastafile(args.input)[0])) 232 | print("Input Sequence: {}".format(readfastafile(args.input)[1])) 233 | sequence = str(readfastafile(args.input)[1]) 234 | # --read sequence from input file-- 235 | sequence = re.sub("[^A-Z:/]", "", sequence.upper()) 236 | sequence = re.sub(":+",":",sequence) 237 | sequence = re.sub("/+","/",sequence) 238 | sequence = re.sub("^[:/]+","",sequence) 239 | sequence = re.sub("[:/]+$","",sequence) 240 | 241 | jobname = "test" #@param {type:"string"} 242 | jobname = re.sub(r'\W+', '', jobname) 243 | 244 | # define number of copies 245 | homooligomer = args.homooligomer #@param {type:"string"} 246 | homooligomer = re.sub("[:/]+",":",homooligomer) 247 | homooligomer = re.sub("^[:/]+","",homooligomer) 248 | homooligomer = re.sub("[:/]+$","",homooligomer) 249 | 250 | if len(homooligomer) == 0: homooligomer = "1" 251 | homooligomer = re.sub("[^0-9:]", "", homooligomer) 252 | homooligomers = [int(h) for h in homooligomer.split(":")] 253 | 254 | #@markdown - `sequence` Specify protein sequence to be modelled. 255 | #@markdown - Use `/` to specify intra-protein chainbreaks (for trimming regions within protein). 256 | #@markdown - Use `:` to specify inter-protein chainbreaks (for modeling protein-protein hetero-complexes). 257 | #@markdown - For example, sequence `AC/DE:FGH` will be modelled as polypeptides: `AC`, `DE` and `FGH`. A separate MSA will be generates for `ACDE` and `FGH`. 258 | #@markdown If `pair_msa` is enabled, `ACDE`'s MSA will be paired with `FGH`'s MSA. 259 | #@markdown - `homooligomer` Define number of copies in a homo-oligomeric assembly. 260 | #@markdown - Use `:` to specify different homooligomeric state (copy numer) for each component of the complex. 261 | #@markdown - For example, **sequence:**`ABC:DEF`, **homooligomer:** `2:1`, the first protein `ABC` will be modeled as a homodimer (2 copies) and second `DEF` a monomer (1 copy). 262 | 263 | ori_sequence = sequence 264 | sequence = sequence.replace("/","").replace(":","") 265 | seqs = ori_sequence.replace("/","").split(":") 266 | 267 | if len(seqs) != len(homooligomers): 268 | if len(homooligomers) == 1: 269 | homooligomers = [homooligomers[0]] * len(seqs) 270 | homooligomer = ":".join([str(h) for h in homooligomers]) 271 | else: 272 | while len(seqs) > len(homooligomers): 273 | homooligomers.append(1) 274 | homooligomers = homooligomers[:len(seqs)] 275 | homooligomer = ":".join([str(h) for h in homooligomers]) 276 | print("WARNING: Mismatch between number of breaks ':' in 'sequence' and 'homooligomer' definition") 277 | 278 | full_sequence = "".join([s*h for s,h in zip(seqs,homooligomers)]) 279 | 280 | # prediction directory 281 | # --set the output directory from command-line arguments 282 | if args.output_dir == "": 283 | output_dir = 'prediction_' + jobname + '_' + cf.get_hash(full_sequence)[:5] 284 | else: 285 | output_dir = args.output_dir 286 | # --set the output directory from command-line arguments 287 | 288 | os.makedirs(output_dir, exist_ok=True) 289 | # delete existing files in working directory 290 | for f in os.listdir(output_dir): 291 | os.remove(os.path.join(output_dir, f)) 292 | 293 | MIN_SEQUENCE_LENGTH = 16 294 | MAX_SEQUENCE_LENGTH = 2500 295 | 296 | aatypes = set('ACDEFGHIKLMNPQRSTVWY') # 20 standard aatypes 297 | if not set(full_sequence).issubset(aatypes): 298 | raise Exception(f'Input sequence contains non-amino acid letters: {set(sequence) - aatypes}. AlphaFold only supports 20 standard amino acids as inputs.') 299 | if len(full_sequence) < MIN_SEQUENCE_LENGTH: 300 | raise Exception(f'Input sequence is too short: {len(full_sequence)} amino acids, while the minimum is {MIN_SEQUENCE_LENGTH}') 301 | if len(full_sequence) > MAX_SEQUENCE_LENGTH: 302 | raise Exception(f'Input sequence is too long: {len(full_sequence)} amino acids, while the maximum is {MAX_SEQUENCE_LENGTH}. Please use the full AlphaFold system for long sequences.') 303 | 304 | if len(full_sequence) > 1400: 305 | print(f"WARNING: For a typical Google-Colab-GPU (16G) session, the max total length is ~1400 residues. You are at {len(full_sequence)}! Run Alphafold may crash.") 306 | 307 | print(f"homooligomer: '{homooligomer}'") 308 | print(f"total_length: '{len(full_sequence)}'") 309 | print(f"working_directory: '{output_dir}'") 310 | #%% 311 | TQDM_BAR_FORMAT = '{l_bar}{bar}| {n_fmt}/{total_fmt} [elapsed: {elapsed} remaining: {remaining}]' 312 | #@markdown Once this cell has been executed, you will see 313 | #@markdown statistics about the multiple sequence alignment 314 | #@markdown (MSA) that will be used by AlphaFold. In particular, 315 | #@markdown you’ll see how well each residue is covered by similar 316 | #@markdown sequences in the MSA. 317 | #@markdown (Note that the search against databases and the actual prediction can take some time, from minutes to hours, depending on the length of the protein and what type of GPU you are allocated by Colab.) 318 | 319 | #@markdown --- 320 | msa_method = args.msa_method #@param ["mmseqs2","jackhmmer","single_sequence","precomputed"] 321 | #@markdown --- 322 | #@markdown **custom msa options** 323 | add_custom_msa = False #@param {type:"boolean"} 324 | msa_format = "fas" #@param ["fas","a2m","a3m","sto","psi","clu"] 325 | #@markdown - `add_custom_msa` - If enabled, you'll get an option to upload your custom MSA in the specified `msa_format`. Note: Your MSA will be supplemented with those from 'mmseqs2' or 'jackhmmer', unless `msa_method` is set to 'single_sequence'. 326 | 327 | # --set the output directory from command-line arguments 328 | pair_mode = args.pair_mode #@param ["unpaired","unpaired+paired","paired"] {type:"string"} 329 | 330 | pair_cov = args.pair_cov #@param [0,25,50,75,90] {type:"raw"} 331 | pair_qid = args.pair_qid #@param [0,15,20,30,40,50] {type:"raw"} 332 | # --set the output directory from command-line arguments 333 | 334 | # --- Search against genetic databases --- 335 | os.makedirs('tmp', exist_ok=True) 336 | msas, deletion_matrices = [],[] 337 | 338 | if add_custom_msa: 339 | print(f"upload custom msa in '{msa_format}' format") 340 | msa_dict = files.upload() 341 | lines = msa_dict[list(msa_dict.keys())[0]].decode() 342 | 343 | # convert to a3m 344 | with open(f"tmp/upload.{msa_format}","w") as tmp_upload: 345 | tmp_upload.write(lines) 346 | os.system(f"reformat.pl {msa_format} a3m tmp/upload.{msa_format} tmp/upload.a3m") 347 | a3m_lines = open("tmp/upload.a3m","r").read() 348 | 349 | # parse 350 | msa, mtx = parsers.parse_a3m(a3m_lines) 351 | msas.append(msa) 352 | deletion_matrices.append(mtx) 353 | 354 | if len(msas[0][0]) != len(sequence): 355 | raise ValueError("ERROR: the length of msa does not match input sequence") 356 | 357 | if msa_method == "precomputed": 358 | print("upload precomputed pickled msa from previous run") 359 | pickled_msa_dict = files.upload() 360 | msas_dict = pickle.loads(pickled_msa_dict[list(pickled_msa_dict.keys())[0]]) 361 | msas, deletion_matrices = (msas_dict[k] for k in ['msas', 'deletion_matrices']) 362 | 363 | elif msa_method == "single_sequence": 364 | if len(msas) == 0: 365 | msas.append([sequence]) 366 | deletion_matrices.append([[0]*len(sequence)]) 367 | 368 | else: 369 | seqs = ori_sequence.replace('/','').split(':') 370 | _blank_seq = ["-" * len(seq) for seq in seqs] 371 | _blank_mtx = [[0] * len(seq) for seq in seqs] 372 | def _pad(ns,vals,mode): 373 | if mode == "seq": _blank = _blank_seq.copy() 374 | if mode == "mtx": _blank = _blank_mtx.copy() 375 | if isinstance(ns, list): 376 | for n,val in zip(ns,vals): _blank[n] = val 377 | else: _blank[ns] = vals 378 | if mode == "seq": return "".join(_blank) 379 | if mode == "mtx": return sum(_blank,[]) 380 | 381 | if len(seqs) == 1 or "unpaired" in pair_mode: 382 | # gather msas 383 | if msa_method == "mmseqs2": 384 | prefix = cf.get_hash("".join(seqs)) 385 | prefix = os.path.join('tmp',prefix) 386 | print(f"running mmseqs2") 387 | A3M_LINES = cf.run_mmseqs2(seqs, prefix, filter=True) 388 | 389 | for n, seq in enumerate(seqs): 390 | # tmp directory 391 | prefix = cf.get_hash(seq) 392 | prefix = os.path.join('tmp',prefix) 393 | 394 | if msa_method == "mmseqs2": 395 | # run mmseqs2 396 | a3m_lines = A3M_LINES[n] 397 | msa, mtx = parsers.parse_a3m(a3m_lines) 398 | msas_, mtxs_ = [msa],[mtx] 399 | 400 | elif msa_method == "jackhmmer": 401 | print(f"running jackhmmer on seq_{n}") 402 | # run jackhmmer 403 | msas_, mtxs_, names_ = ([sum(x,())] for x in run_jackhmmer(seq, prefix)) 404 | 405 | # pad sequences 406 | for msa_,mtx_ in zip(msas_,mtxs_): 407 | msa,mtx = [sequence],[[0]*len(sequence)] 408 | for s,m in zip(msa_,mtx_): 409 | msa.append(_pad(n,s,"seq")) 410 | mtx.append(_pad(n,m,"mtx")) 411 | 412 | msas.append(msa) 413 | deletion_matrices.append(mtx) 414 | 415 | #################################################################################### 416 | # PAIR_MSA 417 | #################################################################################### 418 | 419 | if len(seqs) > 1 and (pair_mode == "paired" or pair_mode == "unpaired+paired"): 420 | print("attempting to pair some sequences...") 421 | 422 | if msa_method == "mmseqs2": 423 | prefix = cf.get_hash("".join(seqs)) 424 | prefix = os.path.join('tmp',prefix) 425 | print(f"running mmseqs2_noenv_nofilter on all seqs") 426 | A3M_LINES = cf.run_mmseqs2(seqs, prefix, use_env=False, use_filter=False) 427 | 428 | _data = [] 429 | for a in range(len(seqs)): 430 | print(f"prepping seq_{a}") 431 | _seq = seqs[a] 432 | _prefix = os.path.join('tmp',cf.get_hash(_seq)) 433 | 434 | if msa_method == "mmseqs2": 435 | a3m_lines = A3M_LINES[a] 436 | _msa, _mtx, _lab = pairmsa.parse_a3m(a3m_lines, 437 | filter_qid=pair_qid/100, 438 | filter_cov=pair_cov/100) 439 | 440 | elif msa_method == "jackhmmer": 441 | _msas, _mtxs, _names = run_jackhmmer(_seq, _prefix) 442 | _msa, _mtx, _lab = pairmsa.get_uni_jackhmmer(_msas[0], _mtxs[0], _names[0], 443 | filter_qid=pair_qid/100, 444 | filter_cov=pair_cov/100) 445 | 446 | if len(_msa) > 1: 447 | _data.append(pairmsa.hash_it(_msa, _lab, _mtx, call_uniprot=False)) 448 | else: 449 | _data.append(None) 450 | 451 | Ln = len(seqs) 452 | O = [[None for _ in seqs] for _ in seqs] 453 | for a in range(Ln): 454 | if _data[a] is not None: 455 | for b in range(a+1,Ln): 456 | if _data[b] is not None: 457 | print(f"attempting pairwise stitch for {a} {b}") 458 | O[a][b] = pairmsa._stitch(_data[a],_data[b]) 459 | _seq_a, _seq_b, _mtx_a, _mtx_b = (*O[a][b]["seq"],*O[a][b]["mtx"]) 460 | 461 | ############################################## 462 | # filter to remove redundant sequences 463 | ############################################## 464 | ok = [] 465 | with open("tmp/tmp.fas","w") as fas_file: 466 | fas_file.writelines([f">{n}\n{a+b}\n" for n,(a,b) in enumerate(zip(_seq_a,_seq_b))]) 467 | os.system("hhfilter -maxseq 1000000 -i tmp/tmp.fas -o tmp/tmp.id90.fas -id 90") 468 | for line in open("tmp/tmp.id90.fas","r"): 469 | if line.startswith(">"): ok.append(int(line[1:])) 470 | ############################################## 471 | print(f"found {len(_seq_a)} pairs ({len(ok)} after filtering)") 472 | 473 | if len(_seq_a) > 0: 474 | msa,mtx = [sequence],[[0]*len(sequence)] 475 | for s_a,s_b,m_a,m_b in zip(_seq_a, _seq_b, _mtx_a, _mtx_b): 476 | msa.append(_pad([a,b],[s_a,s_b],"seq")) 477 | mtx.append(_pad([a,b],[m_a,m_b],"mtx")) 478 | msas.append(msa) 479 | deletion_matrices.append(mtx) 480 | 481 | ''' 482 | # triwise stitching (WIP) 483 | if Ln > 2: 484 | for a in range(Ln): 485 | for b in range(a+1,Ln): 486 | for c in range(b+1,Ln): 487 | if O[a][b] is not None and O[b][c] is not None: 488 | print(f"attempting triwise stitch for {a} {b} {c}") 489 | list_ab = O[a][b]["lab"][1] 490 | list_bc = O[b][c]["lab"][0] 491 | msa,mtx = [sequence],[[0]*len(sequence)] 492 | for i,l_b in enumerate(list_ab): 493 | if l_b in list_bc: 494 | j = list_bc.index(l_b) 495 | s_a = O[a][b]["seq"][0][i] 496 | s_b = O[a][b]["seq"][1][i] 497 | s_c = O[b][c]["seq"][1][j] 498 | 499 | m_a = O[a][b]["mtx"][0][i] 500 | m_b = O[a][b]["mtx"][1][i] 501 | m_c = O[b][c]["mtx"][1][j] 502 | 503 | msa.append(_pad([a,b,c],[s_a,s_b,s_c],"seq")) 504 | mtx.append(_pad([a,b,c],[m_a,m_b,m_c],"mtx")) 505 | if len(msa) > 1: 506 | msas.append(msa) 507 | deletion_matrices.append(mtx) 508 | print(f"found {len(msa)} triplets") 509 | ''' 510 | #################################################################################### 511 | #################################################################################### 512 | 513 | # save MSA as pickle 514 | pickle.dump({"msas":msas,"deletion_matrices":deletion_matrices}, 515 | open(os.path.join(output_dir,"msa.pickle"),"wb")) 516 | 517 | make_msa_plot = len(msas[0]) > 1 518 | if make_msa_plot: 519 | plt = cf.plot_msas(msas, ori_sequence) 520 | plt.savefig(os.path.join(output_dir,"msa_coverage.png"), bbox_inches = 'tight', dpi=300) 521 | #%% 522 | ##@title run alphafold 523 | # --------set parameters from command-line arguments-------- 524 | num_relax = args.num_relax 525 | rank_by = args.rank_by 526 | 527 | use_turbo = True if args.use_turbo else False 528 | max_msa = args.max_msa 529 | # --------set parameters from command-line arguments-------- 530 | 531 | max_msa_clusters, max_extra_msa = [int(x) for x in max_msa.split(":")] 532 | 533 | 534 | 535 | #@markdown - `rank_by` specify metric to use for ranking models (For protein-protein complexes, we recommend pTMscore) 536 | #@markdown - `use_turbo` introduces a few modifications (compile once, swap params, adjust max_msa) to speedup and reduce memory requirements. Disable for default behavior. 537 | #@markdown - `max_msa` defines: `max_msa_clusters:max_extra_msa` number of sequences to use. When adjusting after GPU crash, be sure to `Runtime` → `Restart runtime`. (Lowering will reduce GPU requirements, but may result in poor model quality. This option ignored if `use_turbo` is disabled) 538 | show_images = True #@param {type:"boolean"} 539 | #@markdown - `show_images` To make things more exciting we show images of the predicted structures as they are being generated. (WARNING: the order of images displayed does not reflect any ranking). 540 | #@markdown --- 541 | #@markdown #### Sampling options 542 | #@markdown There are two stochastic parts of the pipeline. Within the feature generation (choice of cluster centers) and within the model (dropout). 543 | #@markdown To get structure diversity, you can iterate through a fixed number of random_seeds (using `num_samples`) and/or enable dropout (using `is_training`). 544 | 545 | # --------set parameters from command-line arguments-------- 546 | num_models = args.num_models 547 | use_ptm = True if args.use_ptm else False 548 | num_ensemble = args.num_ensemble 549 | max_recycles = args.max_recycles 550 | tol = args.tol 551 | is_training = True if args.is_training else False 552 | num_samples = args.num_samples 553 | # --------set parameters from command-line arguments-------- 554 | 555 | subsample_msa = True #@param {type:"boolean"} 556 | #@markdown - `subsample_msa` subsample large MSA to `3E7/length` sequences to avoid crashing the preprocessing protocol. (This option ignored if `use_turbo` is disabled.) 557 | 558 | save_pae_json = True 559 | save_tmp_pdb = True 560 | 561 | 562 | if use_ptm == False and rank_by == "pTMscore": 563 | print("WARNING: models will be ranked by pLDDT, 'use_ptm' is needed to compute pTMscore") 564 | rank_by = "pLDDT" 565 | 566 | ############################# 567 | # delete old files 568 | ############################# 569 | for f in os.listdir(output_dir): 570 | if "rank_" in f: 571 | os.remove(os.path.join(output_dir, f)) 572 | 573 | ############################# 574 | # homooligomerize 575 | ############################# 576 | lengths = [len(seq) for seq in seqs] 577 | msas_mod, deletion_matrices_mod = cf.homooligomerize_heterooligomer(msas, deletion_matrices, 578 | lengths, homooligomers) 579 | ############################# 580 | # define input features 581 | ############################# 582 | def _placeholder_template_feats(num_templates_, num_res_): 583 | return { 584 | 'template_aatype': np.zeros([num_templates_, num_res_, 22], np.float32), 585 | 'template_all_atom_masks': np.zeros([num_templates_, num_res_, 37, 3], np.float32), 586 | 'template_all_atom_positions': np.zeros([num_templates_, num_res_, 37], np.float32), 587 | 'template_domain_names': np.zeros([num_templates_], np.float32), 588 | 'template_sum_probs': np.zeros([num_templates_], np.float32), 589 | } 590 | 591 | num_res = len(full_sequence) 592 | feature_dict = {} 593 | feature_dict.update(pipeline.make_sequence_features(full_sequence, 'test', num_res)) 594 | feature_dict.update(pipeline.make_msa_features(msas_mod, deletion_matrices=deletion_matrices_mod)) 595 | if not use_turbo: 596 | feature_dict.update(_placeholder_template_feats(0, num_res)) 597 | 598 | def do_subsample_msa(F, random_seed=0): 599 | '''subsample msa to avoid running out of memory''' 600 | N = len(F["msa"]) 601 | L = len(F["residue_index"]) 602 | N_ = int(3E7/L) 603 | if N > N_: 604 | print(f"whhhaaa... too many sequences ({N}) subsampling to {N_}") 605 | np.random.seed(random_seed) 606 | idx = np.append(0,np.random.permutation(np.arange(1,N)))[:N_] 607 | F_ = {} 608 | F_["msa"] = F["msa"][idx] 609 | F_["deletion_matrix_int"] = F["deletion_matrix_int"][idx] 610 | F_["num_alignments"] = np.full_like(F["num_alignments"],N_) 611 | for k in ['aatype', 'between_segment_residues', 612 | 'domain_name', 'residue_index', 613 | 'seq_length', 'sequence']: 614 | F_[k] = F[k] 615 | return F_ 616 | else: 617 | return F 618 | 619 | ################################ 620 | # set chain breaks 621 | ################################ 622 | Ls = [] 623 | for seq,h in zip(ori_sequence.split(":"),homooligomers): 624 | Ls += [len(s) for s in seq.split("/")] * h 625 | Ls_plot = sum([[len(seq)]*h for seq,h in zip(seqs,homooligomers)],[]) 626 | feature_dict['residue_index'] = cf.chain_break(feature_dict['residue_index'], Ls) 627 | 628 | ########################### 629 | # run alphafold 630 | ########################### 631 | def parse_results(prediction_result, processed_feature_dict): 632 | b_factors = prediction_result['plddt'][:,None] * prediction_result['structure_module']['final_atom_mask'] 633 | dist_bins = jax.numpy.append(0,prediction_result["distogram"]["bin_edges"]) 634 | dist_mtx = dist_bins[prediction_result["distogram"]["logits"].argmax(-1)] 635 | contact_mtx = jax.nn.softmax(prediction_result["distogram"]["logits"])[:,:,dist_bins < 8].sum(-1) 636 | 637 | out = {"unrelaxed_protein": protein.from_prediction(processed_feature_dict, prediction_result, b_factors=b_factors), 638 | "plddt": prediction_result['plddt'], 639 | "pLDDT": prediction_result['plddt'].mean(), 640 | "dists": dist_mtx, 641 | "adj": contact_mtx} 642 | 643 | if "ptm" in prediction_result: 644 | out.update({"pae": prediction_result['predicted_aligned_error'], 645 | "pTMscore": prediction_result['ptm']}) 646 | return out 647 | 648 | model_names = ['model_1', 'model_2', 'model_3', 'model_4', 'model_5'][:num_models] 649 | total = len(model_names) * num_samples 650 | with tqdm.notebook.tqdm(total=total, bar_format=TQDM_BAR_FORMAT) as pbar: 651 | ####################################################################### 652 | # precompile model and recompile only if length changes 653 | ####################################################################### 654 | if use_turbo: 655 | name = "model_5_ptm" if use_ptm else "model_5" 656 | N = len(feature_dict["msa"]) 657 | L = len(feature_dict["residue_index"]) 658 | compiled = (N, L, use_ptm, max_recycles, tol, num_ensemble, max_msa, is_training) 659 | if "COMPILED" in dir(): 660 | if COMPILED != compiled: recompile = True 661 | else: recompile = True 662 | if recompile: 663 | cf.clear_mem("gpu") 664 | cfg = config.model_config(name) 665 | 666 | # set size of msa (to reduce memory requirements) 667 | msa_clusters = min(N, max_msa_clusters) 668 | cfg.data.eval.max_msa_clusters = msa_clusters 669 | cfg.data.common.max_extra_msa = max(min(N-msa_clusters,max_extra_msa),1) 670 | 671 | cfg.data.common.num_recycle = max_recycles 672 | cfg.model.num_recycle = max_recycles 673 | cfg.model.recycle_tol = tol 674 | cfg.data.eval.num_ensemble = num_ensemble 675 | 676 | params = data.get_model_haiku_params(name,'./alphafold/data') 677 | model_runner = model.RunModel(cfg, params, is_training=is_training) 678 | COMPILED = compiled 679 | recompile = False 680 | 681 | else: 682 | cf.clear_mem("gpu") 683 | recompile = True 684 | 685 | # cleanup 686 | if "outs" in dir(): del outs 687 | outs = {} 688 | cf.clear_mem("cpu") 689 | 690 | ####################################################################### 691 | def report(key): 692 | pbar.update(n=1) 693 | o = outs[key] 694 | line = f"{key} recycles:{o['recycles']} tol:{o['tol']:.2f} pLDDT:{o['pLDDT']:.2f}" 695 | if use_ptm: line += f" pTMscore:{o['pTMscore']:.2f}" 696 | print(line) 697 | if show_images: 698 | fig = cf.plot_protein(o['unrelaxed_protein'], Ls=Ls_plot, dpi=100) 699 | # plt.show() 700 | plt.ion() 701 | if save_tmp_pdb: 702 | tmp_pdb_path = os.path.join(output_dir,f'unranked_{key}_unrelaxed.pdb') 703 | pdb_lines = protein.to_pdb(o['unrelaxed_protein']) 704 | with open(tmp_pdb_path, 'w') as f: f.write(pdb_lines) 705 | 706 | if use_turbo: 707 | # go through each random_seed 708 | for seed in range(num_samples): 709 | 710 | # prep input features 711 | if subsample_msa: 712 | sampled_feats_dict = do_subsample_msa(feature_dict, random_seed=seed) 713 | processed_feature_dict = model_runner.process_features(sampled_feats_dict, random_seed=seed) 714 | else: 715 | processed_feature_dict = model_runner.process_features(feature_dict, random_seed=seed) 716 | 717 | # go through each model 718 | for num, model_name in enumerate(model_names): 719 | name = model_name+"_ptm" if use_ptm else model_name 720 | key = f"{name}_seed_{seed}" 721 | pbar.set_description(f'Running {key}') 722 | 723 | # replace model parameters 724 | params = data.get_model_haiku_params(name, './alphafold/data') 725 | for k in model_runner.params.keys(): 726 | model_runner.params[k] = params[k] 727 | 728 | # predict 729 | prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu") 730 | 731 | # save results 732 | outs[key] = parse_results(prediction_result, processed_feature_dict) 733 | outs[key].update({"recycles":r, "tol":t}) 734 | report(key) 735 | 736 | del prediction_result, params 737 | del sampled_feats_dict, processed_feature_dict 738 | 739 | else: 740 | # go through each model 741 | for num, model_name in enumerate(model_names): 742 | name = model_name+"_ptm" if use_ptm else model_name 743 | params = data.get_model_haiku_params(name, './alphafold/data') 744 | cfg = config.model_config(name) 745 | cfg.data.common.num_recycle = cfg.model.num_recycle = max_recycles 746 | cfg.model.recycle_tol = tol 747 | cfg.data.eval.num_ensemble = num_ensemble 748 | model_runner = model.RunModel(cfg, params, is_training=is_training) 749 | 750 | # go through each random_seed 751 | for seed in range(num_samples): 752 | key = f"{name}_seed_{seed}" 753 | pbar.set_description(f'Running {key}') 754 | processed_feature_dict = model_runner.process_features(feature_dict, random_seed=seed) 755 | prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu") 756 | outs[key] = parse_results(prediction_result, processed_feature_dict) 757 | outs[key].update({"recycles":r, "tol":t}) 758 | report(key) 759 | 760 | # cleanup 761 | del processed_feature_dict, prediction_result 762 | 763 | del params, model_runner, cfg 764 | cf.clear_mem("gpu") 765 | 766 | # delete old files 767 | for f in os.listdir(output_dir): 768 | if "rank" in f: 769 | os.remove(os.path.join(output_dir, f)) 770 | 771 | # Find the best model according to the mean pLDDT. 772 | model_rank = list(outs.keys()) 773 | model_rank = [model_rank[i] for i in np.argsort([outs[x][rank_by] for x in model_rank])[::-1]] 774 | 775 | # Write out the prediction 776 | for n,key in enumerate(model_rank): 777 | prefix = f"rank_{n+1}_{key}" 778 | pred_output_path = os.path.join(output_dir,f'{prefix}_unrelaxed.pdb') 779 | fig = cf.plot_protein(outs[key]["unrelaxed_protein"], Ls=Ls_plot, dpi=200) 780 | plt.savefig(os.path.join(output_dir,f'{prefix}.png'), bbox_inches = 'tight') 781 | plt.close(fig) 782 | 783 | pdb_lines = protein.to_pdb(outs[key]["unrelaxed_protein"]) 784 | with open(pred_output_path, 'w') as f: 785 | f.write(pdb_lines) 786 | 787 | ############################################################ 788 | print(f"model rank based on {rank_by}") 789 | for n,key in enumerate(model_rank): 790 | print(f"rank_{n+1}_{key} {rank_by}:{outs[key][rank_by]:.2f}") 791 | #%% 792 | #@title Refine structures with Amber-Relax (Optional) 793 | 794 | # --------set parameters from command-line arguments-------- 795 | num_relax = args.num_relax 796 | # --------set parameters from command-line arguments-------- 797 | 798 | if num_relax == "None": 799 | num_relax = 0 800 | elif num_relax == "Top1": 801 | num_relax = 1 802 | elif num_relax == "Top5": 803 | num_relax = 5 804 | else: 805 | num_relax = len(model_names) * num_samples 806 | 807 | if num_relax > 0: 808 | if "relax" not in dir(): 809 | # add conda environment to path 810 | sys.path.append('./colabfold-conda/lib/python3.7/site-packages') 811 | 812 | # import libraries 813 | from alphafold.relax import relax 814 | from alphafold.relax import utils 815 | 816 | with tqdm.notebook.tqdm(total=num_relax, bar_format=TQDM_BAR_FORMAT) as pbar: 817 | pbar.set_description(f'AMBER relaxation') 818 | for n,key in enumerate(model_rank): 819 | if n < num_relax: 820 | prefix = f"rank_{n+1}_{key}" 821 | pred_output_path = os.path.join(output_dir,f'{prefix}_relaxed.pdb') 822 | if not os.path.isfile(pred_output_path): 823 | amber_relaxer = relax.AmberRelaxation( 824 | max_iterations=0, 825 | tolerance=2.39, 826 | stiffness=10.0, 827 | exclude_residues=[], 828 | max_outer_iterations=20) 829 | relaxed_pdb_lines, _, _ = amber_relaxer.process(prot=outs[key]["unrelaxed_protein"]) 830 | with open(pred_output_path, 'w') as f: 831 | f.write(relaxed_pdb_lines) 832 | pbar.update(n=1) 833 | #%% 834 | #@title Display 3D structure {run: "auto"} 835 | rank_num = 1 #@param ["1", "2", "3", "4", "5"] {type:"raw"} 836 | color = "lDDT" #@param ["chain", "lDDT", "rainbow"] 837 | show_sidechains = False #@param {type:"boolean"} 838 | show_mainchains = False #@param {type:"boolean"} 839 | 840 | key = model_rank[rank_num-1] 841 | prefix = f"rank_{rank_num}_{key}" 842 | pred_output_path = os.path.join(output_dir,f'{prefix}_relaxed.pdb') 843 | if not os.path.isfile(pred_output_path): 844 | pred_output_path = os.path.join(output_dir,f'{prefix}_unrelaxed.pdb') 845 | 846 | cf.show_pdb(pred_output_path, show_sidechains, show_mainchains, color, Ls=Ls_plot).show() 847 | if color == "lDDT": cf.plot_plddt_legend().show() 848 | if use_ptm: 849 | cf.plot_confidence(outs[key]["plddt"], outs[key]["pae"], Ls=Ls_plot).show() 850 | else: 851 | cf.plot_confidence(outs[key]["plddt"], Ls=Ls_plot).show() 852 | #%% 853 | #@title Extra outputs 854 | dpi = 300#@param {type:"integer"} 855 | save_to_txt = True #@param {type:"boolean"} 856 | save_pae_json = True #@param {type:"boolean"} 857 | #@markdown - save data used to generate contact and distogram plots below to text file (pae values can be found in json file if `use_ptm` is enabled) 858 | 859 | if use_ptm: 860 | print("predicted alignment error") 861 | cf.plot_paes([outs[k]["pae"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 862 | plt.savefig(os.path.join(output_dir,f'predicted_alignment_error.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 863 | # plt.show() 864 | 865 | print("predicted contacts") 866 | cf.plot_adjs([outs[k]["adj"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 867 | plt.savefig(os.path.join(output_dir,f'predicted_contacts.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 868 | # plt.show() 869 | 870 | print("predicted distogram") 871 | cf.plot_dists([outs[k]["dists"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 872 | plt.savefig(os.path.join(output_dir,f'predicted_distogram.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 873 | # plt.show() 874 | 875 | print("predicted LDDT") 876 | cf.plot_plddts([outs[k]["plddt"] for k in model_rank], Ls=Ls_plot, dpi=dpi) 877 | plt.savefig(os.path.join(output_dir,f'predicted_LDDT.png'), bbox_inches = 'tight', dpi=np.maximum(200,dpi)) 878 | # plt.show() 879 | 880 | def do_save_to_txt(filename, adj, dists): 881 | adj = np.asarray(adj) 882 | dists = np.asarray(dists) 883 | L = len(adj) 884 | with open(filename,"w") as out: 885 | out.write("i\tj\taa_i\taa_j\tp(cbcb<8)\tmaxdistbin\n") 886 | for i in range(L): 887 | for j in range(i+1,L): 888 | if dists[i][j] < 21.68 or adj[i][j] >= 0.001: 889 | line = f"{i+1}\t{j+1}\t{full_sequence[i]}\t{full_sequence[j]}\t{adj[i][j]:.3f}" 890 | line += f"\t>{dists[i][j]:.2f}" if dists[i][j] == 21.6875 else f"\t{dists[i][j]:.2f}" 891 | out.write(f"{line}\n") 892 | 893 | for n,key in enumerate(model_rank): 894 | if save_to_txt: 895 | txt_filename = os.path.join(output_dir,f'rank_{n+1}_{key}.raw.txt') 896 | do_save_to_txt(txt_filename,adj=outs[key]["adj"],dists=outs[key]["dists"]) 897 | 898 | if use_ptm and save_pae_json: 899 | pae = outs[key]["pae"] 900 | max_pae = pae.max() 901 | # Save pLDDT and predicted aligned error (if it exists) 902 | pae_output_path = os.path.join(output_dir,f'rank_{n+1}_{key}_pae.json') 903 | # Save predicted aligned error in the same format as the AF EMBL DB 904 | rounded_errors = np.round(np.asarray(pae), decimals=1) 905 | indices = np.indices((len(rounded_errors), len(rounded_errors))) + 1 906 | indices_1 = indices[0].flatten().tolist() 907 | indices_2 = indices[1].flatten().tolist() 908 | pae_data = json.dumps([{ 909 | 'residue1': indices_1, 910 | 'residue2': indices_2, 911 | 'distance': rounded_errors.flatten().tolist(), 912 | 'max_predicted_aligned_error': max_pae.item() 913 | }], 914 | indent=None, 915 | separators=(',', ':')) 916 | with open(pae_output_path, 'w') as f: 917 | f.write(pae_data) 918 | #%% 919 | --------------------------------------------------------------------------------