├── LICENSE
├── pLM.md
├── proteinsequencedesign.md
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Christian Dallago
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/pLM.md:
--------------------------------------------------------------------------------
1 | # Protein Language Models
2 | (sorted by number of parameters)
3 | | Name | Params | Paper | Code | Notes |
4 | | :-------- | ------- | --------- | ------- | --------- |
5 | | ESM2 | 8M - 15B | [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1)|||
6 | | ProGen2 | 151M - 6.4B | [arXiv](https://arxiv.org/abs/2206.13517) | [Code](https://github.com/salesforce/progen/tree/main/progen2) ||
7 | | ProtTrans | 420M - 3B | [Paper](https://ieeexplore.ieee.org/document/9477085/) | [Code](https://github.com/agemagician/ProtTrans) |BFD+UniRef50|
8 | | ProteinLM | 200M, 3B | [arXiv](https://arxiv.org/abs/2108.07435) | [Code](https://github.com/THUDM/ProteinLM) ||
9 | | RITA | 85M - 1.2B | [arXiv](https://arxiv.org/abs/2205.05789) | [Code](https://github.com/lightonai/RITA) ||
10 | | ProGen1 | 1.2M | [bioRxiv](https://www.biorxiv.org/content/10.1101/2020.03.07.982272v2) | [Code](https://github.com/salesforce/progen) ||
11 | | ProtGPT2 | 738M | [Paper](https://www.nature.com/articles/s41467-022-32007-7) | [Code](https://huggingface.co/nferruz/ProtGPT2) ||
12 | | Tranceptron | 700M | [Paper](https://proceedings.mlr.press/v162/notin22a.html) | [Code](https://github.com/OATML-Markslab/Tranception) ||
13 | | ESM1 | 43M - 670M | [Paper](https://www.pnas.org/doi/10.1073/pnas.2016239118) | [Code](https://github.com/facebookresearch/esm) ||
14 | | DistilProtBert | 230M |[bioRxiv](https://www.biorxiv.org/content/early/2022/05/10/2022.05.09.491157) | [Code](https://github.com/yarongef/DistilProtBert) ||
15 | | DARK | 128M | [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.01.27.478087v1)|||
16 | | TAPE | 38M | [arXiv](https://arxiv.org/abs/1906.08230) | [Code](https://github.com/songlab-cal/tape) ||
17 | | ProteinBERT | 16M | [Paper](https://doi.org/10.1093/bioinformatics/btac020) | [Code](https://github.com/nadavbra/protein_bert), [PyTorch](https://github.com/lucidrains/protein-bert-pytorch) |~106M proteins from UniRef90; 28 days over ~670M records (i.e. ~6.4 iterations)|
18 | | AminoBERT || [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.08.02.454840v1) |||
19 |
20 | ## Special purpose pLM
21 | | Name | Params | Paper | Code | Notes |
22 | | :-------- | ------- | --------- | ------- | --------- |
23 | | PeTriBERT | 40M | [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.08.10.503344v1) | N/A | Optimized for protein design |
24 |
25 | ## Non-transformer-based sequence models
26 | | Name | Params | Paper | Code | Notes |
27 | | :-------- | ------- | --------- | ------- | --------- |
28 | | CARP | 600K - 640M | [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.19.492714v2) | [Code](https://github.com/microsoft/protein-sequence-models)| CNN |
29 | | SeqVec | 93M | [Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8) | [Code](https://github.com/mheinzinger/SeqVec)| bidirectional LSTM; UniRef50 |
30 | | UniRep | 90M | [Paper](https://www.nature.com/articles/s41592-019-0598-1)| [Code](https://github.com/churchlab/UniRep)| mLSTM |
31 | | ProSE | 24M | [Paper](https://www.sciencedirect.com/science/article/pii/S2405471221002039) | [Code](https://github.com/tbepler/prose) | LSTM |
32 |
33 | ## pLM specific to Antibody sequences
34 | | Name | Params | Paper | Code | Notes |
35 | | :-------- | ------- | --------- | ------- | --------- |
36 | | TCR-BERT | 100M | [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.11.18.469186v1) |[Code](https://github.com/wukevin/tcr-bert)||
37 | | AntiBERTa | 86M | [Paper](https://www.sciencedirect.com/science/article/pii/S2666389922001052) | [Code](https://github.com/alchemab/antiberta) ||
38 | | AntiBERTy | 26M | [arXiv](https://arxiv.org/abs/2112.07782) | [Code](https://pypi.org/project/antiberty) ||
39 | | IgLM |1.5M, 13M| [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.12.13.472419v1.full) | [Code](https://github.com/Graylab/IgLM) ||
40 | | Sapiens | 0.6M | [Paper](https://www.tandfonline.com/doi/full/10.1080/19420862.2021.2020203) | [Code](https://github.com/Merck/BioPhi) ||
41 | | AbLang || [Paper](https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac046/6609807) | [Code](https://github.com/oxpig/AbLang) ||
42 |
43 | # Building on pLMs
44 | - [protGPT2_gradioFold](https://huggingface.co/spaces/Gradio-Blocks/protGPT2_gradioFold)
45 |
--------------------------------------------------------------------------------
/proteinsequencedesign.md:
--------------------------------------------------------------------------------
1 |
2 | 💡 **Notes**
3 | - The following lists are curated by humans, as such may be incomplete
4 | - We only include software targeting the inverse folding problem e.g given a structure predict the sequence that folds into it. This is also referred to as protein sequence design. Note, that most models here model P(sequence | structure).
5 | - We do not wish to advertize one tool over any other, but simply list the tools we are aware of in order of publication year of the model, preprint or publication (whichever is first)
6 | - Any suggestions for improvements and additions are welcome as issues or pull requests
7 |
8 | ⚡️ **Brought to you by:**
9 | - [@simonduerr](https://twitter.com/simonduerr)
10 | - [@ginaelnesr](https://twitter.com/ginaelnesr)
11 |
12 |
13 | # Protein Sequence Design / Inverse folding models
14 |
15 | (sorted by year of release)
16 | | Name | Release year |Architecture | Paper | Code | Notes |Experimental validation |
17 | | :-------- | ------- |--------- | --------- | ------- | --------- |---|
18 | | ABACUS-R| 2022 |Transformer|[Nat Comput Sci](https://www.nature.com/articles/s43588-022-00273-6)|[code](https://doi.org/10.24433/CO.3351944.v1)||✅|
19 | | ProteinMPNN | 2022 | MPNN |[biorxiv](https://www.biorxiv.org/content/10.1101/2022.06.03.494563v1)|[git](https://github.com/dauparas/proteinMPNN)|[webserver](https://hf.space/simonduerr/ProteinMPNN) |✅|
20 | | ProDESIGN-LE | 2022 | Transformer |[biorxiv](https://www.biorxiv.org/content/10.1101/2022.06.25.497605v4)|-| |-|
21 | | RaSP | 2022 | 3DCNN |[biorxiv](https://www.biorxiv.org/content/10.1101/2022.07.14.500157v2)|[git](https://github.com/KULL-Centre/papers/tree/main/2022/ML-ddG-Blaabjerg-et-al)| [colab](https://colab.research.google.com/github/KULL-Centre/papers/blob/main/2022/ML-ddG-Blaabjerg-et-al/RaSPLab.ipynb) |-|
22 | |MIF|2022| SGNN|[biorxiv](https://www.biorxiv.org/content/10.1101/2022.05.25.493516v1)|[git](https://github.com/microsoft/protein-sequence-models)||-|
23 | | ESM-IF1 | 2022 | Transformer |[biorxiv](https://www.biorxiv.org/content/10.1101/2022.04.10.487779v1)|[git](https://github.com/facebookresearch/esm)| |-|
24 | | Partlon et al. | 2022 | Transformer |[biorxiv](https://www.biorxiv.org/content/10.1101/2022.04.15.488492v1)|-| |-|
25 | | TIMED-* | 2022 | 3DCNN |[arxiv](https://arxiv.org/pdf/2109.07925.pdf)|[git](https://github.com/wells-wood-research/timed-design)| |not published yet|
26 | | GCNDesign | 2021 | GCN |[pdf](https://github.com/ShintaroMinami/GCNdesign/blob/master/documents/Method_Summary.pdf)|[git](https://github.com/ShintaroMinami/GCNdesign)| [colab](https://github.com/naokob/ColabGCNdesign) |-|
27 | | GX | 2021 | 3DCNN+GNN |[arxiv](https://arxiv.org/pdf/2109.07925.pdf)|[git](https://github.com/wells-wood-research/timed-design)| |not published yet|
28 | | Fold2Seq | 2021 | Transformer | [arxiv](https://arxiv.org/abs/2106.13058) | [git](https://github.com/IBM/fold2seq)| |-|
29 | | CNN_protein_landscape | 2021 | 3DCNN |[Journal of Biological Physics](https://link.springer.com/article/10.1007/s10867-021-09593-6#Abs1)|[git](https://github.com/akulikova64/CNN_protein_landscape)||-|
30 | | Orellana et al. | 2021 | GVP |[biorxiv](https://www.biorxiv.org/content/10.1101/2021.09.06.459171v3)|-||-|
31 | | Jing et al. | 2020 | GVP |[arxiv](https://arxiv.org/abs/2009.01411)|[git](https://github.com/drorlab/gvp-pytorch)|||-|
32 | | DenseCPD | 2020 | 3DCNN |[JCIM](https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c00043)|-|[webserver](http://protein.org.cn/densecpd.html)|-|
33 | | ProDCoNN | 2020 | 3DCNN |[Proteins](https://onlinelibrary.wiley.com/doi/10.1002/prot.25868)|-||-|
34 | | ProteinSolver | 2020 | GNN |[Cell Systems](https://www.sciencedirect.com/science/article/pii/S2405471220303276)|[gitlab](https://gitlab.com/ostrokach/proteinsolver)|[webserver](http://design.ccbr.proteinsolver.org/)|✅|
35 | | MutCompute | 2020 | 3DCNN |[ACS SynBio](https://pubs.acs.org/doi/full/10.1021/acssynbio.0c00345)|-| [webserver](https://mutcompute.com)|✅|
36 | | Ingraham et al. | 2019 | Transformer |[NeurIPS Proceedings](https://papers.nips.cc/paper/2019/hash/f3a4ff4839c56a5f460c88cce3666a2b-Abstract.html)|[git](https://github.com/jingraham/neurips19-graph-protein-design)| |-|
37 | | 3DCNN | 2017 | 3DCNN |[BMC Bioinformatics](https://link.springer.com/article/10.1186/s12859-017-1702-0)|-||-|
38 |
39 | # Protein Sequence & Rotamer Design / Inverse folding models
40 |
41 | (sorted by year of release)
42 | | Name | Release year |Architecture | Paper | Code | Notes |Experimental validation |
43 | | :-------- | ------- |--------- | --------- | ------- | --------- | ---|
44 | | TIMED_Rotamer | 2022 | 3DCNN |-|[git](https://github.com/wells-wood-research/timed-design)| | not published yet |
45 | | SeqDes | 2021 | 3DCNN |[Nature Comm](https://www.nature.com/articles/s41467-022-28313-9) - [BioRxiv](https://www.biorxiv.org/content/10.1101/2020.01.06.895466v3)|[git](https://github.com/ProteinDesignLab/protein_seq_des)| |✅|
46 |
47 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | 📖 **Table of contents**
2 | * [Predictors](#Predictors)
3 | * [Tools and Extensions](#Tools)
4 | * [Databases and Datasets](#Databases)
5 | * [Webservers](#Webservers)
6 | * [Discontinued](#Discontinued)
7 |
8 |
9 | 💡 **Notes**
10 | - The following lists are curated by humans, as such may be incomplete
11 | - We only include software targeting the folding problem combining learnings from AlphaFold 2 and protein language models. You may find other ML on protein tools [at Kevin's incredible ML for proteins list](https://github.com/yangkky/Machine-learning-for-proteins).
12 | - We do not wish to advertize one tool over any other, but simply list the tools we are aware of in either random or alphabetical order
13 | - Any suggestions for improvements and additions are welcome as issues or pull requests
14 | - Projects we identify as discontinued are marked with 💀 and in a section at the end
15 |
16 | ⚡️ **Brought to you by:**
17 | - [@sacdallago](https://twitter.com/sacdallago)
18 | - [@sokrypton](https://twitter.com/sokrypton)
19 |
20 | ----
21 |
22 |
23 | ### Predictors
24 | [_in alphabetical order_]
25 | - **MSA-based** (uses Multiple Sequence Alignments (MSAs) as input)
26 | - AlphaFold2
27 | [](https://github.com/deepmind/alphafold)
28 | [](https://www.nature.com/articles/s41586-021-03819-2)
29 | - The original AlphaFold 2 method
30 | - Features: monomer, multimer
31 | - Other: [Colab Notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)
32 | - ColabFold
33 | [](https://github.com/sokrypton/ColabFold)
34 | [](https://www.nature.com/articles/s41592-022-01488-1)
35 | - Faster AF2 compiling and MSA generations
36 | - Features: monomer, multimer
37 | - Other: [localcolabfold](https://github.com/YoshitakaMo/localcolabfold)
38 | - FastFold
39 | [](https://github.com/hpcaitech/FastFold)
40 | [](https://arxiv.org/abs/2203.00854)
41 | - Runtime improvements to OpenFold (see below)
42 | - Features: monomer
43 | - HelixFold
44 | [](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold)
45 | [](https://arxiv.org/abs/2207.05477)
46 | - Reimplementation of AF2 in PaddlePaddle
47 | - Features: monomer
48 | - MEGA-Fold
49 | [](https://gitee.com/mindspore/mindscience/tree/master/MindSPONGE/applications/MEGAProtein)
50 | [](https://arxiv.org/abs/2206.12240)
51 | - Reimplementation of AF2 in MindSpore; provides training code, training dataset and new model params.
52 | - Features: monomer
53 | - OpenFold
54 | [](https://github.com/aqlaboratory/openfold)
55 | - Reimplementation of AF2 in PyTorch; provides training code, training dataset and new model params.
56 | - Features: monomer
57 | - Other: [Colab Notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb)
58 | - RoseTTAFold
59 | [](https://github.com/RosettaCommons/RoseTTAFold)
60 | [](https://www.science.org/doi/10.1126/science.abj8754)
61 | - Reproduced AF2 in PyTorch before details of AF2 were available; new model parameters.
62 | - Features: monomer
63 | - Other: [Unofficial Colab Notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/RoseTTAFold.ipynb)
64 | - Uni-Fold
65 | [](https://github.com/dptech-corp/Uni-Fold)
66 | [](https://github.com/dptech-corp/Uni-Fold-jax)
67 | [](https://doi.org/10.1101/2022.08.04.502811)
68 | - Reimplementation of AF2 in PyTorch; provides training code and new (monomer/multimer) model parameters.
69 | - Features: monomer, multimer
70 | - Resources: [Colab Notebook](https://colab.research.google.com/github/dptech-corp/Uni-Fold/blob/main/notebooks/unifold.ipynb)
71 |
72 | - **pLM-based** (using embeddings from protein Language Models (pLMs) as input)
73 | - ESM-Fold
74 | [](https://doi.org/10.1101/2022.07.20.500902)
75 | - Features: monomer
76 | - Other: [[tweet] Alex's announcement](https://twitter.com/alexrives/status/1550148755206414341)
77 | - EMBER3D
78 | [](ttps://github.com/kWeissenow/EMBER3D)
79 | - Features: monomer
80 | - HelixFold-single
81 | [](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single)
82 | [](https://arxiv.org/abs/2207.13921)
83 | - Features: monomer
84 | - Resource: https://paddlehelix.baidu.com/app/drug/protein-single/forecast
85 | - IgFold
86 | [](https://github.com/Graylab/IgFold)
87 | [](https://doi.org/10.1101/2022.04.20.488972)
88 | - pLM focused on antibody sequences
89 | - Features: monomer
90 | - Other: [Colab Notebook](https://colab.research.google.com/github/Graylab/IgFold/blob/main/IgFold.ipynb)
91 | - OmegaFold
92 | [](https://github.com/HeliXonProtein/OmegaFold)
93 | [](https://doi.org/10.1101/2022.07.21.500999)
94 | - Features: monomer
95 | - Other:
96 | [Unofficial Colab Notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/omegafold.ipynb),
97 | [[tweet] Martin comparing structures](https://twitter.com/thesteinegger/status/1554881669718573062),
98 | [[tweet] Sergey's positional encoding observation](https://twitter.com/sokrypton/status/1555536325176168448)
99 |
100 |
101 | ### Tools and Extensions
102 | - gget (AF2)
103 | [](https://github.com/phbradley/alphafold_finetune)
104 | [](https://doi.org/10.1101/2022.05.17.492392)
105 | - alphafold_finetune
106 | [](https://github.com/pachterlab/gget#gget-alphafold-)
107 | [](https://doi.org/10.1101/2022.07.12.499365)
108 | - finetune AlphaFold for Protein-Peptide prediction
109 | - Other: [[tweet] Amir's announcement](https://twitter.com/AMotmaen/status/1547435940011945984)
110 | - AlphaPulldown
111 | [](https://www.embl-hamburg.de/AlphaPulldown/)
112 | [](https://doi.org/10.1101/2022.08.05.502961)
113 | - protein-protein interaction screens using AlphaFold-Multimer
114 | - ColabDesign
115 | [](https://github.com/sokrypton/ColabDesign)
116 | - Backprop through AlphaFold for protein design
117 | - AF2Rank
118 | [](https://github.com/jproney/AF2Rank)
119 | [](https://doi.org/10.1101/2022.03.11.484043)
120 | - Rank Decoy Structures/Sequences using AlphaFold
121 | - Resource: [Colab Notebook](https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/af/examples/AF2Rank.ipynb)
122 |
123 | ----
124 |
125 | ### Databases of predictions
126 | - AlphaFold Database
127 | [](https://doi.org/10.1093/nar/gkab1061)
128 | - All sequences in UniRef90 - viral sequences; Based on AlphaFold 2
129 | - Resource: https://alphafold.ebi.ac.uk
130 | - Eukaryotic interactormes
131 | [](https://www.science.org/doi/10.1126/science.abm4805)
132 | - Protein-Protein interactions; Based on RoseTTAFold and AlphaFold 2
133 | - Resource: https://www.ebi.ac.uk/pdbe/news/predicted-complexes-modelarchive-now-pdbe-kb-pages
134 | - Structures of human-transcriptome isoforms
135 | [](https://doi.org/10.1101/2022.06.08.495354)
136 | - Based on ColabFold (AlphaFold 2)
137 | - Resource: https://www.isoform.io
138 | - AlphaFill
139 | [](https://doi.org/10.1101/2021.11.26.470110)
140 | - Enriching the AlphaFold models with ligands and co-factors (AlphaFold 2)
141 | - Resource: https://alphafill.eu/
142 | - IgFold Database
143 | [](https://doi.org/10.1101/2022.04.20.488972)
144 | - Predictions specific to antibody sequences; based on OAS dataset and IgFold
145 | - Resource: https://data.graylab.jhu.edu/igfold_oas_paired95.tar.gz
146 |
147 |
148 | ### Datasets for training
149 | - OpenFold
150 | - MSAs for 132K PDBs + 270K UniClust30 predictions for distilation
151 | - Resource: https://registry.opendata.aws/openfold/
152 | - MindSpore
153 | - MSAs for 570K PDBs + 745K Distillation
154 | - Manuscript: https://arxiv.org/abs/2206.12240
155 | - Resource: http://ftp.cbi.pku.edu.cn/psp/
156 |
157 | ----
158 |
159 |
160 | ### Webservers
161 | - Lambda PredictProtein
162 | [](https://doi.org/10.1101/2022.08.04.502750)
163 | - Based on ColabFold; Limited to sequences up to 500AAs
164 | - Resource: http://embed.predictprotein.org/
165 | - Robetta
166 | - Based on RoseTTAFold
167 | - Resource: https://robetta.bakerlab.org/
168 |
169 | ----
170 |
171 |
172 | ### Discontinued
173 |
174 | - 💀 Moonbear
175 | - Resource: https://www.getmoonbear.com/
176 | - Other: [[tweet] Stephanie's announcement](https://twitter.com/stephanieszhang/status/1427773598199164937)
177 | - 💀 Lucidrains' AlphaFold2
178 | [](https://github.com/lucidrains/alphafold2)
179 | - AF2 reproduction attempt
180 | - Features: monomer
181 | - 💀 Lupoglaz's OpenFold2
182 | [](https://github.com/lupoglaz/OpenFold2)
183 | - AF2 reproduction attempt
184 | - Features: monomer
185 |
--------------------------------------------------------------------------------