├── .gitignore ├── CONTRIBUTE.md ├── README.md ├── docs ├── index.html ├── materials.html └── molecules.html ├── images ├── AiMat_logo_purple.png └── ChemMatData_logo_final.png ├── materials.json └── molecules.json /.gitignore: -------------------------------------------------------------------------------- 1 | # See https://help.github.com/articles/ignoring-files/ for more about ignoring files. 2 | 3 | # dependencies 4 | website/node_modules 5 | website/.pnp 6 | .pnp.js 7 | 8 | # testing 9 | website/coverage 10 | 11 | # production 12 | website/build 13 | 14 | # misc 15 | .DS_Store 16 | .env.local 17 | .env.development.local 18 | .env.test.local 19 | .env.production.local 20 | 21 | npm-debug.log* 22 | yarn-debug.log* 23 | yarn-error.log* 24 | -------------------------------------------------------------------------------- /CONTRIBUTE.md: -------------------------------------------------------------------------------- 1 | ## Option 1: 2 | Please send a link to a missing dataset to Jana (jana.zeller@student.kit.edu) and Pascal (pascal.friederich@kit.edu). 3 | Or, if you have a new dataset, please send it to us along with a short description. 4 | 5 | ## Option 2: 6 | Click on the `branch` button and then click on the green `new branch` button. 7 | Edit the `molecules.json` or `materials.json` file directly online in github. 8 | 9 | ## Option 3: 10 | 1. Fork the Project (`git pull https://github.com/aimat-lab/ChemMatData.git`) 11 | 2. Create your Dataset / Feature Branch (`git checkout -b feature/AmazingDataset`) 12 | 3. Open the desired JSON file (molecules.json or materials.json) in a text editor. 13 | 4. Create a new JSON object that represents your dataset. Make sure to include all the relevant fields: Dataset Name, Domain, Task Type, Data Type, #Compounds, #Tasks, Short Description, Papers, and DownloadLink. 14 | 5. If the dataset has multiple values for Task Type, Data Type, or DownloadLink, separate each entry with a comma followed by a whitespace (", "). 15 | 6. For the Papers field, create an array of objects where each object represents a paper with the fields Name and Link. 16 | 17 | Your JSON could now look like this: 18 | ``` 19 | { 20 | "Dataset Name": "QM9", 21 | "Domain": "Quantum Mechanics", 22 | "Short Description": "QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to nine heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT)", 23 | "#Tasks": 16, 24 | "#Compounds": 134000, 25 | "Task Type": "Regression", 26 | "Data Type": "SMILES, 3D coordinates", 27 | "DownloadLink": "http://quantum-machine.org/datasets/#:~:text=Available%20via-,figshare,-.", 28 | "Papers" : [ 29 | { 30 | "Name": "Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17", 31 | "Link": "http://pubs.acs.org/doi/abs/10.1021/ci300415d" 32 | }, 33 | { 34 | "Name": "Quantum chemistry structures and properties of 134 kilo molecules", 35 | "Link": "http://quantum-machine.org/datasets/#:~:text=A.%20von%20Lilienfeld%2C-,Quantum%20chemistry%20structures%20and%20properties%20of%20134%20kilo%20molecules,-%2C%20Scientific%20Data" 36 | } 37 | ] 38 | } 39 | ``` 40 | 7. Add your new dataset object to the array, making sure to place a comma before it. 41 | 8. Commit your Changes (`git commit -m 'Add some AmazingDataset'`) 42 | 9. Push to the Branch (`git push origin feature/AmazingDataset`) 43 | 10. Open a Pull Request. You open a pull request by clicking on the branch icon in the start page and navigating to the branch you just added. In the yellow banner click the `Compare & pull request` button. Select the `main` branch as the branch you want to merge into. Write a short description about the dataset you added. Then click on `Create Pull Request`. See [this full detailed description](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) from GitHub on how to open pull requests. 44 | 45 | 46 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ChemMatData 2 | ## Global collection of molecular and materials datasets 3 | Website at [aimat-lab.github.io/ChemMatData](https://aimat-lab.github.io/ChemMatData/index.html) 4 | 5 | 6 | 7 | --- 8 | 9 | ## About The Project 10 | Our main goal is to create a collaborative platform 11 | where we can gather and categorize various datasets, 12 | making them conveniently accessible in one place. 13 | We are actively collecting datasets for [molecules](https://github.com/aimat-lab/ChemMatData/blob/main/molecules.json) as well as [crystalline structures](https://github.com/aimat-lab/ChemMatData/blob/main/materials.json) to provide a comprehensive resource for researchers, scientists, and enthusiasts. 14 | 15 | See the list of all [molecular datasets](https://github.com/aimat-lab/ChemMatData/blob/main/molecules.json). 16 | See the list of all [materials datasets](https://github.com/aimat-lab/ChemMatData/blob/main/materials.json). 17 | 18 | You can sort and explore all available datasets using our [website](https://aimat-lab.github.io/ChemMatData/index.html). 19 | 20 | 21 | --- 22 | 23 | ## Contributing 24 | If you have additional datasets that you believe should be included in our repository, we encourage you to [contribute](https://github.com/aimat-lab/ChemMatData/blob/main/CONTRIBUTE.md). 25 | Here's how you can do it: 26 | 1. Send a link to a missing dataset or your own dataset with a short description Pascal (pascal.friederich@kit.edu). 27 | 2. Directly add your dataset to the table in your browser. 28 | 3. Clone and extend this repository ([detailed description here](https://github.com/aimat-lab/ChemMatData/blob/main/CONTRIBUTE.md)) 29 | 30 | We appreciate your contribution and look forward to incorporating your suggested datasets into our growing collection! 31 | 32 | --- 33 | 34 | ## List of contributors 35 | 36 | Send us pull requests or emails with new datasets if you want to see your name here! 37 | 38 | --- 39 | 40 | ## About Us 41 | An open-source project hosted by the [AiMat Group](https://aimat.iti.kit.edu/) at the [Karlsruhe Institute of Technology (KIT)](https://www.kit.edu/). 42 | The initial version of this project was developed by Jana Zeller and Pascal Friederich. 43 | 44 | 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 32 | 33 |
34 | ChemMatData logo 35 |
36 | 37 |
38 | 39 | 45 | 46 | 47 | 48 | 54 | 55 | 56 | 57 | 63 | 64 |
65 | 66 |

67 | Our main goal is to create a collaborative platform 68 | where we can gather and categorize various datasets, 69 | making them conveniently accessible in one place. 70 | We are actively collecting datasets for molecules as well as crystalline structures 71 | to provide a comprehensive resource for researchers, scientists, and enthusiasts. 72 |

73 | 74 |
75 |

Contributors

76 |

Send us pull requests or emails with new datasets if you want to see your name here!

77 |
78 | 79 |
80 |

About Us

81 |

An open-source project hosted by the AiMat Group at the Karlsruhe Institute of Technology (KIT).

82 |
83 | 84 |
85 | 86 | 87 | 88 |
89 | 90 | 98 | -------------------------------------------------------------------------------- /docs/materials.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 32 | 33 |

Materials

34 |

Coming Soon!

35 | 36 | 44 | -------------------------------------------------------------------------------- /docs/molecules.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 134 | 135 | 136 | 137 | 138 | 139 | 159 | 160 |

Molecules

161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 |
Dataset NameDomain#Tasks#CompoundsTask TypeData TypeMore Infos
177 | 178 | 186 | -------------------------------------------------------------------------------- /images/AiMat_logo_purple.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/images/AiMat_logo_purple.png -------------------------------------------------------------------------------- /images/ChemMatData_logo_final.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/images/ChemMatData_logo_final.png -------------------------------------------------------------------------------- /materials.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/materials.json -------------------------------------------------------------------------------- /molecules.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "Dataset Name": "QM9", 4 | "Domain": "Quantum Mechanics", 5 | "Short Description": "QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to nine heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT)", 6 | "#Tasks": 16, 7 | "#Compounds": 134000, 8 | "Task Type": "Regression", 9 | "Data Type": "SMILES, 3D coordinates", 10 | "DownloadLink": "http://quantum-machine.org/datasets/#:~:text=Available%20via-,figshare,-.", 11 | "Papers" : [ 12 | { 13 | "Name": "Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17", 14 | "Link": "http://pubs.acs.org/doi/abs/10.1021/ci300415d" 15 | }, 16 | { 17 | "Name": "Quantum chemistry structures and properties of 134 kilo molecules", 18 | "Link": "http://quantum-machine.org/datasets/#:~:text=A.%20von%20Lilienfeld%2C-,Quantum%20chemistry%20structures%20and%20properties%20of%20134%20kilo%20molecules,-%2C%20Scientific%20Data" 19 | } 20 | ] 21 | }, 22 | { 23 | "Dataset Name": "PCQM4Mv2", 24 | "Domain": "Quantum Mechanics", 25 | "Short Description": "Based on the PubChemQC, we define a meaningful ML task of predicting DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs. The HOMO-LUMO gap is one of the most practically-relevant quantum chemical properties of molecules since it is related to reactivity, photoexcitation, and charge transport.", 26 | "#Tasks": 1, 27 | "#Compounds": 3378606, 28 | "Task Type": "Regression", 29 | "Data Type": "SMILES", 30 | "DownloadLink": "https://ogb.stanford.edu/docs/lsc/pcqm4mv2/#dataset", 31 | "Papers": [] 32 | }, 33 | { 34 | "Dataset Name": "Alchemy", 35 | "Domain": "Quantum Mechanics", 36 | "Short Description": "The dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.", 37 | "#Tasks": 12, 38 | "#Compounds": 202579, 39 | "Task Type": "Regression", 40 | "Data Type": "SMILES, 3D coordinates", 41 | "DownloadLink": "https://chrsmrrs.github.io/datasets/docs/datasets/", 42 | "Papers": [ 43 | { 44 | "Name": "Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models", 45 | "Link": "https://arxiv.org/pdf/1906.09427.pdf" 46 | } 47 | ] 48 | }, 49 | { 50 | "Dataset Name": "BACE", 51 | "Domain": "Biophysics", 52 | "Short Description": "The BACE dataset provides quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1 (BACE-1)", 53 | "#Tasks": 2, 54 | "#Compounds": 1522, 55 | "Task Type": "Regression, Classification", 56 | "Data Type": "SMILES", 57 | "DownloadLink": "https://moleculenet.org/datasets-1", 58 | "Papers": [] 59 | }, 60 | { 61 | "Dataset Name": "Freesolv", 62 | "Domain": "Physical Chemistry", 63 | "Short Description": "A collection of experimental and calculated hydration free energies for small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations.", 64 | "#Tasks": 1, 65 | "#Compounds": 643, 66 | "Task Type": "Regression", 67 | "Data Type": "SMILES", 68 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=organic%20small%20molecules.-,FreeSolv,-%3A%20Experimental%20and%20calculated), [weilab.math.msu.edu](https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=4-,FreeSolv,-Solvation%20free%20energy", 69 | "Papers": [ 70 | { 71 | "Name": "FreeSolv: a database of experimental and calculated hydration free energies, with input files", 72 | "Link": "https://pubmed.ncbi.nlm.nih.gov/24928188/" 73 | } 74 | ] 75 | }, 76 | { 77 | "Dataset Name": "ESOL (delaney)", 78 | "Domain": "Physical Chemistry", 79 | "Short Description": "Water solubility data(log solubility in mols per litre) for common organic small molecules.", 80 | "#Tasks": 1, 81 | "#Compounds": 1128, 82 | "Task Type": "Regression", 83 | "Data Type": "SMILES", 84 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=modelled%20small%20molecules.-,ESOL,-%3A%20Water%20solubility%20data, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,ESOL,-ESOL%20(delaney)%20is", 85 | "Papers": [ 86 | { 87 | "Name": "ESOL: Estimating Aqueous Solubility Directly from Molecular Structure", 88 | "Link": "https://pubs.acs.org/doi/10.1021/ci034243x" 89 | } 90 | ] 91 | }, 92 | { 93 | "Dataset Name": "Lipophilicity", 94 | "Domain": "Physical Chemistry", 95 | "Short Description": "Experimental results of octanol/water distribution coefficient(logD at pH 7.4).", 96 | "#Tasks": 1, 97 | "#Compounds": 4200, 98 | "Task Type": "Regression", 99 | "Data Type": "SMILES", 100 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=molecules%20in%20water.-,Lipophilicity,-%3A%20Experimental%20results%20of, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,Lipophilicity,-SMILES%20strings%20are", 101 | "Papers": [] 102 | }, 103 | { 104 | "Dataset Name": "MUV", 105 | "Domain": "Biophysics", 106 | "Short Description": "Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques.", 107 | "#Tasks": 17, 108 | "#Compounds": 93087, 109 | "Task Type": "Classification", 110 | "Data Type": "SMILES", 111 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=high%2Dthroughput%20screening.-,MUV,-%3A%20Subset%20of%20PubChem", 112 | "Papers": [ 113 | { 114 | "Name": "MoleculeNet: A Benchmark for Molecular Machine Learning", 115 | "Link": "https://arxiv.org/abs/1703.00564" 116 | } 117 | ] 118 | }, 119 | { 120 | "Dataset Name": "HIV", 121 | "Domain": "Biophysics", 122 | "Short Description": "Experimentally measured abilities to inhibit HIV replication.", 123 | "#Tasks": 1, 124 | "#Compounds": 41127, 125 | "Task Type": "Classification", 126 | "Data Type": "SMILES", 127 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=virtual%20screening%20techniques.-,HIV,-%3A%20Experimentally%20measured%20abilities", 128 | "Papers": [ 129 | { 130 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 131 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 132 | } 133 | ] 134 | }, 135 | { 136 | "Dataset Name": "AIDS", 137 | "Domain": "Biophysics", 138 | "Short Description": "The DTP AIDS Antiviral Screen has checked tens of thousands of compounds for evidence of anti-HIV activity. Available are screening results and chemical structural data on compounds that are not covered by a confidentiality agreement.", 139 | "#Tasks": 2, 140 | "#Compounds": 2000, 141 | "Task Type": "Classification", 142 | "Data Type": "molecular graph", 143 | "DownloadLink": "https://chrsmrrs.github.io/datasets/docs/datasets/#:~:text=%E2%80%93-,AIDS,-alchemy_full", 144 | "Papers": [ 145 | { 146 | "Name": "IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning", 147 | "Link": "https://DownloadLink.springer.com/chapter/10.1007/978-3-540-89689-0_33" 148 | }, 149 | { 150 | "Name": "AIDS Antiviral Screen Data (2004)", 151 | "Link": "https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data" 152 | } 153 | ] 154 | }, 155 | { 156 | "Dataset Name": "PDBbind", 157 | "Domain": "Biophysics", 158 | "Short Description": "Binding affinities for bio-molecular complexes, both structures of proteins and ligands are provided.", 159 | "#Tasks": 1, 160 | "#Compounds": 11908, 161 | "Task Type": "Regression", 162 | "Data Type": "SMILES, 3D coordinates", 163 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=inhibit%20HIV%20replication.-,PDBbind,-%3A%20Binding%20affinities%20for", 164 | "Papers": [ 165 | { 166 | "Name": "Comparative assessment of scoring functions on a diverse test set", 167 | "Link": "https://pubmed.ncbi.nlm.nih.gov/19358517/" 168 | } 169 | ] 170 | }, 171 | { 172 | "Dataset Name": "BBBP", 173 | "Domain": "Physiology", 174 | "Short Description": "Binary labels of blood-brain barrier penetration(permeability).", 175 | "#Tasks": 1, 176 | "#Compounds": 2039, 177 | "Task Type": "Classification", 178 | "Data Type": "SMILES", 179 | "DownloadLink": "https://moleculenet.org/datasets-1), https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,BBBP,-Blood%E2%80%93brain%20barrier", 180 | "Papers": [ 181 | { 182 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 183 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 184 | } 185 | ] 186 | }, 187 | { 188 | "Dataset Name": "Tox21", 189 | "Domain": "Physiology", 190 | "Short Description": "Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.", 191 | "#Tasks": 12, 192 | "#Compounds": 7831, 193 | "Task Type": "Classification", 194 | "Data Type": "SMILES", 195 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=barrier%20penetration(permeability).-,Tox21,-%3A%20Qualitative%20toxicity%20measurements", 196 | "Papers": [ 197 | { 198 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 199 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 200 | } 201 | ] 202 | }, 203 | { 204 | "Dataset Name": "ToxCast", 205 | "Domain": "Physiology", 206 | "Short Description": "Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks.", 207 | "#Tasks": 617, 208 | "#Compounds": 8575, 209 | "Task Type": "Classification", 210 | "Data Type": "SMILES", 211 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=stress%20response%20pathways.-,ToxCast,-%3A%20Toxicology%20data%20for", 212 | "Papers": [ 213 | { 214 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 215 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 216 | } 217 | ] 218 | }, 219 | { 220 | "Dataset Name": "SIDER", 221 | "Domain": "Physiology", 222 | "Short Description": "Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.", 223 | "#Tasks": 27, 224 | "#Compounds": 1427, 225 | "Task Type": "Classification", 226 | "Data Type": "SMILES", 227 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=over%20600%20tasks.-,SIDER,-%3A%20Database%20of%20marketed", 228 | "Papers": [ 229 | { 230 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 231 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 232 | } 233 | ] 234 | }, 235 | { 236 | "Dataset Name": "ClinTOX", 237 | "Domain": "Physiology", 238 | "Short Description": "Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.", 239 | "#Tasks": 2, 240 | "#Compounds": 1478, 241 | "Task Type": "Classification", 242 | "Data Type": "SMILES", 243 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=system%20organ%20classes.-,ClinTox,-%3A%20Qualitative%20data%20of, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,ClinTox,-The%20ClinTox%20dataset", 244 | "Papers": [ 245 | { 246 | "Name": "MoleculeNet: a benchmark for molecular machine learning", 247 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a" 248 | } 249 | ] 250 | }, 251 | { 252 | "Dataset Name": "Quantitative toxicity - LD50", 253 | "Domain": "Physiology", 254 | "Short Description": "The oral rat LD50 dataset (LD50).", 255 | "#Tasks": 1, 256 | "#Compounds": 7413, 257 | "Task Type": "Regression", 258 | "Data Type": "SMILES", 259 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50", 260 | "Papers": [ 261 | { 262 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks", 263 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf" 264 | }, 265 | { 266 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction", 267 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf" 268 | } 269 | ] 270 | }, 271 | { 272 | "Dataset Name": "Quantitative toxicity - IGC50", 273 | "Domain": "Physiology", 274 | "Short Description": "Tetrahymena pyriformis IGC50 dataset (IGC50).", 275 | "#Tasks": 1, 276 | "#Compounds": 1792, 277 | "Task Type": "Regression", 278 | "Data Type": "SMILES", 279 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50", 280 | "Papers": [ 281 | { 282 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks", 283 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf" 284 | }, 285 | { 286 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction", 287 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf" 288 | } 289 | ] 290 | }, 291 | { 292 | "Dataset Name": "Quantitative toxicity - LC50", 293 | "Domain": "Physiology", 294 | "Short Description": "96 h fathead minnow LC50 dataset.", 295 | "#Tasks": 1, 296 | "#Compounds": 813, 297 | "Task Type": "Regression", 298 | "Data Type": "SMILES", 299 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50", 300 | "Papers": [ 301 | { 302 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks", 303 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf" 304 | }, 305 | { 306 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction", 307 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf" 308 | } 309 | ] 310 | }, 311 | { 312 | "Dataset Name": "Quantitative toxicity - LC50DM", 313 | "Domain": "Physiology", 314 | "Short Description": "The oral rat LD50 dataset (LD50).", 315 | "#Tasks": 1, 316 | "#Compounds": 353, 317 | "Task Type": "Regression", 318 | "Data Type": "SMILES", 319 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50", 320 | "Papers": [ 321 | { 322 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks", 323 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf" 324 | }, 325 | { 326 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction", 327 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf" 328 | } 329 | ] 330 | }, 331 | { 332 | "Dataset Name": "beet", 333 | "Domain": "Physiology", 334 | "Short Description": "The toxicity in honey bees (beet) dataset was extract from a study on the prediction of acute contact toxicity of pesticides in honeybees. The data set contains 254 compounds with their experimental values.", 335 | "#Tasks": 2, 336 | "#Compounds": 254, 337 | "Task Type": "Classification", 338 | "Data Type": "SMILES", 339 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,beet,-The%20toxicity%20in", 340 | "Papers": [ 341 | { 342 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules", 343 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058" 344 | } 345 | ] 346 | }, 347 | { 348 | "Dataset Name": "logP", 349 | "Domain": "", 350 | "Short Description": "Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set.", 351 | "#Tasks": 3, 352 | "#Compounds": "8199(train), 406(test-FDA), 223(test-Star), 43(test-Nonstar)", 353 | "Task Type": "Regression", 354 | "Data Type": "SMILES, 3D coordinates", 355 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/3D/#:~:text=Reference-,logP,-Partition%20coefficient%20datasets", 356 | "Papers": [ 357 | { 358 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction", 359 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf" 360 | }, 361 | { 362 | "Name": "TopP–S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility", 363 | "Link": "https://users.math.msu.edu/users/weig/paper/p223.pdf" 364 | } 365 | ] 366 | }, 367 | { 368 | "Dataset Name": "logS(1)", 369 | "Domain": "", 370 | "Short Description": "Small aqueous solubility datasets.", 371 | "#Tasks": 2, 372 | "#Compounds": 1431, 373 | "Task Type": "Regression", 374 | "Data Type": "SMILES, 3D coordinates", 375 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1%2C%203-,logS(1),-A%20diverse%20dataset", 376 | "Papers": [ 377 | { 378 | "Name": "TopP–S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility", 379 | "Link": "https://users.math.msu.edu/users/weig/paper/p223.pdf" 380 | } 381 | ] 382 | }, 383 | { 384 | "Dataset Name": "DPP4", 385 | "Domain": "", 386 | "Short Description": "DPP-4 inhibitors (DPP4) was extract from ChEMBL with DPP-4 target. The data was processed by removing salt and normalizing molecular structure, with molecular duplication examination, leaving 3933 molecules.", 387 | "#Tasks": 1, 388 | "#Compounds": 3933, 389 | "Task Type": "Regression", 390 | "Data Type": "SMILES", 391 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,DPP4,-DPP%2D4%20inhibitors", 392 | "Papers": [ 393 | { 394 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules", 395 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058" 396 | } 397 | ] 398 | }, 399 | { 400 | "Dataset Name": "Ames", 401 | "Domain": "", 402 | "Short Description": "Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results.", 403 | "#Tasks": 1, 404 | "#Compounds": 6512, 405 | "Task Type": "Classification", 406 | "Data Type": "SMILES", 407 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,Ames,-Ames%20mutagenicity.%20The", 408 | "Papers": [ 409 | { 410 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules", 411 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058" 412 | } 413 | ] 414 | }, 415 | { 416 | "Dataset Name": "DUD", 417 | "Domain": "", 418 | "Short Description": "A Directory of Useful Decoys (DUD).", 419 | "#Tasks": 21, 420 | "#Compounds": "between 31 and 365 actives and 1,344 and 15,560 decoys depending on target", 421 | "Task Type": "Rank", 422 | "Data Type": "SMILES", 423 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#:~:text=5-,DUD,-A%20Directory%20of", 424 | "Papers": [ 425 | { 426 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules", 427 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058" 428 | } 429 | ] 430 | }, 431 | { 432 | "Dataset Name": "MUV", 433 | "Domain": "", 434 | "Short Description": "Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening.", 435 | "#Tasks": 17, 436 | "#Compounds": "30 actives and 1,500 decoys per target", 437 | "Task Type": "Rank", 438 | "Data Type": "SMILES", 439 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,MUV,-Maximum%20Unbiased%20Validation", 440 | "Papers": [ 441 | { 442 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules", 443 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058" 444 | } 445 | ] 446 | }, 447 | { 448 | "Dataset Name": "Cocaine addiction datasets", 449 | "Domain": "", 450 | "Short Description": "The 36 cocaine-addiction related datasets are collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) and literatures (references 1 and 2 in README file), which involve 32 cocaine-addiction protein targets. The labels are binding affinities to these targets.", 451 | "#Tasks": 36, 452 | "#Compounds": "between 114 and 6,923 depending on the target", 453 | "Task Type": "Regression", 454 | "Data Type": "SMILES", 455 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref6:~:text=5-,Cocaine%20addiction%20datasets,-The%2036%20cocaine", 456 | "Papers": [ 457 | { 458 | "Name": "Proteome-informed machine learning studies of cocaine addiction", 459 | "Link": "https://weilab.math.msu.edu/DataLibrary/2D/#ref6:~:text=of%20cocaine%20addiction%22.-,PDF,-%5B7%5D%20Hongsong" 460 | } 461 | ] 462 | }, 463 | { 464 | "Dataset Name": "Cocaine addiction datasets 2", 465 | "Domain": "", 466 | "Short Description": "The 30 additional cocaine-addiction related datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/), which involve 30 cocaine-addiction protein targets. The labels are binding affinities to these targets.", 467 | "#Tasks": 36, 468 | "#Compounds": "between 123 and 6,923 depending on the target", 469 | "Task Type": "Regression", 470 | "Data Type": "SMILES", 471 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/cocaine_addiction-datasets2.zip", 472 | "Papers": [ 473 | { 474 | "Name": "Proteome-informed machine learning studies of cocaine addiction", 475 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03133?casa_token=H4K9rfMLmasAAAAA:_C3oLB_pkvc5Lbd-aklaIASqvHZwue_Z3ghqfUgBkjj4LtmD9kU4urhC5zT5zegGO2ncig5v3dL_Qg" 476 | } 477 | ] 478 | }, 479 | { 480 | "Dataset Name": "Drug_addiction_related", 481 | "Domain": "", 482 | "Short Description": "Receptors related to opioid or cocaine addiction.", 483 | "#Tasks": 11, 484 | "#Compounds": "between 815 and 11,297 depending on target", 485 | "Task Type": "Regression", 486 | "Data Type": "SMILES", 487 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/drug_addiction_related_WeiWeb_2D.zip", 488 | "Papers": [ 489 | { 490 | "Name": "TIDAL: Topology-Inferred Drug Addiction Learning", 491 | "Link": "https://pubs.acs.org/doi/full/10.1021/acs.jcim.3c00046?casa_token=C4B_jMAbt4AAAAAA:BLEYP4-f1E8ZP1-3umVhxzrrXuGUzVLJkhOCFneHCeQOwXG6eb8e0NyVeOis8xBwz3jgxdawRDrKwQ" 492 | } 493 | ] 494 | }, 495 | { 496 | "Dataset Name": "hERG blocker/non-blocker datasets", 497 | "Domain": "", 498 | "Short Description": "Seven datasets are provided for the classification of hERG blocker/non-blockers. These datasets are from literatures and the original datasets are included.", 499 | "#Tasks": 7, 500 | "#Compounds": "between 927 and 203,853 (train) and 407 and 87,366 (test) depending on the task", 501 | "Task Type": "Classification", 502 | "Data Type": "SMILES", 503 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/hERG-classification.zip", 504 | "Papers": [ 505 | { 506 | "Name": "Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models", 507 | "Link": "https://www.sciencedirect.com/science/article/pii/S0010482522011994" 508 | } 509 | ] 510 | }, 511 | { 512 | "Dataset Name": "Opioid use disorder datasets", 513 | "Domain": "", 514 | "Short Description": "75 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of opioid use disorder. The labels are binding affinities to these targets.", 515 | "#Tasks": 75, 516 | "#Compounds": "between 268 and 6,298 depending on the task", 517 | "Task Type": "Regression", 518 | "Data Type": "SMILES", 519 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/OUD-datasets.zip", 520 | "Papers": [ 521 | { 522 | "Name": "Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks", 523 | "Link": "https://arxiv.org/abs/2301.04815" 524 | }, 525 | { 526 | "Name": "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder", 527 | "Link": "https://arxiv.org/abs/2303.00240" 528 | } 529 | ] 530 | }, 531 | { 532 | "Dataset Name": "SVS datasets", 533 | "Domain": "", 534 | "Short Description": "The 9 datasets for biomolecules interactions, including 4 regressions and 5 classfications.", 535 | "#Tasks": 9, 536 | "#Compounds": "between 186 and 11,188 depending on the task", 537 | "Task Type": "Regression, Classification", 538 | "Data Type": "SMILES", 539 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/SVS_datasets.zip", 540 | "Papers": [ 541 | { 542 | "Name": "SVSBI: Sequence-based virtual screening of biomolecular interactions", 543 | "Link": "https://arxiv.org/abs/2212.13617" 544 | } 545 | ] 546 | } 547 | ] 548 | --------------------------------------------------------------------------------