├── .gitignore
├── CONTRIBUTE.md
├── README.md
├── docs
├── index.html
├── materials.html
└── molecules.html
├── images
├── AiMat_logo_purple.png
└── ChemMatData_logo_final.png
├── materials.json
└── molecules.json
/.gitignore:
--------------------------------------------------------------------------------
1 | # See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
2 |
3 | # dependencies
4 | website/node_modules
5 | website/.pnp
6 | .pnp.js
7 |
8 | # testing
9 | website/coverage
10 |
11 | # production
12 | website/build
13 |
14 | # misc
15 | .DS_Store
16 | .env.local
17 | .env.development.local
18 | .env.test.local
19 | .env.production.local
20 |
21 | npm-debug.log*
22 | yarn-debug.log*
23 | yarn-error.log*
24 |
--------------------------------------------------------------------------------
/CONTRIBUTE.md:
--------------------------------------------------------------------------------
1 | ## Option 1:
2 | Please send a link to a missing dataset to Jana (jana.zeller@student.kit.edu) and Pascal (pascal.friederich@kit.edu).
3 | Or, if you have a new dataset, please send it to us along with a short description.
4 |
5 | ## Option 2:
6 | Click on the `branch` button and then click on the green `new branch` button.
7 | Edit the `molecules.json` or `materials.json` file directly online in github.
8 |
9 | ## Option 3:
10 | 1. Fork the Project (`git pull https://github.com/aimat-lab/ChemMatData.git`)
11 | 2. Create your Dataset / Feature Branch (`git checkout -b feature/AmazingDataset`)
12 | 3. Open the desired JSON file (molecules.json or materials.json) in a text editor.
13 | 4. Create a new JSON object that represents your dataset. Make sure to include all the relevant fields: Dataset Name, Domain, Task Type, Data Type, #Compounds, #Tasks, Short Description, Papers, and DownloadLink.
14 | 5. If the dataset has multiple values for Task Type, Data Type, or DownloadLink, separate each entry with a comma followed by a whitespace (", ").
15 | 6. For the Papers field, create an array of objects where each object represents a paper with the fields Name and Link.
16 |
17 | Your JSON could now look like this:
18 | ```
19 | {
20 | "Dataset Name": "QM9",
21 | "Domain": "Quantum Mechanics",
22 | "Short Description": "QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to nine heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT)",
23 | "#Tasks": 16,
24 | "#Compounds": 134000,
25 | "Task Type": "Regression",
26 | "Data Type": "SMILES, 3D coordinates",
27 | "DownloadLink": "http://quantum-machine.org/datasets/#:~:text=Available%20via-,figshare,-.",
28 | "Papers" : [
29 | {
30 | "Name": "Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17",
31 | "Link": "http://pubs.acs.org/doi/abs/10.1021/ci300415d"
32 | },
33 | {
34 | "Name": "Quantum chemistry structures and properties of 134 kilo molecules",
35 | "Link": "http://quantum-machine.org/datasets/#:~:text=A.%20von%20Lilienfeld%2C-,Quantum%20chemistry%20structures%20and%20properties%20of%20134%20kilo%20molecules,-%2C%20Scientific%20Data"
36 | }
37 | ]
38 | }
39 | ```
40 | 7. Add your new dataset object to the array, making sure to place a comma before it.
41 | 8. Commit your Changes (`git commit -m 'Add some AmazingDataset'`)
42 | 9. Push to the Branch (`git push origin feature/AmazingDataset`)
43 | 10. Open a Pull Request. You open a pull request by clicking on the branch icon in the start page and navigating to the branch you just added. In the yellow banner click the `Compare & pull request` button. Select the `main` branch as the branch you want to merge into. Write a short description about the dataset you added. Then click on `Create Pull Request`. See [this full detailed description](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) from GitHub on how to open pull requests.
44 |
45 |
46 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ChemMatData
2 | ## Global collection of molecular and materials datasets
3 | Website at [aimat-lab.github.io/ChemMatData](https://aimat-lab.github.io/ChemMatData/index.html)
4 |
5 |
6 |
7 | ---
8 |
9 | ## About The Project
10 | Our main goal is to create a collaborative platform
11 | where we can gather and categorize various datasets,
12 | making them conveniently accessible in one place.
13 | We are actively collecting datasets for [molecules](https://github.com/aimat-lab/ChemMatData/blob/main/molecules.json) as well as [crystalline structures](https://github.com/aimat-lab/ChemMatData/blob/main/materials.json) to provide a comprehensive resource for researchers, scientists, and enthusiasts.
14 |
15 | See the list of all [molecular datasets](https://github.com/aimat-lab/ChemMatData/blob/main/molecules.json).
16 | See the list of all [materials datasets](https://github.com/aimat-lab/ChemMatData/blob/main/materials.json).
17 |
18 | You can sort and explore all available datasets using our [website](https://aimat-lab.github.io/ChemMatData/index.html).
19 |
20 |
21 | ---
22 |
23 | ## Contributing
24 | If you have additional datasets that you believe should be included in our repository, we encourage you to [contribute](https://github.com/aimat-lab/ChemMatData/blob/main/CONTRIBUTE.md).
25 | Here's how you can do it:
26 | 1. Send a link to a missing dataset or your own dataset with a short description Pascal (pascal.friederich@kit.edu).
27 | 2. Directly add your dataset to the table in your browser.
28 | 3. Clone and extend this repository ([detailed description here](https://github.com/aimat-lab/ChemMatData/blob/main/CONTRIBUTE.md))
29 |
30 | We appreciate your contribution and look forward to incorporating your suggested datasets into our growing collection!
31 |
32 | ---
33 |
34 | ## List of contributors
35 |
36 | Send us pull requests or emails with new datasets if you want to see your name here!
37 |
38 | ---
39 |
40 | ## About Us
41 | An open-source project hosted by the [AiMat Group](https://aimat.iti.kit.edu/) at the [Karlsruhe Institute of Technology (KIT)](https://www.kit.edu/).
42 | The initial version of this project was developed by Jana Zeller and Pascal Friederich.
43 |
44 |
45 |
46 |
47 |
48 |
--------------------------------------------------------------------------------
/docs/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 | ChemMatData
14 |
15 |
16 |
17 |
18 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
65 |
66 |
67 | Our main goal is to create a collaborative platform
68 | where we can gather and categorize various datasets,
69 | making them conveniently accessible in one place.
70 | We are actively collecting datasets for molecules as well as crystalline structures
71 | to provide a comprehensive resource for researchers, scientists, and enthusiasts.
72 |
73 |
74 |
75 |
Contributors
76 |
Send us pull requests or emails with new datasets if you want to see your name here!
77 |
78 |
79 |
83 |
84 |
89 |
90 |
91 |
96 | © 2023 AiMAT
97 |
98 |
--------------------------------------------------------------------------------
/docs/materials.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 | ChemMatData
14 |
15 |
16 |
17 |
18 |
31 |
32 |
33 | Materials
34 | Coming Soon!
35 |
36 |
37 |
42 | © 2023 AiMAT
43 |
44 |
--------------------------------------------------------------------------------
/docs/molecules.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
134 |
135 |
136 |
137 |
138 |
139 |
140 | ChemMatData
141 |
142 |
143 |
144 |
145 |
158 |
159 |
160 | Molecules
161 |
162 |
163 |
164 | Dataset Name
165 | Domain
166 |
167 | #Tasks
168 | #Compounds
169 | Task Type
170 | Data Type
171 | More Infos
172 |
173 |
174 |
175 |
176 |
177 |
178 |
186 |
--------------------------------------------------------------------------------
/images/AiMat_logo_purple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/images/AiMat_logo_purple.png
--------------------------------------------------------------------------------
/images/ChemMatData_logo_final.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/images/ChemMatData_logo_final.png
--------------------------------------------------------------------------------
/materials.json:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aimat-lab/ChemMatData/c99907247ff053318e7479dcf6e6e6d9e2198a72/materials.json
--------------------------------------------------------------------------------
/molecules.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "Dataset Name": "QM9",
4 | "Domain": "Quantum Mechanics",
5 | "Short Description": "QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to nine heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT)",
6 | "#Tasks": 16,
7 | "#Compounds": 134000,
8 | "Task Type": "Regression",
9 | "Data Type": "SMILES, 3D coordinates",
10 | "DownloadLink": "http://quantum-machine.org/datasets/#:~:text=Available%20via-,figshare,-.",
11 | "Papers" : [
12 | {
13 | "Name": "Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17",
14 | "Link": "http://pubs.acs.org/doi/abs/10.1021/ci300415d"
15 | },
16 | {
17 | "Name": "Quantum chemistry structures and properties of 134 kilo molecules",
18 | "Link": "http://quantum-machine.org/datasets/#:~:text=A.%20von%20Lilienfeld%2C-,Quantum%20chemistry%20structures%20and%20properties%20of%20134%20kilo%20molecules,-%2C%20Scientific%20Data"
19 | }
20 | ]
21 | },
22 | {
23 | "Dataset Name": "PCQM4Mv2",
24 | "Domain": "Quantum Mechanics",
25 | "Short Description": "Based on the PubChemQC, we define a meaningful ML task of predicting DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs. The HOMO-LUMO gap is one of the most practically-relevant quantum chemical properties of molecules since it is related to reactivity, photoexcitation, and charge transport.",
26 | "#Tasks": 1,
27 | "#Compounds": 3378606,
28 | "Task Type": "Regression",
29 | "Data Type": "SMILES",
30 | "DownloadLink": "https://ogb.stanford.edu/docs/lsc/pcqm4mv2/#dataset",
31 | "Papers": []
32 | },
33 | {
34 | "Dataset Name": "Alchemy",
35 | "Domain": "Quantum Mechanics",
36 | "Short Description": "The dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.",
37 | "#Tasks": 12,
38 | "#Compounds": 202579,
39 | "Task Type": "Regression",
40 | "Data Type": "SMILES, 3D coordinates",
41 | "DownloadLink": "https://chrsmrrs.github.io/datasets/docs/datasets/",
42 | "Papers": [
43 | {
44 | "Name": "Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models",
45 | "Link": "https://arxiv.org/pdf/1906.09427.pdf"
46 | }
47 | ]
48 | },
49 | {
50 | "Dataset Name": "BACE",
51 | "Domain": "Biophysics",
52 | "Short Description": "The BACE dataset provides quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1 (BACE-1)",
53 | "#Tasks": 2,
54 | "#Compounds": 1522,
55 | "Task Type": "Regression, Classification",
56 | "Data Type": "SMILES",
57 | "DownloadLink": "https://moleculenet.org/datasets-1",
58 | "Papers": []
59 | },
60 | {
61 | "Dataset Name": "Freesolv",
62 | "Domain": "Physical Chemistry",
63 | "Short Description": "A collection of experimental and calculated hydration free energies for small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations.",
64 | "#Tasks": 1,
65 | "#Compounds": 643,
66 | "Task Type": "Regression",
67 | "Data Type": "SMILES",
68 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=organic%20small%20molecules.-,FreeSolv,-%3A%20Experimental%20and%20calculated), [weilab.math.msu.edu](https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=4-,FreeSolv,-Solvation%20free%20energy",
69 | "Papers": [
70 | {
71 | "Name": "FreeSolv: a database of experimental and calculated hydration free energies, with input files",
72 | "Link": "https://pubmed.ncbi.nlm.nih.gov/24928188/"
73 | }
74 | ]
75 | },
76 | {
77 | "Dataset Name": "ESOL (delaney)",
78 | "Domain": "Physical Chemistry",
79 | "Short Description": "Water solubility data(log solubility in mols per litre) for common organic small molecules.",
80 | "#Tasks": 1,
81 | "#Compounds": 1128,
82 | "Task Type": "Regression",
83 | "Data Type": "SMILES",
84 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=modelled%20small%20molecules.-,ESOL,-%3A%20Water%20solubility%20data, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,ESOL,-ESOL%20(delaney)%20is",
85 | "Papers": [
86 | {
87 | "Name": "ESOL: Estimating Aqueous Solubility Directly from Molecular Structure",
88 | "Link": "https://pubs.acs.org/doi/10.1021/ci034243x"
89 | }
90 | ]
91 | },
92 | {
93 | "Dataset Name": "Lipophilicity",
94 | "Domain": "Physical Chemistry",
95 | "Short Description": "Experimental results of octanol/water distribution coefficient(logD at pH 7.4).",
96 | "#Tasks": 1,
97 | "#Compounds": 4200,
98 | "Task Type": "Regression",
99 | "Data Type": "SMILES",
100 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=molecules%20in%20water.-,Lipophilicity,-%3A%20Experimental%20results%20of, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,Lipophilicity,-SMILES%20strings%20are",
101 | "Papers": []
102 | },
103 | {
104 | "Dataset Name": "MUV",
105 | "Domain": "Biophysics",
106 | "Short Description": "Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques.",
107 | "#Tasks": 17,
108 | "#Compounds": 93087,
109 | "Task Type": "Classification",
110 | "Data Type": "SMILES",
111 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=high%2Dthroughput%20screening.-,MUV,-%3A%20Subset%20of%20PubChem",
112 | "Papers": [
113 | {
114 | "Name": "MoleculeNet: A Benchmark for Molecular Machine Learning",
115 | "Link": "https://arxiv.org/abs/1703.00564"
116 | }
117 | ]
118 | },
119 | {
120 | "Dataset Name": "HIV",
121 | "Domain": "Biophysics",
122 | "Short Description": "Experimentally measured abilities to inhibit HIV replication.",
123 | "#Tasks": 1,
124 | "#Compounds": 41127,
125 | "Task Type": "Classification",
126 | "Data Type": "SMILES",
127 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=virtual%20screening%20techniques.-,HIV,-%3A%20Experimentally%20measured%20abilities",
128 | "Papers": [
129 | {
130 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
131 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
132 | }
133 | ]
134 | },
135 | {
136 | "Dataset Name": "AIDS",
137 | "Domain": "Biophysics",
138 | "Short Description": "The DTP AIDS Antiviral Screen has checked tens of thousands of compounds for evidence of anti-HIV activity. Available are screening results and chemical structural data on compounds that are not covered by a confidentiality agreement.",
139 | "#Tasks": 2,
140 | "#Compounds": 2000,
141 | "Task Type": "Classification",
142 | "Data Type": "molecular graph",
143 | "DownloadLink": "https://chrsmrrs.github.io/datasets/docs/datasets/#:~:text=%E2%80%93-,AIDS,-alchemy_full",
144 | "Papers": [
145 | {
146 | "Name": "IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning",
147 | "Link": "https://DownloadLink.springer.com/chapter/10.1007/978-3-540-89689-0_33"
148 | },
149 | {
150 | "Name": "AIDS Antiviral Screen Data (2004)",
151 | "Link": "https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data"
152 | }
153 | ]
154 | },
155 | {
156 | "Dataset Name": "PDBbind",
157 | "Domain": "Biophysics",
158 | "Short Description": "Binding affinities for bio-molecular complexes, both structures of proteins and ligands are provided.",
159 | "#Tasks": 1,
160 | "#Compounds": 11908,
161 | "Task Type": "Regression",
162 | "Data Type": "SMILES, 3D coordinates",
163 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=inhibit%20HIV%20replication.-,PDBbind,-%3A%20Binding%20affinities%20for",
164 | "Papers": [
165 | {
166 | "Name": "Comparative assessment of scoring functions on a diverse test set",
167 | "Link": "https://pubmed.ncbi.nlm.nih.gov/19358517/"
168 | }
169 | ]
170 | },
171 | {
172 | "Dataset Name": "BBBP",
173 | "Domain": "Physiology",
174 | "Short Description": "Binary labels of blood-brain barrier penetration(permeability).",
175 | "#Tasks": 1,
176 | "#Compounds": 2039,
177 | "Task Type": "Classification",
178 | "Data Type": "SMILES",
179 | "DownloadLink": "https://moleculenet.org/datasets-1), https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,BBBP,-Blood%E2%80%93brain%20barrier",
180 | "Papers": [
181 | {
182 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
183 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
184 | }
185 | ]
186 | },
187 | {
188 | "Dataset Name": "Tox21",
189 | "Domain": "Physiology",
190 | "Short Description": "Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.",
191 | "#Tasks": 12,
192 | "#Compounds": 7831,
193 | "Task Type": "Classification",
194 | "Data Type": "SMILES",
195 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=barrier%20penetration(permeability).-,Tox21,-%3A%20Qualitative%20toxicity%20measurements",
196 | "Papers": [
197 | {
198 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
199 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
200 | }
201 | ]
202 | },
203 | {
204 | "Dataset Name": "ToxCast",
205 | "Domain": "Physiology",
206 | "Short Description": "Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks.",
207 | "#Tasks": 617,
208 | "#Compounds": 8575,
209 | "Task Type": "Classification",
210 | "Data Type": "SMILES",
211 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=stress%20response%20pathways.-,ToxCast,-%3A%20Toxicology%20data%20for",
212 | "Papers": [
213 | {
214 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
215 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
216 | }
217 | ]
218 | },
219 | {
220 | "Dataset Name": "SIDER",
221 | "Domain": "Physiology",
222 | "Short Description": "Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.",
223 | "#Tasks": 27,
224 | "#Compounds": 1427,
225 | "Task Type": "Classification",
226 | "Data Type": "SMILES",
227 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=over%20600%20tasks.-,SIDER,-%3A%20Database%20of%20marketed",
228 | "Papers": [
229 | {
230 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
231 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
232 | }
233 | ]
234 | },
235 | {
236 | "Dataset Name": "ClinTOX",
237 | "Domain": "Physiology",
238 | "Short Description": "Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.",
239 | "#Tasks": 2,
240 | "#Compounds": 1478,
241 | "Task Type": "Classification",
242 | "Data Type": "SMILES",
243 | "DownloadLink": "https://moleculenet.org/datasets-1#:~:text=system%20organ%20classes.-,ClinTox,-%3A%20Qualitative%20data%20of, https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,ClinTox,-The%20ClinTox%20dataset",
244 | "Papers": [
245 | {
246 | "Name": "MoleculeNet: a benchmark for molecular machine learning",
247 | "Link": "https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a"
248 | }
249 | ]
250 | },
251 | {
252 | "Dataset Name": "Quantitative toxicity - LD50",
253 | "Domain": "Physiology",
254 | "Short Description": "The oral rat LD50 dataset (LD50).",
255 | "#Tasks": 1,
256 | "#Compounds": 7413,
257 | "Task Type": "Regression",
258 | "Data Type": "SMILES",
259 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50",
260 | "Papers": [
261 | {
262 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks",
263 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf"
264 | },
265 | {
266 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction",
267 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf"
268 | }
269 | ]
270 | },
271 | {
272 | "Dataset Name": "Quantitative toxicity - IGC50",
273 | "Domain": "Physiology",
274 | "Short Description": "Tetrahymena pyriformis IGC50 dataset (IGC50).",
275 | "#Tasks": 1,
276 | "#Compounds": 1792,
277 | "Task Type": "Regression",
278 | "Data Type": "SMILES",
279 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50",
280 | "Papers": [
281 | {
282 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks",
283 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf"
284 | },
285 | {
286 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction",
287 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf"
288 | }
289 | ]
290 | },
291 | {
292 | "Dataset Name": "Quantitative toxicity - LC50",
293 | "Domain": "Physiology",
294 | "Short Description": "96 h fathead minnow LC50 dataset.",
295 | "#Tasks": 1,
296 | "#Compounds": 813,
297 | "Task Type": "Regression",
298 | "Data Type": "SMILES",
299 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50",
300 | "Papers": [
301 | {
302 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks",
303 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf"
304 | },
305 | {
306 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction",
307 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf"
308 | }
309 | ]
310 | },
311 | {
312 | "Dataset Name": "Quantitative toxicity - LC50DM",
313 | "Domain": "Physiology",
314 | "Short Description": "The oral rat LD50 dataset (LD50).",
315 | "#Tasks": 1,
316 | "#Compounds": 353,
317 | "Task Type": "Regression",
318 | "Data Type": "SMILES",
319 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1-,Quantitative%20toxicity,-LD50",
320 | "Papers": [
321 | {
322 | "Name": "Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks",
323 | "Link": "https://users.math.msu.edu/users/weig/PaperName/p222.pdf"
324 | },
325 | {
326 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction",
327 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf"
328 | }
329 | ]
330 | },
331 | {
332 | "Dataset Name": "beet",
333 | "Domain": "Physiology",
334 | "Short Description": "The toxicity in honey bees (beet) dataset was extract from a study on the prediction of acute contact toxicity of pesticides in honeybees. The data set contains 254 compounds with their experimental values.",
335 | "#Tasks": 2,
336 | "#Compounds": 254,
337 | "Task Type": "Classification",
338 | "Data Type": "SMILES",
339 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,beet,-The%20toxicity%20in",
340 | "Papers": [
341 | {
342 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules",
343 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058"
344 | }
345 | ]
346 | },
347 | {
348 | "Dataset Name": "logP",
349 | "Domain": "",
350 | "Short Description": "Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set.",
351 | "#Tasks": 3,
352 | "#Compounds": "8199(train), 406(test-FDA), 223(test-Star), 43(test-Nonstar)",
353 | "Task Type": "Regression",
354 | "Data Type": "SMILES, 3D coordinates",
355 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/3D/#:~:text=Reference-,logP,-Partition%20coefficient%20datasets",
356 | "Papers": [
357 | {
358 | "Name": "Algebraic graph-assisted bidirectional transformers for molecular property prediction",
359 | "Link": "https://www.nature.com/articles/s41467-021-23720-w.pdf"
360 | },
361 | {
362 | "Name": "TopP–S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility",
363 | "Link": "https://users.math.msu.edu/users/weig/paper/p223.pdf"
364 | }
365 | ]
366 | },
367 | {
368 | "Dataset Name": "logS(1)",
369 | "Domain": "",
370 | "Short Description": "Small aqueous solubility datasets.",
371 | "#Tasks": 2,
372 | "#Compounds": 1431,
373 | "Task Type": "Regression",
374 | "Data Type": "SMILES, 3D coordinates",
375 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=1%2C%203-,logS(1),-A%20diverse%20dataset",
376 | "Papers": [
377 | {
378 | "Name": "TopP–S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility",
379 | "Link": "https://users.math.msu.edu/users/weig/paper/p223.pdf"
380 | }
381 | ]
382 | },
383 | {
384 | "Dataset Name": "DPP4",
385 | "Domain": "",
386 | "Short Description": "DPP-4 inhibitors (DPP4) was extract from ChEMBL with DPP-4 target. The data was processed by removing salt and normalizing molecular structure, with molecular duplication examination, leaving 3933 molecules.",
387 | "#Tasks": 1,
388 | "#Compounds": 3933,
389 | "Task Type": "Regression",
390 | "Data Type": "SMILES",
391 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,DPP4,-DPP%2D4%20inhibitors",
392 | "Papers": [
393 | {
394 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules",
395 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058"
396 | }
397 | ]
398 | },
399 | {
400 | "Dataset Name": "Ames",
401 | "Domain": "",
402 | "Short Description": "Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results.",
403 | "#Tasks": 1,
404 | "#Compounds": 6512,
405 | "Task Type": "Classification",
406 | "Data Type": "SMILES",
407 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=3%2C%205-,Ames,-Ames%20mutagenicity.%20The",
408 | "Papers": [
409 | {
410 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules",
411 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058"
412 | }
413 | ]
414 | },
415 | {
416 | "Dataset Name": "DUD",
417 | "Domain": "",
418 | "Short Description": "A Directory of Useful Decoys (DUD).",
419 | "#Tasks": 21,
420 | "#Compounds": "between 31 and 365 actives and 1,344 and 15,560 decoys depending on target",
421 | "Task Type": "Rank",
422 | "Data Type": "SMILES",
423 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#:~:text=5-,DUD,-A%20Directory%20of",
424 | "Papers": [
425 | {
426 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules",
427 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058"
428 | }
429 | ]
430 | },
431 | {
432 | "Dataset Name": "MUV",
433 | "Domain": "",
434 | "Short Description": "Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening.",
435 | "#Tasks": 17,
436 | "#Compounds": "30 actives and 1,500 decoys per target",
437 | "Task Type": "Rank",
438 | "Data Type": "SMILES",
439 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref5:~:text=5-,MUV,-Maximum%20Unbiased%20Validation",
440 | "Papers": [
441 | {
442 | "Name": "Extracting Predictive Representations from Hundreds of Millions of Molecules",
443 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03058"
444 | }
445 | ]
446 | },
447 | {
448 | "Dataset Name": "Cocaine addiction datasets",
449 | "Domain": "",
450 | "Short Description": "The 36 cocaine-addiction related datasets are collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) and literatures (references 1 and 2 in README file), which involve 32 cocaine-addiction protein targets. The labels are binding affinities to these targets.",
451 | "#Tasks": 36,
452 | "#Compounds": "between 114 and 6,923 depending on the target",
453 | "Task Type": "Regression",
454 | "Data Type": "SMILES",
455 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/#ref6:~:text=5-,Cocaine%20addiction%20datasets,-The%2036%20cocaine",
456 | "Papers": [
457 | {
458 | "Name": "Proteome-informed machine learning studies of cocaine addiction",
459 | "Link": "https://weilab.math.msu.edu/DataLibrary/2D/#ref6:~:text=of%20cocaine%20addiction%22.-,PDF,-%5B7%5D%20Hongsong"
460 | }
461 | ]
462 | },
463 | {
464 | "Dataset Name": "Cocaine addiction datasets 2",
465 | "Domain": "",
466 | "Short Description": "The 30 additional cocaine-addiction related datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/), which involve 30 cocaine-addiction protein targets. The labels are binding affinities to these targets.",
467 | "#Tasks": 36,
468 | "#Compounds": "between 123 and 6,923 depending on the target",
469 | "Task Type": "Regression",
470 | "Data Type": "SMILES",
471 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/cocaine_addiction-datasets2.zip",
472 | "Papers": [
473 | {
474 | "Name": "Proteome-informed machine learning studies of cocaine addiction",
475 | "Link": "https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.1c03133?casa_token=H4K9rfMLmasAAAAA:_C3oLB_pkvc5Lbd-aklaIASqvHZwue_Z3ghqfUgBkjj4LtmD9kU4urhC5zT5zegGO2ncig5v3dL_Qg"
476 | }
477 | ]
478 | },
479 | {
480 | "Dataset Name": "Drug_addiction_related",
481 | "Domain": "",
482 | "Short Description": "Receptors related to opioid or cocaine addiction.",
483 | "#Tasks": 11,
484 | "#Compounds": "between 815 and 11,297 depending on target",
485 | "Task Type": "Regression",
486 | "Data Type": "SMILES",
487 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/drug_addiction_related_WeiWeb_2D.zip",
488 | "Papers": [
489 | {
490 | "Name": "TIDAL: Topology-Inferred Drug Addiction Learning",
491 | "Link": "https://pubs.acs.org/doi/full/10.1021/acs.jcim.3c00046?casa_token=C4B_jMAbt4AAAAAA:BLEYP4-f1E8ZP1-3umVhxzrrXuGUzVLJkhOCFneHCeQOwXG6eb8e0NyVeOis8xBwz3jgxdawRDrKwQ"
492 | }
493 | ]
494 | },
495 | {
496 | "Dataset Name": "hERG blocker/non-blocker datasets",
497 | "Domain": "",
498 | "Short Description": "Seven datasets are provided for the classification of hERG blocker/non-blockers. These datasets are from literatures and the original datasets are included.",
499 | "#Tasks": 7,
500 | "#Compounds": "between 927 and 203,853 (train) and 407 and 87,366 (test) depending on the task",
501 | "Task Type": "Classification",
502 | "Data Type": "SMILES",
503 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/hERG-classification.zip",
504 | "Papers": [
505 | {
506 | "Name": "Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models",
507 | "Link": "https://www.sciencedirect.com/science/article/pii/S0010482522011994"
508 | }
509 | ]
510 | },
511 | {
512 | "Dataset Name": "Opioid use disorder datasets",
513 | "Domain": "",
514 | "Short Description": "75 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of opioid use disorder. The labels are binding affinities to these targets.",
515 | "#Tasks": 75,
516 | "#Compounds": "between 268 and 6,298 depending on the task",
517 | "Task Type": "Regression",
518 | "Data Type": "SMILES",
519 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/OUD-datasets.zip",
520 | "Papers": [
521 | {
522 | "Name": "Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks",
523 | "Link": "https://arxiv.org/abs/2301.04815"
524 | },
525 | {
526 | "Name": "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder",
527 | "Link": "https://arxiv.org/abs/2303.00240"
528 | }
529 | ]
530 | },
531 | {
532 | "Dataset Name": "SVS datasets",
533 | "Domain": "",
534 | "Short Description": "The 9 datasets for biomolecules interactions, including 4 regressions and 5 classfications.",
535 | "#Tasks": 9,
536 | "#Compounds": "between 186 and 11,188 depending on the task",
537 | "Task Type": "Regression, Classification",
538 | "Data Type": "SMILES",
539 | "DownloadLink": "https://weilab.math.msu.edu/DataLibrary/2D/Downloads/SVS_datasets.zip",
540 | "Papers": [
541 | {
542 | "Name": "SVSBI: Sequence-based virtual screening of biomolecular interactions",
543 | "Link": "https://arxiv.org/abs/2212.13617"
544 | }
545 | ]
546 | }
547 | ]
548 |
--------------------------------------------------------------------------------