├── README.md ├── dataset_overview.json └── dataset_selection.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Molecular Quantum Chemical Data Sets and Databases for Machine Learning Potentials 2 | 3 | This GitHub dashboard provides links and brief descriptions for various quantum chemistry data sets and databases mentioned in our review paper https://doi.org/10.1088/2632-2153/ad8f13 . Here in this dashboard, we follow alphabetical order. **New updates to the Repo can be found at the end [Click me](#new_updates)** 4 | 5 | ## 1. AIMEl-DB Dataset 6 | - **Description**: AIMEl-DB provides atomic properties (e.g., energy, dipole, quadrupole) for 44,000 molecules randomly selected from QM9. These properties are calculated using DFT at the B3LYP/6-31G(2df,p) level with Gaussian 16 and AIMAll. 7 | - **Data Accessibility**: 8 | - [Zendodo](https://zenodo.org/records/11406726) 9 | 10 | ## 2. Alchemy Dataset 11 | - **Description**: The Alchemy dataset contains 12 quantum mechanical properties for 119,487 organic molecules (up to 14 heavy atoms) from GDB-17. DFT calculations (B3LYP/6-31G(2df,p)) using PySCF provide data on molecular geometries, electronic, and thermochemical properties. 12 | - **Data Accessibility**: 13 | - [GitHub](https://github.com/tencent-alchemy/Alchemy) 14 | 15 | ## 3. ANI-1 Dataset 16 | - **Description**: A collection of non-equilibrium DFT total energy calculations for organic molecules, encompassing approximately 20 million conformations of 57,462 small organic molecules, used to train the ANI-1 potential, calculated using MMFF94 force field and $\omega$ B97x/6-31G(d) with the Gaussian 09. 17 | - **Data Accessibility**: 18 | - [ANI-1 Dataset on Figshare](https://doi.org/10.6084/m9.figshare.5287732.v1) 19 | - [ANI-1 Dataset on GitHub](https://github.com/isayev/ANI1_dataset) 20 | 21 | ## 4. ANI-1x and ANI-1ccx Datasets 22 | - **Description**: The ANI-1x data set comprises DFT calculations ($\omega$ B97x/6-31G* and $\omega$ B97x/def2-TZVPP) for approximately 5 million organic molecule conformations. The ANI-1ccx data set is a subset of ANI-1x, recomputed at the CCSD(T)/CBS level of theory. Multiple softwares are utlilized in data generation such as RDKit, ASE, Gaussian 09, ORCA, and the HORTON software library. 23 | - **Data Accessibility**: 24 | - [ANI-1x and ANI-1ccx Datasets on Figshare](https://doi.org/10.6084/m9.figshare.c.4712477) 25 | - [GitHub Repository](https://github.com/aiqm/ANI1x_datasets) 26 | 27 | ## 5. ANI-2x 28 | - **Description**: The ANI-2x dataset (8.9 million molecules) is used to train the ANI-2x ML model. It includes 7 elements (H, C, N, O, S, F, Cl), improving on ANI-1x (4 elements). ANI-2x uses S66x8 for better non-bonded interactions and a new sampling technique for bulk water. It's based on GDB-11, ChEMBL, S66x8, amino acids, and dipeptides. Active learning generates non-equilibrium geometries, refined for torsion, non-bonded interactions, and bulk water. Energies and forces are calculated using $\omega$B97X/6-31G* with Gaussian 09. 29 | - **Data Accessibility**: 30 | - [Zenodo](https://doi.org/10.5281/zenodo.10108942) 31 | 32 | ## 6. bigQM7ω 33 | - **Description**: Ground-state properties and electronic spectra for over 12,880 molecules calculated using DFT and TDDFT with various functionals and basis sets, using the Gaussian software. 34 | - **Data Accessibility**: 35 | - [Core Data](https://moldisgroup.github.io/bigQM7w) 36 | - [NOMAD Repository](https://dx.doi.org/10.17172/NOMAD/2021.09.30-1) 37 | - [Data-mining Platform](https://moldis.tifrh.res.in/index.html) 38 | 39 | ## 7. **C7O2H10-17** 40 | - **Description**: Molecular dynamics trajectories for 113 randomly selected isomers of C7O2H10 (which represents the largest set of isomers within the 41 | QM9 data set), calculated using DFT (PBE functional) with FHI-aims software. 42 | - **Data Accessibility**: 43 | - [quantum-machine.org](http://quantum-machine.org/data%20sets/) 44 | 45 | ## 8. **CheMFi** 46 | - **Description**: A multifidelity compilation of quantum chemical properties derived from a subset of the WS22 database, featuring 135,000 geometries sampled from nine distinct molecules. It includes five different levels of fidelity, each corresponding to a specific basis set size (STO-3G, 3-21G, 6-31G, def2-SVP, def2-TZVP). The dataset was generated using TD-DFT with the CAM-B3LYP functional, calculated via the ORCA software. 47 | - **Data Accessibility**: 48 | - [GitHub](https://github.com/SM4DA/CheMFi) 49 | 50 | ## 9. **COMPAS Project** 51 | - **Description**: The COMPAS project provides structures and properties for polycyclic aromatic systems where COMPAS-1 (43k), COMPAS-2 (0.5 million), and COMPAS-3 (40k) focus on different types of polycyclic molecules. All include molecules up to 11 rings (xTB), with up to 10 rings calculated at DFT level using CaGe, GFN1/2-xTB, B3LYP-D3(BJ)/def2-SVP, and CAM-B3LYP-D3BJ/aug-cc-pVDZ. 52 | - **Data Accessibility**: 53 | - [GitLab](https://gitlab.com/porannegroup/compas) 54 | - [Webpage](https://compas.net.technion.ac.il/) 55 | 56 | ## 10. **CREMP** 57 | - **Description**: CREMP dataset (36k macrocyclic peptides, 31.3 million conformers) for ML training. Generated using CREST, RDKit, ETKDGv3, MMFF94, GFN2-xTB, and metadynamics. Properties calculated at semiempirical level. CREMP-CycPeptMPDB (8.7 million conformers) added for passive membrane permeability. 58 | - **Data Accessibility**: 59 | - [GitHub](https://github.com/Genentech/cremp) 60 | - [CREMP on Zenodo](https://doi.org/10.5281/zenodo.7931444) 61 | - [CREMP-CycPeptMPDB on Zenodo](https://doi.org/10.5281/zenodo.10798261) 62 | 63 | ## 11. **GEOM** 64 | - **Description**: The GEOM dataset (451,186 molecules) provides high-quality conformers for drug-like species and QM9 molecules. It was generated using RDKit, MMFF, GFN2-xTB, CREST, CENSO, and ORCA. The dataset includes conformers with energies, vibrational frequencies, and other properties. 65 | - **Data Accessibility**: 66 | - [GitHub](https://github.com/learningmatter-mit/geom) 67 | 68 | ## 12. **ISO17** 69 | - **Description**: Extends the C7O2H10-17 dataset with 129 isomers and additional data, calculated using DFT (PBE functional, GGA and Tkatchenko 70 | Scheffler (TS) van der Waals correction method) with FHI-aims software. 71 | - **Data Accessibility**: 72 | - [quantum-machine.org](http://quantum-machine.org/data%20sets/) 73 | 74 | ## 13. MD17 and its later versions 75 | - **Description**: Ab-initio molecular dynamics trajectories collection along with total energy and forces for ten small organic molecules, calculated using PIMD, PBE + vdW-TS, PBE/def2-SVP, CCSD, CCSD(T) and DFT FHI-aims with a list of softwares including i-PI code, ORCA and FHI-aims. 76 | - **Data Accessibility**: 77 | - [MD17 Data Sets](http://www.sgdml.org/#data_sets) 78 | - [rMD17 Data Set](https://dx.doi.org/10.6084/m9.figshare.12672038) 79 | 80 | ## 14. MD22 81 | - **Description**: MD trajectories for seven systems spanning four major classes of biomolecules and supramolecular structures, calculated using PBE+MBD level of theory with "light" and "tight" basis sets within the FHI-aims framework. 82 | - **Data Accessibility**: 83 | - [MD22 Data Set](http://www.sgdml.org/#data_sets) 84 | 85 | ## 15. MultiXC-QM9 86 | - **Description**: Expands upon QM9 by including data from 76 different DFT functionals alongside three basis sets and a semi-empirical method (GFN2-XTB), for small organic molecules, using the ADF software package. 87 | - **Data Accessibility**: 88 | - [MultiXC-QM9 on Figshare](https://doi.org/10.11583/DTU.c.6185986.v3) 89 | 90 | ## 16. $\nabla^2$ DFT Dataset 91 | - **Description**: A comprehensive collection of approximately 16 million conformers for around 2 million drug-like molecules, calculated using $\omega$B97X-D/def2-SVP level theory with Psi4 software. 92 | - **Data Accessibility**: 93 | - [nablaDFT Dataset on GitHub](https://github.com/AIRI-Institute/nablaDFT) 94 | 95 | ## 17. OFF-ON Dataset 96 | - **Description**: The OFF-ON database contains 75,326 organic molecules (7,869 equilibrium, 67,457 non-equilibrium) for training ML models on functional organic molecules (e.g., photoswitchable catalysts). It was generated using molecular databases (OSCAR, CSD, PubChem) and MD simulations. DFTB calculations were used for baseline energy calculations, normalized using multilinear regression. Reference energies were computed using PBE0-D3/def2-SVP. DFTB+ was utilized forDFTB calculations and PBE0-D3/def2-SVP calculations was performed with TeraChem. 97 | - **Data Accessibility**: 98 | - [Materials Studio](https://archive.materialscloud.org/record/2023.189) 99 | 100 | ## 18. OrbNet Denali 101 | - **Description**: Data set used to develop a machine learning potential for electronic structure calculations, includes over 2.3 million molecules, calculated using various levels of theory including GFN1-xTB, AIMD and $\omega$ B97X-D3/def2-TZVP with a list of softwares including ENTOS BREEZE, DIMORPHITE-DL and ENTOS QCORE. 102 | - **Data Accessibility**: 103 | - [OrbNet Denali on Figshare](https://doi.org/10.6084/m9.figshare.14883867) 104 | 105 | ## 19. PC9 106 | - **Description**: A counterpart to QM9, 99,234 distinct molecules, a subset of PubChemQC project selected based on the limitations of QM9 dataset (size of up to 9 heavy atoms in the range C, N, O and F). 107 | - **Data Accessibility**: 108 | - [PC9 on Figshare](https://doi.org/10.6084/m9.figshare.9033977.v1) 109 | - [PC9 on Zenodo](https://doi.org/10.5281/zenodo.3588370) 110 | 111 | ## 20. PubChemQC B3LYP/6-31G*//PM6 112 | - **Description**: Electronic properties for 85,938,443 molecules using DFT with the B3LYP/6-31G* basis set, following initial geometry optimization with the PM6 method, using the GAMESS software. 113 | - **Data Accessibility**: 114 | - [PubChemQC B3LYP/6-31G*//PM6](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html) 115 | 116 | ## 21. PubChemQC Database 117 | - **Description**: Contains electronic structures for approximately 3 million molecules, optimized using DFT at the B3LYP/6-31G* level for ground states and TD-DFT with the B3LYP functional and 6-31+G* basis set for low-lying excited states. A list of software are utilized which are Firefly, SMASH and GAMESS. 118 | - **Data Accessibility**: 119 | - [PubChemQC Database](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_2017.html) 120 | 121 | ## 22. PubChemQC PM6 122 | - **Description**: PM6 data for 221 million molecules, including optimized geometries and electronic structures using the Gaussian 09. 123 | - **Data Accessibility**: 124 | - [PubChemQC PM6](https://nakatamaho.riken.jp/pubchemqc.riken.jp/pm6_datasets.html) 125 | 126 | ## 23. QCDGE 127 | - **Description**: An extensive collection of ground and excited-state properties for 443,106 organic molecules, each containing up to ten heavy atoms, including carbon, nitrogen, oxygen, and fluorine. These molecules are sourced from well-known databases such as QM9, PubChemQC, and GDB-11. Ground-state geometry optimizations and frequency calculations for all compounds were carried out using the B3LYP/6-31G* level of theory with BJD3 dispersion correction, while excited-state single-point calculations were performed at the $\omega$ B97X-D/6-31G* level. All computational work was conducted using Gaussian 16. 128 | - **Data Accessibility**: 129 | - [Lan's group webpage](http://langroup.site/QCDGE/) 130 | 131 | ## 24. QM1B 132 | - **Description**: A large dataset of one billion training examples for machine learning applications in quantum chemistry, focusing on molecules with 9-11 heavy atoms, calculated using STO-3G and B3LYP in PySCF$_{\rm IPU}. 133 | - **Data Accessibility**: 134 | - [QM1B on GitHub](https://github.com/graphcore-research/qm1b-dataset) 135 | 136 | ## 25. QM7 137 | - **Description**: Focuses on a subset of 7,165 small organic molecules from the GDB-13 database, providing Coulomb matrices, atomization energies, atomic charge, and Cartesian coordinates. DFT calculations are done using PBE0/tier2 basis set implemented in FHI-aims code. 138 | - **Data Accessibility**: 139 | - [QM7 on Quantum-Machine.org](http://quantum-machine.org/datasets/) 140 | 141 | ## 26. QM7-X 142 | - **Description**: A comprehensive data set of 4.2 million structures of small organic molecules containing up to seven non-hydrogen atoms, including 42 properties per molecule, calculated using PBE0+MBD level of theory with FHI-aims code. 143 | - **Data Accessibility**: 144 | - [QM7-X on Zenodo](https://doi.org/10.5281/zenodo.4288677) 145 | 146 | ## 27. QM7b 147 | - **Description**: An extension of QM7, providing data on 7,211 small organic molecules including 14 properties such as atomization energy, static polarizability, and frontier orbital eigenvalues, calculated using DFT and other quantum chemistry methods. Software used are OpenBabel, ORCA, and FHI-aims code (tight settings/tier2 basis set). 148 | - **Data Accessibility**: 149 | - [Supplementary material of the reference paper](http://stacks.iop.org/NJP/15/095003/mmedia) 150 | 151 | ## 28. QM8 152 | - **Description**: Provides electronic spectra data for approximately 21,786 small organic molecules, derived from QM9, calculated using TDDFT and other excited-state methods with the TURBOMOLE program. 153 | - **Data Accessibility**: 154 | - [Supplementary material of the reference paper](https://pubs.aip.org/jcp/article-supplement/73278/zip/084111_1_supplements/) 155 | 156 | ## 29. QM9 157 | - **Description**: A collection of molecular structures and properties for 134,000 small organic molecules, generated using density functional theory (DFT) at the B3LYP/6-31G(2df,p) level with the Gaussian 09 software. 158 | - **Data Accessibility**: 159 | - [QM9 on Figshare](https://doi.org/10.6084/m9.figshare.978904) 160 | 161 | ## 30. QM9S 162 | - **Description**: A collection of 33,885 organic molecules from QM9 for training/testing DetaNet where geometries were re-optimized with Gaussian 16 (B3LYP/def-TZVP). Molecular properties (scalar, vector, tensor) were calculated at the same level, including IR, Raman, and UV-Vis spectra from frequency analysis and TD-DFT 163 | - **Data Accessibility**: 164 | - [QM9S on Figshare](https://doi.org/10.6084/m9.figshare.24235333) 165 | 166 | ## 31. QM9-G4MP2 167 | - **Description**: Provides highly accurate G4MP2 calculations for the molecular structures within QM9, focusing on small organic molecules, using the Gaussian 16 software. 168 | - **Data Accessibility**: 169 | - [QM9-G4MP2 on Figshare](https://doi.org/10.6084/m9.figshare.c.4351631.v1) 170 | 171 | ## 32. QM-sym Database 172 | - **Description**: A database documenting the C$_n$h symmetry for each molecule within its vast repository, including 135,000 structures, calculated using DFT at the B3LYP/6-31G(2df,p) level with the Gaussian 09 software. 173 | - **Data Accessibility**: 174 | - [GitHub Repository](https://github.com/XI-Lab/QM-sym-database) 175 | - [Figshare](https://doi.org/10.6084/m9.Figshare.9638093) 176 | 177 | ## 33. QM-symex Database 178 | - **Description**: The QM-sym dataset has been expanded to include an additional 38,000 molecules, providing valuable information on excited-state properties. Calculation were performed using DFT at the B3LYP/6-31G(2df,p) level with the Gaussian 09 software. 179 | - **Data Accessibility**: 180 | - [QM-symex on Figshare](https://doi.org/10.6084/m9.Figshare.12815276) 181 | 182 | ## 34. **QM-22** 183 | - **Description**: A compilation of molecular datasets specifically curated for Diffusion Monte Carlo (DMC) calculations of the zero-point state. Each dataset within QM22 employs unique methodologies tailored to the specific molecules involved, with detailed computational methods available in their corresponding publications. 184 | - **Data Accessibility**: 185 | - [GitHub](https://github.com/jmbowma/QM-22) 186 | 187 | ## 35. QMugs 188 | - **Description**: Quantum-mechanical properties of over 665,000 drug-like molecules, calculated at GFN2-xTB and $\omega$B97XD/def2-SVP level theory using xTB and Psi4 software packages. 189 | - **Data Accessibility**: 190 | - [QMugs](https://doi.org/10.3929/ethz-b-000482129) 191 | 192 | ## 36. revQM9 193 | - **Description**: A revised version of QM9 with recalculated properties at the aPBE0 level (PBE0 with ML-determined exact exchange). PySCF 2.4.0 was used for simulations, with aPBE0 parameters from a trained ML model targeting CCSD(T)/cc-pVTZ atomization energies. The dataset includes total and atomization energies, orbital energies, dipole moments, and density matrices. 194 | - **Data Accessibility**: 195 | - [Zenodo](https://zenodo.org/doi/10.5281/zenodo.10689883) 196 | 197 | ## 37. SPICE 198 | - **Description**: A dataset of 2,008,628 conformations of 113,999 drug-like small molecules and proteins, using $\omega$B97M-D3BJ/def2-TZVPPD level theory implemented in Psi4. 199 | - **Data Accessibility**: 200 | - [Zenodo](https://doi.org/10.5281/zenodo.7338495) 201 | - [GitHub](https://github.com/openmm/spice-dataset) 202 | 203 | ## 38. **TensorMol ChemSpider** 204 | - **Description**: Energies for 3 million conformations from 15,000 different molecules, calculated using the QChem software. 205 | - **Data Accessibility**: 206 | - The TensorMol ChemSpider data set was reportedly available for download at [Google Drive](https://drive.google.com/drive/folders/1IfWPs7i5kfmErIRyuhGv95dSVtNFo0e_) according to the supplementary information. However, the dataset is no longer accessible. 207 | 208 | ## 39. tmQM 209 | - **Description**: Electronic energy, dispersion energy, dipole moment, natural charge at the metal center, HOMO-LUMO gap, HOMO energy, and LUMO energy for 108k transition-metal complexes computed at the TPSSh/def2-SVP // GFN2-xTB level of theory. xTB calculations were performed using the xtb program, and quantum properties were derived from single-point DFT calculations on the GFN2xTB-optimized geometries, utilizing Gaussian 16. 210 | 211 | - **Data Accessibility**: 212 | - [UiO Computational Catalysis Site](https://www.uiocompcat.info/tmqmdataset?s=03) 213 | - [Github](https://github.com/bbskjelstad/tmqm?tab=readme-ov-file) 214 | 215 | ## 40. Transition1x Dataset 216 | - **Description**: A collection of 9.6 million data points, each meticulously generated using DFT calculations with forces and energies for a staggering 10,000 organic reactions. These calculations employed the $\omega$B97x/6-31G(d) level of theory and utilized NEB and CINEB exploration technique. The computations were performed using software such as ASE and ORCA. 217 | - **Data Access**: 218 | - [Transition1x Dataset on Figshare](https://doi.org/10.6084/m9.figshare.19614657.v4) 219 | - [Dataloaders and example scripts on GitLab](https://gitlab.com/matschreiner/T1x) 220 | 221 | ## 41. VIB5 Database 222 | - **Description**: A collection of high-quality ab initio quantum chemical data for five small polyatomic molecules with significant astrophysical relevance, calculated using coupled-cluster and HF theory with the MOLPRO and CFOUR softwares. 223 | - **Data Accessibility**: 224 | - [VIB5 Database on Figshare](https://doi.org/10.6084/m9.figshare.16903288) 225 | 226 | ## 42. **VQM24** 227 | - **Description**: Provides quantum mechanical properties for 258,242 unique constitutional isomers and 577,705 conformers of varying stoichiometries, focusing on molecules composed of up to five heavy atoms from elements such as C, N, O, F, Si, P, S, Cl, and Br. The dataset utilizes methods including MMFF94, GFN2-xTB, $\omega$ B97X-D3/cc-pVDZ, and DMC@PBE0/ccECP/ccpVQZ, with calculations performed using Surge, RDKit, Crest, Psi4, and QMCPACK. 228 | - **Data Accessibility**: 229 | - [Zenodo](https://doi.org/10.5281/zenodo.11164951) 230 | 231 | ## 43. WS22 Database 232 | - **Description**: A comprehensive database featuring ten organic molecules with up to 22 atoms, encompassing 1.18 million geometries, offers a range of quantum mechanical properties. These properties have been calculated using various methods, such as the Wigner sampling approach, geometry interpolation scheme, B3LYP/6-31G*, and classical AIMD. The computations were performed using software including Newton-X and Gaussian 09. 233 | - **Data Accessibility**: 234 | - [WS22 Database on Zenodo](https://doi.org/10.5281/zenodo.7032334) 235 | 236 | ## 44. xxMD17 237 | - **Description**: Excited-state molecular dynamics data set for four molecular systems chosen for their photochemical activity: azobenzene, malonaldehyde, stilbene, and dithiophene, calculated using Surface hopping dynamics, SA-CASSCF electronic theory and unrestricted KS-DFT (M069 meta-GGA and 6-31g) with a list of softwares including SHARC, OpenMolcas 22.06 and Psi4. 238 | - **Data Access**: 239 | - [GitHub Repository](https://github.com/zpengmei/xxMD) 240 | - [Zenodo Repository](https://doi.org/10.5281/zenodo.10393859) 241 | 242 | ## New Updates 243 | 1. QM9star, two Million DFT-computed Equilibrium Structures for Ions and Radicals with Atomic Information https://doi.org/10.1038/s41597-024-03933-6 244 | 2. An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers https://doi.org/10.48550/arXiv.2410.19316 245 | 3. QM40, Realistic Quantum Mechanical Dataset for Machine Learning in Molecular Science, https://doi.org/10.1038/s41597-024-04206-y 246 | 4. ANI-1xBB, an ANI based reactive potential, https://doi.org/10.26434/chemrxiv-2025-m2nqq 247 | 5. The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models https://arxiv.org/abs/2505.08762 248 | 249 | ## Contributions 250 | 251 | We welcome contributions! If you have new data sets or databases to add or updates to existing ones, please submit a pull request. 252 | 253 | 254 | 255 | 256 | 257 | -------------------------------------------------------------------------------- /dataset_overview.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset_overview": [ 3 | { 4 | "name": "AIMEl-DB", 5 | "full_name": "AIMEl-DB", 6 | "description": "Atomic Properties for 44K small organic molecules", 7 | "methods":{ 8 | "geometry_optimization":{ 9 | "B3LYP/6-31G(2df,p)": "Gaussian 09" 10 | }, 11 | "energy":{ 12 | "B3LYP/6-31G(2df,p)": "Gaussian 16" 13 | } 14 | }, 15 | "data_size": { 16 | "number_of_structures": 4.4e4 17 | }, 18 | "data_access": { 19 | "Zenodo": "https://zenodo.org/records/11406726" 20 | }, 21 | "chemical_elements": ["H","C","N","O"], 22 | "number_of_heavy_atoms": ["checking required","checking required",9], 23 | "initial_source": [ 24 | "QM9" 25 | ], 26 | "non-equilibrium structures": "False", 27 | "charges": [0], 28 | "multiplicities": [1], 29 | "excited_states": "False", 30 | "solvent": ["gas_phase"], 31 | "properties": [ 32 | "total_energy", 33 | "atomic_dipole_moment", 34 | "atomic_quadrupole_moment", 35 | "atomic_energy", 36 | "atomic_population" 37 | ], 38 | "doi":"10.1038/s41597-024-03723-0", 39 | "reference":[ 40 | "@article{meza2024quantum,\n title={Quantum topological atomic Properties of 44K molecules},\n author={Meza-Gonz{\\'a}lez, Brandon and Ram{\\'\\i}rez-Palma, David I and Carpio-Mart{\\'\\i}nez, Pablo and V{\\'a}zquez-Cuevas, David and Mart{\\'\\i}nez-Mayorga, Karina and Cort{\\'e}s-Guzm{\\'a}n, Fernando},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={945},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n", 41 | "@dataset{meza_gonzalez_2024_11406726,\n author = {Meza-González, Brandon and Ramírez-Palma, David I. and Carpio-Martínez, Pablo and Vázquez-Cuevas, David and Martinez-Mayorga, Karina and Cortés-Guzmán, Fernando},\n title = {{AIMEl-DB: Atomic Properties for 44K small organic molecules}},\n year = 2024,\n publisher = {Universidad Nacional Autónoma de México},\n version = {2.0},\n doi = {10.5281/zenodo.11406726},\n url = {https://doi.org/10.5281/zenodo.11406726}\n}\n" 42 | ] 43 | }, 44 | { 45 | "name": "Alchemy", 46 | "full_name": "Alchemy", 47 | "description": "19,487 organic molecules contain-ing up to 14 heavy atoms from the GDB-17 database with 12 quantum mechanical properties", 48 | "methods":{ 49 | "geometry_optimization":{ 50 | "B3LYP/6-31G(2df,p)":"PySCF" 51 | }, 52 | "energy":{ 53 | "B3LYP/6-31G(2df,p)":"PySCF" 54 | } 55 | }, 56 | "data_size": { 57 | "number_of_structures": 119487 58 | }, 59 | "data_access": { 60 | "GitHub":"https://github.com/tencent-alchemy/Alchemy" 61 | }, 62 | "chemical_elements": ["H","C","N","O","F","S","Cl"], 63 | "number_of_heavy_atoms": ["checking required","checking required",14], 64 | "initial_source": [ 65 | "GDB MedChem" 66 | ], 67 | "non-equilibrium structures": "False", 68 | "charges": ["checking required"], 69 | "multiplicities": ["checking required"], 70 | "excited_states": "False", 71 | "solvent": ["gas_phase"], 72 | "temperature": 298.15, 73 | "properties": [ 74 | "total_energy", 75 | "dipole_moment", 76 | "HOMO/LUMO_energy", 77 | "HOMO/LUMO_gap", 78 | "zero_point_energy", 79 | "internal_energy", 80 | "enthalpy", 81 | "free_energy", 82 | "heat_capacity" 83 | ], 84 | "doi": "10.48550/arXiv.1906.09427", 85 | "reference":[ 86 | "@article{chen2019alchemy,\n author={Guangyong Chen and Pengfei Chen and Chang-Yu Hsieh and Chee-Kong Lee and Benben Liao and Renjie Liao and Weiwen Liu and Jiezhong Qiu and Qiming Sun and Jie Tang and Richard Zemel and Shengyu Zhang},\n title={Alchemy: A Quantum Chemistry Dataset for Benchmarking {AI} Models}, \n journal={arXiv preprint arXiv:1906.09427 [cs.LG]},\n year={2024} \n}\n" 87 | ] 88 | }, 89 | { 90 | "name": "ANI-1", 91 | "full_name": "ANI-1", 92 | "description": " A collection of non-equilibrium DFT total energy calculations for organic molecules, encompassing approximately 20 million conformations of 57,462 small organic molecules ", 93 | "methods":{ 94 | "geometry_optimization":{ 95 | "ωB97X/6-31G(d)": "Gaussian 09" 96 | }, 97 | "energy":{ 98 | "ωB97X/6-31G(d)":"Gaussian 09" 99 | } 100 | }, 101 | "data_size": { 102 | "number_of_structures": 2e7 103 | }, 104 | "data_access": { 105 | "Figshare": "https://doi.org/10.6084/m9.figshare.5287732.v1", 106 | "GitHub": "https://github.com/isayev/ANI1_dataset" 107 | }, 108 | "chemical_elements": ["H","C","N","O"], 109 | "number_of_heavy_atoms": [1,"checking required",8], 110 | "initial_source": [ 111 | "GDB-11" 112 | ], 113 | "non-equilibrium structures": "True", 114 | "charges": [0], 115 | "multiplicities": [1], 116 | "excited_states": "False", 117 | "solvent": ["gas_phase"], 118 | "temperature": "False", 119 | "properties": [ 120 | "total_energy", 121 | "force" 122 | ], 123 | "doi": "10.1038/sdata.2017.193", 124 | "reference": [ 125 | "@article{smith2017ani,\n title={ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules},\n author={Smith, Justin S and Isayev, Olexandr and Roitberg, Adrian E},\n journal={Scientific data},\n volume={4},\n number={1},\n pages={1--8},\n year={2017},\n publisher={Nature Publishing Group}\n}\n" 126 | ] 127 | }, 128 | { 129 | "name": "ANI-1x", 130 | "full_name": "ANI-1x", 131 | "description": " The ANI-1x data set includes DFT calculations for approximately 5 million organic molecule conformations ", 132 | "methods":{ 133 | "geometry_optimization":{ 134 | "False": "False" 135 | }, 136 | "energy":{ 137 | "ωB97X/def2-TZVPP": "ORCA 4.2.0" 138 | } 139 | }, 140 | "data_size": { 141 | "number_of_structures": 5e6 142 | }, 143 | "data_access": { 144 | "Figshare": "https://doi.org/10.6084/m9.figshare.c.4712477", 145 | "GitHub": "https://github.com/aiqm/ANI1x_datasets" 146 | }, 147 | "chemical_elements": ["H","C","N","O"], 148 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 149 | "initial_source": [ 150 | "GDB-11", 151 | "ChEMBL" 152 | ], 153 | "non-equilibrium structures": "True", 154 | "charges": [0], 155 | "multiplicities": [1], 156 | "excited_states": "False", 157 | "solvent": ["gas_phase"], 158 | "temperature": "False", 159 | "properties": [ 160 | "total_energy", 161 | "force" 162 | ], 163 | "doi": "10.1038/s41597-020-0473-z", 164 | "reference": [ 165 | "@article{smith2020ani,\n title={The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules},\n author={Smith, Justin S and Zubatyuk, Roman and Nebgen, Benjamin and Lubbers, Nicholas and Barros, Kipton and Roitberg, Adrian E and Isayev, Olexandr and Tretiak, Sergei},\n journal={Scientific data},\n volume={7},\n number={1},\n pages={134},\n year={2020},\n publisher={Nature Publishing Group UK London}\n}\n" 166 | ] 167 | }, 168 | { 169 | "name": "ANI-1ccx", 170 | "full_name": "ANI-1ccx", 171 | "description": " Tthe ANI-1ccx data set is a subset of ANI-1x data set, recomputed with CCSD(T)/CBS level of theory ", 172 | "methods":{ 173 | "geometry_optimization":{ 174 | "False": "False" 175 | }, 176 | "energy":{ 177 | "CCSD(T)*/CBS":"ORCA 4.2.0", 178 | "HF/cc-pVDZ": "ORCA 4.2.0", 179 | "HF/cc-pVTZ": "ORCA 4.2.0", 180 | "HF/cc-pVQZ": "ORCA 4.2.0", 181 | "DLPNO-CCSD(T)/cc-pVDZ": "ORCA 4.2.0", 182 | "DLPNO-CCSD(T)/cc-pVTZ": "ORCA 4.2.0", 183 | "DLPNO-CCSD(T)/cc-pVQZ": "ORCA 4.2.0" 184 | } 185 | }, 186 | "data_size": { 187 | "number_of_structures": 5e5 188 | }, 189 | "data_access": { 190 | "Figshare": "https://doi.org/10.6084/m9.figshare.c.4712477", 191 | "GitHub": "https://github.com/aiqm/ANI1x_datasets" 192 | }, 193 | "chemical_elements": ["H","C","N","O"], 194 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 195 | "initial_source": [ 196 | "GDB-11", 197 | "ChEMBL" 198 | ], 199 | "non-equilibrium structures": "True", 200 | "charges": [0], 201 | "multiplicities": [1], 202 | "excited_states": "False", 203 | "solvent": ["gas_phase"], 204 | "temperature": "False", 205 | "properties": [ 206 | "total_energy" 207 | ], 208 | "doi": "10.1038/s41597-020-0473-z", 209 | "reference": [ 210 | "@article{smith2020ani,\n title={The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules},\n author={Smith, Justin S and Zubatyuk, Roman and Nebgen, Benjamin and Lubbers, Nicholas and Barros, Kipton and Roitberg, Adrian E and Isayev, Olexandr and Tretiak, Sergei},\n journal={Scientific data},\n volume={7},\n number={1},\n pages={134},\n year={2020},\n publisher={Nature Publishing Group UK London}\n}\n" 211 | ] 212 | }, 213 | { 214 | "name": "ANI-2x", 215 | "full_name": "ANI-2x", 216 | "description": "Extension of previous developed ANI-1x datasets to include S, F, Cl with non-equilibrium geometries at DFT level", 217 | "methods":{ 218 | "energy":{ 219 | "ωB97X/6-31G*": "Gaussian 09", 220 | "ωB97X/def2-TZVPP": "ORCA 4.2.1", 221 | "B97-3c/def2-mTZVP": "ORCA 4.2.1", 222 | "ωB97M-V/def2-TZVPP": "ORCA 4.2.1", 223 | "ωB97M-D3(BJ)/def2-TZVPP": "ORCA 4.2.1" 224 | } 225 | }, 226 | "data_size": { 227 | "number_of_structures": 8.9e6 228 | }, 229 | "data_access": { 230 | "Zenodo": "https://doi.org/10.5281/zenodo.10108942" 231 | }, 232 | "chemical_elements": ["H","C","N","O","F","S","Cl"], 233 | "number_of_heavy_atoms": ["checking required"], 234 | "initial_source": [ 235 | "GDB-11", 236 | "ChEMBL", 237 | "s66x8" 238 | ], 239 | "non-equilibrium structures": "True", 240 | "charges": [0], 241 | "multiplicities": [1], 242 | "excited_states": "False", 243 | "solvent": ["gas_phase"], 244 | "properties": [ 245 | "total_energy", 246 | "force", 247 | "dipole_moment" 248 | ], 249 | "doi":"10.1021/acs.jctc.0c00121", 250 | "reference":[ 251 | "@article{devereux2020extending,\n title={Extending the applicability of the ANI deep learning molecular potential to sulfur and halogens},\n author={Devereux, Christian and Smith, Justin S and Huddleston, Kate K and Barros, Kipton and Zubatyuk, Roman and Isayev, Olexandr and Roitberg, Adrian E},\n journal={Journal of Chemical Theory and Computation},\n volume={16},\n number={7},\n pages={4192--4202},\n year={2020},\n publisher={ACS Publications}\n}\n", 252 | "@misc{ani2zenodo,\n title={ANI-2 data set},\n author={Devereux, Christian and Smith, Justin S and Huddleston, Kate K and Barros, Kipton and Zubatyuk, Roman and Isayev, Olexandr and Roitberg, Adrian E},\n journal={Journal of Chemical Theory and Computation},\n year={2020},\nhowpublished={Zenodo \\url{https://doi.org/10.5281/zenodo.10108942} (accessed on August 22, 2024)}\n}\n" 253 | ] 254 | }, 255 | { 256 | "name": "bigQM7ω", 257 | "full_name": "", 258 | "description": " Ground-state properties and electronic spectra for over 12,880 molecules with 7 CONF atoms ", 259 | "methods":{ 260 | "geometry_optimization":{ 261 | "ωB97XD/3-21G":"Gaussian 16", 262 | "ωB97XD/def2-SVP":"Gaussian 16", 263 | "ωB97XD/def2-TZVP":"Gaussian 16" 264 | }, 265 | "energy":{ 266 | "ωB97XD/3-21G":"Gaussian 16", 267 | "ωB97XD/def2-SVP":"Gaussian 16", 268 | "ωB97XD/def2-TZVP":"Gaussian 16", 269 | "TDωB97XD/3-21G":"Gaussian 16", 270 | "TDωB97XD/def2-SVP":"Gaussian 16", 271 | "TDωB97XD/def2-TZVP":"Gaussian 16", 272 | "TDωB97XD/def2-SVPD":"Gaussian 16" 273 | 274 | } 275 | }, 276 | "data_size": { 277 | "number_of_structures": 12880 278 | }, 279 | "data_access": { 280 | "Link1": "https://moldisgroup.github.io/bigQM7w", 281 | "Link2": "https://dx.doi.org/10.17172/NOMAD/2021.09.30-1", 282 | "Link3": "https://moldis.tifrh.res.in/index.html" 283 | }, 284 | "chemical_elements": ["H","C","N","O","F"], 285 | "number_of_heavy_atoms": ["checking required","checking required",7], 286 | "initial_source": [ 287 | "GDB-11" 288 | ], 289 | "non-equilibrium structures": "False", 290 | "charges": ["checking required"], 291 | "multiplicities": ["checking required"], 292 | "excited_states": "True", 293 | "solvent": ["gas_phase"], 294 | "temperature": 298.15, 295 | "properties": [ 296 | "total_energy", 297 | "excitation_energy", 298 | "dipole_moment", 299 | "polarizability", 300 | "internal_energy", 301 | "enthalpy", 302 | "free_energy", 303 | "heat_capacity", 304 | "harmonic_frequency" 305 | ], 306 | "doi": "10.1039/D1DD00031D", 307 | "reference": [ 308 | "@article{schutt2017quantum,\n title={Quantum-chemical insights from deep tensor neural networks},\n author={Sch{\\\"u}tt, Kristof T and Arbabzadah, Farhad and Chmiela, Stefan and M{\\\"u}ller, Klaus R and Tkatchenko, Alexandre},\n journal={Nature communications},\n volume={8},\n number={1},\n pages={13890},\n year={2017},\n publisher={Nature Publishing Group UK London}\n}\n" 309 | ] 310 | }, 311 | { 312 | "name": "C7O2H10-17", 313 | "full_name": "", 314 | "description": " MD trajectories of 5,000 steps for 113 isomers of C7O2H10", 315 | "methods":{ 316 | "geometry_optimization":{ 317 | "False":"False" 318 | }, 319 | "energy":{ 320 | "PBE":"checking required" 321 | } 322 | }, 323 | "data_size": { 324 | "number_of_structures": 565000 325 | }, 326 | "data_access": { 327 | "Link": "http://quantum-machine.org/datasets/" 328 | }, 329 | "chemical_elements": ["H","C","O"], 330 | "number_of_heavy_atoms": [9,9,9], 331 | "initial_source": [ 332 | "GDB-9" 333 | ], 334 | "non-equilibrium structures": "False", 335 | "charges": [0], 336 | "multiplicities": [1], 337 | "excited_states": "False", 338 | "solvent": ["gas_phase"], 339 | "temperature": "False", 340 | "properties": [ 341 | "total_energy" 342 | ], 343 | "doi": "10.1038/ncomms13890", 344 | "reference": [ 345 | "@article{schutt2017quantum,\n title={Quantum-chemical insights from deep tensor neural networks},\n author={Sch{\\\"u}tt, Kristof T and Arbabzadah, Farhad and Chmiela, Stefan and M{\\\"u}ller, Klaus R and Tkatchenko, Alexandre},\n journal={Nature communications},\n volume={8},\n number={1},\n pages={13890},\n year={2017},\n publisher={Nature Publishing Group UK London}\n}\n" 346 | ] 347 | }, 348 | { 349 | "name": "CheMFi", 350 | "full_name": "quantum Chemistry MultiFidelity", 351 | "description": " A multifidelity compilation of quantum chemical properties derived from a subset of the WS22 database, featuring 135,000 geometries sampled from nine distinct molecules. It encompasses five different levels of fidelity, each corresponding to a specific basis set size (STO-3G, 3-21G, 6-31G, def2-SVP, def2-TZVP) ", 352 | "methods":{ 353 | "geometry_optimization":{ 354 | "PBE0/6-311G(d)":"Gaussian 09" 355 | }, 356 | "energy":{ 357 | "(TD-)CAM-B3LYP/STO-3G":"ORCA 5.0.1", 358 | "(TD-)CAM-B3LYP/3-21G":"ORCA 5.0.1", 359 | "(TD-)CAM-B3LYP/6-31G":"ORCA 5.0.1", 360 | "(TD-)CAM-B3LYP/def2-SVP":"ORCA 5.0.1", 361 | "(TD-)CAM-B3LYP/def2-TZVP":"ORCA 5.0.1" 362 | } 363 | }, 364 | "data_size": { 365 | "number_of_structures": 135000 366 | }, 367 | "data_access": { 368 | "GitHub": "https://github.com/SM4DA/CheMFi" 369 | }, 370 | "chemical_elements": ["H","C","N","O"], 371 | "number_of_heavy_atoms": [4,"checking required",14], 372 | "initial_source": [ 373 | "WS22" 374 | ], 375 | "non-equilibrium structures": "True", 376 | "charges": [0], 377 | "multiplicities": ["checking required"], 378 | "excited_states": "True", 379 | "solvent": ["gas_phase"], 380 | "temperature": 298.15, 381 | "properties": [ 382 | "total_energy", 383 | "dipole_moment", 384 | "oscillator_strength", 385 | "time" 386 | ], 387 | "doi": "https://arxiv.org/abs/2406.14149v2", 388 | "reference": [ 389 | "@article{vinod2024chemfi,\n title={CheMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules},\n author={Vinod, Vivin and Zaspel, Peter},\n journal={arXiv preprint arXiv:2406.14149},\n year={2024}\n}\n\n" 390 | ] 391 | }, 392 | { 393 | "name": "ISO17", 394 | "full_name": "", 395 | "description": " An extension of C7O2H10-17 data set, consisting of 129 isomers ", 396 | "methods":{ 397 | "geometry_optimization":{ 398 | "False":"False" 399 | }, 400 | "energy":{ 401 | "PBE+vdW-TS":"FHI-aims" 402 | } 403 | }, 404 | "data_size": { 405 | "number_of_structures": "645000" 406 | }, 407 | "data_access": { 408 | "Link": "http://quantum-machine.org/datasets/" 409 | }, 410 | "chemical_elements": ["H","C","O"], 411 | "number_of_heavy_atoms": ["9","9","9"], 412 | "initial_source": [ 413 | "QM9" 414 | ], 415 | "non-equilibrium structures": "True", 416 | "charges": [0], 417 | "multiplicities": ["checking required"], 418 | "excited_states": "False", 419 | "solvent": ["gas_phase"], 420 | "temperature": "False", 421 | "properties": [ 422 | "total_energy", 423 | "force" 424 | ], 425 | "doi": "10.48550/arXiv.1706.08566", 426 | "reference": [ 427 | "@article{schutt2017schnet,\nauthor = {Sch\\\"{u}tt, K. T. and Kindermans, P.-J. and Sauceda, H. E. and Chmiela, S. and Tkatchenko, A. and M\\\"{u}ller, K.-R.},\ntitle = {SchNet: a continuous-filter convolutional neural network for modeling quantum interactions},\nyear = {2017},\nisbn = {9781510860964},\npublisher = {Curran Associates Inc.},\naddress = {Red Hook, NY, USA},\nbooktitle = {Proceedings of the 31st International Conference on Neural Information Processing Systems},\npages = {992\u20131002},\nnumpages = {11},\nlocation = {Long Beach, California, USA},\nseries = {NIPS'17}\n}\n" 428 | ] 429 | }, 430 | { 431 | "name": "MD17 and its later versions", 432 | "full_name": "", 433 | "description": " \\textit{Ab initio} molecular dynamics trajectories collection along with total energy and forces for ten small organic molecules", 434 | "methods":{ 435 | "geometry_optimization":{ 436 | "False":"False" 437 | }, 438 | "energy":{ 439 | "PBE+vdW-TS":"FHI-aims" 440 | } 441 | }, 442 | "data_size": { 443 | "number_of_structures": 3.8e6 444 | }, 445 | "data_access": { 446 | "Link": "http://www.sgdml.org/#data_sets" 447 | }, 448 | "chemical_elements": ["H","C","N","O"], 449 | "number_of_heavy_atoms": [3,"checking required",13], 450 | "initial_source": [], 451 | "non-equilibrium structures": "True", 452 | "charges": ["checking required"], 453 | "multiplicities": ["checking required"], 454 | "excited_states": "False", 455 | "solvent": ["gas_phase"], 456 | "temperature": "checking required", 457 | "properties": [ 458 | "total_energy", 459 | "force" 460 | ], 461 | "doi": "10.1126/sciadv.1603015", 462 | "reference": [ 463 | "@article{chmiela2017machine,\n title={Machine learning of accurate energy-conserving molecular force fields},\n author={Chmiela, Stefan and Tkatchenko, Alexandre and Sauceda, Huziel E and Poltavsky, Igor and Sch{\\\"u}tt, Kristof T and M{\\\"u}ller, Klaus-Robert},\n journal={Science advances},\n volume={3},\n number={5},\n pages={e1603015},\n year={2017},\n publisher={American Association for the Advancement of Science}\n}\n", 464 | "@article{christensen2020role,\n title={On the role of gradients for machine learning of molecular energies and forces},\n author={Christensen, Anders S and Von Lilienfeld, O Anatole},\n journal={Machine Learning: Science and Technology},\n volume={1},\n number={4},\n pages={045018},\n year={2020},\n publisher={IOP Publishing}\n}\n", 465 | "@article{chmiela2018towards,\n title={Towards exact molecular dynamics simulations with machine-learned force fields},\n author={Chmiela, Stefan and Sauceda, Huziel E and M{\\\"u}ller, Klaus-Robert and Tkatchenko, Alexandre},\n journal={Nature communications},\n volume={9},\n number={1},\n pages={3887},\n year={2018},\n publisher={Nature Publishing Group UK London}\n}\n" 466 | ] 467 | }, 468 | { 469 | "name": "MD22", 470 | "full_name": "MD22", 471 | "description": " MD trajectories for seven systems spanning four major classes of biomolecules and supramolecular structures ", 472 | "methods":{ 473 | "geometry_optimization":{ 474 | "False":"False" 475 | }, 476 | "energy":{ 477 | "PBE+MBD":"FHI-aims" 478 | } 479 | }, 480 | "data_size": { 481 | "number_of_structures": "checking required" 482 | }, 483 | "data_access": { 484 | "Link": "http://www.sgdml.org/#data_sets" 485 | }, 486 | "chemical_elements": ["checking required"], 487 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 488 | "number_of_atoms": [42,"checking required",370], 489 | "initial_source": [], 490 | "non-equilibrium structures": "True", 491 | "charges": ["checking required"], 492 | "multiplicities": ["checking required"], 493 | "excited_states": "False", 494 | "solvent": ["gas_phase"], 495 | "temperature": 400, 496 | "properties": [ 497 | "total_energy", 498 | "force" 499 | ], 500 | "doi": "10.1126/sciadv.adf0873", 501 | "reference": [ 502 | "@article{chmiela2023accurate,\n title={Accurate global machine learning force fields for molecules with hundreds of atoms},\n author={Chmiela, Stefan and Vassilev-Galindo, Valentin and Unke, Oliver T and Kabylda, Adil and Sauceda, Huziel E and Tkatchenko, Alexandre and M{\\\"u}ller, Klaus-Robert},\n journal={Science Advances},\n volume={9},\n number={2},\n pages={eadf0873},\n year={2023},\n publisher={American Association for the Advancement of Science}\n}\n" 503 | ] 504 | }, 505 | { 506 | "name": "MultiXC-QM9", 507 | "full_name": "MultiXC-QM9", 508 | "description": " Expands upon QM9 by including data from 76 different DFT functionals alongside three basis sets and a semi-empirical method (GFN2-xTB) ", 509 | "methods":{ 510 | "geometry_optimization":{ 511 | "B3LYP/6-31 G(2df,p)": "Gaussian 16" 512 | }, 513 | "energy":{ 514 | "GFN2-xTB": "xTB", 515 | "76_functionals/SZ":"ADF(SCM)", 516 | "76_functionals/DZP":"ADF(SCM)", 517 | "76_functionals/TZP":"ADF(SCM)" 518 | } 519 | }, 520 | "data_size": { 521 | "number_of_structures": 133858 522 | }, 523 | "data_access": { 524 | "Figshare": "https://doi.org/10.11583/DTU.c.6185986.v3", 525 | "GitHub": "https://github.com/chemsurajit/largeDFTdata" 526 | }, 527 | "chemical_elements": ["H","C","N","O","F"], 528 | "number_of_heavy_atoms": ["checking required","checking required",9], 529 | "initial_source": [ 530 | "QM9", 531 | "QM9-G4MP2" 532 | ], 533 | "non-equilibrium structures": "False", 534 | "charges": [0], 535 | "multiplicities": ["checking required"], 536 | "excited_states": "False", 537 | "solvent": ["gas_phase"], 538 | "temperature": 298.15, 539 | "properties": [ 540 | "total_energy" 541 | ], 542 | "doi": "10.1038/s41597-023-02690-2", 543 | "reference": [ 544 | "@article{nandi2023multixc,\n title={MultiXC-QM9: Large dataset of molecular and reaction energies from multi-level quantum chemical methods},\n author={Nandi, Surajit and Vegge, Tejs and Bhowmik, Arghya},\n journal={Scientific Data},\n volume={10},\n number={1},\n pages={783},\n year={2023},\n publisher={Nature Publishing Group UK London}\n}\n" 545 | ] 546 | }, 547 | { 548 | "name": "$\\nabla^2$DFT", 549 | "full_name": "", 550 | "description": " A comprehensive collection of approximately 16 million conformers for around 2 million drug-like molecules ", 551 | "methods":{ 552 | "geometry_optimization":{ 553 | "checking required":"checking required" 554 | }, 555 | "energy":{ 556 | "ωB97X-D/def2-SVP":"Psi4" 557 | } 558 | }, 559 | "data_size": { 560 | "number_of_structures": 1.5e7 561 | }, 562 | "data_access": { 563 | "GitHub": "https://github.com/AIRI-Institute/nablaDFT" 564 | }, 565 | "chemical_elements": ["H","C","N","O","S","Cl","F","Br"], 566 | "number_of_heavy_atoms": [8,"checking required",27], 567 | "initial_source": [ 568 | "MOSES", 569 | "ZINC21 (Zinc Clean Leads)" 570 | ], 571 | "non-equilibrium structures": "True", 572 | "charges": ["checking required"], 573 | "multiplicities": ["checking required"], 574 | "excited_states": "False", 575 | "solvent": ["gas_phase"], 576 | "temperature": "False", 577 | "properties": [ 578 | "total_energy", 579 | "force", 580 | "fock_matrix", 581 | "overlap_matrix", 582 | "coefficients_matrix" 583 | ], 584 | "doi": "10.48550/arXiv.2406.14347", 585 | "reference": [ 586 | "@article{khrabrov2024nabla,\n title={$\\nabla ^2$ DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials},\n author={Khrabrov, Kuzma and Ber, Anton and Tsypin, Artem and Ushenin, Konstantin and Rumiantsev, Egor and Telepov, Alexander and Protasov, Dmitry and Shenbin, Ilya and Alekseev, Anton and Shirokikh, Mikhail and others},\n journal={arXiv preprint arXiv:2406.14347},\n year={2024}\n}\n" 587 | ] 588 | }, 589 | { 590 | "name": "OrbNet Denali", 591 | "full_name": "", 592 | "description": " 2.3 million molecules based ChEMBL27 database ", 593 | "methods":{ 594 | "geometry_optimization":{ 595 | "GFN1-xTB":"xTB" 596 | }, 597 | "energy":{ 598 | "ωB97X-D3/def2-TZVP":"ENTOS QCORE version 0.8.17" 599 | } 600 | }, 601 | "data_size": { 602 | "number_of_structures": 1.8e7 603 | }, 604 | "data_access": { 605 | "Figshare": "https://doi.org/10.6084/m9.figshare.14883867" 606 | }, 607 | "chemical_elements": ["C", "O", "N", "F", "S", "Cl", "Br", "I", "P", "Si", "B", "Na", "K", "Li", "Ca", "Mg"], 608 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 609 | "number_of_atoms": ["checking required","checking required",50], 610 | "initial_source": [ 611 | "ChEMBL" 612 | ], 613 | "non-equilibrium structures": "True", 614 | "charges": [0], 615 | "multiplicities": [1], 616 | "excited_states": "False", 617 | "solvent": ["gas_phase"], 618 | "temperature": "False", 619 | "properties": [ 620 | "total_energy" 621 | ], 622 | "doi": "10.1063/5.0061990", 623 | "reference": [ 624 | "@article{christensen2021orbnet,\n title={OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy},\n author={Christensen, Anders S and Sirumalla, Sai Krishna and Qiao, Zhuoran and O\u2019Connor, Michael B and Smith, Daniel GA and Ding, Feizhi and Bygrave, Peter J and Anandkumar, Animashree and Welborn, Matthew and Manby, Frederick R and others},\n journal={The Journal of Chemical Physics},\n volume={155},\n number={20},\n year={2021},\n publisher={AIP Publishing}\n}\n" 625 | ] 626 | }, 627 | { 628 | "name": "PC9", 629 | "full_name": "", 630 | "description": " 99,234 distinct molecules, a subset of PubChemQC project selected based on the limitations of QM9 data set (size of up to 9 heavy atoms in the range C, N, O and F) ", 631 | "methods":{ 632 | "geometry_optimization":{ 633 | "B3LYP/6-31G*":"GAMESS" 634 | }, 635 | "energy":{ 636 | "B3LYP/6-31G*":"GAMESS" 637 | } 638 | }, 639 | "data_size": { 640 | "number_of_structures": 99234 641 | }, 642 | "data_access": { 643 | "Figshare": "https://doi.org/10.6084/m9.figshare.9033977.v1", 644 | "Zenodo": "https://doi.org/10.5281/zenodo.3588370" 645 | }, 646 | "chemical_elements": ["H","C","N","O","F"], 647 | "number_of_heavy_atoms": ["checking required","checking required",9], 648 | "initial_source": [ 649 | "PubChemQC" 650 | ], 651 | "non-equilibrium structures": "False", 652 | "charges": [0], 653 | "multiplicities": [1,2,3], 654 | "excited_states": "False", 655 | "solvent": ["gas_phase"], 656 | "temperature": "False", 657 | "properties": [ 658 | "total_energy", 659 | "HOMO/LUMO_energy" 660 | ], 661 | "doi": "10.1186/s13321-019-0391-2", 662 | "reference": [ 663 | "@article{nakata2023pubchemqc,\n title={{PubChemQC} {B3LYP}/6-31{G}*//{PM6} Data Set: The Electronic Structures of 86 Million Molecules Using {B3LYP/6-31G*} Calculations},\n author={Nakata, Maho and Maeda, Toshiyuki},\n journal={Journal of Chemical Information and Modeling},\n volume={63},\n number={18},\n pages={5734--5754},\n year={2023},\n publisher={ACS Publications}\n}\n" 664 | ] 665 | }, 666 | { 667 | "name": "PubChemQC B3LYP/6-31G*//PM6", 668 | "full_name": "", 669 | "description": " A collection of electronic properties for nearly 86 million molecules, encompassing a broad spectrum of essential compounds and biomolecules with molecular weights up to 1000 ", 670 | "methods":{ 671 | "geometry_optimization":{ 672 | "PM6":"Gaussian 09" 673 | }, 674 | "energy":{ 675 | "B3LYP/6-31G*":"GAMESS" 676 | } 677 | }, 678 | "data_size": { 679 | "number_of_structures": 8.6e7 680 | }, 681 | "data_access": { 682 | "Link": "https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html" 683 | }, 684 | "chemical_elements": ["H","C","N","O","P","S","F","Cl","Na","K","Mg","Ca"], 685 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 686 | "initial_source": [ 687 | "PubChemQC" 688 | ], 689 | "non-equilibrium structures": "False", 690 | "charges": [0], 691 | "multiplicities": ["checking required"], 692 | "excited_states": "False", 693 | "solvent": ["gas_phase"], 694 | "temperature": "False", 695 | "properties": [ 696 | "total_energy", 697 | "HOMO/LUMO_energy", 698 | "dipole_moment" 699 | ], 700 | "doi": "10.1021/acs.jcim.3c00899", 701 | "reference": [ 702 | "@article{nakata2023pubchemqc,\n title={{PubChemQC} {B3LYP}/6-31{G}*//{PM6} Data Set: The Electronic Structures of 86 Million Molecules Using {B3LYP/6-31G*} Calculations},\n author={Nakata, Maho and Maeda, Toshiyuki},\n journal={Journal of Chemical Information and Modeling},\n volume={63},\n number={18},\n pages={5734--5754},\n year={2023},\n publisher={ACS Publications}\n}\n" 703 | ] 704 | }, 705 | { 706 | "name": "PubChemQC Database", 707 | "full_name": "", 708 | "description": " 3 million molecules optimized for their ground states and over 2 million molecules with low-lying excited states ", 709 | "methods":{ 710 | "geometry_optimization":{ 711 | "B3LYP/6-31G*":"GAMESS" 712 | }, 713 | "energy":{ 714 | "TD-B3LYP/6-31+G*":"GAMESS" 715 | } 716 | }, 717 | "data_size": { 718 | "number_of_structures": 3e6 719 | }, 720 | "data_access": { 721 | "Link": "https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_2017.html" 722 | }, 723 | "chemical_elements": ["H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne", "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar", "K", "Ca", "Sc", "Ti", "V", "Cr", "Mn", "Fe", "Co", "Ni", "Cu", "Zn"], 724 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 725 | "initial_source": [ 726 | "PubChem" 727 | ], 728 | "non-equilibrium structures": "False", 729 | "charges": [0], 730 | "multiplicities": ["checking required"], 731 | "excited_states": "True", 732 | "solvent": ["gas_phase"], 733 | "temperature": "False", 734 | "properties": [ 735 | "total_energy", 736 | "dipole_moment", 737 | "HOMO/LUMO_energy" 738 | ], 739 | "doi": "10.1021/acs.jcim.7b00083", 740 | "reference": [ 741 | "@article{nakata2020pubchemqc,\n title={{PubChemQC PM6}: Data sets of 221 million molecules with optimized molecular geometries and electronic properties},\n author={Nakata, Maho and Shimazaki, Tomomi and Hashimoto, Masatomo and Maeda, Toshiyuki},\n journal={Journal of Chemical Information and Modeling},\n volume={60},\n number={12},\n pages={5891--5899},\n year={2020},\n publisher={ACS Publications}\n}\n" 742 | ] 743 | }, 744 | { 745 | "name": "PubChemQC PM6", 746 | "full_name": "", 747 | "description": " PM6 data for 221 million molecules, including optimized geometries and electronic structures ", 748 | "methods":{ 749 | "geometry_optimization":{ 750 | "PM6":"Gaussian 09" 751 | }, 752 | "energy":{ 753 | "PM6":"Gaussian 09" 754 | } 755 | }, 756 | "data_size": { 757 | "number_of_structures": 2.2e8 758 | }, 759 | "data_access": { 760 | "Link": "https://nakatamaho.riken.jp/pubchemqc.riken.jp/pm6_datasets.html" 761 | }, 762 | "chemical_elements": ["C","H","N","O","P","S","F","Cl","Na","K","Mg","Ca"], 763 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 764 | "initial_source": [ 765 | "PubChem" 766 | ], 767 | "non-equilibrium structures": "False", 768 | "charges": ["checking required"], 769 | "multiplicities": [1,3], 770 | "excited_states": "False", 771 | "solvent": ["gas_phase"], 772 | "temperature": "False", 773 | "properties": [ 774 | "total_energy", 775 | "orbital_energy", 776 | "dipole_moment", 777 | "enthalpy" 778 | 779 | ], 780 | "doi": "10.1021/acs.jcim.0c00740", 781 | "reference": [ 782 | "@article{nakata2020pubchemqc,\n title={{PubChemQC PM6}: Data sets of 221 million molecules with optimized molecular geometries and electronic properties},\n author={Nakata, Maho and Shimazaki, Tomomi and Hashimoto, Masatomo and Maeda, Toshiyuki},\n journal={Journal of Chemical Information and Modeling},\n volume={60},\n number={12},\n pages={5891--5899},\n year={2020},\n publisher={ACS Publications}\n}\n" 783 | ] 784 | }, 785 | { 786 | "name": "QCDGE", 787 | "full_name": "Quantum Chemistry Dataset with Ground- and Excited-State Properties", 788 | "description": " An extensive collection of ground and excited-state properties for 443,106 organic molecules, each containing up to ten heavy atoms, including carbon, nitrogen, oxygen, and fluorine. These molecules are sourced from well-known databases such as QM9, PubChemQC, and GDB-11. ", 789 | "methods":{ 790 | "geometry_optimization":{ 791 | "B3LYP-D3(BJ)/6-31G*":"Gaussian 16" 792 | }, 793 | "energy":{ 794 | "B3LYP-D3(BJ)/6-31G*":"Gaussian 16", 795 | "TD-ωB97X-D/6-31G*":"Gaussian 16" 796 | } 797 | }, 798 | "data_size": { 799 | "number_of_structures": 443106 800 | }, 801 | "data_access": { 802 | "Link": "http://langroup.site/QCDGE/" 803 | }, 804 | "chemical_elements": ["H","C","N","O","F"], 805 | "number_of_heavy_atoms": ["checking required","checking required",10], 806 | "initial_source": [ 807 | "QM9", 808 | "PubChemQC", 809 | "GDB-11" 810 | ], 811 | "non-equilibrium structures": "False", 812 | "charges": ["checking required"], 813 | "multiplicities": ["checking required"], 814 | "excited_states": "True", 815 | "solvent": ["gas_phase"], 816 | "temperature": "False", 817 | "properties": [ 818 | "total_energy", 819 | "HOMO/LUMO_energy", 820 | "dipole_moment" 821 | ], 822 | "doi": "10.1038/s41597-024-03788-x", 823 | "reference": [ 824 | "@article{zhu2024quantum,\n title={Quantum Chemistry Dataset with Ground-and Excited-state Properties of 450 Kilo Molecules},\n author={Zhu, Yifei and Li, Mengge and Xu, Chao and Lan, Zhenggang},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={948},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n" 825 | ] 826 | }, 827 | { 828 | "name": "QM1B", 829 | "full_name": "", 830 | "description": " One billion conformers for molecules with 9-11 heavy atoms ", 831 | "methods":{ 832 | "geometry_optimization":{ 833 | "False":"False" 834 | }, 835 | "energy":{ 836 | "B3LYP/STO-3G":"PySCF-IPU" 837 | } 838 | }, 839 | "data_size": { 840 | "number_of_structures": 10e9 841 | }, 842 | "data_access": { 843 | "GitHub": "https://github.com/graphcore-research/qm1b-dataset" 844 | }, 845 | "chemical_elements": ["H","C","N","O","F"], 846 | "number_of_heavy_atoms": [9,"checking required",11], 847 | "initial_source": [ 848 | "GDB-11" 849 | ], 850 | "non-equilibrium structures": "True", 851 | "charges": ["checking required"], 852 | "multiplicities": ["checking required"], 853 | "excited_states": "False", 854 | "solvent": ["gas_phase"], 855 | "temperature": "False", 856 | "properties": [ 857 | "total_energy", 858 | "HOMO/LUMO_energy" 859 | ], 860 | "doi": "10.48550/arXiv.2311.01135", 861 | "reference": [ 862 | "@article{mathiasen2024generating,\n title={Generating {QM1B} with {PySCF}$_\\text{IPU}$},\n author={Mathiasen, Alexander and Helal, Hatem and Klaser, Kerstin and Balanca, Paul and Dean, Josef and Luschi, Carlo and Beaini, Dominique and Fitzgibbon, Andrew and Masters, Dominic},\n journal={Advances in Neural Information Processing Systems},\n volume={36},\n year={2024}\n}\n" 863 | ] 864 | }, 865 | { 866 | "name": "QM7", 867 | "full_name": "", 868 | "description": " Focuses on a subset of 7,165 small organic molecules from the GDB-13 database, providing Coulomb matrices, atomization energies, atomic charge, and Cartesian coordinates ", 869 | "methods":{ 870 | "geometry_optimization":{ 871 | "PBE0":"FHI-aims" 872 | }, 873 | "energy":{ 874 | "PBE0":"FHI-aims" 875 | } 876 | }, 877 | "data_size": { 878 | "number_of_structures": 7165 879 | }, 880 | "data_access": { 881 | "Link": "http://quantum-machine.org/datasets/" 882 | }, 883 | "chemical_elements": ["H","C","N","O","S"], 884 | "number_of_heavy_atoms": ["checking required","checking required",7], 885 | "initial_source": [ 886 | "GDB-13" 887 | ], 888 | "non-equilibrium structures": "False", 889 | "charges": [0], 890 | "multiplicities": ["checking required"], 891 | "excited_states": "checking required", 892 | "solvent": ["gas_phase"], 893 | "temperature": "False", 894 | "properties": [ 895 | "atomization_energy", 896 | "coulomb_matrix" 897 | ], 898 | "doi": "10.1103/PhysRevLett.108.058301", 899 | "reference": [ 900 | "@article{rupp2012fast,\n title={Fast and accurate modeling of molecular atomization energies with machine learning},\n author={Rupp, Matthias and Tkatchenko, Alexandre and M{\\\"u}ller, Klaus-Robert and Von Lilienfeld, O Anatole},\n journal={Physical review letters},\n volume={108},\n number={5},\n pages={058301},\n year={2012},\n publisher={APS}\n}\n\n" 901 | ] 902 | }, 903 | { 904 | "name": "QM7-X", 905 | "full_name": "", 906 | "description": " 4.2 million structures of small organic molecules (containing up to seven non-hydrogen atoms) with a rich set of 42 properties ", 907 | "methods":{ 908 | "geometry_optimization":{ 909 | "DFTB3+MBD": "DFTB+" 910 | }, 911 | "energy":{ 912 | "PBE0+MBD": "FHI-aims" 913 | } 914 | }, 915 | "data_size": { 916 | "number_of_structures": 4.2e6 917 | }, 918 | "data_access": { 919 | "Zenodo": "https://doi.org/10.5281/zenodo.4288677" 920 | }, 921 | "chemical_elements": ["H","C","N","O","S","Cl"], 922 | "number_of_heavy_atoms": [1,"checking required",7], 923 | "initial_source": [ 924 | "GDB-13" 925 | ], 926 | "non-equilibrium structures": "True", 927 | "charges": ["checking required"], 928 | "multiplicities": ["checking required"], 929 | "excited_states": "False", 930 | "solvent": ["gas_phase"], 931 | "temperature": "False", 932 | "properties": [ 933 | "total_energy", 934 | "force", 935 | "dipole_moment", 936 | "quadrupole_moment", 937 | "HOMO/LUMO_energy", 938 | "HOMO/LUMO_gap", 939 | "polarizability" 940 | 941 | ], 942 | "doi": "10.1038/s41597-021-00812-2", 943 | "reference": [ 944 | "@article{hoja2021qm7,\n title={{QM7-X}, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules},\n author={Hoja, Johannes and Medrano Sandonas, Leonardo and Ernst, Brian G and Vazquez-Mayagoitia, Alvaro and DiStasio Jr, Robert A and Tkatchenko, Alexandre},\n journal={Scientific data},\n volume={8},\n number={1},\n pages={43},\n year={2021},\n publisher={Nature Publishing Group UK London}\n}\n" 945 | ] 946 | }, 947 | { 948 | "name": "QM7b", 949 | "full_name": "", 950 | "description": " An extension of QM7, providing data on 7,211 small organic molecules including 14 properties such as atomization energy, static polarizability, and frontier orbital eigenvalues, calculated using DFT and other quantum chemistry methods ", 951 | "methods":{ 952 | "geometry_optimization":{ 953 | "PBE0":"FHI-aims" 954 | }, 955 | "energy":{ 956 | "PBE0":"FHI-aims" 957 | } 958 | }, 959 | "data_size": { 960 | "number_of_structures": 7211 961 | }, 962 | "data_access": { 963 | "Link": "http://stacks.iop.org/NJP/15/095003/mmedia" 964 | }, 965 | "chemical_elements": ["H","C","N","O","S","Cl"], 966 | "number_of_heavy_atoms": ["checking required","checking required",7], 967 | "initial_source": [ 968 | "GDB-13" 969 | ], 970 | "non-equilibrium structures": "False", 971 | "charges": ["checking required"], 972 | "multiplicities": ["checking required"], 973 | "excited_states": "True", 974 | "solvent": ["gas_phase"], 975 | "temperature": "False", 976 | "properties": [ 977 | "atomization_energy", 978 | "HOMO/LUMO_energy", 979 | "ionization_potential", 980 | "electron_affinity" 981 | ], 982 | "doi": "10.1088/1367-2630/15/9/095003", 983 | "reference": [ 984 | "@article{montavon2013machine,\n title={Machine learning of molecular electronic properties in chemical compound space},\n author={Montavon, Gr{\\'e}goire and Rupp, Matthias and Gobre, Vivekanand and Vazquez-Mayagoitia, Alvaro and Hansen, Katja and Tkatchenko, Alexandre and M{\\\"u}ller, Klaus-Robert and Von Lilienfeld, O Anatole},\n journal={New Journal of Physics},\n volume={15},\n number={9},\n pages={095003},\n year={2013},\n publisher={IOP Publishing}\n}\n" 985 | ] 986 | }, 987 | { 988 | "name": "QM8", 989 | "full_name": "", 990 | "description": " Provides electronic spectra data for approximately 21,786 small organic molecules, derived from QM9, calculated using TD-DFT and other excited-state methods ", 991 | "methods":{ 992 | "geometry_optimization":{ 993 | "B3LYP/6-31G(2df,p)":"Gaussian 09" 994 | }, 995 | "energy":{ 996 | "TD-PBE0/def2-SVP":"TURBOMOLE", 997 | "TD-RI-CC2/def2-TZVP":"TURBOMOLE", 998 | "TD-CAM-B3LYP/def2-SVP":"Gaussian 09" 999 | } 1000 | }, 1001 | "data_size": { 1002 | "number_of_structures": 21800 1003 | }, 1004 | "data_access": { 1005 | "Link": "https://pubs.aip.org/jcp/article-supplement/73278/zip/084111_1_supplements/" 1006 | }, 1007 | "chemical_elements": ["H","C","N","O","F"], 1008 | "number_of_heavy_atoms": ["checking required","checking required",8], 1009 | "initial_source": [ 1010 | "GDB-17" 1011 | ], 1012 | "non-equilibrium structures": "False", 1013 | "charges": [0], 1014 | "multiplicities": [1], 1015 | "excited_states": "True", 1016 | "solvent": ["gas_phase"], 1017 | "temperature": "False", 1018 | "properties": [ 1019 | "total_energy", 1020 | "oscillator_strength" 1021 | ], 1022 | "doi": "10.1063/1.4928757", 1023 | "reference": [ 1024 | "@article{ramakrishnan2015electronic,\n title={Electronic spectra from TDDFT and machine learning in chemical space},\n author={Ramakrishnan, Raghunathan and Hartmann, Mia and Tapavicza, Enrico and Von Lilienfeld, O Anatole},\n journal={The Journal of chemical physics},\n volume={143},\n number={8},\n year={2015},\n publisher={AIP Publishing}\n}\n" 1025 | ] 1026 | }, 1027 | { 1028 | "name": "QM9", 1029 | "full_name": "QM9", 1030 | "description": " A collection of molecular structures and properties for 134,000 small organic molecules ", 1031 | "methods": { 1032 | "geometry_optimization":{ 1033 | "B3LYP/6-31G(2df,p)": "Gaussian 09" 1034 | }, 1035 | "energy":{ 1036 | "B3LYP/6-31G(2df,p)": "Gaussian 09", 1037 | "G4MP2": "Gaussian 09" 1038 | } 1039 | }, 1040 | "data_size": { 1041 | "number_of_structures": 133885 1042 | }, 1043 | "data_access": { 1044 | "Figshare": "https://doi.org/10.6084/m9.figshare.978904" 1045 | }, 1046 | "chemical_elements": ["H","C","N","O","F"], 1047 | "number_of_heavy_atoms": ["checking required","checking required",9], 1048 | "initial_source": [ 1049 | "GDB-17" 1050 | ], 1051 | "non-equilibrium structures": "False", 1052 | "charges": [0], 1053 | "multiplicities": [1], 1054 | "excited_states": "False", 1055 | "solvent": ["gas_phase"], 1056 | "temperature": 298.15, 1057 | "properties": [ 1058 | "total_energy", 1059 | "enthalpy", 1060 | "entropy", 1061 | "heat_capacity", 1062 | "zero_point_energy", 1063 | "HOMO/LUMO_energy", 1064 | "HOMO/LUMO_gap", 1065 | "atomization_free_energy", 1066 | "harmonic_frequency", 1067 | "dipole_moments", 1068 | "polarizability" 1069 | ], 1070 | "doi": "10.1038/sdata.2014.22", 1071 | "reference": [ 1072 | "@article{ramakrishnan2014quantum,\n title={Quantum chemistry structures and properties of 134 kilo molecules},\n author={Ramakrishnan, Raghunathan and Dral, Pavlo O and Rupp, Matthias and Von Lilienfeld, O Anatole},\n journal={Scientific data},\n volume={1},\n number={1},\n pages={1--7},\n year={2014},\n publisher={Nature Publishing Group}\n}\n\n" 1073 | ] 1074 | }, 1075 | { 1076 | "name": "QM9-G4MP2", 1077 | "full_name": "QM9-G4MP2", 1078 | "description": " Provides highly accurate G4MP2 calculations for the molecular structures within QM9 ", 1079 | "methods": { 1080 | "geometry_optimization":{ 1081 | "B3LYP/6-31g(2df,p)": "Gaussian 16" 1082 | }, 1083 | "energy":{ 1084 | "G4MP2": "Gaussian 16", 1085 | "B3LYP/6-31g(2df,p)": "Gaussian 16", 1086 | "HF/6-31g(d)": "Gaussian 16", 1087 | "MP2/6-31g(d)": "Gaussian 16", 1088 | "MP3/6-31g(d)": "Gaussian 16", 1089 | "MP4D/6-31g(d)": "Gaussian 16", 1090 | "MP4DQ/6-31g(d)": "Gaussian 16", 1091 | "MP4SDTQ/6-31g(d)": "Gaussian 16", 1092 | "MP4SDQ/6-31g(d)": "Gaussian 16", 1093 | "CCSD/6-31g(d)": "Gaussian 16", 1094 | "CCSD(T)/6-31g(d)": "Gaussian 16", 1095 | "HF/G3MP2largeXP": "Gaussian 16", 1096 | "MP2/G3MP2largeXP": "Gaussian 16", 1097 | "HF/mod-aug-cc-pVTZ": "Gaussian 16", 1098 | "HF/mod-aug-cc-pvQZ": "Gaussian 16" 1099 | } 1100 | }, 1101 | "data_size": { 1102 | "number_of_structures": 133858 1103 | }, 1104 | "data_access": { 1105 | "Figshare": "https://doi.org/10.6084/m9.figshare.c.4351631.v1" 1106 | }, 1107 | "chemical_elements": ["H","C","N","O","F"], 1108 | "number_of_heavy_atoms": ["checking required","checking required",9], 1109 | "initial_source": [ 1110 | "QM9" 1111 | ], 1112 | "non-equilibrium structures": "False", 1113 | "charges": [0], 1114 | "multiplicities": ["checking required"], 1115 | "excited_states": "False", 1116 | "solvent": ["gas_phase"], 1117 | "temperature": 298.15, 1118 | "properties": [ 1119 | "total_energy", 1120 | "zero_point_energy", 1121 | "Gibbs_free_energy", 1122 | "enthalpy" 1123 | ], 1124 | "doi": "10.1038/s41597-019-0121-7", 1125 | "reference": [ 1126 | "@article{kim2019energy,\n title={Energy refinement and analysis of structures in the QM9 database via a highly accurate quantum chemical method},\n author={Kim, Hyungjun and Park, Ji Young and Choi, Sunghwan},\n journal={Scientific data},\n volume={6},\n number={1},\n pages={109},\n year={2019},\n publisher={Nature Publishing Group UK London}\n}\n" 1127 | ] 1128 | }, 1129 | { 1130 | "name": "QM-sym", 1131 | "description": " 135,000 structures with C$_{n\\text{h}}$ symmetry ", 1132 | "methods":{ 1133 | "geometry_optimization":{ 1134 | "B3LYP/6-31G(2df,p)":"Gaussian 09" 1135 | }, 1136 | "energy":{ 1137 | "B3LYP/6-31G(2df,p)":"Gaussian 09" 1138 | } 1139 | }, 1140 | "data_size": { 1141 | "number_of_structures": 1.35e5 1142 | }, 1143 | "data_access": { 1144 | "GitHub": "https://github.com/XI-Lab/QM-sym-database", 1145 | "Figshare": "https://doi.org/10.6084/m9.Figshare.9638093" 1146 | }, 1147 | "chemical_elements": ["H","B","C","N","O","F","Cl","Br"], 1148 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1149 | "initial_source": [], 1150 | "non-equilibrium structures": "False", 1151 | "charges": ["checking required"], 1152 | "multiplicities": [1], 1153 | "excited_states": "False", 1154 | "solvent": ["gas_phase"], 1155 | "temperature": "False", 1156 | "properties": [ 1157 | "total_energy", 1158 | "HOMO/LUMO_energy", 1159 | "symmetry", 1160 | "enthalpy", 1161 | "atomization_energy" 1162 | ], 1163 | "doi": "10.1038/s41597-019-0237-9", 1164 | "reference": [ 1165 | "@article{liang2019qm,\n title={QM-sym, a symmetrized quantum chemistry database of 135 kilo molecules},\n author={Liang, Jiechun and Xu, Yanheng and Liu, Rulin and Zhu, Xi},\n journal={Scientific Data},\n volume={6},\n number={1},\n pages={213},\n year={2019},\n publisher={Nature Publishing Group UK London}\n}\n" 1166 | ] 1167 | }, 1168 | { 1169 | "name": "QM-symex", 1170 | "full_name": "", 1171 | "description": " The QM-sym data set has been expanded to include an additional 38,000 molecules, providing valuable information on excited-state properties at TD-B3LYP/6-31G//B3LYP/6-31G(2df,p); Gaussian 09 ", 1172 | "methods":{ 1173 | "geometry_optimization":{ 1174 | "B3LYP/6-31G(2df,p)":"Gaussian 09" 1175 | }, 1176 | "energy":{ 1177 | "B3LYP/6-31G(2df,p)":"Gaussian 09", 1178 | "TD-B3LYP/6-31G(2df,p)":"Gaussian 09" 1179 | } 1180 | }, 1181 | "data_size": { 1182 | "number_of_structures": 1.73e5 1183 | }, 1184 | "data_access": { 1185 | "Figshare": "https://doi.org/10.6084/m9.Figshare.12815276" 1186 | }, 1187 | "chemical_elements": ["H","B","C","N","O","F","Cl","Br"], 1188 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1189 | "initial_source": [], 1190 | "non-equilibrium structures": "False", 1191 | "charges": ["checking required"], 1192 | "multiplicities": [1,3], 1193 | "excited_states": "True", 1194 | "solvent": ["gas_phase"], 1195 | "temperature": "False", 1196 | "properties": [ 1197 | "total_energy", 1198 | "wavelength", 1199 | "symmetry", 1200 | "oscillator_strength", 1201 | "spin" 1202 | ], 1203 | "doi": "10.1038/s41597-020-00746-1", 1204 | "reference": [ 1205 | "@article{liang2020qm,\n title={QM-symex, update of the QM-sym database with excited state information for 173 kilo molecules},\n author={Liang, Jiechun and Ye, Shuqian and Dai, Tianshu and Zha, Ziyue and Gao, Yuechen and Zhu, Xi},\n journal={Scientific Data},\n volume={7},\n number={1},\n pages={400},\n year={2020},\n publisher={Nature Publishing Group UK London}\n}\n" 1206 | ] 1207 | }, 1208 | { 1209 | "name": "QM-22", 1210 | "full_name": "", 1211 | "description": " A compilation of molecular data sets specifically curated for DMC calculations of the zero-point state ", 1212 | "methods":{ 1213 | "geometry_optimization":{ 1214 | "checking required":"checking required" 1215 | }, 1216 | "energy":{ 1217 | "CCSD(T)":"checking required", 1218 | "MRCI":"checking required", 1219 | "B3LYP":"checking required", 1220 | "CCSD(T)-MRCI":"checking required", 1221 | "MP2":"checking required" 1222 | } 1223 | }, 1224 | "data_size": { 1225 | "number_of_structures": 1.15e6 1226 | }, 1227 | "data_access": { 1228 | "GitHub": "https://github.com/jmbowma/QM-22" 1229 | }, 1230 | "chemical_elements": ["H","C","N","O"], 1231 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1232 | "number_of_atoms": [4,"checking required",15], 1233 | "initial_source": [ 1234 | "checking required" 1235 | ], 1236 | "non-equilibrium structures": "True", 1237 | "charges": [0,1], 1238 | "multiplicities": [1,3], 1239 | "excited_states": "False", 1240 | "solvent": ["gas_phase"], 1241 | "temperature": "checking required", 1242 | "properties": [ 1243 | "total_energy", 1244 | "force" 1245 | ], 1246 | "doi": "10.1063/5.0089200", 1247 | "reference": [ 1248 | "@article{bowman2022md17,\n title={The {MD17} datasets from the perspective of datasets for gas-phase \u201csmall\u201d molecule potentials},\n author={Bowman, Joel M and Qu, Chen and Conte, Riccardo and Nandi, Apurba and Houston, Paul L and Yu, Qi},\n journal={The Journal of chemical physics},\n volume={156},\n number={24},\n year={2022},\n publisher={AIP Publishing}\n}\n" 1249 | ] 1250 | }, 1251 | { 1252 | "name": "QMugs", 1253 | "full_name": "Quantum-Mechanical Properties of Druglike Molecules", 1254 | "description": " Quantum-mechanical properties of over 665,000 drug-like molecules extracted from the ChEMBL database ", 1255 | "methods":{ 1256 | "geometry_optimization":{ 1257 | "GFN2-xTB":"xTB" 1258 | }, 1259 | "energy":{ 1260 | "ωB97X-D/def2-SVP":"Psi4" 1261 | } 1262 | }, 1263 | "data_size": { 1264 | "number_of_structures": 2e6 1265 | }, 1266 | "data_access": { 1267 | "Link": "https://doi.org/10.3929/ethz-b-000482129" 1268 | }, 1269 | "chemical_elements": ["H","C","N","O","F","P","S","Cl","Br","I"], 1270 | "number_of_heavy_atoms": ["checking required",30.6,100], 1271 | "initial_source": [ 1272 | "ChEMBL" 1273 | ], 1274 | "non-equilibrium structures": "False", 1275 | "charges": [0], 1276 | "multiplicities": ["checking required"], 1277 | "excited_states": "checking required", 1278 | "solvent": ["gas_phase"], 1279 | "temperature": "False", 1280 | "properties": [ 1281 | "total_energy", 1282 | "dipole_moment", 1283 | "rotational_constant", 1284 | "partial_charge", 1285 | "density_matrix" 1286 | ], 1287 | "doi": "10.1038/s41597-022-01390-7", 1288 | "reference": [ 1289 | "@article{isert2022qmugs,\n title={QMugs, quantum mechanical properties of drug-like molecules},\n author={Isert, Clemens and Atz, Kenneth and Jim{\\'e}nez-Luna, Jos{\\'e} and Schneider, Gisbert},\n journal={Scientific Data},\n volume={9},\n number={1},\n pages={273},\n year={2022},\n publisher={Nature Publishing Group UK London}\n}\n" 1290 | ] 1291 | }, 1292 | { 1293 | "name": "SPICE", 1294 | "full_name": "", 1295 | "description": " 2,008,628 conformations of 113,999 drug-like small molecules and proteins ", 1296 | "methods":{ 1297 | "geometry_optimization":{ 1298 | "False":"False" 1299 | }, 1300 | "energy":{ 1301 | "ωB97M-D3(BJ)/def2-TZVPPD":"Psi4 1.4.1" 1302 | } 1303 | }, 1304 | "data_size": { 1305 | "number_of_structures": 1132808 1306 | }, 1307 | "data_access": { 1308 | "Zenodo": "https://doi.org/10.5281/zenodo.7338495", 1309 | "GitHub": "https://github.com/openmm/spice-dataset" 1310 | }, 1311 | "chemical_elements": ["H","Li","C","N","O","F","Na","Mg","P","S","Cl","K","Ca","Br","I"], 1312 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1313 | "number_of_atoms" : [2,"checking required",96], 1314 | "initial_source": [ 1315 | "PubChem", 1316 | "DES370K" 1317 | 1318 | ], 1319 | "non-equilibrium structures": "True", 1320 | "charges": [-8,-6,-5,-4,-3,-2,-1,0,1,2], 1321 | "multiplicities": ["checking required"], 1322 | "excited_states": "False", 1323 | "solvent": ["gas_phase","water"], 1324 | "temperature": "False", 1325 | "properties": [ 1326 | "total_energy", 1327 | "force", 1328 | "dipole_moment" 1329 | ], 1330 | "doi": "10.1038/s41597-022-01882-6", 1331 | "reference": [ 1332 | "@article{eastman2023spice,\n title={Spice, a dataset of drug-like molecules and peptides for training machine learning potentials},\n author={Eastman, Peter and Behara, Pavan Kumar and Dotson, David L and Galvelis, Raimondas and Herr, John E and Horton, Josh T and Mao, Yuezhi and Chodera, John D and Pritchard, Benjamin P and Wang, Yuanqing and others},\n journal={Scientific Data},\n volume={10},\n number={1},\n pages={11},\n year={2023},\n publisher={Nature Publishing Group UK London}\n}\n", 1333 | "@article{eastman2024spice,\n title={Nutmeg and {SPICE}: Models and Data for Biomolecular Machine Learning},\n author={Eastman, Peter and Pritchard, Benjamin P and Chodera, John D and Markland, Thomas E},\n journal={arXiv preprint arXiv:2406.13112v1},\n year={2024}\n}\n" 1334 | ] 1335 | }, 1336 | { 1337 | "name": "Transition1x", 1338 | "full_name": "", 1339 | "description": " 9.6 million data points, each meticulously generated using DFT calculations with forces and energies for a staggering 10,000 organic reactions ", 1340 | "methods":{ 1341 | "geometry_optimization":{ 1342 | "ωB97X/6-31G*":"ORCA 5.0.2" 1343 | }, 1344 | "energy":{ 1345 | "ωB97X/6-31G*":"ORCA 5.0.2" 1346 | } 1347 | }, 1348 | "data_size": { 1349 | "number_of_structures": 9.6e6 1350 | }, 1351 | "data_access": { 1352 | "Figshare": "https://doi.org/10.6084/m9.figshare.19614657.v4", 1353 | "GitLab": "https://gitlab.com/matschreiner/T1x" 1354 | }, 1355 | "chemical_elements": ["H","C","N","O"], 1356 | "number_of_heavy_atoms": ["checking required","checking required", 7], 1357 | "initial_source": [ 1358 | "GDB-7" 1359 | ], 1360 | "non-equilibrium structures": "True", 1361 | "charges": ["checking required"], 1362 | "multiplicities": ["checking required"], 1363 | "excited_states": "False", 1364 | "solvent": ["gas_phase"], 1365 | "temperature": "False", 1366 | "properties": [ 1367 | "total_energy", 1368 | "force" 1369 | ], 1370 | "doi": "10.1038/s41597-022-01870-w", 1371 | "reference": [ 1372 | "@article{schreiner2022transition1x,\n title={Transition1x-a dataset for building generalizable reactive machine learning potentials},\n author={Schreiner, Mathias and Bhowmik, Arghya and Vegge, Tejs and Busk, Jonas and Winther, Ole},\n journal={Scientific Data},\n volume={9},\n number={1},\n pages={779},\n year={2022},\n publisher={Nature Publishing Group UK London}\n}\n" 1373 | ] 1374 | }, 1375 | { 1376 | "name": "VIB5", 1377 | "full_name": "", 1378 | "description": " Ab initio quantum chemical data for five small polyatomic molecules with significant astrophysical relevance ", 1379 | "methods":{ 1380 | "geometry_optimization":{ 1381 | "False":"False" 1382 | }, 1383 | "energy":{ 1384 | "TBE":"MOLPRO 2012", 1385 | "MP2/cc-pVTZ":"CFOUR 1.0, 2.1", 1386 | "CCSD(T)/cc-pVQZ":"CFOUR 1.0, 2.1", 1387 | "HF/cc-pVTZ":"CFOUR 1.0, 2.1", 1388 | "HF/cc-pVQZ":"CFOUR 1.0, 2.1" 1389 | } 1390 | }, 1391 | "data_size": { 1392 | "number_of_structures": 324592 1393 | }, 1394 | "data_access": { 1395 | "Figshare": "https://doi.org/10.6084/m9.figshare.16903288" 1396 | }, 1397 | "chemical_elements": ["H","C","O","F","Na","Si","Cl"], 1398 | "number_of_heavy_atoms": [1,"checking required",2], 1399 | "initial_source": [], 1400 | "non-equilibrium structures": "True", 1401 | "charges": [0], 1402 | "multiplicities": [1], 1403 | "excited_states": "False", 1404 | "solvent": ["gas_phase"], 1405 | "temperature": "False", 1406 | "properties": [ 1407 | "total_energy", 1408 | "force" 1409 | ], 1410 | "doi": "10.1038/s41597-022-01185-w", 1411 | "reference": [ 1412 | "@article{zhang2022vib5,\n title={VIB5 database with accurate ab initio quantum chemical molecular potential energy surfaces},\n author={Zhang, Lina and Zhang, Shuang and Owens, Alec and Yurchenko, Sergei N and Dral, Pavlo O},\n journal={Scientific Data},\n volume={9},\n number={1},\n pages={84},\n year={2022},\n publisher={Nature Publishing Group UK London}\n}\n" 1413 | ] 1414 | }, 1415 | { 1416 | "name": "VQM24", 1417 | "full_name": "Vector-QM24", 1418 | "description": " Providing QM properties of 258,242 unique constitutional isomers and 577,705 conformers of varying stoichiometries, focusing on molecules composed of up to five heavy atoms from C, N, O, F, Si, P, S, Cl, Br ", 1419 | "methods":{ 1420 | "geometry_optimization":{ 1421 | "ωB97X-D3/cc-pVDZ":"Psi4 1.7" 1422 | }, 1423 | "energy":{ 1424 | "DMC@PBE0/ccECP/cc-pVQZ":"QMCPACK", 1425 | "PBE0/ccECP/cc-pVQZ":"PySCF" 1426 | } 1427 | }, 1428 | "data_size": { 1429 | "number_of_structures": 8.36e5 1430 | }, 1431 | "data_access": { 1432 | "Zenodo": "https://doi.org/10.5281/zenodo.11164951" 1433 | }, 1434 | "chemical_elements": ["H","C","N","O","F","Si","P","S","Cl","Br"], 1435 | "number_of_heavy_atoms": ["checking required","checking required",5], 1436 | "initial_source": [], 1437 | "non-equilibrium structures": "False", 1438 | "charges": [0], 1439 | "multiplicities": [1], 1440 | "excited_states": "False", 1441 | "solvent": ["gas_phase"], 1442 | "temperature": 298.15, 1443 | "properties": [ 1444 | "total_energy", 1445 | "HOMO/LUMO_energy", 1446 | "dipole_moment", 1447 | "harmonic_frequency", 1448 | "zero_point_energy", 1449 | "enthalpy" 1450 | ], 1451 | "doi": "10.48550/arXiv.2405.05961", 1452 | "reference": [ 1453 | "@article{khan2024towards,\n title={Towards comprehensive coverage of chemical space: Quantum mechanical properties of 836k constitutional and conformational closed shell neutral isomers consisting of HCNOFSiPSClBr},\n author={Khan, Danish and Benali, Anouar and Kim, Scott YH and von Rudorff, Guido Falk and von Lilienfeld, O Anatole},\n journal={arXiv preprint arXiv:2405.05961},\n year={2024}\n}\n" 1454 | ] 1455 | }, 1456 | { 1457 | "name": "WS22 Database", 1458 | "full_name": "", 1459 | "description": " 1.18 million molecular geometries (equilibrium and non-equilibrium) for ten organic molecules containing up to 22 atoms ", 1460 | "methods":{ 1461 | "geometry_optimization":{ 1462 | "PBE0/6-311G(d)":"Gaussian 09" 1463 | }, 1464 | "energy":{ 1465 | "PBE0/6-311G(d)": "Gaussian 09", 1466 | "TD-PBE0/6-311G(d)": "Gaussian 09" 1467 | } 1468 | }, 1469 | "data_size": { 1470 | "number_of_structures": 1.18e6 1471 | }, 1472 | "data_access": { 1473 | "Zenodo": "https://doi.org/10.5281/zenodo.7032334" 1474 | }, 1475 | "chemical_elements": ["H","C","N","O"], 1476 | "number_of_heavy_atoms": [4,"checking required",14], 1477 | "initial_source": [], 1478 | "non-equilibrium structures": "True", 1479 | "charges": [0], 1480 | "multiplicities": ["checking required"], 1481 | "excited_states": "True", 1482 | "solvent": ["gas_phase"], 1483 | "temperature": 298.15, 1484 | "properties": [ 1485 | "total_energy", 1486 | "force", 1487 | "dipole_moment", 1488 | "polarizability", 1489 | "HOMO/LUMO_energy" 1490 | ], 1491 | "doi": "10.1038/s41597-023-01998-3", 1492 | "reference": [ 1493 | "@article{pinheiro2023ws22,\n title={WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets},\n author={Pinheiro Jr, Max and Zhang, Shuang and Dral, Pavlo O and Barbatti, Mario},\n journal={Scientific Data},\n volume={10},\n number={1},\n pages={95},\n year={2023},\n publisher={Nature Publishing Group UK London}\n}\n" 1494 | ] 1495 | }, 1496 | { 1497 | "name": "xxMD", 1498 | "full_name": "", 1499 | "description": " Excited-state molecular dynamics data set for four molecular systems chosen for their photochemical activity: azobenzene, malonaldehyde, stilbene, and dithiophene ", 1500 | "methods":{ 1501 | "geometry_optimization":{ 1502 | "False":"False" 1503 | }, 1504 | "energy":{ 1505 | "SA-CASSCF":"OpenMolcas 22.06", 1506 | "M06/6-31G":"Psi4" 1507 | } 1508 | }, 1509 | "data_size": { 1510 | "number_of_structures": 86716 1511 | }, 1512 | "data_access": { 1513 | "GitHub": "https://github.com/zpengmei/xxMD", 1514 | "Zenodo": "https://doi.org/10.5281/zenodo.10393859" 1515 | }, 1516 | "chemical_elements": ["H","C","N","O","S"], 1517 | "number_of_heavy_atoms": [5,"checking required",18], 1518 | "initial_source": [], 1519 | "non-equilibrium structures": "True", 1520 | "charges": [0], 1521 | "multiplicities": [1], 1522 | "excited_states": "True", 1523 | "solvent": ["gas_phase"], 1524 | "temperature": "False", 1525 | "properties": [ 1526 | "total_energy", 1527 | "force" 1528 | ], 1529 | "doi": "10.1038/s41597-024-03019-3", 1530 | "reference": [ 1531 | "@article{pengmei2024beyond,\n title={Beyond {MD17}: the reactive {xxMD} dataset},\n author={Pengmei, Zihan and Liu, Junyu and Shu, Yinan},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={222},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n" 1532 | ] 1533 | }, 1534 | { 1535 | "name": "CREMP", 1536 | "full_name": "Conformer-Rotamer Ensembles of Macrocyclic Peptides", 1537 | "description": "Macrocyclic peptides data set containing 4-, 5-, and 6-mer homodetic cyclic peptides.", 1538 | "methods":{ 1539 | "geometry_optimization":{ 1540 | "GFN2-xTB": "xTB" 1541 | }, 1542 | "energy":{ 1543 | "GFN2-xTB": "xTB" 1544 | } 1545 | }, 1546 | "data_size": { 1547 | "number_of_structures": 3.13e7 1548 | }, 1549 | "data_access": { 1550 | "GitHub": "https://github.com/Genentech/cremp", 1551 | "Zenodo": "https://doi.org/10.5281/zenodo.7931444" 1552 | }, 1553 | "chemical_elements": ["H","C","N","O","S"], 1554 | "number_of_heavy_atoms": [16,38,74], 1555 | "initial_source": ["literature"], 1556 | "non-equilibrium structures": "False", 1557 | "charges": [0], 1558 | "multiplicities": [1], 1559 | "excited_states": "False", 1560 | "solvent": ["chloroform"], 1561 | "properties": [ 1562 | "total_energy" 1563 | ], 1564 | "doi":"10.1038/s41597-024-03698-y", 1565 | "reference":[ 1566 | "@article{grambow2024cremp,\n title={CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning},\n author={Grambow, Colin A and Weir, Hayley and Cunningham, Christian N and Biancalani, Tommaso and Chuang, Kangway V},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={859},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n" 1567 | ] 1568 | }, 1569 | { 1570 | "name": "CREMP-CycPeptMPDB", 1571 | "full_name": "Conformer-Rotamer Ensembles of Macrocyclic Peptides-CycPeptMPDB", 1572 | "description": "Macrocyclic peptides data set containing 6-, 7-, and 10-mer cyclic peptides.", 1573 | "methods":{ 1574 | "geometry_optimization":{ 1575 | "GFN2-xTB": "xTB" 1576 | }, 1577 | "energy":{ 1578 | "GFN2-xTB": "xTB" 1579 | } 1580 | }, 1581 | "data_size": { 1582 | "number_of_structures": 3.7e6 1583 | }, 1584 | "data_access": { 1585 | "Zenodo": "https://doi.org/10.5281/zenodo.10798261" 1586 | }, 1587 | "chemical_elements": ["H","C","N","O","S","F","Cl"], 1588 | "number_of_heavy_atoms": [32,56,86], 1589 | "initial_source": ["CycPeptMPDB"], 1590 | "non-equilibrium structures": "False", 1591 | "charges": [0], 1592 | "multiplicities": ["checking required"], 1593 | "excited_states": "False", 1594 | "solvent": ["chloroform"], 1595 | "properties": [ 1596 | "total_energy" 1597 | ], 1598 | "doi":"10.1038/s41597-024-03698-y", 1599 | "reference":[ 1600 | "@article{grambow2024cremp,\n title={CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning},\n author={Grambow, Colin A and Weir, Hayley and Cunningham, Christian N and Biancalani, Tommaso and Chuang, Kangway V},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={859},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n" 1601 | ] 1602 | }, 1603 | { 1604 | "name": "GEOM", 1605 | "full_name": "Geometric Ensemble Of Molecules", 1606 | "description": "Conformers for mid-sized organic molecules from QM9 and experimental datasets related to physical chemistry, biophysics, and physiology.", 1607 | "methods":{ 1608 | "geometry_optimization":{ 1609 | "GFN2-xTB": "xTB", 1610 | "r2scan-3c/mTZVPP": "ORCA 5.0.1" 1611 | }, 1612 | "energy":{ 1613 | "GFN2-xTB": "xTB", 1614 | "r2scan-3c/mTZVPP": "ORCA 5.0.1" 1615 | } 1616 | }, 1617 | "data_size": { 1618 | "number_of_structures": 3.7e7 1619 | }, 1620 | "data_access": { 1621 | "GitHub": "https://github.com/learningmatter-mit/geom" 1622 | }, 1623 | "chemical_elements": ["checking required"], 1624 | "number_of_heavy_atoms": ["checking required", 25, 91], 1625 | "initial_source": [ 1626 | "AICures", 1627 | "MoleculeNet", 1628 | "QM9" 1629 | ], 1630 | "non-equilibrium structures": "False", 1631 | "charges": ["checking required"], 1632 | "multiplicities": ["checking required"], 1633 | "excited_states": "checking required", 1634 | "solvent": ["gas_phase", "water"], 1635 | "properties": [ 1636 | "total_energy" 1637 | ], 1638 | "doi":"10.1038/s41597-022-01288-4", 1639 | "reference":[ 1640 | "@article{axelrod2022geom,\n title={GEOM, energy-annotated molecular conformations for property prediction and molecular generation},\n author={Axelrod, Simon and Gomez-Bombarelli, Rafael},\n journal={Scientific Data},\n volume={9},\n number={1},\n pages={185},\n year={2022},\n publisher={Nature Publishing Group UK London}\n}\n" 1641 | ] 1642 | }, 1643 | { 1644 | "name": "tmQM", 1645 | "full_name": "transition metal quantum mechanics dataset", 1646 | "description": "Mononuclear transition metal-organic compond complexes datasets consisting 30 transition metals in 3d, 4d, and 5d from groups 3 to 12", 1647 | "methods":{ 1648 | "geometry_optimization":{ 1649 | "GFN2-xTB": "xTB" 1650 | }, 1651 | "energy":{ 1652 | "TPSSh-D3BJ/def2-SVP": "Gaussian 16" 1653 | } 1654 | }, 1655 | "data_size": { 1656 | "number_of_structures": 86699 1657 | }, 1658 | "data_access": { 1659 | "GitHub": "https://github.com/bbskjelstad/tmqm", 1660 | "Link": "http://quantum-machine.org/datasets/" 1661 | }, 1662 | "chemical_elements": ["checking required"], 1663 | "number_of_heavy_atoms": ["checking required"], 1664 | "initial_source": [ 1665 | "Cambridge Structural Database" 1666 | ], 1667 | "non-equilibrium structures": "False", 1668 | "charges": [-1,0,1], 1669 | "multiplicities": [1], 1670 | "excited_states": "False", 1671 | "solvent": ["gas_phase"], 1672 | "properties": [ 1673 | "total_energy", 1674 | "dispersion_energy", 1675 | "metal_center_natural_charge", 1676 | "HOMO/LUMO_energy", 1677 | "HOMO/LUMO_gap", 1678 | "dipole_moment", 1679 | "polarizability" 1680 | ], 1681 | "doi":"10.1021/acs.jcim.0c01041", 1682 | "reference":[ 1683 | "@article{balcells2020tmqm,\n title={tmQM dataset—quantum geometries and properties of 86k transition metal complexes},\n author={Balcells, David and Skjelstad, Bastian Bjerkem},\n journal={Journal of chemical information and modeling},\n volume={60},\n number={12},\n pages={6135--6146},\n year={2020},\n publisher={ACS Publications}\n}\n" 1684 | ] 1685 | }, 1686 | { 1687 | "name": "COMPAS-1", 1688 | "full_name": "COMputational database of Polycyclic Aromatic Systems-1", 1689 | "description": "Part of the COMPAS project containing cata-condensed polybenzenoid hydrocarbons and various QM properties", 1690 | "methods":{ 1691 | "geometry_optimization":{ 1692 | "GFN2-xTB": "xTB", 1693 | "B3LYP-D3(BJ)/def2-SVP": "ORCA 4.2.0" 1694 | }, 1695 | "energy":{ 1696 | "GFN2-xTB": "xTB", 1697 | "B3LYP-D3(BJ)/def2-SVP": "ORCA 4.2.0" 1698 | } 1699 | }, 1700 | "data_size": { 1701 | "number_of_structures": 34072 1702 | }, 1703 | "data_access": { 1704 | "GitLab": "https://gitlab.com/porannegroup/compas" 1705 | }, 1706 | "chemical_elements": ["H","C"], 1707 | "number_of_heavy_atoms": [6,45,46], 1708 | "initial_source": [], 1709 | "non-equilibrium structures": "False", 1710 | "charges": [-1,0,1], 1711 | "multiplicities": [1,2], 1712 | "excited_states": "False", 1713 | "solvent": ["gas_phase"], 1714 | "properties": [ 1715 | "total_energy", 1716 | "dispersion_energy", 1717 | "HOMO/LUMO_energy", 1718 | "HOMO/LUMO_gap", 1719 | "dipole_moment", 1720 | "zero_point_energy", 1721 | "adiabatic_ionization_potential", 1722 | "adiabatic_electron_affinity" 1723 | ], 1724 | "doi":"10.1021/acs.jcim.2c00503", 1725 | "reference":[ 1726 | "@article{wahab2022compas,\n title={The compas project: A computational database of polycyclic aromatic systems. phase 1: cata-condensed polybenzenoid hydrocarbons},\n author={Wahab, Alexandra and Pfuderer, Lara and Paenurk, Eno and Gershoni-Poranne, Renana},\n journal={Journal of Chemical Information and Modeling},\n volume={62},\n number={16},\n pages={3704--3713},\n year={2022},\n publisher={ACS Publications}\n}\n" 1727 | ] 1728 | }, 1729 | { 1730 | "name": "COMPAS-2", 1731 | "full_name": "COMputational database of Polycyclic Aromatic Systems-2", 1732 | "description": "Part of the COMPAS project containing cata-condensed poly(hetero)cyclic aromatic molecules and various QM properties", 1733 | "methods":{ 1734 | "geometry_optimization":{ 1735 | "GFN1-xTB": "xTB", 1736 | "CAM-B3LYP-D3(BJ)/def2-SVP": "ORCA 4.2.0" 1737 | }, 1738 | "energy":{ 1739 | "GFN1-xTB": "xTB", 1740 | "CAM-B3LYP-D3(BJ)/def2-SVP": "ORCA 4.2.0" 1741 | } 1742 | }, 1743 | "data_size": { 1744 | "number_of_structures": 524392 1745 | }, 1746 | "data_access": { 1747 | "GitLab": "https://gitlab.com/porannegroup/compas", 1748 | "Figshare": "https://doi.org/10.6084/m9.figshare.24347152", 1749 | "Link": "https://compas.net.technion.ac.il/" 1750 | }, 1751 | "chemical_elements": ["H","B","C","N","O","S"], 1752 | "number_of_heavy_atoms": ["checking required"], 1753 | "initial_source": [], 1754 | "non-equilibrium structures": "False", 1755 | "charges": [-1,0,1], 1756 | "multiplicities": [1,2], 1757 | "excited_states": "False", 1758 | "solvent": ["gas_phase"], 1759 | "properties": [ 1760 | "total_energy", 1761 | "dispersion_energy", 1762 | "HOMO/LUMO_energy", 1763 | "HOMO/LUMO_gap", 1764 | "zero_point_energy", 1765 | "adiabatic_ionization_potential", 1766 | "adiabatic_electron_affinity" 1767 | ], 1768 | "doi":"10.1038/s41597-024-02927-8", 1769 | "reference":[ 1770 | "@article{mayo2024compas,\n title={COMPAS-2: a dataset of cata-condensed hetero-polycyclic aromatic systems},\n author={Mayo Yanes, Eduardo and Chakraborty, Sabyasachi and Gershoni-Poranne, Renana},\n journal={Scientific Data},\n volume={11},\n number={1},\n pages={97},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n" 1771 | ] 1772 | }, 1773 | { 1774 | "name": "COMPAS-3", 1775 | "full_name": "COMputational database of Polycyclic Aromatic Systems-3", 1776 | "description": "Part of the COMPAS project containing peri-condensed polybenzenoid hydrocarbons", 1777 | "methods":{ 1778 | "geometry_optimization":{ 1779 | "GFN2-xTB": "xTB", 1780 | "CAM-B3LYP-D3BJ/def2-SVP": "ORCA 4.2.0" 1781 | }, 1782 | "energy":{ 1783 | "GFN2-xTB": "xTB", 1784 | "CAM-B3LYP-D3BJ/aug-cc-pVDZ": "ORCA 4.2.0" 1785 | } 1786 | }, 1787 | "data_size": { 1788 | "number_of_structures": 39482 1789 | }, 1790 | "data_access": { 1791 | "GitLab": "https://gitlab.com/porannegroup/compas" 1792 | }, 1793 | "chemical_elements": ["H","C"], 1794 | "number_of_heavy_atoms": [16,42,44], 1795 | "initial_source": [], 1796 | "non-equilibrium structures": "False", 1797 | "charges": [-1,0,1], 1798 | "multiplicities": [1,2], 1799 | "excited_states": "False", 1800 | "solvent": ["gas_phase"], 1801 | "properties": [ 1802 | "total_energy", 1803 | "dispersion_energy", 1804 | "HOMO/LUMO_energy", 1805 | "HOMO/LUMO_gap", 1806 | "dipole_moment", 1807 | "zero_point_energy", 1808 | "adiabatic_ionization_potential", 1809 | "adiabatic_electron_affinity" 1810 | ], 1811 | "doi":"10.1039/D4CP01027B", 1812 | "reference":[ 1813 | "@article{wahab2024compas,\n title={COMPAS-3: a dataset of peri-condensed polybenzenoid hydrocarbons},\n author={Wahab, Alexandra and Gershoni-Poranne, Renana},\n journal={Physical Chemistry Chemical Physics},\n volume={26},\n number={21},\n pages={15344--15357},\n year={2024},\n publisher={Royal Society of Chemistry}\n}\n" 1814 | ] 1815 | }, 1816 | { 1817 | "name": "OFF-ON dataset", 1818 | "full_name": "organic fragments from organocatalysts that are non-modular", 1819 | "description": "organic fragments from organocatalysts that are non-modular", 1820 | "methods":{ 1821 | "geometry_optimization":{ 1822 | "DFTB":"DFTB+" 1823 | }, 1824 | "energy":{ 1825 | "PBE0-D3/def2-SVP":"TeraChem" 1826 | } 1827 | }, 1828 | "data_size": { 1829 | "number_of_structures": 75326 1830 | }, 1831 | "data_access": { 1832 | "checking required":"https://archive.materialscloud.org/record/2023.189" 1833 | }, 1834 | "chemical_elements": ["H","C","N","O","F","S"], 1835 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1836 | "initial_source": [ 1837 | "OSCAR", 1838 | "CSD", 1839 | "PubChem", 1840 | "NCI Atlas" 1841 | ], 1842 | "non-equilibrium structures": "True", 1843 | "charges": ["checking required"], 1844 | "multiplicities": ["checking required"], 1845 | "excited_states": "False", 1846 | "solvent": ["gas_phase"], 1847 | "temperature": "False", 1848 | "properties": [ 1849 | "total_energy" 1850 | ], 1851 | "doi": "10.1021/acs.jcim.3c01953", 1852 | "reference":[ 1853 | "@article{celerse2024offon,\n author = {Célerse, Frédéric and Wodrich, Matthew D. and Vela, Sergi and Gallarati, Simone and Fabregat, Raimon and Juraskova, Veronika and Corminboeuf, Clémence},\n title = {From Organic Fragments to Photoswitchable Catalysts: The {OFF-ON} Structural Repository for Transferable Kernel-Based Potentials},\n journal = {J. Chem. Inf. Model.},\n volume = {64},\n number = {4},\n pages = {1201--1212},\n year = {2024},\n publisher = {ACS Publications}\n}\n" 1854 | ] 1855 | }, 1856 | { 1857 | "name": "revQM9", 1858 | "full_name": "", 1859 | "description": "", 1860 | "methods":{ 1861 | "geometry_optimization":{ 1862 | "B3LYP/6-31G(2df,p)": "Gaussian 09" 1863 | }, 1864 | "energy":{ 1865 | "aPBE0/cc-pVTZ":"PySCF 2.4.0" 1866 | } 1867 | }, 1868 | "data_size": { 1869 | "number_of_structures": 1.3e5 1870 | }, 1871 | "data_access": { 1872 | "checking required":"https://zenodo.org/doi/10.5281/zenodo.10689883" 1873 | }, 1874 | "chemical_elements": ["H","C","N","O","F"], 1875 | "number_of_heavy_atoms": ["checking required","checking required",9], 1876 | "initial_source": [ 1877 | "QM9" 1878 | ], 1879 | "non-equilibrium structures": "False", 1880 | "charges": [0], 1881 | "multiplicities": [1], 1882 | "excited_states": "False", 1883 | "solvent": ["gas_phase"], 1884 | "temperature": "False", 1885 | "properties": [ 1886 | "total_energy", 1887 | "HOMO/LUMO_energy", 1888 | "dipole_moment", 1889 | "atomization_energy" 1890 | ], 1891 | "doi": "10.48550/arXiv.2402.14793", 1892 | "reference":[ 1893 | "@article{khan2024revqm9,\n title={Adaptive hybrid density functionals},\n author={Danish Khan and Alastair James Arthur Price and Maximilian L. Ach and O. Anatole von Lilienfeld},\n year={2024},\n journal={arXiv preprint arXiv:2402.14793 [physics.chem-ph]},\n}\n" 1894 | ] 1895 | }, 1896 | { 1897 | "name": "SPICE-2", 1898 | "full_name": "", 1899 | "description": "Extension of SPICE-1 with additional 20,000 new molecules and 2 new elements: boron and silicon)", 1900 | "methods":{ 1901 | "geometry_optimization":{ 1902 | "False":"False" 1903 | }, 1904 | "energy":{ 1905 | "ωB97M-D3(BJ)/def2-TZVPPD":"Psi4" 1906 | } 1907 | }, 1908 | "data_size": { 1909 | "number_of_structures": 2008628 1910 | }, 1911 | "data_access": { 1912 | "checking required":"https://doi.org/10.5281/zenodo.10975225" 1913 | }, 1914 | "chemical_elements": ["H","Li","B","C","N","O","F","Na","Mg","Si","P","S","Cl","K","Ca","Br","I"], 1915 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1916 | "number_of_atoms": [2,"checking required",110], 1917 | "initial_source": [ 1918 | "PubChem" 1919 | ], 1920 | "non-equilibrium structures": "True", 1921 | "charges": ["checking required"], 1922 | "multiplicities": ["checking required"], 1923 | "excited_states": "False", 1924 | "solvent": ["gas_phase","water"], 1925 | "temperature": "False", 1926 | "properties": [ 1927 | "total_energy", 1928 | "force", 1929 | "dipole_moment", 1930 | "formation_energy", 1931 | "mbis_charges", 1932 | "mbis_dipoles" 1933 | ], 1934 | "doi": "10.1021/acs.jctc.4c00794", 1935 | "reference":[ 1936 | "@article{peter2024spice2,\nauthor = {Eastman, Peter and Pritchard, Benjamin P. and Chodera, John D. and Markland, Thomas E.},\ntitle = {Nutmeg and SPICE: Models and Data for Biomolecular Machine Learning},\njournal = {Journal of Chemical Theory and Computation},\nvolume = {20},\nnumber = {19},\npages = {8583-8593},\nyear = {2024},\npublisher = {ACS Publications}\n}\n" 1937 | ] 1938 | }, 1939 | { 1940 | "name": "FMO-SCOP", 1941 | "full_name": "", 1942 | "description": "protein folds with FMO calculation from SCOP2", 1943 | "methods":{ 1944 | "geometry_optimization":{ 1945 | "Amber10:ETH":"MOE" 1946 | }, 1947 | "energy":{ 1948 | "FMO-MP2/6-31G*":"ABINIT-MP 1.23 BINDS", 1949 | "FMO-MP2/6-31G**":"ABINIT-MP 1.23 BINDS", 1950 | "FMO-MP2/cc-pVDZ":"ABINIT-MP 1.23 BINDS" 1951 | } 1952 | }, 1953 | "data_size": { 1954 | "number_of_structures": 2e5 1955 | }, 1956 | "data_access": { 1957 | "checking required":"https://doi.org/10.6084/m9.figshare.25980112.v2" 1958 | }, 1959 | "chemical_elements": ["checking required"], 1960 | "number_of_heavy_atoms": ["checking required","checking required","checking required"], 1961 | "initial_source": [ 1962 | "SCOP2" 1963 | ], 1964 | "non-equilibrium structures": "False", 1965 | "charges": ["checking required"], 1966 | "multiplicities": ["checking required"], 1967 | "excited_states": "False", 1968 | "solution": ["gas_phase"], 1969 | "temperature": "False", 1970 | "properties": [ 1971 | "total_energy", 1972 | "atomic_charge", 1973 | "pair_interaction_energy" 1974 | ], 1975 | "doi": "10.1038/s41597-024-03999-2", 1976 | "reference":[ 1977 | "@article{daisuke2024fmoscop,\n title = {Quantum chemical calculation dataset for representative protein folds by the fragment molecular orbital method},\n author = {Takaya, Daisuke and Ohno, Shu and Miyagishi, Toma and Tanaka, Sota and Okuwaki, Koji and Watanabe, Chiduru and Kato, Koichiro and Tian, Yu-Shi and Fukuzawa, Kaori},\n journal = {Scientific Data},\n volume = {11},\n number = {1},\n pages = {1164},\n year = {2024},\n publisher = {Nature Publishing Group UK London}\n}\n" 1978 | ] 1979 | } 1980 | ] 1981 | } -------------------------------------------------------------------------------- /dataset_selection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial on dataset selection\n", 8 | "\n", 9 | "Below we provide instructions on how to use the information of database in `dataset_overview.json`.\n", 10 | "\n", 11 | "Take QM9 as an example to show how the information is presented for each dataset in our json file (if the value is not clarified in the paper and needs additional check on the dataset in detail, `\"checking required\"` will be used as the placeholder.): " 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "{\n", 21 | "\"name\": # Abbriviation of the name for the dataset \n", 22 | " \"QM9\", \n", 23 | "\n", 24 | "\"full_name\": # Full name of the dataset\n", 25 | " \"QM9\", \n", 26 | "\n", 27 | "\"description\": # Simple description about the dataset\n", 28 | " \" A collection of molecular structures and properties for 134,000 small organic molecules \",\n", 29 | "\n", 30 | "\"methods\": { # Calculation methods and softwares used to construct dataset. \n", 31 | " # Currently only methods for geometry optimization and single \n", 32 | " # point calculations are considered.\n", 33 | " \"geometry_optimization\":{\n", 34 | " \"B3LYP/6-31G(2df,p)\": \"Gaussian 09\" # method : software\n", 35 | " },\n", 36 | " \"energy\":{\n", 37 | " \"B3LYP/6-31G(2df,p)\": \"Gaussian 09\",\n", 38 | " \"G4MP2\": \"Gaussian 09\"\n", 39 | " }\n", 40 | " },\n", 41 | "\n", 42 | "\"data_size\": { # Number of molecules in the dataset. \n", 43 | " # Currently only number of conformations \n", 44 | " # are considered. \n", 45 | " \"number_of_structures\": 133885\n", 46 | " }, \n", 47 | "\n", 48 | "\"data_access\": { # The link of repository to store the dataset. \n", 49 | " \"Figshare\": \"https://doi.org/10.6084/m9.figshare.978904\"\n", 50 | " },\n", 51 | "\n", 52 | "\"chemical_elements\": # The chemical elements the dataset covers \n", 53 | " [\"H\",\"C\",\"N\",\"O\",\"F\"],\n", 54 | "\n", 55 | "\"number_of_heavy_atoms\": [ # The number of non-hydrogen atoms \n", 56 | " \"checking required\", # minima\n", 57 | " \"checking required\", # mean\n", 58 | " 9 # maxima\n", 59 | " ], \n", 60 | "\n", 61 | "\"initial_source\": [ # The original dataset which current dataset build on\n", 62 | " \"GDB-17\"\n", 63 | " ], \n", 64 | "\n", 65 | "\"non-equilibrium structures\": # whether the dataset contain non-equilibrium structures\n", 66 | " \"False\",\n", 67 | "\n", 68 | "\"charges\": [ # charges of the molecules in the dataset\n", 69 | " 0\n", 70 | " ],\n", 71 | "\n", 72 | "\"multiplicities\": [ # multiplicities of the molecules in the dataset\n", 73 | " 1\n", 74 | " ],\n", 75 | "\n", 76 | "\"excited_states\": # whether the dataset contain molecules in excited states\n", 77 | " \"False\",\n", 78 | "\n", 79 | "\"solvent\": [ # the solvents used for calculation\n", 80 | " \"gas_phase\"\n", 81 | " ],\n", 82 | "\n", 83 | "\"temperature\": # the temperature used for thermochemical calculation or dynamics\n", 84 | " 298.15,\n", 85 | "\n", 86 | "\"properties\": [ # atomic and molecular properties stored in the dataset\n", 87 | " \"total_energy\",\n", 88 | " \"enthalpy\",\n", 89 | " \"...\"\n", 90 | "],\n", 91 | "\"doi\": \"10.1038/sdata.2014.22\",\n", 92 | "\"reference\": [ # reference in bibtex\n", 93 | " \"@article{ramakrishnan2014quantum,\\n title={Quantum chemistry structures and properties of 134 kilo molecules},\\n author={Ramakrishnan, Raghunathan and Dral, Pavlo O and Rupp, Matthias and Von Lilienfeld, O Anatole},\\n journal={Scientific data},\\n volume={1},\\n number={1},\\n pages={1--7},\\n year={2014},\\n publisher={Nature Publishing Group}\\n}\\n\\n\"\n", 94 | "]\n", 95 | "}," 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "More entries will be added later according to user requests and instances for new datasets can be created easily from the template we created in the `template.json`. \n", 103 | "\n", 104 | "## Filter datasets\n", 105 | "\n", 106 | "User can write their own scripts to filter the datasets according to the properties presented above. We also provide `filter_dataset` function to help user get started with. 3 possible usages are shown below." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "import json\n", 116 | "\n", 117 | "with open('dataset_overview.json','r') as d:\n", 118 | " datasets = json.load(d)\n", 119 | "datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list\n", 120 | "\n", 121 | "def lower_list(ll):\n", 122 | " ll_updates = []\n", 123 | " for l in ll:\n", 124 | " if type(l) == str:\n", 125 | " ll_updates.append(l.lower())\n", 126 | " else:\n", 127 | " return ll\n", 128 | " return ll_updates\n", 129 | "\n", 130 | "def filter_dataset( # only list and str type values are supported\n", 131 | " datasets, # the datasets to filter on\n", 132 | " entry, # the properties sorted on\n", 133 | " value # the corresponding value requested by users\n", 134 | " ): \n", 135 | " datasets_to_select = []\n", 136 | " for dataset in datasets:\n", 137 | " if type(dataset[entry]) == list:\n", 138 | " if type(value) == list:\n", 139 | " if set(lower_list(value)) <= set(lower_list(dataset[entry])):\n", 140 | " datasets_to_select.append(dataset)\n", 141 | " elif type(value) == str:\n", 142 | " if value.lower() in lower_list(dataset[entry]):\n", 143 | " datasets_to_select.append(dataset)\n", 144 | " else:\n", 145 | " print('Not supported type for value')\n", 146 | " elif type(dataset[entry]) == str:\n", 147 | " if dataset[entry].lower() == value.lower():\n", 148 | " datasets_to_select.append(dataset)\n", 149 | " else:\n", 150 | " print('filter function only supports list and str. For other properties user can generate with their own scripts easily')\n", 151 | " return datasets_to_select\n" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "#### Case 1: Select datasets with the PubChem as the initial source " 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "datasets_to_select = filter_dataset(\n", 168 | " datasets=datasets,\n", 169 | " entry='initial_source',\n", 170 | " value='pubchem')\n", 171 | "\n", 172 | "# print information\n", 173 | "for d in datasets_to_select:\n", 174 | " print(f\"Name of the dataset: {d['name']}\")\n", 175 | " print(f\"Description of the dataset: {d['description']}\")" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "#### Case 2: Select datasets with excited states available" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "datasets_to_select = filter_dataset(\n", 192 | " datasets=datasets,\n", 193 | " entry='excited_states',\n", 194 | " value='True')\n", 195 | "\n", 196 | "# print information\n", 197 | "for d in datasets_to_select:\n", 198 | " print(f\"Name of the dataset: {d['name']}\")\n", 199 | " print(f\"Description of the dataset: {d['description']}\")" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "#### Case 3: Select datasets containing HCNOS\n" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "datasets_to_select = filter_dataset(\n", 216 | " datasets=datasets,\n", 217 | " entry='chemical_elements',\n", 218 | " value=['H','C','N','O',])\n", 219 | " \n", 220 | "# print information\n", 221 | "for d in datasets_to_select:\n", 222 | " print(f\"Name of the dataset: {d['name']}\")\n", 223 | " print(f\"Description of the dataset: {d['description']}\")" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "## Generate .csv table from the latest .json file" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "For better visualization of the statistics, we provide `json2csv` function to transform the json file into csv format which can be easily parsed with Excel or common table visualization tools. We also provide an examples with table generated from `pandas` which is the library for data analysis in Python." 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "import json\n", 247 | "import pandas as pd\n", 248 | "\n", 249 | "def get_value_for_entry(entry, datasets):\n", 250 | " vals = [dataset[entry] for dataset in datasets]\n", 251 | " def dict2str(d):\n", 252 | " ss_list = [str(vv) for vv in d.values()]\n", 253 | " return ','.join(ss_list)\n", 254 | " if type(vals[0]) == str:\n", 255 | " return vals \n", 256 | " elif type(vals[0]) == list:\n", 257 | " return [','.join([str(vv) for vv in val]) for val in vals]\n", 258 | " elif type(vals[0]) == dict:\n", 259 | " if entry == 'methods':\n", 260 | " print('Only fidelity of energy will be presented')\n", 261 | " vals = [vv['energy'] for vv in vals]\n", 262 | " return [','.join([*d]) for d in vals]\n", 263 | "\n", 264 | " else:\n", 265 | " return [dict2str(d) for d in vals]\n", 266 | "\n", 267 | "def json2csv(\n", 268 | " datasets, # list of datasets in dict\n", 269 | " entries, # (list) the properties user would like to show in the table\n", 270 | " output_file='dataset_overview.csv' # the name of the output csv file\n", 271 | "):\n", 272 | "\n", 273 | " entries_updated = ['name','description','doi']\n", 274 | " entries_updated += [i for i in entries if i not in entries_updated]\n", 275 | " \n", 276 | " df = pd.DataFrame(columns=entries_updated)\n", 277 | " for entry in entries_updated:\n", 278 | " values = get_value_for_entry(entry, datasets)\n", 279 | " df[entry] = values\n", 280 | "\n", 281 | " df.to_csv(output_file, index=False)\n", 282 | " return df \n" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "with open('dataset_overview.json','r') as d:\n", 292 | " datasets = json.load(d)\n", 293 | "datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list\n", 294 | "\n", 295 | "json2csv(\n", 296 | " datasets=datasets[:5],\n", 297 | " entries=['chemical_elements','initial_source'],\n", 298 | " output_file='dataset_overview.csv'\n", 299 | ")" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "## Combine them together: Generate table for selected datasets" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "Below we provide an example on selecting datasets containing elements F and S and generate table on the required information" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "import json\n", 323 | "import pandas as pd\n", 324 | "\n", 325 | "# load the whole datasets\n", 326 | "with open('dataset_overview.json','r') as d:\n", 327 | " datasets = json.load(d)\n", 328 | "datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list\n", 329 | "\n", 330 | "datasets_to_select = filter_dataset(\n", 331 | " datasets=datasets,\n", 332 | " entry='chemical_elements',\n", 333 | " value=['F', 'S'])\n", 334 | "\n", 335 | "# generate table:\n", 336 | "json2csv(\n", 337 | " datasets=datasets_to_select,\n", 338 | " entries=['properties'],\n", 339 | " output_file='dataset_overview.csv')" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "mlatom", 353 | "language": "python", 354 | "name": "python3" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.9.17" 367 | } 368 | }, 369 | "nbformat": 4, 370 | "nbformat_minor": 2 371 | } 372 | --------------------------------------------------------------------------------