├── .gitignore ├── AI4Science101_template.zip ├── Makefile ├── README.md ├── make.bat ├── requirements.txt └── source ├── chapters ├── AI_for_scientific_discovery │ ├── index.rst │ ├── manifesto.md │ ├── mind_steps.md │ ├── opportunities.md │ ├── real_world_challenge.md │ ├── references.md │ ├── roadmap.md │ ├── success_of_AI.md │ └── why_different.md ├── announcement │ └── announcement.md ├── knowledge_base │ ├── biology.md │ ├── chemistry.md │ ├── index.rst │ ├── pharmacy.md │ ├── physics.md │ └── references.md ├── molecular_dynamics │ ├── AI_in_MD.md │ ├── MD_definition.md │ ├── advanced_example.md │ ├── enhanced_sampling.md │ ├── index.rst │ ├── preperation.md │ ├── references.md │ └── simple_example.md └── scientific_discovery_in_the_era_of_AI │ ├── AI_bring.md │ ├── AI_roadmap.md │ ├── artificial_intelligence.md │ ├── how_ai_work.md │ ├── index.rst │ ├── manifesto.md │ ├── mindsets_for_AI.md │ ├── news_AI.md │ └── references.md ├── conf.py └── index.rst /.gitignore: -------------------------------------------------------------------------------- 1 | build/ 2 | -------------------------------------------------------------------------------- /AI4Science101_template.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/deepmodeling/AI4Science101/74e64a4e0c86d86e812a89afd95c1d661828a093/AI4Science101_template.zip -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = source 9 | BUILDDIR = build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AI4Science101 2 | 3 | With the rapid development of AI, people have started to apply AI methods to almost every field, from natural language processing to computer vision. Recent breakthroughs have demonstrated the power of AI in solving grand challenges in the scientific community. Particular examples include predicting highly accurate protein structures with AlphaFold2, simulating 100 million particle systems with DPMD, imagining the first-ever picture of a black hole, etc. Nevertheless, many researchers in both AI and scientific fields are not able to approach AI for Science research due to many gaps, from limited domain knowledge to the misunderstanding of AI capability. In addition, the educational materials for AI for science are scattered and poorly organized. We announce this initiative (a series of documents) to bring people who are interested in AI for Science into the forefront of AI for Science with knowledge collected at different levels, from motivational overviews of the field, lecture-style tutorials on specific topics to a knowledge base over common terminologies.  4 | 5 | In this first post, we would like to motivate people from both AI and Scientific fields about this emerging, fast-growing and impactful field, AI for Science, from both the views of AI and Scientific researchers: scientific discovery in the era of AI and AI for scientific discovery. We also prepare our first lecture-style tutorial focusing on molecular dynamics, one of the most fundamental tools in computational chemistry, with the first release of our knowledge base covering basic concepts from physics, chemistry, biology, and pharmacy. 6 | 7 | * [Scientific Discovery in the era of AI](https://ai4science101.deepmodeling.com/en/latest/chapters/scientific_discovery_in_the_era_of_AI/index.html) - AI for Science from the view of AI 8 | * [AI for Scientific Discovery](https://ai4science101.deepmodeling.com/en/latest/chapters/AI_for_scientific_discovery/index.html) - AI for Science from the view of Science 9 | * [Molecular Dynamics](https://ai4science101.deepmodeling.com/en/latest/chapters/molecular_dynamics/index.html) - Lecture-style tutorial 10 | * [Knowledge Base](https://ai4science101.deepmodeling.com/en/latest/chapters/knowledge_base/index.html) - Basic concepts 11 | 12 | ## Acknowledgement 13 | The project is a part of the DeepModeling community, an open-source community that aims to define the future of scientific computing together.  14 | This effort is primarily led by Yuanqi Du (Cornell), Yingze Wang (UCB), Yanze Wang (PKU), Yibo Wang (DP) and contributors Jiayue Wang (DP), Jiameng Huang (PKU), Arian Jamasb (Cambridge), Jihao Long (Princeton), Guiyu Cao (PKU), Zhenfeng Deng (PKU), Xi Chen (DP), Siyuan Zhou (BFSU), Yinkai Wang (Tufts). We also like to express our gratitude to Weinan E (Princeton & PKU), Linfeng Zhang (DP), Ping Tuo (DP), Zheng Cheng (AISI), Han Wen (DP), Dongdong Wang (DP), Xinming Tu (UW), Nilay Shah (UCLA), Hannes Stark (MIT), Chaitanya Joshi (Cambridge), Ryan-Rhys Griffiths (Cambridge), Sang Truong (Stanford), Junhan Chang (PKU), Chenbing Wang (PKU), Ziming Liu (MIT), Weiliang Luo (PKU), Zhen Wang (DP), Yucheng Zhang (UTokyo), Ferry Hooft (UvA), Ziyao Li (PKU) for providing expertise, feedback and support. 15 | 16 | ## Feedback/comment or Join us 17 | Please reach out to us at [ai4science101@deepmodeling.com](mailto:ai4science101@deepmodeling.com) or join our [slack channel](https://join.slack.com/t/aiforscience/shared_invite/zt-1bdof1jmf-YtIjkUVA5DquXguEiOXGPQ) if you have any feedback or comments. As this is a community effort, we welcome anyone interested to join us. Any kind of volunteer work is welcomed, including writing tutorials, drawing illustrations, etc. Do not hesitate to let us know! 18 | 19 | ## Contribution Guidelines 20 | 21 | We are looking for contributors/experts for specific areas related to AI for Science. The expected contributions include a three-level write-up, a one-paragraph introduction and learning material in section 2 or 3 (depending on the topic in AI or Science), common terminologies and short explanations in section 5, and a specialized chapter similar to section 4. For each specialized chapter, we expect to include (1) target audience and motivations, (2) brief review of literature/history, (3) current advances and future promises, (4) takeaways, and (5) a running sample/demo (optional). You can download our LaTex template [here](https://github.com/deepmodeling/AI4Science101/blob/devel/AI4Science101_template.zip) and find a detailed github PR guideline [here](https://github.com/Chengqian-Zhang/AI4Science101/blob/main/contribution_guideline.md). 22 | 23 | Notice: we focus on both **breath** and **depth** of each topic/chapter, specifically, **breath** refers to drawing a whole picture about the topic and **depth** refers to foundations of why AI methods work or why AI changes the game in the field. 24 | -------------------------------------------------------------------------------- /make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=source 11 | set BUILDDIR=build 12 | 13 | if "%1" == "" goto help 14 | 15 | %SPHINXBUILD% >NUL 2>NUL 16 | if errorlevel 9009 ( 17 | echo. 18 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 19 | echo.installed, then set the SPHINXBUILD environment variable to point 20 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 21 | echo.may add the Sphinx directory to PATH. 22 | echo. 23 | echo.If you don't have Sphinx installed, grab it from 24 | echo.https://www.sphinx-doc.org/ 25 | exit /b 1 26 | ) 27 | 28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 29 | goto end 30 | 31 | :help 32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 33 | 34 | :end 35 | popd 36 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | myst_parser 2 | deepmodeling_sphinx 3 | sphinx_rtd_theme 4 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/index.rst: -------------------------------------------------------------------------------- 1 | AI for Scientific Discovery 2 | ============================= 3 | 4 | .. toctree:: 5 | :maxdepth: 2 6 | :caption: Contents: 7 | 8 | manifesto.md 9 | success_of_AI.md 10 | why_different.md 11 | real_world_challenge.md 12 | opportunities.md 13 | mind_steps.md 14 | roadmap.md 15 | references.md 16 | 17 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/manifesto.md: -------------------------------------------------------------------------------- 1 | ## Manifesto 2 | Ever since the time of Isaac Newton, there have been two different paradigms for scientific research: the **Keplerian** paradigm and the **Newtonian** paradigm. 3 | 4 |
5 |
6 | 7 |
8 |
Figure 1: Portrait of Johannes Kepler and Isaac Newton, from Wikipedia
9 |
10 | 11 | The Keplerian paradigm, often referred to as the "data-driven" approach, expects to extract new physical rules or trends through data analysis and utilize these rules to solve actual problems. The discovery of Kepler's laws of planetary motion was the canonical implementation of this paradigm. Nowadays, many successful examplesbioinformatics and cheminformatics have demonstrated the effectiveness of this paradigm in areas from multi-scale modeling, protein structure prediction, to drug discoveryincluding drug discovery or disease treatment. 12 | 13 | The Newtonian paradigm is based on working from first principles, with the aim to figure out fundamental physical rules that govern the world as we know it. Based on these principles, scientists are able to explain most of the experimentally observed phenomenons. One of the most successful theories is quantum mechanics because it almost prepares us with all necessary laws for much of engineering and natural sciences. However, as pointed out by Dirac, "the exact application of these laws leads to equations much too complicated to be solved". The central difficulty is called "the curse of dimensionality", i.e., the problems we are encountered are actually too high-dimensional and cannot be solved efficiently. For a long time, natural scientists have only had limited ability to handle these equations with at most thousands of variables. 14 | 15 | **
BUT things are going to change!
** 16 | 17 | Machine learning, especially deep learning (or generally AI) techniques emerge as effective tools to approximate arbitrary high-dimensional functions as illustrated by its unprecedented success in computer vision (CV) and natural language processing (NLP). In the Newtonian paradigm, AI methods have been applied to incorporate physical laws to solve more much complicated problems or system simulations than toy examples. In the Keplerian paradigm, AI can be directly applied to analyze and learn from data in an end-to-end manner. With the promises of AI in solving real and challenging scientific problems, **"AI for Science" (AI4Science)** has become established as a new term and prevailed in both AI and scientific research communities. In the past few years, successful applications of AI methods have opened up a wide research avenue for both communities, from AlphaFold2 [1] that solves the 50-year-old protein structure prediction puzzle, DeePMD [2] that extending *ab initio* simulation to unprecedentedly large scales, to controlling nuclear reactor with AI agents [3]. The new paradigm of scientific research empowered by AI has been formed, and aforementioned successful examples have paved the way for this new paradigm. However, as scientific discovery has a very broad scope with many different disciplines many grand challenges that are critical to our lives still remain unsolved. Despite the early success, we have to acknowledge that AI for Science is still nascent and requires joint efforts from both AI and scientific communities. 18 | 19 |
20 |
21 |
Figure 2: Paul A. M. Dirac (1902 – 1984), from wikipedia
22 |
23 | 24 | We are living in an era with the opportunity and means to tackle grand challenges in scientific discovery. To facilitate this emergent field and bridge gaps between AI and scientific communities, this blog aims to equip researchers in the AI community with some basic scientific knowledge and an overview of new challenges in scientific discovery, which may appear significantly different from common AI application areas such as computer vision and speech recognition. 25 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/mind_steps.md: -------------------------------------------------------------------------------- 1 | ## Mind your steps 2 | - **Be careful with data.** Datasets in scientific problems have many problems: it may be highly-screwed, with 99\% positive cases and only 1\% negative cases, because researchers will not publish their bad results; it may be very small, because much data is hard to generate and collect; it may be very dirty, for example, some experimental results are noisy and not reliable. 3 | - **Understand the problems.** Scientific concepts are not as easy to understand as classifying cats and dogs in computer vision. Take a humble and respectful manner toward scientific problems and learn more scientific backgrounds (physics, chemistry, biology, etc.) about the problem of interest. Understand the reason for solving the problem and the practical application of the research are the key to success. 4 | - **Be patient.** "Rome was not built in a day." Scientific problems are often challenging and taking years to solve. But don't be afraid if you miss any of the deadlines for NeurIPS/ICML/ICLR, good work will be recognized and published and become impactful eventually. 5 | - **Enjoy interdisciplinary collaborations.** Good collaborations between AI and Science communities are key to make impactful work both in terms of real-world challenges and methodologies. Be open-minded while talking to people from the other community. 6 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/opportunities.md: -------------------------------------------------------------------------------- 1 | ## See the opportunities - Why AI for Science? 2 | - **Challenging scenarios.** For AI algorithms, scientific applications are usually much more challenging, compared to common applications in images, texts, or audio, where "rules" are defined by humans. Science is about finding and understanding the nature, so it is usually more challenging. 3 | - **Real-world impacts.** Scientific discovery works for the good of human beings. For example, boosting drug design will reduce the price of drugs and save more people's lives. 4 | - **New discovery.** Curiosity is the nature of human beings that motivates the development of science. As Kepler derived the laws of planetary motions hundreds of years ago, we are in an era in which AI may help us to discover new science systematically. -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/real_world_challenge.md: -------------------------------------------------------------------------------- 1 | ## Real-world Challenges in AI for Science 2 | ### Next Steps in Protein Structure Prediction 3 | Represented by AlphaFold2 [1], a variety of AI-based protein structure prediction models [13,14] successfully solve the protein structure prediction problem, but it is only the first step towards understanding protein structures and functions. There are still many remaining challenges to be solved: 4 | 5 |
6 |
7 |
Figure 5: An illustration of protein-ligand binding
8 |
9 | 10 | **Protein Multimers** 11 | Current predictive models of protein structure can only provide reliable results for monomers (single peptide chain). But in reality, peptide chains can interact with each other and form complexes (multimers). In many scenarios, only by doing so can the proteins perform their biological functions correctly. In structural biology, such behavior is defined as quarternary protein structure. 12 | 13 | 14 | **Protein-ligand Complex** 15 | Protein-ligand interactions and the induced-fit models are key to understanding drugs' potency. Small organic molecules often interact with a certain area (referred to as a pocket) in target proteins and may cause the protein structure to change significantly. Traditional computational methods, such as molecular docking, model the protein-ligand binding free energies with physical-based scoring functions, which are parametrized in an empirical and error-prune way. AI models will be a breakthrough in this area if accurate prediction of the ligand binding pose, and/or the protein structural changes during the ligand binding process can be made. 16 | 17 | **Protein Conformation Ensembles** 18 | Most of the recent successful models are based on multiple sequence alignment (MSA), which can be viewed as an augmented version of "homologous modeling". The scientific logic behind this is that proteins follow rules of evolution, so more or less, any protein found naturally is subject to have some structural similarities with those proteins in other organisms which some have been studied before. However, there are still a variety of proteins that are de novo-designed (manually designed) or lacking MSA, current models fail to provide reliable prediction results. Thus one promising direction of AI-based protein structure prediction may be the development of MSA-free models. 19 | 20 | ### Quantum Mechanics 21 | One of the central goals in quantum mechanics is to find accurate solutions (wave functions and energies) to Schrödinger equations on real systems, which is hindered by many-body problems because the dimensionality of the equation is $3N$, where $N$ is the number of electrons (and a real system can easily have hundreds of electrons, where the many-body Schrödinger equations can not be solved exactly. To compromise, researchers have come up with many approximation methods, such as DFT (density functional theory), to make the computational cost acceptable by sacrificing some accuracy. This work have reached great achievements in many areas such as material science, but in cases where the DFT results are not accurate enough, researchers have to rely on more accurate but more time-consuming methods (CCSD(T), with a computation complexity of $O\left(N^7\right)$). Recent work, such as DeePKS [15] and DM21 [16] have been proposed to tackle this issue with AI models, but are still far from perfect. One particular challenge is how to represent an anti-symmetric function under permutation in a neural-network manner (wave functions of electrons are of this property). 22 | 23 |
24 |
25 |
Figure 6: Illustration of coarse-grained models
26 |
27 | 28 | ### Molecular Dynamics 29 | Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, providing a view of the dynamic "evolution" of the system. The trajectory can be considered as a sample under the Boltzmann distribution of a given system and temperature. Thus, many thermodynamic properties such as density and free energy can be calculated by MD. 30 | 31 | **General neural-network-based force field** 32 | Although deep learning methods have already shown their capabilities in accelerating AIMD, a neural network potential able to be generalized to different systems and simulation settings is of high practical value. This could be achieved by pre-training treatment and thus repeated work can be avoided, as users will no longer need to establish a model from scratch, but fine-tune the pre-trained models against specific systems instead. For example, a model describing arbitrary organic molecules at a very accurate quantum mechanics level will be useful in drug design, and a model describing any components of alloy/materials is valued in material science. Besides, the requirement of higher transferability also challenge current methods with more generalizable representation of atomic configuration, which further brings demand to architecture enhancement. 33 | 34 | **Coarse-grained models** 35 | Simulation of extremely large and complicated systems, such as a whole virus, needs coarse-grained force fields that treat several atoms as one "bead". Then the interactions between these beads are expected to reflect certain properties of interest, such as free energy or conformation distribution. However, it is nontrivial to find optimal forms and parameters to describe such interactions, and currently there are no general protocols like empirical atomistic force fields. AI models may be an effective tool just as they are between DFT/AIMD and classical MD, but more research need to be conducted to answer questions including what targets to fit, how to generate training data efficiently. 36 | 37 | **Enhanced sampling** 38 | Enhanced sampling assists to overcome free energies barriers in a molecular dynamic simulation. If the free energy landscape of a given system is not smooth, the simulation will be stuck in one local minimal and ergodicity in molecular dynamics simulation will not be satisfied. This phenomenon is manifested by inadequate sampling over the whole landscape, especially over transition states or other local minima, and occurs frequently in simulation of biological systems. Computational chemists have employed bias-potential-based techniques (such as meta-dynamics [17], and umbrella sampling [18]) to enhance sample efficiency. But these methods require well-defined collective variables (CVs) and fail to handle situations where the number of CVs is large. The key challenge is how to learn an accurate representation of the free energy surface (FES) with high-dimensional CVs. AI models have recently been introduced, e.g., NN-VES [18], Reinforced Dynamics [19], and NN-based CV selections [20,21]. The main challenges lie in better models with generalizability and more effective workflows to take training data generation into consideration [22]. 39 | 40 | ### Partial differential equations 41 | High dimensional partial differential equations (PDEs) arise in many scientific problems. Notable examples include high dimensional nonlinear Black-Scholes equations in finance, many electronic Schrödinger equations in quantum mechanics, and high dimensional Hamilton-Jacobi-Bellman equations in control theory. However, traditional numerical algorithms like finite difference or finite element methods suffer from the curse of dimensionality and are unable to deal with PDEs beyond 10 dimensions. The practical success of deep-learning-based PDE solvers such as physics-informed neural networks and deep BSDE method shows the ability of the deep neural networks to efficiently approximate the solutions of high dimensional PDEs. Hence, once we can reformulate the PDE by a variational problem, deep learning techniques can be easily applied to the variational problem and the original PDE can be solved. Successful examples in this direction include the deep Ritz method[23] 42 | the deep BSDE method[1] 43 | , and Physics-informed neural networks[24] 44 | 45 | - **Variational problem:** Find the maxima or minima of a functional, which maps functions to scalars, over a given domain. 46 | 47 | - **Finite difference method:** A class of numerical algorithms to solve the differential equations. It approximates the derivative or partial derivative by finite differences and solves the resulting linear or nonlinear systems. 48 | 49 | - **Finite element method:** A class of numerical algorithms to solve the differential equations. It converts the differential equations to a variational problem, uses a finite-dimensional linear space to approximate the domain of the variational problem, and solves the variational problem over the finite-dimensional linear space. 50 | 51 | ### Control theory 52 | Control algorithms are widely used in engineering and industry, which aim to govern the application of system inputs to drive the dynamic system to satisfying specific conditions. Since the time of Bellman [25] 53 | , a long-lasting problem in control theory is to solve the high dimensional closed-loop control problems, which aims to find the policy function: the input as a function of the state. Indeed, the terminology "curse of dimensionality" was originally coined by Bellman in order to highlight these difficulties. The practical success of deep learning shows that deep neural networks can approximate high dimensional functions and hence raise the hope to solve high dimensional closed-loop control problems. Although this field is still immature and faces many challenges such as stability and robustness of policy function, pioneering works [24,26] show the potential of this field. Another related field is reinforcement learning. Roughly speaking, control algorithms and reinforcement learning problems solve the same problems. However, in contrast, to control algorithms, which make heavy use of the underlying models, reinforcement learning algorithms make minimum use of the model. Comparison and combination of control algorithms and reinforcement learning algorithms are interesting topics and helpful if one wants to deal with complex practical problems. 54 | 55 | - **Reinforcement learning:** Reinforcement learning concerns how an agent takes actions to maximize the long-term reward when faced with an unknown environment. One feature of the reinforcement learning algorithm is that it does not require the exact form of the underlying model. 56 | 57 |
58 |
59 |
60 |
Figure 7: Magnitude of vorticity in compressible turbulent mixing layer (left), and hypersonic reentry vehicle in rarefied regime (right).
61 |
62 | 63 | ### Fluid Mechanics 64 | Fluid mechanics studies the systems with fluid (liquids, gases, and plasmas) at rest and in motion [27,28,29]. Many scientific and engineering disciplines get involved with fluid mechanics (as shown in Figure above), including astrophysics, oceanography, meteorology, aerospace engineering, chip industry, and physics-based animation. Overall, the fluid mechanics can be roughly divided into inviscid flows vs. viscous flows, laminar flows vs. turbulence, incompressible flows vs. compressible flows, continuum flows vs. rarefied flows, single-phase flows vs. multiphase flows, Newtonian flows vs. non-Newtonian flows, etc. 65 | 66 | Mathematical analysis, experimental studies, and numerical simulations are three major approaches to exploring fluid mechanics. Fundamentally, a fluid system is assumed to be governed by mathematical equations in the conservation of mass, momentum, and energy. In different physical modeling scales, the governing equations of fluid are in different forms [30], the newton dynamics, Boltzmann equation, Euler or Navier-Stokes equations (NSE), and coarse-grained turbulence models. In the hierarchy of governing equations, the hyperbolic Euler equations for inviscid flows are usually utilized to validate the performance of the numerical scheme of its accuracy, efficiency, and robustness. Additionally, the NSE is widely used in continuum viscous fluid mechanics, while the Boltzmann equation works well in rarefied gas dynamics. With the rapid growth of high-performance computing, numerical simulation called computational fluid dynamics (CFD) not only gradually becomes the indispensable tool to validate the key mathematical conclusions and experimental observations in fluid dynamics, but also provides more abundant and practical fluid information (macroscopic velocities, pressure and temperature distribution, drag and lift force, heat load, noise level) for engineering applications. With the aid of AI methods, research on numerical and experimental fluid mechanics may be improved. 67 | 68 | - **Design data-driven turbulence models**, such as modeling high-Reynolds number wall-bounded turbulent flows and complex separated turbulent flows [31] (i.e., the simulation and design in advanced aircraft). 69 | - **Conduct data assimilation in flow fields**, which combines the sparse measured data and numerical solutions together to provide more complete and accurate flow fields (i.e., the prediction of ocean circulation, weather forecast, and city environment simulations). 70 | - **Refresh multiphase and multiscale fluid models**, which modify the ad-hoc models of turbulence combustion, multiphase flows, and rarefied gas dynamics (i.e., efficient moment closure models for simulating rarefied flows). 71 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/references.md: -------------------------------------------------------------------------------- 1 | ## References 2 | 3 | [1] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. 4 | 5 | [2] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan. Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett., 120(14), April 2018. 6 | 7 | [3] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de las Casas, Craig Donner, Leslie Fritz, Cristian Galperti, Andrea Huber, James Keeling, Maria Tsimpoukelli, Jackie Kay, Antoine Merle, Jean-Marc Moret, Seb Noury, Federico Pesamosca, David Pfau, Olivier Sauter, Cristian Sommariva, Stefano Coda, Basil Duval, Ambrogio Fasoli, Pushmeet Kohli, Koray Kavukcuoglu, Demis Hassabis, and Martin Riedmiller. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 8 | February 2022. 9 | 10 | [4] CASP competitions 2022. https://predictioncenter.org/ 11 | 12 | [5] Justin S Smith, Olexandr Isayev, and Adrian E Roitberg. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chemical science, 8(4):3192–3203, 2017. 13 | 14 | [6] Linfeng Zhang, Jiequn Han, Han Wang, Wissam Saidi, Roberto Car, and Weinan E. End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems. Advances in Neural Information Processing Systems, 31, 2018. 15 | 16 | [7] Oliver T Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges. Journal of chemical theory and computation, 15(6):3678–3693, 2019. 17 | 18 | [8] Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan. DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models. Comput. Phys. Commun., 253(107206):107206, August 2020. 19 | 20 | [9] Tongqi Wen, Linfeng Zhang, Han Wang, E Weinan, and David J Srolovitz. Deep potentials for materials science. Materials Futures, 2022. 21 | 22 | [10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 23 | 24 | [11] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 25 | 26 | [12] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000. 27 | 28 | [13] Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021. 52 29 | 30 | [14] Jian Peng and Jinbo Xu. Raptorx: exploiting structure information for protein alignment by statistical inference. Proteins: Structure, Function, and Bioinformatics, 79(S10):161–171, 2011. 31 | 32 | [15] Yixiao Chen, Linfeng Zhang, Han Wang, and Weinan E. Deepks: A comprehensive data-driven approach toward chemically accurate density functional theory. Journal of Chemical Theory and Computation, 17(1):170–181, 2021. PMID: 33296197. 33 | 34 | [16] James Kirkpatrick, Brendan McMorrow, David H. P. Turban, Alexander L. Gaunt, James S. Spencer, Alexander G. D. G. Matthews, Annette Obika, Louis Thiry, Meire Fortunato, David Pfau, Lara Román Castellanos, Stig Petersen, Alexander W. R. Nelson, Pushmeet Kohli, Paula Mori-Sánchez, Demis Hassabis, and Aron J. Cohen. Pushing the frontiers of density functionals by solving the fractional electron problem. Science, 374(6573):1385–1389, 2021. 35 | 36 | [17] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566, 2002. 37 | 38 | [18] Glenn M Torrie and John P Valleau. Nonphysical sampling distributions in monte carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199, 1977. 39 | 40 | [19] Luigi Bonati, Yue-Yu Zhang, and Michele Parrinello. Neural networks-based variationally enhanced sampling. Proceedings of the National Academy of Sciences, 116:201907975, 08 2019. 41 | 42 | [20] Linfeng Zhang, Han Wang, and Weinan E. Reinforced dynamics for enhanced sampling in large atomic and molecular systems. The Journal of chemical physics, 148(12):124113, 2018. 43 | 44 | [21] Dongdong Wang, Yanze Wang, Junhan Chang, Linfeng Zhang, Han Wang, et al. Efficient sampling of high-dimensional free energy landscapes using adaptive reinforced dynamics. Nature Computational Science, 2(1):20–29, 2022. 45 | 46 | [22] Luigi Bonati, Valerio Rizzi, and Michele Parrinello. Data-driven collective variables for enhanced sampling. The Journal of Physical Chemistry Letters, 11:2998–3004, 04 2020. 47 | 48 | [23] Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W R Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, January 2020. 49 | 50 | [24] Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova, Andrew Senior, Tim Green, Augustin Žídek, Russ Bates, Sam Blackwell, Jason Yim, Olaf Ronneberger, Sebastian Bodenstein, Michal Zielinski, Alex Bridgland, Anna Potapenko, Andrew Cowie, Kathryn Tunyasuvunakool, Rishub Jain, Ellen 51 | Clancy, Pushmeet Kohli, John Jumper, and Demis Hassabis. Protein complex prediction with AlphaFold-Multimer. October 2021. 52 | 53 | [25] E Weinan. The dawning of a new era in applied mathematics. Notices of the, volume 68. American Mathematical Society, 2021. 54 | 55 | [26] Linfeng Zhang, Jiequn Han, Han Wang, Wissam A Saidi, Roberto Car, and Weinan E. End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems. May 2018. 56 | 57 | [27] Pijush K Kundu, Ira M Cohen, and Howard H Hu. Fluid Mechanics. Academic Press Inc. (London), London, England, 3 edition, December 2004. 53 58 | 59 | [28] Frank M White. Viscous fluid flow (int’l ed). McGraw-Hill Professional, New York, NY, 3 edition, April 2005. 60 | 61 | [29] David J Acheson. Elementary fluid dynamics. Clarendon Press, 2009. 62 | 63 | [30] Kun Xu. Direct modeling for computational fluid dynamics: Construction and application of unified gas-kinetic schemes. Advances In Computational Fluid Dynamics. World Scientific Publishing, Singapore, Singapore, March 2015. 64 | 65 | [31] Karthik Duraisamy, Gianluca Iaccarino, and Heng Xiao. Turbulence modeling in the age of data. March 2018. 66 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/roadmap.md: -------------------------------------------------------------------------------- 1 | ## A Roadmap of Basic Scientific Knowledge 2 | 3 | **Classical Mechanics** 4 | - Kibble, Tom, and Frank H. Berkshire. Classical mechanics. world scientific publishing company, 2004.Classical Mechanics 5 | 6 | **Statistic Mechanics**: 7 | - Tuckerman, Mark. Statistical mMechanics: tTheory and mMolecular sSimulation. Oxford university press, 2010. Mark E. Tuckerman 8 | - Pathria, Raj Kumar.Statistical Statistical mMechanics. Elsevier (R. K. Pathria, Paul D. Beale, 2016.1) 9 | 10 | **Quantum Mechanics**: 11 | - A. Szabo, A., and N. S. Ostlund. ", Modern Quantum Chemistry (Dover." New York (, 1996). 12 | - L. Piela, Lucjan. Ideas of qQuantum cChemistry, 2nd Ed. (Elsevier, 2006.14) 13 | - Sholl, David S., and Janice A. Steckel. Density functional theory: aA practical introduction. John Wiley & Sons (David Sholl, Janice A Steckel, 2011.09) 14 | 15 | **Solid State Physics**: 16 | - Kittel, Charles. "Introduction to Solid State Physics Solution Manual." (2021). (Charles Kittel, 2004) 17 | 18 | **Multi-scale Modeling**: 19 | - Weinan, E. Principles of mMultiscale mModeling. Cambridge University Press (Weinan, E, 2011.) 20 | 21 | **Control Theory**: 22 | - Evans, Lawrence C. "An introduction to mathematical optimal control theory version 0.2." Lecture notes available at http://math. berkeley. edu/~ evans/control. course. pdf ( (L.C. Evans, 1983). 23 | 24 | **Partial Differential Equations**: 25 | - Evans, Lawrence C. Partial dDifferential eEquations. Vol. 19. American Mathematical Soc. (L. C. Evans, 2010.) 26 | 27 | **Fluid Dynamics**: 28 | - Kundu, Pijush K., Ira M. Cohen, and David R. Dowling. Fluid mechanics. Academic pressFluid mechanics (P. K. Kundu, I. M. Cohen \& D. R. Dowling, 2015.5) -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/success_of_AI.md: -------------------------------------------------------------------------------- 1 | ## Success of AI in Scientific Discovery 2 | ### Protein Structure Prediction 3 | 4 | When it comes to AI for Science, arguably the most famous and successful example of AI-advanced scientific discovery is AlphaFold2 which addresses the problem of accurately predicting protein 3D structures from their sequence, one of the "holy grail" problems in structural biology. The structures of proteins are essential to their biological functions and accurate 3D modeling at the atomistic level is significant for a variety of field from drug discovery to synthetic biology. However, resolving structures experimentally is highly expensive and time-consuming, so computational methods to predict accurate protein structures have long been studied, represented by the series of CASP competitions [4]. Anfinsen's hypothesis, stating that for most proteins, the native structures in standard physiological environments are determined solely by the proteins' amino acid sequences, grounds the study of computational method for the protein structure prediction problem. Traditionally, structural biologists utilize "homology modeling" to make such predictions. In those methods, multiple proteins, whose sequences are similar to the query one and structures have already been resolved experimentally, will be found. These structures will be then be assembled to provide the predicted result. The accuracy of homology modeling depends heavily on sequence identity, and sometimes fails to reach a satisfying level. Deep learning methods, however, have stronger ability to discover correlations between sequences and structures. With carefully-designed attention-based neural networks and multiple-sequence-alignment (MSA) information, in 2020, the AlphaFold2 model achieved an astonishing average RMSD of 0.96 angstroms on the test cases in the prestigious CASP competition. Such accuracy is even comparable to experimental errors and AF2 was considered to make a breakthrough in solving this 50-year structural biological puzzle. 5 | 6 | 7 |
8 | 9 |
10 |
Figure 3: An illustration of AI predicted protein structure
11 |
12 | 13 | ### Multi-scale Modeling 14 | As mentioned in the manifesto, the principles of our physical world are almost completely known but the mathematical equations are too complicated to be solved accurately. Therefore, for problems at different time and space scales, scientists have to develop different computational methods with the necessary approximations to reduce the computational complexity which is called multi-scale modeling in computational science. In this area, *ab initio* and classical molecular dynamics (shortened as AIMD and classical MD) are two widely used techniques. However, there has been a longstanding trade-off problem between the accuracy and efficiency of these methods. Specifically, in AIMD, the energy and forces of given systems are calculated by quantum mechanics (usually density functional theory or DFT), thus it is more accurate but more time-consuming. The computational complexity is $O\left (N^3\right)$, where $N$ is the number of electrons in the systems. In classical MD, the potential energy surface is given by fixed mathematical forms with manually curated parameters (named "force fields"), so it is much faster ($ O\left(N\right)$) but less accurate. Recently, AI methods have been largely developed to bridge this gap [5,6,7]. Specifically, researchers designed neural networks that preserving necessary physical symmetries and generated a neural representation of atomic configurations which could be used to fit DFT-level potential energy surfaces. However, it still requires large-scale dataset to train such models. Concurrent learning [8] workflows are here to rescue, by heuristically exploring the configuration space to collect as few data as possible. In this manner, construction of neural network potentials becomes more automatic, which further enables a variety of applications in complex systems in condensed matters as well as material science [9]. 15 | 16 |
17 |
18 |
Figure 4: An illustration of protein-multimers
19 |
20 | -------------------------------------------------------------------------------- /source/chapters/AI_for_scientific_discovery/why_different.md: -------------------------------------------------------------------------------- 1 | ## Why is AI for Science different? 2 | 3 | ### Data is Important 4 | Undoubtedly, AI methods rely on datasets with both high-quality and high-quantity to achieve excellent performance in solving problems, which has been demonstrated by ImageNet [10] 5 | , Cifar-10 [11], etc. This is also very true for scientific problems, examplified by RCSB Protein Data Bank (PDB) [12]. This database, containing approximate 200,000 data entries and maintained by researchers all over the world for over 40 years, is one of the best database for experimentally resolved biomolecular structures. Deep learning methods would never reach such a success without efforts from maintainers of PDB. In particular, scientific problems usually involve real and challenging scenarios for many AI-interested topics, e.g., out-of-distribution generalization, low-data regime learning, etc. And it is often the case that in many topics, no high-quality dataset is available to generate effective deep learning models. Therefore, it is encouraged that AI researchers to pay more attention to data collection and structuralization, during which domain expertise and joint efforts are required. 6 | % Some examples include learning from structures studied in several years ago to predict structures of researchers' interest today for out-of-distribution generalization, learning from highly expensive quantum property data to predict property for new data (low-data regime), etc. 7 | 8 | ### Problem Formulation 9 | Many scientific discovery problems are much more complex than simple classification or regression tasks. For AI researchers, scientific problems have to be decomposed and well-formulated to an extent that inputs, outputs, and objective functions (often needs to be differentiable) can be clearly defined. For example, "drug design" is a huge pipeline which consists of a sequence of steps and is obviously not a well-formulated problem itself. Instead, we can decompose this problem into different pieces that are AI-solvable: molecular property prediction for virtual screening molecular databases, molecular generation for proposing better drug candidates, etc. Problems failed to be compliant with this standard are often referred to as "dirty" ones, and are unlikely to be addressed solely with AI methods. 10 | -------------------------------------------------------------------------------- /source/chapters/announcement/announcement.md: -------------------------------------------------------------------------------- 1 | # Announcing AI for Science Blog Series 2 | 3 | ## Background 4 | With the rapid development of AI, people have started to apply AI methods to almost every field, from natural language processing to computer vision. Recent breakthroughs have demonstrated the power of AI in solving grand challenges in the scientific community. Particular examples include predicting highly accurate protein structures with AlphaFold2, simulating 100 million particle systems with DPMD, imagining the first-ever picture of a black hole, etc. Nevertheless, many researchers in both AI and scientific fields are not able to approach AI for Science research due to many gaps, from limited domain knowledge to the misunderstanding of AI capability. In addition, the educational materials for AI for science are scattered and poorly organized. We announce this initiative (a blog series) to bring people who are interested in AI for Science into the forefront of AI for Science with knowledge collected at different levels, from motivational overview of the field, and lecture-style tutorials on specific topics to knowledge base over common terminologies. 5 | 6 | ## Aim and Scope 7 | We are a group of students, researchers, and practitioners who are interested in AI for science and devoted to advancing AI for science as a new field and community. We write blogs to promote AI for science research at different levels from motivations for new researchers, resources for interdisciplinary researchers, etc. As we announce this AI for science blog series, we release two main documents with titles *AI for Scientific Discovery* and *Scientific Discovery in the era of AI*, which are different views on AI for science from the AI and scientific communities. In addition, we compile a list of common terminologies in different disciplines as a *knowledge base*. As our first *lecture-style tutorial*, we highlight a study of molecular dynamics, one of the most commonly used tools in computational chemistry. 8 | 9 | ## Acknowledgement 10 | The project is a part of the DeepModeling community, an open-source community that aims to define the future of scientific computing together. 11 | This effort is primarily led by Yuanqi Du (Cornell), Yingze Wang (UCB), Yanze Wang (PKU), Yibo Wang (DP) and contributors Jiayue Wang (DP), Jiameng Huang (PKU), Arian Jamasb (Cambridge), Jihao Long (Princeton), Guiyu Cao (PKU), Zhenfeng Deng (PKU), Xi Chen (DP), Siyuan Zhou (BFSU), Yinkai Wang (Tufts). We also like to express our gratitude to Weinan E (Princeton \& PKU), Linfeng Zhang (DP), Ping Tuo (DP), Zheng Cheng (AISI), Han Wen (DP), Dongdong Wang (DP), Xinming Tu (UW), Nilay Shah (UCLA), Hannes Stark (MIT), Chaitanya Joshi (Cambridge), Ryan-Rhys Griffiths (Cambridge), Sang Truong (Stanford), Junhan Chang (PKU), Chenbing Wang (PKU), Ziming Liu (MIT), Weiliang Luo (PKU), Zhen Wang (DP), Yucheng Zhang (UTokyo), Ferry Hooft (UvA), Ziyao Li (PKU) for providing expertise, feedback and support. 12 | 13 | ## Feedback/comment or Join us 14 | Please reach out to us at [ai4science101@deepmodeling.com](mailto:ai4science101@deepmodeling.com) or join our [slack channel](https://join.slack.com/t/aiforscience/shared_invite/zt-1bdof1jmf-YtIjkUVA5DquXguEiOXGPQ) if you have any feedback or comments. 15 | As this is a community effort, we welcome anyone interested to join us. Any kind of volunteer work is welcomed, including writing tutorials, drawing illustrations, etc. Do not hesitate to let us know! 16 | 17 | ## Contribution Guidelines 18 | We are looking for contributors/experts for specific areas related to AI for Science. The expected contributions include a three-level write-up, a one-paragraph introduction and learning material in section 2 or 3 (depending on the topic in AI or Science), common terminologies and short explanations in section 5, and a specialized chapter similar to section 4. For each specialized chapter, we expect to include (1) target audience and motivations, (2) brief review of literature/history, (3) current advances and future promises, (4) takeaways, and (5) a running sample/demo (optional). 19 | #### How to get involved 20 | - Github discussion 21 | 22 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/1.PNG) 23 | 24 | Welcome everyone to participate in the discussion about AI4Science in the discussion module of our GitHub. The website is [here](https://github.com/deepmodeling/AI4Science101/discussions) 25 | 26 | - Email 27 | 28 | Our email address is [ai-for-science101@googlegroups.com](mailto:ai-for-science101@googlegroups.com). If you are interested in sharing your knowledge about any particular aspect of AI for Science (e.g. a common AI tool, practical guidance, an overview of a scientific topic, etc.), we encourage you to send us an email before you start preparing the material. 29 | 30 | - Slack group 31 | 32 | In addition to our reading documents, you can also join the AI4Science101 [Slack channel](https://aiforscience.slack.com/join/shared_invite/zt-1bdof1jmf-YtIjkUVA5DquXguEiOXGPQ#/shared-invite/email) to introduce yourself, drop comments/feedback, discuss related material, network with peers, and contribute new material. 33 | 34 | #### How to make a new request 35 | - Make a new issue 36 | 37 | If you have any suggestions for any of the documents, have any new requests for material that you are interested in, or you are interested in contributing your knowledge and expertise to this initiative, we encourage you to participate in the AI4Science101 project. 38 | 39 | In order to increase the visibility of all requests and comments, and to facilitate the organization of this project, we recommend that you submit a new issue with examples shown below. 40 | 41 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/2.PNG) 42 | 43 | Then you can click the button pointed by the red arrow to open an issue. 44 | 45 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/3.PNG) 46 | 47 | After this, you can write your issue in the section inside the red box. 48 | 49 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/4.PNG) 50 | 51 | In order to make your request/comments more organized, we hope you help us classify the type of request by stating in the title as one of [error report|question|new material request|new material contribution|others], e.g., [new material request] Protein Structure Prediction Tutorial. 52 | 53 | #### How to make corrections to the docs/pull request 54 | 55 | If you find any inaccurate expressions/typos/grammatical errors in our documents or you would like to add more relevant content to our documents, you are welcome to submit a pull request on our GitHub. The community's volunteers will merge your pull request after reviewing your submission. 56 | 57 | There are two ways for you to modify the document: 58 | 59 | First, if you just want to modify a sentence or a few sentences, you can do it directly in the document,as shown below. 60 | 61 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/5.PNG) 62 | 63 | Then find the place you want to modify and modify it. 64 | 65 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/6.PNG) 66 | 67 | You can then describe your changes at the bottom of the page and submit them. 68 | 69 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/7.PNG) 70 | 71 | The system will automatically generate a new branch, and you can click the button in the green box to create a new pull request 72 | 73 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/8.PNG) 74 | (Note: please commit pr to the **devel branch**) 75 | 76 | Second, if you want to modify a lot of places, we recommend that you fork to your own repository to modify, and then create a pull request after modifying all the places. 77 | 78 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/9.PNG) 79 | 80 | After making changes in your repository, you can create a pull request. 81 | 82 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/10.PNG) 83 | 84 | Then you can click the button pointed by the red arrow to open a new pull request. 85 | 86 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/11.PNG) 87 | 88 | It is worth noting that you must switch to the devel branch before submitting the pull request. After collecting and sorting out a certain number of changes, we will merge them into the main branch, as shown below. So when submitting a pull request, please change the comparison branch to the devel branch, as shown below. 89 | 90 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/pictures/12.PNG) 91 | 92 | In addition, due to the problem of markdown format, some formulas of documents on GitHub may appear garbled. Please refer to the content on the project [website](ai4science101.deepmodeling.com) 93 | -------------------------------------------------------------------------------- /source/chapters/knowledge_base/biology.md: -------------------------------------------------------------------------------- 1 | ## Biology 2 | 3 | #### Cell Biology 4 | 5 | 6 | A branch of biology that studies the structure, corresponding function and subsequent behaviour of components within a cell. 7 | 8 | 9 | #### Biochemistry 10 | 11 | Solving biological issues with chemical perspectives and techniques.Focusing on the intracellular entities, treating them as chemical blocks and studying the their functionality thus map out the landscape about how life works. 12 | 13 | #### Molecular Biology 14 | 15 | Molecular biology studies the composition, structure, function and behaviour of bio-active and/or bio-significant molecules, such as nucleic acids and proteins. 16 | 17 | #### Genetics 18 | 19 | Study heredity in the perspective of elemental blocks from DNA and their temporal/spatial distribution/variation in organism. Originality of diseases (abnormality) and driving force of evolution could be derived from thorough understanding of genetics. 20 | 21 | #### X-omics 22 | 23 | Source of large-scale and comprehensive biological data assembled from Genomics Transcriptomics, Proteomics, Metabonomics, Microbiomics. 24 | 25 | #### Systems biology 26 | 27 | Analysis and modeling of complex biological systems based on data acquired by X-omics. 28 | 29 | #### Synthetic biology 30 | 31 | Design of new device and circuits based on biological components. 32 | 33 | #### Epidemiology 34 | 35 | Distribution and determinants of disease in population. 36 | 37 | #### Enzyme 38 | 39 | An enzyme is a biological catalyst that is capable of accelerating a specific chemical reaction in cells. The enzyme is not destroyed during the reaction process and could be used again and again (under the sustained condition). In most cases, enzymes are proteins. 40 | 41 | ![Schematic illustration of enzyme catalyzing reactions](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/schematic_enzyme.jpg) 42 | 43 | #### Phase I/II biotransformation 44 | 45 | metabolism of a drug can be divided into 2 phases. Phase I mainly involves the breakdown (mainly by hydrolysis and oxidation). Phase II mainly involves the conjugation of chemical groups (polar in most cases) to make drug more soluble and suitable for excretion. 46 | 47 | #### Cytochrome P450 48 | 49 | A family of key enzymes contain heme as the cofactor to function as mono-oxygenases. It is the typical phase I drug metabolizing enzyme and are involved in so many components’ metabolism from drug and food. They can be easily induced and inhibited by their substrate thus have a outstanding role when studying the drug-drug interaction (DDI). e.g. Patients who are taking Alvastatin are not allowed to eat grapefruit. 50 | 51 | #### Drug targets 52 | 53 | Molecules that are intrinsically associated with particular diseases and could be specifically addressed by a drug to take action. Most of the known drug targets are proteins. 54 | 55 | #### Active site 56 | 57 | Catalytic center of enzymes that bind substrate(s) and initiate reactions. For enzymes that are proteins, side chains along the backbone of key amino acids constructing the active site, shape it into specific size with specific chemical behavior. 58 | 59 | #### Cofactor/Prosthetic group/Coenzyme 60 | 61 | Cofactors are necessary non-peptide components required for enzymes to function properly. Cofactors can either be inorganic metal ions or organic molecules. The assistance of cofactors for enzyme function is achieved by binding to the inactive form of enzyme (apo-enzyme) to produce the catalytic active form (holo-enzyme). A prosthetic group is a type of cofactor that tightly bind to the assisted enzyme and is not easily to be removed. A coenzyme is a specific type of cofactor as they are organic small molecules. 62 | 63 | #### Michaelis-Menten equation 64 | 65 | Michaelis–Menten kinetics describe the typical kinetic behaviour of enzymes. The name was given after German biochemist Leonor Michaelis and Canadian physician Maud Menten. The Michaelis–Menten kinetics model describes the rate of enzymatic reactions $v$ in the form of Michaelis–Menten equation showing bellow: 66 | 67 |
68 | \begin{aligned} 69 | v=\frac{d[P]}{d(t)}={V_{max}}\frac{[S]}{{K_{M}}+[S]} \\ 70 | \end{aligned} 71 |
72 | 73 | Here, enzyme reaction rate $v$, the rate of forming product $[P]$, is related with substrate concentration $[S]$ $V_{max}$ describes the maximum reaction rate achieved by the studied system. It would be reached when the substrate concentration is saturated under a given enzyme concentration. The Michaelis constant $K_{M}$ is numerically equal to the substrate concentration where half $V_{max}$ is reached. In most of the enzyme catalyzing single-substrate reactions, their kinetics behaviours are assumed to fit Michaelis-Menten equation, regardless of further assumptions. 74 | 75 | ![Michaelis Menten model curve](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/Michaelis_Menten_curve.png) 76 | 77 | #### Kinase 78 | 79 | Types of enzyme responsible for substrate phosphorylation. 80 | 81 | #### Receptor tyrosine kinase 82 | 83 | Tyrosine kinase is a type of kinase for tyrosine phosphorylation. It functions as an “on” or “off” switch in many cellular signalling process. Receptor tyrosine kinase is a subclass of tyrosine kinase that serves as cell surface receptor with high-affinity for many polypeptide growth factors, cytokines, and hormones. 84 | 85 | #### G protein coupled receptors (GPCR) 86 | 87 | A large group of evolutionarily-related proteins serve as cell surface receptors to produce cellular response activation upon signal outside cell. The transmembrane domain of GPCRs pass through the cell membrane seven times (typical structure characteristics of GPCRs). Ligands can either bind at extracellular N-terminus and loops or within the transmembrane helices of GPCR. Effective binding of ligand would cause conformational change. Subsequent dissociation of $\alpha$ subunit from the conjugated G-protein would further facilitate intracellular signal processing. 88 | 89 | ![Structural and conformational scheme of GPCR[1].](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/schematic_GPCR.jpg) 90 | 91 | #### Catalytic receptor 92 | 93 | Type of cell surface protein with the ligand binding site localized at the extracellular surface of the plasma membrane and the functional region possessing catalytic activity on the intracellular face of the plasma membrane. The two parts are linked by a single transmembrane-spanning domain consisting of 20–25 hydrophobic amino acids. It commonly exists and functions as a dimer. Endogenous ligands for catalytic receptor are often peptides or proteins. 94 | 95 | #### Transport protein 96 | 97 | A transmembrane protein which function to allow selective passage of specific molecules from the external environment and is able to translocate ions, small molecules, or macromolecules. Transport proteins may be divided into subgroups as channels and carriers. 98 | 99 | 100 | #### Carrier protein 101 | 102 | Active carrier proteins function in the energy-consumed manner and are able to translocate the substances against concentration gradient. Passive carrier proteins assist the substance by facilitated diffusion. 103 | 104 | #### Ion channel 105 | 106 | An ion channel is a type of transmembrane protein that mediates the passage of ions through the membrane. The major differences between ion channels are ion carriers are: (1).high efficiency, usually $10^6$ per second (or higher); (2).translocation of ions down their electrochemical gradient in an energy conservation way. 107 | 108 | #### Nuclear hormone receptors (NHR) 109 | 110 | A class of transcriptional factors to regulate gene expression regulated by their binding ligands. The ligand binding domain (LBD) is capable of recognizing specific ligands to stimulate conformational change (dimerization) of NHR. The DNA binding domain (DBD) mediates the receptor towards its hormone response elements (HRE). DBD functions in the form of a dimer with each monomer recognizing a six base pair sequence of the targeted DNA. 111 | 112 | #### Ubiquitination 113 | 114 | A biological process of protein degradation (intracellularly). The protein is first labelled with a ubiquitin, a 76-amino-acid protein, through a three-step process with help of ubiquitin-activating enzyme (E1), ubiquitin-conjugating enzyme (E2), and ubiquitin-protein ligase (E3), facilitating mono-ubiquitination. The labelled ubiquitin chain could be extended by adding more ubiquitin, resulting in polyubiquitination. The 26S proteasome recognizes the polyubiquitination as a signal to initiate proteolysis and process the protein for degradation. 115 | 116 | #### Proteasome 117 | 118 | Huge protein complex to break peptide bonds for unneeded or damaged proteins. 119 | 120 | #### Heat shock proteins (HSP) 121 | 122 | Molecular chaperones (proteins) to assist protein functioning in response to stressful conditions (eg. exposure to cold and/or UV light, wound healing etc). HSPs are named according to their molecular weight. HSP90 refers to HSPs which are 90 kilodaltons in size. Ubiquitin (8 kilodaltons) also possess heat shock protein features. 123 | 124 | #### Tubulins 125 | 126 | Structural unit for living cell skeletal system. Tubulins are proteins that can be polymerized into long chains or filaments to assemble into microtubules - hollow fibers that serve as cell skeletal system. 127 | 128 | #### Binding Site Detection for Receptors 129 | 130 | Not all functional components in our body can be drug targets. However, this doesn’t mean they cannot be modulated. Sometimes they are just too hard to be accessed accurately due to their distribution in tissue or a structural factor, while in other cases inhibition of these components cannot trigger the expected downstream reaction due to intrinsic homeostasis / ignorance of its mechanism. In most cases, orthosteric binding sites ( the pocket to binding endogenous ligand) can be easily determined by sequence / structure alignment. These site may lack selectivity, rendering growing interest in allosteric site detection. (sites not directly binding the endogenous ligand, but modulate its binding behavior) Traditional methods for allosteric site detection rely on MD simulation. See: Investigating Cryptic Binding Sites by Molecular Dynamics Simulations 131 | 132 | - **Orthosteric/Allosteric Regulation** A protein can have endogenous ligands and protein-protein binding partners. If a drug binds the protein in areas directly involved in endogenous binding, its effect on the protein is called orthosteric regulation. If the drug binds other areas (far away) but can affect the behavior in this area, its effect on the protein is called allosteric regulation. Orthosteric regulation is easier to study: such binding can at least compete with endogenous partner, affecting target behavior. Allosteric regulation is much harder to research, requiring dynamic insight to determine the relationship between orthosteric site and the potential allosteric site. 133 | 134 | - **Covalent Regulation** Traditionally, a drug molecule binds to the target without a reaction with it. It can bind and dissociate, resulting in a chemical equilibrium. However, some novel types of drugs try to form a chemical bond with the target, binding to them permanently. Giving the obvious Sequelae effect (the drug effect can maintain a long time after the drug’s blood concentration becomes low), this kind of regulation can be both effective and risky. 135 | 136 | #### Immunology 137 | 138 | A branch of physiology raising huge interest recently; studies the immune system of human body. 139 | 140 | - **Lymphocyte** A type of white blood cell that plays a vital role in immune responses. There two types of lymphocyte: B-cells and T-cells. 141 | 142 | - **B-cells and T-cells** B-cells are a type of lymphocyte that are able to produce antibodies. T-cells are involved in cell-killing (directly kill the virus-infected cells), immune response amplification (via cytokines, a signal protein secreted from T-cells) and cell memory that enable an organism to respond to the same infection more quickly and efficiently if infection happen again. 143 | 144 | - **Antigen and antibody** The term antigen originally referred to a substance that may trigger an immune response and serves as a antibody generator. Antibodies (or immunoglobulins) are large, Y-shaped protein secreted from B-cells to recognize and neutralize antigens. 145 | 146 | - **Complement system** The complement system functions via the cascade involving distinct plasma proteins that react with one another to opsonize pathogens and induce a series of inflammatory responses to fight infection. It works as enhancing and or complementing the effects of antibody activity and is firstly evolved as part of the innate immune system. 147 | 148 | - **Cluster of differentiation antigen (CD)** Surface proteins on leukocytes, reflecting differentiation stage or activation state of the cell and can be recognized by specific monoclonal antibodies. 149 | 150 | - **Epitope** Epitope is the antigenic determinant lying on the antigens to simulate immune responses. Binding and subsequent reaction of immune cells and antibodies with antigens is initiated via the recognition of epitope. 151 | 152 | - **Antigen-presenting cell (APC), Major histocompatibility complex 153 | (MHC) and Human leukocyte antigen (HLA)** APCs are cells possessing the ability to present an antigen for T-cell recognition. The heterogeneous group (protein complex) on the APC surface for antigen presentation is called major histocompatibility complex (MHC). There are two type of MHC, class I and class II, differed by structure and expressed cell types. MHC in human is also called human leukocyte antigen (HLA). There is significant work aiming to solve the recognition pattern issues of MHC with presented antigen. AI models have achieved rather ideal accuracy for the prediction task to define whether an antigen (mainly short peptide sequence) could be presented by MHC (thus stimulate the immune reaction from T-cells with much possibility) to design more efficient immune regulators (neoantigen). 154 | 155 | - **Cytokines** Cytokines are messenger proteins released from immune cells to regulate immune responses. Abnormal activities of cytokines could induce “cytokine storm” which has lethal impact. 156 | 157 | #### Antigenicity and immunogenicity 158 | 159 | When a foreign material (antigen) enters, an organism would initiate a barrier system to fight against and eventually eliminate this intruder. Antigenicity describes the ability of an antigen bind to, or interact with the products of the final cell-mediated response (such as B-cell or T-cell receptors). Immunogenicity measures the ability of the antigen to activate the immune response (including innate immune response and the subsequent adaptive (acquired) immune response). Immunogens possess antigenicty, while antigens may not always have immunogenicity. Metal ions are typically haptens, which are antigens, but would not trigger immune responses. 160 | 161 | #### Monoclonal antibody, vaccine and neoantigen 162 | 163 | Monoclonal antibodies are engineered antibodies that typically recognize the same epitope, and thus possesses high specificity towards the targeted antigen. Vaccines are the biological preparation containing an agent to initiate the immune responses to form a barrier thus protect the body from certain disease derived from infection. The agent of vaccine resembles the disease-causing microorganism and is often made from weakened or killed forms of the microbe, its toxins, or the surface proteins. Neoantigens are the translation product (protein) of mutated DNA in cancer cells. They are different from the original protein under physiological condition and may thus play a significant role in stimulating immune response against cancer cells. 164 | 165 | #### Prediction and design of protein-protein interaction 166 | 167 | Protein-protein interaction (PPI) is the basis for many biological processes to function properly. Specific recognition between the interacting proteins is established on the basis of physical contacts. The forces driving stable/favourable interaction come from electrostatic interaction, hydrogen bonding and/or the hydrophobic effects etc. Based on the forces performed by atom/atom groups, there exist recognition patterns in the aspects of protein sequence as conserved region formed by amino acids that possess similar physicochemical properties have been observed in certain type of PPI. With the understanding of the interaction forces and their corresponding protein sequences, recognition/interaction patterns of PPIs should be reasonably summarized in relation with their biological outcomes. These summarized patterns in forms of models could be further applied for biological effect prediction with the protein sequences as input. Further, one could design functional protein sequences to achieve the desired bio-activity. 168 | 169 | #### Yield, Solubility, Stability of therapeutic Macromolecules 170 | 171 | Therapeutic macromolecules are compounds with large molecular weight possessing therapeutics effects and are typically derived from biological processes. The commonly applied therapeutic macromolecules (macromolecular drugs) include peptides, proteins, antibodies, polysaccharides and nucleic acids. Procedures to collect therapeutic macromolecules include bio-synthesis, recombinant protein expression, conjugation and modification etc. In comparison with small-molecule drugs, stable pipeline construction to yield therapeutic macromolecules requires much more effort. It is also worth noting that, as therapeutic macromolecules are typically derived from biological process in organism, some of them possess favourable intrinsic properties such as being lipophilic/hydrophobic and many of them would be easily degraded when recognized by cell metabolizing systems. Thus, enhancing the solubility of therapeutic macromolecules to facilitate desirable distribution properties as well as to sustain the intact entities thus gain stability to occupy therapeutic window wide enough to take action are another important issues for future discovery. 172 | 173 | #### Immuno-therapy 174 | 175 | Immuno-therapy functions by activating/mediating/enhancing immune responses of the patients to fight against diseases (cancer). 176 | 177 | #### Cell therapy 178 | 179 | Engineered cells (isolated from patients) with therapeutic effects are injected/grafted/implanted (back) into the patient’s body to treat the disease. The most well known cell therapy is CAR-T, where chimeric antigen receptor T cells are genetically engineered to produce an artificial T cell receptor to take action in the way of immunotherapy. 180 | -------------------------------------------------------------------------------- /source/chapters/knowledge_base/chemistry.md: -------------------------------------------------------------------------------- 1 | ## Chemistry 2 | 3 | #### Atomic Orbitals 4 | 5 | In Quantum Mechanics, Atomic Orbitals are mathematical functions that describe the wave-like behavior of electrons in atoms. This function can be used to calculate the probability of electrons appearing around the nucleus, and the meaning of “orbital” refers to the probability of electrons appearing in a specific area. According to the “shape” of the track, it can be classified into s, p, d, f, etc. 6 | 7 | #### Electronegativity 8 | 9 | Electronegativity describes the ability of atoms of an element to attract electrons in a compound. The greater the electronegativity of an element, the stronger the ability of its atoms to attract electrons in the compound. In a period of the periodic table, the electronegativity of the element atom increases from left to right; and it decreases from top to bottom in a group. Therefore, the elements at the upper right of the periodic table (O, N, F, Cl, etc.) have higher electronegativity values. The element with the greatest electronegativity is fluorine. 10 | 11 | #### Chemical Bond 12 | 13 | A chemical bond refers to the strong interaction between atoms, ions, and other particles. Through chemical bonds, particles can form polyatomic compounds (such as organic molecules, inorganic molecules, ionic compounds, etc.). Simply put, for a polyatomic system, the most stable configuration between positively charged nuclei and negatively charged electrons is that when electrons are located between nuclei, electrons are attracted between different nuclei, and using this force the nuclei are “attracted” together, forming a chemical bond. 14 | 15 | - **Ionic Bond**: A chemical bond formed by electrostatic interaction between oppositely charged anions and cations, without directionality, such as sodium chloride (salt), calcium carbonate. 16 | 17 | - **Covalent Bond**: A chemical bond formed by sharing electron pairs between atoms. Two atoms with similar electronegativity are equally attracted to electrons, so they mainly form chemical bonds by sharing each other’s outer valence electrons. Covalent bonds are directional, resulting in complex molecular structures. For example, in the methane molecule, carbon atoms and hydrogen atoms are connected by covalent bonds to form a regular tetrahedron, the carbon atom is located at the center of the tetrahedron, and the hydrogen atom is located at the vertex of the tetrahedron. According to the number of shared electron pairs, it can also be classified into a single bond, double bond, and triple bond. 18 | 19 | - **Hydrogen Bond**: When a hydrogen atom forms a covalent bond with an atom with high electronegativity X (usually O, N, F), if it bonds with another atom with high electronegativity. When Y (usually also O, N, F) is close, using hydrogen as the medium between X and Y, a special form of interaction like X-H· · ·Y is generated, known as a hydrogen bond. Hydrogen bonds widely exist in biological macromolecules such as water and proteins and DNA. It plays a crucial role in stabilizing the conformation of biological macromolecules. 20 | 21 | #### Functional Group 22 | 23 | Functional groups are atoms or groups of atoms that determine the properties of organic compounds. Common functional groups include hydroxyl (-OH), carboxyl (-COOH), ether bond (C-O-C), carbonyl (C=O), halogen atom (-F, -Cl, -Br, -I), etc. 24 | 25 | #### Aromatic 26 | 27 | Aromaticity is a chemical property that exists in cyclic planar molecules co ntaining $\pi$ bonds composed of delocalized electrons, which can provide molecules with stability that cannot be explained by conjugation alone. The number of electrons in the delocalized $\pi$ of an aromatic molecule needs to satisfy the Huckel rule (also called the “4n+2” rule). Molecules with aromaticity are called aromatic compounds, and molecules without aromaticity are called aliphatic compounds. Aromatic compounds can be roughly classified into simple aromatic compounds (such as benzene), polycyclic aromatic compounds (such as naphthalene, and anthracene), and heterocyclic compounds (such as pyridine, and pyrrole). 28 | 29 | ![Examples of aromatic compounds](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/aromatic.png) 30 | 31 | #### Conformation 32 | 33 | Conformer usually refers to three-dimensional conformation, which refers to the structure that a molecule has in three-dimensional space. For organic molecules, their conformations cannot be randomly generated due to the limitation of the directionality of covalent bonds. 34 | 35 | #### Isomers 36 | 37 | In Organic Chemistry, substances with the same chemical composition (molecular formula) but different structures are called isomers of each other. For example, the compositions of ethanol and dimethyl ether are both $\mathrm{C_2H_6O}$, but their structures are different: 38 | 39 | ![General form of isomers](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/isomers.png) 40 | 41 | #### Stereoisomerism 42 | 43 | Stereoisomers refer to molecules in which atoms are topologically connected in the same way but spatial arrangement of the atoms are different. For example, a molecule is likeley to have stereoisomers when it contains carbon atoms to which four different functional groups are bonded. Such atom is called chiral atoms, and usually R/S are denoted to distinguish two different them. In terms of biomolecules, such as peptides, amino acids and sugar, L/D are frequently used to denote different type of stereoisomers. The two amino acid configurations shown in the figure below are stereoisomers of each other. All natural amino acids are in the L configuration, and their carbon atoms are in the S configuration. 44 | 45 | ![Forms of stereoisomerism](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/chiral_isomerism.png) 46 | 47 | #### Cis-trans Isomerism 48 | 49 | Cis-trans isomerism refers to isomerism that occurs due to the hindered free rotation in the compound molecule, which is commonly found in compounds with double bonds or rings. 50 | 51 | ![Forms of isomerism](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/cis_trans_isomerism.png) 52 | 53 | #### Tautomerism 54 | 55 | Tautomerism means the structure of some organic compounds is converted between two functional isomers. Most tautomerisms involve the transfer of hydrogen atoms or protons, and the conversion of single bonds to double bonds. The distribution of tautomers in equilibrium depends on specific factors, including temperature, solvent, and pH, etc. The diagram below shows the keto (left) and enol (right) tautomers present in carbonyl compounds, with the keto structure predominating in the usual case. 56 | 57 | ![Forms of tautomerism](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/tautomerism.png) 58 | 59 | #### Amino Acids 60 | 61 | Amino acids are biologically important organic compounds consisting of amino (-NH2) and carboxyl (-COOH) functional groups and side chains attached to each amino acid. Amino acids are the basic units that make up a protein. In nature, there are 20 genetically encoded amino acids. 62 | 63 | ![Bound and stick shown of amino acids](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/amino_acids.png) 64 | 65 | #### Protein Structure 66 | 67 | Protein structure refers to the spatial structure of a protein biomolecule, which can be divided into four levels to describe different aspects. 68 | 69 | - **Primary structure**: the linear amino acid sequence that makes up the polypeptide chain of a protein. 70 | 71 | - **Secondary structure**: a stable structure formed by hydrogen bonds between C=O and N-H groups between different amino acids, mainly $\alpha$-helix and β-sheet. 72 | 73 | - **Tertiary structure**: the three-dimensional structure of a protein molecule is formed by the arrangement of multiple secondary structural elements in three-dimensional space. 74 | 75 | - **Quaternary structure**: used to describe the interaction of different polypeptide chains (subunits) to form functional protein molecules. 76 | 77 | #### Ligand 78 | 79 | In biochemistry or pharmacology, a ligand refers to a compound that can bind to a receptor and then lead to some physiological effect. In medicinal chemistry, ligands are usually small organic molecules or short peptides composed of several amino acids. The forces between ligands and receptors are usually non-covalent interactions: such as hydrogen bonds, electrostatic interactions, van der Waals interactions, etc. 80 | 81 | #### Receptor 82 | 83 | Signal transduction is responsible for intracellular communication via series of molecular events (protein phosphorylation) upon chemical/physical signal outside cell , where receptor function in the central role as transmit signals outside cells and produce specific effects within cells. It is usually biological macromolecule such as protein. After the receptor binds to a specific stimuli, the structure will change to a certain extent, and the corresponding effect will be induced in the cell. In medicinal chemistry, receptors usually refer to target proteins able to bind with ligands. 84 | 85 | #### Lock and Key Model 86 | 87 | The lock-and-key model is a theory proposed by E. Fischer in 1890 to explain the specific binding between enzymes and substrates (or between ligands and receptors). The model believes that the structures of enzymes and substrates at their binding sites should be strictly matched and highly complementary, just like the structural complementarity and matching of a lock and its original key. The disadvantage of this model is that the model treats the structure of the enzyme and the substrate as rigid structures, which is inconsistent with the fact that the conformation of the enzyme and the substrate changes during the catalytic reaction. 88 | 89 | #### Induced Fit Model 90 | 91 | The Induced-Fit Model is a model proposed by Koshland in 1958 to describe the enzyme-substrate (ligand-receptor) binding interaction. This model believes that in the process of binding the enzyme to the substrate, the substrate can induce a certain change in the structure of the enzyme, and finally form an active conformation that can bind to the substrate. 92 | 93 | ![Schematic of the lock and key model](https://dp-public.oss-cn-beijing.aliyuncs.com/community/knowledge_base/lock_key_model.png) 94 | 95 | #### Molecular Docking 96 | 97 | Molecular Docking is a technique that simulates the interaction between ligands and receptors. The technology predicts ligand binding modes and ligand-receptor binding forces by physically modeling intermolecular interactions and applying optimization algorithms such as the Monte Carlo method. 98 | 99 | #### Reversible Reaction 100 | 101 | A reversible reaction is a chemical reaction that can proceed in both the forward and reverse directions under the same conditions. When the degree of the reverse reaction direction is much smaller than that of the forward reaction direction, the reaction can be considered irreversible. Most of the reactions are reversible, such as the dissociation of weak acid/base, ligand-receptor binding, etc. 102 | 103 | #### Chemical Equilibrium 104 | 105 | Chemical Equilibrium refers to a state in which the forward and reverse reaction rates of a chemical reaction are equal in a reversible reaction with certain macroscopic conditions, and the concentrations of the reactants and the components of the products do not change. Take the following reaction as an example: 106 | 107 |
108 | \begin{aligned} 109 | \mathrm{aA+bB\rightleftharpoons cC} 110 | \end{aligned} 111 |
112 | 113 | When the equilibrium is reached, the concentrations of $\mathrm{A,B,C}$ are respectively [$A$],[$B$],[$C$], then the equilibrium constant K can be defined: 114 | 115 |
116 | \begin{aligned} 117 | K&=\frac{[\mathrm{C}]^c}{[\mathrm{A}]^a\mathrm{[B]}^b} \\ 118 | \end{aligned} 119 |
120 | 121 | Given the reaction conditions, the equilibrium constant for a reaction with a fixed stoichiometric ratio is the same, and is related to the free energy change of the reaction as follows: 122 | 123 |
124 | \begin{aligned} 125 | \Delta G=-RT\ln K \\ 126 | \end{aligned} 127 |
128 | 129 | #### van der Waals force 130 | 131 | van der Waals (vdW) force refers to the non-directional, unsaturated, weak interaction force between atoms. Van der Waals interactions are much weaker than chemical bonds, but they will significantly affect the melting point, boiling point, and many other properties. Van der Waals interactions have 3 major contributions: 132 | 133 | - **Attractive or replusive interactions** are between permanent charges, dipoles, quadrupoles, etc. 134 | 135 | - **Induction** (also known as polarization), which is the attractive interaction between a permanent multipole on one molecule with an induced multipole on another. This interaction is sometimes called Debye force. 136 | 137 | - **Dispersion** (usually named London dispersion interactions after Fritz London), which is the attractive interaction between any pair of molecules, including non-polar atoms, arising from the interactions of instantaneous multipoles. 138 | 139 | In molecular simulations, van der Waals forces are usually described in terms of the Lanner-Jones potential function, which has the following form: 140 | 141 |
142 | \begin{aligned} 143 | V(r)=\frac{C^6}{r^6}-\frac{C^{12}}{r^{12}} \\ 144 | \end{aligned} 145 |
146 | 147 | Where $r$ is the distance between two atoms, $C$ is a parameter, usually obtained by fitting physical quantities such as density and the enthalpy of evaporation. 148 | 149 | #### Hydrophobic interaction 150 | 151 | Hydrophobic interaction, also known as a hydrophobic effect, is a chemical phenomenon that which groups with hydrophobicity in an aqueous solution (such as alkyl groups without polarity) are close to each other to reduce the contact area with water. Hydrophobic interactions are the main driver of protein folding. 152 | 153 | #### Thermodynamics 154 | 155 | Thermodynamics focuses on the interaction of heat and work between chemical reactions and system states under the laws of thermodynamics. Generally speaking, the problems (equilibrium state) that do not involve the study of the chemical reaction process belong to the category of chemical thermodynamics, such as phase transition, and the balance of sodium and potassium ions on two sides of the cell membrane. 156 | 157 | #### Kinetics 158 | 159 | Kinetics, also known as reaction kinetics and chemical reaction kinetics, is a branch of physical chemistry that studies the rate and mechanism of chemical reactions. Chemical kinetics is different from chemical thermodynamics. It does not care about the equilibrium state, but studies the chemical reaction dynamically, and studies the time required for the transformation of the reaction system, as well as the microscopic process involved. 160 | -------------------------------------------------------------------------------- /source/chapters/knowledge_base/index.rst: -------------------------------------------------------------------------------- 1 | Knowledge Base 2 | ======================== 3 | 4 | .. toctree:: 5 | :maxdepth: 2 6 | :caption: Contents: 7 | 8 | physics.md 9 | chemistry.md 10 | biology.md 11 | pharmacy.md 12 | references.md 13 | 14 | -------------------------------------------------------------------------------- /source/chapters/knowledge_base/pharmacy.md: -------------------------------------------------------------------------------- 1 | ## Pharmacy 2 | 3 | #### Pharmaceutics 4 | 5 | How chemical entities are transferred to medication. 6 | 7 | #### Pharmacokinetics 8 | 9 | Pharmacokinetics studies how organisms process drugs. 10 | 11 | - **Absorption (A)**: how drugs get into the bloodstream 12 | 13 | - **Distribution (D)**: how drugs are reversibly transferred from one location to another within the body. some drugs tend to concentrate in part of the body like adipose tissue, raising potential risks for clinical usage. 14 | 15 | - **Metabolism (M)**: how drugs are broken down and modified inside the body. 16 | 17 | - **Excretion (E)**: how drugs and their metabolites (a metabolised form of drugs) are removed from the body. 18 | 19 | - **Toxicity (T)**: a pharmacodynamic property of drugs. Since its assessment protocol shares something similar with ADME, they are referred to as a whole in many cases. 20 | 21 | #### Pharmacodynamics 22 | 23 | Pharmacodynamics studies what a drug does to the body. Pharmacodynamics focus on the molecular, biochemical and physiological effects or actions of the studied drug. 24 | 25 | #### Medicinal Chemistry 26 | 27 | Medicinal chemistry refers to designing and synthesizing small molecules as a pharmaceutical agent. 28 | 29 | - **Hit-lead-candidate**: A hierarchical description to describe the potential precursor of a registered medication entity. 30 | 31 | - **Hit**: Promising candidates from preliminary screening, typically with a micro-molar EC50 (median effect concentration) values, or top scores from virtual screening. 32 | 33 | - **Lead**: Candidates demonstrating further potential to become drugs. Lead compounds normally require extensive modification and assessment before becoming a drug candidate. 34 | 35 | - **Drug candidate**: A well-studied compound, showing sufficient evidence in potency, selectivity, safety, and other drug-like properties. Drug candidates will become registered drugs after thorough clinical trials(usually including thousands of volunteers, tens of years, and investment in the order of billions of dollars) 36 | 37 | #### Synthetic Route Design 38 | 39 | Given a promising drug candidate, plan/design of the synthetic routine is performed to find more efficient processes with suitable starting materials to finally yield the product. 40 | 41 | #### Retrosynthesis 42 | 43 | Retrosynthesis is a recursion method to design a synthesis pathway for target organic molecules. The target is split into simpler precursors, and the precursor is split in the same manner until the building blocks are commercially available. 44 | 45 | #### Bioisostere 46 | 47 | Chemical substitutions or groups with similar physical or chemical properties to produce broadly similar biological effects. Bioisostere replacement from one compound to another is mainly applied in the situation where the parent compound is unsuitable for safe use, and/or bio-availability etc while possessing ideal druggability characteristics. 48 | 49 | #### Reaction Output Prediction 50 | 51 | Predict the product(s) with given reactants. There are reaction rules and preference reaction sites to be followed and studied in order to achieve this goal. Organic reactions are very dirty: a system can react in a different manner (e.g. substitution reaction & elimination reaction share very similar reagents and reaction condition), and at different sites (e.g. there can be two oxhydryls in one reagent, change the reaction condition can influence the preference during reaction. 52 | 53 | #### Patent recognition/Literature information extraction 54 | 55 | Collections of patent filings or literature contain a plethora of information that can guide drug development. On the other hand, novel compounds need to avoid violating existing intellectual property. Chemical patents are quite intricate, requiring sophisticated training and significant time to understand. Thus, automated information extraction via computer vision or natural language processing technology is quite necessary, although the evaluation would be a challenge. 56 | 57 | #### Property Prediction for Ligands 58 | 59 | A drug-like ligand requires favourable properties concerning pharmacokinetics issues (which means it can arrive the target properly) and pharmacodynamics issues (which means it can act with the target properly). These properties will be checked closely and separately below. However, medicinal chemists have come up with a few straight-forward and system-independent guidelines for drug design, to filter out risky molecules for downstream development. Below are some classic samples: 60 | 61 | - Lipinski rules of five (RO5) / bRO5: The most classic drug-like guideline. But these rules are being questioned all the time. See Rule of five in 2015 and beyond: Target and ligand structural limitations, ligand chemistry structure and drug discovery project decisions 62 | 63 | - Pan-assay interference compounds (Pan-assay interference compounds) Some molecules can always show a positive signal in high-throughput screening, but so far no one has successfully turned them into registered drugs. Such molecules share some intricate similarity, but how to describe them sharply remains a problem. 64 | 65 | - dosage form: the form drug is marketed for use. e.g. capsule, syrup, injection, etc 66 | 67 | - excipient: inactivate component in a drug product. The purpose of excipients may be to: make the drug stable, soluble, absorbable; change the absorption behavior(e.g. Controlled-release technique); some new excipients are suggested to change the distribution behavior of drug (like liposome). 68 | 69 | - pharmaceutical formulation design: choose the proper dosage form, find out the suitable combination of excipient and their ratio, decide the protocol of manufacturing 70 | 71 | - crystal structure of drugs: the arrangement of the molecules in a crystal. It can affect both physical and chemical (rare) properties of drug product. 72 | 73 | #### Bioactivity 74 | 75 | Bioactivity refers to the fraction (%) of an administered drug that reaches systemic circulation. 76 | 77 | #### Druggability 78 | 79 | If a protein is suitable to be a drug target. Studies on druggability don’t focus on ’determination of a good target’, but rather on ’how to filter out the unsuitable/difficult targets’. Two ’tangible’ sub-project in this topic: if a target can be modulated by small molecules (containing suitable pocket / covalent modification site / etc for molecule binding); if the inhibition / activation of a target can cause downstream (at least cellular-level) changes (rather than be antagonised and eliminated due to intracellular homeostasis) 80 | 81 | #### Druggability Prediction for Receptors 82 | 83 | Define whether certain kind of receptors could serve as a drug target that could be specifically addressed by a drug to take action for certain disease. It is also commonly to detect whether newly identified receptors could be targeted by existing drugs for re-orientated therapeutic purposes (under the circumstances of drug re-purposing/repositioning). 84 | 85 | #### Pharmacophore 86 | 87 | (according to IUPAC) an ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target and to trigger (or block) its biological response. In short: when we perform a QSAR study, we consider a part of a molecule as a group, and describe it by it’s chemical property (charge, hydrophobicity, aromaticity, steric hindrance, etc). Such group will be ’scored’ and replaced as a whole. 88 | 89 | #### Prediction of structure-activity relationship 90 | 91 | structure-activity relationship (SAR) is the relationship between the structure and the biological activity. It tries to answer 2 question: 1.which parts in a bioactive compound / which combination between these parts matters (certain pharmacophore in a certain topology / geometry structure); 2.how to modify a molecule according to information gained above (infer a stronger pharmacophore / scaffold to replace the old one) 92 | 93 | #### Molecule Generation 94 | 95 | Design a new chemical entity satisfying all demands above (have ideal property, can be synthesised easily, haven’t been patented) is considered as the holy grail of drug discovery. Since the inference of the above properties is still underdeveloped, there is still a long way to go for this ambition. However, today’s development of generative chemistry models can also serve a practical role in settings like library generation (generate at least novel and ’drug-like’ molecule) and conditional design (generate molecule satisfying certain explicit constraint). For more information see in Generative Models for De Novo Drug Design. 96 | 97 | #### Formulation Design 98 | 99 | Design of the optimal form of drugs based on the effective compound. Further reading could be referred to: 1.State-of-the-Art Review of Artificial Neural Networks to Predict, Characterize and Optimize Pharmaceutical Formulation; 2.Crystal structures of drugs: advances in determination, prediction and engineering 100 | 101 | #### Regenerative medicine 102 | 103 | Regenerative medicine seeks the way to replace the damaged tissues or organs from disease, trauma, or congenital issues, in contrast to the traditional clinical ideas that focus only on alleviating or treating the symptoms. -------------------------------------------------------------------------------- /source/chapters/knowledge_base/physics.md: -------------------------------------------------------------------------------- 1 | ## Physics 2 | 3 | #### Molecular Dynamics 4 | 5 | Molecular Dynamic (MD) is a type of molecular simulation method, which aims to study the dynamic evolution of physical systems through computer simulations of atoms and molecules. Based on MD simulations and statistical mechanics, many macroscopic thermodynamic properties, for instance, free energy or density, can be evaluated. Typically, trajectories of atoms in a simulation are generated by solving Newton’s laws of motion, where the potential energy function $V$ comes either from force fields or Quantum Mechanic (QM) ab-initio calculations: 6 | 7 |
8 | \begin{aligned} 9 | \dot{p}&=m\ddot{x} = -\frac{\partial V}{\partial x}\\ 10 | \end{aligned} 11 |
12 | 13 | Depending on the smallest indivisible unit during a simulation, MD simulations can be roughly divided into two major categories: All-atom Molecular Dynamics and Coarse-grained Molecular Dynamics (CGMD): 14 | 15 | - **All-atom Molecular Dynamics**: each individual atom is treated as the smallest indivisible unit for motion and force calculations 16 | 17 | - **Coarse-grained Molecular Dynamics**: a set of adjacent atoms (such as an amino acid residue, a water molecule) is treated as a coarse grained unit, usually referred as a “bead”. Only interactions between beads are considered, while all intra-bead interactions are neglected during a CGMD. This treatment makes CGMD capable of performing simulations on a larger time scale and for larger physical systems with reduced cost of computation and increased loss of accuracy. 18 | 19 | Depending on the accuracy of potential energy functions used during a simulation, MD simulations can be divided into three categories: Classical Molecular Dynamics (Classical MD, cMD), Ab-initio Molecular Dynamics (AIMD) and Machine Learning Molecular Dynamics (MLMD): 20 | 21 | - **Classical Molecular Dynamics**: potential energy functions of the physical system come from a force field; 22 | 23 | - **Ab-initio Molecular Dynamics**: potential energy functions of the physical system come from ab-initio calculations; 24 | 25 | - **Machine Learning Molecular Dynamics**: potential energy functions of the physical system come from a machine learning force field. 26 | 27 | #### Potential Energy Function 28 | 29 | Potential Energy Function, usually shortened as “Potential”, refers to the function that is used to describe the energy of interaction within a physical system. In an all-atom MD simulation the potential is a function of the atom types and atomic coordinates within the given physical system, and it could be given by quantum mechanics (QM), molecular mechanics (MM) force fields, or machine learning (ML) force fields. 30 | 31 | #### Force Field 32 | 33 | Force Field, conventionally called Molecular Mechanics (MM) Force Field, refers to a collection of empirical functions with fixed mathematical formats to describe the potential energy of the physical system. Parameters for these empirical functions are determined by fitting against experimental data or QM-derived data. Compared to ab-initio methods, MM force fields are less accurate but much faster (usually several magnitudes). 34 | 35 | #### Hamiltonian 36 | 37 | Under the context of classical mechanics, the concept of the Hamiltonian refers to the total energy of a physical system, which is the sum of the potential energy and the kinetic energy of all particles within the given system. 38 | 39 |
40 | \begin{aligned} 41 | H&=\sum_i H_i =\sum_i [\frac{p_i^2}{2m}+V(x_i)]\\ 42 | \end{aligned} 43 |
44 | 45 | In quantum mechanics, the Hamiltonian should be considered as an Hamiltonian operator. 46 | 47 |
48 | \begin{aligned} 49 | \hat{H}=\sum_{i} \frac{\hat{p}^2}{2m_i} + \hat{V} \\ 50 | \end{aligned} 51 |
52 | 53 | #### Statistical Mechanics 54 | 55 | In physics, statistical mechanics is a sub-discipline which applies statistical methods and probability theory to describe large assemblies of microscopic particles so that macroscopic behavior of the physical system (for instance, temperature, pressure) can be related to the behavior of microscopic particles. 56 | 57 | #### State Function 58 | 59 | State Function is a physical property to describe the macroscopic property of a physical system. State functions have fixed values for a physical system under certain thermodynamic equilibria and depend only on the current equilibrium state of the system, rather than the path on which the system reaches equilibrium. Examples of State Functions include internal energy, enthalpy, entropy, free energy, etc. 60 | 61 | #### Ensemble 62 | Ensemble is a concept in statistical mechanics, which refers to a collection of a large number of independent systems with identical properties and structures in various motion states under certain macroscopic conditions. 63 | 64 | #### Free Energy 65 | 66 | The thermodynamic free energy refers to the energy of a thermodynamic system that can be used to do external work. It can be used as a criterion for whether a thermodynamic process can proceed spontaneously. Under given constraints, the system always tends to transition to a state with low free energy. For example, the process of protein folding is the spontaneous transition from an unfolded state with higher free energy to a folded state with lower free energy. According to the different qualifications, it can be divided into Helmholtz free energy (common notation $F$) and Gibbs free energy (common notation $G$). Note: free energy is different from potential energy although many people may confuse them. 67 | 68 | #### Boltzmann Distribution 69 | 70 | In statistical mechanics, the Boltzmann distribution describes the 71 | In statistical mechanics, the Boltzmann distribution describes the probability distribution of particles in a system in possible microscopic quantum states, and has the following form: 72 | 73 |
74 | \begin{aligned} 75 | p_i\propto\exp\left(-\frac{\varepsilon_i}{kT}\right) \\ 76 | \end{aligned} 77 |
78 | 79 | where $E$ is the quantum state energy, $k$ is the Boltzmann constant $T$ is the temperature, $p_i$ is the probability that the particle is in the $i$ quantum state, and ε$_i$ is the energy of the $i$ quantum state. 80 | 81 | #### Collective Variables (Reaction Coordinates) 82 | 83 | The representative parameters that can quantitatively describe the change process of the system are called Collective Variables (CV) or Reaction Coordinates (RC). For example, in the chemical reaction shown in the figure below, the distance between O and C $d(\mathrm{C-O})$ can be regarded as the reaction coordinate, and the distance between C and Br $d(\mathrm{Br-C})$ can also be regarded as the reaction coordinate. 84 | 85 | Given that the reaction coordinates are well defined, methods such as umbrella sampling can be used to estimate the free energy difference between different reaction coordinates through molecular simulation, and then the free energy change along with the reaction coordinates during the transforming process can be described, which is the basis of kinetic and thermodynamic research. 86 | 87 | #### Slow Degrees of Freedom 88 | 89 | In the process of dynamic simulation, some degrees of freedom change rapidly with time (such as bond length, bond angle, etc., usually on the order of fs or ps). And some degrees of freedom change slowly with time (such as the dihedral angle, usually on the order of ns, $\mu$ s , or even ms). 90 | 91 | #### Enhanced Sampling 92 | 93 | Enhanced sampling refers to accelerating the sampling of slow degrees of freedom in the simulation process by some technical means, which are classified as collective variable-based (e.g. umbrella sampling), and collective variable-free (e.g. replica exchange). 94 | 95 | #### Quantum Mechanics 96 | 97 | Quantum Mechanics is a branch of physics that studies microscopic systems. By describing the motion and interaction of microscopic particles (such as electrons, protons, etc.), quantum mechanics can explain many experimental phenomena that cannot be explained under the framework of classical mechanics, including blackbody radiation and the spectrum of the hydrogen atom. 98 | 99 | #### Operator 100 | 101 | Generally, an operator acts on the state space of a physical system, making the physical system transform from one state to another. Within the context of quantum mechanics, the state of a system can be described by a state vector. Physical observables (such as position, momentum, Hamiltonian, etc.) all correspond to a (Hermitian) operator. 102 | 103 | #### Schrödinger Equation 104 | 105 | In quantum mechanics, the Schrödinger equation is a partial differential equation that describes the time evolution of the quantum state of a physical system and is the fundamental equation of quantum mechanics. The Schrödinger equation can be divided into two types: the “time-dependent Schrödinger equation” 106 | 107 |
108 | \begin{aligned} 109 | \hat{H}\Psi=i\hbar\frac{\partial}{\partial t}\Psi \\ 110 | \end{aligned} 111 |
112 | 113 | and the “time-independent Schrödinger equation” (also known as the steady-state Schrödinger equation) 114 | 115 |
116 | \begin{aligned} 117 | \hat{H}\Psi&=E\Psi \\ 118 | \end{aligned} 119 |
120 | where $\hat{H}$ is Hamiltonian operator, and Ψ is the wave function of the system. 121 | 122 |
123 | \begin{aligned} 124 | \hat{H}&=-\frac{\hbar^2}{2m}\nabla^2+V \\ 125 | \end{aligned} 126 |
127 | 128 | The time-dependent Schrödinger equation describes how the wave function of a quantum system evolves over time, while the time-independent Schrödinger equation describes the physical properties of a stationary quantum system. 129 | 130 | 131 | #### First Principle 132 | 133 | First Principle, also called ab initio, refers to derivation and calculation based on the basic laws of physics without additional assumptions and empirical fitting. For example, the of use the Schrodinger equation to solve electronic structure. 134 | 135 | #### Wave Function 136 | 137 | In quantum mechanics, the state of a quantum system can be described by a wave function. The wave function Ψ\(**r**,t\) is a complex-valued function. According to Bonn’s statistical interpretation, $|Ψ|^2$ is the probability density of finding a particle at position **r**, time $t$. 138 | 139 | #### Born-Oppenheimer Approximation 140 | 141 | The Born-Oppenheimer approximation refers to the approximate variable separation of the nuclear coordinates and the electron coordinates when solving quantum mechanical equations containing the nucleus and electrons, to decompose the wave function of the whole system into separately solving the nuclear wave function and the electron wave function, which are two relatively simple processes. The basis of this approximation is that the mass of the nucleus is 3 to 4 orders of magnitude larger than that of the electron, and the speed of the nucleus is much smaller than that of the electron, so the electron can be regarded as being in the potential field formed by the stationary nucleus, and the nucleus won’t be affected by the specific position of the electron, only the average force of electrons counts. 142 | 143 | #### Density Functional Theory 144 | 145 | Density functional theory (DFT) is a quantum mechanical method to study the electronic structure of multi-electron systems, and it is one of the most commonly used methods in the fields of condensed matter physics and computational chemistry. Since the classical method of electronic structure theory needs to solve the multi-electron wave function with a higher dimension ($3N$ for a system containing $N$ electrons), the basic idea of the density function is to use the electron density instead of the wave function as the basic amount of research, thereby reducing the computational complexity. The most common application of density functional theory is implemented with the Kohn-Sham method. 146 | -------------------------------------------------------------------------------- /source/chapters/knowledge_base/references.md: -------------------------------------------------------------------------------- 1 | ## References 2 | 3 | [1] Jakob Schneider, Ksenia Korshunova, Francesco Musiani, Mercedes Alfonso-Prieto, Alejandro Giorgetti, and Paolo Carloni. Predicting ligand binding poses for low-resolution membrane protein models: Perspectives from multiscale simulations. Biochemical and Biophysical Research Communications, 498(2):366–374, 2018. Multiscale Modeling. -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/AI_in_MD.md: -------------------------------------------------------------------------------- 1 | ## AI in MD 2 | 3 | The current difficulties restricting MD simulation are as follows: 4 | 5 | - Simulation accuracy 6 | 7 | - Introduced by the classical force field approximations to quantum mechanics. 8 | 9 | - Sampling efficiency 10 | 11 | - This is the most essential problem of MD. It is partly solved by enhanced sampling techniques but traditional enhanced sampling methods cannot solve high-dimensional problems. Excessive heating or aggressively increasing bias potential can lead to denaturation of systems. 12 | 13 | Here, some people draw inspiration and seek solutions from deep learning to try solve these problems. 14 | 15 | - DeePMD[10] 16 | 17 | - By fitting quantum chemical data with neural networks, a potential energy surface with quantitative accuracy is obtained. The computational complexity of neural networks is far less than solving Schrödinger equations, so DeepMD can achieve molecular dynamics simulations with high-precision. 18 | 19 | - Neural-networks-based enhanced sampling 20 | 21 | - These methods obtain the representation of the high-dimensional free energy surface by fitting the mean-force data using neural networks. The bias potential is applied to the system to boost simulations, where neural networks alleviate the high-dimensional problem of multiple CVs. These works include NN-VES, Reinforced Dynamics, etc. 22 | 23 | - CV Discovering 24 | 25 | - Reduce the dimensionalities of the data through machine learning methods (SVM, VAE, GAN, etc.) to find the most essential collective variables. TorchCV[11] is a good example for this. 26 | 27 | **Difficulties:** 28 | 29 | - How to generate MD data for training ML models? 30 | 31 | - Possible solutions: active learning, concurrent learning 32 | 33 | - It is temporarily hard to do end-to-end simulation with similar efficiency to well-developed MD software. 34 | -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/MD_definition.md: -------------------------------------------------------------------------------- 1 | ## What is Molecular Dynamics? 2 | 3 | Molecular Dynamics (MD) studies how atomic coordinates evolve under given conditions. It relies on the framework of **classical mechanics**(also known as **Newtonian mechanics**) and simulates the motion of molecular systems numerically. For instance, you can imagine the motion of two rigid balls connected by a spring. 4 | 5 | 6 | ### Experiments on a computer 7 | 8 | MD is a computational method, a chemical experiment performed on computers. Let's imagine a chemical experiment in the real world first. One may have to prepare experimental instruments and drugs, set instrument parameters, conduct experiments, wait for chemical reactions to proceed, obtain experimental results, and then analyze experimental results. MD is quite similar; we use a table to compare MD with the traditional chemical experiments: 9 | 10 | | Chemical experiment | Molecular dynamics simulation | 11 | |----------------------------------|-----------------------------------------------------------------------------------------------------------------| 12 | | Prepare experimental instruments | Prepare computing hardware (CPU, GPU, etc.)
Prepare calculation software (Gromacs, Lammps, etc.) | 13 | | Prepare experimental drugs | Prepare a file describing the molecular structure as an initial conformation for the simulation process | 14 | | Set instrument parameters | Set simulation parameters (simulation temperature, simulation time, etc.)
Set force field parameters (parameters that can be analogous to spring coefficients) 15 | | Conduct chemical experiments | Run MD simulations | 16 | | Get experimental results | Get the simulated trajectory | 17 | | Analyze experimental results | Analysis of physico-chemical properties from trajectories obtained from simulations (calculation of statistics) | 18 | 19 | 20 |
Table 1: Molecular Dynamics Procedure
21 |
22 | 23 | ### Why do we run MD? 24 | 25 | Molecular dynamics simulations can allow us to simulate chemical and physical processes on computers, obtain kinetic information at the microscopic scale, provide theoretical support for experiments and guide chemical experiments. Moreover, computational simulations can help reduce the cost of manually conducting experiments. MD can also be performed under special (usually severe) conditions (ultra-high pressure, ultra-high temperature, strong electric field, magnetic field, etc.) 26 | 27 | However, its basis in classical mechanics means that MD can only be effective for physical processes at the molecular scale (e.g. $nm$). For simulations involving electronic structures MD doesn't work. 28 | 29 | - Systems that can be simulated with classical molecular dynamics: 30 | 31 | - Protein systems (protein folding problems, etc.) 32 | 33 | - RNA systems 34 | 35 | - Atomic-scale material systems 36 | 37 | - Free energy calculation 38 | 39 | - Catalytic reactions that do not involve electron transfer 40 | 41 | - Problems that classical MD *can not* solve (or cannot be solved only by classical MD): 42 | 43 | - Magnetic and electrical properties of materials (requires quantum chemistry tools) 44 | 45 | - Calculate the energy of molecular conformation (requires quantum chemistry method, current software include Gaussian, Vasp) 46 | 47 | - Protease-catalyzed reactions involving electron transfer (requires QM/MM method) 48 | 49 | - Chemical reactions involving electron transfer (requires QM/MM method) 50 | 51 | ### The inputs and outputs of MD 52 | 53 | Inputs to molecular dynamics simulations are an initial conformation of molecules. They should contain the coordinates of the atoms, the information of chemical bonds among atoms (also called topology information), etc. 54 | 55 | The outputs of molecular dynamics are molecular trajectories (trajectory refers to a continuous molecular motion in the coordinate space or Cartesian space) simulated from the initial conformations. -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/advanced_example.md: -------------------------------------------------------------------------------- 1 | ## An advanced example 2 | 3 | In this part, we will use an example of a **protein** to illustrate how to conduct MD simulations in real-world research. 4 | 5 | ### Building Model 6 | 7 | Again, we use the BO approximation to construct models for proteins. These models include information on bond types, atom types, and force field parameters for various interactions. We usually need the following files: 8 | 9 | - topology (`topol.top` for example): Contains molecular bonding information, molecular type information, and atomic type information 10 | 11 | - force field (`forcefield.itp` for example): contains chemical bond equilibrium positions, chemical bond force constants, non-bond force constants, etc. 12 | 13 | - Commonly used force fields are as follows: 14 | 15 | - Amber 16 | 17 | - Charmm 18 | 19 | - Gromos 20 | 21 | - OPLS 22 | 23 | - So far there is no unified database or maintenance of methods for force fields, and each force field is maintained by each company or organization. Developing a force field is a difficult and nuanced process. 24 | 25 | ### Solvent 26 | 27 | Unlike systems in vacuum, proteins generally exist in solvents. We therefore need to account for and model the solvent. We can do this in two ways: 28 | 29 | - **Explicit solvent model.** It is often used in all-atom simulations, that is, directly introducing solvent molecules into the system. The force field parameters for solvent molecules are usually included in the force field file. Commonly used models for water are SCP, TIP3P, TIP4P, etc. 30 | 31 | - Advantages: Similar to the actual physical process, the results are more accurate. It can describe solvent-involved processes (e.g. protein-ligand binding). It can also explicitly describe solvent effects such as hydrogen bonding. 32 | 33 | - Disadvantages: high computational complexity. A system usually contains hundreds or thousands of solvent molecules. 34 | 35 | - **Implicit solvent model.** The effect of the solvent on the solute is described by a continuous electric field model. The Generalized Born model is an example of commonly used model. 36 | 37 | - Advantages: low computational complexity, no need to introduce additional solvent molecules. 38 | 39 | - Disadvantages: Imprecise, solvent-involved reactions cannot be described. 40 | 41 | 42 | 43 | 1. **Periodic Boundary Conditions**: 44 | 45 | Due to limited computation resources it is impossible for us to simulate an infinitely large system, nor to simulate infinite steps. For a protein system, tens of thousands of atoms already require a lot of calculations (the required calculation time is measured in days), but this is still far less than Avogadro's constant($10^{23}$). 46 | 47 | However, simulating a small system will be seriously affected by the interface and cannot reflect the properties of the bulk phase. 48 | 49 | To solve this problem, we often use **periodic boundary conditions**. We confine the system of interest in a box and assume the properties of the actual system can be approximated by an virtual infinite system of repeating side-by-side lattices. If the molecule passes through the box boundary, it will re-enter the box from the opposite boundary, forming a periodic space. 50 | 51 | 52 |
53 | 54 |
55 |
Figure 1: Periodic system
56 |
57 | 58 | 2. **Preparation for simulation** 59 | 60 | Next, you need to prepare the conformation files of the protein and water molecules. Typically, protein structure files are generated by PDBs, then converted into a format that can be read by molecular dynamics software. 61 | 62 | We also need to prepare simulation parameter files, in which you need to set: 63 | 64 | - temperature 65 | 66 | - time interval and duration for simulations 67 | 68 | - the integrator/numerical algorithm, the ensemble for simulations 69 | 70 | - the temperature/pressure controller 71 | 72 | - the output-frequency and content of output files 73 | 74 | 3. **Simulation** Commonly used MD simulation software in current research are as follows: 75 | 76 | - Amber [1] 77 | 78 | - Commercial software. There is a free version and a high-performance optimized commercial version. The code is closed source and highly engineered. It supports most MD functions and plays an important role in the simulation of protein systems. Amber is often used in simulations of biological systems. 79 | 80 | - Gromacs [2] 81 | 82 | - Open-source MD simulation software. The latest release is the Gromacs 2022 release version. With efficient GPU optimization, there is a good developer community for Gromacs. This is mostly used for biological system simulation. 83 | 84 | - OpenMM [3] 85 | 86 | - Molecular dynamics simulation software with Python interfaces. Modules are implemented by calling functions from the Python command line, which can directly involve deep learning frameworks such as PyTorch. However, its compatibility has not been fully developed. Mostly used for biological system simulation. 87 | 88 | - Lammps [4] 89 | 90 | - MD software for material simulations. 91 | 92 | Most MD software is fully optimized on GPU devices, which can provide greater efficiency than CPU devices. Another program, Charmm, can be used for preparing structures for MD but it can not be used for simulations. 93 | 94 | In addition, most software is compatible with other formats of force fields; for instance, Gromacs is compatible with Amber or Charmm force fields. -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/enhanced_sampling.md: -------------------------------------------------------------------------------- 1 | ## Rare Events and Enhanced Sampling 2 | 3 | - Molecular dynamics algorithms have time reversibility and ergodic hypotheses. We assume that all states of a molecule have a probability to be explored (or traversed/sampled) after a sufficiently long simulation. These states include ground states, meta-stable states, and some high-energy states (unstable states). When the simulation reaches equilibrium, the distribution of molecular conformations in the system satisfies **the Boltzmann distribution** under its **ensemble**. 4 | 5 | - Different states have different probabilities to be sampled. In most cases, the molecules are stuck in a local minimum on the energy surface, and it is difficult to jump over the energy barriers. Therefore, under finite-time simulations, the probability of sampling some high-energy state or another meta-state separated by an energy barrier is very low. These are **rare events**. 6 | 7 | - Here, it can be compared with Monte Carlo (MC) simulations of a high-dimensional function, starting from a random value of the high-dimensional function to explore the global minima of the function. This can take a lot of time and computational resources. 8 | 9 | - To increase the probability of rare events occurring in MD simulations, the simulation process can be interfered with using various methods: 10 | 11 | - Enhanced sampling based on temperature: 12 | 13 | - Raise the temperature, lower the energy barriers, and increase the probability of rare events. 14 | 15 | - Replica-Exchange Molecular Dynamics (REMD)[5], selective integrated tempering sampling (SITS)[6], etc. 16 | 17 | - Enhanced sampling based on bias potential: 18 | 19 | - Collective variables (CV), are functions of system coordinates. The free energy are defined on collective variables. (Please refer to the difference between free energy and potential energy) 20 | 21 | - Add bias potential to the given CVs during the simulations, which can push the trajectory out of the local minima on the energy surface to explore other states. 22 | 23 | - Metadynamics[7], VES[8], RiD[9], etc. 24 | 25 | - Traditional boosted sampling methods based on bias potential suffer from **the curse of dimensionality.** -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/index.rst: -------------------------------------------------------------------------------- 1 | Molecular Dynamics 2 | ============================= 3 | 4 | This tutorial aims to equip you with the knowledge about molecular dynamics and answer the following questions: (1) what molecular dynamics is, (2) why we run molecular dynamics, (3) how molecular dynamics works and (4) how to run a real molecular dynamics process. 5 | 6 | .. toctree:: 7 | :maxdepth: 2 8 | :caption: Contents: 9 | 10 | preperation.md 11 | MD_definition.md 12 | simple_example.md 13 | advanced_example.md 14 | enhanced_sampling.md 15 | AI_in_MD.md 16 | references.md 17 | 18 | -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/preperation.md: -------------------------------------------------------------------------------- 1 | ## Before you start 2 | 3 | - This tutorial assumes that you have already learned: 4 | 5 | - Basic physics and chemistry knowledge 6 | 7 | - Calculus 8 | 9 | - If you want to deeply understand the principles and roles of molecular dynamics, you need to master the following: 10 | 11 | - Mechanics, theoretical mechanics (analytical mechanics) 12 | 13 | - Statistical Mechanics 14 | 15 | - Physical Chemistry 16 | 17 | - Numerical Methods / Computational Physics / Computational 18 | Methods 19 | 20 | - Further topics: 21 | 22 | - Quantum Chemistry 23 | 24 | - Biochemistry or Solid State Physics, depending on the 25 | specific application -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/references.md: -------------------------------------------------------------------------------- 1 | ## References 2 | 3 | [1] Romelia salomon ferrer, David Case, and Ross Walker. An overview of the amber biomolecular simulation package. Wiley Interdisciplinary Reviews: Computational Molecular Science, 3, 03 2013. 4 | 5 | [2] Henk Bekker, Herman Berendsen, E.J. Dijkstra, S. Achterop, Rudi Drunen, David van der Spoel, A. Sijbers, H. Keegstra, B. Reitsma, and M.K.R. Renardus. Gromacs: A parallel computer for molecular dynamics simulations. Physics Computing, 92:252–256, 01 1993. 6 | 7 | [3] Peter Eastman, Jason Swails, John Chodera, Robert Mcgibbon, Yutong Zhao, Kyle Beauchamp, Lee-Ping Wang, Andrew Simmonett, Matthew Harrigan, Chaya Stern, Rafal Wiewiora, Bernard Brooks, and Vijay Pande. Openmm 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput Biol, 06 2017. 8 | 9 | [4] Aidan Thompson, H. Metin Aktulga, Richard Berger, Dan Bolintineanu, W. Brown, Paul Crozier, Pieter in ’t Veld, Axel Kohlmeyer, Stan Moore, Trung Nguyen, Ray Shan, Mark Stevens, J. Tranchida, Christian Trott, and Steven Plimpton. Lammps - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics communications, 271:108171, 09 2021. 10 | 11 | [5] Zhou Ruhong. Replica exchange molecular dynamics method for protein folding simulation. Methods in molecular biology (Clifton, N.J.), 350:205–23, 02 2007. 12 | 13 | [6] Lijiang Yang and Yi Gao. A selective integrated tempering method. The Journal of chemical physics, 131:214109, 12 2009. 14 | 15 | [7] Alessandro Barducci and Massimiliano Bonomi. Metadynamics. WIREs Comput. Mol. Sci., 1:826, 09 2011. 16 | 17 | [8] Luigi Bonati, Yue-Yu Zhang, and Michele Parrinello. Neural networks-based variationally enhanced sampling. Proceedings of the National Academy of Sciences, 116:201907975, 08 2019. 18 | 19 | [9] Dongdong Wang, Yanze Wang, Junhan Chang, Linfeng Zhang, Han Wang, et al. Efficient sampling of high-dimensional free energy landscapes using adaptive reinforced dynamics. Nature Computational Science, 2(1):20–29, 2022. 20 | 21 | [10] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. July 2017. 22 | 23 | [11] Luigi Bonati, Valerio Rizzi, and Michele Parrinello. Data-driven collective variables for enhanced sampling. The Journal of Physical Chemistry Letters, 11:2998–3004, 04 2020. 24 | -------------------------------------------------------------------------------- /source/chapters/molecular_dynamics/simple_example.md: -------------------------------------------------------------------------------- 1 | ## Starting with a simple example 2 | 3 | Here we consider a simple example of a molecular dynamics simulation for a hydrogen molecule in vacuum. $H-H$ (a single hydrogen molecule) 4 | 5 | ### Basic knowledge 6 | 7 | Let’s go over some basic knowledge before simulations: 8 | 9 | - Molecules have an associated energy, which can be divided to potential energy (denoted by $V$) and kinetic energy (denoted by $T$). Potential energy is related to the coordinates of the atoms in molecules, while kinetic energy is related to the velocity (or momentum) of the atoms. 10 | 11 |
12 | \begin{aligned} 13 | E &= T + V \\ 14 | \end{aligned} 15 |
16 | 17 |
18 | \begin{aligned} 19 | E &= \frac{1}{2} m v^2+V(\vec{q}) \text{ or } E = \frac{p^2}{2m} + V(\vec{q}) \\ 20 | \end{aligned} 21 |
22 | 23 | where $q$ represents the vector coordinates of atoms and $p$ represents the momentum of atoms. 24 | 25 | - Force is the negative derivative of the energy over coordinates. Since kinetic energy has nothing to do with positions (coordinates), force is also equal to the negative derivative of potential energy over coordinates. 26 | 27 |
28 | \begin{aligned} 29 | F_i &= -\frac{\partial E}{\partial q_i} = - \frac{\partial V}{\partial q_i} \\ 30 | \end{aligned} 31 |
32 | 33 | ### The construction of theoretical models - generating force fields 34 | 35 | #### **Why do we need force fields?** 36 | 37 | To simulate the motion of hydrogen molecules, we need to know the forces exerted on hydrogen atoms at different positions. These can be used to calculate changes of velocity and coordinates. This is equivalent to knowing the potential energy surface of the $H_2$ molecule. 38 | 39 | How do we get the potential energy surface? The most straightforward way is to calculate the energy by solving Schrödinger equations, and then calculate forces from derivatives. Then our simulation process becomes: 40 | 41 | ![Force field flow chart.](https://dp-public.oss-cn-beijing.aliyuncs.com/community/molecular_dynamics/force_field.jpg) 42 | 43 | Every time we get a new position in the coordinate space, we need to solve a new set of Schrödinger equations. It is very time-consuming and laborious for complex systems. However, to calculate the motion of hydrogen atoms, we don't need to know the exact information of the electrons in hydrogen atom, but only the position information of the nucleus. Thus we need **the Born-Oppenheimer approximation**. 44 | 45 | **The Born-Oppenheimer approximation** is a quantum chemical approximation for electrons. Since nuclei are much heavier than electrons, they move much slower. We can assume that the motion of the nuclei is only affected by the mean-field of electrons. 46 | 47 | Simply speaking, we only need to know the positions and forces about the hydrogen nucleus. The influence of electrons can be represented by some force field parameters under approximations. For example, we can approximate the hydrogen molecule as a spring model, and regard the interaction from electrons (or chemical bonds) as a spring. The force generated by the chemical bond is described by the spring coefficient. In this way, we transform the problem of solving Schrödinger equations to calculating the force of a spring: 48 | 49 |
50 | \begin{aligned} 51 | F &= k(x-x_0) \\ 52 | \end{aligned} 53 |
54 | 55 | where $k$ represents the spring coefficient, and $x_0$ represents the offset from the equilibrium position. As a result, calculations are greatly simplified. Here, the **k (spring coefficient, or force constant)** and **x0 (equilibrium positions)** are simplified **force field parameters**, also known as the **force field**. 56 | 57 | Note that such simplification can significantly reduce the accuracy of MD simulations, so MD can only reflect the information near equilibrium states. But for macro statistics (e.g. chemical shift, protein contact), it is usually a sufficient description. 58 | 59 | #### **The actual force field parameters.** 60 | 61 | In real MD simulations, almost all inter-atomic and inter-molecular forces are approximated by an analytical expression of classical mechanics. These **empirical parameters are obtained by fitting experimental data or data from quantum calculations}(for example, gradient descent optimization**. Usually, the format of force fields looks like a combination of terms like: 62 | 63 |
64 | \begin{aligned} 65 | U &=\sum_{\text {bonds }} \frac{1}{2} k_{b}\left(r-r_{0}\right)^{2}+\sum_{\text {angles }} \frac{1}{2} k_{a}\left(\theta-\theta_{0}\right)^{2}+\sum_{\text {torsions }} \frac{V_{n}}{2}[1+\cos (n \phi-\delta)] + \\ 66 | \end{aligned} 67 |
68 | 69 |
70 | \begin{aligned} 71 | \sum_{\text {improper }} V_{i m p}+\sum_{\mathrm{LJ}} 4 \epsilon_{i j}\left(\frac{\sigma_{i j}^{12}}{r_{i j}^{12}}-\frac{\sigma_{i j}^{6}}{r_{i j}^{6}}\right)+\sum_{\text {elec }} \frac{q_{i} q_{j}}{r_{i j}} \\ 72 | \end{aligned} 73 |
74 | 75 | These interactions include: 76 | 77 | - Short-range interactions: 78 | 79 | - bonds 80 | 81 | - angles 82 | 83 | - torsion (dihedral angles) 84 | 85 | - Long-range interactions 86 | 87 | - Electrostatic interactions (Coulomb interactions, elec) 88 | 89 | - Van der Waals 90 | 91 | ### Preparation for simulation 92 | 93 | Now, we have modeled a hydrogen molecule. For simplicity, let the force constant of the hydrogen molecule be $0.1 kJ/\mathring A^2$. Then let the equilibrium bond length be $0.5\mathring A$, and the mass of the hydrogen atom is assumed to be \(1\) (just for demonstration). 94 | 95 | - To start the simulation, we also need the initial conditions (initial conformation and initial velocity). Also for simplicity, we assume that the coordinates of the two atoms are [0.7, 0, 0], [0, 0, 0] (Unit: angstrom), and the velocities are [-0.1, 0, 0], [0.1, 0 , 0] (Unit: angstrom/fs) 96 | 97 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/molecular_dynamics/example1.jpg) 98 | 99 | - Next, let's set the simulation parameters: since we calculate the differential equation numerically, we need to give the minimum simulation time interval, which is generally set to be 2 fs.(which is actually the grid size for time dimension in finite difference method). 100 | 101 | - Set the simulation to 5 steps, so the total simulation time is 10 fs. 102 | 103 | - Set the simulated temperature to 300K (this affects the velocity initialization and force field parameters, which should be set before velocity initialization). 104 | 105 | Ready? Let’s simulate\! 106 | 107 | ### Simulation 108 | 109 | (Please be aware that units are omitted for simplicity) 110 | 111 | 1. Calculate the force on the hydrogen atoms from the initial positions: 112 | 113 |
114 | \begin{aligned} 115 | F_x&=k(x-x_0)=0.1\times(1-0.5)=0.05 \\ 116 | \end{aligned} 117 |
118 | 119 | 2. Calculate the accelerations: 120 | 121 |
122 | \begin{aligned} 123 | a_x &= \frac{F}{m}=0.05 \\ 124 | \end{aligned} 125 |
126 | 127 | 3. Update velocities: 128 | 129 |
130 | \begin{aligned} 131 | \Delta v_x=a\times \Delta t=0.05 * 2=0.1 \\ 132 | \end{aligned} 133 |
134 | 135 |
136 | \begin{aligned} 137 | \Delta x = v_x \times \Delta t=0.1 * 2=0.2 \\ 138 | \end{aligned} 139 |
140 | 141 | 4. Update the coordinates and velocities through the calculation above to obtain new coordinates: 142 | 143 | 5. (Repeat) Calculate the new force 144 | 145 | 6. (Repeat) Calculate the new acceleration 146 | 147 | ![image](https://dp-public.oss-cn-beijing.aliyuncs.com/community/molecular_dynamics/example2.jpg) 148 | 149 | ### Results 150 | 151 | Finally, we will obtain positions of H atoms at different times, and these “continuous” motions compose the MD trajectory. 152 | 153 | Through long-term MD simulations, we can calculate **physical and chemical** properties from the trajectories. (such as RMSD changes, conformation transitions, ensemble averaging of physical quantities, etc.) 154 | 155 | **Notes**: Here, the equations from Newtonian mechanics are used for the purposes of demonstration. MD simulation is **usually more complicated**. Depending on the required **ensemble**, numerical evolution of partial differential equations will be carried out under specific Hamiltonian mechanics. 156 | 157 | We will not introduce the concept of the ensemble in detail here. If you are interested in it, please read the related content about **statistical mechanics**. 158 | -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/AI_bring.md: -------------------------------------------------------------------------------- 1 | ## What does AI bring to the scientific Community? 2 | 3 |
4 | 5 |
6 |
Figure 3: Scientific discovery process
7 |
8 | 9 | #### Traditional Discovery Process in Science (a day of a scientist) 10 | 11 | - Goal - Drug design (AI can help in different phases of drug design) 12 | 13 | - Hypothesis - Hand-crafted design (AI can help search the vast chemical space and propose drug candidates) 14 | 15 | - Simulation - Computer simulation (AI can accelerate and improve the accuracy of computer simulation) 16 | 17 | - Experiment - lab experiment (AI can guide lab experiment design) 18 | 19 | - Result analysis - Statistical analysis (AI can analyze high-dimensional data) 20 | 21 | - Repeat 22 | 23 | - Conclusion & Finding 24 | 25 | Goals | Traditional Methods | AI-Based Methods 26 | ------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------- 27 | Drug Design |- Error-prone and intuition-based workflows | - AI aided rational drug design 28 | Virtual Screening |- Docking (based on physical scoring functions)
- 2D/3D/based on pharmacophore similarity search | - Neural network-based scoring functions
- Generative models 29 | Lead Optimization | - MD-based free energy calculation with empirical force field
- Wet-lab experiment | - AI-predicted binding
- Simulation with more accurate force fields developed with neural networks affinity| 30 | Drug Synthesize | - Error-prone experiments to find optimal synthesis route and reaction conditions | - Retro-synthesis analysis by AI models
- AI predicted reaction outcomes
- Automated wet-Lab experiment 31 | ADMET | - Experiments | - AI-based prediction models 32 | 33 |
Table 1: Examples of AI methods in assisting the drug discovery process
34 |
35 | 36 | #### Even decades ago, AI was widely used in the scientific community 37 | 38 | - Example: principal component analysis/PCA, linear regression, Kalman Filter, clustering algorithms, etc. 39 | 40 | #### Data analysis empowered by modern AI 41 | 42 | - **Why has AI become so popular recently? (What changed the game?)** 43 | 44 | - **Accumulated Big Data** 45 | 46 | - **Advanced Algorithms (especially Rising of Deep Learning)** 47 | 48 | - **Improved Computing Power and Storage** 49 | 50 | - **What is data?** Data are individual facts, statistics, or items of information, often numeric. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects. 51 | 52 | - **What is learning?** Learning refers to machine learning, which is the study of computer algorithms that improve automatically through experience. 53 | 54 | - **Why do we learn from data?** Learning from accumulated data enables us to analyze data and execute certain tasks. 55 | 56 | - **What are common tasks that could be solved by AI?** 57 | 58 | - **Predictive tasks** refer to predicting the value or status of something of interest. 59 | 60 | - Example: predict whether an image is a cat or dog. 61 | 62 | - **Generative tasks** refer to generating new data by learning from existing data. 63 | 64 | - Example: generate new drug-like molecules. 65 | 66 | - **Decision-making tasks** refer to making decisions based on the information provided. 67 | 68 | - Example: decide on trading strategies for the stock market. 69 | 70 | - **How do we learn from data?** Two essential components of learning from data are (1) data and (2) learning. 71 | 72 | - **Data** includes two main components, data points, and labels, data points refer to the single instances of facts, statistics, or items of information, while labels are meaningful and informative tags for the data points. (Note the labels are often expensive to obtain, thus most of the data are unlabelled) 73 | 74 | - Data Points:\(X\) cab be in any format, text, image, graph, etc. 75 | 76 | \- Example: images of cats and dogs 77 | 78 | - Labels: \(Y\) are informative tags 79 | 80 | \- Example: cat or dog 81 | 82 | - **Learning** essentially involves three main broad categories, **Supervised Learning**, **Unsupervised Learning, and Reinforcement Learning**. We include an additional learning diagram, **Active Learning**, which is commonly used in scientific discovery. 83 | 84 | - **Supervised Learning** learns with labeled data, usually, for predictive tasks, a special case is Semi-supervised Learning where only partial data are labeled, since labels allow us to directly assess the predictive performance of a machine learning model in certain circumstances. 85 | 86 | \- Example: predicting molecular properties from molecular structures 87 | 88 | - **Unsupervised Learning** learns with unlabeled data and discovers patterns from data, usually for clustering, dimensionality reduction, and visualization tasks. Another rising topic, Self-supervised Learning, also requires no labeled data by training models to predict “missing” or masked parts of the input, and it focuses more on predictive tasks similar to supervised learning. 89 | 90 | \- Example 1: clustering galaxy images with similar patterns 91 | 92 | \- Example 2: clustering molecule conformations with similar patterns which reduces the workloads for MD analysis 93 | 94 | - **Reinforcement Learning** concerns how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. 95 | 96 | \- Example: AI agent for nuclear fusion reactor control 97 | 98 | - **Active Learning** is a learning algorithm that can request labels (or propose experiments) that provide information that be most useful for it to improve predictive performance. It is also referred to as optimal experimental design. 99 | 100 | \- Example: uncertainty estimation to guide the experiment/data collection 101 | 102 | - **How do we collect or represent data?** 103 | 104 | - **Representation Learning** is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task. 105 | 106 | - **Tabular Representation (columns are features)** 107 | 108 | \- Example: weather data stored in tables 109 | 110 | - **Grid Representation (stored in grids)** 111 | 112 | \- Example: cell images 113 | 114 | - **Sequence Representation (stored in sequences)** 115 | 116 | \- Example: genes 117 | 118 | - **Graph Representation (stored in nodes and edges)** 119 | 120 | \- Example: molecular graphs 121 | 122 | - **Geometry Representation (stored in 3D geometries)** 123 | 124 | \- Example: protein structures 125 | 126 | - **Could we generate new data?** 127 | 128 | - **Generative Modeling** learns the distributions of observed samples and generates unseen samples. 129 | 130 | - Example: goal-oriented molecule generation, protein conformations sampling 131 | -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/AI_roadmap.md: -------------------------------------------------------------------------------- 1 | ## AI Systematic Learning Roadmap 2 | 3 | - [AI for Everyone](https://www.coursera.org/learn/ai-for-everyone) 4 | 5 | - [Python (Programming)](https://cs50.harvard.edu/x/2022/) 6 | 7 | - [Calculus](https://ocw.mit.edu/courses/mathematics/18-01sc-single-variable-calculus-fall-2010/syllabus/) 8 | 9 | - [Linear Algebra](https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/syllabus/) 10 | 11 | - [Discrete Math](https://www.eecs70.org/) 12 | 13 | - [Probabilistic Statistics](https://inst.eecs.berkeley.edu/~ee126/fa20/content.html) 14 | 15 | - [Machine Learning](https://www.coursera.org/learn/machine-learning) 16 | 17 | - [Deep Learning](https://www.coursera.org/specializations/deep-learning) 18 | 19 | - Specialized/Advances Topics 20 | 21 | - [Machine Learning with Graphs](http://web.stanford.edu/class/cs224w/) 22 | 23 | - [Computer Vision](http://cs231n.stanford.edu/) 24 | 25 | - [Reinforcement Learning](http://rail.eecs.berkeley.edu/deeprlcourse/) 26 | 27 | - More... -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/artificial_intelligence.md: -------------------------------------------------------------------------------- 1 | ## What is Artificial Intelligence? 2 | 3 | Many words describe areas closely related to AI, sometimes they could be called AI generally, but we illustrate their relationships below. (examples shown in each level are disjoint examples from the overlapping subjects): 4 | 5 |
6 | 7 |
8 |
Figure 2: Relationship of words referring to AI
9 |
10 | 11 | - **Artificial Intelligence** is a generic word that represents intelligence demonstrated by machines which include a broad set of methods from traditional reasoning and planning methods to modern machine learning approaches. 12 | 13 | - **Machine Learning** is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. 14 | 15 | - **Deep Learning** is one type of machine learning methods that leverages artificial neural networks with back propagation for representation learning. 16 | 17 | - **Statistics** is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data, which shares great overlaps with AI in terms of methodologies. 18 | 19 | - **Data Mining** is a process of extracting and discovering patterns in large data sets involving methods at the intersection of AI, statistics, and database systems (access and storage of data). -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/how_ai_work.md: -------------------------------------------------------------------------------- 1 | ## How does AI work? 2 | 3 | 4 |
5 | 6 |
7 |
Figure 4: AI pipeline
8 |
9 | 10 | #### Typical pipeline 11 | 12 | - **Problem Formulation** is an essential step to formulate a problem in a “machine learning language” 13 | 14 | - What is the input? What is the output? What is the task? 15 | 16 | - Input: images of cats and dogs; 17 | 18 | - Output: whether the images are cats or dogs; 19 | 20 | - Task: Predictive or Classification task. 21 | 22 | - What is the input representation? What is the output representation? What is the objective function? 23 | 24 | - Input representation: Images. 25 | 26 | - Output representation: Scalars. 27 | 28 | - Objective function: measurement between the predicted labels and ground truth labels (cross-entropy loss is very common for classification tasks) 29 | 30 | - **Data Preparation/Processing** is conducted after problem formulation. The data could be accumulated, or specifically curated for the formulated problem. It is utilized to collect and manipulate the data to produce meaningful information under the formulation of the problem. 31 | 32 | - **Data Representation** is another important step to represent data in a machine-readable (or numeric) format. The type of representation is also critical to model choice and which type of specific information it aims to capture. 33 | 34 | - **Model Choice** is another important step in the pipeline which determines the key model used to learn to fulfill the task. It mainly includes traditional machine learning models and deep learning models. In addition to the model choice itself, for a model to learn from data efficiently, we often need to design a measurement or objective function such that the model can be aware of how good or bad it is performing, then an optimizer is needed to adjust the model accordingly. 35 | 36 | - **Traditional ML Models** (Modeling data w/ limited structured inductive bias (i.e. data is not always assumed to be in certain structure like graph)) 37 | 38 | - **Random Forests, Support Vector Machine, Gradient Boosting,** etc. 39 | 40 | - **Deep Learning Models** (Modeling data w/ structured inductive bias) 41 | 42 | - **Multi-layer Perceptron (MLP)** models all types of data (without structure inductive bias) 43 | 44 | - **Convolutional Neural Network (CNN)** models grid data 45 | 46 | - **Recurrent Neural Network (RNN)** models sequence data 47 | 48 | - **Graph Neural Network (GNN)** models graph data 49 | 50 | - **Transformer** models sequence data originally, but later adapted to model all types of data 51 | 52 | - **Objective/Loss Function** 53 | 54 | - Mean-squared error loss for regression task 55 | 56 | - Cross-entropy loss for classification task 57 | 58 | - More to read: Common Loss functions in machine learning [5] 59 | 60 | - **Optimizer** is the algorithm used to minimize the objective/loss function and update the parameters of the machine learning model. More to read [6] 61 | 62 | - Common optimizers include SGD, Adam, RMSProp, etc. 63 | 64 | **Evaluation/Result Analysis** is conducted to evaluate the performance of the model and provide feedback to improve the whole pipeline. 65 | 66 | - Training/Validation/Testing Set Evaluation (Common procedure: tuning parameters on the training set, select the parameters that have the best performance on the validation set and report the result on the testing set to mimic the real-world scenario when unseen/new data come as the testing set) 67 | 68 | - Evaluation Metrics measure the performance of the model. 69 | 70 |
71 | 72 |
73 |
Figure 5: How AI predicts protein structures by AlphaFold2
74 |
75 | 76 | #### Real-world Example (Protein Structure Prediction - Alphafold2) 77 | 78 | - Problem Formulation 79 | 80 | - Input: protein sequence (a sequence of N amino acids) 81 | 82 | - Output: protein structure (coordinates of amino acids in 3D space -\> \(N \times 3\)) 83 | 84 | - Task: predictive or regression task 85 | 86 | - Data Preparation 87 | 88 | - Accumulated protein structures from Protein Data Bank (PDB) (sequence, structure pairs) 89 | 90 | - Searching Multiple Sequence Alignment (MSA) for each protein sequence (demonstrated to help with learning coevolutionary information) 91 | 92 | - Accumulated protein templates (existing templates for some proteins) 93 | 94 | - Model Choice - Deep Learning Models 95 | 96 | - Transformers in modeling MSA embeddings and producing pairwise and single-sequence features 97 | 98 | - Transformers in modeling pairwise and single-sequence features and output structures 99 | 100 | - Objective Function 101 | 102 | - Cross-entropy loss, mean-squared-error loss, etc. 103 | 104 | - Optimizer - Adam Optimizer 105 | 106 | - Evaluation/Result Analysis 107 | 108 | - TMScore, lDDT (measurement between two structures) 109 | 110 | - plDDT, pTMScore (predicted lDDT/TMScore for uncertainty estimation) 111 | -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/index.rst: -------------------------------------------------------------------------------- 1 | ========================================= 2 | Scientific Discovery in the era of AI 3 | ========================================= 4 | 5 | 6 | 7 | .. toctree:: 8 | :maxdepth: 2 9 | :caption: Contents: 10 | 11 | manifesto.md 12 | news_AI.md 13 | artificial_intelligence.md 14 | AI_bring.md 15 | mindsets_for_AI.md 16 | how_ai_work.md 17 | AI_roadmap.md 18 | references.md 19 | -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/manifesto.md: -------------------------------------------------------------------------------- 1 | ## Manifesto 2 | 3 | In recent years, AI has almost been everywhere, from our daily life to life-critical systems. AI for science is a new terminology representing a growing community that attracts more and more people from both AI and science communities to work on scientific discovery with AI. You may wonder what AI really is? How AI for science works? How is it related to my daily work? In this blog, we introduce AI to people who are interested in AI for science (especially from the view of the scientific community) and answer the above questions, including what AI brings to the scientific community, successful examples of AI in scientific applications, AI mindsets in tackling different types of scientific problems. -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/mindsets_for_AI.md: -------------------------------------------------------------------------------- 1 | ## Mindsets for AI 2 | 3 | #### Key abilities of AI 4 | 5 | - **Automated feature learning.** Instead of traditionally manual design of features for various tasks, AI takes data in raw formats and automatically learns the features while optimizing the task objectives. 6 | 7 | - **Learning from big data**. AI can learn from the accumulated “big” data in many domains that traditional methods are not capable of. 8 | 9 | - **Inductive bias** (e.g. symmetry preservation). AI models are flexible and can be designed to respect natural laws such as symmetries. 10 | 11 | - **Generalizability** (to unseen data). After training the AI model, it is expected to generalize to new scenarios and unseen data. In some scenarios, it is also expected to generalize to new dataset or similar task after training with one general dataset or task. 12 | 13 | - **Fit high-dimensional function**. AI models can fit complex functions, such as free energy surface, Schrodinger equation, etc. 14 | 15 | - **Differentiable programming**. AI brings a new wave of differentiable programming and pushes forward the development of many tools for automatic differentiation, such as PyTorch, Tensorflow, Theano, etc. 16 | 17 | #### Limitations of AI 18 | 19 | - **Overfitting**. AI models sometimes overfit into the given data set which hinders their abilities to generalize to other datasets. 20 | 21 | - **Data requirement.** No free lunch, AI models usually rely on large-scale datasets. 22 | 23 | - **Computational cost.** AI models usually consume plenty of computational resources, especially with the growth of the model and data sizes. 24 | 25 | - **Explainability.** AI models usually have poor explainability and are thus considered “black-boxes”, though it is an active area of research. -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/news_AI.md: -------------------------------------------------------------------------------- 1 | ## News you may hear about AI 2 | 3 | - [DeepBlue](https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)) [1] is the first computer chess player to win a game, and the first to win a match, against a reigning world champion under regular time controls. Deep Blue’s victory was considered a milestone in the history of artificial intelligence. 4 | 5 | - [AlphaGo](https://deepmind.com/research/case-studies/alphago-the-story-so-far) [2] is a computer Go player that defeats a professional human Go player, and defeats a Go world champion for the first time. 6 | 7 | - [AlphaFold2](https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology) [3] provides a solution to a 50-year-old grand challenge in biology, determining protein structure given its sequence. 8 | 9 |
10 | 11 |
12 |
Figure 1: DALL·E 2 synthesized image from text description
13 |
14 | 15 | 16 | - [DALL·E 2](https://openai.com/dall-e-2/) (2022) [4] is one of the largest AI systems that can create realistic images and art from a description in human-readable language. 17 | -------------------------------------------------------------------------------- /source/chapters/scientific_discovery_in_the_era_of_AI/references.md: -------------------------------------------------------------------------------- 1 | ## References 2 | 3 | [1] Wikipedia Contributors. Deep learning, 05 2019. 4 | 5 | [2] DeepMind. Alphago: the story so far, 2016. 6 | 7 | [3] The AlphaFold Team. Alphafold: a solution to a 50-year-old grand challenge in biology, 11 2020. 8 | 9 | [4] OpenAI. Dall·e 2, Apr 2022 10 | 11 | [5] Ravindra Parmar. Common loss functions in machine learning, 09 2018. 12 | 13 | [6] Sebastian Ruder. An overview of gradient descent optimization algorithms, 01 2016. -------------------------------------------------------------------------------- /source/conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for the Sphinx documentation builder. 2 | # 3 | # This file only contains a selection of the most common options. For a full 4 | # list see the documentation: 5 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 6 | 7 | # -- Path setup -------------------------------------------------------------- 8 | 9 | # If extensions (or modules to document with autodoc) are in another directory, 10 | # add these directories to sys.path here. If the directory is relative to the 11 | # documentation root, use os.path.abspath to make it absolute, like shown here. 12 | # 13 | # import os 14 | # import sys 15 | # sys.path.insert(0, os.path.abspath('.')) 16 | 17 | 18 | # -- Project information ----------------------------------------------------- 19 | 20 | project = 'AI4Science101' 21 | copyright = '2022, DeepModeling Community' 22 | author = 'DeepModeling Community' 23 | 24 | 25 | # -- General configuration --------------------------------------------------- 26 | 27 | # Add any Sphinx extension module names here, as strings. They can be 28 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 29 | # ones. 30 | extensions = [ 31 | 'myst_parser', 32 | 'deepmodeling_sphinx', 33 | ] 34 | myst_enable_extensions = [ 35 | 'dollarmath', 36 | ] 37 | 38 | # Add any paths that contain templates here, relative to this directory. 39 | templates_path = ['_templates'] 40 | 41 | # List of patterns, relative to source directory, that match files and 42 | # directories to ignore when looking for source files. 43 | # This pattern also affects html_static_path and html_extra_path. 44 | exclude_patterns = [] 45 | 46 | 47 | # -- Options for HTML output ------------------------------------------------- 48 | 49 | # The theme to use for HTML and HTML Help pages. See the documentation for 50 | # a list of builtin themes. 51 | # 52 | html_theme = 'sphinx_rtd_theme' 53 | 54 | # Add any paths that contain custom static files (such as style sheets) here, 55 | # relative to this directory. They are copied after the builtin static files, 56 | # so a file named "default.css" will overwrite the builtin "default.css". 57 | html_static_path = ['_static'] 58 | 59 | latex_engine = 'xelatex' 60 | mathjax_path = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.2.0/es5/tex-mml-chtml.min.js' 61 | -------------------------------------------------------------------------------- /source/index.rst: -------------------------------------------------------------------------------- 1 | .. AI4Science101 documentation master file, created by 2 | sphinx-quickstart on Mon Jun 20 12:21:16 2022. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | ========================================= 7 | Welcome to AI4Science101's documentation! 8 | ========================================= 9 | 10 | 11 | First Edition 12 | ============== 13 | 14 | .. toctree:: 15 | :maxdepth: 2 16 | :caption: Contents: 17 | 18 | chapters/announcement/announcement.md 19 | chapters/AI_for_scientific_discovery/index 20 | chapters/scientific_discovery_in_the_era_of_AI/index 21 | chapters/molecular_dynamics/index 22 | chapters/knowledge_base/index 23 | 24 | 25 | 26 | Indices and tables 27 | ================== 28 | 29 | * :ref:`genindex` 30 | * :ref:`modindex` 31 | * :ref:`search` 32 | --------------------------------------------------------------------------------