The Use of Artificial Intelligence and Machine Learning in Chemical Simulations

Technology

Machine learning can be used to solve problems - especially in the area of structure-activity relationships - that would be impossible or would require immense effort to solve using standard methods due to their level of complexity. At Quantistry, we therefore use this technology to make our analyses even faster and more reliable.

Artificial Intelligence (AI) and Machine Learning (ML) are flourishing in many areas of science due to open access to cost-efficient computing resources and big data. Within ML, flexible scientific models are built based on (usually experimental) data to reproduce and predict chemical and physical phenomena and laws. The various ML methods are suitable for solving scientific problems characterized by high combinatorial complexity or non-linear processes. Those are usually not solvable by conventional methods or can only be handled with an immense computational effort. One of these highly complex problems is the prediction of chemical reactions (1). Currently, the design and description of reaction routes rely primarily on chemical intuition, i.e., the experience, expertise, and mechanistic understanding of a scientist. However, due to the complicated structure-(re)activity relationship of the molecules (reactants, reagents) involved, accurately predicting the outcome and mechanism of a chemical reaction remains a major challenge - both for experienced chemists and for computational quantum chemical models.

Analysis of structure-activity relationships

Therefore, in current chemical research, ML methods are preferably used to elucidate these structure-activity relationships. For this purpose, predictive models are built based on large data sets to determine the relationship between the chemical, biochemical, and physical properties of a molecule and its chemical structure. The subject of research is the prediction of both chemical and physical properties such as the reactivity, acid strength or an optical spectrum (UV/Vis, IR spectrum) of a chemical compound and physiological properties such as the odor or toxicological effect of a molecule.

ML models usually learn to predict structure-activity relationships by using experimental laboratory data from chemical research. These data usually require extensive processing and cleanup to identify missing and/or erroneous elements, refurbish them, and to ensure the validity of the trained ML models. Nonetheless, the quality of laboratory data can vary widely, as their completeness, reproducibility, and proper error propagation are not always assured. The representation form of the raw data is an important aspect in the creation of ML models. Although the cleaned raw scientific data are usually in numerical form, their representation format can strongly influence the learning of the predictive models. The proper choice of a data format has a positive impact on the learning effectiveness of ML models. The more appropriate the representation of the input data, the more accurately an algorithm can map it to the output data. Selecting the optimal representation of the data may require insight into both the underlying scientific problem and how the learning algorithm works.

The relevance of molecular fingerprints.

To this end, current chemical research uses so-called molecular fingerprints to represent and identify the three-dimensional structure of any given molecule in a highly simplified format. These molecular descriptors usually correspond to a bit-based chemical structure code (e.g., Morgan Fingerprints) (2), or an (ASCII) character string (SMILES - Simplified Molecular Input Line Entry Specification) (3, 4, 5). The advantages and disadvantages of these fingerprints are the strong simplification of the molecular structures. On the one hand, the compact and handy format of the descriptors allows effective learning of ML models. On the other hand, many elementary structural properties (e.g., enantiomerism) cannot be properly represented.

Graph Neural Networks and Supervised Learnings

An alternative to string- or fingerprint-based approaches is the use of Graph Neural Networks (GNN) (6). Here, molecular structures are translated into graphs in which atoms correspond to nodes and bonds to neighbors correspond to edges of the graph. Additional atom or bond information, such as atom types or bond orders, can be encoded in the node or edge attributes. Convolution is now used to embed the attributes of the environment, i.e., the attributes of the neighboring edges and nodes, into the node.

Recently, ML models based on supervised learning have been increasingly used in chemical research. These models can predict a non-linear relationship between input data such as molecular structures and output data such as acid strengths with acceptable accuracy. As part of this, a number of different classification algorithms (e.g. Naive Bayes Classifier (7), k-Nearest Neighbors (8), Decision Trees (9), Kernel Methods (10), Artificial Neural Networks (11) are used, which need to be systematically tested and validated for each problem. Unsupervised learning, on the other hand, finds only a small field of application in chemistry, e.g., in the prediction of repulsive potentials in quantum chemical approximation methods such as DFTB.

The widespread availability of large data sets and sufficient hardware resources, especially powerful GPU resources, make it possible to effectively use the described approaches to solve chemical problems that would not be solvable with standard methods, or only with disproportionately larger efforts. For this reason, we complement our methodological spectrum selectively with the described approaches and integrate their results seamlessly into our end-to-end software solution.

Literature

(1) B. Maryasin, P. Marquetand, and N. Maulide, “Machine learning for organic synthesis: Are robots replacing chemists?”, Angew. Chem. Int. Ed. 57, 6978 (2018).

(2) D. Rogers and M. Hahn, “Extended-connectivity fingerprints”, J. Chem. Inf. Model. 50, 742 (2010).

(3) D. Weininger, “SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules”, J. Chem. Inf. Model. 28, 31 (1988).

(4) D. Weininger, A. Weininger, and J. L. Weininger, “SMILES. 2. algorithm for generation of unique SMILES notation”, J. Chem. Inf. Model. 29, 97 (1989).

(5) D. Weininger, “SMILES. 3. DEPICT. graphical depiction of chemical structures”, J. Chem. Inf. Model. 30, 237 (1990).

(6) F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The Graph Neural Network Model”, IEEE 20, 1 (2009).

(7) D. J. Hand and K. Yu, “Idiot’s Bayes? Not so stupid after all?”, Int. Stat. Rev. 69, 385 (2001).

(8) G. Shakhnarovich, T. Darrell and Indyk, P. “Nearest-Neighbor Methods in Learning and Vision: Theory and Practice” (MIT Press, Boston, 2005).

(9) L. Rokach and O. Maimon, “Data Mining and Knowledge Discovery Handbook” (eds Maimon, O. & Rokach, L.) 149–174 (Springer, New York, 2010).

(10) J. Shawe-Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis” (Cambridge Univ. Press, Cambridge, 2004).

(11) J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural Netw. 61, 85 (2015).