Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts
Published in Molecular Pharmaceutics 2018, Volume 15, Issue 10, Pages 4371-4377
Abstract
Artificial Intelligence has advanced at an unprecedented pace, backing recent breakthroughs in natural language processing, speech recognition, and computer vision: domains where the data is euclidean in nature. More recently, considerable progress has been made in engineering deep-learning architectures that can accept non-Euclidean data such as graphs and manifolds: geometric deep learning. This progress is of considerable interest to the drug discovery community, as molecules can naturally be represented as graphs, where atoms are nodes and bonds are edges. In this work, we explore the performance of geometric deep-learning methods in the context of drug discovery, comparing machine learned features against the domain expert engineered features that are mainstream in the pharmaceutical industry.
Introduction
Deep Learning
Deep neural networks (DNNs) are not an entirely new concept as they have existed for ∼20 years, (1) only recently entering the spotlight due to an abundance of storage and compute as well as optimization advances. Today, deep-learning backs the core technology in many applications, such as self-driving cars, (2) speech synthesis, (3) and machine translation. (4) Perhaps the most important property of DNNs is their ability to automatically learn embeddings (features) tabula rasa from the underlying data, aided by vast amounts of compute and more data than any one human domain expert can understand.
Naturally, there is interest in expanding the domain of applicability of these methods to non-euclidian data such as graphs or manifolds, (5) which arise in domains such as 3D models in computer graphics, represented as riemannian manifolds, or graphs in molecular machine learning. Understanding data of this structure has been elusive for classical architectures because of a lack of a well-defined coordinate system and vector space structure in non-euclidian domains. Even operations as simple as addition often cannot find natural constructions, for example, the sum of two atoms or two molecules has no meaning.
Geometric Deep Learning aims to solve this by defining primitives that can operate on these unwieldy data structures, primarily by constructing spatial and spectral interpretations of existing architectures (6) such as convolutional neural networks (CNNs). Recasting CNNs into this domain is of particular interest in drug discovery, as like nearby pixels, nearby atoms are highly related and interact with each other whereas distant atoms usually do not.
Drug Discovery
Development of novel therapeutics for a human disease is a process that can easily consume a decade of research and development, as well as billions of dollars in capital. (7) Long before anything reaches the clinic for validation, a potential disease-modulating biological target is discovered and characterized. Then, the search process for the right therapeutic compound is kicked off, a process akin to finding the perfect chemical key for a tough to crack biological lock, which is conducted through a vast chemical space containing more molecules than atoms in the universe. Even restricting the search to molecules with a molecular weight of ≤500 Da yields a search space of at least 1050 molecules, virtually all of which have never been synthesized before.
To make it to the clinic, drug discovery practitioners need to optimize for a wide range of molecular properties, ranging from physical properties, such as aqueous solubility to complex biochemical properties, such as blood-brain barrier penetration. This long, laborious search has historically been guided by the intuition of skilled medicinal chemists and biologists, but over the past few decades, heuristics and machine learning have played an increasingly important role in guiding the process.
The first widespread use of a heuristic is Lipinski’s rule of five (RO5), invented at Pfizer in 1997. (8) RO5 places limits on the number of hydrogen bond donors, acceptors, molecular weight, and lipophilicity measures and has been shown to filter out compounds that are likely to exhibit poor ADME properties. In practice, even today RO5 is often still used to evaluate emerging preclinical molecules.
Over the past two decades, machine learning models have begun to emerge in the industry as a more advanced filter or virtual screen. Researchers in industry have shown that expert-engineered features and support vector machines can be used to predict stability in human liver microsomes (9,10) effectively, among other end points. Multitask, fully connected neural networks on these same inputs, has been shown to on average outperform more traditional models, (11,12) including XGBoost, (13) with performance scaling monotonically with the number of tasks into the thousands. (14) Progress in learning from small amounts of data has been achieved using variants of matching networks. (15) More recently, use of 3D convolutional neural networks has shown considerable promise in predicting protein – ligand binding energy (16) (drug potency), and ranking models have made considerable progress is drug repurposing. (17)
More recently, progress has been made in generating novel molecules in silico, unlocking the possibility of screening molecules that have been designed by machines instead of humans. This may allow exploration of far-reaching regions of chemical space that are beyond those covered by existing human-engineered screens in industry. Success with these approaches was first demonstrated using Adversarial Autoencoders, which were shown to be able to hallucinate (generate) chemical fingerprints that matched with a variety of patented anticancer assets. (18) Variational Autoencoders have also been used in this area (19) and have been shown to be able to hallucinate molecules that have exceptional solubility and low similarity to the training set. Segler (20) et al. relied on LSTMs trained on chemical language representations to achieve a similar result for potency end points. Recently, more progress has been made using Generative Adversarial Neural Network models, first on 2D representations (18) and later on 3D representations. (21)
As these prediction systems improve, the average quality of molecules selected for synthesis in drugs programs improves significantly, (22) resulting in programs that get to the clinic faster and with lower capital requirements, which is significant in light of pipeline attrition rates. For drugs in phase I, excluding portfolio rebalancing, ∼40% fail due to toxicity and ∼15% fail due to poor pharmacokinetics, both of which have the potential to be captured by these prediction systems long before the clinic. (23)
In this work, the state of the art of drug discovery feature engineering is compared against the state of the art of geometric deep learning in a rigorous manner. We will show that geometric deep learning can autonomously learn representations that outperform those designed by domain experts on four out of five of the data sets tested.
Chemical Embeddings
The first challenge in machine learning is selecting a numerical representation that correctly captures the underlying dynamics of the training data, also known as features or an embedding, terms we will use interchangeably in this work. A fixed-shape representation is typically required simply because the mathematics of learning algorithms require that their inputs be the same shape. Selecting an embedding that respects the underlying structure of the data cannot be overlooked because certain mathematical assumptions that apply to some data sets need not apply to others, reversing an English sentence destroys its meaning, whereas reversing an image generally would not. In natural language processing, a pernicious problem is that sentences need not be the same length and that locality must be respected because words are highly related to their neighbors. Bag of words embeddings resolves this by mapping sentences into bit vectors that indicate the presence or absence of adjacent words in the document [Figure 1]. This convenient, fixed length bit vector can later be used to train a classifier for any natural language processing task.
In molecular machine learning, engineering good embedding/features is a considerable challenge because molecules are unwieldy, undirected multigraphs with atoms being nodes and bonds being edges. A good chemical embedding would be able to model graphs with a differing number of nodes and edge configurations while preserving locality because it is understood that atoms that are close to each other generally exhibit more pronounced interactions than atoms that are distant.
More formally, for a molecule represented with an adjacency matrix A ∈ {0, 1}n×n and atom-feature matrix X ∈ Rnxd, we want to construct some function f with (optionally) learnable parameters Ɵ ∈ Rw s.t
ʄ : (A, X; Ɵ) → xd
where x is a fixed-shape representation that captures the essence of the underlying graph. This vector is then passed to a learning algorithm of a scientist’s choice, such as random forests or a fully connected neural network.
Naïve Embeddings
A standard chemistry embedding solution is the extended-connectivity fingerprints (ECFP4). (24) These fingerprints generate features for every radius r ≤ 4, which is the maximum distance explored on the graph from the starting vertex. For a specific r and specific vertex in the graph, ECFP4 takes the neighborhood features from the previous radius, concatenates them, and applies a hash function, the range of which corresponds to indices on a hash table. After iterating over all vertices and radius values, this bag of fragments approach to graph embedding results in task agnostic representation that can be easily passed onto a learning algorithm.
Expert Embeddings
Cheminformatics experts in drug discovery have, over decades, engineered numerous domain-specific, physiologically relevant features, also known as descriptors. For example, polar surface area (PSA) is a feature that is calculated as the sum of the surface area contributions of the polar atoms in a molecule, a feature well-known in industry to negatively correlate with membrane permeability. There are 101 of these publicly available, expert-engineered features [Table 3] that are easily available in the open-source RDKit package.
Learnable Embeddings
One criticism of the naïve embeddings is that they are not optimized for the task at hand. The ideal features to predict drug solubility are likely to be considerably different than the features used to predict photovoltaic efficiency. The solution is to allow the model to engineer its own problem specific, optimized embedding for the problem at hand, in essence by combining the learner with the embedding. This is achieved by allowing gradients to flow back from the learner into the embedding function, allowing the embedding to be optimized in tandem with the learner. Neural Fingerprints (25) demonstrated that ECFP could be considerably improved in this manner by introducing learnable weights, and Weaves (26) demonstrated further improvements by mixing bond and atom features. Later, it was shown both of these graph embedding methods are special cases of message passing algorithms. (27)
Methods
Learning Algorithms
Random Forests
Random Forests are a common ensemble learning algorithm used in industry due to their training speed, high-performance, and ease of use. In this work, random forest models (sklearn’s RandomForestRegressor) are trained on the concatenation of 101 RDKit descriptors and the 384 bit wide ECFP4 fingerprints using 30 trees and a maximum tree depth of 4. This particular hyperparameter configuration is the result of tuning on the validation set by hand with the aim of maximizing absolute performance while minimizing the spread of performance. A maximum tree depth of 10 was used on the lipophilicity dataset due to its size.
FC-DNN
Fully-Connected Neural Networks operate on a fixed shape input by passing information through multiple nonlinear transformations, i.e., layers. FC-DNN models were implemented in PyTorch (28) and are trained on the same inputs as the random forest models with an added normalization preprocessing stage. After extensive hyperparameter tuning on the validation set, a neural network with two hidden layers of size 48 and 32 was found to perform well. ReLU activations and batch normalization were used on both hidden layers. Optimization was performed using the ADAM optimizer. (29)
For hyperparameter, a static learning rate of 5e–4 and l2 weight decay of 8e–3 was used. All FC-DNN models were trained through training epoch 11, after which the models would begin overfitting.
GC-DNN
Graph Convolutional networks are a geometric deep-learning method that is distinct from the previous methods in that they are trained exclusively from the molecular graph, an unwieldy input that can vary in the number of vertices as well as connectivity. This graph is initialized using a variety of atom features ranging from atomic number to covalent radius.
The DeepChem Tensorflow (30) implementation of the graph convolution, graph-pooling, and graph-gather primitives was used to construct single-task networks. This implementation is unique in that it reserves a parameter matrix for each node degree, unlike other approaches. (6) For these experiments, a 3‑layer network was used using ReLU activations, batch normalization, and a static learning rate of 1e–3 with no weight decay. Once again, optimization was performed using the ADAM optimizer. A formal mathematical construction of the graph convolutional primitives is presented in the Appendix.
Data Preparation
Setting up valid machine learning experiments in molecular machine learning is considerably more challenging than other domains. Dataset are autocorrelated because they are not collected by sampling from chemical space uniformly at random. Rather, dataset are comprised of many chemical series of interest with each series consisting of molecules that differ by only subtle topology changes.
This underlying structure can be visualized using t‑SNE, (31) a nonlinear embedding algorithm that excels at accurately visualizing high-dimensional data, such as molecules. In essence, t‑SNE aims to produce a 2D embedding such that points that are close together in high dimensions remain close together in the 2D embedding. Likewise, it aims to keep points that are far apart in high dimensions far apart in the 2D embedding. The resulting t‑SNE scatterplot [Figure 2] for the lipophilicity data set reveals this clear clustering.
It follows from this structure that randomly splitting data sets of this style results in significant redundancies between the training set and validation sets. It can be shown that benchmarks of this style significantly reward solutions that overfit rather than solutions that can generalize to molecules that are significantly different than the training sets. (32) To control for this, we split the dataset into Murcko clusters (33) and place the largest clusters in the training set and the smallest ones in the validation set, targeting 80% of the data being placed in the training set, 10% in the validation set, and 10% in the test set. This method results in the majority of the chemical diversity being held outside of the training set, not unlike the data the system will encounter when deployed.
Both split and unsplit dataset have been open sourced into a repository under the Numerate GitHub organization.
Capturing Uncertainty
Small dataset, along with algorithms that rely on randomness during training, introduce considerable noise into the performance results. This makes it difficult to tease apart genuine advancements from luck [28]. Moreover, the performance of molecular machine learning systems is highly dependent on the choice of training set, making it difficult to assess how the system would perform on significantly novel chemical matter.
Because there is no closed-form solution for uncertainty estimates for the metric that we are interested in, R2, bootstrapping with replacement of the training set is used to capture uncertainty [Figure3]. Models are trained on 25 bootstrap resamples, and 25 R2 values are recorded [Table1]. The result is not a single score but rather a distribution of scores defined by a sample mean and sample variance. Variations in mean performance among learning algorithms can then be tested for statistical significance using the Welch t test, an adaptation of the t test that is more reliable for two samples that have unequal variances [Table2].
Table 1. Test Set Performance R2
data set | model | mean | std | range |
---|---|---|---|---|
pKa-A1 | RF | 0.319 | 0.179 | [−0.260, 0.673] |
pKa-A1 | FC-DNN | 0.191 | 0.072 | [0.091, 0.377] |
pKa-A1 | GC-DNN | 0.437 | 0.105 | [0.204, 0.689] |
Clearance | RF | 0.155 | 0.047 | [0.054, 0.253] |
Clearance | FC-DNN | 0.136 | 0.025 | [0.088, 0.192] |
Clearance | GC-DNN | 0.217 | 0.048 | [0.117, 0.333] |
HPPB | RF | 0.287 | 0.029 | [0.215, 0.342] |
HPPB | FC-DNN | 0.203 | 0.024 | [0.158, 0.265] |
HPPB | GC-DNN | 0.208 | 0.039 | [0.126, 0.309] |
ThermoSol | RF | 0.187 | 0.021 | [0.137, 0.224] |
ThermoSol | FC-DNN | 0.256 | 0.039 | [0.224, 0.377] |
ThermoSol | GC-DNN | 0.294 | 0.043 | [0.215, 0.377] |
Lipophilicity | RF | 0.424 | 0.022 | [0.371, 0.473] |
Lipophilicity | FC-DNN | 0.345 | 0.025 | [0.302, 0.402] |
Lipophilicity | GC-DNN | 0.484 | 0.023 | [0.436, 0.515] |
Table 2. A/B Test for Random Forests and Graph Convolutions Using Welch t-test
data set | p-value |
---|---|
pKa-A1 | 7.2e–3 |
Clearance | 3.2e–5 |
HPPB | 3.7e10 |
Thermosol | 4.6e–13 |
Lipo | 1.6e–12 |
Experiments
Regression models are tested against a variety of physiochemical and ADME end points that are of interest to the pharmaceutical industry. We restrict our choice of data sets to the ones released by AstraZeneca into ChEMBL, a publicly available database, (34) with the expectation that they were subject to their strict, internal quality control standards, contain considerable chemical diversity, and are representative of data sets held internally in industry.
Data Sets
pKa-A1
This is the acid – base dissociation constant for the most acidic proton, which is an important factor in understanding the ionizability of a potential drug and has a strong influence over multiple different properties of interest, including permeability, partitioning, binding, and so forth. (35) This is this smallest data set of the five with only 204 examples.
Human Intrinsic Clearance
This is the rate at which the human body removes circulating, unbound drug from the blood. This is one of the key in vitro parameters used to predict drug residency time in the patient. (36) In drug discovery, this property is assessed by measuring the metabolic stability of drugs in either human liver microsomes or hepatocytes. This data set includes 1102 examples of intrinsic clearance measured in human liver microsomes (muL min–1 mg–1 protein) following incubation at 37 °C.
Human Plasma Protein Binding
This assay measures the proportion of drug that is bound reversibly to proteins such as albumin and α‑acid glycoprotein in the plasma. Knowing the amount that is unbound is critical because it is that amount that can diffuse into tissue or be cleared by the liver; (36) 1640 compounds are measured, and regression targets are transformed using log(1 – bound), a more representative measure for scientists.
Thermodynamic Solubility
This measures the solubility of a solid starting material in pH 7.4 buffer. Solubility influences a wide range of properties for drugs, especially ones that are administered orally. This data set contains 1763 examples.
Lipophilicity
This is a compound’s affinity for a lipophilic solvent vs a polar solvent. More formally, we use logD (pH 7.4), which is captured experimentally using the octanol/buffer distribution coefficient measured by the shake flask method. This is an important measure for potential drugs, as lipophilicity is a key contributor to membrane permeability. (36) Alternatively, highly lipophilic compounds are usually encumbered by low solubility, high clearance, high plasma protein binding, and so forth. Indeed, most drug discovery projects have a target range for lipophilicity. (36) This data set is the largest of the three at 4200 compounds.
Results
Graph Convolutional Neural Networks lead the three learning algorithms on four out of five data sets with the exception being human plasma protein binding. All five differences between GC-DNNs and the industry-standard RFs were found to be statistically significant using the Welch t-test A/B test. Fully Connected Neural Networks generally underperformed their counterparts despite requiring considerably more hyperparameter tuning.
Discussion
In part due to its autonomously learned features, graph convolutional neural networks outperformed methods trained on expert engineered features on four out of five data sets with the exception being plasma-protein binding [Table 3]. This is a surprising result given that GC-DNNs are blind to the domain of drug discovery and could trivially be repurposed to solve orthogonal problems such as detecting fraud in banking transaction networks. Geometric deep learning approaches like this unlock the possibility of learning from non-euclidian graphs (molecules) and manifolds, providing the pharmaceutical industry with the ability to learn from and exploit knowledge from their historical successes and failures, resulting in significantly improved quality of research candidates and accelerated timelines.
Table 3. Expert Engineered Features
feature | mean | variance | feature | mean | variance |
---|---|---|---|---|---|
MaxAbsPartialCharge | 0.43 | 0.07 | PEOE-VSA10 | 9.64 | 7.99 |
MinPartialCharge | –0.42 | 0.07 | PEOE-VSA11 | 4.26 | 6.43 |
MinAbsPartialCharge | 0.0 | 0.0 | PEOE-VSA12 | 3.61 | 5.10 |
HeavyAtomMolWt | 0.26 | 0.08 | PEOE-VSA13 | 3.19 | 4.38 |
MaxAbsEStateIndex | 0.16 | 0.19 | PEOE-VSA14 | 2.38 | 4.05 |
NumRadicalElectrons | 0.0 | 0.0 | PEOE-VSA2 | 7.14 | 6.03 |
NumValenceElectrons | 141.25 | 40.29 | PEOE-VSA3 | 6.97 | 6.48 |
MinAbsEStateIndex | 0.16 | 0.19 | PEOE-VSA4 | 2.89 | 5.13 |
MaxEStateIndex | 11.65 | 2.53 | PEOE-VSA5 | 1.75 | 4.23 |
MaxPartialCharge | 0.27 | 0.08 | PEOE-VSA6 | 24.64 | 18.86 |
MinEStateIndex | –1.11 | 1.59 | PEOE-VSA7 | 40.05 | 19.59 |
ExactMolWt | 382.69 | 106.85 | PEOE-VSA8 | 24.58 | 15.41 |
BalabanJ | 1.80 | 0.43 | PEOE-VSA9 | 14.78 | 10.27 |
BertzCT | 944.81 | 330.45 | SMR-VSA1 | 13.01 | 8.90 |
Chi0 | 19.20 | 5.32 | SMR-VSA10 | 23.76 | 12.57 |
Chi0n | 15.16 | 4.36 | SMR-VSA2 | 0.45 | 1.51 |
Chi0v | 15.47 | 4.46 | SMR-VSA3 | 12.30 | 7.77 |
Chi1 | 13.01 | 3.58 | SMR-VSA4 | 2.74 | 5.10 |
Chi1n | 8.87 | 2.66 | SMR-VSA5 | 24.53 | 19.55 |
Chi1v | 9.47 | 2.85 | SMR-VSA6 | 19.45 | 16.61 |
Chi2n | 6.69 | 2.18 | SMR-VSA7 | 55.98 | 21.91 |
Chi2v | 7.38 | 2.45 | SMR-VSA8 | 0.0 | 0.0 |
Chi3n | 4.71 | 1.68 | SMR-VSA9 | 7.67 | 7.71 |
Chi3v | 5.26 | 1.88 | SlogP-VSA1 | 9.88 | 6.52 |
Chi4n | 3.27 | 1.30 | SlogP-VSA10 | 7.400 | 7.78 |
Chi4v | 3.71 | 1.44 | SlogP-VSA11 | 3.62 | 5.400 |
HallKierAlpha | –2.67 | 0.89 | SlogP-VSA12 | 6.36 | 8.89 |
Ipc | 0.0 | 0.0 | SlogP-VSA2 | 37.98 | 20.79 |
Kappa1 | 18.54 | 5.75 | SlogP-VSA3 | 9.19 | 8.33 |
Kappa2 | 7.84 | 2.77 | SlogP-VSA4 | 5.73 | 7.08 |
Kappa3 | 4.13 | 1.79 | SlogP-VSA5 | 24.76 | 17.81 |
LabuteASA | 159.92 | 43.61 | SlogP-VSA6 | 44.86 | 18.86 |
PEOE-VSA1 | 13.85 | 7.86 | SlogP-VSA7 | 1.52 | 3.03 |
VSA-EState1 | 0.0 | 0.0 | SlogP-VSA8 | 8.62 | 9.03 |
VSA-EState10 | 1.55 | 3.95 | SlogP-VSA9 | 0.0 | 0.0 |
VSA-EState2 | 0.0 | 0.0 | TPSA | 78.84 | 32.07 |
VSA-EState3 | 0.0 | 0.0 | EState-VSA1 | 8.73 | 12.23 |
VSA-EState4 | 0.0 | 0.0 | EState-VSA10 | 10.28 | 8.04 |
VSA-EState5 | 0.0 | 0.0 | EState-VSA11 | 0.02 | 0.35 |
VSA-EState6 | 0.0 | 0.0 | EState-VSA2 | 13.47 | 10.48 |
VSA-EState7 | 0.0 | 0.0 | EState-VSA3 | 20.41 | 14.42 |
VSA-EState8 | 10.71 | 16.66 | EState-VSA4 | 25.33 | 18.15 |
VSA-EState9 | 52.19 | 17.21 | EState-VSA5 | 12.42 | 12.41 |
FractionCSP3 | 0.30 | 0.18 | EState-VSA6 | 16.85 | 14.04 |
HeavyAtomCount | 27.04 | 7.46 | EState-VSA7 | 23.08 | 18.96 |
NOCount | 6.03 | 2.37 | EState-VSA8 | 19.82 | 15.80 |
NumHAcceptors | 5.14 | 2.16 | EState-VSA9 | 9.51 | 8.59 |
NumHeteroatoms | 7.17 | 2.79 | MolLogP | 3.28 | 1.32 |
NumRotatableBonds | 5.22 | 2.91 | MolMR | 104.00 | 28.25 |
NHOHCount | 1.84 | 1.37 | - | - | - |
However, for these applications to take off in industry, there needs to be significant certainty that the system will remain performant under novel chemical matter. As part of this work, our analysis of uncertainty has revealed concerns in the methodology of learning algorithm comparisons in this field. PKa-A1 in particular exhibits so much uncertainty that individual trials have little to no meaning. Although it is clear from the p-values that GC-DNNs do indeed outperform, the width of the uncertainty intervals indicates that it is completely unclear whether or not the resulting predictor will turn out to to be useful. Even the random forests trained on the 1102 example clearance dataset exhibits significant variability in performance, ranging from almost zero correlation to a high enough correlation to be useful and everything in between. This is alarming considering that 1102 examples is considered a large dataset in this field and could easily have cost in excess of a half million dollars to generate.
Beyond this, there is still a significant amount of progress to be made. The publicly available approaches tested in this work still significantly lag the accuracy of the underlying assays they are trying to model. Thermodynamic solubility, in particular, has an assay limit upward of 0.8 R2, whereas all of the presented models are under 0.3 R2, a gap that more data is unlikely to cover. What’s missing?
Our internal research shows that in most cases the answer is 3D representations. Medicinal molecules interact with the human body in three dimensions while in solution. These molecular structures are not static and can take the form of a wide range of conformers. Building machine learning systems that are more aware of the true underlying physics can result in significantly more performant models, which will be the focus of our upcoming follow-up paper.