•  Patrick Hop, Brandon Allgood, and Jessen Yu

Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts

Published in Molecular Pharmaceutics 2018, Volume 15, Issue 10, Pages 4371-4377

Abstract

Arti­fi­cial Intel­li­gence has advanced at an unprece­dent­ed pace, back­ing recent break­throughs in nat­ur­al lan­guage pro­cess­ing, speech recog­ni­tion, and com­put­er vision: domains where the data is euclid­ean in nature. More recent­ly, con­sid­er­able progress has been made in engi­neer­ing deep-learn­ing archi­tec­tures that can accept non-Euclid­ean data such as graphs and man­i­folds: geo­met­ric deep learn­ing. This progress is of con­sid­er­able inter­est to the drug dis­cov­ery com­mu­ni­ty, as mol­e­cules can nat­u­ral­ly be rep­re­sent­ed as graphs, where atoms are nodes and bonds are edges. In this work, we explore the per­for­mance of geo­met­ric deep-learn­ing meth­ods in the con­text of drug dis­cov­ery, com­par­ing machine learned fea­tures against the domain expert engi­neered fea­tures that are main­stream in the phar­ma­ceu­ti­cal industry.

Intro­duc­tion

Deep Learn­ing

Deep neur­al net­works (DNNs) are not an entire­ly new con­cept as they have exist­ed for ∼20 years, (1) only recent­ly enter­ing the spot­light due to an abun­dance of stor­age and com­pute as well as opti­miza­tion advances. Today, deep-learn­ing backs the core tech­nol­o­gy in many appli­ca­tions, such as self-dri­ving cars, (2) speech syn­the­sis, (3) and machine trans­la­tion. (4) Per­haps the most impor­tant prop­er­ty of DNNs is their abil­i­ty to auto­mat­i­cal­ly learn embed­dings (fea­tures) tab­u­la rasa from the under­ly­ing data, aid­ed by vast amounts of com­pute and more data than any one human domain expert can understand. 

Nat­u­ral­ly, there is inter­est in expand­ing the domain of applic­a­bil­i­ty of these meth­ods to non-euclid­i­an data such as graphs or man­i­folds, (5) which arise in domains such as 3D mod­els in com­put­er graph­ics, rep­re­sent­ed as rie­mann­ian man­i­folds, or graphs in mol­e­c­u­lar machine learn­ing. Under­stand­ing data of this struc­ture has been elu­sive for clas­si­cal archi­tec­tures because of a lack of a well-defined coor­di­nate sys­tem and vec­tor space struc­ture in non-euclid­i­an domains. Even oper­a­tions as sim­ple as addi­tion often can­not find nat­ur­al con­struc­tions, for exam­ple, the sum of two atoms or two mol­e­cules has no meaning. 

Geo­met­ric Deep Learn­ing aims to solve this by defin­ing prim­i­tives that can oper­ate on these unwieldy data struc­tures, pri­mar­i­ly by con­struct­ing spa­tial and spec­tral inter­pre­ta­tions of exist­ing archi­tec­tures (6) such as con­vo­lu­tion­al neur­al net­works (CNNs). Recast­ing CNNs into this domain is of par­tic­u­lar inter­est in drug dis­cov­ery, as like near­by pix­els, near­by atoms are high­ly relat­ed and inter­act with each oth­er where­as dis­tant atoms usu­al­ly do not.

Drug Dis­cov­ery

Devel­op­ment of nov­el ther­a­peu­tics for a human dis­ease is a process that can eas­i­ly con­sume a decade of research and devel­op­ment, as well as bil­lions of dol­lars in cap­i­tal. (7) Long before any­thing reach­es the clin­ic for val­i­da­tion, a poten­tial dis­ease-mod­u­lat­ing bio­log­i­cal tar­get is dis­cov­ered and char­ac­ter­ized. Then, the search process for the right ther­a­peu­tic com­pound is kicked off, a process akin to find­ing the per­fect chem­i­cal key for a tough to crack bio­log­i­cal lock, which is con­duct­ed through a vast chem­i­cal space con­tain­ing more mol­e­cules than atoms in the uni­verse. Even restrict­ing the search to mol­e­cules with a mol­e­c­u­lar weight of ≤500 Da yields a search space of at least 1050 mol­e­cules, vir­tu­al­ly all of which have nev­er been syn­the­sized before.

To make it to the clin­ic, drug dis­cov­ery prac­ti­tion­ers need to opti­mize for a wide range of mol­e­c­u­lar prop­er­ties, rang­ing from phys­i­cal prop­er­ties, such as aque­ous sol­u­bil­i­ty to com­plex bio­chem­i­cal prop­er­ties, such as blood-brain bar­ri­er pen­e­tra­tion. This long, labo­ri­ous search has his­tor­i­cal­ly been guid­ed by the intu­ition of skilled med­i­c­i­nal chemists and biol­o­gists, but over the past few decades, heuris­tics and machine learn­ing have played an increas­ing­ly impor­tant role in guid­ing the process.

The first wide­spread use of a heuris­tic is Lipinski’s rule of five (RO5), invent­ed at Pfiz­er in 1997. (8) RO5 places lim­its on the num­ber of hydro­gen bond donors, accep­tors, mol­e­c­u­lar weight, and lipophilic­i­ty mea­sures and has been shown to fil­ter out com­pounds that are like­ly to exhib­it poor ADME prop­er­ties. In prac­tice, even today RO5 is often still used to eval­u­ate emerg­ing pre­clin­i­cal molecules.

Over the past two decades, machine learn­ing mod­els have begun to emerge in the indus­try as a more advanced fil­ter or vir­tu­al screen. Researchers in indus­try have shown that expert-engi­neered fea­tures and sup­port vec­tor machines can be used to pre­dict sta­bil­i­ty in human liv­er micro­somes (9,10) effec­tive­ly, among oth­er end points. Mul­ti­task, ful­ly con­nect­ed neur­al net­works on these same inputs, has been shown to on aver­age out­per­form more tra­di­tion­al mod­els, (11,12) includ­ing XGBoost, (13) with per­for­mance scal­ing monot­o­n­i­cal­ly with the num­ber of tasks into the thou­sands. (14) Progress in learn­ing from small amounts of data has been achieved using vari­ants of match­ing net­works. (15) More recent­ly, use of 3D con­vo­lu­tion­al neur­al net­works has shown con­sid­er­able promise in pre­dict­ing pro­tein – lig­and bind­ing ener­gy (16) (drug poten­cy), and rank­ing mod­els have made con­sid­er­able progress is drug repur­pos­ing. (17)

More recent­ly, progress has been made in gen­er­at­ing nov­el mol­e­cules in sil­i­co, unlock­ing the pos­si­bil­i­ty of screen­ing mol­e­cules that have been designed by machines instead of humans. This may allow explo­ration of far-reach­ing regions of chem­i­cal space that are beyond those cov­ered by exist­ing human-engi­neered screens in indus­try. Suc­cess with these approach­es was first demon­strat­ed using Adver­sar­i­al Autoen­coders, which were shown to be able to hal­lu­ci­nate (gen­er­ate) chem­i­cal fin­ger­prints that matched with a vari­ety of patent­ed anti­cancer assets. (18) Vari­a­tion­al Autoen­coders have also been used in this area (19) and have been shown to be able to hal­lu­ci­nate mol­e­cules that have excep­tion­al sol­u­bil­i­ty and low sim­i­lar­i­ty to the train­ing set. Segler (20) et al. relied on LSTMs trained on chem­i­cal lan­guage rep­re­sen­ta­tions to achieve a sim­i­lar result for poten­cy end points. Recent­ly, more progress has been made using Gen­er­a­tive Adver­sar­i­al Neur­al Net­work mod­els, first on 2D rep­re­sen­ta­tions (18) and lat­er on 3D rep­re­sen­ta­tions. (21)

As these pre­dic­tion sys­tems improve, the aver­age qual­i­ty of mol­e­cules select­ed for syn­the­sis in drugs pro­grams improves sig­nif­i­cant­ly, (22) result­ing in pro­grams that get to the clin­ic faster and with low­er cap­i­tal require­ments, which is sig­nif­i­cant in light of pipeline attri­tion rates. For drugs in phase I, exclud­ing port­fo­lio rebal­anc­ing, ∼40% fail due to tox­i­c­i­ty and ∼15% fail due to poor phar­ma­co­ki­net­ics, both of which have the poten­tial to be cap­tured by these pre­dic­tion sys­tems long before the clin­ic. (23)

In this work, the state of the art of drug dis­cov­ery fea­ture engi­neer­ing is com­pared against the state of the art of geo­met­ric deep learn­ing in a rig­or­ous man­ner. We will show that geo­met­ric deep learn­ing can autonomous­ly learn rep­re­sen­ta­tions that out­per­form those designed by domain experts on four out of five of the data sets tested.

Chem­i­cal Embeddings

The first chal­lenge in machine learn­ing is select­ing a numer­i­cal rep­re­sen­ta­tion that cor­rect­ly cap­tures the under­ly­ing dynam­ics of the train­ing data, also known as fea­tures or an embed­ding, terms we will use inter­change­ably in this work. A fixed-shape rep­re­sen­ta­tion is typ­i­cal­ly required sim­ply because the math­e­mat­ics of learn­ing algo­rithms require that their inputs be the same shape. Select­ing an embed­ding that respects the under­ly­ing struc­ture of the data can­not be over­looked because cer­tain math­e­mat­i­cal assump­tions that apply to some data sets need not apply to oth­ers, revers­ing an Eng­lish sen­tence destroys its mean­ing, where­as revers­ing an image gen­er­al­ly would not. In nat­ur­al lan­guage pro­cess­ing, a per­ni­cious prob­lem is that sen­tences need not be the same length and that local­i­ty must be respect­ed because words are high­ly relat­ed to their neigh­bors. Bag of words embed­dings resolves this by map­ping sen­tences into bit vec­tors that indi­cate the pres­ence or absence of adja­cent words in the doc­u­ment [Fig­ure 1]. This con­ve­nient, fixed length bit vec­tor can lat­er be used to train a clas­si­fi­er for any nat­ur­al lan­guage pro­cess­ing task.

Figure 1. Bag of fragments (left); bag of words (right).

In mol­e­c­u­lar machine learn­ing, engi­neer­ing good embedding/​features is a con­sid­er­able chal­lenge because mol­e­cules are unwieldy, undi­rect­ed multi­graphs with atoms being nodes and bonds being edges. A good chem­i­cal embed­ding would be able to mod­el graphs with a dif­fer­ing num­ber of nodes and edge con­fig­u­ra­tions while pre­serv­ing local­i­ty because it is under­stood that atoms that are close to each oth­er gen­er­al­ly exhib­it more pro­nounced inter­ac­tions than atoms that are distant.

More for­mal­ly, for a mol­e­cule rep­re­sent­ed with an adja­cen­cy matrix A ∈ {0, 1}n×n and atom-fea­ture matrix XRnxd, we want to con­struct some func­tion f with (option­al­ly) learn­able para­me­ters ƟRw s.t

ʄ : (A, X; Ɵ)xd

where x is a fixed-shape rep­re­sen­ta­tion that cap­tures the essence of the under­ly­ing graph. This vec­tor is then passed to a learn­ing algo­rithm of a scientist’s choice, such as ran­dom forests or a ful­ly con­nect­ed neur­al network.

Naïve Embed­dings

A stan­dard chem­istry embed­ding solu­tion is the extend­ed-con­nec­tiv­i­ty fin­ger­prints (ECFP4). (24) These fin­ger­prints gen­er­ate fea­tures for every radius r4, which is the max­i­mum dis­tance explored on the graph from the start­ing ver­tex. For a spe­cif­ic r and spe­cif­ic ver­tex in the graph, ECFP4 takes the neigh­bor­hood fea­tures from the pre­vi­ous radius, con­cate­nates them, and applies a hash func­tion, the range of which cor­re­sponds to indices on a hash table. After iter­at­ing over all ver­tices and radius val­ues, this bag of frag­ments approach to graph embed­ding results in task agnos­tic rep­re­sen­ta­tion that can be eas­i­ly passed onto a learn­ing algorithm.

Expert Embed­dings

Chem­in­for­mat­ics experts in drug dis­cov­ery have, over decades, engi­neered numer­ous domain-spe­cif­ic, phys­i­o­log­i­cal­ly rel­e­vant fea­tures, also known as descrip­tors. For exam­ple, polar sur­face area (PSA) is a fea­ture that is cal­cu­lat­ed as the sum of the sur­face area con­tri­bu­tions of the polar atoms in a mol­e­cule, a fea­ture well-known in indus­try to neg­a­tive­ly cor­re­late with mem­brane per­me­abil­i­ty. There are 101 of these pub­licly avail­able, expert-engi­neered fea­tures [Table 3] that are eas­i­ly avail­able in the open-source RDKit package.

Learn­able Embeddings

One crit­i­cism of the naïve embed­dings is that they are not opti­mized for the task at hand. The ide­al fea­tures to pre­dict drug sol­u­bil­i­ty are like­ly to be con­sid­er­ably dif­fer­ent than the fea­tures used to pre­dict pho­to­volta­ic effi­cien­cy. The solu­tion is to allow the mod­el to engi­neer its own prob­lem spe­cif­ic, opti­mized embed­ding for the prob­lem at hand, in essence by com­bin­ing the learn­er with the embed­ding. This is achieved by allow­ing gra­di­ents to flow back from the learn­er into the embed­ding func­tion, allow­ing the embed­ding to be opti­mized in tan­dem with the learn­er. Neur­al Fin­ger­prints (25) demon­strat­ed that ECFP could be con­sid­er­ably improved in this man­ner by intro­duc­ing learn­able weights, and Weaves (26) demon­strat­ed fur­ther improve­ments by mix­ing bond and atom fea­tures. Lat­er, it was shown both of these graph embed­ding meth­ods are spe­cial cas­es of mes­sage pass­ing algo­rithms. (27)

Meth­ods

Learn­ing Algorithms

Ran­dom Forests

Ran­dom Forests are a com­mon ensem­ble learn­ing algo­rithm used in indus­try due to their train­ing speed, high-per­for­mance, and ease of use. In this work, ran­dom for­est mod­els (sklearn’s Ran­dom­Fore­stRe­gres­sor) are trained on the con­cate­na­tion of 101 RDKit descrip­tors and the 384 bit wide ECFP4 fin­ger­prints using 30 trees and a max­i­mum tree depth of 4. This par­tic­u­lar hyper­pa­ra­me­ter con­fig­u­ra­tion is the result of tun­ing on the val­i­da­tion set by hand with the aim of max­i­miz­ing absolute per­for­mance while min­i­miz­ing the spread of per­for­mance. A max­i­mum tree depth of 10 was used on the lipophilic­i­ty dataset due to its size.

FC-DNN

Ful­ly-Con­nect­ed Neur­al Net­works oper­ate on a fixed shape input by pass­ing infor­ma­tion through mul­ti­ple non­lin­ear trans­for­ma­tions, i.e., lay­ers. FC-DNN mod­els were imple­ment­ed in PyTorch (28) and are trained on the same inputs as the ran­dom for­est mod­els with an added nor­mal­iza­tion pre­pro­cess­ing stage. After exten­sive hyper­pa­ra­me­ter tun­ing on the val­i­da­tion set, a neur­al net­work with two hid­den lay­ers of size 48 and 32 was found to per­form well. ReLU acti­va­tions and batch nor­mal­iza­tion were used on both hid­den lay­ers. Opti­miza­tion was per­formed using the ADAM opti­miz­er. (29)

For hyper­pa­ra­me­ter, a sta­t­ic learn­ing rate of 5e4 and l2 weight decay of 8e3 was used. All FC-DNN mod­els were trained through train­ing epoch 11, after which the mod­els would begin overfitting.

GC-DNN

Graph Con­vo­lu­tion­al net­works are a geo­met­ric deep-learn­ing method that is dis­tinct from the pre­vi­ous meth­ods in that they are trained exclu­sive­ly from the mol­e­c­u­lar graph, an unwieldy input that can vary in the num­ber of ver­tices as well as con­nec­tiv­i­ty. This graph is ini­tial­ized using a vari­ety of atom fea­tures rang­ing from atom­ic num­ber to cova­lent radius.

The DeepChem Ten­sor­flow (30) imple­men­ta­tion of the graph con­vo­lu­tion, graph-pool­ing, and graph-gath­er prim­i­tives was used to con­struct sin­gle-task net­works. This imple­men­ta­tion is unique in that it reserves a para­me­ter matrix for each node degree, unlike oth­er approach­es. (6) For these exper­i­ments, a 3‑layer net­work was used using ReLU acti­va­tions, batch nor­mal­iza­tion, and a sta­t­ic learn­ing rate of 1e3 with no weight decay. Once again, opti­miza­tion was per­formed using the ADAM opti­miz­er. A for­mal math­e­mat­i­cal con­struc­tion of the graph con­vo­lu­tion­al prim­i­tives is pre­sent­ed in the Appen­dix.

Data Prepa­ra­tion

Set­ting up valid machine learn­ing exper­i­ments in mol­e­c­u­lar machine learn­ing is con­sid­er­ably more chal­leng­ing than oth­er domains. Dataset are auto­cor­re­lat­ed because they are not col­lect­ed by sam­pling from chem­i­cal space uni­form­ly at ran­dom. Rather, dataset are com­prised of many chem­i­cal series of inter­est with each series con­sist­ing of mol­e­cules that dif­fer by only sub­tle topol­o­gy changes.

This under­ly­ing struc­ture can be visu­al­ized using t‑SNE, (31) a non­lin­ear embed­ding algo­rithm that excels at accu­rate­ly visu­al­iz­ing high-dimen­sion­al data, such as mol­e­cules. In essence, t‑SNE aims to pro­duce a 2D embed­ding such that points that are close togeth­er in high dimen­sions remain close togeth­er in the 2D embed­ding. Like­wise, it aims to keep points that are far apart in high dimen­sions far apart in the 2D embed­ding. The result­ing t‑SNE scat­ter­plot [Fig­ure 2] for the lipophilic­i­ty data set reveals this clear clustering.

Figure 2. 2D embedding of the 4200 molecule lipophilicity dataset using t-SNE. Notice the heavy clustering that is characteristic of a drug-discovery dataset.

It fol­lows from this struc­ture that ran­dom­ly split­ting data sets of this style results in sig­nif­i­cant redun­dan­cies between the train­ing set and val­i­da­tion sets. It can be shown that bench­marks of this style sig­nif­i­cant­ly reward solu­tions that over­fit rather than solu­tions that can gen­er­al­ize to mol­e­cules that are sig­nif­i­cant­ly dif­fer­ent than the train­ing sets. (32) To con­trol for this, we split the dataset into Mur­cko clus­ters (33) and place the largest clus­ters in the train­ing set and the small­est ones in the val­i­da­tion set, tar­get­ing 80% of the data being placed in the train­ing set, 10% in the val­i­da­tion set, and 10% in the test set. This method results in the major­i­ty of the chem­i­cal diver­si­ty being held out­side of the train­ing set, not unlike the data the sys­tem will encounter when deployed.

Both split and unsplit dataset have been open sourced into a repos­i­to­ry under the Numer­ate GitHub organization.

Cap­tur­ing Uncertainty

Small dataset, along with algo­rithms that rely on ran­dom­ness dur­ing train­ing, intro­duce con­sid­er­able noise into the per­for­mance results. This makes it dif­fi­cult to tease apart gen­uine advance­ments from luck [28]. More­over, the per­for­mance of mol­e­c­u­lar machine learn­ing sys­tems is high­ly depen­dent on the choice of train­ing set, mak­ing it dif­fi­cult to assess how the sys­tem would per­form on sig­nif­i­cant­ly nov­el chem­i­cal matter.

Because there is no closed-form solu­tion for uncer­tain­ty esti­mates for the met­ric that we are inter­est­ed in, R2, boot­strap­ping with replace­ment of the train­ing set is used to cap­ture uncer­tain­ty [Figure3]. Mod­els are trained on 25 boot­strap resam­ples, and 25 R2 val­ues are record­ed [Table1]. The result is not a sin­gle score but rather a dis­tri­b­u­tion of scores defined by a sam­ple mean and sam­ple vari­ance. Vari­a­tions in mean per­for­mance among learn­ing algo­rithms can then be test­ed for sta­tis­ti­cal sig­nif­i­cance using the Welch t test, an adap­ta­tion of the t test that is more reli­able for two sam­ples that have unequal vari­ances [Table2].

Figure 3. Bootstrapped performance histograms and kernel density estimates for Random Forests, Graph Convolutional Neural Networks, and Fully Connected Neural Networks over five data sets.

Table 1. Test Set Per­for­mance R2

data setmodelmeanstdrange
pKa-A1RF0.3190.179[−0.260, 0.673]
pKa-A1FC-DNN0.1910.072[0.091, 0.377]
pKa-A1GC-DNN0.4370.105[0.204, 0.689]
ClearanceRF0.1550.047[0.054, 0.253]
ClearanceFC-DNN0.1360.025[0.088, 0.192]
ClearanceGC-DNN0.2170.048[0.117, 0.333]
HPPBRF0.2870.029[0.215, 0.342]
HPPBFC-DNN0.2030.024[0.158, 0.265]
HPPBGC-DNN0.2080.039[0.126, 0.309]
ThermoSolRF0.1870.021[0.137, 0.224]
ThermoSolFC-DNN0.2560.039[0.224, 0.377]
ThermoSolGC-DNN0.2940.043[0.215, 0.377]
LipophilicityRF0.4240.022[0.371, 0.473]
LipophilicityFC-DNN0.3450.025[0.302, 0.402]
LipophilicityGC-DNN0.4840.023[0.436, 0.515]

Table 2. A/B Test for Ran­dom Forests and Graph Con­vo­lu­tions Using Welch t-test

data setp-value
pKa-A17.2e–3
Clearance3.2e–5
HPPB3.7e10
Thermosol4.6e–13
Lipo1.6e–12

Exper­i­ments

Regres­sion mod­els are test­ed against a vari­ety of phys­io­chem­i­cal and ADME end points that are of inter­est to the phar­ma­ceu­ti­cal indus­try. We restrict our choice of data sets to the ones released by AstraZeneca into ChEM­BL, a pub­licly avail­able data­base, (34) with the expec­ta­tion that they were sub­ject to their strict, inter­nal qual­i­ty con­trol stan­dards, con­tain con­sid­er­able chem­i­cal diver­si­ty, and are rep­re­sen­ta­tive of data sets held inter­nal­ly in industry.

Data Sets

pKa-A1

This is the acid – base dis­so­ci­a­tion con­stant for the most acidic pro­ton, which is an impor­tant fac­tor in under­stand­ing the ion­iz­abil­i­ty of a poten­tial drug and has a strong influ­ence over mul­ti­ple dif­fer­ent prop­er­ties of inter­est, includ­ing per­me­abil­i­ty, par­ti­tion­ing, bind­ing, and so forth. (35) This is this small­est data set of the five with only 204 examples.

Human Intrin­sic Clearance

This is the rate at which the human body removes cir­cu­lat­ing, unbound drug from the blood. This is one of the key in vit­ro para­me­ters used to pre­dict drug res­i­den­cy time in the patient. (36) In drug dis­cov­ery, this prop­er­ty is assessed by mea­sur­ing the meta­bol­ic sta­bil­i­ty of drugs in either human liv­er micro­somes or hepa­to­cytes. This data set includes 1102 exam­ples of intrin­sic clear­ance mea­sured in human liv­er micro­somes (muL min1 mg1 pro­tein) fol­low­ing incu­ba­tion at 37 °C.

Human Plas­ma Pro­tein Binding

This assay mea­sures the pro­por­tion of drug that is bound reversibly to pro­teins such as albu­min and α‑acid gly­co­pro­tein in the plas­ma. Know­ing the amount that is unbound is crit­i­cal because it is that amount that can dif­fuse into tis­sue or be cleared by the liv­er; (36) 1640 com­pounds are mea­sured, and regres­sion tar­gets are trans­formed using log(1 – bound), a more rep­re­sen­ta­tive mea­sure for scientists.

Ther­mo­dy­nam­ic Solubility

This mea­sures the sol­u­bil­i­ty of a sol­id start­ing mate­r­i­al in pH 7.4 buffer. Sol­u­bil­i­ty influ­ences a wide range of prop­er­ties for drugs, espe­cial­ly ones that are admin­is­tered oral­ly. This data set con­tains 1763 examples.

Lipophilic­i­ty

This is a compound’s affin­i­ty for a lipophilic sol­vent vs a polar sol­vent. More for­mal­ly, we use logD (pH 7.4), which is cap­tured exper­i­men­tal­ly using the octanol/​buffer dis­tri­b­u­tion coef­fi­cient mea­sured by the shake flask method. This is an impor­tant mea­sure for poten­tial drugs, as lipophilic­i­ty is a key con­trib­u­tor to mem­brane per­me­abil­i­ty. (36) Alter­na­tive­ly, high­ly lipophilic com­pounds are usu­al­ly encum­bered by low sol­u­bil­i­ty, high clear­ance, high plas­ma pro­tein bind­ing, and so forth. Indeed, most drug dis­cov­ery projects have a tar­get range for lipophilic­i­ty. (36) This data set is the largest of the three at 4200 compounds.

Results

Graph Con­vo­lu­tion­al Neur­al Net­works lead the three learn­ing algo­rithms on four out of five data sets with the excep­tion being human plas­ma pro­tein bind­ing. All five dif­fer­ences between GC-DNNs and the indus­try-stan­dard RFs were found to be sta­tis­ti­cal­ly sig­nif­i­cant using the Welch t-test A/B test. Ful­ly Con­nect­ed Neur­al Net­works gen­er­al­ly under­per­formed their coun­ter­parts despite requir­ing con­sid­er­ably more hyper­pa­ra­me­ter tuning.

Dis­cus­sion

In part due to its autonomous­ly learned fea­tures, graph con­vo­lu­tion­al neur­al net­works out­per­formed meth­ods trained on expert engi­neered fea­tures on four out of five data sets with the excep­tion being plas­ma-pro­tein bind­ing [Table 3]. This is a sur­pris­ing result giv­en that GC-DNNs are blind to the domain of drug dis­cov­ery and could triv­ial­ly be repur­posed to solve orthog­o­nal prob­lems such as detect­ing fraud in bank­ing trans­ac­tion net­works. Geo­met­ric deep learn­ing approach­es like this unlock the pos­si­bil­i­ty of learn­ing from non-euclid­i­an graphs (mol­e­cules) and man­i­folds, pro­vid­ing the phar­ma­ceu­ti­cal indus­try with the abil­i­ty to learn from and exploit knowl­edge from their his­tor­i­cal suc­cess­es and fail­ures, result­ing in sig­nif­i­cant­ly improved qual­i­ty of research can­di­dates and accel­er­at­ed timelines.

Table 3. Expert Engi­neered Features

featuremeanvariancefeaturemeanvariance
MaxAbsPartialCharge0.430.07PEOE-VSA109.647.99
MinPartialCharge–0.420.07PEOE-VSA114.266.43
MinAbsPartialCharge0.00.0PEOE-VSA123.615.10
HeavyAtomMolWt0.260.08PEOE-VSA133.194.38
MaxAbsEStateIndex0.160.19PEOE-VSA142.384.05
NumRadicalElectrons0.00.0PEOE-VSA27.146.03
NumValenceElectrons141.2540.29PEOE-VSA36.976.48
MinAbsEStateIndex0.160.19PEOE-VSA42.895.13
MaxEStateIndex11.652.53PEOE-VSA51.754.23
MaxPartialCharge0.270.08PEOE-VSA624.6418.86
MinEStateIndex–1.111.59PEOE-VSA740.0519.59
ExactMolWt382.69106.85PEOE-VSA824.5815.41
BalabanJ1.800.43PEOE-VSA914.7810.27
BertzCT944.81330.45SMR-VSA113.018.90
Chi019.205.32SMR-VSA1023.7612.57
Chi0n15.164.36SMR-VSA20.451.51
Chi0v15.474.46SMR-VSA312.307.77
Chi113.013.58SMR-VSA42.745.10
Chi1n8.872.66SMR-VSA524.5319.55
Chi1v9.472.85SMR-VSA619.4516.61
Chi2n6.692.18SMR-VSA755.9821.91
Chi2v7.382.45SMR-VSA80.00.0
Chi3n4.711.68SMR-VSA97.677.71
Chi3v5.261.88SlogP-VSA19.886.52
Chi4n3.271.30SlogP-VSA107.4007.78
Chi4v3.711.44SlogP-VSA113.625.400
HallKierAlpha–2.670.89SlogP-VSA126.368.89
Ipc0.00.0SlogP-VSA237.9820.79
Kappa118.545.75SlogP-VSA39.198.33
Kappa27.842.77SlogP-VSA45.737.08
Kappa34.131.79SlogP-VSA524.7617.81
LabuteASA159.9243.61SlogP-VSA644.8618.86
PEOE-VSA113.857.86SlogP-VSA71.523.03
VSA-EState10.00.0SlogP-VSA88.629.03
VSA-EState101.553.95SlogP-VSA90.00.0
VSA-EState20.00.0TPSA78.8432.07
VSA-EState30.00.0EState-VSA18.7312.23
VSA-EState40.00.0EState-VSA1010.288.04
VSA-EState50.00.0EState-VSA110.020.35
VSA-EState60.00.0EState-VSA213.4710.48
VSA-EState70.00.0EState-VSA320.4114.42
VSA-EState810.7116.66EState-VSA425.3318.15
VSA-EState952.1917.21EState-VSA512.4212.41
FractionCSP30.300.18EState-VSA616.8514.04
HeavyAtomCount27.047.46EState-VSA723.0818.96
NOCount6.032.37EState-VSA819.8215.80
NumHAcceptors5.142.16EState-VSA99.518.59
NumHeteroatoms7.172.79MolLogP3.281.32
NumRotatableBonds5.222.91MolMR104.0028.25
NHOHCount1.841.37---

How­ev­er, for these appli­ca­tions to take off in indus­try, there needs to be sig­nif­i­cant cer­tain­ty that the sys­tem will remain per­for­mant under nov­el chem­i­cal mat­ter. As part of this work, our analy­sis of uncer­tain­ty has revealed con­cerns in the method­ol­o­gy of learn­ing algo­rithm com­par­isons in this field. PKa-A1 in par­tic­u­lar exhibits so much uncer­tain­ty that indi­vid­ual tri­als have lit­tle to no mean­ing. Although it is clear from the p-val­ues that GC-DNNs do indeed out­per­form, the width of the uncer­tain­ty inter­vals indi­cates that it is com­plete­ly unclear whether or not the result­ing pre­dic­tor will turn out to to be use­ful. Even the ran­dom forests trained on the 1102 exam­ple clear­ance dataset exhibits sig­nif­i­cant vari­abil­i­ty in per­for­mance, rang­ing from almost zero cor­re­la­tion to a high enough cor­re­la­tion to be use­ful and every­thing in between. This is alarm­ing con­sid­er­ing that 1102 exam­ples is con­sid­ered a large dataset in this field and could eas­i­ly have cost in excess of a half mil­lion dol­lars to generate.

Beyond this, there is still a sig­nif­i­cant amount of progress to be made. The pub­licly avail­able approach­es test­ed in this work still sig­nif­i­cant­ly lag the accu­ra­cy of the under­ly­ing assays they are try­ing to mod­el. Ther­mo­dy­nam­ic sol­u­bil­i­ty, in par­tic­u­lar, has an assay lim­it upward of 0.8 R2, where­as all of the pre­sent­ed mod­els are under 0.3 R2, a gap that more data is unlike­ly to cov­er. What’s missing?

Our inter­nal research shows that in most cas­es the answer is 3D rep­re­sen­ta­tions. Med­i­c­i­nal mol­e­cules inter­act with the human body in three dimen­sions while in solu­tion. These mol­e­c­u­lar struc­tures are not sta­t­ic and can take the form of a wide range of con­form­ers. Build­ing machine learn­ing sys­tems that are more aware of the true under­ly­ing physics can result in sig­nif­i­cant­ly more per­for­mant mod­els, which will be the focus of our upcom­ing fol­low-up paper.