array of ai 3d drawing nature geometry

Abstract

Constructive molecular representation learning is of great importance to facilitate molecular property prediction. Recent advances for molecular representation learning have shown great promise in applying graph neural networks to model molecules. Moreover, a few recent studies pattern self-supervised learning methods for molecular representation to address insufficient labelled molecules; however, these self-supervised frameworks treat the molecules every bit topological graphs without fully utilizing the molecular geometry data. The molecular geometry, also known as the three-dimensional spatial construction of a molecule, is critical for determining molecular properties. To this end, we advise a novel geometry-enhanced molecular representation learning method (GEM). The proposed Gem has a specially designed geometry-based graph neural network architecture as well equally several dedicated geometry-level self-supervised learning strategies to acquire the molecular geometry knowledge. We compare Gem with various land-of-the-art baselines on different benchmarks and show that information technology tin considerably outperform them all, demonstrating the superiority of the proposed method.

Principal

Molecular property prediction has been widely considered as ane of the nearly critical tasks in computational drug and materials discovery, equally many methods rely on predicted molecular properties to evaluate, select and generate molecules^ane,2. With the development of deep neural networks (DNNs), molecular representation learning exhibits a keen reward over characteristic engineering-based methods, which has attracted increasing enquiry attending to tackle the molecular belongings prediction trouble.

Graph neural networks (GNNs) for molecular representation learning have recently become an emerging research surface area, which regard the topology of atoms and bonds as a graph, and propagate messages of each element to its neighbours^three,4,5,6. However, one major obstacle to hinder the successful awarding of GNNs (and DNNs) in molecule property prediction is the scarity of labelled data, which is too a common research challenge in natural language processing^vii,8 and computer vision^ix,ten communities. Inspired by the success of cocky-supervised learning, recent studies^iv,xi start to use large-scale unlabelled molecules in a cocky-supervised methodology to pre-train the molecular representation and and then use a small number of labelled molecules to fine melody the models, achieving substantial improvements.

Existing self-supervised learning techniques for GNNs^4,xi only consider the topology information of the molecules, neglecting the molecular geometry, that is, the three-dimensional spatial structure of a molecule. These works conduct self-supervised learning by masking and predicting in nodes, edges or contexts in the topology^4,11. Yet these tasks only enable the model to acquire the laws of molecular graph such every bit which cantlet/group could exist connected to a double bail, and lack the ability to learn the molecular geometry noesis, which plays an important role in determining molecules' concrete, chemical and biological activities. For example, the water solubility (a critical metric of drug-likeness) of the two molecules illustrated in Fig. 1 is dissimilar due to their differing geometries, even though they have the same topology. Cis-platin and trans-platin are another case of molecules with the same topology but different geometries: cis-platin is a pop chemotherapy drug used to treat a number of cancers, whereas trans-platin has no cytotoxic activity¹².

**Fig. 1: Comparison between two stereoisomers with the same topology but dissimilar geometries.**

Although incorporating geometric data into graph architectures to benefit some molecular property estimation tasks has attracted research attention in contempo years^{13,fourteen,xv,16,17}, there is still a need to utilize the molecular geometry data to develop a cocky-supervised learning image for property prediction. We argue that adopting the self-supervised learning to estimate the geometry can contribute to facilitating the model'southward chapters in predicting various backdrop. Self-supervised learning can take advantage of the big-scale unlabelled molecules with coarse 3-dimensional spatial structures to amend learn the molecular representation, where the coarse three-dimensional spatial structures can be efficiently calculated by cheminformatics tools such as RDKit (https://www.rdkit.org/). Past geometry-level self-supervised learning, the pre-trained model is capable of inferring the molecular geometry past itself.

To this end, we propose a novel geometry-enhanced molecular representation learning method (GEM). Showtime, to make the message passing sensitive to geometries, we model the effects of atoms, bonds and bond angles simultaneously by designing a geometry-based GNN compages (GeoGNN). The compages consists of two graphs: the showtime graph regards the atoms as nodes and the bonds as edges, whereas the second graph regards the bonds equally nodes and the bond angles as edges. Second, we pre-railroad train the GeoGNN to learn the chemic laws and the geometries from big-scale molecules with coarse three-dimensional spatial structures, designing diverse geometry-level self-supervised learning tasks. To verify the effectiveness of the proposed GEM, we compared it with several state-of-the-fine art (SOTA) baselines on 15 molecular property prediction benchmarks, among which Gem achieves 14 SOTA results.

Our contributions can be summarized as follows:

We propose a novel geometry-based GNN to encode both the topology and geometry information of molecules.
We pattern multiple geometry-level self-supervised learning tasks to learn the molecular spatial knowledge from large-calibration molecules with fibroid spatial structures.
We evaluated GEM thoroughly on various molecular belongings prediction datasets. Experimental results demonstrate that Jewel considerably outperforms competitive baselines on multiple benchmarks.

Preliminaries

Graph-based molecular representation

A molecule consists of atoms and the neighbouring atoms are connected by chemical bonds, which tin be represented by a graph $G=({{{\mathcal{V}}}},{{{\mathcal{E}}}})$, where ${{{\mathcal{V}}}}$ is a node ready and ${{{\mathcal{Eastward}}}}$ is an edge prepare. An atom in the molecule is regarded equally a node $v\in {{{\mathcal{5}}}}$ and a chemic bail in the molecule is regarded every bit an edge $(u,5)\in {{{\mathcal{East}}}}$ connecting atoms u and v.

Graph neural networks are message-passing neural networks¹⁸, making them useful for predicting molecular properties. Post-obit the definitions of the previous GNNs¹⁹, the features of a node five are represented by 10 _v and the features of an border (u,v) are represented by x _{u v}. Taking node features, border features and the graph structure as inputs, a GNN learns the representation vectors of the nodes, where the representation vector of a node v is denoted by h _v. A GNN iteratively updates a node's representation vector past aggregating the messages from the node'southward neighbours. Finally, the representation vector h _G of the entire graph can be obtained by pooling over the representation vectors {h _five} of all the nodes at the last iteration. The representation vector of the graph h _G is utilized to estimate the molecular backdrop.

Pre-training methods for GNNs

In the molecular representation learning community, recently several works^4,11,20 have explored the ability of self-supervised learning to improve the generalization ability of GNN models on downstream tasks. They mainly focus on two kinds of self-supervised learning tasks: the node-level (edge-level) tasks and the graph-level tasks.

The node-level self-supervised learning tasks are devised to capture the local domain knowledge. For example, some studies randomly mask a portion of nodes or sub-graphs and so predict their backdrop by the node/edge representation. The graph-level self-supervised learning tasks are used to capture the global information, like predicting the graph properties by the graph representation. Usually, the graph properties are domain-specific cognition, such as experimental results from biochemical assays or the being of molecular functional groups.

The Jewel framework

This department introduces the details of our proposed geometry-enhanced molecular representation learning method (GEM), which includes ii parts: a novel geometry-based GNN and various geometry-level self-supervised learning tasks.

GeoGNN

Nosotros propose a GeoGNN that encodes molecular geometries past modelling the cantlet–bond–angle relations, distinguishing them from traditional GNNs, which but consider the relationship between atoms and bonds.

For a molecule, we denote the atom set every bit ${{{\mathcal{V}}}}$, the bond set every bit ${{{\mathcal{E}}}}$, and the bail angle ready as ${{{\mathcal{A}}}}$. We introduce atom–bail graph Chiliad and bond–bending graph H for each molecule, equally illustrated in Fig. 2a. The cantlet–bond graph is divers every bit $G=({{{\mathcal{V}}}},{{{\mathcal{E}}}})$, where cantlet $u\in {{{\mathcal{V}}}}$ is regarded as the node of 1000 and bond $(u,v)\in {{{\mathcal{E}}}}$ as the edge of G, connecting atoms u and v. Similarly, the bond–bending graph is defined as $H=({{{\mathcal{East}}}},{{{\mathcal{A}}}})$, where bond $(u,v)\in {{{\mathcal{East}}}}$ is regarded as the node of H and bail angle $(u,v,west)\in {{{\mathcal{A}}}}$ every bit the edge of H, connecting bonds (u,v) and (five,w). We use x _u as the initial features of cantlet u, ten _{u v} as the initial features of bond (u,5), and x _{u v due west} as the initial features of bail bending (u,v,w). The atom–bond graph G and the bond–bending graph H—too equally atom features, bond features and bail bending features—are taken as the inputs of GeoGNN.

**Fig. 2: Overall architecture of Precious stone.**

GeoGNN learns the representation vectors of atoms and bonds iteratively. For the yardth iteration, the representation vectors of cantlet u and bail (u,five) are denoted by h _u and h _{u v}, respectively. To connect the atom–bond graph G and bond–angle graph H, the representation vectors of the bonds are taken equally the advice links betwixt Grand and H. In the get-go pace, the bonds' representation vectors are learned by aggregating messages from the neighbouring bonds and corresponding bond angles in the bail–angle graph H. In the second step, the atoms' representation vectors are learned by aggregating messages from the neighbouring atoms and the corresponding bonds in the cantlet–bond graph M. Finally, the molecular representation h _G is obtained by pooling over the atoms' representations. Encounter the Methods for details on the GeoGNN architecture.

Geometry-level self-supervised learning tasks

To farther boost the generalization power of GeoGNN, we suggest three geometry-level self-supervised learning tasks to pre-railroad train GeoGNN: (1) the bond lengths prediction; (2) the bond angles prediction; (3) the atomic distance matrices prediction. The bond lengths and bail angles describe the local spatial structures, whereas the atomic altitude matrices describe the global spatial structures.

Local spatial structures

Bond lengths and angles are the almost important molecular geometrical parameters: the former is the distance between two joint atoms in a molecule, reflecting the bond forcefulness between the atoms, whereas the latter is the angle connecting two consecutive bonds, including three atoms, describing the local spatial construction of a molecule.

To acquire the local spatial structures, we construct self-supervised learning tasks that predict bond lengths and angles. First, for a molecule, we randomly select 15% of atoms. For each selected atom, nosotros extract the ane-hop neighbourhood of this atom, including the side by side atoms and bonds, as well equally the bond angles formed by that selected atom. Second, we mask the features of these atoms, bonds and bond angles in the one-hop neighbourhood. The representation vectors of the extracted atoms and bonds at the final iteration of GeoGNN are used to predict the extracted bond lengths and bond angles. Cocky-supervised learning tasks based on bond lengths and bond angles are shown on left and middle of Fig. 2b. We design a regression loss function that penalizes the error between the predicted bond lengths/angles and the labels, whose details can be referred to in the Methods. The task of predicting the local spatial structures can exist seen as a node-level cocky-supervised learning task.

Global spatial structures

Except for the tasks for learning local spatial structures, we too design the atomic distance matrices prediction task for learning the global molecular geometry. We construct the atomic distance matrix for each molecule based on the three-dimensional coordinates of the atoms. Nosotros so predict the elements in the distance matrix, shown on the right of Fig. 2b.

Note that for 2 molecules with the same topological structures, the spatial distances betwixt the respective atoms could vary greatly; thus, for a molecule, rather than take predicting atomic altitude matrix equally a regression problem, we have it as a multi-class classification problem by projecting the diminutive distances into thirty bins with equal stride. Details on the designed loss function tin can exist institute in the Methods. The job predicting the bond lengths tin exist seen every bit a special example of the task predicting the diminutive distances. The old focuses more on the local spatial structures, whereas the latter focuses more on the distribution of the global spatial structures.

To pre-train GeoGNN, nosotros consider both the local spatial structures and global spatial structures for each molecule past summing up the corresponding loss functions.

Experiments

To thoroughly evaluate the performance of Gem, we compare it with multiple SOTA methods on multiple criterion datasets from MoleculeNet²¹ with diverse molecular property prediction tasks, such as concrete, chemical and biophysics.

Pre-grooming settings

Datasets

We use twenty million unlabelled molecules sampled from Zinc15²², a public access database that contains purchasable drug-similar compounds, to pre-train GeoGNN. Nosotros randomly sample 90% of the molecules for training and the remaining for evaluation.

Self-supervised learning chore settings

We utilize geometry- and graph-level tasks to pre-train GeoGNN. For the erstwhile, we utilise the Merck molecular force field (MMFF94)²³ function in RDKit to obtain the simulated three-dimensional coordinates of the atoms in the molecules. The geometric features of the molecule—including bond lengths, bond angles and atomic distance matrices—are calculated by the false 3-dimensional coordinates. We predict the molecular fingerprints for the graph-level tasks. The graph-level tasks can exist formulated as a set of binary classification problems, where each bit of the fingerprints corresponds to one binary nomenclature problem. Two kinds of fingerprints are used: (i) the molecular access arrangement (MACCS) key²⁴ and (2) the extended-connectivity fingerprint (ECFP)²⁵.

Molecular property prediction settings

Datasets and splitting method

We conduct experiments on multiple molecular benchmarks from the MoleculeNet²¹, including both classification and regression tasks^{26,27,28,29,30,31}. Post-obit the previous work^eleven, nosotros split all the datasets with scaffold dissever³², which splits molecules according to the their scaffold (molecular substructure). Scaffold split is a more challenging splitting method and tin better evaluate the generalization ability of the models on out-of-distribution data samples.

GNN architecture

Nosotros use the Aggregate and COMBINE functions divers in the graph isomorphism network (GIN)¹⁹. Residual connections³³, layer normalization³⁴ and graph normalization³⁵ are incorporated into GIN to further improve the performance. We also use the boilerplate pooling as the READOUT role to obtain the graph representation.

Evaluation metrics

Every bit suggested by the MoleculeNet²¹, we apply the boilerplate ROC-AUC³⁶ as the evaluation metric for the classification datasets. ROC-AUC (area under the receiver operating feature curve) is used to evaluate the functioning of binary nomenclature tasks, for which higher is ameliorate. With respect to the regression datasets, we use root mean foursquare error (RMSE) for FreeSolv³⁷, ESOL³⁸ and Lipo³⁹, whereas we utilize hateful average error (MAE) for QM7⁴⁰, QM8⁴¹ and QM9⁴². We execute iv independent runs for each method and report the mean and the standard deviation of the metrics.

Baselines

We compare the proposed method with various competitive baselines. D-MPNN⁴³, AttentiveFP⁴⁴, SGCN^sixteen, DimeNet¹⁷ and HMGNN⁶ are the GNNs without pre-training, amidst which, SGCN, DimeNet and HMGNN contain three-dimensional geometry data; N-Gram⁴⁵, PretrainGNN¹¹ and GROVER^four are the methods with pre-training. N-Gram assembles the node embeddings in short walks in the graph and then leverages Random Forest or XGBoost to predict the molecular backdrop. PretrainGNN implements several types of self-supervised learning tasks, among which we report the best event. GROVER integrates GNN into Transformer with 2 self-supervised tasks, and we report the results of GROVER_base and GROVER_big with different network capacity.

Experimental results

Overall operation

The overall functioning of GEM along with other methods is summarized in Tabular array 1. We have the following observations: (1) Jewel achieves SOTA results on 14 out of xv datasets. On the regression tasks, Gem achieves an overall relative improvement of 8.8% on average compared with the previous SOTA results in each dataset. On the nomenclature tasks, Jewel achieves an overall relative comeback of 4.7% on the boilerplate ROC-AUC compared with the previous SOTA upshot from D-MPNN. (two) Jewel achieves more substantial improvements on the regression datasets than the classification datasets. Nosotros guess that the regression datasets focus on predicting the quantum chemic properties, which are highly correlated to molecular geometries.

Table 1 Overall performance for regression tasks and classification tasks

Full size table

Contribution of GeoGNN

We investigate the issue of GeoGNN without pre-training on the regression datasets, including the properties of quantum mechanics and physical chemical science, which are highly correlated to molecular geometries. GeoGNN is compared with multiple GNN architectures, including: (ane) the commonly used GNN architectures, GIN¹⁹, GAT⁴⁶ and GCN⁴⁷; (2) contempo works incorporating three-dimensional molecular geometry, SGCN¹⁶, DimeNet¹⁷ and HMGNN^vi; (3) the architectures specially designed for molecular representation, D-MPNN⁴³, AttentiveFP⁴⁴ and GTransformer⁴. From Table 2, we can conclude that GeoGNN considerably outperforms other GNN architectures on all the regression datasets since GeoGNN incorporates geometrical parameters even though the three-dimensional coordinates of the atoms are faux. The overall relative improvement is 7.9% compared with the best results of previous methods.

Table ii Performance of different GNN architectures for regression tasks

Full size table

Contribution of geometry-level tasks

To study the effect of the proposed geometry-level self-supervised learning tasks, we apply dissimilar types of self-supervised learning tasks to pre-train GeoGNN on the regression datasets. In Table 3, 'Without pre-railroad train' denotes the GeoGNN network without pre-training, 'Geometry' denotes our proposed geometry-level tasks, 'Graph' denotes the graph-level task that predicts the molecular fingerprints and 'Context'⁴ denotes a node-level job that predicts the diminutive context. In general, the methods with geometry-level tasks are meliorate than that without it. Furthermore, 'Geometry' performs better than 'Geometry + Graph' in the regression tasks, which may due to the weak connexion between molecular fingerprints and the regression tasks.

Tabular array 3 Performance of GeoGNN with different pre-training strategies for regression tasks

Full size tabular array

Pre-trained representations visualization

To intuitively observe the representations that the self-supervised tasks (without downstream fine-tuning) accept learned, we visualize the representations by mapping them to the 2-dimensional infinite by t-SNE algorithm⁴⁸, whose details can be found in the Supplementary Information. The Davies Bouldin index⁴⁹ is calculated to mensurate the separation of clusters. The lower the Davies Bouldin index, the better the separation of the clusters. Here we test whether the pre-training methods are able to distinguish molecules with valid geometry (generated from RDKit) from molecules with invalid geometry (random generated). Nosotros randomly select one,000 molecules from ZINC. For each molecule, we generate the valid and invalid geometry. Every bit shown in Fig. 3a, both the graph-level and geometry-level pre-preparation methods can better distinguish the valid geometry from invalid geometry compared to not pre-trained. Besides, the geometry-level pre-training can further subtract the Davies Bouldin Index to ii.63, compared with seven.88 of the graph-level.

**Fig. 3: Visualizations and ablation studies.**

Impact of the quality of geometry

To investigate the impact of the quality of geometry, we commencement compare GeoGNN, which adopts the default force field MMFF, with GeoGNN (UFF), which adopts the universal force field (UFF)⁵⁰, on dataset QM9. GeoGNN and GeoGNN (UFF) accomplish similar operation, as shown in Fig. 3c. The touch on of more than precise 3-dimensional coordinates provided past dataset QM9 (calculated by DFT⁵¹) is also investigated. GeoGNN (precise 3D) achieves a nifty comeback of about 12% compared with the baseline GeoGNN.

Furthermore, Fig. 3b shows the representation visuals for unlike qualities of molecular geometry. GeoGNN (without 3D) is a variant of GeoGNN that masks all the geometry features with zeros, GeoGNN is the baseline that utilizes coarse three-dimensional coordinates, and GeoGNN (precise 3D) utilizes precise 3D coordinates generated past DFT. Nosotros equally divide 2,000 molecules from QM9 into two clusters, 1 with loftier Homo–LUMO gaps and the other with low Human–LUMO gaps. We test the ability of dissimilar models to distinguish these two group of molecules. Visually, we observe that GeoGNN can better separate the clusters than GeoGNN (without 3D), whereas GeoGNN (precise 3D) works improve than GeoGNN. The differences in Davies Bouldin index support the observations.

Contributions of atom–bail and bond–bending graphs

We evaluate the contributions of the atom–bond and bail–bending graphs in GeoGNN on dataset QM9, as shown in Fig. 3c. Atom–bond graph utilizes the atom–bond graph just and pool over the representations of the atoms to judge the properties, whereas bond–angle graph utilizes the bond–angle graph only and pools over the representations of bonds. GeoGNN, which consists of both the atom–bond and bond–angle graphs, performs better than the above two variants, indicating that both the atom–bond and bail–angle graphs contribute to the performance.

Related work

Molecular representation

Current molecular representations can exist categorized into three types: molecular fingerprints, sequence-based representations and graph-based representations.

Molecular fingerprints

Molecular fingerprints such as ECFP²⁵ and MACCS²⁴ are molecular descriptors. Fingerprints are handcrafted representations—widely used by traditional machine learning methods^3,52,53,54—that encode a molecule into a sequence of bits according to the molecules' topological substructures. Although fingerprints tin can represent the presence of the substructures in the molecules, they suffer from scrap collisions and vector sparsity, limiting their representation power.

Sequence-based representations

Some studies^3,55 take SMILES strings⁵⁶ that describe the molecules by strings as inputs, and leverage sequence-based models such as Recurrent Neural Networks and Transformer^57,58 to learn the molecular representations; however, it is difficult for sequence-based methods to encompass the syntax of SMILES. For example, 2 adjacent atoms may be far apart in the text sequence. Likewise, a small change in a SMILES string tin lead to a large alter in the molecular structure.

Graph-based representations

Many works^3,4,5,6,18 have showcased the great potential of graph neural networks on modelling molecules by taking each atom as a node and each chemical bond equally an edge. For example, AttentiveFP⁴⁴ proposes to extend graph attending mechanism to learn assemblage weights. Meanwhile, a group of studies take tried to incorporate iii-dimensional geometry information: (1)^{13,fourteen,15} accept fractional geometry information equally features, such as atomic distances; (ii)^sixteen proposed a spatial graph convolution that uses relative position vectors betwixt atoms as input features; (3)¹⁷ proposed a bulletin passing scheme based on bonds and transform letters from angles.

Pre-training for GNNs

Self-supervised learning^{7,viii,9,ten,59} has achieved great success in natural linguistic communication processing, figurer vision and other domains; it trains unlabelled samples in a supervised way to convalesce the overfitting issue and improve information utilization efficiency. Some studies^four,eleven recently practical cocky-supervised learning methods to GNNs for molecular belongings prediction to overcome the insufficiency of the labelled samples. These works learn the molecular representation vectors past exploiting the node- and graph-level tasks, where the node-level tasks learn the local domain noesis past predicting the node backdrop and the graph-level tasks larn the global domain noesis by predicting biological activities. Although existing cocky-supervised learning methods can heave the generalization ability, they fail the spatial noesis that is strongly related to the molecular properties.

Decision

Efficient molecular representation learning is crucial for molecular property prediction. Existing works that use pre-training methods for molecular property prediction fail to utilize the molecular geometries described past bonds, bond angles and other geometrical parameters. To this end, nosotros pattern a geometry-based GNN and multiple geometry-level self-supervised learning methods capture the molecular spatial knowledge. Extensive experiments were conducted to verify the effectiveness of Gem, comparing it with multiple competitive baselines. Precious stone considerably outperforms other methods on multiple benchmarks. In the hereafter nosotros volition try to prefer the proposed framework to more molecular tasks, particularly the poly peptide–ligand affinity prediction job that requires lots of three-dimensional samplings.

Methods

Preliminary for GNNs

Graph neural networks is a message passing neural networks. More concretely, given a node 5, its representation vector ${\mathbf{h}}_{v}^{(m)}$ at the kth iteration is formalized past

$$\begin{array}{lll}{\mathbf{a}}_{v}^{(g)}&=&{\mathrm{Aggregate}}^{(k)}\left(\{({\mathbf{h}}_{5}^{(thou-i)},{\mathbf{h}}_{u}^{(k-ane)},{\mathbf{x}}_{uv}| u\in {{{\mathcal{Due north}}}}(v))\}\right.,\\ {\mathbf{h}}_{v}^{(yard)}&=&{\mathrm{COMBINE}}^{(g)}({\mathbf{h}}_{v}^{(k-ane)},{\mathbf{a}}_{v}^{(m)}).\end{array}$$

(1)

where ${{{\mathcal{Due north}}}}(five)$ is the ready of neighbours of node 5, AGGREGATE^(k) is the aggregation function for accumulation messages from a node's neighbourhood, and COMBINE^(thousand) is the update office for updating the node representation. We initialize ${\mathbf{h}}_{v}^{(0)}$ by the characteristic vector of node five, that is, ${\mathbf{h}}_{v}^{(0)}={\mathbf{ten}}_{v}$.

READOUT office is introduced to integrate the nodes' representation vectors at the final iteration so as to gain the graph'south representation vector h _Yard, which is formalized as

$${\mathbf{h}}_{Yard}={\mathrm{READOUT}}({\mathbf{h}}_{v}^{(Chiliad)}| five\in {{{\mathcal{Five}}}}),$$

(2)

where K is the number of iterations. In virtually cases, READOUT is a permutation invariant pooling office, such as summation and maximization. The graph's representation vector h ₁₀₀₀ can then be used for downstream task predictions.