An earlier version of the work was published in NeurIPS 2017 as "Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network" with some slight difference in modeling.
This work proposes a template-free approach for reaction prediction with 2 stages:
- Identify reaction center (pairs of atoms that will lose a bond or form a bond)
- Enumerate the possible combinations of bond changes and rank the corresponding candidate products
We provide a jupyter notebook for walking through a demonstration with our pre-trained models. You can
download it with wget https://data.dgl.ai/dgllife/reaction_prediction_pretrained.ipynb
and you need to put it
in this directory. Below we visualize a reaction prediction by the model:
The example by default works with reactions from USPTO (United States Patent and Trademark) granted patents, collected by Lowe [1]. After removing duplicates and erroneous reactions, the authors obtain a set of 480K reactions. The dataset is divided into 400K, 40K, and 40K for training, validation and test.
Reaction centers refer to the pairs of atoms that lose/form a bond in the reactions. A Graph neural network (Weisfeiler-Lehman Network in this case) is trained to update the representations of all atoms. Then we combine pairs of atom representations to predict the likelihood for the corresponding atoms to lose a bond or form a particular bond.
In particular, we represent each candidate bond change as a 3-tuple a-b-c, where a and b are atom mapping numbers of two atoms in the reactants and c is the possible change, which can be losing a bond, forming a single bond, forming a double bond, forming a triple bond, and forming an aromatic bond.
For evaluation, we select candidate bond changes with top-k scores for each reaction and compute the proportion of reactions whose bond changes have all been included.
We use GPU whenever possible. To train the model with default options, simply do
python find_reaction_center_train.py
Once the training process starts, the progress will be printed in the terminal as follows:
Epoch 1/50, iter 8150/20452 | loss 8.4788 | grad norm 12.9927
Epoch 1/50, iter 8200/20452 | loss 8.6722 | grad norm 14.0833
Everytime the learning rate is decayed (specified as 'decay_every'
in configure.py
's reaction_center_config
), we save a checkpoint of
the model and evaluate the model on the validation set. The evaluation result is formatted as follows, where total samples x
means
the we have trained the model on x
samples and acc@k
means top-k accuracy:
total samples 800000, (epoch 2/35, iter 2443/2557) | acc@12 0.9278 | acc@16 0.9419 | acc@20 0.9496 | acc@40 0.9596 | acc@80 0.9596 |
All model check points and evaluation results can be found under center_results
. model_x.pkl
stores a model
checkpoint after seeing x
training samples. val_eval.txt
stores all
evaluation results on the validation set.
You may want to terminate the training process when the validation performance no longer improves for some time.
By default we use one GPU only. We also allow multi-gpu training. To use GPUs with ids id1,id2,...
, do
python find_reaction_center_train.py --gpus id1,id2,...
A summary of the training speedup with the DGL implementation is presented below.
Item | Training time (s/epoch) | Speedup |
---|---|---|
Authors' implementation | 11657 | 1x |
DGL with 1 gpu | 858 | 13.6x |
DGL with 2 gpus | 443 | 26.3x |
DGL with 4 gpus | 243 | 48.0x |
DGL with 8 gpus | 134 | 87.0x |
python find_reaction_center_eval.py --model-path X
For example, you can evaluate the model trained for 800000 samples by setting X
to be
center_results/model_800000.pkl
. The evaluation results will be stored at center_results/test_eval.txt
.
For model evaluation, we can choose whether to exclude reactants not contributing heavy atoms to the product (e.g. reagents and solvents) in top-k atom pair selection, which will make the task easier. For the easier evaluation, do
python find_reaction_center_eval.py --easy
A summary of the model performance of various settings is as follows:
Item | Top 6 accuracy | Top 8 accuracy | Top 10 accuracy |
---|---|---|---|
Paper | 89.8 | 92.0 | 93.3 |
Hard evaluation from authors' code | 87.7 | 90.6 | 92.1 |
Easy evaluation from authors' code | 90.0 | 92.8 | 94.2 |
Hard evaluation | 88.9 | 91.7 | 93.1 |
Easy evaluation | 91.2 | 93.8 | 95.0 |
Hard evaluation for model trained on 8 gpus | 88.1 | 91.0 | 92.5 |
Easy evaluation for model trained on 8 gpus | 90.3 | 93.3 | 94.6 |
- We are able to match the results reported from authors' code for both single-gpu and multi-gpu training
- While multi-gpu training provides a great speedup, the performance with the default hyperparameters drops slightly.
By default we use 32 processes for data pre-processing. If you encounter an error with
BrokenPipeError: [Errno 32] Broken pipe
, you can specify a smaller number of processes with
python find_reaction_center_train.py -np X
python find_reaction_center_eval.py -np X
where X
is the number of processes that you would like to use.
We provide a pre-trained model so users do not need to train from scratch. To evaluate the pre-trained model, simply do
python find_reaction_center_eval.py
New datasets should be processed such that each line corresponds to the SMILES for a reaction (rxn smiles) like below:
[CH3:14][NH2:15].[N+:1](=[O:2])([O-:3])[c:4]1[cH:5][c:6]([C:7](=[O:8])[OH:9])[cH:10][cH:11][c:12]1[Cl:13].[OH2:16]>>[N+:1](=[O:2])([O-:3])[c:4]1[cH:5][c:6]([C:7](=[O:8])[OH:9])[cH:10][cH:11][c:12]1[NH:15][CH3:14]
The reactants are placed before >>
and the product is placed after >>
. The reactants are separated by .
.
In addition, atom mapping information is provided.
Notice
The atom mapping numbers in the rxn should be consecutive integers starting from 1, or it will report the molAtomMapNumber issue. To avoid the problem, you could convert the raw rxn smiles with explicit hydrgogen atoms to the rxn smiles without hydrogen atoms by RDKit befor adding map ids for the rxn smiles.
For example, the raw rxn smiles is "[H]C([H])([H])Oc1ccc(CCNC=O)cc1OC([H])([H])[H]>>[H]C([H])([H])Oc1cc2c(cc1OC([H])([H])[H])CCN=C2", if you directly add map for the rxn by RDT software. the mapped rxn smiles is
[CH2:1]([CH2:2][NH:10][CH:11]=[O:21])[c:8]1[cH:7][cH:6][c:5]([O:4][CH3:3])[c:12]([cH:9]1)[O:13][CH3:14]>>[CH:11]1=[N:10][CH2:2][CH2:1][c:8]2[cH:9][c:12]([O:13][CH3:14])[c:5]([cH:6][c:7]12)[O:4][CH3:3]
the oxygen atom will be labelled as 21.
First, convert the raw rxn smiles to rxn smiles without hydrogen atoms.
#!python
from rdkit import Chem
def canonicalizatonsmi(smi):
newsmi = Chem.MolToSmiles(Chem.MolFromSmiles(smi))
return newsmi
def canon_reaction(rxnstring):
#print("rxnstring:",rxnstring)
r,p =rxnstring.split('>>')
rs = r.split('.')
#print("rs",rs)
ps = p.split('.')
#print("ps",p)
rscans=[]
pscans=[]
for reactant in rs:
temp=canonicalizatonsmi(reactant)
#print(reactant,temp)
rscans.append(temp)
for product in ps:
pscans.append(canonicalizatonsmi(product))
rscan='.'.join(sorted(rscans))
pscan='.'.join(pscans)
newrxnsring='%s>>%s'%(rscan,pscan)
return newrxnsring
from rdkit import Chem
rxnstring= '[H]C([H])([H])Oc1ccc(CCNC=O)cc1OC([H])([H])[H]>>[H]C([H])([H])Oc1cc2c(cc1OC([H])([H])[H])CCN=C2'
canon_reaction(rxnstring)
Then, add map for the rxn smiles.
[O:15]=[CH:1][NH:2][CH2:3][CH2:4][c:5]1[cH:6][cH:7][c:8]([O:9][CH3:10])[c:11]([O:12][CH3:13])[cH:14]1>>[CH:1]1=[N:2][CH2:3][CH2:4][c:5]2[cH:14][c:11]([O:12][CH3:13])[c:8]([O:9][CH3:10])[cH:7][c:6]12
the oxygen atom will be labelled as 15.
Lastly, if you use the scripts for multiple different datasets, you need to delete local cached files with . clean.sh
before working on a new dataset.
You can then train a model on new datasets with
python find_reaction_center_train.py --train-path X --val-path Y
where X
, Y
are paths to the new training/validation dataset as described above.
For evaluation,
python find_reaction_center_eval.py --test-path Z
where Z
is the path to the new test set as described above.
Before training/evaluating the model, the script will first check if the model can properly handle the reactions, resulting in the following 6 files:
- train_valid_reactions.proc: Reactions in the training file that can be handled by the model
- train_invalid_reactions.proc: Reactions in the training file that cannot be handled by the model
- val_valid_reactions.proc: Reactions in the validation file that can be handled by the model
- val_invalid_reactions.proc: Reactions in the validation file that cannot be handled by the model
- test_valid_reactions.proc: Reactions in the test file that can be handled by the model
- test_invalid_reactions.proc: Reactions in the test file that cannot be handled by the model
In addition to RDKit, MolVS is an alternative for comparing whether two molecules are the same after sanitization.
For candidate ranking, we assume that a model has been trained for reaction center prediction first. The pipeline for predicting candidate products given a reaction proceeds as follows:
- Select top-k bond changes for atom pairs in the reactants, ranked by the model for reaction center prediction. By default, we use k=80 and exclude reactants not contributing heavy atoms to the ground truth product in selecting top-k bond changes as in the paper.
- Filter out candidate bond changes for bonds that are already in the reactants
- Enumerate possible combinations of atom pairs with up to C pairs, which reflects the number of bond changes (losing or forming a bond) in reactions. A statistical analysis in USPTO suggests that setting it to 5 is enough.
- Filter out invalid combinations where 1) atoms in candidate bond changes are not connected or 2) an atom pair is predicted to have different types of bond changes (e.g. two atoms are predicted simultaneously to form a single and double bond) or 3) valence constraints are violated.
- Apply the candidate bond changes for each valid combination and get the corresponding candidate products.
- Construct molecular graphs for the reactants and candidate products, featurize their atoms and bonds.
- Apply a Weisfeiler-Lehman Network to the molecular graphs for reactants and candidate products and score them
We use GPU whenever possible. To train the model with default options, simply do
python candidate_ranking_train.py -cmp X
where X
is the path to a trained model for reaction center prediction. You can use our
pre-trained model by not specifying -cmp
.
Once the training process starts, the progress will be printed in the terminal as follows:
Epoch 6/6, iter 16439/20061 | time 1.1124 | accuracy 0.8500 | grad norm 5.3218
Epoch 6/6, iter 16440/20061 | time 1.1124 | accuracy 0.9500 | grad norm 2.1163
Everytime the learning rate is decayed (specified as 'decay_every'
in configure.py
's candidate_ranking_config
),
we save a checkpoint of the model and evaluate the model on the validation set. The evaluation result is formatted
as follows, where total samples x
means that we have trained the model for x
samples, acc@k
means top-k accuracy,
gfound
means the proportion of reactions where the ground truth product can be recovered by the ground truth bond changes.
We perform the evaluation based on RDKit-sanitized molecule equivalence (marked with [strict]
) and MOLVS-sanitized
molecule equivalence (marked with [molvs]
).
total samples 100000, (epoch 1/20, iter 5000/20061)
[strict] acc@1: 0.7732 acc@2: 0.8466 acc@3: 0.8763 acc@5: 0.8987 gfound 0.9864
[molvs] acc@1: 0.7779 acc@2: 0.8523 acc@3: 0.8826 acc@5: 0.9057 gfound 0.9953
All model check points and evaluation results can be found under candidate_results
. model_x.pkl
stores a model
checkpoint after seeing x
training samples in total. val_eval.txt
stores all
evaluation results on the validation set.
You may want to terminate the training process when the validation performance no longer improves for some time.
python candidate_ranking_eval.py --model-path X -cmp Y
where X
is the path to a trained model for candidate ranking and Y
is the path to a trained model
for reaction center prediction. For example, you can evaluate the model trained for 800000 samples by setting X
to be
candidate_results/model_800000.pkl
. The evaluation results will be stored at candidate_results/test_eval.txt
. As
in training, you can use our pre-trained model by not specifying -cmp
.
A summary of the model performance of various settings is as follows:
Item | Top 1 accuracy | Top 2 accuracy | Top 3 accuracy | Top 5 accuracy |
---|---|---|---|---|
Authors' strict evaluation | 85.6 | 90.5 | 92.8 | 93.4 |
DGL's strict evaluation | 85.6 | 90.0 | 91.7 | 92.9 |
Authors' molvs evaluation | 86.2 | 91.2 | 92.8 | 94.2 |
DGL's molvs evaluation | 86.1 | 90.6 | 92.4 | 93.6 |
We provide a pre-trained model so users do not need to train from scratch. To evaluate the pre-trained model, simply do
python candidate_ranking_eval.py
You can train a model on new datasets with
python candidate_ranking_train.py --train-path X --val-path Y -cmp Z
where X
, Y
are paths to the new training/validation dataset as in reaction center prediction. Z
is
the path to a trained model for reaction center prediction. You can use our pre-trained model by not specifying -cmp
.
For evaluation,
python candidate_ranking_eval.py --model-path X -cmp Y --test-path Z
where X
is the path to a trained model for candidate ranking, Y
is the path to a trained model
for reaction center prediction, and Z
is the path to the new test dataset as in reaction center prediction.
You can use the pre-trained model for reaction center prediction by not specifying -cmp
and use the pre-trained
model for candidate ranking by not specifying --model-path
.
By default we use a large number of processes for constructing mini-batches so that we can accelerate the training process. If you encounter the following issue
RuntimeError: received 0 items of ancdata
try using a smaller number of processes with -nw A
. By default we set A
to 100 for
candidate_ranking_train.py
and 32 for candidate_ranking_eval.py
.
[1] D. M.Lowe, Patent reaction extraction: downloads, https://bitbucket.org/dan2097/patent-reaction-extraction/downloads, 2014.