Skip to content

schwallergroup/fsscore

Repository files navigation

License: MIT Powered by RDKit

FSscore

Focused synthesizability score augmented with human knowledge and intuition.

The Focused Synthesizability score (FSscore) learns to rank structures based on binary preferences using a graph attention network. First, a baseline trained on an extensive set of reactant-product pairs is established that subsequently is fine-tuned with on a chemical space of interest using binary preferences from human experts or carefully designed labels.

Installation

git clone https://github.com/schwallergroup/fsscore.git
cd fsscore
conda create -n fsscore python=3.10
conda activate FSscore
pip install -e .

This method was developed on CUDA-enabled GPUs (linux OS) and furhter tested on macOS.

Data

Trained models can be downloaded from figshare: https://figshare.com/s/2db88a98f73e22af6868 Please download the models folder and place it in the root directory of the repository. The best graph-based model (GGLGGL) is already included in this folder.

The code for the data processing is made available in this repository (data_processing). Please change the paths in the bash script accordingly. These include the directory where you want to store the data (the default is recommended) and set the path to the additional data paths. Subsequently, run the following command:

cd data_processing
./process_data.sh

This will create the necessary data files for pre-training.

Usage

Scoring molecules

The following code shows an example on how to easily score molecules in python This examples uses the graph-based implementation of the FSscore.

from fsscore.score import Scorer
from fsscore.models.ranknet import LitRankNet
from fsscore.utils.paths import PRETRAIN_MODEL_PATH

# 1) load pre-trained model or choose path to own model
model = LitRankNet.load_from_checkpoint(PRETRAIN_MODEL_PATH)

# 2) initialize scorer
scorer = Scorer(model=model)

# 3) predict scores given a list of SMILES
scores = scorer.score(smiles)

To score molecules using the command line, use the score.py script. The script takes SMILES as input and outputs a CSV file with the scores. The script can be run as follows:

python score.py --model_path <path/to/model/file> --data_path <path/to/csv/file> --compound_cols <SMILES_column> --save_filepath <path/to/save/file> --featurizer graph_2D --batch_size 128

The following arguments are used:

  • --model_path: Path to the model file. If no model path is provided the pre-trained graph-based model is used per default.
  • --data_path: Path to the CSV file with SMILES to score.
  • --compound_cols: Name of the column containing the SMILES.
  • --save_filepath: Path to save the CSV file with the scores.
  • --featurizer: Featurization method to use. The default is graph_2D.
  • --batch_size: Batch size to use for scoring. The default is 128.

Fine-tuning

To fine-tune a model, use the finetuning.py script. The script takes a CSV file with SMILES as input and outputs a trained model. The script can be run as follows:

python finetuning.py --data_path <path/to/finetuning/data> --featurizer graph_2D --compound_cols smiles_i smiles_j --rating_col target --save_dir <path/to/save/dir> --batch_size 4 --val_size 5 --n_epochs 20 --lr 0.0001 --datapoints 50 --track_improvement --track_pretest --earlystopping

The following arguments are used:

  • --model_path: Path to the model file. If no model path is provided the pre-trained graph-based model (GGLGGL) is used per default.
  • --data_path: Path to the CSV file with the fine-tuning data in two columns of SMILES and a column of binary preference labels.
  • --featurizer: Featurization method to use. The default is graph_2D.
  • --compound_cols: Name of the columns containing the SMILES of the opposite pairs.
  • --rating_col: Name of the column containing the binary preference. 0 indicates that the molecule in the first column is harder to synthesize while 1 indicates that the moelcule in the second column is harder.
  • --save_dir: Directory to save the model in.
  • --batch_size: Batch size to use for training.
  • --val_size: Number (int) of fraction (float) of validation samples to use.
  • --n_epochs: Number of epochs to train for. Default is 20.
  • --lr: Learning rate to use. Default is 0.0001.
  • --datapoints: Number of data points to use for fine-tuning (leave out for production). Default is None, which results in the use of the whole dataset.
  • --track_improvement: Whether to track the improvement on the validation set. Defaults to True.
  • --track_pretest: Whether to track the performance on the pre-training test set. Defaults to True.
  • --earlystopping: Whether to use early stopping. Defaults to True.

Training a baseline model

To train a model, use the train.py script. The script takes a CSV file with SMILES as input and outputs a trained model. The script can be run as follows:

python train.py --save_dir <path/to/save/dir> --featurizer graph_2D --n_epochs 250 --val_size 0.01 --batch_size 128 --arrange_layers GGLGGL --graph_encoder GNN --reload_interval 10

The following arguments are used (the same as described in the paper):

  • --save_dir: Directory to save the model in.
  • --featurizer: Featurization method to use. The default is graph_2D.
  • --n_epochs: Number of epochs to train for.
  • --val_size: Fraction (float) of validation samples to use. Set to 0 to not use a validation set.
  • --batch_size: Batch size to use for training.
  • --arrange_layers: Arrangement of the graph attention layers. The default is GGLGGL.
  • --graph_encoder: Graph encoder to use. The default is GNN.
  • --reload_interval: Interval at which to save the model.

This command uses the training data used in our manuscript. To input your own data provide the path to --data_path and specifz the collumn names for the SMILES (--compound_cols) and the binary preference labels (--rating_col).

If you want to train a model with a fingerprint representation, do the following:

  • --featurizer: Select from morgan, morgan_count, morgan_chiral or morgan_chiral_count
  • --use_fp: Set to True

Pairing molecules

Fine-tuning a model requires paired and labeled molecules. This pairing is done based on clustering so that dissimilar molecules get paired. These pairs get subsequently ranked based on the uncertainty of the prediction of a pre-trained model. The ranking allows prioritization of the labelling process. Please pair and rank molecules as follows:

python pair_molecules.py --filepath <path/to/csvfile> --outdir <path/to/out/directory> --smi_col smiles --ckpt_path <path/to/pretrained/model> --batch_size 32 --featurizer graph_2D
  • --filepath: Directory to save the model in.
  • --outdir: Directory to save paired data in.
  • --smi_col: Column header where the SMILES are stored.
  • --ckpt_path: Path to pre-trained checkpoint. Default is the best model of the manuscript.
  • --featurizer: Featurization method to use. The default is graph_2D.

Subsequently, one proceeds to labeling the pairs. We recommend doing this with the application.

App: FSscore

This repository contains a streamlit app that can be run locally. To run the app, use the following command:

streamlit run streamlit_app/run.py

This will open a browser window with the app. The app should be run locally as files are written and saved. For deployment, please refer to the streamlit documentation. The following functionalities are offered at the moment:

  • Pairing molecules and ranking them by uncertainty based on the default pre-trained model
  • Label pairs of molecules
  • Fine-tune a model
  • Score molecules

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published