-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 75ac6b2
Showing
55 changed files
with
25,139 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: b6df8a037bac08be38d09481774b3045 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
rxnutils documentation | ||
============================ | ||
|
||
rxnutils is a collection of routines for working with reactions, reaction templates and template extraction | ||
|
||
Introduction | ||
------------ | ||
|
||
The package is divided into (currently) three sub-packages: | ||
|
||
* `chem` - chemistry routines like template extraction or reaction cleaning | ||
* `data` - routines for manipulating various reaction data sources | ||
* `pipeline` - routines for building and executing simple pipelines for modifying and analyzing reactions | ||
* `routes` - routines for handling synthesis routes | ||
|
||
Auto-generated API documentation is available, as well as guides for common tasks. See the menu to the left. | ||
|
||
Installation | ||
------------ | ||
|
||
For most users it is as simple as | ||
|
||
.. code-block:: | ||
pip install reaction-utils | ||
`For developers`, first clone the repository using Git. | ||
|
||
Then execute the following commands in the root of the repository | ||
|
||
.. code-block:: | ||
conda env create -f env-dev.yml | ||
conda activate rxn-env | ||
poetry install | ||
the `rxnutils` package is now installed in editable mode. | ||
|
||
Lastly, make sure to install pre-commits that are run on every commit | ||
|
||
.. code-block:: | ||
pre-commit install | ||
Limitations | ||
----------- | ||
|
||
* Some old RDKit wheels on pypi did not include the `Contrib` folder, preventing the usage of the `rdkit_RxnRoleAssignment` action | ||
* The pipeline for the Open reaction database requires some additional dependencies, see the documentation for this pipeline | ||
* Using the data piplines for the USPTO and Open reaction database requires you to setup a second python environment | ||
* The RInChI capabilities are not supported on MacOS | ||
|
||
|
||
.. toctree:: | ||
:hidden: | ||
|
||
templates | ||
uspto | ||
ord | ||
pipeline | ||
routes | ||
rxnutils |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
rxnutils | ||
======== | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
rxnutils |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
Open reaction database | ||
======================= | ||
|
||
``rxnutils`` contain two pipelines that together imports and prepares the reaction data from the `Open reaction database <https://open-reaction-database.org/>`_ so that it can be used on modelling. | ||
|
||
It is a complete end-to-end pipeline that is designed to be transparent and reproducible. | ||
|
||
Pre-requisites | ||
-------------- | ||
|
||
The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (``rxnmapper``) is incompatible with | ||
the dependencies ``rxnutils`` package. Therefore, to be able to use to full pipeline, you need to setup two python environment. | ||
|
||
1. Install ``rxnutils`` according to the instructions in the `README`-file | ||
|
||
2. Install the ``ord-schema`` package in the `` rxnutils`` environment | ||
|
||
conda activate rxn-env | ||
python -m pip install ord-schema | ||
|
||
3. Download/Clone the ``ord-data`` repository according to the instructions here: https://github.com/Open-Reaction-Database/ord-data | ||
|
||
git clone https://github.com/open-reaction-database/ord-data.git . | ||
|
||
Note down the path to the repository as this needs to be given to the preparation pipeline | ||
|
||
4. Install ``rxnmapper`` according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper | ||
|
||
|
||
.. code-block:: | ||
conda create -n rxnmapper python=3.6 -y | ||
conda activate rxnmapper | ||
conda install -c rdkit rdkit=2020.03.3.0 | ||
python -m pip install rxnmapper | ||
5. Install ``Metaflow`` and ``rxnutils`` in the new environment | ||
|
||
|
||
.. code-block:: | ||
python -m pip install metaflow | ||
python -m pip install --no-deps --ignore-requires-python . | ||
Usage | ||
----- | ||
|
||
Create a folder for the ORD data and in that folder execute this command in the ``rxnutils`` environment | ||
|
||
|
||
.. code-block:: | ||
conda activate rxn-env | ||
python -m rxnutils.data.ord.preparation_pipeline run --nbatches 200 --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH | ||
and then in the environment with the ``rxnmapper`` run | ||
|
||
|
||
.. code-block:: | ||
conda activate rxnmapper | ||
python -m rxnutils.data.mapping_pipeline run --data-prefix ord --nbatches 200 --max-workers 8 --max-num-splits 200 | ||
The ``-max-workers`` flag should be set to the number of CPUs available. | ||
|
||
On 8 CPUs and 1 GPU the pipeline takes a couple of hours. | ||
|
||
|
||
Artifacts | ||
--------- | ||
|
||
The pipelines creates a number of `tab-separated` CSV files: | ||
|
||
* `ord_data.csv` is the imported ORD data | ||
* `ord_data_cleaned.csv` is the cleaned and filter data | ||
* `ord_data_mapped.csv` is the atom-mapped, modelling-ready data | ||
|
||
|
||
The cleaning is done to be able to atom-map the reactions and are performing the following tasks: | ||
* Ignore extended SMILES information in the SMILES strings | ||
* Remove molecules not sanitizable by RDKit | ||
* Remove reactions without any reactants or products | ||
* Move all reagents to reactants | ||
* Remove the existing atom-mapping | ||
* Remove reactions with more than 200 atoms when summing reactants and products | ||
|
||
(the last is a requisite for ``rxnmapper`` that was trained on a maximum token size roughly corresponding to 200 atoms) | ||
|
||
|
||
The ``ord_data_mapped.csv`` files will have the following columns: | ||
|
||
* ID - unique ID from the original database | ||
* Dataset - the name of the dataset from which this is reaction is taken | ||
* Date - the date of the experiment as given in the database | ||
* ReactionSmiles - the original reaction SMILES | ||
* Yield - the yield of the first product of the first outcome, if provided | ||
* ReactionSmilesClean - the reaction SMILES after cleaning | ||
* BadMolecules - molecules not sanitizable by RDKit | ||
* ReactantSize - number of atoms in reactants | ||
* ProductSize - number of atoms in products | ||
* mapped_rxn - the mapped reaction SMILES | ||
* confidence - the confidence of the mapping as provided by ``rxnmapper`` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
Pipeline | ||
======== | ||
|
||
``rxnutils`` provide a simple pipeline to perform simple tasks on reaction SMILES and templates in a CSV-file. | ||
|
||
|
||
The pipeline works on `tab-separated` CSV files (TSV files) | ||
|
||
|
||
Usage | ||
----- | ||
|
||
To exemplify the pipeline capabilities, we will have a look at the pipeline used to clean the USPTO data. | ||
|
||
The input to the pipeline is a simple YAML-file that specifies each action to take. The actions will be executed | ||
sequentially, one after the other and each action takes a number of input arguments. | ||
|
||
This is the YAML-file used to clean the USPTO data: | ||
|
||
.. code-block:: yaml | ||
trim_rxn_smiles: | ||
in_column: ReactionSmiles | ||
out_column: ReactionSmilesClean | ||
remove_unsanitizable: | ||
in_column: ReactionSmilesClean | ||
out_column: ReactionSmilesClean | ||
reagents2reactants: | ||
in_column: ReactionSmilesClean | ||
out_column: ReactionSmilesClean | ||
remove_atom_mapping: | ||
in_column: ReactionSmilesClean | ||
out_column: ReactionSmilesClean | ||
reactantsize: | ||
in_column: ReactionSmilesClean | ||
productsize: | ||
in_column: ReactionSmilesClean | ||
query_dataframe1: | ||
query: "ReactantSize>0" | ||
query_dataframe2: | ||
query: "ProductSize>0" | ||
query_dataframe3: | ||
query: "ReactantSize+ProductSize<200" | ||
The first action is called ``trim_rxn_smiles`` and two arguments are given: ``in_column`` specifying which column to use as input and ``out_column`` specifying which column | ||
to use as output. | ||
|
||
The following actions ``remove_unsanitizable``, ``reagents2reactants``, ``remove_atom_mapping``, ``reactantsize``, ``productsize`` works the same way, but might use other columns to specified for output. | ||
|
||
The last three actions are actually the same action but executed with different arguments. They therefore have to be postfixed with 1, 2 and 3. | ||
The action ``query_dataframe`` takes a ``query`` argument and removes a number of rows not matching the query. | ||
|
||
If we save this to ``clean_pipeline.yml`` and given that we have a tab-separated file with USPTO data called ``uspto_data.csv`` we can run the following command | ||
|
||
.. code-block:: | ||
python -m rxnutils.pipeline.runner --pipeline clean_pipeline.yml --data uspto_data.csv --output uspto_cleaned.csv | ||
or we can alternatively run it from a python method like this | ||
|
||
.. code-block:: | ||
from rxnutils.pipeline.runner import main as validation_runner | ||
validation_runner( | ||
[ | ||
"--pipeline", | ||
"clean_pipeline.yml", | ||
"--data", | ||
"uspto_data.csv", | ||
"--output", | ||
"uspto_cleaned.csv", | ||
] | ||
) | ||
Actions | ||
------- | ||
|
||
To find out what actions are available, you can type | ||
|
||
.. code-block:: | ||
python -m rxnutils.pipeline.runner --list | ||
Development | ||
----------- | ||
|
||
New actions can easily be added to the pipeline framework. All of the actions are implemented in one of four modules | ||
|
||
|
||
* ``rxnutils.pipeline.actions.dataframe_mod`` - actions that modify the dataframe, e.g., removing rows or columns | ||
* ``rxnutils.pipeline.actions.reaction_mod`` - actions that modify reaction SMILES | ||
* ``rxnutils.pipeline.actions.dataframe_props`` - actions that compute properties from reaction SMILES | ||
* ``rxnutils.pipeline.actions.templates`` - actions that process reaction templates | ||
|
||
|
||
To exemplify, let's have a look at the ``productsize`` action | ||
|
||
|
||
.. code-block:: python | ||
@action | ||
@dataclass | ||
class ProductSize: | ||
"""Action for counting product size""" | ||
pretty_name: ClassVar[str] = "productsize" | ||
in_column: str | ||
out_column: str = "ProductSize" | ||
def __call__(self, data: pd.DataFrame) -> pd.DataFrame: | ||
smiles_col = global_apply(data, self._row_action, axis=1) | ||
return data.assign(**{self.out_column: smiles_col}) | ||
def __str__(self) -> str: | ||
return f"{self.pretty_name} (number of heavy atoms in product)" | ||
def _row_action(self, row: pd.Series) -> str: | ||
_, _, products = row[self.in_column].split(">") | ||
products_mol = Chem.MolFromSmiles(products) | ||
if products_mol: | ||
product_atom_count = products_mol.GetNumHeavyAtoms() | ||
else: | ||
product_atom_count = 0 | ||
return product_atom_count | ||
The action is defined as a class ``ProductSize`` that has two class-decorators. | ||
The first ``@action`` will register the action in a global action list and second ``@dataclass`` is dataclass decorator from the standard library. | ||
The ``pretty_name`` class variable is used to identify the action in the pipeline, that is what you are specifying in the YAML-file. | ||
The other two ``in_column`` and ``out_column`` are the arguments you can specify in the YAML file for executing the action, they can have default | ||
values in case they don't need to be specified in the YAML file. | ||
|
||
When the action is executed by the pipeline the ``__call__`` method is invoked with the current Pandas dataframe as the only argument. This method | ||
should return the modified dataframe. | ||
|
||
Lastly, it is nice to implement a ``__str__`` method which is used by the pipeline to print useful information about the action that is executed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
Routes | ||
====== | ||
|
||
``rxnutils`` contains routines to analyse synthesis routes. There are a number of readers that can be used to read routes from a number of | ||
formats, and there are routines to score the different routes. | ||
|
||
Reading | ||
------- | ||
|
||
The simplest route format supported is a text file, where each reaction is written as a reaction SMILES in a line. | ||
Routes are separated by new-line | ||
|
||
For instance: | ||
|
||
.. code-block:: | ||
CC(C)N.Clc1cccc(Nc2ccoc2)n1>>CC(C)Nc1cccc(Nc2ccoc2)n1 | ||
Brc1ccoc1.Nc1cccc(Cl)n1>>Clc1cccc(Nc2ccoc2)n1 | ||
Nc1cccc(NC(C)C)n1.Brc1ccoc1>>CC(C)Nc1cccc(Nc2ccoc2)n1 | ||
CC(C)N.Nc1cccc(Cl)n1>>Nc1cccc(NC(C)C)n1 | ||
If this is saved to ``routes.txt``, these can be read into route objects with | ||
|
||
.. code-block:: | ||
from rxnutils.routes.readers import read_reaction_lists | ||
routes = read_reaction_lists("reactions.txt") | ||
If you have an environment with ``rxnmapper`` installed and the NextMove software ``namerxn`` in your PATH then you can | ||
add atom-mapping and reaction classes to these routes with | ||
|
||
.. code-block:: | ||
# This can be set on the command-line as well | ||
import os | ||
os.environ["RXNMAPPER_ENV_PATH"] = "/home/username/miniconda/envs/rxnmapper/" | ||
for route in routes: | ||
route.assign_atom_mapping(only_rxnmapper=True) | ||
routes[1].remap(routes[0]) | ||
The last line of code also make sure that the second route shares mapping with the first route. | ||
|
||
|
||
Other readers are available | ||
|
||
* ``read_aizynthcli_dataframe`` - for reading routes from aizynthcli output dataframe | ||
* ``read_reactions_dataframe`` - for reading routes stored as reactions in a dataframe | ||
|
||
|
||
For instance, to read routes from a dataframe with reactions. You can do something like what follows. | ||
The dataframe has column ``reaction_smiles`` that holds the reaction SMILES, and the individual routes | ||
are identified by a ``target_smiles`` and ``route_id`` column. The dataframe also has a column ``classification``, | ||
holding the NextMove classification. The dataframe is called ``data``. | ||
|
||
.. code-block:: | ||
from rxnutils.routes.readers import read_reactions_dataframe | ||
routes = read_reactions_dataframe( | ||
data, | ||
"reaction_smiles", | ||
group_by=["target_smiles", "route_id"], | ||
metadata_columns=["classification"] | ||
) |
Oops, something went wrong.