deploy: 7a20456

MolecularAI · Mar 14, 2024 · 75ac6b2 · 75ac6b2
commit 75ac6b2
Show file tree

Hide file tree

Showing 55 changed files with 25,139 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: b6df8a037bac08be38d09481774b3045
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/sample_reaction.PNG b/_images/sample_reaction.PNG
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,65 @@
+rxnutils documentation
+============================
+
+rxnutils is a collection of routines for working with reactions, reaction templates and template extraction
+
+Introduction
+------------
+
+The package is divided into (currently) three sub-packages:
+
+* `chem` - chemistry routines like template extraction or reaction cleaning
+* `data` - routines for manipulating various reaction data sources
+* `pipeline` - routines for building and executing simple pipelines for modifying and analyzing reactions
+* `routes` - routines for handling synthesis routes
+
+Auto-generated API documentation is available, as well as guides for common tasks.  See the menu to the left.
+
+Installation
+------------
+
+For most users it is as simple as
+
+.. code-block::
+
+    pip install reaction-utils
+
+
+`For developers`, first clone the repository using Git.
+
+Then execute the following commands in the root of the repository
+
+.. code-block::
+
+    conda env create -f env-dev.yml
+    conda activate rxn-env
+    poetry install
+
+
+the `rxnutils` package is now installed in editable mode.
+
+Lastly, make sure to install pre-commits that are run on every commit
+
+.. code-block::
+
+    pre-commit install
+
+
+Limitations
+-----------
+
+* Some old RDKit wheels on pypi did not include the `Contrib` folder, preventing the usage of the `rdkit_RxnRoleAssignment` action
+* The pipeline for the Open reaction database requires some additional dependencies, see the documentation for this pipeline
+* Using the data piplines for the USPTO and Open reaction database requires you to setup a second python environment
+* The RInChI capabilities are not supported on MacOS
+
+
+.. toctree::
+    :hidden:
+
+    templates
+    uspto
+    ord
+    pipeline
+    routes
+    rxnutils
diff --git a/_sources/modules.rst.txt b/_sources/modules.rst.txt
@@ -0,0 +1,7 @@
+rxnutils
+========
+
+.. toctree::
+   :maxdepth: 4
+
+   rxnutils
diff --git a/_sources/ord.rst.txt b/_sources/ord.rst.txt
@@ -0,0 +1,106 @@
+Open reaction database
+=======================
+
+``rxnutils`` contain two pipelines that together imports and prepares the reaction data from the `Open reaction database <https://open-reaction-database.org/>`_ so that it can be used on modelling.
+
+It is a complete end-to-end pipeline that is designed to be transparent and reproducible.
+
+Pre-requisites
+--------------
+
+The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (``rxnmapper``) is incompatible with 
+the dependencies ``rxnutils`` package. Therefore, to be able to use to full pipeline, you need to setup two python environment. 
+
+1. Install ``rxnutils`` according to the instructions in the `README`-file
+
+2. Install the ``ord-schema`` package in the `` rxnutils`` environment
+
+    conda activate rxn-env
+    python -m pip install ord-schema
+
+3. Download/Clone the ``ord-data`` repository according to the instructions here: https://github.com/Open-Reaction-Database/ord-data
+
+    git clone https://github.com/open-reaction-database/ord-data.git .
+
+Note down the path to the repository as this needs to be given to the preparation pipeline
+
+4. Install ``rxnmapper`` according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper
+
+
+.. code-block::
+            
+    conda create -n rxnmapper python=3.6 -y
+    conda activate rxnmapper
+    conda install -c rdkit rdkit=2020.03.3.0
+    python -m pip install rxnmapper
+
+
+5. Install ``Metaflow`` and ``rxnutils`` in the new environment
+
+
+.. code-block::
+
+    python -m pip install metaflow
+    python -m pip install --no-deps --ignore-requires-python . 
+
+
+Usage
+-----
+
+Create a folder for the ORD data and in that folder execute this command in the ``rxnutils`` environment
+
+
+.. code-block::
+
+    conda activate rxn-env
+    python -m rxnutils.data.ord.preparation_pipeline run --nbatches 200  --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH
+
+
+and then in the environment with the ``rxnmapper`` run
+
+
+.. code-block::
+
+    conda activate rxnmapper
+    python -m rxnutils.data.mapping_pipeline run --data-prefix ord --nbatches 200  --max-workers 8 --max-num-splits 200
+
+
+The ``-max-workers`` flag should be set to the number of CPUs available.
+
+On 8 CPUs and 1 GPU the pipeline takes a couple of hours.
+
+
+Artifacts
+---------
+
+The pipelines creates a number of `tab-separated` CSV files:
+
+    * `ord_data.csv` is the imported ORD data
+    * `ord_data_cleaned.csv` is the cleaned and filter data
+    * `ord_data_mapped.csv` is the atom-mapped, modelling-ready data
+
+
+The cleaning is done to be able to atom-map the reactions and are performing the following tasks:
+    * Ignore extended SMILES information in the SMILES strings 
+    * Remove molecules not sanitizable by RDKit
+    * Remove reactions without any reactants or products 
+    * Move all reagents to reactants
+    * Remove the existing atom-mapping
+    * Remove reactions with more than 200 atoms when summing reactants and products 
+
+(the last is a requisite for ``rxnmapper`` that was trained on a maximum token size roughly corresponding to 200 atoms)
+
+
+The ``ord_data_mapped.csv`` files will have the following columns:
+
+    * ID - unique ID from the original database
+    * Dataset - the name of the dataset from which this is reaction is taken
+    * Date - the date of the experiment as given in the database
+    * ReactionSmiles - the original reaction SMILES
+    * Yield - the yield of the first product of the first outcome, if provided
+    * ReactionSmilesClean - the reaction SMILES after cleaning
+    * BadMolecules - molecules not sanitizable by RDKit
+    * ReactantSize - number of atoms in reactants
+    * ProductSize - number of atoms in products
+    * mapped_rxn - the mapped reaction SMILES
+    * confidence - the confidence of the mapping as provided by ``rxnmapper`` 
diff --git a/_sources/pipeline.rst.txt b/_sources/pipeline.rst.txt
@@ -0,0 +1,140 @@
+Pipeline
+========
+
+``rxnutils`` provide a simple pipeline to perform simple tasks on reaction SMILES and templates in a CSV-file.
+
+
+The pipeline works on  `tab-separated` CSV files (TSV files) 
+
+
+Usage
+-----
+
+To exemplify the pipeline capabilities, we will have a look at the pipeline used to clean the USPTO data.
+
+The input to the pipeline is a simple YAML-file that specifies each action to take. The actions will be executed
+sequentially, one after the other and each action takes a number of input arguments. 
+
+This is the YAML-file used to clean the USPTO data:
+
+.. code-block:: yaml
+
+    trim_rxn_smiles:
+        in_column: ReactionSmiles
+        out_column: ReactionSmilesClean 
+    remove_unsanitizable:
+        in_column: ReactionSmilesClean
+        out_column: ReactionSmilesClean
+    reagents2reactants:
+        in_column: ReactionSmilesClean
+        out_column: ReactionSmilesClean
+    remove_atom_mapping:
+        in_column: ReactionSmilesClean
+        out_column: ReactionSmilesClean
+    reactantsize:
+        in_column: ReactionSmilesClean
+    productsize:
+        in_column: ReactionSmilesClean
+    query_dataframe1:
+        query: "ReactantSize>0"
+    query_dataframe2:
+        query: "ProductSize>0"
+    query_dataframe3:
+        query: "ReactantSize+ProductSize<200"
+
+
+The first action is called ``trim_rxn_smiles`` and two arguments are given: ``in_column`` specifying which column to use as input and ``out_column`` specifying which column
+to use as output. 
+
+The following actions ``remove_unsanitizable``, ``reagents2reactants``, ``remove_atom_mapping``, ``reactantsize``, ``productsize`` works the same way, but might use other columns to specified for output. 
+
+The last three actions are actually the same action but executed with different arguments. They therefore have to be postfixed with 1, 2 and 3. 
+The action ``query_dataframe`` takes a ``query`` argument and removes a number of rows not matching the query.
+
+If we save this to ``clean_pipeline.yml`` and given that we have a tab-separated file with USPTO data called ``uspto_data.csv`` we can run the following command
+
+.. code-block::
+
+    python -m rxnutils.pipeline.runner --pipeline clean_pipeline.yml --data uspto_data.csv --output uspto_cleaned.csv
+
+
+or we can alternatively run it from a python method like this
+
+.. code-block::
+
+    from rxnutils.pipeline.runner import main as validation_runner
+    
+    validation_runner(
+        [
+            "--pipeline",
+            "clean_pipeline.yml",
+            "--data",
+            "uspto_data.csv",
+            "--output",
+            "uspto_cleaned.csv",
+        ]
+    )
+
+Actions
+-------
+
+To find out what actions are available, you can type
+
+.. code-block::
+
+    python -m rxnutils.pipeline.runner --list
+
+Development
+-----------
+
+New actions can easily be added to the pipeline framework. All of the actions are implemented in one of four modules 
+
+
+    * ``rxnutils.pipeline.actions.dataframe_mod`` - actions that modify the dataframe, e.g., removing rows or columns
+    * ``rxnutils.pipeline.actions.reaction_mod`` - actions that modify reaction SMILES
+    * ``rxnutils.pipeline.actions.dataframe_props`` - actions that compute properties from reaction SMILES
+    * ``rxnutils.pipeline.actions.templates`` - actions that process reaction templates
+
+
+To exemplify, let's have a look at the ``productsize`` action
+
+
+.. code-block:: python
+
+    @action
+    @dataclass
+    class ProductSize:
+    """Action for counting product size"""
+
+    pretty_name: ClassVar[str] = "productsize"
+    in_column: str
+    out_column: str = "ProductSize"
+
+    def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
+        smiles_col = global_apply(data, self._row_action, axis=1)
+        return data.assign(**{self.out_column: smiles_col})
+
+    def __str__(self) -> str:
+        return f"{self.pretty_name} (number of heavy atoms in product)"
+
+    def _row_action(self, row: pd.Series) -> str:
+        _, _, products = row[self.in_column].split(">")
+        products_mol = Chem.MolFromSmiles(products)
+
+        if products_mol:
+            product_atom_count = products_mol.GetNumHeavyAtoms()
+        else:
+            product_atom_count = 0
+
+        return product_atom_count
+
+The action is defined as a class ``ProductSize`` that has two class-decorators. 
+The first ``@action`` will register the action in a global action list and second ``@dataclass`` is dataclass decorator from the standard library.
+The ``pretty_name`` class variable is used to identify the action in the pipeline, that is what you are specifying in the YAML-file. 
+The other two ``in_column`` and ``out_column`` are the arguments you can specify in the YAML file for executing the action, they can have default 
+values in case they don't need to be specified in the YAML file.
+
+When the action is executed by the pipeline the ``__call__`` method is invoked with the current Pandas dataframe as the only argument. This method 
+should return the modified dataframe.
+
+Lastly, it is nice to implement a ``__str__`` method which is used by the pipeline to print useful information about the action that is executed.
diff --git a/_sources/routes.rst.txt b/_sources/routes.rst.txt
@@ -0,0 +1,68 @@
+Routes
+======
+
+``rxnutils`` contains routines to analyse synthesis routes. There are a number of readers that can be used to read routes from a number of 
+formats, and there are routines to score the different routes.
+
+Reading
+-------
+
+The simplest route format supported is a text file, where each reaction is written as a reaction SMILES in a line. 
+Routes are separated by new-line
+
+For instance:
+
+.. code-block::
+
+    CC(C)N.Clc1cccc(Nc2ccoc2)n1>>CC(C)Nc1cccc(Nc2ccoc2)n1
+    Brc1ccoc1.Nc1cccc(Cl)n1>>Clc1cccc(Nc2ccoc2)n1
+
+    Nc1cccc(NC(C)C)n1.Brc1ccoc1>>CC(C)Nc1cccc(Nc2ccoc2)n1
+    CC(C)N.Nc1cccc(Cl)n1>>Nc1cccc(NC(C)C)n1
+
+
+If this is saved to ``routes.txt``, these can be read into route objects with 
+
+.. code-block::
+
+    from rxnutils.routes.readers import read_reaction_lists
+    routes = read_reaction_lists("reactions.txt")
+
+
+If you have an environment with ``rxnmapper`` installed and the NextMove software ``namerxn`` in your PATH then you can
+add atom-mapping and reaction classes to these routes with
+
+.. code-block::
+
+    # This can be set on the command-line as well
+    import os
+    os.environ["RXNMAPPER_ENV_PATH"] = "/home/username/miniconda/envs/rxnmapper/"
+
+    for route in routes:
+        route.assign_atom_mapping(only_rxnmapper=True)
+    routes[1].remap(routes[0])
+
+
+The last line of code also make sure that the second route shares mapping with the first route. 
+
+
+Other readers are available
+
+* ``read_aizynthcli_dataframe`` - for reading routes from aizynthcli output dataframe
+* ``read_reactions_dataframe`` - for reading routes stored as reactions in a dataframe
+
+
+For instance, to read routes from a dataframe with reactions. You can do something like what follows.
+The dataframe has column ``reaction_smiles`` that holds the reaction SMILES, and the individual routes
+are identified by a ``target_smiles`` and ``route_id`` column. The dataframe also has a column ``classification``,
+holding the NextMove classification. The dataframe is called ``data``.
+
+.. code-block::
+
+    from rxnutils.routes.readers import read_reactions_dataframe
+    routes = read_reactions_dataframe(
+        data, 
+        "reaction_smiles", 
+        group_by=["target_smiles", "route_id"], 
+        metadata_columns=["classification"]
+    )