Merge pull request #16 from datamol-io/fix/canonical-safe

Fix bug in canonical due to for loop on set
datamol-io · Nov 8, 2023 · e1e2cbd · e1e2cbd
2 parents 3d6ccdb + f0d8607
commit e1e2cbd
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -44,7 +44,7 @@
 
 ## Overview of SAFE
 
-SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
+SAFE _is the_  deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:
 
 - _de novo_ design
 - superstructure generation
@@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
 - linker generation
 - scaffold morphing.
 
-The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
+The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
 
 </br>
 <div align="center">
@@ -76,25 +76,25 @@ mamba install -c conda-forge safe-mol
 
 ### Datasets and Models
 
-| Type    | Name                                                                           | Infos      | Size  | Comment              |
-| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
-| Model   | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt)              | 87M params | 350M  | Default model        |
-| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt)     | 1.1B rows  | 250GB | Training dataset     |
-| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows    | 20 kB | Benchmarking dataset |
+| Type                   | Name                                                                           | Infos      | Size  | Comment              |
+| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
+| Model                  | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt)              | 87M params | 350M  | Default model        |
+| Training Dataset       | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt)     | 1.1B rows  | 250GB | Training dataset     |
+| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows    | 20 kB | Benchmarking dataset |
 
 ## Usage
 
-Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.
+Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.
 
 ### API
 
 We summarize some key functions provided by the `safe` package below.
 
-| Function      | Description                                                                                                                                                                                             |
-| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `safe.encode` | Translates a SMILES string into its corresponding SAFE string.                                                                                                                                          |
-| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogens bonds. |
-| `safe.split`  | Tokenizes a SAFE string to build a generative model.                                                                                                                                                    |
+| Function      | Description                                                                                                                                                                                            |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `safe.encode` | Translates a SMILES string into its corresponding SAFE string.                                                                                                                                         |
+| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogen bonds. |
+| `safe.split`  | Tokenizes a SAFE string to build a generative model.                                                                                                                                                   |
 
 ### Examples
 
@@ -117,9 +117,9 @@ except safe.DecoderError:
 ibuprofen_tokens = list(safe.split(ibuprofen_sf))
 ```
 
-### Training a new models
+### Training/Finetuning a (new) model
 
-A command line interface is available to train a new model, please run `safe-train --help`
+A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.
 
 For example:
 

diff --git a/docs/index.md b/docs/index.md
@@ -44,7 +44,7 @@
 
 ## Overview of SAFE
 
-SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
+SAFE _is the_  deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:
 
 - _de novo_ design
 - superstructure generation
@@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
 - linker generation
 - scaffold morphing.
 
-The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
+The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
 
 </br>
 <div align="center">
@@ -76,15 +76,16 @@ mamba install -c conda-forge safe-mol
 
 ### Datasets and Models
 
-| Type    | Name                                                                           | Infos      | Size  | Comment              |
-| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
-| Model   | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt)              | 87M params | 350M  | Default model        |
-| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt)     | 1.1B rows  | 250GB | Training dataset     |
-| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows    | 20 kB | Benchmarking dataset |
+| Type                   | Name                                                                           | Infos      | Size  | Comment              |
+| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
+| Model                  | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt)              | 87M params | 350M  | Default model        |
+| Training Dataset       | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt)     | 1.1B rows  | 250GB | Training dataset     |
+| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows    | 20 kB | Benchmarking dataset |
 
 ## Usage
 
-Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.
+
+The tutorials in the [documentation](https://safe-docs.datamol.io/) can help you get started with `safe` and `SAFE-GPT`.
 
 ### API
 
@@ -117,9 +118,9 @@ except safe.DecoderError:
 ibuprofen_tokens = list(safe.split(ibuprofen_sf))
 ```
 
-### Training a new models
+### Training/Finetuning a (new) model
 
-A command line interface is available to train a new model, please run `safe-train --help`
+A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.
 
 For example:
 
@@ -138,6 +139,7 @@ safe-train --config <path to config> \
     --max_steps 5
 ```
 
+
 ## References
 
 If you use this repository, please cite the following related [paper](https://arxiv.org/abs/2310.10773#):

diff --git a/safe/converter.py b/safe/converter.py
@@ -318,6 +318,8 @@ def encoder(
 
         scaffold_str = ".".join(frags_str)
         attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str))
+        if canonical:
+            attach_pos = sorted(attach_pos)
         starting_num = 1 if len(branch_numbers) == 0 else max(branch_numbers) + 1
         for attach in attach_pos:
             val = str(starting_num) if starting_num < 10 else f"%{starting_num}"