Skip to content

Commit

Permalink
Merge pull request #16 from datamol-io/fix/canonical-safe
Browse files Browse the repository at this point in the history
Fix bug in canonical due to for loop on set
  • Loading branch information
hadim authored Nov 8, 2023
2 parents 3d6ccdb + f0d8607 commit e1e2cbd
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 25 deletions.
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

## Overview of SAFE

SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

- _de novo_ design
- superstructure generation
Expand All @@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
- linker generation
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

</br>
<div align="center">
Expand All @@ -76,25 +76,25 @@ mamba install -c conda-forge safe-mol

### Datasets and Models

| Type | Name | Infos | Size | Comment |
| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |
| Type | Name | Infos | Size | Comment |
| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Training Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |

## Usage

Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.
Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.

### API

We summarize some key functions provided by the `safe` package below.

| Function | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `safe.encode` | Translates a SMILES string into its corresponding SAFE string. |
| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogens bonds. |
| `safe.split` | Tokenizes a SAFE string to build a generative model. |
| Function | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `safe.encode` | Translates a SMILES string into its corresponding SAFE string. |
| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogen bonds. |
| `safe.split` | Tokenizes a SAFE string to build a generative model. |

### Examples

Expand All @@ -117,9 +117,9 @@ except safe.DecoderError:
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
```

### Training a new models
### Training/Finetuning a (new) model

A command line interface is available to train a new model, please run `safe-train --help`
A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

Expand Down
22 changes: 12 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

## Overview of SAFE

SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

- _de novo_ design
- superstructure generation
Expand All @@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
- linker generation
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

</br>
<div align="center">
Expand All @@ -76,15 +76,16 @@ mamba install -c conda-forge safe-mol

### Datasets and Models

| Type | Name | Infos | Size | Comment |
| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |
| Type | Name | Infos | Size | Comment |
| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Training Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |

## Usage

Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.

The tutorials in the [documentation](https://safe-docs.datamol.io/) can help you get started with `safe` and `SAFE-GPT`.

### API

Expand Down Expand Up @@ -117,9 +118,9 @@ except safe.DecoderError:
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
```

### Training a new models
### Training/Finetuning a (new) model

A command line interface is available to train a new model, please run `safe-train --help`
A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

Expand All @@ -138,6 +139,7 @@ safe-train --config <path to config> \
--max_steps 5
```


## References

If you use this repository, please cite the following related [paper](https://arxiv.org/abs/2310.10773#):
Expand Down
2 changes: 2 additions & 0 deletions safe/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,8 @@ def encoder(

scaffold_str = ".".join(frags_str)
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str))
if canonical:
attach_pos = sorted(attach_pos)
starting_num = 1 if len(branch_numbers) == 0 else max(branch_numbers) + 1
for attach in attach_pos:
val = str(starting_num) if starting_num < 10 else f"%{starting_num}"
Expand Down

0 comments on commit e1e2cbd

Please sign in to comment.