Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in canonical due to for loop on set #16

Merged
merged 1 commit into from
Nov 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

## Overview of SAFE

SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

- _de novo_ design
- superstructure generation
Expand All @@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
- linker generation
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

</br>
<div align="center">
Expand All @@ -76,25 +76,25 @@ mamba install -c conda-forge safe-mol

### Datasets and Models

| Type | Name | Infos | Size | Comment |
| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |
| Type | Name | Infos | Size | Comment |
| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Training Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |

## Usage

Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.
Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.

### API

We summarize some key functions provided by the `safe` package below.

| Function | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `safe.encode` | Translates a SMILES string into its corresponding SAFE string. |
| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogens bonds. |
| `safe.split` | Tokenizes a SAFE string to build a generative model. |
| Function | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `safe.encode` | Translates a SMILES string into its corresponding SAFE string. |
| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogen bonds. |
| `safe.split` | Tokenizes a SAFE string to build a generative model. |

### Examples

Expand All @@ -117,9 +117,9 @@ except safe.DecoderError:
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
```

### Training a new models
### Training/Finetuning a (new) model

A command line interface is available to train a new model, please run `safe-train --help`
A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

Expand Down
22 changes: 12 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

## Overview of SAFE

SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

- _de novo_ design
- superstructure generation
Expand All @@ -53,7 +53,7 @@ SAFE _is the_ deep learning molecular representation. It's an encoding leveragin
- linker generation
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).
The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

</br>
<div align="center">
Expand All @@ -76,15 +76,16 @@ mamba install -c conda-forge safe-mol

### Datasets and Models

| Type | Name | Infos | Size | Comment |
| ------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |
| Type | Name | Infos | Size | Comment |
| ---------------------- | ------------------------------------------------------------------------------ | ---------- | ----- | -------------------- |
| Model | [datamol-io/safe-gpt](https://huggingface.co/datamol-io/safe-gpt) | 87M params | 350M | Default model |
| Training Dataset | [datamol-io/safe-gpt](https://huggingface.co/datasets/datamol-io/safe-gpt) | 1.1B rows | 250GB | Training dataset |
| Drug Benchmark Dataset | [datamol-io/safe-drugs](https://huggingface.co/datasets/datamol-io/safe-drugs) | 26 rows | 20 kB | Benchmarking dataset |

## Usage

Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.

The tutorials in the [documentation](https://safe-docs.datamol.io/) can help you get started with `safe` and `SAFE-GPT`.

### API

Expand Down Expand Up @@ -117,9 +118,9 @@ except safe.DecoderError:
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
```

### Training a new models
### Training/Finetuning a (new) model

A command line interface is available to train a new model, please run `safe-train --help`
A command line interface is available to train a new model, please run `safe-train --help`. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

Expand All @@ -138,6 +139,7 @@ safe-train --config <path to config> \
--max_steps 5
```


## References

If you use this repository, please cite the following related [paper](https://arxiv.org/abs/2310.10773#):
Expand Down
2 changes: 2 additions & 0 deletions safe/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,8 @@ def encoder(

scaffold_str = ".".join(frags_str)
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str))
if canonical:
attach_pos = sorted(attach_pos)
starting_num = 1 if len(branch_numbers) == 0 else max(branch_numbers) + 1
for attach in attach_pos:
val = str(starting_num) if starting_num < 10 else f"%{starting_num}"
Expand Down