Skip to content

Commit

Permalink
WIP cleaning up
Browse files Browse the repository at this point in the history
  • Loading branch information
hadim committed Oct 27, 2023
1 parent 77a88be commit 2fcacd1
Show file tree
Hide file tree
Showing 15 changed files with 119 additions and 215 deletions.
Binary file removed .DS_Store
Binary file not shown.
8 changes: 0 additions & 8 deletions .authors.yml

This file was deleted.

12 changes: 0 additions & 12 deletions .mailmap

This file was deleted.

4 changes: 0 additions & 4 deletions AUTHORS.rst

This file was deleted.

14 changes: 0 additions & 14 deletions CHANGELOG.rst

This file was deleted.

127 changes: 60 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

<h1 align="center"> :safety_vest: SAFE </h1>
<h4 align="center"><b>S</b>equential <b>A</b>ttachment-based <b>F</b>ragment <b>E</b>mbedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.</h4>

Expand All @@ -9,13 +8,13 @@
</br>

<p align="center">
<a href="" target="_blank">
<a href="https://arxiv.org/pdf/2310.10773.pdf" target="_blank">
Paper
</a> |
<a href="https://maclandrol.github.io/safe/" target="_blank">
<a href="https://safe-docs.datamol.io/" target="_blank">
Docs
</a> |
<a href="#" target="_blank">
<a href="https://huggingface.co/datamol-io/safe" target="_blank">
🤗 Model
</a>
</p>
Expand All @@ -24,28 +23,30 @@

</br>

[![PyPI](https://img.shields.io/pypi/v/safe)](https://pypi.org/project/safe/)
[![Version](https://img.shields.io/pypi/pyversions/safe)](https://pypi.org/project/safe/)
[![Code license](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/maclandrol/safe/blob/main/LICENSE)
[![Data License](https://img.shields.io/badge/Data%20License-CC%20BY%204.0-red.svg)](https://github.com/maclandrol/safe/blob/main/DATA_LICENSE)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-blue.svg)](https://github.com/maclandrol/safe/graphs/commit-activity)
[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/1234.56789)
[![test](https://github.com/maclandrol/safe/actions/workflows/test.yml/badge.svg)](https://github.com/maclandrol/safe/actions/workflows/test.yml)

## 🆕 News
- \[**August 2023**\] We've released xxx

[![PyPI](https://img.shields.io/pypi/v/safe-mol)](https://pypi.org/project/safe-mol/)
[![Conda](https://img.shields.io/conda/v/conda-forge/safe-mol?label=conda&color=success)](https://anaconda.org/conda-forge/safe-mol)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/safe-mol)](https://pypi.org/project/safe-mol/)
[![Conda](https://img.shields.io/conda/dn/conda-forge/safe-mol)](https://anaconda.org/conda-forge/safe-mol)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/safe-mol)](https://pypi.org/project/safe-mol/)
[![Code license](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/datamol-io/safe/blob/main/LICENSE)
[![Data License](https://img.shields.io/badge/Data%20License-CC%20BY%204.0-red.svg)](https://github.com/datamol-io/safe/blob/main/DATA_LICENSE)[![GitHub Repo stars](https://img.shields.io/github/stars/datamol-io/safe)](https://github.com/datamol-io/safe/stargazers)
[![GitHub Repo stars](https://img.shields.io/github/forks/datamol-io/safe)](https://github.com/datamol-io/safe/network/members)
[![test](https://github.com/datamol-io/safe/actions/workflows/test.yml/badge.svg)](https://github.com/datamol-io/safe/actions/workflows/test.yml)
[![release](https://github.com/datamol-io/safe/actions/workflows/release.yml/badge.svg)](https://github.com/datamol-io/safe/actions/workflows/release.yml)
[![code-check](https://github.com/datamol-io/safe/actions/workflows/code-check.yml/badge.svg)](https://github.com/datamol-io/safe/actions/workflows/code-check.yml)
[![doc](https://github.com/datamol-io/safe/actions/workflows/doc.yml/badge.svg)](https://github.com/datamol-io/safe/actions/workflows/doc.yml)
[![arXiv](https://img.shields.io/badge/arXiv-2310.10773-b31b1b.svg)](https://arxiv.org/pdf/2310.10773.pdf)

## Overview of SAFE

SAFE *is the* deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
SAFE _is the_ deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:

- *de novo* design
- _de novo_ design
- superstructure generation
- scaffold decoration
- motif extension
- linker generation
- scaffold morphing.
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

Expand All @@ -54,63 +55,41 @@ The construction of a SAFE strings requires definition a molecular fragmentation
<img src="docs/assets/safe-construction.svg" width="100%">
</div>



### Installation

You can install `safe` using pip, when the package is public
You can install `safe` using pip:

```bash
pip install safe-mol
pip install safe-mol
```


You can use conda/mamba. Ask @maclandrol for credentials to the conda forge or for a token

```bash
mamba install -c invivoai safe
```


Alternatively clone this repo, install the dependencies, install `safe` locally and you are good to go:


```bash
git clone https://github.com/maclandrol/safe.git
cd safe
mamba env create -f env.yml -n "safe-space" # :)
pip install -e .
mamba install -c conda-forge safe-mol
```

`safe` mostly depends on [transformers](https://huggingface.co/docs/transformers/index) and [datasets](https://huggingface.co/docs/datasets/index). Please see the [env.yml](./env.yml) file for a complete list of dependencies.


### Datasets and Models

We provided a pretained GPT2 model (XX M parameters) using the SAFE molecular representation that has been trained on 1.1 billion molecules from Unichem (0.1B) + Zinc (1B):

- *Safe-XXM* [maclandrol/safe-XXM]()
We provided a pretained GPT2 model (XX M parameters) using the SAFE molecular representation that has been trained on 1.1 billion molecules from Unichem (0.1B) + Zinc (1B):

- _Safe-XXM_ TODO

## Usage

Please refer to the [documentation](), which contains a thorough tutorial for getting started with ``safe`` and detailed descriptions of the functions provided.

In particular, see the following tutorials:
- xxx
- xxx

Please refer to the [documentation](https://safe-docs.datamol.io/), which contains tutorials for getting started with `safe` and detailed descriptions of the functions provided.

### API

We summarize some key functions provided by the `safe` package below.

| Function | Description |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ``safe.encode`` | Translates a SMILES string into its corresponding SAFE string. |
| ``safe.decode`` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogens bonds. |
| ``safe.split`` | Tokenizes a SAFE string to build a generative model. |

| Function | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `safe.encode` | Translates a SMILES string into its corresponding SAFE string. |
| `safe.decode` | Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogens bonds. |
| `safe.split` | Tokenizes a SAFE string to build a generative model. |

### Examples

Expand All @@ -126,7 +105,7 @@ try:
ibuprofen_sf = safe.encode(ibuprofen) # [C][=C][C][=C][C][=C][Ring1][=Branch1]
ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True) # CC(Cc1ccc(cc1)C(C(=O)O)C)C
except safe.EncoderError:
pass
pass
except safe.DecoderError:
pass

Expand All @@ -136,7 +115,7 @@ ibuprofen_tokens = list(safe.split(ibuprofen_sf))

### Training a new models

A command line interface is available to train a new model, please run ```safe-train --help```
A command line interface is available to train a new model, please run `safe-train --help`

For example:

Expand All @@ -155,28 +134,42 @@ safe-train --config <path to config> \
--max_steps 5
```


## Changelog
See the latest changelogs at [CHANGELOG.rst](./CHANGELOG.rst).

## References
If you use this repository, please cite the following related paper:

```
@article{,
title={Gotta be SAFE: a new framework for molecular design.},
author={},
journal={},
year={2023}
If you use this repository, please cite the following related [paper](https://arxiv.org/abs/2310.10773#):

```bib
@misc{noutahi2023gotta,
title={Gotta be SAFE: A New Framework for Molecular Design},
author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
year={2023},
eprint={2310.10773},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

## License

Please note that all data and model weights of **SAFE** are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. See [DATA_LICENSE](DATA_LICENSE) for details.
Note that all data and model weights of **SAFE** are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. See [DATA_LICENSE](DATA_LICENSE) for details.

This code base is licensed under the Apache-2.0 license. See [LICENSE](LICENSE) for details.

## Maintainers
## Development lifecycle

- @maclandrol
### Setup dev environment

```bash
mamba create -n safe -f env.yml
mamba activate safe

pip install --no-deps -e .
```

### Tests

You can run tests locally with:

```bash
pytest
```
Binary file removed docs/.DS_Store
Binary file not shown.
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<a href="" target="_blank">
Paper
</a> |
<a href="https://github.com/valence-labs/safe/" target="_blank">
<a href="https://github.com/datamol-io/safe/" target="_blank">
Github
</a> |
<a href="#" target="_blank">
Expand All @@ -34,7 +34,7 @@ SAFE *is the* deep learning molecular representation. It's an encoding leveragin
- scaffold decoration
- motif extension
- linker generation
- scaffold morphing.
- scaffold morphing.

The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by [datamol](https://github.com/datamol-io/datamol) or [RDKit](https://github.com/rdkit/rdkit).

Expand Down Expand Up @@ -67,14 +67,14 @@ pip install -e .

### Datasets and Models

We provided a pretained GPT2 model (XXM parameters) using the SAFE molecular representation that has been trained on 1.1 billion molecules from Unichem (0.1B) + Zinc (1B):
We provided a pretained GPT2 model (XXM parameters) using the SAFE molecular representation that has been trained on 1.1 billion molecules from Unichem (0.1B) + Zinc (1B):

- *Safe-XXM* [maclandrol/safe-XXM]()


### Usage

To get started with SAFE, please see the tutorials:
To get started with SAFE, please see the tutorials:
- xxx
- xxx

Expand Down
21 changes: 8 additions & 13 deletions env.yml
Original file line number Diff line number Diff line change
@@ -1,27 +1,29 @@
channels:
- conda-forge

dependencies:
- python >=3.8
- python >=3.9
- pip
- tqdm
- joblib
- loguru
- typer

# Scientific
- datamol
- numpy
- pytorch >=2.0
- transformers
- optimum
- datasets
- typer
- tokenizers
- sentencepiece
- accelerate
- evaluate
- wandb
- universal_pathlib
- huggingface_hub
- deepspeed

# Optional
- deepspeed

# dev
- black >=23
- ruff
Expand All @@ -31,13 +33,6 @@ dependencies:
- nbconvert
- ipywidgets

# Releasing tools
- twine
- build
- setuptools-scm
- rever >=0.4.5
- conda-smithy

# Doc
- mkdocs
- mkdocs-material >=7.1.1
Expand Down
Empty file removed expts/README.md
Empty file.
10 changes: 5 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
site_name: "SAFE"
site_description: "Gotta be SAFE: a new framework for molecular design"
site_url: "https://github.com/valence-labs/safe"
repo_url: "https://github.com/valence-labs/safe"
repo_name: "valence-labs/safe"
site_url: "https://github.com/datamol-io/safe"
repo_url: "https://github.com/datamol-io/safe"
repo_name: "datamol-io/safe"
copyright: Copyright 2023 Valence Labs

remote_branch: "gh-pages"
Expand Down Expand Up @@ -87,8 +87,8 @@ extra:

social:
- icon: fontawesome/brands/github
link: https://github.com/valence-labs
link: https://github.com/datamol-io
- icon: fontawesome/brands/twitter
link: https://twitter.com/ENoutahi
link: https://twitter.com/datamol_io
- icon: fontawesome/brands/python
link: https://pypi.org/project/safe-mol/
23 changes: 0 additions & 23 deletions news/TEMPLATE.rst

This file was deleted.

Loading

0 comments on commit 2fcacd1

Please sign in to comment.