update README

Acellera · May 18, 2024 · e6b05f9 · e6b05f9
1 parent 541f5fe
commit e6b05f9
Show file tree

Hide file tree

Showing 17 changed files with 241 additions and 111 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ The full paper can be found [here](https://arxiv.org/abs/2405.04657).
 ## Features
 
 - **Multiple Generative Modes:** AceGen facilitates the generation of chemical libraries with different modes: de novo generation, scaffold decoration, and fragment linking.
-- **RL Algorithms:** AceGen offers task optimization with various reinforcement learning algorithms such as Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), Reinforce, Reinvent, and Augmented Hill-Climb (AHC).
+- **RL Algorithms:** AceGen offers task optimization with various reinforcement learning algorithms such as [Proximal Policy Optimization (PPO)][1], [Advantage Actor-Critic (A2C)][2], [Reinforce][3], [Reinvent][4], and [Augmented Hill-Climb (AHC)][5].
 - **Pre-trained Models:** The toolkit offers pre-trained models including Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and GPT-2.
 - **Scoring Functions :** AceGen relies on MolScore, a comprehensive scoring function suite for generative chemistry, to evaluate the quality of the generated molecules.
 - **Customization Support:** AceGen provides tutorials for integrating custom models and custom scoring functions, ensuring flexibility for advanced users.
@@ -80,7 +80,17 @@ To learn how to configure constrained molecule generation with AcGen and prompts
 
 ---
 
-## Running training scripts
+## Generating libraries of molecules
+
+ACEGEN has multiple RL algorithms available, each in a different directory within the `acegen-open/scripts` directory. Each RL algorithm has three different generative modes of execution: de novo, scaffold decoration, and fragment linking.
+
+Each mode of execution has its own configuration file in YAML format, located right next to the script. To modify training parameters for any mode, edit the corresponding YAML file. For a breakdown of the general structure of our configuration files, refer to this [tutorial](tutorials/breaking_down_configuration_files.md).
+
+While the default values in the configuration files are considered sensible, a default scoring function and model architecture are also defined so users can test the scripts out of the box. However, users might generally want to customize the model architecture or the scoring function.
+
+To customize the model architecture, refer to the [Changing the model architecture](##Changing the model architecture) section. To customize the scoring function, refer to the [Changing the scoring function](##Changing the scoring function) section.
+
+### Running training scripts to generate compoud libraries
 
 To run the training scripts for denovo generation, run the following commands:
 
@@ -106,8 +116,6 @@ To run the training scripts for fragment linking, run the following commands (re
     python scripts/reinvent/reinvent.py --config-name config_linking
     python scripts/ahc/ahc.py --config-name config_linking
 
-To modify training parameters, edit the corresponding YAML file in each example's directory.
-
 #### Advanced usage
 
 Scripts are also available as executables after installation, but both the path and name of the config must be specified. For example,
@@ -120,82 +128,92 @@ YAML config parameters can also be specified on the command line. For example,
 
 ---
 
-# Available models
+## Changing the scoring function
+
+To change the scoring function, adjust the `molscore` parameter in any configuration files. Set it to point to a valid 
+MolScore configuration file (e.g.  ../MolScore/molscore/configs/GuacaMol/Albuterol_similarity.json). 
+Alternatively, you can set the `molscore` parameter to the name of a valid MolScore benchmark 
+(such as MolOpt, GuacaMol, etc.) to automatically execute each task in the benchmark. For further details on MolScore, 
+please refer to the [MolScore](https://github.com/MorganCThomas/MolScore) repository.
+
+Alternatively, users can define their own custom scoring functions and use them in the AceGen scripts by following the 
+instructions in this [tutorial](tutorials/adding_custom_scoring_function.md).
+
+---
+
+## Changing the model architecture
 
-We provide a variety of example priors that can be selected in the configuration file. These include:
+### Available models
+
+We provide a variety of default priors that can be selected in the configuration file. These include:
 
 - A Gated Recurrent Unit (GRU) model
   - pre-training dataset1 (default): [ChEMBL](https://www.ebi.ac.uk/chembl/)
   - pre-training dataset2: [ZINC250k](https://github.com/wenhao-gao/mol_opt/blob/main/data/zinc.txt.gz)
-  - umber of parameters: 4,363,045
+  - number of parameters: 4,363,045
+  - to select set the field `model` to `gru` in any configuration file
 
 
 - A Long Short-Term Memory (LSTM) model
   - pre-training dataset: [ChEMBL](https://www.ebi.ac.uk/chembl/)
   - number of parameters: 5,807,909
+    - to select set the field `model` to `lstm` in any configuration file
 
 
 - A GPT-2 model (requires installation of HuggingFace's `transformers` library)
   - pre-training dataset: [REAL 350/3 lead-like, 613.86M cpds, CXSMILES](https://enamine.net/compound-collections/real-compounds/real-database-subsets)
   - number of parameters: 5,030,400
+- to select set the field `model` to `gpt2` in any configuration file
 
 ---
 
-# Changing the scoring function
-
-To change the scoring function, adjust the `molscore` parameter in any configuration files. Set it to point to a valid 
-MolScore configuration file (e.g.  ../MolScore/molscore/configs/GuacaMol/Albuterol_similarity.json). 
-Alternatively, you can set the `molscore` parameter to the name of a valid MolScore benchmark 
-(such as MolOpt, GuacaMol, etc.) to automatically execute each task in the benchmark. For further details on MolScore, 
-please refer to the [MolScore](https://github.com/MorganCThomas/MolScore) repository.
-
-Alternatively, users can define their own custom scoring functions and use them in the AceGen scripts by following the 
-instructions in this [tutorial](tutorials/adding_custom_scoring_function.md).
-
----
-
-# Integration of custom models
-
-We encourage users to integrate their own models into AceGen.
+### Integration of custom models
 
-`/acegen/models/gru.py` and `/acegen/models/lstm.py` offer methods to create RNNs of varying sizes, which can be use
-to load custom models. 
+We also encourage users to integrate their own models into AceGen.
 
-Similarly, `/acegen/models/gpt2.py` can serve as a template for integrating HuggingFace models. A detailed guide 
-on integrating custom models can be found in this [tutorial](tutorials/adding_custom_model.md).
+A detailed guide on integrating custom models can be found in this [tutorial](tutorials/adding_custom_model.md).
 
 ---
 
-# Results on the [MolOpt](https://arxiv.org/pdf/2206.12411.pdf) benchmark
+## Results on the [MolOpt](https://arxiv.org/pdf/2206.12411.pdf) benchmark
 
 Algorithm comparison for the Area Under the Curve (AUC) of the top 100 molecules on MolOpt benchmark scoring functions. 
 Each algorithm ran 5 times with different seeds, and results were averaged. 
 We used the default configuration for each algorithm, including the GRU model for the prior.
 Additionally, for Reinvent we also tested the configuration proposed in the MolOpt paper.
 
-| Task                          | Reinvent | Reinvent MolOpt | AHC   | A2C   | PPO   | PPOD  |
-|-------------------------------|----------|-----------------|-------|-------|-------|-------|
-| Albuterol_similarity         | 0.569    | 0.865           | 0.640 | 0.760 | 0.911 | **0.919** |
-| Amlodipine_MPO               | 0.506    | 0.626           | 0.505 | 0.511 | 0.553 | **0.656** |
-| C7H8N2O2                     | 0.615    | 0.871           | 0.563 | 0.737 | 0.864 | **0.875** |
-| C9H10N2O2PF2Cl               | 0.556    | 0.721           | 0.553 | 0.610 | 0.625 | **0.756** |
-| Celecoxxib_rediscovery       | 0.566    | 0.812           | 0.590 | 0.700 | 0.647 | **0.888** |
-| Deco_hop                     | 0.602    | **0.657**       | 0.616 | 0.605 | 0.601 | 0.646 |
-| Fexofenadine_MPO             | 0.668    | **0.765**       | 0.680 | 0.663 | 0.687 | 0.747 |
-| Median_molecules_1           | 0.199    | 0.348           | 0.197 | 0.321 | 0.362 | **0.363** |
-| Median_molecules_2           | 0.195    | 0.270           | 0.208 | 0.224 | 0.236 | **0.285** |
-| Mestranol_similarity         | 0.454    | 0.821           | 0.514 | 0.645 | 0.728 | **0.870** |
-| Osimertinib_MPO              | 0.782    | **0.837**       | 0.791 | 0.780 | 0.798 | 0.815 |
-| Perindopril_MPO              | 0.430    | **0.516**       | 0.431 | 0.444 | 0.477 | 0.506 |
-| QED                           | 0.922    | 0.931           | 0.925 | 0.927 | **0.933** | **0.933** |
-| Ranolazine_MPO               | 0.626    | **0.721**       | 0.635 | 0.681 | 0.681 | 0.706 |
-| Scaffold_hop                 | 0.758    | **0.834**       | 0.772 | 0.764 | 0.761 | 0.808 |
-| Sitagliptin_MPO              | 0.226    | 0.356           | 0.219 | 0.272 | 0.295 | **0.372** |
-| Thiothixene_rediscovery      | 0.350    | 0.539           | 0.385 | 0.446 | 0.473 | **0.570** |
-| Troglitazone_rediscovery     | 0.256    | 0.447           | 0.282 | 0.305 | 0.449 | 0.511 |
-| Valsartan_smarts             | 0.012    | 0.014           | 0.011 | 0.010 | **0.022** | **0.022** |
-| Zaleplon_MPO                 | 0.408    | **0.496**       | 0.412 | 0.415 | 0.469 | 0.490 |
-| DRD2                          | 0.907    | **0.963**       | 0.906 | 0.942 | **0.967** | 0.963 |
-| GSK3B                         | 0.738    | 0.890           | 0.719 | 0.781 | 0.863 | **0.891** |
-| JNK3                          | 0.640    | 0.817           | 0.649 | 0.660 | 0.770 | **0.842** |
-| **Total**                     | **11.985** | **15.118**    | **12.205** | **13.203** | **14.170** | **15.434** |
+| Task                          | [REINFORCE][3] | [REINVENT][4] | [REINVENT MolOpt][6] | [AHC][5]   | [A2C][2]   | [PPO][1]   | [PPOD][7]  |
+|-------------------------------|-----------|----------|-----------------|-------|-------|-------|-------|-------|
+| Albuterol_similarity | 0.68 ± 0.03 | 0.69 ± 0.02 | 0.90 ± 0.01    | 0.77 ± 0.02 | 0.82 ± 0.04 | 0.93 ± 0.02 | **0.94 ± 0.00** |
+| Amlodipine_MPO | 0.55 ± 0.01 | 0.56 ± 0.01 | 0.65 ± 0.06      | 0.56 ± 0.01 | 0.55 ± 0.01 | 0.58 ± 0.03 | **0.68 ± 0.02** |
+| C7H8N2O2  | 0.83 ± 0.01 | 0.82 ± 0.03 | **0.90 ± 0.02**   | 0.76 ± 0.04 | 0.84 ± 0.04 | 0.89 ± 0.01 | 0.89 ± 0.03 |
+| C9H10N2O2PF2Cl | 0.70 ± 0.02 | 0.70 ± 0.02 | 0.76 ± 0.03    | 0.68 ± 0.02 | 0.69 ± 0.03 | 0.66 ± 0.02 | **0.79 ± 0.02** |
+| Celecoxxib_rediscovery | 0.63 ± 0.02 | 0.64 ± 0.03 | 0.77 ± 0.02 | 0.72 ± 0.02 | 0.73 ± 0.06 | 0.65 ± 0.12 | **0.82 ± 0.03** |
+| DRD2     | 0.98 ± 0.00 | 0.97 ± 0.00 | **0.99 ± 0.00** | 0.98 ± 0.01 | 0.98 ± 0.01 | **0.99 ± 0.00** | **0.99 ± 0.00** |
+| Deco_hop             | 0.63 ± 0.00    | 0.63 ± 0.01    | **0.67 ± 0.01**   | 0.64 ± 0.01    | 0.62 ± 0.00    | 0.62 ± 0.01    | 0.66 ± 0.02    |
+| Fexofenadine_MPO     | 0.71 ± 0.01    | 0.71 ± 0.00    | **0.80 ± 0.03**   | 0.72 ± 0.00    | 0.71 ± 0.00    | 0.73 ± 0.00    | 0.78 ± 0.01    |
+| GSK3B                | 0.84 ± 0.01    | 0.84 ± 0.02    | **0.92 ± 0.02**   | 0.82 ± 0.01    | 0.85 ± 0.02    | 0.90 ± 0.02    | **0.92 ± 0.02**|
+| JNK3                 | 0.75 ± 0.03    | 0.75 ± 0.02    | 0.85 ± 0.04       | 0.75 ± 0.01    | 0.74 ± 0.06    | 0.80 ± 0.04    | **0.87 ± 0.02**|
+| Median_molecules_1   | 0.26 ± 0.00    | 0.24 ± 0.00    | **0.36 ± 0.02**   | 0.24 ± 0.00    | 0.31 ± 0.01    | 0.33 ± 0.02    | 0.35 ± 0.02    |
+| Median_molecules_2   | 0.22 ± 0.00    | 0.22 ± 0.00    | 0.28 ± 0.01       | 0.24 ± 0.00    | 0.25 ± 0.01    | 0.25 ± 0.02    | **0.29 ± 0.01**|
+| Mestranol_similarity | 0.60 ± 0.03    | 0.55 ± 0.04    | 0.85 ± 0.07       | 0.66 ± 0.04    | 0.69 ± 0.07    | 0.75 ± 0.15    | **0.89 ± 0.05**|
+| Osimertinib_MPO      | 0.82 ± 0.01    | 0.82 ± 0.00    | **0.86 ± 0.01**   | 0.83 ± 0.00    | 0.81 ± 0.01    | 0.82 ± 0.01    | 0.84 ± 0.00    |
+| Perindopril_MPO      | 0.48 ± 0.01    | 0.47 ± 0.00    | **0.54 ± 0.01**   | 0.47 ± 0.01    | 0.48 ± 0.00    | 0.50 ± 0.01    | 0.53 ± 0.00    |
+| QED                  | **0.94 ± 0.00**| **0.94 ± 0.00**| **0.94 ± 0.00**   | **0.94 ± 0.00**| **0.94 ± 0.00**| **0.94 ± 0.00**| **0.94 ± 0.00**|
+| Ranolazine_MPO        | 0.70 ± 0.01    | 0.69 ± 0.00    | **0.76 ± 0.01**   | 0.70 ± 0.00    | 0.74 ± 0.01    | 0.73 ± 0.01    | 0.75 ± 0.00    |
+| Scaffold_hop              | 0.80 ± 0.00    | 0.79 ± 0.00    | **0.86 ± 0.02**   | 0.80 ± 0.01    | 0.80 ± 0.00    | 0.80 ± 0.02    | 0.84 ± 0.03    |
+| Sitagliptin_MPO           | 0.34 ± 0.02    | 0.33 ± 0.01    | 0.38 ± 0.03       | 0.33 ± 0.02    | **0.39 ± 0.02**| 0.32 ± 0.02    | **0.39 ± 0.02**|
+| Thiothixene_rediscovery   | 0.41 ± 0.01    | 0.41 ± 0.00    | 0.56 ± 0.04       | 0.45 ± 0.02    | 0.48 ± 0.04    | 0.48 ± 0.06    | **0.58 ± 0.09**|
+| Troglitazone_rediscovery  | 0.31 ± 0.02    | 0.31 ± 0.02    | 0.47 ± 0.05       | 0.34 ± 0.01    | 0.35 ± 0.02    | 0.46 ± 0.07    | **0.52 ± 0.06**|
+| Valsartan_smarts          | **0.03 ± 0.00**| 0.02 ± 0.00    | 0.02 ± 0.00       | 0.02 ± 0.00    | 0.02 ± 0.00    | **0.03 ± 0.00**| **0.03 ± 0.00**|
+| Zaleplon_MPO              | 0.47 ± 0.01    | 0.47 ± 0.01    | **0.52 ± 0.01**   | 0.48 ± 0.01    | 0.47 ± 0.01    | 0.50 ± 0.02    | **0.52 ± 0.01**|
+| **Total**                 | **13.67**      | **13.60**      | **15.65**         | **13.91**      | **14.27**      | **14.65**      | **15.80**      |
+
+
+[1]: https://arxiv.org/abs/1707.06347
+[2]: https://arxiv.org/abs/1602.01783
+[3]: https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf
+[4]: https://arxiv.org/abs/1704.07555
+[5]: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00646-z
+[6]: https://arxiv.org/pdf/2206.12411.pdf
+[7]: https://arxiv.org/abs/2007.03328
diff --git a/scripts/a2c/config_denovo.yaml b/scripts/a2c/config_denovo.yaml
@@ -1,23 +1,23 @@
 # Logging configuration
-experiment_name: acegen
+experiment_name: acegen 
 agent_name: a2c
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 16 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: MolOpt
 molscore_include: ["Albuterol_similarity"]
 custom_task: null # Requires molscore to be set to null
 
-# Fix the beginning of the generated molecules
-prompt: null  # e.g. c1ccccc
+# Promptsmiles configuration
+prompt: null  # e.g. c1ccccc  # Fix the beginning of the generated molecules
 
-# Architecture configuration
+# Model architecture
 shared_nets: False
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.

diff --git a/scripts/a2c/config_fragment.yaml b/scripts/a2c/config_fragment.yaml
@@ -1,13 +1,13 @@
 # Logging configuration
 experiment_name: acegen
 agent_name: a2c
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 16 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: MolOpt
@@ -20,8 +20,7 @@ promptsmiles_optimize: True
 promptsmiles_shuffle: True
 promptsmiles_multi: False
 
-
-# Architecture configuration
+# Model architecture
 shared_nets: False
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.

diff --git a/scripts/a2c/config_scaffold.yaml b/scripts/a2c/config_scaffold.yaml
@@ -1,13 +1,13 @@
 # Logging configuration
 experiment_name: acegen
 agent_name: a2c
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 16 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: LibINVENT_Exp1
@@ -20,7 +20,7 @@ promptsmiles_optimize: True
 promptsmiles_shuffle: True
 promptsmiles_multi: False
 
-# Architecture configuration
+# Model architecture
 shared_nets: False
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.

diff --git a/scripts/ahc/config_denovo.yaml b/scripts/ahc/config_denovo.yaml
@@ -1,23 +1,23 @@
 # Logging configuration
 experiment_name: acegen
 agent_name: ahc
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 128 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: MolOpt
 molscore_include: ["Albuterol_similarity"]
 custom_task: null # Requires molscore to be set to null
 
-# Fix the beginning of the generated molecules
-prompt: null  # e.g. c1ccccc
+# Promptsmiles configuration
+prompt: null  # e.g. c1ccccc  # Fix the beginning of the generated molecules
 
-# Architecture
+# Model architecture
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.
 # The default vocabulary varies for each prior. Refer to the README file in the root directory for more information.

diff --git a/scripts/ahc/config_fragment.yaml b/scripts/ahc/config_fragment.yaml
@@ -1,13 +1,13 @@
 # Logging configuration
 experiment_name: acegen
 agent_name: ahc
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 128 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: MolOpt
@@ -20,7 +20,7 @@ promptsmiles_optimize: True
 promptsmiles_shuffle: True
 promptsmiles_multi: True
 
-# Architecture
+# Model architecture
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.
 # The default vocabulary varies for each prior. Refer to the README file in the root directory for more information.

diff --git a/scripts/ahc/config_scaffold.yaml b/scripts/ahc/config_scaffold.yaml
@@ -1,13 +1,13 @@
 # Logging configuration
 experiment_name: acegen
 agent_name: ahc
-log_dir: results
+log_dir: results # Directory to save the results
 logger_backend: null  # wandb, tensorboard, or null
 seed: 101 # multiple seeds can be provided as a list to multiple experiments sequentially e.g. [101, 102, 103]
 
 # Environment configuration
 num_envs: 128 # Number of smiles to generate in parallel
-total_smiles: 10_000
+total_smiles: 10_000  # Total number of smiles to generate
 
 # Scoring function
 molscore: LibINVENT_Exp1
@@ -20,7 +20,7 @@ promptsmiles_optimize: True
 promptsmiles_shuffle: True
 promptsmiles_multi: False
 
-# Architecture
+# Model architecture
 model: gru # gru, lstm, or gpt2
 # The default prior varies for each model. Refer to the README file in the root directory for more information.
 # The default vocabulary varies for each prior. Refer to the README file in the root directory for more information.