Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Improve docs
  • Loading branch information
lewismervin1 authored Jul 9, 2024
1 parent 8eb70a2 commit b52363f
Showing 1 changed file with 153 additions and 108 deletions.
261 changes: 153 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,26 +135,129 @@ and optimization is free to pair any specified descriptor with any of the algori

When we have our data and our configuration, it is time to start the optimization.

### Running via singulartity

QSARtuna can be deployed using [Singularity](https://sylabs.io/guides/3.7/user-guide/index.html) container.
## Run from Python/Jupyter Notebook

Create conda environment with Jupyter and Install QSARtuna there:
```shell
module purge
module load Miniconda3
conda create --name my_env_with_qsartuna python=3.10.10 jupyter pip
conda activate my_env_with_qsartuna
module purge # Just in case.
which python # Check. Should output path that contains "my_env_with_qsartuna".
python -m pip install https://github.com/MolecularAI/QSARtuna/releases/download/3.1.1/qsartuna-3.1.1.tar.gz
```

Then you can use QSARtuna inside your Notebook:
```python
from qsartuna.three_step_opt_build_merge import (
optimize,
buildconfig_best,
build_best,
build_merged,
)
from qsartuna.config import ModelMode, OptimizationDirection
from qsartuna.config.optconfig import (
OptimizationConfig,
SVR,
RandomForest,
Ridge,
Lasso,
PLS,
XGBregressor,
)
from qsartuna.datareader import Dataset
from qsartuna.descriptors import ECFP, MACCS_keys, ECFP_counts

##
# Prepare hyperparameter optimization configuration.
config = OptimizationConfig(
data=Dataset(
input_column="canonical",
response_column="molwt",
training_dataset_file="tests/data/DRD2/subset-50/train.csv",
),
descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new()],
algorithms=[
SVR.new(),
RandomForest.new(),
Ridge.new(),
Lasso.new(),
PLS.new(),
XGBregressor.new(),
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=3,
n_trials=100,
direction=OptimizationDirection.MAXIMIZATION,
),
)

##
# Run Optuna Study.
study = optimize(config, study_name="my_study")

##
# Get the best Trial from the Study and make a Build (Training) configuration for it.
buildconfig = buildconfig_best(study)
# Optional: write out JSON of the best configuration.
import json
print(json.dumps(buildconfig.json(), indent=2))

##
# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

##
# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")
```

## Running via CLI

QSARtuna can be deployed directly from the CLI

To run commands inside the container, Singularity uses the following syntax:
To run commands QSARtuna uses the following syntax:
```shell
singularity exec <container.sif> <command>
qsartuna-<optimize|build|predict|schemagen> <command>
```

We can run three-step-process from command line with the following command:

```shell
singularity exec /projects/cc/mai/containers/QSARtuna_latest.sif \
/opt/qsartuna/.venv/bin/qsartuna-optimize \
qsartuna-optimize \
--config examples/optimization/regression_drd2_50.json \
--best-buildconfig-outpath ~/qsartuna-target/best.json \
--best-model-outpath ~/qsartuna-target/best.pkl \
--merged-model-outpath ~/qsartuna-target/merged.pkl
```

Optimization accepts the following command line arguments:

```
shell
qsartuna-optimize -h
usage: qsartuna-optimize [-h] --config CONFIG [--best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH] [--best-model-outpath BEST_MODEL_OUTPATH] [--merged-model-outpath MERGED_MODEL_OUTPATH] [--no-cache]
optbuild: Optimize hyper-parameters and build (train) the best model.
options:
-h, --help show this help message and exit
--best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH
Path where to write Json of the best build configuration.
--best-model-outpath BEST_MODEL_OUTPATH
Path where to write (persist) the best model.
--merged-model-outpath MERGED_MODEL_OUTPATH
Path where to write (persist) the model trained on merged train+test data.
--no-cache Turn off descriptor generation caching
required named arguments:
--config CONFIG Path to input configuration file (JSON): either Optimization configuration, or Build (training) configuration.
```

Since optimization can be a long process,
we should avoid running it on the login node,
and we should submit it to the SLURM queue instead.
Expand All @@ -176,13 +279,14 @@ We can submit our script to the queue by giving `sbatch` the following script:
# This script illustrates how to run one configuration from QSARtuna examples.
# The example we use is in examples/optimization/regression_drd2_50.json.

module load Miniconda3
conda activate my_env_with_qsartuna

# The example we chose uses relative paths to data files, change directory.
cd /{project_folder}/OptunaAZ-versions/OptunaAZ_latest
cd /{project_folder}/

singularity exec \
/{project_folder}/containers/QSARtuna_latest.sif \
/opt/qsartuna/.venv/bin/qsartuna-optimize \
--config{project_folder}/examples/optimization/regression_drd2_50.json \
/<your-project-dir>/qsartuna-optimize \
--config {project_folder}/examples/optimization/regression_drd2_50.json \
--best-buildconfig-outpath ~/qsartuna-target/best.json \
--best-model-outpath ~/qsartuna-target/best.pkl \
--merged-model-outpath ~/qsartuna-target/merged.pkl
Expand All @@ -195,33 +299,54 @@ When the script is complete, it will create pickled model files inside your home

When the model is built, run inference:
```shell
singularity exec /{project_folder}/containers/QSARtuna_latest.sif \
/opt/qsartuna/.venv/bin/qsartuna-predict \
qsartuna-predict \
--model-file target/merged.pkl \
--input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \
--input-smiles-csv-column "canonical" \
--output-prediction-csv-file target/prediction.csv
```

Note that QSARtuna_latest.sif points to the most recent version of QSARtuna.

Legacy models require the inference with the same QSARtuna version used to train the model.
This can be specified by modifying the above command and supplying
`/projects/cc/mai/containers/QSARtuna_<version>.sif` (replace <version> with the version of QSARtuna).

E.g:
Note that prediction accepts a variety of command line arguments:
```shell
singularity exec /{project_folder}/containers/QSARtuna_2.5.1.sif \
/opt/qsartuna/.venv/bin/qsartuna-predict \
--model-file 2.5.1_model.pkl \
--input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \
--input-smiles-csv-column "canonical" \
--output-prediction-csv-file target/prediction.csv
qsartuna-predict -h
usage: qsartuna-predict [-h] --model-file MODEL_FILE [--input-smiles-csv-file INPUT_SMILES_CSV_FILE] [--input-smiles-csv-column INPUT_SMILES_CSV_COLUMN] [--input-aux-column INPUT_AUX_COLUMN]
[--input-precomputed-file INPUT_PRECOMPUTED_FILE] [--input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN]
[--input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN] [--output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN]
[--output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE] [--predict-uncertainty] [--predict-explain] [--uncertainty_quantile UNCERTAINTY_QUANTILE]

Predict responses for a given OptunaAZ model

options:
-h, --help show this help message and exit
--input-smiles-csv-file INPUT_SMILES_CSV_FILE
Name of input CSV file with Input SMILES
--input-smiles-csv-column INPUT_SMILES_CSV_COLUMN
Column name of SMILES column in input CSV file
--input-aux-column INPUT_AUX_COLUMN
Column name of auxiliary descriptors in input CSV file
--input-precomputed-file INPUT_PRECOMPUTED_FILE
Filename of precomputed descriptors input CSV file
--input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN
Column name of precomputed descriptors identifier
--input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN
Column name of precomputed descriptors response column
--output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN
Column name of prediction column in output CSV file
--output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE
Name of output CSV file
--predict-uncertainty
Predict with uncertainties (model must provide this functionality)
--predict-explain Predict with SHAP or ChemProp explainability
--uncertainty_quantile UNCERTAINTY_QUANTILE
Apply uncertainty threshold to predictions

required named arguments:
--model-file MODEL_FILE
Model file name
```

would generate predictions for a model trained with QSARtuna 2.5.1.

### Optional: inspect
## Optional: inspect
To inspect performance of different models tried during optimization,
use [MLFlow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html):
```bash
Expand Down Expand Up @@ -258,86 +383,6 @@ You can get more details by clicking individual runs.
There you can access run/trial build (training) configuration.


## Run from Python/Jupyter Notebook

Create conda environment with Jupyter and Install QSARtuna there:
```shell
module purge
module load Miniconda3
conda create --name my_env_with_qsartuna python=3.10.10 jupyter pip
conda activate my_env_with_qsartuna
module purge # Just in case.
which python # Check. Should output path that contains "my_env_with_qsartuna".
python -m pip install https://github.com/MolecularAI/QSARtuna/releases/download/3.1.1/qsartuna-3.1.1.tar.gz
```

Then you can use QSARtuna inside your Notebook:
```python
from qsartuna.three_step_opt_build_merge import (
optimize,
buildconfig_best,
build_best,
build_merged,
)
from qsartuna.config import ModelMode, OptimizationDirection
from qsartuna.config.optconfig import (
OptimizationConfig,
SVR,
RandomForest,
Ridge,
Lasso,
PLS,
XGBregressor,
)
from qsartuna.datareader import Dataset
from qsartuna.descriptors import ECFP, MACCS_keys, ECFP_counts

##
# Prepare hyperparameter optimization configuration.
config = OptimizationConfig(
data=Dataset(
input_column="canonical",
response_column="molwt",
training_dataset_file="tests/data/DRD2/subset-50/train.csv",
),
descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new()],
algorithms=[
SVR.new(),
RandomForest.new(),
Ridge.new(),
Lasso.new(),
PLS.new(),
XGBregressor.new(),
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=3,
n_trials=100,
direction=OptimizationDirection.MAXIMIZATION,
),
)

##
# Run Optuna Study.
study = optimize(config, study_name="my_study")

##
# Get the best Trial from the Study and make a Build (Training) configuration for it.
buildconfig = buildconfig_best(study)
# Optional: write out JSON of the best configuration.
import json
print(json.dumps(buildconfig.json(), indent=2))

##
# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

##
# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")
```


## Adding descriptors to QSARtuna


Expand Down

0 comments on commit b52363f

Please sign in to comment.