diff --git a/README.md b/README.md index a0e2148..169e005 100644 --- a/README.md +++ b/README.md @@ -135,26 +135,129 @@ and optimization is free to pair any specified descriptor with any of the algori When we have our data and our configuration, it is time to start the optimization. -### Running via singulartity -QSARtuna can be deployed using [Singularity](https://sylabs.io/guides/3.7/user-guide/index.html) container. +## Run from Python/Jupyter Notebook + +Create conda environment with Jupyter and Install QSARtuna there: +```shell +module purge +module load Miniconda3 +conda create --name my_env_with_qsartuna python=3.10.10 jupyter pip +conda activate my_env_with_qsartuna +module purge # Just in case. +which python # Check. Should output path that contains "my_env_with_qsartuna". +python -m pip install https://github.com/MolecularAI/QSARtuna/releases/download/3.1.1/qsartuna-3.1.1.tar.gz +``` + +Then you can use QSARtuna inside your Notebook: +```python +from qsartuna.three_step_opt_build_merge import ( + optimize, + buildconfig_best, + build_best, + build_merged, +) +from qsartuna.config import ModelMode, OptimizationDirection +from qsartuna.config.optconfig import ( + OptimizationConfig, + SVR, + RandomForest, + Ridge, + Lasso, + PLS, + XGBregressor, +) +from qsartuna.datareader import Dataset +from qsartuna.descriptors import ECFP, MACCS_keys, ECFP_counts + +## +# Prepare hyperparameter optimization configuration. +config = OptimizationConfig( + data=Dataset( + input_column="canonical", + response_column="molwt", + training_dataset_file="tests/data/DRD2/subset-50/train.csv", + ), + descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new()], + algorithms=[ + SVR.new(), + RandomForest.new(), + Ridge.new(), + Lasso.new(), + PLS.new(), + XGBregressor.new(), + ], + settings=OptimizationConfig.Settings( + mode=ModelMode.REGRESSION, + cross_validation=3, + n_trials=100, + direction=OptimizationDirection.MAXIMIZATION, + ), +) + +## +# Run Optuna Study. +study = optimize(config, study_name="my_study") + +## +# Get the best Trial from the Study and make a Build (Training) configuration for it. +buildconfig = buildconfig_best(study) +# Optional: write out JSON of the best configuration. +import json +print(json.dumps(buildconfig.json(), indent=2)) + +## +# Build (re-Train) and save the best model. +build_best(buildconfig, "target/best.pkl") + +## +# Build (Train) and save the model on the merged train+test data. +build_merged(buildconfig, "target/merged.pkl") +``` + +## Running via CLI + +QSARtuna can be deployed directly from the CLI -To run commands inside the container, Singularity uses the following syntax: +To run commands QSARtuna uses the following syntax: ```shell -singularity exec +qsartuna- ``` We can run three-step-process from command line with the following command: ```shell -singularity exec /projects/cc/mai/containers/QSARtuna_latest.sif \ - /opt/qsartuna/.venv/bin/qsartuna-optimize \ + qsartuna-optimize \ --config examples/optimization/regression_drd2_50.json \ --best-buildconfig-outpath ~/qsartuna-target/best.json \ --best-model-outpath ~/qsartuna-target/best.pkl \ --merged-model-outpath ~/qsartuna-target/merged.pkl ``` +Optimization accepts the following command line arguments: + +``` +shell +qsartuna-optimize -h +usage: qsartuna-optimize [-h] --config CONFIG [--best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH] [--best-model-outpath BEST_MODEL_OUTPATH] [--merged-model-outpath MERGED_MODEL_OUTPATH] [--no-cache] + +optbuild: Optimize hyper-parameters and build (train) the best model. + +options: + -h, --help show this help message and exit + --best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH + Path where to write Json of the best build configuration. + --best-model-outpath BEST_MODEL_OUTPATH + Path where to write (persist) the best model. + --merged-model-outpath MERGED_MODEL_OUTPATH + Path where to write (persist) the model trained on merged train+test data. + --no-cache Turn off descriptor generation caching + +required named arguments: + --config CONFIG Path to input configuration file (JSON): either Optimization configuration, or Build (training) configuration. + +``` + Since optimization can be a long process, we should avoid running it on the login node, and we should submit it to the SLURM queue instead. @@ -176,13 +279,14 @@ We can submit our script to the queue by giving `sbatch` the following script: # This script illustrates how to run one configuration from QSARtuna examples. # The example we use is in examples/optimization/regression_drd2_50.json. +module load Miniconda3 +conda activate my_env_with_qsartuna + # The example we chose uses relative paths to data files, change directory. -cd /{project_folder}/OptunaAZ-versions/OptunaAZ_latest +cd /{project_folder}/ -singularity exec \ - /{project_folder}/containers/QSARtuna_latest.sif \ - /opt/qsartuna/.venv/bin/qsartuna-optimize \ - --config{project_folder}/examples/optimization/regression_drd2_50.json \ + //qsartuna-optimize \ + --config {project_folder}/examples/optimization/regression_drd2_50.json \ --best-buildconfig-outpath ~/qsartuna-target/best.json \ --best-model-outpath ~/qsartuna-target/best.pkl \ --merged-model-outpath ~/qsartuna-target/merged.pkl @@ -195,33 +299,54 @@ When the script is complete, it will create pickled model files inside your home When the model is built, run inference: ```shell -singularity exec /{project_folder}/containers/QSARtuna_latest.sif \ - /opt/qsartuna/.venv/bin/qsartuna-predict \ + qsartuna-predict \ --model-file target/merged.pkl \ --input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \ --input-smiles-csv-column "canonical" \ --output-prediction-csv-file target/prediction.csv ``` -Note that QSARtuna_latest.sif points to the most recent version of QSARtuna. - -Legacy models require the inference with the same QSARtuna version used to train the model. -This can be specified by modifying the above command and supplying -`/projects/cc/mai/containers/QSARtuna_.sif` (replace with the version of QSARtuna). - -E.g: +Note that prediction accepts a variety of command line arguments: ```shell -singularity exec /{project_folder}/containers/QSARtuna_2.5.1.sif \ - /opt/qsartuna/.venv/bin/qsartuna-predict \ - --model-file 2.5.1_model.pkl \ - --input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \ - --input-smiles-csv-column "canonical" \ - --output-prediction-csv-file target/prediction.csv + qsartuna-predict -h +usage: qsartuna-predict [-h] --model-file MODEL_FILE [--input-smiles-csv-file INPUT_SMILES_CSV_FILE] [--input-smiles-csv-column INPUT_SMILES_CSV_COLUMN] [--input-aux-column INPUT_AUX_COLUMN] + [--input-precomputed-file INPUT_PRECOMPUTED_FILE] [--input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN] + [--input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN] [--output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN] + [--output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE] [--predict-uncertainty] [--predict-explain] [--uncertainty_quantile UNCERTAINTY_QUANTILE] + +Predict responses for a given OptunaAZ model + +options: + -h, --help show this help message and exit + --input-smiles-csv-file INPUT_SMILES_CSV_FILE + Name of input CSV file with Input SMILES + --input-smiles-csv-column INPUT_SMILES_CSV_COLUMN + Column name of SMILES column in input CSV file + --input-aux-column INPUT_AUX_COLUMN + Column name of auxiliary descriptors in input CSV file + --input-precomputed-file INPUT_PRECOMPUTED_FILE + Filename of precomputed descriptors input CSV file + --input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN + Column name of precomputed descriptors identifier + --input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN + Column name of precomputed descriptors response column + --output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN + Column name of prediction column in output CSV file + --output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE + Name of output CSV file + --predict-uncertainty + Predict with uncertainties (model must provide this functionality) + --predict-explain Predict with SHAP or ChemProp explainability + --uncertainty_quantile UNCERTAINTY_QUANTILE + Apply uncertainty threshold to predictions + +required named arguments: + --model-file MODEL_FILE + Model file name ``` -would generate predictions for a model trained with QSARtuna 2.5.1. -### Optional: inspect +## Optional: inspect To inspect performance of different models tried during optimization, use [MLFlow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html): ```bash @@ -258,86 +383,6 @@ You can get more details by clicking individual runs. There you can access run/trial build (training) configuration. -## Run from Python/Jupyter Notebook - -Create conda environment with Jupyter and Install QSARtuna there: -```shell -module purge -module load Miniconda3 -conda create --name my_env_with_qsartuna python=3.10.10 jupyter pip -conda activate my_env_with_qsartuna -module purge # Just in case. -which python # Check. Should output path that contains "my_env_with_qsartuna". -python -m pip install https://github.com/MolecularAI/QSARtuna/releases/download/3.1.1/qsartuna-3.1.1.tar.gz -``` - -Then you can use QSARtuna inside your Notebook: -```python -from qsartuna.three_step_opt_build_merge import ( - optimize, - buildconfig_best, - build_best, - build_merged, -) -from qsartuna.config import ModelMode, OptimizationDirection -from qsartuna.config.optconfig import ( - OptimizationConfig, - SVR, - RandomForest, - Ridge, - Lasso, - PLS, - XGBregressor, -) -from qsartuna.datareader import Dataset -from qsartuna.descriptors import ECFP, MACCS_keys, ECFP_counts - -## -# Prepare hyperparameter optimization configuration. -config = OptimizationConfig( - data=Dataset( - input_column="canonical", - response_column="molwt", - training_dataset_file="tests/data/DRD2/subset-50/train.csv", - ), - descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new()], - algorithms=[ - SVR.new(), - RandomForest.new(), - Ridge.new(), - Lasso.new(), - PLS.new(), - XGBregressor.new(), - ], - settings=OptimizationConfig.Settings( - mode=ModelMode.REGRESSION, - cross_validation=3, - n_trials=100, - direction=OptimizationDirection.MAXIMIZATION, - ), -) - -## -# Run Optuna Study. -study = optimize(config, study_name="my_study") - -## -# Get the best Trial from the Study and make a Build (Training) configuration for it. -buildconfig = buildconfig_best(study) -# Optional: write out JSON of the best configuration. -import json -print(json.dumps(buildconfig.json(), indent=2)) - -## -# Build (re-Train) and save the best model. -build_best(buildconfig, "target/best.pkl") - -## -# Build (Train) and save the model on the merged train+test data. -build_merged(buildconfig, "target/merged.pkl") -``` - - ## Adding descriptors to QSARtuna