Make scope pip-installable (#514)

* Initial commit: restructure, add _instantiate.py, modify pyproject.toml * Add new config-specified args * Update _instantiate.py with more functions * Refactor methods, add argument parsers * List all arguments in utils * Update docstrings * Install/update scope.py scripts * Update workflows * Update imports * Rename scope.py to scope_class.py, update other code * Change imports * Update more imports in scope_class * Update more imports * More updated imports * Relative imports for scope code * Use poetry instead of pip to install scope * Remove py311 from black * Use poetry run * Change package name, include config defaults * Add initialization script, golden dataset mapper * Install scope-download-classification, allow user-specified config * Update docs with science user install, new usage * Add more useful data files to package * Include more useful files * Refactor fritz tools with new config code * Refactor feature code * Add missing args to generate_features_slurm.py * Refactor training, inference scripts * Refactor remaining scripts * Refactor scope_class, utils to update config reading/checking * Standardize argument format to use hyphens, update docs * Update gcn_cronjob.py * Move example notebooks out of tools directory * Delete old example notebooks * Update pre-commit config * Fix new linting issues, move initialization function * Fix isinstance changes * Fix feature generation bug * Update package metadata, readme * Fix typo in readme * Separate dev requirements from others * Reorganize requirments, update docs * Update author list, version * Enable --doGPU flag in scope-test * Account for path_to_features in scope-test * More path_to_features debugging in scope-test * Debug GPU testing * Fix period_suffix bug * Debug generate-features/get-quad-ids paths * Change logs path in training/inference slurm scripts * Fix typos in tool.poetry.scripts * Update readme with new repo name * Restrict tensorflow requirements
ZwickyTransientFacility · Feb 23, 2024 · 66cf46f · 66cf46f
1 parent 8f6684e
commit 66cf46f
Show file tree

Hide file tree

Showing 40 changed files with 2,583 additions and 1,812 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,20 +1,20 @@
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v3.3.0
+    rev: v4.5.0
     hooks:
       - id: check-yaml
       - id: end-of-file-fixer
         exclude: .ipynb_checkpoints|data/Gaia_hp8_densitymap.fits|tools/classification_stats.ipynb
       - id: trailing-whitespace
         exclude: .ipynb_checkpoints|data/Gaia_hp8_densitymap.fits
   - repo: https://github.com/python/black
-    rev: 22.3.0
+    rev: 24.2.0
     hooks:
       - id: black
         pass_filenames: true
         exclude: .ipynb_checkpoints|data|^.fits
   - repo: https://github.com/pycqa/flake8
-    rev: 3.8.4
+    rev: 7.0.0
     hooks:
       - id: flake8
         pass_filenames: true

diff --git a/.requirements/dev.txt b/.requirements/dev.txt
diff --git a/.requirements/doc.txt b/.requirements/doc.txt
diff --git a/README.md b/README.md
@@ -1,13 +1,14 @@
-# SCoPe: ZTF source classification project
+# SCoPe: ZTF Source Classification Project
 
 [![arXiv](https://img.shields.io/badge/arXiv-2102.11304-brightgreen)](https://arxiv.org/abs/2102.11304)
 [![arXiv](https://img.shields.io/badge/arXiv-2009.14071-brightgreen)](https://arxiv.org/abs/2009.14071)
+[![arXiv](https://img.shields.io/badge/arXiv-2312.00143-brightgreen)](https://arxiv.org/abs/2312.00143)
 
-The documentation is hosted at [https://zwickytransientfacility.github.io/scope-docs/](https://zwickytransientfacility.github.io/scope-docs/). To generate HTML files of the documentation locally, run `./scope.py doc`
+`scope-ml` uses machine learning to classify light curves from the Zwicky Transient Facility ([ZTF](https://www.ztf.caltech.edu)). The documentation is hosted at [https://zwickytransientfacility.github.io/scope-docs/](https://zwickytransientfacility.github.io/scope-docs/). To generate HTML files of the documentation locally, clone the repository and run `scope-doc` after installing.
 
 ## Funding
  We gratefully acknowledge previous and current support from the U.S. National Science Foundation (NSF) Harnessing the Data Revolution (HDR) Institute for <a href="https://a3d3.ai">Accelerated AI Algorithms for Data-Driven Discovery (A3D3)</a> under Cooperative Agreement No. <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2117997">PHY-2117997</a>.
 
  <p align="center">
- <img src="https://github.com/ZwickyTransientFacility/scope/blob/main/assets/a3d3.png" alt="A3D3" width="200"/>
- <img src="https://github.com/ZwickyTransientFacility/scope/blob/main/assets/nsf.png" alt="NSF" width="200"/>
+ <img src="https://github.com/ZwickyTransientFacility/scope/raw/main/assets/a3d3.png" alt="A3D3" width="200"/>
+ <img src="https://github.com/ZwickyTransientFacility/scope/raw/main/assets/nsf.png" alt="NSF" width="200"/>
diff --git a/tools/SCoPe_data_analysis_plots.ipynb → SCoPe_data_analysis_plots.ipynb b/tools/SCoPe_data_analysis_plots.ipynb → SCoPe_data_analysis_plots.ipynb
diff --git a/config.defaults.yaml b/config.defaults.yaml
@@ -1731,6 +1731,15 @@ training:
       eval_metric: 'auc'
       early_stopping_rounds: 10
       num_boost_round: 999
+    plot_params:
+      cm_include_count: False
+      cm_include_percent: True
+      annotate_scores: False
+  dnn:
+    dense_branch: True
+    conv_branch: True
+    loss: 'binary_crossentropy'
+    optimizer: 'adam'
   classes:
     # phenomenological classes
     vnv:

diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -0,0 +1,5 @@
+pytest>=6.1.2
+pre-commit>=3.5.0
+sphinx>=4.2
+sphinx_press_theme>=0.8.0
+poetry>=1.7.1
diff --git a/doc/developer.md b/doc/developer.md
@@ -1,6 +1,23 @@
 # Installation/Developer Guidelines
 
-## Initial steps
+## Science users
+- Create and activate a virtual/conda environment with Python 3.11, e.g:
+  ```shell script
+  conda create -n scope-env python=3.11
+  conda activate scope-env
+  ```
+- Install the latest release of `scope-ml` from PyPI:
+  ```shell script
+  pip install scope-ml
+  ```
+- In the directory of your choice, run the initialization script. This will create the required directories and copy the necessary files to run the code:
+  ```shell script
+  scope-initialize
+  ```
+- Change directories to `scope` and modify `config.yaml` to finish the initialization process. This config file is used by default when running all scripts. You can also specify another config file using the `--config-path` argument.
+
+
+## Developers/contributors
 
 - Create your own fork the [scope repository](https://github.com/ZwickyTransientFacility/scope) by clicking the "fork" button. Then, decide whether you would like to use HTTPS (easier for beginners) or SSH.
 - Following one set of instructions below, clone (download) your copy of the repository, and set up a remote called `upstream` that points to the main `scope` repository.
@@ -21,9 +38,9 @@ git clone [email protected]:<yourname>/scope.git && cd scope
 git remote add upstream [email protected]:ZwickyTransientFacility/scope.git
 ```
 
-## Setting up your environment (Windows/Linux/macOS)
+### Setting up your environment (Windows/Linux/macOS)
 
-### Use a package manager for installation
+#### Use a package manager for installation
 
 We currently recommend running `scope` with Python 3.11. You may want to begin your installation by creating/activating a virtual environment, for example using conda. We specifically recommend installing miniforge3 (https://github.com/conda-forge/miniforge).
 
@@ -34,23 +51,30 @@ conda create -n scope-env -c conda-forge python=3.11
 conda activate scope-env
 ```
 
-### Update your `PYTHONPATH`
+#### (Optional): Update your `PYTHONPATH`
 
-Ensure that Python can import from `scope` by modifying the `PYTHONPATH` environment variable. Use a simple text editor like `nano` to modify the appropriate file (depending on which shell you are using). For example, if using bash, run `nano ~/.bash_profile` and add the following line:
+If you plan to import from `scope`, ensure that Python can import from `scope` by modifying the `PYTHONPATH` environment variable. Use a simple text editor like `nano` to modify the appropriate file (depending on which shell you are using). For example, if using bash, run `nano ~/.bash_profile` and add the following line:
 
 ```bash
 export PYTHONPATH="$PYTHONPATH:$HOME/scope"
 ```
 
 Save the updated file (`Ctrl+O` in `nano`) and close/reopen your terminal for this change to be recognized. Then `cd` back into scope and activate your `scope-env` again.
 
-### Install pre-commit
+### Install required packages
+
+Ensure you are in the `scope` directory that contains `pyproject.toml`. Then, install the required python packages by running:
+```bash
+pip install .
+```
+
+#### Install dev requirements, pre-commit hook
 
 We use `black` to format the code and `flake8` to verify that code complies with [PEP8](https://www.python.org/dev/peps/pep-0008/).
-Please install our pre-commit hook as follows:
+Please install our dev requirements and pre-commit hook as follows:
 
 ```shell script
-pip install pre-commit
+pip install -r dev-requirements.txt
 pre-commit install
 ```
 
@@ -60,14 +84,7 @@ code.
 
 The pre-commit hook will lint *changes* made to the source.
 
-## Install required packages
-
-Install the required python packages by running:
-```bash
-pip install -r requirements.txt
-```
-
-### Create and modify config.yaml
+#### Create and modify config.yaml
 
 From the included config.defaults.yaml, make a copy called config.yaml:
 
@@ -77,14 +94,15 @@ cp config.defaults.yaml config.yaml
 
 Edit config.yaml to include Kowalski instance and Fritz tokens in the associated empty `token:` fields.
 
-### Testing
-Run `./scope.py test` to test your installation. Note that for the test to pass, you will need access to the Kowalski database. If you do not have Kowalski access, you can run `./scope.py test_limited` to run a more limited (but still useful) set of tests.
+#### Testing
+Run `scope-test` to test your installation. Note that for the test to pass, you will need access to the Kowalski database. If you do not have Kowalski access, you can run `scope-test-limited` to run a more limited (but still useful) set of tests.
 
 ### Troubleshooting
 Upon encountering installation/testing errors, manually install the package in question using  `conda install xxx` , and remove it from `.requirements/dev.txt`. After that, re-run `pip install -r requirements.txt` to continue.
 
-### Known issues
-- Across all platforms, we are currently aware of `scope` dependency issues with Python 3.11.
+#### Known issues
+- If using GPU-accelerated period-finding algorithms for feature generation, you will need to install [periodfind](https://github.com/ZwickyTransientFacility/periodfind) separately from the source.
+- Across all platforms, we are currently aware of `scope` dependency issues with Python 3.12.
 - Anaconda continues to cause problems with environment setup.
 - Using `pip` to install `healpy` on an arm64 Mac can raise an error upon import. We recommend including `h5py` as a requirement during the creation of your `conda` environment.
 - On Windows machines, `healpy` and `cesium` raise errors upon installation.
@@ -93,7 +111,7 @@ Upon encountering installation/testing errors, manually install the package in q
 
 If the installation continues to raise errors, update the conda environment and try again.
 
-## How to contribute
+### How to contribute
 
 Contributions to `scope` are made through [GitHub Pull Requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests), a set of proposed commits (or patches):
 
@@ -144,7 +162,7 @@ Developers may merge `main` into their branch as many times as they want to.
 
 1. Once the pull request has been reviewed and approved by at least one team member, it will be merged into `scope`.
 
-## Contributing Field Guide sections
+### Contributing Field Guide sections
 
 If you would like to contribute a Field Guide section, please follow the steps below.
 

diff --git a/doc/quickstart.md b/doc/quickstart.md
@@ -1,16 +1,18 @@
 # Quick Start Guide
 
-This guide is intended to facilitate quick interactions with SCoPe code after you have completed the **Installation/Developer Guidelines** section. More detailed usage info can be found in the **Usage** section. **All of the following examples assume that SCoPe is installed in your home directory. If the `scope` directory is located elsewhere, adjust the example code as necessary.**
+This guide is intended to facilitate quick interactions with SCoPe code after you have completed the **Installation/Developer Guidelines** section. More detailed usage info can be found in the **Usage** section.
 
 ## Modify `config.yaml`
 To start out, provide SCoPe your training set's filepath using the `training:` `dataset:` field in `config.yaml`. The path should be a partial one starting within the `scope` directory. For example, if your training set `trainingSet.parquet` is within the `tools` directory (which itself is within `scope`), provide `tools/trainingSet.parquet` in the `dataset:` field.
 
+When running scripts, `scope` will by default use the `config.yaml` file in your current directory. You can specify a different config file by providing its path to any installed script using the `--config-path` argument.
+
 ## Training
 
 Train an XGBoost binary classifier using the following code:
 
 ```
-./scope.py train --tag=vnv --algorithm=xgb --group=ss23 --period_suffix=ELS_ECE_EAOV --epochs=30 --verbose --save --plot --skip_cv
+scope-train --tag vnv --algorithm xgb --group ss23 --period-suffix ELS_ECE_EAOV --epochs 30 --verbose --save --plot --skip-cv
 ```
 
 ### Arguments:
@@ -20,34 +22,34 @@ Train an XGBoost binary classifier using the following code:
 
 `--group`: if `--save` is passed, training results are saved to the group/directory named here.
 
-`--period_suffix`: SCoPe determines light curve periods using GPU-accelerated algorithms. These algorithms include a Lomb-Scargle approach (ELS), Conditional Entropy (ECE), Analysis of Variance (AOV), and an approach nesting all three (ELS_ECE_EAOV). Periodic features are stored with the suffix specified here.
+`--period-suffix`: SCoPe determines light curve periods using GPU-accelerated algorithms. These algorithms include a Lomb-Scargle approach (ELS), Conditional Entropy (ECE), Analysis of Variance (AOV), and an approach nesting all three (ELS_ECE_EAOV). Periodic features are stored with the suffix specified here.
 
-`--min_count`: requires at least min_count positive examples to run training.
+`--min-count`: requires at least min_count positive examples to run training.
 
 `--epochs`: neural network training takes an --epochs argument that is set to 30 here.
 
 ***Notes:***
-- *The above training runs the XGB algorithm by default and skips cross-validation in the interest of time. For a full run, you can remove the `--skip_cv` argument to run a cross-validated grid search of XGB hyperparameters during training.*
+- *The above training runs the XGB algorithm by default and skips cross-validation in the interest of time. For a full run, you can remove the `--skip-cv` argument to run a cross-validated grid search of XGB hyperparameters during training.*
 
-- *DNN hyperparameters are optimized using a different approach - Weights and Biases Sweeps (https://docs.wandb.ai/guides/sweeps). The results of these sweeps are the default hyperparameters in the config file. To run another round of sweeps for DNN, create a WandB account and set the `--run_sweeps` keyword in the call to `scope.py train`.*
+- *DNN hyperparameters are optimized using a different approach - Weights and Biases Sweeps (https://docs.wandb.ai/guides/sweeps). The results of these sweeps are the default hyperparameters in the config file. To run another round of sweeps for DNN, create a WandB account and set the `--run-sweeps` keyword in the call to `scope-train`.*
 
 - *SCoPe DNN training does not provide feature importance information (due to the hidden layers of the network). Feature importance is possible to estimate for neural networks, but it is more computationally expensive compared to this "free" information from XGB.*
 
 ### Train multiple classifiers with one script
 
-Create a shell script that contains multiple calls to `scope.py train`:
+Create a shell script that contains multiple calls to `scope-train`:
 ```
-./scope.py create_training_script --filename=train_xgb.sh --min_count=1000 --algorithm=xgb --period_suffix=ELS_ECE_EAOV --add_keywords="--save --plot --group=ss23 --epochs=30 --skip_cv"
+create-training-script --filename train_xgb.sh --min-count 1000 --algorithm xgb --period-suffix ELS_ECE_EAOV --add-keywords "--save --plot --group ss23 --epochs 30 --skip-cv"
 ```
 
-Modify the permissions of this script by running `chmod +x train_xgb.sh`. Run the generated training script in a terminal window (using e.g. `./train_xgb.sh`) to train multiple label sequentially.
+Modify the permissions of this script by running `chmod +x train_xgb.sh`. Run the generated training script in a terminal window (using e.g. `./train_xgb.sh`) to train multiple classifers sequentially.
 
 ***Note:***
-- *The code will throw an error if the training script filename already exists.*
+- *The code will raise an error if the training script filename already exists.*
 
 ### Running training on HPC resources
 
-`train_algorithm_slurm.py` and `train_algorithm_job_submission.py` can be used generate and submit `slurm` scripts to train all classifiers in parallel using HPC resources.
+`train-algorithm-slurm` and `train-algorithm-job-submission` can be used generate and submit `slurm` scripts to train all classifiers in parallel using HPC resources.
 
 ## Plotting Classifier Performance
 SCoPe saves diagnostic plots and json files to report each classifier's performance. The below code shows the location of the validation set results for one classifier.
@@ -82,10 +84,10 @@ This code may also be placed in a loop over multiple labels to compare each clas
 
 ## Inference
 
-Use `tools/inference.py` to run inference on a field (297) of features (within a directory called `generated_features`). The classifiers used for this inference are within the `ss23` directory/group specified during training.
+Use `run-inference` to run inference on a field (297) of features (in this example, located in a directory called `generated_features`). The classifiers used for this inference are within the `ss23` directory/group specified during training.
 
 ```
-./scope.py create_inference_script --filename=get_all_preds_xgb.sh --group_name=ss23 --algorithm=xgb --period_suffix=ELS_ECE_EAOV --feature_directory=generated_features
+create-inference-script --filename get_all_preds_xgb.sh --group-name ss23 --algorithm xgb --period-suffix ELS_ECE_EAOV --feature-directory generated_features
 ```
 
 Modify the permissions of this script using `chmod +x get_all_preds_xgb.sh`, then run on the desired field:
@@ -94,12 +96,12 @@ Modify the permissions of this script using `chmod +x get_all_preds_xgb.sh`, the
 ```
 
 ***Notes:***
-- *`scope.py create_inference_script` will throw an error if the inference script filename already exists.*
+- *`create-inference-script` will raise an error if the inference script filename already exists.*
 - *Inference begins by imputing missing features using the strategies specified in the `features:` section of the config file.*
 
 ### Running inference on HPC resources
 
-`run_inference_slurm.py` and `run_inference_job_submission.py` can be used generate and submit `slurm` scripts to run inference for all classifiers in parallel using HPC resources.*
+`run-inference-slurm` and `run-inference-job-submission` can be used generate and submit `slurm` scripts to run inference for all classifiers in parallel using HPC resources.*
 
 ## Examining predictions