Skip to content

Commit

Permalink
Make scope pip-installable (#514)
Browse files Browse the repository at this point in the history
* Initial commit: restructure, add _instantiate.py, modify pyproject.toml

* Add new config-specified args

* Update _instantiate.py with more functions

* Refactor methods, add argument parsers

* List all arguments in utils

* Update docstrings

* Install/update scope.py scripts

* Update workflows

* Update imports

* Rename scope.py to scope_class.py, update other code

* Change imports

* Update more imports in scope_class

* Update more imports

* More updated imports

* Relative imports for scope code

* Use poetry instead of pip to install scope

* Remove py311 from black

* Use poetry run

* Change package name, include config defaults

* Add initialization script, golden dataset mapper

* Install scope-download-classification, allow user-specified config

* Update docs with science user install, new usage

* Add more useful data files to package

* Include more useful files

* Refactor fritz tools with new config code

* Refactor feature code

* Add missing args to generate_features_slurm.py

* Refactor training, inference scripts

* Refactor remaining scripts

* Refactor scope_class, utils to update config reading/checking

* Standardize argument format to use hyphens, update docs

* Update gcn_cronjob.py

* Move example notebooks out of tools directory

* Delete old example notebooks

* Update pre-commit config

* Fix new linting issues, move initialization function

* Fix isinstance changes

* Fix feature generation bug

* Update package metadata, readme

* Fix typo in readme

* Separate dev requirements from others

* Reorganize requirments, update docs

* Update author list, version

* Enable --doGPU flag in scope-test

* Account for path_to_features in scope-test

* More path_to_features debugging in scope-test

* Debug GPU testing

* Fix period_suffix bug

* Debug generate-features/get-quad-ids paths

* Change logs path in training/inference slurm scripts

* Fix typos in tool.poetry.scripts

* Update readme with new repo name

* Restrict tensorflow requirements
  • Loading branch information
bfhealy authored Feb 23, 2024
1 parent 8f6684e commit 66cf46f
Show file tree
Hide file tree
Showing 40 changed files with 2,583 additions and 1,812 deletions.
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.3.0
rev: v4.5.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
exclude: .ipynb_checkpoints|data/Gaia_hp8_densitymap.fits|tools/classification_stats.ipynb
- id: trailing-whitespace
exclude: .ipynb_checkpoints|data/Gaia_hp8_densitymap.fits
- repo: https://github.com/python/black
rev: 22.3.0
rev: 24.2.0
hooks:
- id: black
pass_filenames: true
exclude: .ipynb_checkpoints|data|^.fits
- repo: https://github.com/pycqa/flake8
rev: 3.8.4
rev: 7.0.0
hooks:
- id: flake8
pass_filenames: true
Expand Down
9 changes: 0 additions & 9 deletions .requirements/dev.txt

This file was deleted.

23 changes: 0 additions & 23 deletions .requirements/doc.txt

This file was deleted.

9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# SCoPe: ZTF source classification project
# SCoPe: ZTF Source Classification Project

[![arXiv](https://img.shields.io/badge/arXiv-2102.11304-brightgreen)](https://arxiv.org/abs/2102.11304)
[![arXiv](https://img.shields.io/badge/arXiv-2009.14071-brightgreen)](https://arxiv.org/abs/2009.14071)
[![arXiv](https://img.shields.io/badge/arXiv-2312.00143-brightgreen)](https://arxiv.org/abs/2312.00143)

The documentation is hosted at [https://zwickytransientfacility.github.io/scope-docs/](https://zwickytransientfacility.github.io/scope-docs/). To generate HTML files of the documentation locally, run `./scope.py doc`
`scope-ml` uses machine learning to classify light curves from the Zwicky Transient Facility ([ZTF](https://www.ztf.caltech.edu)). The documentation is hosted at [https://zwickytransientfacility.github.io/scope-docs/](https://zwickytransientfacility.github.io/scope-docs/). To generate HTML files of the documentation locally, clone the repository and run `scope-doc` after installing.

## Funding
We gratefully acknowledge previous and current support from the U.S. National Science Foundation (NSF) Harnessing the Data Revolution (HDR) Institute for <a href="https://a3d3.ai">Accelerated AI Algorithms for Data-Driven Discovery (A3D3)</a> under Cooperative Agreement No. <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2117997">PHY-2117997</a>.

<p align="center">
<img src="https://github.com/ZwickyTransientFacility/scope/blob/main/assets/a3d3.png" alt="A3D3" width="200"/>
<img src="https://github.com/ZwickyTransientFacility/scope/blob/main/assets/nsf.png" alt="NSF" width="200"/>
<img src="https://github.com/ZwickyTransientFacility/scope/raw/main/assets/a3d3.png" alt="A3D3" width="200"/>
<img src="https://github.com/ZwickyTransientFacility/scope/raw/main/assets/nsf.png" alt="NSF" width="200"/>
File renamed without changes.
9 changes: 9 additions & 0 deletions config.defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1731,6 +1731,15 @@ training:
eval_metric: 'auc'
early_stopping_rounds: 10
num_boost_round: 999
plot_params:
cm_include_count: False
cm_include_percent: True
annotate_scores: False
dnn:
dense_branch: True
conv_branch: True
loss: 'binary_crossentropy'
optimizer: 'adam'
classes:
# phenomenological classes
vnv:
Expand Down
5 changes: 5 additions & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pytest>=6.1.2
pre-commit>=3.5.0
sphinx>=4.2
sphinx_press_theme>=0.8.0
poetry>=1.7.1
62 changes: 40 additions & 22 deletions doc/developer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
# Installation/Developer Guidelines

## Initial steps
## Science users
- Create and activate a virtual/conda environment with Python 3.11, e.g:
```shell script
conda create -n scope-env python=3.11
conda activate scope-env
```
- Install the latest release of `scope-ml` from PyPI:
```shell script
pip install scope-ml
```
- In the directory of your choice, run the initialization script. This will create the required directories and copy the necessary files to run the code:
```shell script
scope-initialize
```
- Change directories to `scope` and modify `config.yaml` to finish the initialization process. This config file is used by default when running all scripts. You can also specify another config file using the `--config-path` argument.


## Developers/contributors

- Create your own fork the [scope repository](https://github.com/ZwickyTransientFacility/scope) by clicking the "fork" button. Then, decide whether you would like to use HTTPS (easier for beginners) or SSH.
- Following one set of instructions below, clone (download) your copy of the repository, and set up a remote called `upstream` that points to the main `scope` repository.
Expand All @@ -21,9 +38,9 @@ git clone [email protected]:<yourname>/scope.git && cd scope
git remote add upstream [email protected]:ZwickyTransientFacility/scope.git
```

## Setting up your environment (Windows/Linux/macOS)
### Setting up your environment (Windows/Linux/macOS)

### Use a package manager for installation
#### Use a package manager for installation

We currently recommend running `scope` with Python 3.11. You may want to begin your installation by creating/activating a virtual environment, for example using conda. We specifically recommend installing miniforge3 (https://github.com/conda-forge/miniforge).

Expand All @@ -34,23 +51,30 @@ conda create -n scope-env -c conda-forge python=3.11
conda activate scope-env
```

### Update your `PYTHONPATH`
#### (Optional): Update your `PYTHONPATH`

Ensure that Python can import from `scope` by modifying the `PYTHONPATH` environment variable. Use a simple text editor like `nano` to modify the appropriate file (depending on which shell you are using). For example, if using bash, run `nano ~/.bash_profile` and add the following line:
If you plan to import from `scope`, ensure that Python can import from `scope` by modifying the `PYTHONPATH` environment variable. Use a simple text editor like `nano` to modify the appropriate file (depending on which shell you are using). For example, if using bash, run `nano ~/.bash_profile` and add the following line:

```bash
export PYTHONPATH="$PYTHONPATH:$HOME/scope"
```

Save the updated file (`Ctrl+O` in `nano`) and close/reopen your terminal for this change to be recognized. Then `cd` back into scope and activate your `scope-env` again.

### Install pre-commit
### Install required packages

Ensure you are in the `scope` directory that contains `pyproject.toml`. Then, install the required python packages by running:
```bash
pip install .
```

#### Install dev requirements, pre-commit hook

We use `black` to format the code and `flake8` to verify that code complies with [PEP8](https://www.python.org/dev/peps/pep-0008/).
Please install our pre-commit hook as follows:
Please install our dev requirements and pre-commit hook as follows:

```shell script
pip install pre-commit
pip install -r dev-requirements.txt
pre-commit install
```

Expand All @@ -60,14 +84,7 @@ code.

The pre-commit hook will lint *changes* made to the source.

## Install required packages

Install the required python packages by running:
```bash
pip install -r requirements.txt
```

### Create and modify config.yaml
#### Create and modify config.yaml

From the included config.defaults.yaml, make a copy called config.yaml:

Expand All @@ -77,14 +94,15 @@ cp config.defaults.yaml config.yaml

Edit config.yaml to include Kowalski instance and Fritz tokens in the associated empty `token:` fields.

### Testing
Run `./scope.py test` to test your installation. Note that for the test to pass, you will need access to the Kowalski database. If you do not have Kowalski access, you can run `./scope.py test_limited` to run a more limited (but still useful) set of tests.
#### Testing
Run `scope-test` to test your installation. Note that for the test to pass, you will need access to the Kowalski database. If you do not have Kowalski access, you can run `scope-test-limited` to run a more limited (but still useful) set of tests.

### Troubleshooting
Upon encountering installation/testing errors, manually install the package in question using `conda install xxx` , and remove it from `.requirements/dev.txt`. After that, re-run `pip install -r requirements.txt` to continue.

### Known issues
- Across all platforms, we are currently aware of `scope` dependency issues with Python 3.11.
#### Known issues
- If using GPU-accelerated period-finding algorithms for feature generation, you will need to install [periodfind](https://github.com/ZwickyTransientFacility/periodfind) separately from the source.
- Across all platforms, we are currently aware of `scope` dependency issues with Python 3.12.
- Anaconda continues to cause problems with environment setup.
- Using `pip` to install `healpy` on an arm64 Mac can raise an error upon import. We recommend including `h5py` as a requirement during the creation of your `conda` environment.
- On Windows machines, `healpy` and `cesium` raise errors upon installation.
Expand All @@ -93,7 +111,7 @@ Upon encountering installation/testing errors, manually install the package in q

If the installation continues to raise errors, update the conda environment and try again.

## How to contribute
### How to contribute

Contributions to `scope` are made through [GitHub Pull Requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests), a set of proposed commits (or patches):

Expand Down Expand Up @@ -144,7 +162,7 @@ Developers may merge `main` into their branch as many times as they want to.

1. Once the pull request has been reviewed and approved by at least one team member, it will be merged into `scope`.

## Contributing Field Guide sections
### Contributing Field Guide sections

If you would like to contribute a Field Guide section, please follow the steps below.

Expand Down
32 changes: 17 additions & 15 deletions doc/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
# Quick Start Guide

This guide is intended to facilitate quick interactions with SCoPe code after you have completed the **Installation/Developer Guidelines** section. More detailed usage info can be found in the **Usage** section. **All of the following examples assume that SCoPe is installed in your home directory. If the `scope` directory is located elsewhere, adjust the example code as necessary.**
This guide is intended to facilitate quick interactions with SCoPe code after you have completed the **Installation/Developer Guidelines** section. More detailed usage info can be found in the **Usage** section.

## Modify `config.yaml`
To start out, provide SCoPe your training set's filepath using the `training:` `dataset:` field in `config.yaml`. The path should be a partial one starting within the `scope` directory. For example, if your training set `trainingSet.parquet` is within the `tools` directory (which itself is within `scope`), provide `tools/trainingSet.parquet` in the `dataset:` field.

When running scripts, `scope` will by default use the `config.yaml` file in your current directory. You can specify a different config file by providing its path to any installed script using the `--config-path` argument.

## Training

Train an XGBoost binary classifier using the following code:

```
./scope.py train --tag=vnv --algorithm=xgb --group=ss23 --period_suffix=ELS_ECE_EAOV --epochs=30 --verbose --save --plot --skip_cv
scope-train --tag vnv --algorithm xgb --group ss23 --period-suffix ELS_ECE_EAOV --epochs 30 --verbose --save --plot --skip-cv
```

### Arguments:
Expand All @@ -20,34 +22,34 @@ Train an XGBoost binary classifier using the following code:

`--group`: if `--save` is passed, training results are saved to the group/directory named here.

`--period_suffix`: SCoPe determines light curve periods using GPU-accelerated algorithms. These algorithms include a Lomb-Scargle approach (ELS), Conditional Entropy (ECE), Analysis of Variance (AOV), and an approach nesting all three (ELS_ECE_EAOV). Periodic features are stored with the suffix specified here.
`--period-suffix`: SCoPe determines light curve periods using GPU-accelerated algorithms. These algorithms include a Lomb-Scargle approach (ELS), Conditional Entropy (ECE), Analysis of Variance (AOV), and an approach nesting all three (ELS_ECE_EAOV). Periodic features are stored with the suffix specified here.

`--min_count`: requires at least min_count positive examples to run training.
`--min-count`: requires at least min_count positive examples to run training.

`--epochs`: neural network training takes an --epochs argument that is set to 30 here.

***Notes:***
- *The above training runs the XGB algorithm by default and skips cross-validation in the interest of time. For a full run, you can remove the `--skip_cv` argument to run a cross-validated grid search of XGB hyperparameters during training.*
- *The above training runs the XGB algorithm by default and skips cross-validation in the interest of time. For a full run, you can remove the `--skip-cv` argument to run a cross-validated grid search of XGB hyperparameters during training.*

- *DNN hyperparameters are optimized using a different approach - Weights and Biases Sweeps (https://docs.wandb.ai/guides/sweeps). The results of these sweeps are the default hyperparameters in the config file. To run another round of sweeps for DNN, create a WandB account and set the `--run_sweeps` keyword in the call to `scope.py train`.*
- *DNN hyperparameters are optimized using a different approach - Weights and Biases Sweeps (https://docs.wandb.ai/guides/sweeps). The results of these sweeps are the default hyperparameters in the config file. To run another round of sweeps for DNN, create a WandB account and set the `--run-sweeps` keyword in the call to `scope-train`.*

- *SCoPe DNN training does not provide feature importance information (due to the hidden layers of the network). Feature importance is possible to estimate for neural networks, but it is more computationally expensive compared to this "free" information from XGB.*

### Train multiple classifiers with one script

Create a shell script that contains multiple calls to `scope.py train`:
Create a shell script that contains multiple calls to `scope-train`:
```
./scope.py create_training_script --filename=train_xgb.sh --min_count=1000 --algorithm=xgb --period_suffix=ELS_ECE_EAOV --add_keywords="--save --plot --group=ss23 --epochs=30 --skip_cv"
create-training-script --filename train_xgb.sh --min-count 1000 --algorithm xgb --period-suffix ELS_ECE_EAOV --add-keywords "--save --plot --group ss23 --epochs 30 --skip-cv"
```

Modify the permissions of this script by running `chmod +x train_xgb.sh`. Run the generated training script in a terminal window (using e.g. `./train_xgb.sh`) to train multiple label sequentially.
Modify the permissions of this script by running `chmod +x train_xgb.sh`. Run the generated training script in a terminal window (using e.g. `./train_xgb.sh`) to train multiple classifers sequentially.

***Note:***
- *The code will throw an error if the training script filename already exists.*
- *The code will raise an error if the training script filename already exists.*

### Running training on HPC resources

`train_algorithm_slurm.py` and `train_algorithm_job_submission.py` can be used generate and submit `slurm` scripts to train all classifiers in parallel using HPC resources.
`train-algorithm-slurm` and `train-algorithm-job-submission` can be used generate and submit `slurm` scripts to train all classifiers in parallel using HPC resources.

## Plotting Classifier Performance
SCoPe saves diagnostic plots and json files to report each classifier's performance. The below code shows the location of the validation set results for one classifier.
Expand Down Expand Up @@ -82,10 +84,10 @@ This code may also be placed in a loop over multiple labels to compare each clas

## Inference

Use `tools/inference.py` to run inference on a field (297) of features (within a directory called `generated_features`). The classifiers used for this inference are within the `ss23` directory/group specified during training.
Use `run-inference` to run inference on a field (297) of features (in this example, located in a directory called `generated_features`). The classifiers used for this inference are within the `ss23` directory/group specified during training.

```
./scope.py create_inference_script --filename=get_all_preds_xgb.sh --group_name=ss23 --algorithm=xgb --period_suffix=ELS_ECE_EAOV --feature_directory=generated_features
create-inference-script --filename get_all_preds_xgb.sh --group-name ss23 --algorithm xgb --period-suffix ELS_ECE_EAOV --feature-directory generated_features
```

Modify the permissions of this script using `chmod +x get_all_preds_xgb.sh`, then run on the desired field:
Expand All @@ -94,12 +96,12 @@ Modify the permissions of this script using `chmod +x get_all_preds_xgb.sh`, the
```

***Notes:***
- *`scope.py create_inference_script` will throw an error if the inference script filename already exists.*
- *`create-inference-script` will raise an error if the inference script filename already exists.*
- *Inference begins by imputing missing features using the strategies specified in the `features:` section of the config file.*

### Running inference on HPC resources

`run_inference_slurm.py` and `run_inference_job_submission.py` can be used generate and submit `slurm` scripts to run inference for all classifiers in parallel using HPC resources.*
`run-inference-slurm` and `run-inference-job-submission` can be used generate and submit `slurm` scripts to run inference for all classifiers in parallel using HPC resources.*

## Examining predictions

Expand Down
Loading

0 comments on commit 66cf46f

Please sign in to comment.