Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphium 3.0 #519

Draft
wants to merge 189 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
189 commits
Select commit Hold shift + click to select a range
5ffe261
Added and integrated C++ graphium_cpp library, a Python module implem…
ndickson-nvidia Apr 13, 2024
8286383
Small changes to support not needing label data during data loading
ndickson-nvidia Apr 17, 2024
dca9b2b
Removed FakeDataset, FakeDataModule, and SingleTaskDataset. SingleTa…
ndickson-nvidia Apr 17, 2024
8304210
Removed option to featurize using Python, (but didn't delete everythi…
ndickson-nvidia Apr 17, 2024
4ee35d4
Removed newly deprecated options from yaml files
ndickson-nvidia Apr 18, 2024
cf23e37
Added support for limiting the number of threads used by prepare_and_…
ndickson-nvidia Apr 18, 2024
5db0e2a
Fixed compiler warning about signed vs. unsigned comparison
ndickson-nvidia Apr 18, 2024
c75a452
Fixed Python syntax issues
ndickson-nvidia Apr 18, 2024
4aa1f85
Changed asymmetric inverse normalization type to be implemented using…
ndickson-nvidia Apr 18, 2024
c53451a
Fixed compile errors
ndickson-nvidia Apr 18, 2024
268e245
Some simplification in collate.py
ndickson-nvidia Apr 19, 2024
e032e8e
Deleting most of the Python featurization code
ndickson-nvidia Apr 19, 2024
bdefe89
Implemented conformer generation in get_conformer_features, trying to…
ndickson-nvidia Apr 23, 2024
5298444
Deleted deprecated properties.py
ndickson-nvidia Apr 23, 2024
c38aa06
Handle case of no label data in prepare_and_save_data. Also added con…
ndickson-nvidia Apr 25, 2024
86abf21
Changed prepare_data to support having no label data
ndickson-nvidia Apr 25, 2024
bd59098
Removed ipu metrics, since not compatible with latest torchmetrics
DomInvivo Apr 26, 2024
734ba55
Updated `MetricWrapper` to work with `update` and `compute`, compatib…
DomInvivo Apr 26, 2024
b6c578f
Changed requirements for torchmetrics
DomInvivo Apr 26, 2024
4f6e816
fixed the loss by adding `MetricToTorchMetrics`, and added a few comm…
DomInvivo Apr 26, 2024
80276da
Updated license passed to setup call in setup.py
ndickson-nvidia May 2, 2024
7933ae5
Major updates to `predictor_summaries.py`
DomInvivo May 3, 2024
5849927
Improved the predictor summaries. Added GradientNormMetric
DomInvivo May 3, 2024
9492e62
Changes to get test_dataset.py and test_multitask_datamodule.py passing
ndickson-nvidia May 6, 2024
d94097c
Removed load_type option from test_training.py, because it's no longe…
ndickson-nvidia May 6, 2024
11e6935
Updated comment in setup.py about how to build graphium_cpp package
ndickson-nvidia May 14, 2024
ff93c2d
Rewrote test_featurizer.py. Fixed bug in mask_nans C++ function, and …
ndickson-nvidia May 14, 2024
a892068
Removed deprecation warnings and deprecated parameters from datamodul…
ndickson-nvidia May 23, 2024
38a5510
Recommended tweaks to extract_labels in multilevel_utils.py
ndickson-nvidia May 23, 2024
f7771b3
Fixed "else if"->"elif"
ndickson-nvidia May 23, 2024
4256839
Rewrote test_pe_nodepair.py to use graphium_cpp
ndickson-nvidia May 24, 2024
91c37a3
Rewrote test_pe_rw.py to use graphium_cpp. Comment update in test_pe_…
ndickson-nvidia May 24, 2024
f347a0d
Rewrote test_pe_spectral.py to use graphium_cpp
ndickson-nvidia May 24, 2024
26b5531
Removed tests/test_positional_encodings.py, because it's a duplicate …
ndickson-nvidia May 24, 2024
1ded38b
Fixed handling of disconnected components vs. single component for la…
ndickson-nvidia May 28, 2024
314d636
Fixed compile warnings in one_hot.cpp
ndickson-nvidia May 28, 2024
e49b4da
Rewrote test_positional_encoders.py, though it's still failing the te…
ndickson-nvidia May 28, 2024
f001464
Removed commented out lines from setup.py
ndickson-nvidia Jun 4, 2024
2782fbc
Ran linting on Python files
ndickson-nvidia Jun 4, 2024
77d27b5
Hopefully explicitly installing graphium_cpp fixes the automated test…
ndickson-nvidia Jun 5, 2024
cb1df19
Test fix
ndickson-nvidia Jun 5, 2024
f3f6a0d
Another test fix
ndickson-nvidia Jun 5, 2024
c5c0085
Another test fix
ndickson-nvidia Jun 5, 2024
6dd827f
Make sure RDKit can find Boost headers
ndickson-nvidia Jun 5, 2024
59c84a2
Reimplemented test_pos_transfer_funcs.py to test all supported conver…
ndickson-nvidia Jun 12, 2024
7bc8ade
Linting fixes
ndickson-nvidia Jun 12, 2024
6903243
Fixed collections.abs.Callable to typing.Callable for type hint
ndickson-nvidia Jun 12, 2024
f355eed
Improved the task summaries and started to fix the training logging.
DomInvivo Jun 13, 2024
9f38afb
Removed file_opener and its test
ndickson-nvidia Jun 17, 2024
5ab9ca9
Fixed the issue with boolean masking, introduced by `F._canonical_mas…
DomInvivo Jul 9, 2024
9c7504f
Fixed the float vs double issue in laplacian pos encoding
DomInvivo Jul 9, 2024
f8358f3
Added comment
DomInvivo Jul 9, 2024
692decc
Fixed the ipu tests by making sure that `IPUStrategy` is not imported…
DomInvivo Jul 9, 2024
8891e66
Update test.yml to only test python 3.10
DomInvivo Jul 9, 2024
c2d3c87
Removed positional encodings from the docs
DomInvivo Jul 9, 2024
d3d19d7
Merge remote-tracking branch 'origin/dom_unittest' into dom_unittest
DomInvivo Jul 9, 2024
0a1696f
Upgraded python versions in the tests
DomInvivo Jul 9, 2024
50265df
Removed reference to old files now in C++
DomInvivo Jul 9, 2024
58fc2aa
Downgraded python version
DomInvivo Jul 9, 2024
5852467
Fixed other docs broken references
DomInvivo Jul 9, 2024
ea9a775
Merge pull request #1 from ndickson-nvidia/dom_unittest
ndickson-nvidia Jul 9, 2024
7f933b7
Merge pull request #510 from ndickson-nvidia/graphium_cpp
DomInvivo Jul 9, 2024
4372ace
Fixed test_metrics. Moved lots of `spaces.py` imports to inner functi…
DomInvivo Jul 10, 2024
7b89998
duplicated some unit-test fixes from graphium_3.0 branch
DomInvivo Jul 11, 2024
ab88952
Fixed the loading of a previous dummy model using older metrics by re…
DomInvivo Jul 11, 2024
6c58733
Minor documentation
DomInvivo Jul 11, 2024
a9a8810
Removed the loss from `predictor_summaries`
DomInvivo Jul 11, 2024
2185697
Removed epochs from task summaries
DomInvivo Jul 11, 2024
d37d818
Draft implementing the update/compute logic in the predictor.
DomInvivo Jul 11, 2024
b4524f9
Fix the std metric. Still needs testing.
DomInvivo Jul 11, 2024
5040c47
fixed all errors arising in `test_finetuning.py`
DomInvivo Jul 11, 2024
e761e08
Fixed the `test_training.py` unit test
DomInvivo Jul 12, 2024
5d60fbf
Standardized the test names
DomInvivo Jul 12, 2024
b59428a
Fixed some unit-tests that were broken by previous changes
DomInvivo Jul 12, 2024
632d4dc
Added `pytdc` to the tests
DomInvivo Jul 12, 2024
0fa2d86
Changed mamba install tdc to pip install, in the `test.yml` file
DomInvivo Jul 12, 2024
2441f43
Added '--no-deps' to TDC installation in `test.yml`
DomInvivo Jul 12, 2024
326b6e7
Woops
DomInvivo Jul 12, 2024
641fa37
Fixed issue with building docs
DomInvivo Jul 12, 2024
2b85dce
Removed old file from breaking docs building
DomInvivo Jul 12, 2024
0c93a0f
Changed to micromamba to install pytdc
DomInvivo Jul 12, 2024
ec235fc
Added tests for the `STDMetric` and `GradientNormMetric` and fixed th…
DomInvivo Jul 12, 2024
38d03e1
Implemented test of MultiTaskSummaries. Only an error left for the me…
DomInvivo Jul 12, 2024
d6f62a4
Fixed the `preds` and `targets` that were inverted in `TaskSummary`
DomInvivo Jul 13, 2024
3673884
Tried to add grad_norm to the metrics, but won't work because it's no…
DomInvivo Jul 13, 2024
29598a2
Moved the gradient metric directly to the `Predictor`
DomInvivo Jul 13, 2024
6260fa1
Removed file_opener and read_file
DomInvivo Jul 13, 2024
10a1017
Fixed predictor grad_norm
DomInvivo Jul 13, 2024
8aa0f2b
Merge branch 'graphium_3.0' into torchmetrics
DomInvivo Jul 13, 2024
90c0ca4
Fixed the progress bar logging to newest version. Fixed minor issues …
DomInvivo Jul 15, 2024
be99d94
Merge remote-tracking branch 'origin/torchmetrics' into torchmetrics
DomInvivo Jul 15, 2024
44b66b5
fixed some issue with older version of torchmetrics
DomInvivo Jul 15, 2024
5c421a6
Fixed reversed preds/targets. Fixed random sampling to take in the DF…
DomInvivo Jul 16, 2024
f15cd9a
fixed missing metrics computation on `on_train_batch_end`
DomInvivo Jul 16, 2024
2142313
Added toymix training to the unit-tests. Also useful to run in debug …
DomInvivo Jul 16, 2024
99e0cd6
Adding `_global/` to some metrics logging into wandb
DomInvivo Jul 16, 2024
045ea53
Added better handling of metrics failure with `logger.warn`
DomInvivo Jul 16, 2024
d8ba606
Fixed metric issues on gpu by casting to the right device prior to `.…
DomInvivo Jul 16, 2024
1bf2734
Added losses to the metrics, such that they are computed on val and t…
DomInvivo Jul 17, 2024
68b9361
Restricting the numpy version due to issues with wandb
DomInvivo Jul 17, 2024
911dfe9
detaching preds
DomInvivo Jul 17, 2024
d34ac60
Removed cuda version restriction
DomInvivo Jul 17, 2024
b1f2e86
Removed unnecessary detach, that broke the loss
DomInvivo Jul 17, 2024
62b385a
Updating dep versions for bh2 install
Andrewq11 Jul 30, 2024
41490a0
Merge pull request #522 from datamol-io/package/bh2-install
DomInvivo Jul 31, 2024
7f9112a
Fix lightning backend issue; add predict_step for inference
WenkelF Aug 8, 2024
47b7d1c
Fixing device issue in metrics calculation
WenkelF Aug 9, 2024
9dbd021
Minor gitignore
DomInvivo Aug 15, 2024
5a77cbe
Fixed the error due to time metrics on CPU `No backend type associate…
DomInvivo Aug 16, 2024
7fba29d
Added val epoch time
DomInvivo Aug 16, 2024
b59dc36
Added logic to avoid crashing when resetting unused metrics
DomInvivo Aug 17, 2024
da3e3a1
Added `MetricWrapper.device`
DomInvivo Aug 19, 2024
8bf0d41
Disable caching model checkpoint through WandbLogger
Aug 19, 2024
1ec4969
Disabled caching model checkpoint through WandbLogger
AnujaSomthankar Aug 19, 2024
9ba5a16
Drafting unit test for node ordering
WenkelF Aug 19, 2024
6f35ea9
Improved the testing of the metrics reset, update, compute
DomInvivo Aug 21, 2024
d2f84f2
Reverted wrong change in `train_finetune_test.py
DomInvivo Aug 21, 2024
e9be441
Improved __len__ in MultitaskDataModule
DomInvivo Aug 21, 2024
eaf9077
Added a new logic to allow saving all preds and targets more efficien…
DomInvivo Aug 22, 2024
5432531
Fixed the concatenation to work with and without DDP. Moved to CPU fo…
DomInvivo Aug 22, 2024
8c75d77
Fixed the issue with memory leaks and devices.
DomInvivo Aug 22, 2024
5abd769
Fixed the CPU syncing of `MetricToConcatenatedTorchMetrics` and GPU f…
DomInvivo Aug 22, 2024
fac3052
Fixed the training metrics, and grouped all epoch-time and tput metrics
DomInvivo Aug 22, 2024
6603014
Fixing unit tests
WenkelF Aug 22, 2024
d0ed816
Fixed epoch_time tracking (because train ends after val)
DomInvivo Aug 22, 2024
9b7063f
Using the `torchmetrics.Metric.sync` instead of torch_distributed
DomInvivo Aug 23, 2024
136b8b0
Fixed issue that NaNs are always removed with `mean-per-label`
DomInvivo Aug 29, 2024
2724b4c
Changed the name of logging variables
DomInvivo Aug 29, 2024
141f48b
Removed some IPU logic
DomInvivo Aug 29, 2024
62f2224
Fixed the syncing of `MetricToConcatenatedTorchMetrics`
DomInvivo Aug 29, 2024
2b58fed
Fixed classification metric calculation when multitask_handling=flatten
AnujaSomthankar Aug 29, 2024
c23dc02
Partial fix of node label ordering
WenkelF Sep 5, 2024
2fb7f4b
Fixed all unit-test, except those for IPU
DomInvivo Sep 7, 2024
607e71b
First pass at removing IPU
DomInvivo Sep 7, 2024
49e9984
More removal of ipu
DomInvivo Sep 7, 2024
d8786e9
More removal of ipus
DomInvivo Sep 7, 2024
6ed6bb8
More removal of ipus
DomInvivo Sep 7, 2024
495f3f6
Remove packing
DomInvivo Sep 7, 2024
90c7af2
Fixing most unit-tests
DomInvivo Sep 7, 2024
29229ff
Updated env file
DomInvivo Sep 7, 2024
019be26
Fixed the dummy model, toymix run, and most unit-tests
DomInvivo Sep 7, 2024
21a63a8
Minor
DomInvivo Sep 7, 2024
4bbc1f9
Fixed all remaining unit-tests - mostly the attention layers
DomInvivo Sep 7, 2024
be17a1f
minor changes to env
DomInvivo Sep 7, 2024
318694f
Added comments to env file
DomInvivo Sep 7, 2024
ce4f94d
Merge branch 'graphium_3.0' into torchmetrics
AnujaSomthankar Sep 10, 2024
f723632
Merge pull request #517 from datamol-io/torchmetrics
AnujaSomthankar Sep 10, 2024
c32be78
Merge branch 'torchmetrics' into remove_ipu
WenkelF Sep 11, 2024
5097e2a
Forcing `gcc_linux_64` in the env file
DomInvivo Sep 11, 2024
29044f5
Update test.yml
DomInvivo Sep 16, 2024
1771b78
Update test.yml
DomInvivo Sep 16, 2024
f87ee26
Update test.yml
DomInvivo Sep 16, 2024
24412f6
Changed persistent_workers=True to False
AnujaSomthankar Sep 19, 2024
1d7bfeb
Reorder atoms in node-level and nodepair-level label data, when the s…
ndickson-nvidia Jul 22, 2024
b7d9fe7
Merging of equivalent molecules is now optional, but still defaults t…
ndickson-nvidia Jul 24, 2024
5932fd8
Fixed bug with recent change in smiles_to_brief_data
ndickson-nvidia Jul 25, 2024
020b08a
Fix graphium_cpp.prepare_and_save_data call in test_dataset.py to inc…
ndickson-nvidia Jul 25, 2024
6cd5e26
In MultitaskFromSmilesDataModule.get_data_hash, include options used …
ndickson-nvidia Jul 25, 2024
d1cad44
Linter fixes in python files already modified in this branch
ndickson-nvidia Jul 25, 2024
508abbd
Split prepare_and_save_data into get_task_data, get_indices_and_strin…
ndickson-nvidia Aug 1, 2024
044cd47
Added support for reordering edge label data if there are multiple ta…
ndickson-nvidia Aug 1, 2024
50adb1b
Changed parse_mol in graphium_cpp.cpp to order based only on explicit…
ndickson-nvidia Aug 10, 2024
c870af4
The datasets use 0-based indexing for explicit ordering via atom clas…
ndickson-nvidia Aug 12, 2024
90f2403
Started adding doxygen comments to the C++ code. Also changed comput…
ndickson-nvidia Sep 10, 2024
bafdfe8
Adding unit test for node ordering
WenkelF Sep 5, 2024
7fd40e7
Added doxygen comments for functions and enums related to one-hot fea…
ndickson-nvidia Sep 11, 2024
06b12b2
Added more doxygen comments
ndickson-nvidia Sep 11, 2024
5d798a5
Added and updated more comments
ndickson-nvidia Sep 17, 2024
7261279
Added comments to each function in features.cpp
ndickson-nvidia Sep 23, 2024
b82f582
Investigating failing unit tests
WenkelF Sep 23, 2024
4a152b2
Added more comments to labels.cpp
ndickson-nvidia Sep 23, 2024
618bbb1
Merge branch 'atom_order' of ssh://github.com/ndickson-nvidia/graphiu…
ndickson-nvidia Sep 23, 2024
92ab751
Build fix in features.cpp
ndickson-nvidia Sep 23, 2024
f123565
Skipping test_training.py for now
WenkelF Sep 23, 2024
e887176
Merge pull request #521 from ndickson-nvidia/atom_order
DomInvivo Sep 23, 2024
1c4aa3b
Updated documentation
AnujaSomthankar Sep 27, 2024
75df01a
Updated documentation
AnujaSomthankar Sep 27, 2024
8436111
Wrapping up finetuning updates
WenkelF Oct 31, 2024
9cfabb7
Update readme
WenkelF Nov 1, 2024
bf504a8
Fixed finetuning unit test
WenkelF Nov 1, 2024
42c9dbc
Fixing docs
WenkelF Nov 1, 2024
cbf9dc7
Merge pull request #530 from datamol-io/remove_ipu
WenkelF Nov 4, 2024
807913f
Reducing size of example finetuning dataset
WenkelF Nov 4, 2024
697e3d1
Cleaning up configs and naming conventions
WenkelF Nov 5, 2024
d9aa407
Minor changes and documentation
WenkelF Nov 5, 2024
7759266
Minor change
WenkelF Nov 5, 2024
7cfa89a
Merge pull request #529 from datamol-io/upgrade-finetuning
WenkelF Nov 5, 2024
8a65d84
Added C++ file description comments
ndickson-nvidia Nov 14, 2024
5ffc2f6
Merge pull request #532 from ndickson-nvidia/cpp_file_descriptions
DomInvivo Nov 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,13 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10"]
pytorch-version: ["2.0"]
include:
- python-version: "3.10"
pytorch-version: "2.0"
- python-version: "3.11"
pytorch-version: "2.0"
- python-version: "3.12"
pytorch-version: "2.3"

runs-on: "ubuntu-latest"
timeout-minutes: 30
Expand Down Expand Up @@ -49,8 +54,11 @@ jobs:
- name: Install library
run: python -m pip install --no-deps -e . # `-e` required for correct `coverage` run.

- name: Run tests
run: pytest -m 'not ipu'
- name: Install test dependencies
run: micromamba install -c conda-forge pytdc # Required to run the `test_finetuning.py`

- name: Install C++ library
run: cd graphium/graphium_cpp && git clone https://github.com/pybind/pybind11.git && export PYTHONPATH=$PYTHONPATH:./pybind11 && python -m pip install . && cd ../..

- name: Test CLI
run: graphium --help
Expand Down
69 changes: 0 additions & 69 deletions .github/workflows/test_ipu.yml

This file was deleted.

11 changes: 2 additions & 9 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ draft/
scripts-expts/
sweeps/
mup/
loc-*

# Data and predictions
graphium/data/ZINC_bench_gnn/
Expand All @@ -38,6 +39,7 @@ graphium/data/cache/
graphium/data/b3lyp/
graphium/data/PCQM4Mv2/
graphium/data/PCQM4M/
graphium/data/largemix/
graphium/data/neurips2023/small-dataset/
graphium/data/neurips2023/large-dataset/
graphium/data/neurips2023/dummy-dataset/
Expand All @@ -53,15 +55,6 @@ debug/
change_commits.sh
graphium/features/test_new_pes.ipynb

# IPU related ignores and profiler outputs
*.a
*.cbor
*.capnp
*.pop
*.popart
*.pop_cache
*.popef
*.pvti*

############ END graphium Custom GitIgnore ##############

Expand Down
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@
Copyright 2023 Valence Labs
Copyright 2023 Recursion Pharmaceuticals
Copyright 2023 Graphcore Limited
Copyright 2024 NVIDIA CORPORATION & AFFILIATES

Various Academic groups have also contributed to this software under
the given license. These include, but are not limited, to the following
Expand Down
105 changes: 78 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
[![GitHub Repo stars](https://img.shields.io/github/stars/datamol-io/graphium)](https://github.com/datamol-io/graphium/stargazers)
[![GitHub Repo stars](https://img.shields.io/github/forks/datamol-io/graphium)](https://github.com/datamol-io/graphium/network/members)
[![test](https://github.com/datamol-io/graphium/actions/workflows/test.yml/badge.svg)](https://github.com/datamol-io/graphium/actions/workflows/test.yml)
[![test-ipu](https://github.com/datamol-io/graphium/actions/workflows/test_ipu.yml/badge.svg)](https://github.com/datamol-io/graphium/actions/workflows/test_ipu.yml)
[![release](https://github.com/datamol-io/graphium/actions/workflows/release.yml/badge.svg)](https://github.com/datamol-io/graphium/actions/workflows/release.yml)
[![code-check](https://github.com/datamol-io/graphium/actions/workflows/code-check.yml/badge.svg)](https://github.com/datamol-io/graphium/actions/workflows/code-check.yml)
[![doc](https://github.com/datamol-io/graphium/actions/workflows/doc.yml/badge.svg)](https://github.com/datamol-io/graphium/actions/workflows/doc.yml)
Expand All @@ -35,8 +34,6 @@ Visit https://graphium-docs.datamol.io/.

## Installation for developers

### For CPU and GPU developers

Use [`mamba`](https://github.com/mamba-org/mamba), a faster and better alternative to `conda`.

If you are using a GPU, we recommend enforcing the CUDA version that you need with `CONDA_OVERRIDE_CUDA=XX.X`.
Expand All @@ -53,25 +50,67 @@ mamba activate graphium
pip install --no-deps -e .
```

### For IPU developers
## Training a model

To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).

If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.

## Running an experiment

### Datasets

Graphium provides configs for 2 datasets: `toymix` and `largemix`.
`Toymix` uses 3 datasets, which are referenced in datamodule [here](https://github.com/datamol-io/graphium/blob/d12df7e06828fa7d7f8792141d058a60b2b2d258/expts/hydra-configs/tasks/loss_metrics_datamodule/toymix.yaml#L59-L102). Its datasets and their splits files can be downloaded from here:

```bash
# Install Graphcore's SDK and Graphium dependencies in a new environment called `.graphium_ipu`
./install_ipu.sh .graphium_ipu
# Change or make the directory to where the dataset is to be downloaded
cd expts/data/neurips2023/small-dataset

# QM9
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9.csv.gz
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9_random_splits.pt

# Tox21
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21-7k-12-labels.csv.gz
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21_random_splits.p

# Zinc
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k.csv.gz
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k_random_splits.pt
```

The above step needs to be done once. After that, enable the SDK and the environment as follows:
`Largemix` uses datasets referenced in datamodule [here](https://github.com/datamol-io/graphium/blob/e887176f71ee95c3b82f8f6b56c706eaa9765bf1/expts/hydra-configs/tasks/loss_metrics_datamodule/largemix.yaml#L82C1-L155C37). Its datasets and their splits files can be downloaded from here:


```bash
source enable_ipu.sh .graphium_ipu
```
# Change or make the directory to where the dataset is to be downloaded
cd ../data/graphium/large-dataset/

## Training a model
# L1000_VCAP
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/LINCS_L1000_VCAP_0-4.csv.gz
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/l1000_vcap_random_splits.pt

To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).
# L1000_MCF7
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/LINCS_L1000_MCF7_0-4.csv.gz
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/l1000_mcf7_random_splits.pt

If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.
# PCBA_1328
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/PCBA_1328_1564k.parquet
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/pcba_1328_random_splits.pt

# PCQM4M_G25
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/PCQM4M_G25_N4.parquet
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/pcqm4m_g25_n4_random_splits.pt

#PCQM4M_N4
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/PCQM4M_G25_N4.parquet
wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/pcqm4m_g25_n4_random_splits.pt
```
These datasets can be used further for pretraining.

### Pretraining

## Running an experiment
We have setup Graphium with `hydra` for managing config files. To run an experiment go to the `expts/` folder. For example, to benchmark a GCN on the ToyMix dataset run
```bash
graphium-train architecture=toymix tasks=toymix training=toymix model=gcn
Expand All @@ -86,34 +125,46 @@ Integrating `hydra` also allows you to quickly switch between accelerators. E.g.
graphium-train architecture=toymix tasks=toymix training=toymix model=gcn accelerator=gpu
```
automatically selects the correct configs to run the experiment on GPU.
Finally, you can also run a fine-tuning loop:
```bash
graphium-train +finetuning=admet
```
To use Largemix dataset instead, replace `toymix` to `largemix` in the above commmands.

To use a config file you built from scratch you can run
```bash
graphium-train --config-path [PATH] --config-name [CONFIG]
```
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.

## Preparing the data in advance
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
### Finetuning

However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
After pretraining a model and saving a model checkpoint, the model can be finetuned to a new task

The following command-line will prepare the data and cache it, then use it to train a model.
```bash
# First prepare the data and cache it in `path_to_cached_data`
graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]
graphium-train +finetuning [example-custom OR example-tdc] finetuning.pretrained_model=[model_identifier]
```

# Then train the model on the prepared data
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
The `[model_identifier]` serves to identify the pretrained model among those maintained in the `GRAPHIUM_PRETRAINED_MODELS_DICT` in `graphium/utils/spaces.py`, where the `[model_identifier]` maps to the location of the checkpoint of the pretrained model.

We have provided two example yaml configs under `expts/hydra-configs/finetuning` for finetuning on a custom dataset (`example-custom.yaml`) or for a task from the TDC benchmark collection (`example-tdc.yaml`).

When using `example-custom.yaml`, to finetune on a custom dataset, we nee to provide the location of the data (`constants.data_path=[path_to_data]`) and the type of task (`constants.task_type=[cls OR reg]`).

When using `example-tdc.yaml`, to finetune on a TDC task, we only need to provide the task name (`constants.task=[task_name]`) and the task type is inferred automatically.

Custom datasets to finetune from consist of two files `raw.csv` and `split.csv`. The `raw.csv` contains two columns, namely `smiles` with the smiles strings, and `target` with the corresponding targets. In `split.csv`, three columns `train`, `val`, `test` contain the indices of the rows in `raw.csv`. Examples can be found under `expts/data/finetuning_example-reg` (regression) and `expts/data/finetuning_example-cls` (binary classification).

### Fingerprinting

Alternatively, we can also obtain molecular embeddings (fingerprints) from a pretrained model:
```bash
graphium fps create [example-custom OR example-tdc] pretrained.model=[model_identifier] pretrained.layers=[layer_identifiers]
```

**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.
We have provided two example yaml configs under `expts/hydra-configs/fingerprinting` for extracting fingerprints for a custom dataset (`example-custom.yaml`) or for a dataset from the TDC benchmark collection (`expample-tdc.yaml`).

After specifiying the `[model_identifier]`, we need to provide a list of layers from that model where we want to read out embeddings via `[layer_identifiers]` (which requires knowledge of the architecture of the pretrained model).

When using `example-custom.yaml`, the location of the smiles to be embedded needs to be passed via `datamodule.df_path=[path_to_data]`. The data can be passed as a csv/parquet file with a column `smiles`, similar to `expts/data/finetuning_example-reg/raw.csv`.

**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
When extracting fingerprints for a TDC task using `expample-tdc.yaml`, we need to specify `datamodule.benchmark` and `datamodule.task` instead of `datamodule.df_path`.

## License

Expand Down
5 changes: 0 additions & 5 deletions codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,3 @@ component_management:
target: auto
branches:
- "!main"
individual_components:
- component_id: ipu # this is an identifier that should not be changed
name: ipu # this is a display name, and can be changed freely
paths:
- graphium/ipu/**
29 changes: 0 additions & 29 deletions docs/api/graphium.features.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,37 +5,8 @@ Feature extraction and manipulation
=== "Contents"

* [Featurizer](#featurizer)
* [Positional Encoding](#positional-encoding)
* [Properties](#properties)
* [Spectral PE](#spectral-pe)
* [Random Walk PE](#random-walk-pe)
* [NMP](#nmp)

## Featurizer
------------
::: graphium.features.featurizer


## Positional Encoding
------------
::: graphium.features.positional_encoding


## Properties
------------
::: graphium.features.properties


## Spectral PE
------------
::: graphium.features.spectral


## Random Walk PE
------------
::: graphium.features.rw


## NMP
------------
::: graphium.features.nmp
2 changes: 1 addition & 1 deletion docs/api/graphium.finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ Module for finetuning models and doing linear probing (fingerprinting).

::: graphium.finetuning.finetuning_architecture.FinetuningHead

::: graphium.finetuning.fingerprinting.Fingerprinter
::: graphium.fingerprinting.fingerprinter.Fingerprinter
48 changes: 0 additions & 48 deletions docs/api/graphium.ipu.md

This file was deleted.

Loading
Loading