feat(training,rollout)!: Rollout Schedulers #46

HCookie · 2024-12-20T11:02:37Z

Closes #14

Rollout Schedulers

Expand the ways to describe rollout, and provide an interface to schedule updates

New default rollout config

# length of the "rollout" window (see Keisler's paper)
rollout:
  _target_: anemoi.training.schedulers.rolllout.stepped.EpochStepped
  minimum: 1
  maximum: 12
  # increase rollout every n epochs
  every_n_epochs: 1
  # Control the incrementing of the rollout window
  increment:
    step:
      0: 0
      200000: 1 # After 200k steps, increment by 1 every 1 epoch

Can step by epoch, step, and control the increment based on either the step or epoch.

Additonally, formally add random steppers.

Todo

Integrate with the data loader
Ensure that the randomness is seeded appropriately
Randomness broadcast
Ensure restartability
Ability to change config

- Allow for complex incrementing setup

- Calculation based not step based

for more information, see https://pre-commit.ci

…out-scheduling

anaprietonem

Started to go through the PR and left some comments! I still need to understand better some of the functionality so hope te questions makes sense. Thanks for this Harrison!

anaprietonem · 2025-01-07T09:15:36Z

training/src/anemoi/training/train/train.py

@@ -405,6 +419,7 @@ def train(self) -> None:
            use_distributed_sampler=False,
            profiler=self.profiler,
            enable_progress_bar=self.config.diagnostics.enable_progress_bar,
+            reload_dataloaders_every_n_epochs=self._need_to_reload_dataloaders,


Is there a reason why the type of reload_dataloaders_every_n_epochs has been changes from int to bool? Wondering because looking at PTL docs the type of flag is int (https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/trainer/trainer.html#Trainer.__init__) and that's used by the [data_connector](https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/trainer/connectors/data_connector.py#L51)

Due to python duck typing I suppose a True will evaluate to 1 when used as an int. So it works, but if it is clearer, I can change it to an int.

I see now, thanks for the clarification! all good then

anaprietonem · 2025-01-07T09:30:19Z

training/src/anemoi/training/diagnostics/callbacks/rollout.py

+        """
+        self._update_rollout(trainer, pl_module, epoch=checkpoint["epoch"], step=checkpoint["global_step"])
+
+    def on_validation_epoch_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule, *_) -> None:


If someone sets the limit_batches for validation to 0, to skip validation this hook wouldn't be triggered?

Good point, I'll need to take a look.

training/src/anemoi/training/train/forecaster.py

anaprietonem · 2025-01-07T09:44:23Z

training/src/anemoi/training/train/forecaster.py

@@ -451,7 +451,7 @@ def rollout_step(
        )
        assert batch.shape[1] >= rollout + self.multi_step, msg

-        for rollout_step in range(rollout or self.rollout):
+        for rollout_step in range(rollout or int(self.rollout)):


Do we still need to pass rollout as a value to this function is you are already making it a self variable of the GraphForecaster ?

Passing it as an arg allows for override, which I think is used in some of the callbacks.

anaprietonem · 2025-01-07T09:45:24Z

training/src/anemoi/training/train/forecaster.py

+
+    def on_train_epoch_start(self) -> None:
+        # Sync the rollout at the start of each epoch
+        # Cannot use stepping due to inconsistent behaviour with Pytorch Lightning


Just to understand it better, what do you mean here with the inconsistent behaviour with PTL?

If I use the step function, it gets triggered in sanity checking, and other places where I don't want it,

training/src/anemoi/training/train/forecaster.py

FussyDuck · 2025-01-09T08:11:30Z

All committers have signed the CLA.

anaprietonem · 2025-01-10T07:51:17Z

training/src/anemoi/training/config/training/default.yaml

+  increment:
+    step:
+      0: 0
+      200000: 1 # After 200k steps, increment by 1 every 1 epoch


I am probably just being slow but how does this interact with the limit batches? What would be the difference between doing the above, and the 'old configuration' with a limit batches of 200000?

The limit_batches ends the training, this will continue on, and then begin updating the rollout.

anaprietonem · 2025-01-10T07:52:23Z

training/tests/schedulers/__init__.py

@@ -0,0 +1,8 @@
+# (C) Copyright 2024 Anemoi contributors.


Minor thing, but just a remainder that all of these headers would need to be updated to 2025 before merging

Even the code I wrote in 2024? I honestly have no idea

anaprietonem · 2025-01-10T08:02:01Z

training/src/anemoi/training/train/train.py

@@ -377,6 +377,20 @@ def strategy(self) -> DDPGroupStrategy:
            static_graph=not self.config.training.accum_grad_batches > 1,
        )

+    @cached_property
+    def _need_to_reload_dataloaders(self) -> bool:


Have you seen any difference in terms of runtime from using this?

Unfortunately, yes, I need to still quantify it, but it does slow down the transition between epochs.

HCookie and others added 11 commits December 18, 2024 15:30

Rollout Schedulers

973349f

Incrementer

fcf1f1f

- Allow for complex incrementing setup

Improve incrementor

a712c48

- Calculation based not step based

[pre-commit.ci] auto fixes from pre-commit.com hooks

72e0bf9

for more information, see https://pre-commit.ci

Precommit fixes

c199c0e

Add changelog entry

69a5d9a

Seed random every time and remove -1 for inf

d5a0ff9

Merge commit 'd5a0ff9c20b0560da9bb540a1e63a4e9015dcc79' into 145-roll…

a0759ce

…out-scheduling

MIGRATION COMMIT

249144e

Merge commit '249144e5c4ea3222e2d5eddec80025ce3acd5a5d' into 145-roll…

71f99f3

…out-scheduling

Merge branch 'develop' into 145-rollout-scheduling

7659721

HCookie added the training label Dec 20, 2024

HCookie self-assigned this Dec 20, 2024

HCookie changed the title ~~feat(rollout)!: Rollout Schedulers~~ feat(training,rollout)!: Rollout Schedulers Dec 20, 2024

HCookie added 3 commits December 20, 2024 15:11

Update warnings

433362a

Add tests

3ac7dcd

pre-commit

71a9e08

HCookie added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 20, 2024

anaprietonem reviewed Jan 7, 2025

View reviewed changes

anaprietonem reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training,rollout)!: Rollout Schedulers #46

feat(training,rollout)!: Rollout Schedulers #46

HCookie commented Dec 20, 2024 •

edited

Loading

anaprietonem left a comment

anaprietonem Jan 7, 2025

HCookie Jan 9, 2025

anaprietonem Jan 10, 2025

anaprietonem Jan 7, 2025

HCookie Jan 9, 2025

anaprietonem Jan 7, 2025

HCookie Jan 9, 2025

anaprietonem Jan 7, 2025

HCookie Jan 9, 2025

FussyDuck commented Jan 9, 2025 •

edited

Loading

anaprietonem Jan 10, 2025 •

edited

Loading

HCookie Jan 10, 2025

anaprietonem Jan 10, 2025

HCookie Jan 10, 2025

anaprietonem Jan 10, 2025

HCookie Jan 10, 2025

feat(training,rollout)!: Rollout Schedulers #46

Are you sure you want to change the base?

feat(training,rollout)!: Rollout Schedulers #46

Conversation

HCookie commented Dec 20, 2024 • edited Loading

Rollout Schedulers

New default rollout config

Todo

anaprietonem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FussyDuck commented Jan 9, 2025 • edited Loading

anaprietonem Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HCookie commented Dec 20, 2024 •

edited

Loading

FussyDuck commented Jan 9, 2025 •

edited

Loading

anaprietonem Jan 10, 2025 •

edited

Loading