Add regression tests for training #730

2015aroras · 2024-10-07T21:59:52Z

Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

Fix: Add training regression tests that run 2 steps of training and compare the model activations against an already-prepared set of model activations from beaker. The saved model activations can be updated by changing the flags passed to the tests.

The added tests also run on CPU, but this required making minor changes to OLMo training code. Autocast works differently on CPU, so the model activations are different for CPU compared to GPU.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

epwalsh · 2024-10-08T20:27:10Z

I'm in favor of adding trainer tests but I have a couple concerns about this approach, the foremost being that I think comparing activations is way too brittle and strict. It would be enough to just check that loss is decreasing over a few steps.

we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

I wouldn't assume training would or even should be identical across platforms.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

I think this is a lot actually especially considering new versions of these may have to committed over time.

2015aroras added 5 commits October 7, 2024 14:43

Make OLMo training runnable on CPU

c634c65

When logging submodule outputs, keep track of order of modules

7abc27f

Add fixtures for CPU and CUDA tests

80a52a8

Add regression test for training

09235de

Fix typing error

efe5826

2015aroras changed the title ~~Shanea/add training regression tests~~ Add regression tests for training Oct 7, 2024

2015aroras marked this pull request as ready for review October 7, 2024 22:09

2015aroras requested review from dirkgr and epwalsh October 7, 2024 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regression tests for training #730

Add regression tests for training #730

2015aroras commented Oct 7, 2024 •

edited

Loading

epwalsh commented Oct 8, 2024

Add regression tests for training #730

Are you sure you want to change the base?

Add regression tests for training #730

Conversation

2015aroras commented Oct 7, 2024 • edited Loading

epwalsh commented Oct 8, 2024

2015aroras commented Oct 7, 2024 •

edited

Loading