How to use Grokfast with FP16 mixed precision training? #10

peterjc123 · 2024-07-03T08:50:53Z

Hi, I'm trying out Grokfast in a LLM scenario. Mixed precision training is a commonly-used technique to save GPU memory usage and speedup training. The following code is an example for FP16 training.

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

The question is where should I put grads = gradfilter_ema(model, grads)? I tried to put this between scale and unscale, but it doesn't work, the loss scale just explodes.

The text was updated successfully, but these errors were encountered:

damian0815 · 2024-07-08T15:50:29Z

similar issue here - when I put grads = gradfilter_ema(model, grads) after the call to scaler.unscale_(optimizer) the scale goes to 0 and i get nans for the step loss

ironjr · 2024-07-10T09:11:40Z

Thank you for the valuable report! This should be because of the increased gradient norm due to the added low-pass filtered gradient.

The code here is basically for the proof-of-concept demonstration of acceleration of grokking in the previously known scenarios. For larger models, I suspect there should be more sophisticated control of the step size of the gradient updates, especially with mixed precision training you have mentioned. I will revise the code to add more compatibility to train larger models in the next version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use Grokfast with FP16 mixed precision training? #10

How to use Grokfast with FP16 mixed precision training? #10

peterjc123 commented Jul 3, 2024

damian0815 commented Jul 8, 2024

ironjr commented Jul 10, 2024

How to use Grokfast with FP16 mixed precision training? #10

How to use Grokfast with FP16 mixed precision training? #10

Comments

peterjc123 commented Jul 3, 2024

damian0815 commented Jul 8, 2024

ironjr commented Jul 10, 2024