Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Grokfast with FP16 mixed precision training? #10

Open
peterjc123 opened this issue Jul 3, 2024 · 2 comments
Open

How to use Grokfast with FP16 mixed precision training? #10

peterjc123 opened this issue Jul 3, 2024 · 2 comments

Comments

@peterjc123
Copy link

Hi, I'm trying out Grokfast in a LLM scenario. Mixed precision training is a commonly-used technique to save GPU memory usage and speedup training. The following code is an example for FP16 training.

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

The question is where should I put grads = gradfilter_ema(model, grads)? I tried to put this between scale and unscale, but it doesn't work, the loss scale just explodes.

@damian0815
Copy link

similar issue here - when I put grads = gradfilter_ema(model, grads) after the call to scaler.unscale_(optimizer) the scale goes to 0 and i get nans for the step loss

@ironjr
Copy link
Owner

ironjr commented Jul 10, 2024

Thank you for the valuable report! This should be because of the increased gradient norm due to the added low-pass filtered gradient.

The code here is basically for the proof-of-concept demonstration of acceleration of grokking in the previously known scenarios. For larger models, I suspect there should be more sophisticated control of the step size of the gradient updates, especially with mixed precision training you have mentioned. I will revise the code to add more compatibility to train larger models in the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants