Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation #28837

gante · 2024-02-02T12:32:42Z

What does this PR do?

This PR fixes the following problems, all related to RoPE:

Casting a model with .from_pretrained(..., torch_dtype=...) or .to(dtype=...) would produce different sin/cos tensors at recomputation time. The underlying cause was inv_freq being a buffer, which means it was subject to buffer manipulation (like a .to() operation in the wrapping module). Note that the original repo assumed it was always a torch.float32 tensor. In some models, there was a visible performance degradation when doing inference with seq_len > max_position_embeddings (see here);
The inv_freq tensor was being loaded from the state dict, due to a previous version of the code where it was a persistent buffer;
⚠️ Perhaps more importantly, the sin/cos tensors are now always computed on CPU. As pointed out in this comment, there are subtle numerical differences that depend on the initialization device, which quickly escalate into further downstream issues. This particular change results in the following:
a. Smaller modeling performance differences across devices, as CPUs are ubiquitous (as opposed to accelerators, which may change);
b. Prevention of loss spikes at train time, possibly due to the more accurate sin/cos computation (see this comment and the whole issue);
c. Slightly slower throughput when recomputing the sin/cos tensors, i.e. when going beyond self.max_seq_len_cached.

See additional data and experiments below for the impact of this PR. Most of the diff in this PR is tests, to ensure we don't regress 🤗

Suggested review order:

Llama modelling changes
Llama test changes
GPTNeoX changes (fixes dtype cast as intended, see experiments below :) )
Other models (direct #Copied from changes)
Other tests (copy/paste)
(Other RoPE models will follow in a future PR)

Related GH issues

Fixes #28685
Fixes #25681
Fixes #28596
Fixes #27179
Should fix/help microsoft/DeepSpeed#4932

Additional data and experiments

Perlplexity, memory, and latency results before/after this PR

NOTE: using the .to() casting method. The torch_dtype sees no differences, as inv_freq is not casted.

Llama 2 -- very little ppl differences

Dtype: bfloat16
(ignore the vram -- the latest commit has the same GPU memory footprint as main)

Dtype: float16
(ignore the vram -- the latest commit has the same GPU memory footprint as main)

TinyLlama -- visible ppl upgrade

Dtype: bfloat16
(ignore the vram -- the latest commit has the same GPU memory footprint as main)

Dtype: float16
(ignore the vram -- the latest commit has the same GPU memory footprint as main)

How sensible is the sin/cos creation to the device placement?

Consider the following script:

import torch
from transformers.models.llama.modeling_llama import LlamaRotaryEmbedding

TEST_DTYPE = torch.bfloat16

for dim in (64, 256, 1024):
  for max_position_embeddings in (1024, 2048, 4096):
      for base in (10000, 100000, 1000000):
          rope_gpu = LlamaRotaryEmbedding(dim=dim, max_position_embeddings=max_position_embeddings, base=base, device='cuda')
          rope_cpu = LlamaRotaryEmbedding(dim=dim, max_position_embeddings=max_position_embeddings, base=base, device='cpu')

          rope_cpu = rope_cpu.to(device='cuda', dtype=TEST_DTYPE)
          rope_gpu = rope_gpu.to(device='cuda', dtype=TEST_DTYPE)
          max_sin_diff = (rope_gpu.sin_cached - rope_cpu.sin_cached).abs().max()
          max_cos_diff = (rope_gpu.cos_cached - rope_cpu.cos_cached).abs().max()
          max_diff = max(max_sin_diff, max_cos_diff)
          if max_diff > 0.0:
              print(f"dim={dim}, max_position_embeddings={max_position_embeddings}, base={base}, max_diff={max_diff:.2e}")

On main, before this PR, we can see differences as large as ~1e-3 regardless of TEST_DTYPE (even in torch.float64!). After this PR, the difference is 0.0.

Original Llama codebase vs our codebase after this PR?

Key takeaways:
👉 sin/cos are created on the available device (and not on CPU)
👉 sin/cos are not only kept in FP32, but also applied in FP32!

Consider the following script, which compares this hugging face's implementation against meta's repo

# run as `torchrun this_script.py`
from llama import Llama
from transformers import AutoModelForCausalLM
import torch

# Loaded in FP16 on GPU
original_llama = Llama.build(
  ckpt_dir="/home/joao/meta_llama/Llama-2-7b/",
  tokenizer_path="/home/joao/meta_llama/Llama-2-7b/tokenizer.model",
  max_seq_len=2048,  # internaly, 2048*2 is considered to compute sin/cos
  max_batch_size=1,
)
og_logits = original_llama.model(tokens=torch.tensor([list(range(1000))]), start_pos=0)
og_sin = original_llama.model.freqs_cis.imag
og_cos = original_llama.model.freqs_cis.real
del original_llama
torch.cuda.empty_cache()

# Loaded in FP16 on GPU
transformers_llama = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16
)
logits = transformers_llama(torch.tensor([list(range(1000))])).logits.float()
sin = transformers_llama.model.layers[0].self_attn.rotary_emb.sin_cached
cos = transformers_llama.model.layers[0].self_attn.rotary_emb.cos_cached


logits_diff = (og_logits.cpu() - logits.cpu()).abs().max()
print(f"Max logits diff: {logits_diff.item()}")

# .cat -> our sin/cos have a period of 4pi (2 cycles), the orginal have a period of 2pi (1 cycle)
# .float() -> on main, we cast sin/cos to the model dtype
sin_diff = (torch.cat([og_sin, og_sin], dim=1).cpu() - sin.float().cpu()).abs().max()
cos_diff = (torch.cat([og_cos, og_cos], dim=1).cpu() - cos.float().cpu()).abs().max()
print(f"Max sin diff: {sin_diff.item()}")
print(f"Max cos diff: {cos_diff.item()}")

On main + GPU + FP16, before this PR, we can see sin/cos and logits differences as large as 2e-4 and 6e-2 (respectively). After this PR, the difference is 0.0.

HuggingFaceDocBuilderDev · 2024-02-02T12:51:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-02-02T18:06:29Z

tests/models/llama/test_modeling_llama.py

@@ -505,6 +506,120 @@ def test_eager_matches_sdpa_generate(self):
            res_sdpa = model_sdpa.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
            self.assertTrue(torch.allclose(res_eager, res_sdpa))

+    @require_torch_gpu
+    def test_rope_cast_strategy_invariant(self):


this test fails on main because inv_freq was being casted with .to()

gante · 2024-02-02T18:06:57Z

tests/models/llama/test_modeling_llama.py

+            )
+
+    @require_torch_gpu
+    def test_rope_initialization_invariant(self):


this test fails on main, as initialization is device-dependent there

github-actions · 2024-03-30T08:04:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante added 3 commits February 1, 2024 14:55

mvp

0834051

tmp commit

3425c67

tmp commmit

b6c8872

gante added 4 commits February 2, 2024 12:59

llama finished

7444658

stricter test

7c3800c

fix test

f278527

torch fx compatible

ffbfc0f

gante commented Feb 2, 2024

View reviewed changes

tmp commit

25a86c9

gante changed the title ~~Llama: device and type-invariant RoPE sin/cos computation~~ Llama: device/type-invariant RoPE sin/cos computation and FP32 application Feb 5, 2024

gante added 4 commits February 5, 2024 18:20

tmp commit

351b3f9

complex operations

ea34e3f

eager fully matching

73ab280

working with cache; waiting for static cache to be merged

47c9b37

gante changed the title ~~Llama: device/type-invariant RoPE sin/cos computation and FP32 application~~ Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation Feb 7, 2024

huggingface deleted a comment from github-actions bot Mar 5, 2024

github-actions bot closed this Apr 7, 2024

gante mentioned this pull request May 3, 2024

torch.arange use should not use dtype=float for integer ranges, conflicts w/ DS zero.Init() #28685

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation #28837

Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation #28837

gante commented Feb 2, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 2, 2024

gante Feb 2, 2024

gante Feb 2, 2024

github-actions bot commented Mar 30, 2024

Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation #28837

Llama: device/type-invariant RoPE sin/cos computation, eager attention matches original implementation #28837

Conversation

gante commented Feb 2, 2024 • edited Loading

What does this PR do?

Related GH issues

Additional data and experiments

HuggingFaceDocBuilderDev commented Feb 2, 2024

gante Feb 2, 2024

Choose a reason for hiding this comment

gante Feb 2, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 30, 2024

gante commented Feb 2, 2024 •

edited

Loading