Add Rotary Positional Embeddings (RoPE) - part 2 of parallel attention blocks #450

lessw2020 · 2023-08-17T05:04:16Z

Summary:
Adds Rotary Positional Embeddings (RoPE)

Test plan:
two unit tests - one for math, one for padding

codecov-commenter · 2023-08-17T05:09:49Z

Codecov Report

Patch coverage: 96.55% and project coverage change: +0.13% 🎉

Comparison is base (951a452) 69.11% compared to head (15c7469) 69.24%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #450      +/-   ##
==========================================
+ Coverage   69.11%   69.24%   +0.13%     
==========================================
  Files         170      170              
  Lines       11524    11580      +56     
==========================================
+ Hits         7965     8019      +54     
- Misses       3559     3561       +2

Files Changed	Coverage Δ
...rchmultimodal/modules/layers/position_embedding.py	`97.50% <94.28%> (-2.50%)`	⬇️
tests/modules/layers/test_position_embedding.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers

Looks good! Just a few minor things, mainly around testing and comments

ebsmothers · 2023-08-18T17:26:24Z

tests/modules/layers/test_position_embedding.py

@@ -112,3 +115,38 @@ def test_forward(self, data, emb):
        actual = emb(data)
        expected = torch.Size([3, 5])
        assert_expected(actual.shape, expected)
+
+
+def test_rotary_embeddings_math():


Can we put these unit tests into a class? (Similar to the other tests in this file)

yes, will do.

ebsmothers · 2023-08-18T20:33:20Z

torchmultimodal/modules/layers/position_embedding.py

+        return cur_freqs.view(*shape, 2)
+
+    def forward(
+        self, q: torch.Tensor, k: torch.Tensor, start_pos: Union[int, float]


Do you think it makes sense to have start_pos default to 0? (My assumption is that this would at least be the starting point for most users)

ebsmothers · 2023-08-18T20:36:08Z

torchmultimodal/modules/layers/position_embedding.py

+            Maximum expected sequence length for the model, if exceeded the cached freqs will be recomputed
+        ratio: int
+            The ratio for the geometric progression to compute the rotation angles
+        """


It'd be nice to add more in the docstring on the exact details of these embeddings, e.g. at least the [[cos, -sin], [sin, cos]] matrix and maybe even a small example (like the simple 2D one you wrote for the unit test)

ebsmothers · 2023-08-18T20:51:05Z

tests/modules/layers/test_position_embedding.py

+    assert_expected(qr[0, :, 1], qr2[1, :, 0])
+
+    assert_expected(kr[0], kr2[0])
+    assert_expected(kr[0, :, 1], kr2[1, :, 0])


Can we also add a test for updating the cached frequencies? (As far as I can tell this second test is not hitting that block in L262-268, lmk if I'm misunderstanding)

yes, that's a good idea.

ebsmothers · 2023-08-18T21:06:27Z

torchmultimodal/modules/layers/position_embedding.py

+        k_ = k.float().reshape(*k.shape[:-1], -1, 2)  # B H L D/2 2
+
+        if isinstance(start_pos, int):
+            if start_pos + seq_len > self.max_seq_len_cached:


Some comments here about when the frequencies need to be recomputed might be helpful

sounds good - offhand should be changing dtype, changing device, and resetting seq len > max_seq_len.

ebsmothers · 2023-08-18T21:06:46Z

torchmultimodal/modules/layers/position_embedding.py

+        )
+        self.compute_freqs_cis(max_position_embeddings)
+
+    def compute_freqs_cis(


Random q: what does cis mean here?

it's short form for rotation transform technically doing e^(alpha*i) = cos(alpha) + i * sin(alpha), or shortened, cos + i * sin = cis.

should probably add that in the docstring actually, otherwise it's too cryptic.

facebook-github-bot · 2023-08-22T21:00:11Z

@rohan-varma has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-08-22T21:31:48Z

@rohan-varma has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

rohan-varma

high level comment, but let's maybe create a modules/layers/embeddings folder in the future as we might have multiple embedding layers.

lessw2020 added 2 commits August 17, 2023 04:37

add Rotary Embeddings main code

1f273a0

add rotary position tests

b01516a

lessw2020 requested review from ankitade and ebsmothers August 17, 2023 05:04

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 17, 2023

lessw2020 added 5 commits August 17, 2023 05:16

add missing types

ceda766

add int typing for embedding count

aafa9b3

add torch.device typing

a2a98ba

add typing Union for compute_freq_cis

15c7469

more typing - Union[int,float] for start_pos

7c21fba

ebsmothers reviewed Aug 18, 2023

View reviewed changes

lessw2020 added 2 commits August 18, 2023 21:55

add torch.LongTensor typing

06fe2a9

Merge branch 'main' into rotary_embeddings

2d0ab30

rohan-varma reviewed Aug 22, 2023

View reviewed changes

ebsmothers mentioned this pull request Oct 12, 2023

Add support for LLaVA model #482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rotary Positional Embeddings (RoPE) - part 2 of parallel attention blocks #450

Add Rotary Positional Embeddings (RoPE) - part 2 of parallel attention blocks #450

lessw2020 commented Aug 17, 2023

codecov-commenter commented Aug 17, 2023 •

edited

Loading

ebsmothers left a comment

ebsmothers Aug 18, 2023

lessw2020 Aug 24, 2023

ebsmothers Aug 18, 2023

ebsmothers Aug 18, 2023

ebsmothers Aug 18, 2023

lessw2020 Aug 24, 2023

ebsmothers Aug 18, 2023

lessw2020 Aug 24, 2023

ebsmothers Aug 18, 2023

lessw2020 Aug 24, 2023

lessw2020 Aug 24, 2023

facebook-github-bot commented Aug 22, 2023

facebook-github-bot commented Aug 22, 2023

rohan-varma left a comment

Add Rotary Positional Embeddings (RoPE) - part 2 of parallel attention blocks #450

Are you sure you want to change the base?

Add Rotary Positional Embeddings (RoPE) - part 2 of parallel attention blocks #450

Conversation

lessw2020 commented Aug 17, 2023

codecov-commenter commented Aug 17, 2023 • edited Loading

Codecov Report

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 22, 2023

facebook-github-bot commented Aug 22, 2023

rohan-varma left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 17, 2023 •

edited

Loading