Add loss generating token counts #1610

dakinggg · 2024-10-22T22:05:40Z

Takes advantage of the new functionality in Composer to weight microbatches by loss generating tokens, and not just total tokens. See the Composer PR (mosaicml/composer#3677) for more details and manual testing.

Note: this needs a Composer release and bump (and CI won't pass until that happens)

mvpatel2000

will you hold PR here until release?

dakinggg · 2024-10-24T16:47:42Z

@mvpatel2000 yeah, CI won't pass until release

llmfoundry/data/utils.py

jimmyxu-db · 2024-10-28T22:30:31Z

tests/data/test_dataloader.py

@@ -1310,9 +1321,11 @@ def build_from_hf(
        raise NotImplementedError()

    batch_collated = dl.dataloader.collate_fn(batch_tokenized)  # type: ignore
-    actual_token_count = dl.get_num_tokens_in_batch(batch_collated)
+    actual_total_token_count = dl.get_num_tokens_in_batch(batch_collated, token_type='total')


i might be missing something, but how can we pass in token_type here when it's not in the function definition here? https://github.com/mosaicml/llm-foundry/pull/1610/files#diff-9568d89aed75ca69416abe2a592c6bb9732129049a62c34e4e9263c18495a236R99

The function being called here is actually defined on the DataSpec class in Composer (https://github.com/mosaicml/composer/blob/28756dd52e96371689b764cb72c336406460ad35/composer/core/data_spec.py#L301). The DataSpec takes in a function from the user and uses it.

Part of the reason for doing it this way was to maintain backwards compatibility with any existing user defined get_num_tokens_in_batch functions out there.

vancoykendall · 2024-10-31T00:56:13Z

llmfoundry/data/utils.py

+                torch.sum(batch['labels'] != CROSS_ENTROPY_IGNORE_INDEX).item(),
+            )
+
+            # Subtract one for each example in the batch that starts with a non -100,


@dakinggg I don't think this subtraction isn't necessary. Instead you can just do this:

loss_generating_tokens = int( torch.sum(batch['labels'][...,1:] != CROSS_ENTROPY_IGNORE_INDEX).item(), )

*I just came across this pr while looking into how mosaic's libs handle the gradient accumulation bug recently discussed on x.com

ah yeah, that should work too :)

dakinggg added 2 commits October 16, 2024 10:55

should work

5d938e5

pc

8ba4ecf

dakinggg mentioned this pull request Oct 22, 2024

Loss gen tokens mosaicml/composer#3677

Merged

dakinggg added 3 commits October 23, 2024 10:19

Merge branch 'main' into loss-gen-tokens

09e0749

fix off by one

5f0ed6e

pc

a2438b3

dakinggg marked this pull request as ready for review October 24, 2024 06:29

dakinggg requested a review from a team as a code owner October 24, 2024 06:29

dakinggg requested review from mvpatel2000 and irenedea October 24, 2024 06:29

mvpatel2000 reviewed Oct 24, 2024

View reviewed changes

dakinggg changed the title ~~Loss gen tokens~~ Add loss generating token counts Oct 24, 2024

dakinggg added 4 commits October 24, 2024 20:52

fix off by one

e145e59

fix

767efbb

fix types

ea16fcd

remove print

6792abd

mvpatel2000 approved these changes Oct 25, 2024

View reviewed changes

llmfoundry/data/utils.py Outdated Show resolved Hide resolved

dakinggg and others added 3 commits October 25, 2024 11:14

pc and cleanup

4d720ad

fix

894c54f

Merge branch 'main' into loss-gen-tokens

34e00c2

dakinggg merged commit 874c30a into mosaicml:main Oct 27, 2024
9 checks passed

dakinggg mentioned this pull request Oct 27, 2024

Change accumulate_train_batch_on_tokens default to True #1618

Merged

jimmyxu-db reviewed Oct 28, 2024

View reviewed changes

vancoykendall reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add loss generating token counts #1610

Add loss generating token counts #1610

dakinggg commented Oct 22, 2024 •

edited

Loading

mvpatel2000 left a comment

dakinggg commented Oct 24, 2024

jimmyxu-db Oct 28, 2024

dakinggg Oct 28, 2024

dakinggg Oct 28, 2024

vancoykendall Oct 31, 2024

dakinggg Oct 31, 2024

Add loss generating token counts #1610

Add loss generating token counts #1610

Conversation

dakinggg commented Oct 22, 2024 • edited Loading

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Oct 24, 2024

jimmyxu-db Oct 28, 2024

Choose a reason for hiding this comment

dakinggg Oct 28, 2024

Choose a reason for hiding this comment

dakinggg Oct 28, 2024

Choose a reason for hiding this comment

vancoykendall Oct 31, 2024

Choose a reason for hiding this comment

dakinggg Oct 31, 2024

Choose a reason for hiding this comment

dakinggg commented Oct 22, 2024 •

edited

Loading