Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

mahdibaghbanzadeh · 2024-12-12T21:37:43Z

This pull request introduces enhancements to the DataCollatorForLanguageModeling class, providing greater flexibility for token replacement during masked language modeling (MLM). The key changes include:

Configurable Replacement Probabilities:
- mask_replace_prob: Specifies the probability of replacing masked tokens with the [MASK] token (default: 80%).
- random_replace_prob: Specifies the probability of replacing masked tokens with random tokens from the vocabulary (default: 10%).
- The remaining masked tokens are left unchanged (default: 10%).
Edge Case Handling:
- Properly scales random_replace_prob to the remaining probability after applying mask_replace_prob.
- Includes validation to ensure the sum of mask_replace_prob and random_replace_prob does not exceed 1.
Backward Compatibility:
- Default behavior mimics the traditional 80-10-10 rule for MLM token replacement.

Examples of New Functionality

Default Behavior:
Replace 80% of masked tokens with [MASK], 10% with random tokens, and leave 10% unchanged.

Custom Configurations:

Replace all masked tokens with [MASK]:

mask_replace_prob=1.0, random_replace_prob=0.0

Replace all masked tokens with random tokens:

mask_replace_prob=0.0, random_replace_prob=1.0

Balanced replacement:

mask_replace_prob=0.5, random_replace_prob=0.4

Additional Notes

Updated docstrings to reflect the new configuration options.
Added validations for probability values and enhanced edge case handling for robust training workflows.

This enhancement gives users greater control over MLM training configurations, catering to various pretraining and fine-tuning use cases.

… that provides more control over the token masking and relacing

…lator_mlm

Rocketknight1

I like this addition to the class! Some suggestions before we can merge it, though:

You'll need to run pip install transformers[quality] followed by make style to fix the code style issues
We'll need some tests to cover these new options! They should go in tests/trainer/test_data_collator.py.

Because the collator uses random sampling, though, please don't write tests that check the number of masked tokens is close to the expected value - these are very flaky and tend to randomly fail 1% of the time, which is very annoying in our CI. Instead, I suggest setting values to 0 or 1 and confirming that you get the expected behaviour - e.g. set mask_replace_prob=1 and confirm that every token is either the original token or [MASK]. You can also set illegal values and confirm that an error is raised.

src/transformers/data/data_collator.py

… the DataCollatorForLanguageModeling

mahdibaghbanzadeh · 2024-12-13T16:54:12Z

Thanks for the feedback!
I updated the docstring and added the following tests:

test_probability_sum_error: Ensures an error is raised if mask_replace_prob + random_replace_prob is not within [0, 1].
test_all_mask_replacement: Verifies functionality when mask_replace_prob=1, ensuring all tokens are either the original token or [MASK].

Rocketknight1 · 2024-12-17T16:05:03Z

@mahdibaghbanzadeh this looks good now! Let me know whenever you're ready for final review and I'll ping a core maintainer

mahdibaghbanzadeh · 2024-12-17T16:10:09Z

@Rocketknight1 Thanks, Please let them know to do the final review.

Rocketknight1 · 2024-12-17T16:14:10Z

cc @ArthurZucker for core maintainer review!

mahdibaghbanzadeh and others added 4 commits December 12, 2024 16:26

DataCollatorForLanguageModeling class was updated with new parameters…

f26418c

… that provides more control over the token masking and relacing

Merge branch 'main' into data_collator_mlm

b0deb66

DataCollatorForLanguageModeling class was updated with new parameters…

cab24a3

… that provides more control over the token masking and relacing

Merge remote-tracking branch 'origin/data_collator_mlm' into data_col…

cb0541e

…lator_mlm

Rocketknight1 reviewed Dec 13, 2024

View reviewed changes

src/transformers/data/data_collator.py Outdated Show resolved Hide resolved

Addressed review comments, modified the docstring and made a test for…

72d6cac

… the DataCollatorForLanguageModeling

mahdibaghbanzadeh added 2 commits December 17, 2024 09:03

Merge branch 'main' into data_collator_mlm

3942102

Merge branch 'main' into data_collator_mlm

450ac60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

mahdibaghbanzadeh commented Dec 12, 2024

Rocketknight1 left a comment •

edited

Loading

mahdibaghbanzadeh commented Dec 13, 2024

Rocketknight1 commented Dec 17, 2024

mahdibaghbanzadeh commented Dec 17, 2024

Rocketknight1 commented Dec 17, 2024

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Are you sure you want to change the base?

Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities #35251

Conversation

mahdibaghbanzadeh commented Dec 12, 2024

Examples of New Functionality

Additional Notes

Rocketknight1 left a comment • edited Loading

Choose a reason for hiding this comment

mahdibaghbanzadeh commented Dec 13, 2024

Rocketknight1 commented Dec 17, 2024

mahdibaghbanzadeh commented Dec 17, 2024

Rocketknight1 commented Dec 17, 2024

Rocketknight1 left a comment •

edited

Loading