Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added test cases for rembert refering to albert and reformer test_tok… #27637

Merged
merged 13 commits into from
Dec 4, 2023

Conversation

nileshkokane01
Copy link
Contributor

@nileshkokane01 nileshkokane01 commented Nov 21, 2023

What does this PR do?

It addresses #16627 to create test cases for RemBert.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @SaulLu

@nileshkokane01
Copy link
Contributor Author

I do not know why tests_tf, tests_torch, run_tests and tests_jax is failing.
Can anyone let me know? I can work on it to resolve.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks for contributing! The failing tests are the one you added 😉 meaning the tokenization_rembert probably needs to be fixed! Do you need help with this?

  • make sure the sp_model_kwargs attribute is passed
  • make sure the mask_token is initialized the same way as an added token for both the fast and the slow tokenizers
  • read the test errors as they should indicate what's wrong! 🤗

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Nov 22, 2023

make sure the mask_token is initialized the same way as an added token for both the fast and the slow tokenizers

You are right. This is absent in slow token for rembert.

Commenting that line I am not getting any failed cases.

How to tackle such test cases for which the mask token are initialized differently?
Should the constructor be passed with a appropriate mask_token?

self.assertDictEqual(EXPECTED_ADDED_TOKENS_DECODER, tokenizer_fasadded_tokens_decoder) E AssertionError: {0: A[454 chars]trip=False, single_word=False, noalized=Fals[123 chars]rue)} != {0: A[454 chars]trip=True, single_word=False, normalized=False22 chars]rue)} E Diff is 870 characters long. Set self.maxDiff to None to see it.

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Nov 25, 2023

I overrided test_added_tokens_serialization to clear the following error.
Do not know if its a right way.

`
AssertionError: {0: A[454 chars]trip=False, single_word=False, normalized=Fals[123 chars]rue)} != {0: A[454 chars]trip=True, single_word=False, normalized=False[122 chars]rue)}E Diff is 870 characters long. Set self.maxDiff to None to see it.

tests\test_tokenization_common.py:4128: AssertionError

`

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather try to update the slow tokenizer with the line you mentioned instead of modifying the test (which is pretty robust 😉) If that does not work we need to make sure the reason why we are skipping the test is valid, and add it as a comment!
Feel free to ping me for help 🤗

@ArthurZucker
Copy link
Collaborator

It's usually a bit tricky!

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Nov 27, 2023

Hi @ArthurZucker ,

Thanks for your suggestion.

I dig in more deeper and found out that this line for slow tokens is having AddedToken for [MASK] with lstrip= False.

However, this will give the instance of Fast Token with lstrip = True as it invokes the constructor RemBertTokenizerFast where lstrip = True explicitly.

The above issue can be fixed and all test Pass- by setting this lstrip= False and that requires change in file outside test_tokenization_rembert.py , however , the comment above that line gives the reason as to why we have lstrip = True.

Mask token behave like a normal word, i.e. include the space before it

Do let me know, I can send the changes accordingly.

@ArthurZucker
Copy link
Collaborator

Hey! No, I'm pretty sure this should be fixed by setting the mask_token using :

        # Mask token behave like a normal word, i.e. include the space before it
        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

in the slow tokenizer. Did you try that?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Nov 27, 2023

That isn't fixing the issue because it requires mask_token to be string.

When you call from_pretrained all the 'added_token_decoder' gets converted to AddedToken instance from here based on the config file .

Therefore, the slow token will have lstrip as False, despite the line added as it checks for instance as string.

@ArthurZucker
Copy link
Collaborator

No. mask_token does not need to be a string it has to be an AddedToken, otherwise it's casted to addedToken when adding it to the vocab if it's not part of the vocab. Anyway what I am saying is that this test works for the ~100 or so tokenizers, to skip it we need a good reason like a very edgy case or whatever, otherwise it means that loading the tokenizers from slow to fast, from the hub etc does not produce correct results.

@ArthurZucker
Copy link
Collaborator

🤗

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Nov 28, 2023

Hi,

No. mask_token does not need to be a string it has to be an AddedToken

If this is the case, the following line will fix the issue. I just verified it.

# Mask token behave like a normal word, i.e. include the space before it mask_token = AddedToken('[MASK]', lstrip=True, rstrip=False, normalized=False)

If the above change to slow token is okay, I'll send a patch, otherwise we have to rely on 'finding' a strong case to skip the the slow->fast test.

@ArthurZucker
Copy link
Collaborator

Yeah the above change should be alright mostly if all tests pass!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -111,6 +111,9 @@ def __init__(
mask_token="[MASK]",
**kwargs,
):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)

Copy link
Contributor Author

@nileshkokane01 nileshkokane01 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change doesn't working, as for the initial RembertTokenizer instance is created from line 45 of this file which by default makes the mask_token="[MASK]" (a string) but the subsequent call happens to constructor from this line where mask_token is already AddedToken instance but with the lstrip = False (that will eventually fail the test case )

so it becomes like as the mask_token is AddedToken instance when it reaches line 115 of this file:

mask_token = AddedToken(AddedToken(...), lstrip=True, rstrip=False, normalized=False)

leading to error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to put

Suggested change
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) is isinstance(mask_token, str) else mask_token

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that , but I can't change the value of lstrip, which is private to class and defaults to False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sorry for the confusion. I the issue is that the way the slow tokenizer was saved is such that the stripping for the token is wrong, then you can copy paste the test we are talking about, and make sure you pass the mask token in the test (to overwrtie this) or just push a Rembert tokenizer to the hub and use this one for the test! 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. I'll do the needful. Thanks

tests/models/rembert/test_tokenization_rembert.py Outdated Show resolved Hide resolved
Comment on lines 141 to 117
ids = tokenizer._tokenize(text)
text_decode = tokenizer.convert_tokens_to_string(ids)
self.assertEquals(text_decode, text)

text = "In the sky up above"
encode_text = tokenizer._tokenize(text)
decode_text = tokenizer.convert_tokens_to_string(encode_text)
self.assertEqual(text, decode_text)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let' s try to right the expected raw IDS. Since rembert is a sentencepiece based model we could also add some edge case of what we are expecting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, can you please hint me as to how we get the expected ID and token for the fixtures/test_sentencepiece.model ?

If I have this information, I can add more cases to the function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just run the test and print the outputs of tokenizer._tokenize(text) which is text, and tokenizer.encode(text) which are input ids

output_text = "this is a test"
return input_text, output_text

def test_convert_token_and_id(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this one is needed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think its trivial we can remove it. Let me know.

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Dec 1, 2023

I have made the necessary changes. Can you have a look ?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks a lot better!
Let's rather add edges cases with special tokens and spaces which are the most important but otherwise LGTM

Comment on lines +32 to +35
@require_sentencepiece
@require_tokenizers
class RemBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = RemBertTokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything that is copied from somwhere else needs the # Copied from mentions see here for doc on that! 🤗

tests/models/rembert/test_tokenization_rembert.py Outdated Show resolved Hide resolved
"▁",
"🌈",
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt: skip

0,
1001,
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt: skip


text = "Invoice #12345, dated 2023-12-01, is due on 2024-01-15."
tokens = tokenizer.tokenize(text)
self.assertListEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt: skip

tests/models/rembert/test_tokenization_rembert.py Outdated Show resolved Hide resolved
4,
1001,
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fmt skip as well

@nileshkokane01
Copy link
Contributor Author

nileshkokane01 commented Dec 1, 2023

can you Pl check the latest commit.

@nileshkokane01 nileshkokane01 force-pushed the my-working-branch branch 2 times, most recently from 96e31d0 to 9ce98ff Compare December 1, 2023 15:16
Kokane and others added 8 commits December 3, 2023 13:03
…ASK] for slow and fast, Therefore it required to make the initialization for [MASK] token uniform between fast and slow token
…ed the slow token (mask_token) to have AddedToken instance with lstrip=True
…fied slow tokenizer of rembert to have mask_token as AddedToken with lstrip = True
…style + added comments to indicate from the copied test cases
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the effort! LGTM

@ArthurZucker ArthurZucker merged commit 4d4febb into huggingface:main Dec 4, 2023
18 checks passed
@nileshkokane01
Copy link
Contributor Author

@ArthurZucker

Thanks for your support!

@nileshkokane01 nileshkokane01 deleted the my-working-branch branch February 12, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants