-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added test cases for rembert refering to albert and reformer test_tok… #27637
Added test cases for rembert refering to albert and reformer test_tok… #27637
Conversation
9eda5a2
to
d025a1c
Compare
I do not know why tests_tf, tests_torch, run_tests and tests_jax is failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for contributing! The failing tests are the one you added 😉 meaning the tokenization_rembert
probably needs to be fixed! Do you need help with this?
- make sure the
sp_model_kwargs
attribute is passed - make sure the
mask_token
is initialized the same way as an added token for both the fast and the slow tokenizers - read the test errors as they should indicate what's wrong! 🤗
d025a1c
to
f86c282
Compare
You are right. This is absent in slow token for rembert. Commenting that line I am not getting any failed cases. How to tackle such test cases for which the mask token are initialized differently?
|
f86c282
to
c3d8e9f
Compare
I overrided test_added_tokens_serialization to clear the following error. ` tests\test_tokenization_common.py:4128: AssertionError ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rather try to update the slow
tokenizer with the line you mentioned instead of modifying the test (which is pretty robust 😉) If that does not work we need to make sure the reason why we are skipping the test is valid, and add it as a comment!
Feel free to ping me for help 🤗
It's usually a bit tricky! |
Hi @ArthurZucker , Thanks for your suggestion. I dig in more deeper and found out that this line for slow tokens is having AddedToken for [MASK] with lstrip= False. However, this will give the instance of Fast Token with lstrip = True as it invokes the constructor RemBertTokenizerFast where lstrip = True explicitly. The above issue can be fixed and all test Pass- by setting this lstrip= False and that requires change in file outside test_tokenization_rembert.py , however , the comment above that line gives the reason as to why we have lstrip = True.
Do let me know, I can send the changes accordingly. |
Hey! No, I'm pretty sure this should be fixed by setting the # Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token in the slow tokenizer. Did you try that? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
That isn't fixing the issue because it requires mask_token to be string. When you call from_pretrained all the 'added_token_decoder' gets converted to AddedToken instance from here based on the config file . Therefore, the slow token will have lstrip as False, despite the line added as it checks for instance as string. |
No. mask_token does not need to be a string it has to be an AddedToken, otherwise it's casted to addedToken when adding it to the vocab if it's not part of the vocab. Anyway what I am saying is that this test works for the ~100 or so tokenizers, to skip it we need a good reason like a very edgy case or whatever, otherwise it means that loading the tokenizers from slow to fast, from the hub etc does not produce correct results. |
🤗 |
Hi,
If this is the case, the following line will fix the issue. I just verified it.
If the above change to slow token is okay, I'll send a patch, otherwise we have to rely on 'finding' a strong case to skip the the slow->fast test. |
Yeah the above change should be alright mostly if all tests pass! |
c3d8e9f
to
913c5a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -111,6 +111,9 @@ def __init__( | |||
mask_token="[MASK]", | |||
**kwargs, | |||
): | |||
# Mask token behave like a normal word, i.e. include the space before it | |||
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False) | |
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change doesn't working, as for the initial RembertTokenizer instance is created from line 45 of this file which by default makes the mask_token="[MASK]" (a string) but the subsequent call happens to constructor from this line where mask_token is already AddedToken instance but with the lstrip = False (that will eventually fail the test case )
so it becomes like as the mask_token is AddedToken instance when it reaches line 115 of this file:
mask_token = AddedToken(AddedToken(...), lstrip=True, rstrip=False, normalized=False)
leading to error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to put
mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False) | |
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) is isinstance(mask_token, str) else mask_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that , but I can't change the value of lstrip, which is private to class and defaults to False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sorry for the confusion. I the issue is that the way the slow tokenizer was saved is such that the stripping for the token is wrong, then you can copy paste the test we are talking about, and make sure you pass the mask token in the test (to overwrtie this) or just push a Rembert tokenizer to the hub and use this one for the test! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better. I'll do the needful. Thanks
ids = tokenizer._tokenize(text) | ||
text_decode = tokenizer.convert_tokens_to_string(ids) | ||
self.assertEquals(text_decode, text) | ||
|
||
text = "In the sky up above" | ||
encode_text = tokenizer._tokenize(text) | ||
decode_text = tokenizer.convert_tokens_to_string(encode_text) | ||
self.assertEqual(text, decode_text) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let' s try to right the expected raw IDS. Since rembert is a sentencepiece based model we could also add some edge case of what we are expecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, can you please hint me as to how we get the expected ID and token for the fixtures/test_sentencepiece.model ?
If I have this information, I can add more cases to the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just run the test and print the outputs of tokenizer._tokenize(text)
which is text, and tokenizer.encode(text)
which are input ids
output_text = "this is a test" | ||
return input_text, output_text | ||
|
||
def test_convert_token_and_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this one is needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think its trivial we can remove it. Let me know.
913c5a1
to
8a1c9ea
Compare
I have made the necessary changes. Can you have a look ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks looks a lot better!
Let's rather add edges cases with special tokens and spaces which are the most important but otherwise LGTM
@require_sentencepiece | ||
@require_tokenizers | ||
class RemBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase): | ||
tokenizer_class = RemBertTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything that is copied from somwhere else needs the # Copied from
mentions see here for doc on that! 🤗
"▁", | ||
"🌈", | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt: skip
0, | ||
1001, | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt: skip
|
||
text = "Invoice #12345, dated 2023-12-01, is due on 2024-01-15." | ||
tokens = tokenizer.tokenize(text) | ||
self.assertListEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt: skip
4, | ||
1001, | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use fmt skip as well
4699fb4
to
754f686
Compare
can you Pl check the latest commit. |
96e31d0
to
9ce98ff
Compare
…ASK] for slow and fast, Therefore it required to make the initialization for [MASK] token uniform between fast and slow token
…ed the slow token (mask_token) to have AddedToken instance with lstrip=True
…fied slow tokenizer of rembert to have mask_token as AddedToken with lstrip = True
…style + added comments to indicate from the copied test cases
…e stripping the text
409c9b6
to
b45da2f
Compare
b45da2f
to
745eedf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the effort! LGTM
Thanks for your support! |
What does this PR do?
It addresses #16627 to create test cases for RemBert.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @SaulLu