Added test cases for rembert refering to albert and reformer test_tok… #27637

nileshkokane01 · 2023-11-21T17:04:04Z

What does this PR do?

It addresses #16627 to create test cases for RemBert.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @SaulLu

nileshkokane01 · 2023-11-22T03:28:26Z

I do not know why tests_tf, tests_torch, run_tests and tests_jax is failing.
Can anyone let me know? I can work on it to resolve.

ArthurZucker

Hey! Thanks for contributing! The failing tests are the one you added 😉 meaning the tokenization_rembert probably needs to be fixed! Do you need help with this?

make sure the sp_model_kwargs attribute is passed
make sure the mask_token is initialized the same way as an added token for both the fast and the slow tokenizers
read the test errors as they should indicate what's wrong! 🤗

nileshkokane01 · 2023-11-22T16:12:31Z

make sure the mask_token is initialized the same way as an added token for both the fast and the slow tokenizers

You are right. This is absent in slow token for rembert.

Commenting that line I am not getting any failed cases.

How to tackle such test cases for which the mask token are initialized differently?
Should the constructor be passed with a appropriate mask_token?

self.assertDictEqual(EXPECTED_ADDED_TOKENS_DECODER, tokenizer_fasadded_tokens_decoder) E AssertionError: {0: A[454 chars]trip=False, single_word=False, noalized=Fals[123 chars]rue)} != {0: A[454 chars]trip=True, single_word=False, normalized=False22 chars]rue)} E Diff is 870 characters long. Set self.maxDiff to None to see it.

nileshkokane01 · 2023-11-25T10:35:30Z

I overrided test_added_tokens_serialization to clear the following error.
Do not know if its a right way.

`
AssertionError: {0: A[454 chars]trip=False, single_word=False, normalized=Fals[123 chars]rue)} != {0: A[454 chars]trip=True, single_word=False, normalized=False[122 chars]rue)}E Diff is 870 characters long. Set self.maxDiff to None to see it.

tests\test_tokenization_common.py:4128: AssertionError

`

ArthurZucker

I think we should rather try to update the slow tokenizer with the line you mentioned instead of modifying the test (which is pretty robust 😉) If that does not work we need to make sure the reason why we are skipping the test is valid, and add it as a comment!
Feel free to ping me for help 🤗

ArthurZucker · 2023-11-26T15:51:22Z

It's usually a bit tricky!

nileshkokane01 · 2023-11-27T07:19:05Z

Hi @ArthurZucker ,

Thanks for your suggestion.

I dig in more deeper and found out that this line for slow tokens is having AddedToken for [MASK] with lstrip= False.

However, this will give the instance of Fast Token with lstrip = True as it invokes the constructor RemBertTokenizerFast where lstrip = True explicitly.

The above issue can be fixed and all test Pass- by setting this lstrip= False and that requires change in file outside test_tokenization_rembert.py , however , the comment above that line gives the reason as to why we have lstrip = True.

Mask token behave like a normal word, i.e. include the space before it

Do let me know, I can send the changes accordingly.

ArthurZucker · 2023-11-27T13:24:42Z

Hey! No, I'm pretty sure this should be fixed by setting the mask_token using :

        # Mask token behave like a normal word, i.e. include the space before it
        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

in the slow tokenizer. Did you try that?

HuggingFaceDocBuilderDev · 2023-11-27T13:43:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

nileshkokane01 · 2023-11-27T14:43:57Z

That isn't fixing the issue because it requires mask_token to be string.

When you call from_pretrained all the 'added_token_decoder' gets converted to AddedToken instance from here based on the config file .

Therefore, the slow token will have lstrip as False, despite the line added as it checks for instance as string.

ArthurZucker · 2023-11-28T08:12:41Z

No. mask_token does not need to be a string it has to be an AddedToken, otherwise it's casted to addedToken when adding it to the vocab if it's not part of the vocab. Anyway what I am saying is that this test works for the ~100 or so tokenizers, to skip it we need a good reason like a very edgy case or whatever, otherwise it means that loading the tokenizers from slow to fast, from the hub etc does not produce correct results.

ArthurZucker · 2023-11-28T08:12:50Z

🤗

nileshkokane01 · 2023-11-28T10:19:09Z

Hi,

No. mask_token does not need to be a string it has to be an AddedToken

If this is the case, the following line will fix the issue. I just verified it.

# Mask token behave like a normal word, i.e. include the space before it mask_token = AddedToken('[MASK]', lstrip=True, rstrip=False, normalized=False)

If the above change to slow token is okay, I'll send a patch, otherwise we have to rely on 'finding' a strong case to skip the the slow->fast test.

ArthurZucker · 2023-11-28T14:44:54Z

Yeah the above change should be alright mostly if all tests pass!

ArthurZucker

Thanks!

ArthurZucker · 2023-11-28T16:10:22Z

src/transformers/models/rembert/tokenization_rembert.py

@@ -111,6 +111,9 @@ def __init__(
        mask_token="[MASK]",
        **kwargs,
    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)


Suggested change

mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)

mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)

That change doesn't working, as for the initial RembertTokenizer instance is created from line 45 of this file which by default makes the mask_token="[MASK]" (a string) but the subsequent call happens to constructor from this line where mask_token is already AddedToken instance but with the lstrip = False (that will eventually fail the test case )

so it becomes like as the mask_token is AddedToken instance when it reaches line 115 of this file:

mask_token = AddedToken(AddedToken(...), lstrip=True, rstrip=False, normalized=False)

leading to error.

you have to put

Suggested change

mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)

mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) is isinstance(mask_token, str) else mask_token

I can do that , but I can't change the value of lstrip, which is private to class and defaults to False

ok, sorry for the confusion. I the issue is that the way the slow tokenizer was saved is such that the stripping for the token is wrong, then you can copy paste the test we are talking about, and make sure you pass the mask token in the test (to overwrtie this) or just push a Rembert tokenizer to the hub and use this one for the test! 🤗

This is much better. I'll do the needful. Thanks

tests/models/rembert/test_tokenization_rembert.py

ArthurZucker · 2023-11-28T16:14:26Z

tests/models/rembert/test_tokenization_rembert.py

+        ids = tokenizer._tokenize(text)
+        text_decode = tokenizer.convert_tokens_to_string(ids)
+        self.assertEquals(text_decode, text)
+
+        text = "In the sky up above"
+        encode_text = tokenizer._tokenize(text)
+        decode_text = tokenizer.convert_tokens_to_string(encode_text)
+        self.assertEqual(text, decode_text)
+


let' s try to right the expected raw IDS. Since rembert is a sentencepiece based model we could also add some edge case of what we are expecting

Agreed, can you please hint me as to how we get the expected ID and token for the fixtures/test_sentencepiece.model ?

If I have this information, I can add more cases to the function

you can just run the test and print the outputs of tokenizer._tokenize(text) which is text, and tokenizer.encode(text) which are input ids

ArthurZucker · 2023-11-28T16:14:46Z

tests/models/rembert/test_tokenization_rembert.py

+        output_text = "this is a test"
+        return input_text, output_text
+
+    def test_convert_token_and_id(self):


not sure this one is needed!

If you think its trivial we can remove it. Let me know.

nileshkokane01 · 2023-12-01T00:46:07Z

I have made the necessary changes. Can you have a look ?

ArthurZucker

Thanks looks a lot better!
Let's rather add edges cases with special tokens and spaces which are the most important but otherwise LGTM

ArthurZucker · 2023-12-01T09:30:15Z

tests/models/rembert/test_tokenization_rembert.py

+@require_sentencepiece
+@require_tokenizers
+class RemBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = RemBertTokenizer


anything that is copied from somwhere else needs the # Copied from mentions see here for doc on that! 🤗

tests/models/rembert/test_tokenization_rembert.py

ArthurZucker · 2023-12-01T09:31:31Z

tests/models/rembert/test_tokenization_rembert.py

+                "▁",
+                "🌈",
+            ],
+        )


ArthurZucker · 2023-12-01T09:31:35Z

tests/models/rembert/test_tokenization_rembert.py

+                0,
+                1001,
+            ],
+        )


ArthurZucker · 2023-12-01T09:31:38Z

tests/models/rembert/test_tokenization_rembert.py

+
+        text = "Invoice #12345, dated 2023-12-01, is due on 2024-01-15."
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(


tests/models/rembert/test_tokenization_rembert.py

ArthurZucker · 2023-12-01T09:32:25Z

tests/models/rembert/test_tokenization_rembert.py

+                4,
+                1001,
+            ],
+        )


use fmt skip as well

nileshkokane01 · 2023-12-01T13:29:03Z

can you Pl check the latest commit.

…enization

…okens to True

…ASK] for slow and fast, Therefore it required to make the initialization for [MASK] token uniform between fast and slow token

…ed the slow token (mask_token) to have AddedToken instance with lstrip=True

…fied slow tokenizer of rembert to have mask_token as AddedToken with lstrip = True

…style + added comments to indicate from the copied test cases

…e stripping the text

ArthurZucker

Thanks a lot for the effort! LGTM

nileshkokane01 · 2023-12-04T12:39:06Z

@ArthurZucker

Thanks for your support!

nileshkokane01 force-pushed the my-working-branch branch from 9eda5a2 to d025a1c Compare November 22, 2023 03:02

ArthurZucker reviewed Nov 22, 2023

View reviewed changes

nileshkokane01 force-pushed the my-working-branch branch from d025a1c to f86c282 Compare November 22, 2023 14:58

nileshkokane01 force-pushed the my-working-branch branch from f86c282 to c3d8e9f Compare November 25, 2023 10:13

ArthurZucker reviewed Nov 26, 2023

View reviewed changes

nileshkokane01 force-pushed the my-working-branch branch from c3d8e9f to 913c5a1 Compare November 28, 2023 15:16

ArthurZucker reviewed Nov 28, 2023

View reviewed changes

nileshkokane01 force-pushed the my-working-branch branch from 913c5a1 to 8a1c9ea Compare November 30, 2023 15:02

ArthurZucker reviewed Dec 1, 2023

View reviewed changes

nileshkokane01 force-pushed the my-working-branch branch from 4699fb4 to 754f686 Compare December 1, 2023 13:28

nileshkokane01 force-pushed the my-working-branch branch 2 times, most recently from 96e31d0 to 9ce98ff Compare December 1, 2023 15:16

Kokane added 4 commits December 3, 2023 13:03

Added test cases for rembert refering to albert and reformer test_tok…

a467492

…enization

removed CURL_CA_BUNDLE='

386d250

Added flag test_sentencepiece_ignore_case and space_between_special_t…

a3c300c

…okens to True

Overrided test_added_tokens_serialization

5427773

Kokane and others added 8 commits December 3, 2023 13:03

As slow->fast token failed due to the different initialization for [M…

44ebd0b

…ASK] for slow and fast, Therefore it required to make the initialization for [MASK] token uniform between fast and slow token

Added few more test cases in test_encode_decode_round_trip and modefi…

42689a6

…ed the slow token (mask_token) to have AddedToken instance with lstrip=True

Added few test cases in test_encoder_decoder round trip and also modi…

6af18a0

…fied slow tokenizer of rembert to have mask_token as AddedToken with lstrip = True

Cleaned the code and added fmt: skip to avoid line breaks after make …

e5f4565

…style + added comments to indicate from the copied test cases

Corrected few comments

bf5c33c

Fixed quality issue

1432404

Ran fix-copies

6368358

Fixed few minor issues as (make fix-copies) broke few test cases whil…

ceab41d

…e stripping the text

nileshkokane01 force-pushed the my-working-branch branch from 409c9b6 to b45da2f Compare December 3, 2023 07:34

Reverted the changes made by repo-consistancy

745eedf

nileshkokane01 force-pushed the my-working-branch branch from b45da2f to 745eedf Compare December 3, 2023 09:06

ArthurZucker approved these changes Dec 4, 2023

View reviewed changes

ArthurZucker merged commit 4d4febb into huggingface:main Dec 4, 2023
18 checks passed

nileshkokane01 deleted the my-working-branch branch February 12, 2024 12:15

	mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False, normalized=False)
	mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)

+,
+,
+                          ],
+                      )

Added test cases for rembert refering to albert and reformer test_tok… #27637

Added test cases for rembert refering to albert and reformer test_tok… #27637

Conversation

nileshkokane01 commented Nov 21, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

nileshkokane01 commented Nov 22, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

nileshkokane01 commented Nov 22, 2023 • edited Loading

nileshkokane01 commented Nov 25, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Nov 26, 2023

nileshkokane01 commented Nov 27, 2023 • edited Loading

ArthurZucker commented Nov 27, 2023

HuggingFaceDocBuilderDev commented Nov 27, 2023

nileshkokane01 commented Nov 27, 2023 • edited Loading

ArthurZucker commented Nov 28, 2023

ArthurZucker commented Nov 28, 2023

nileshkokane01 commented Nov 28, 2023 • edited Loading

ArthurZucker commented Nov 28, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nileshkokane01 Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nileshkokane01 commented Dec 1, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmt: skip

Choose a reason for hiding this comment

fmt: skip

Choose a reason for hiding this comment

fmt: skip

Choose a reason for hiding this comment

nileshkokane01 commented Dec 1, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

nileshkokane01 commented Dec 4, 2023

nileshkokane01 commented Nov 21, 2023 •

edited

Loading

nileshkokane01 commented Nov 22, 2023 •

edited

Loading

nileshkokane01 commented Nov 25, 2023 •

edited

Loading

nileshkokane01 commented Nov 27, 2023 •

edited

Loading

nileshkokane01 commented Nov 27, 2023 •

edited

Loading

nileshkokane01 commented Nov 28, 2023 •

edited

Loading

nileshkokane01 Nov 29, 2023 •

edited

Loading

nileshkokane01 commented Dec 1, 2023 •

edited

Loading

nileshkokane01 commented Dec 1, 2023 •

edited

Loading