Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserializing BPE tokenizer failure #1541

Closed
mcognetta opened this issue May 25, 2024 · 4 comments
Closed

Deserializing BPE tokenizer failure #1541

mcognetta opened this issue May 25, 2024 · 4 comments

Comments

@mcognetta
Copy link
Contributor

mcognetta commented May 25, 2024

I am trying to serialize and deserialize a tokenizer and am getting an error:

Exception: data did not match any variant of untagged enum ModelWrapper at ...

First, I want to clarify that I have seen the other related issues to this (e.g., #1342, #566, #909, #1297, etc.), and none of the fixes in them apply here (I will detail that below). I have tested this on two different corpora and over 3 different versions of tokenizers (v0.12, v0.13, and v0.19) with the same results.

Here are the things I have checked/tried (to confirm that I have checked the other bugs):

My model training code is:

def build_bpe_tokenizer(path, vocab_size):
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = WhitespaceSplit()
    special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        continuing_subword_prefix="##",
        show_progress=False,
        special_tokens=special_tokens,
    )
    tokenizer.train([path], trainer)
    tokenizer.save("tokenizer.json")
    return tokenizer

which was taken basically verbatim from the documentation.

EDIT: I actually had tokenizer.model.dropout = 0.0 in my code, which was the cause of the failure (see the closing comment).

Immediately reloading Tokenizer.from_file("tokenizer.json") fails with the above error.

Any ideas on how to work around this?


Below is an example json output from the tokenizer training code above. Note that I set the vocabulary size to be such that only one merge was added. I did not modify anything in the json file. Please try to load this (and let me know if it works, it fails for me).

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
      "content": "[UNK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 1,
      "content": "[PAD]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 2,
      "content": "[CLS]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 3,
      "content": "[SEP]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 4,
      "content": "[MASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "WhitespaceSplit"
  },
  "post_processor": null,
  "decoder": null,
  "model": {
    "type": "BPE",
    "dropout": 0.0,
    "unk_token": "[UNK]",
    "continuing_subword_prefix": "##",
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": false,
    "vocab": {
      "[UNK]": 0,
      "[PAD]": 1,
      "[CLS]": 2,
      "[SEP]": 3,
      "[MASK]": 4,
      "!": 5,
      "#": 6,
      "$": 7,
      "%": 8,
      "&": 9,
      "(": 10,
      ")": 11,
      "*": 12,
      "+": 13,
      ",": 14,
      "-": 15,
      ".": 16,
      "/": 17,
      "0": 18,
      "1": 19,
      "2": 20,
      "3": 21,
      "4": 22,
      "5": 23,
      "6": 24,
      "7": 25,
      "8": 26,
      "9": 27,
      ":": 28,
      ";": 29,
      "=": 30,
      "?": 31,
      "@": 32,
      "\\": 33,
      "^": 34,
      "_": 35,
      "a": 36,
      "b": 37,
      "c": 38,
      "d": 39,
      "e": 40,
      "f": 41,
      "g": 42,
      "h": 43,
      "i": 44,
      "j": 45,
      "k": 46,
      "l": 47,
      "m": 48,
      "n": 49,
      "o": 50,
      "p": 51,
      "q": 52,
      "r": 53,
      "s": 54,
      "t": 55,
      "u": 56,
      "v": 57,
      "w": 58,
      "x": 59,
      "y": 60,
      "z": 61,
      "한": 62,
      "국": 63,
      "£": 64,
      "²": 65,
      "à": 66,
      "á": 67,
      "â": 68,
      "ã": 69,
      "ä": 70,
      "ç": 71,
      "è": 72,
      "é": 73,
      "ê": 74,
      "ë": 75,
      "í": 76,
      "ï": 77,
      "ñ": 78,
      "ó": 79,
      "ô": 80,
      "ö": 81,
      "ø": 82,
      "ú": 83,
      "ü": 84,
      "ā": 85,
      "ă": 86,
      "ć": 87,
      "ē": 88,
      "ť": 89,
      "ย": 90,
      "ร": 91,
      "อ": 92,
      "่": 93,
      "–": 94,
      "—": 95,
      "…": 96,
      "€": 97,
      "你": 98,
      "葱": 99,
      "送": 100,
      "##r": 101,
      "##o": 102,
      "##a": 103,
      "##d": 104,
      "##-": 105,
      "##s": 106,
      "##c": 107,
      "##l": 108,
      "##e": 109,
      "##i": 110,
      "##n": 111,
      "##t": 112,
      "##g": 113,
      "##y": 114,
      "##m": 115,
      "##u": 116,
      "##p": 117,
      "##h": 118,
      "##b": 119,
      "##w": 120,
      "##k": 121,
      "##f": 122,
      "##z": 123,
      "##5": 124,
      "##v": 125,
      "##x": 126,
      "##9": 127,
      "##2": 128,
      "##0": 129,
      "##1": 130,
      "##7": 131,
      "##.": 132,
      "##j": 133,
      "##4": 134,
      "##,": 135,
      "##8": 136,
      "##3": 137,
      "##6": 138,
      "##q": 139,
      "##;": 140,
      "##é": 141,
      "##ñ": 142,
      "##ø": 143,
      "##à": 144,
      "##í": 145,
      "##ô": 146,
      "##ö": 147,
      "##ê": 148,
      "##ó": 149,
      "##ē": 150,
      "##è": 151,
      "###": 152,
      "##á": 153,
      "##ä": 154,
      "##ú": 155,
      "##ย": 156,
      "##你": 157,
      "##葱": 158,
      "##ć": 159,
      "##ï": 160,
      "##ร": 161,
      "##ã": 162,
      "##&": 163,
      "##ç": 164,
      "##ă": 165,
      "##ť": 166,
      "##ü": 167,
      "##â": 168,
      "##ë": 169,
      "##ā": 170,
      "th": 171
    },
    "merges": [
      "t ##h"
    ]
  }
}
@mcognetta
Copy link
Contributor Author

mcognetta commented May 25, 2024

As part of my search, I found that I was setting dropout

A json file with this change works

<     "dropout": 0.0,
---
>     "dropout": null,

Thus, I think that this is where the bug is. I need to check the rust code to see what the expected type of dropout is supposed to be, but it looks like if it is anything other than (0.0, 1.0] it fails. This failure makes sense to me (since otherwise dropout is undefined), but 0.0 should probably be accepted and it should be clearly documented.

If someone confirms that this is the root cause, I will close this issue and open another about the documentation + including 0.0.

@mcognetta
Copy link
Contributor Author

Having looked into the rust code (see below), my previous comment does seem to be true (it only accepts (0, 1]). Calling it a bug is thus not correct, and I will close this issue and open one about allowing dropout = 0.0 to be analogous to None.

if let Some(p) = self.config.dropout {
if p <= 0.0 || p > 1.0 {
return Err(Error::InvalidDropout.into());
}
}

@mcognetta mcognetta closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024
@ArthurZucker
Copy link
Collaborator

Thanks for the detailed issue, and yeah it makes sense to have dropout 0 equivalent to None

@mcognetta
Copy link
Contributor Author

Thanks for the comment. I opened #1550 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants