Deserializing BPE tokenizer failure #1541

mcognetta · 2024-05-25T20:22:18Z

I am trying to serialize and deserialize a tokenizer and am getting an error:

Exception: data did not match any variant of untagged enum ModelWrapper at ...

First, I want to clarify that I have seen the other related issues to this (e.g., #1342, #566, #909, #1297, etc.), and none of the fixes in them apply here (I will detail that below). I have tested this on two different corpora and over 3 different versions of tokenizers (v0.12, v0.13, and v0.19) with the same results.

Here are the things I have checked/tried (to confirm that I have checked the other bugs):

(Merges cannot handle tokens containing spaces. #909) There are no spaces in any of the tokens
(Merges cannot handle tokens containing spaces. #909) All merges have exactly one space in their representation
(Exception upon attempting to load a Tokenizer from file #566) I've tried with no pre_tokenizer, Whitespace(), WhitespaceSplit(), and Split(pattern='\w+|[^\w\s]+', behavior='isolated')
(tokenizers.processors is not optional #1342) I've tried with no post_processor and a Template processor
I've tried loading from a string vs from a file (both immediately after the tokenization was created)
I've tried with many different vocabulary sizes (including so small that no merges are present)

My model training code is:

def build_bpe_tokenizer(path, vocab_size):
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = WhitespaceSplit()
    special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        continuing_subword_prefix="##",
        show_progress=False,
        special_tokens=special_tokens,
    )
    tokenizer.train([path], trainer)
    tokenizer.save("tokenizer.json")
    return tokenizer

which was taken basically verbatim from the documentation.

EDIT: I actually had tokenizer.model.dropout = 0.0 in my code, which was the cause of the failure (see the closing comment).

Immediately reloading Tokenizer.from_file("tokenizer.json") fails with the above error.

Any ideas on how to work around this?

Below is an example json output from the tokenizer training code above. Note that I set the vocabulary size to be such that only one merge was added. I did not modify anything in the json file. Please try to load this (and let me know if it works, it fails for me).

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
      "content": "[UNK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 1,
      "content": "[PAD]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 2,
      "content": "[CLS]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 3,
      "content": "[SEP]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 4,
      "content": "[MASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "WhitespaceSplit"
  },
  "post_processor": null,
  "decoder": null,
  "model": {
    "type": "BPE",
    "dropout": 0.0,
    "unk_token": "[UNK]",
    "continuing_subword_prefix": "##",
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": false,
    "vocab": {
      "[UNK]": 0,
      "[PAD]": 1,
      "[CLS]": 2,
      "[SEP]": 3,
      "[MASK]": 4,
      "!": 5,
      "#": 6,
      "$": 7,
      "%": 8,
      "&": 9,
      "(": 10,
      ")": 11,
      "*": 12,
      "+": 13,
      ",": 14,
      "-": 15,
      ".": 16,
      "/": 17,
      "0": 18,
      "1": 19,
      "2": 20,
      "3": 21,
      "4": 22,
      "5": 23,
      "6": 24,
      "7": 25,
      "8": 26,
      "9": 27,
      ":": 28,
      ";": 29,
      "=": 30,
      "?": 31,
      "@": 32,
      "\\": 33,
      "^": 34,
      "_": 35,
      "a": 36,
      "b": 37,
      "c": 38,
      "d": 39,
      "e": 40,
      "f": 41,
      "g": 42,
      "h": 43,
      "i": 44,
      "j": 45,
      "k": 46,
      "l": 47,
      "m": 48,
      "n": 49,
      "o": 50,
      "p": 51,
      "q": 52,
      "r": 53,
      "s": 54,
      "t": 55,
      "u": 56,
      "v": 57,
      "w": 58,
      "x": 59,
      "y": 60,
      "z": 61,
      "한": 62,
      "국": 63,
      "£": 64,
      "²": 65,
      "à": 66,
      "á": 67,
      "â": 68,
      "ã": 69,
      "ä": 70,
      "ç": 71,
      "è": 72,
      "é": 73,
      "ê": 74,
      "ë": 75,
      "í": 76,
      "ï": 77,
      "ñ": 78,
      "ó": 79,
      "ô": 80,
      "ö": 81,
      "ø": 82,
      "ú": 83,
      "ü": 84,
      "ā": 85,
      "ă": 86,
      "ć": 87,
      "ē": 88,
      "ť": 89,
      "ย": 90,
      "ร": 91,
      "อ": 92,
      "่": 93,
      "–": 94,
      "—": 95,
      "…": 96,
      "€": 97,
      "你": 98,
      "葱": 99,
      "送": 100,
      "##r": 101,
      "##o": 102,
      "##a": 103,
      "##d": 104,
      "##-": 105,
      "##s": 106,
      "##c": 107,
      "##l": 108,
      "##e": 109,
      "##i": 110,
      "##n": 111,
      "##t": 112,
      "##g": 113,
      "##y": 114,
      "##m": 115,
      "##u": 116,
      "##p": 117,
      "##h": 118,
      "##b": 119,
      "##w": 120,
      "##k": 121,
      "##f": 122,
      "##z": 123,
      "##5": 124,
      "##v": 125,
      "##x": 126,
      "##9": 127,
      "##2": 128,
      "##0": 129,
      "##1": 130,
      "##7": 131,
      "##.": 132,
      "##j": 133,
      "##4": 134,
      "##,": 135,
      "##8": 136,
      "##3": 137,
      "##6": 138,
      "##q": 139,
      "##;": 140,
      "##é": 141,
      "##ñ": 142,
      "##ø": 143,
      "##à": 144,
      "##í": 145,
      "##ô": 146,
      "##ö": 147,
      "##ê": 148,
      "##ó": 149,
      "##ē": 150,
      "##è": 151,
      "###": 152,
      "##á": 153,
      "##ä": 154,
      "##ú": 155,
      "##ย": 156,
      "##你": 157,
      "##葱": 158,
      "##ć": 159,
      "##ï": 160,
      "##ร": 161,
      "##ã": 162,
      "##&": 163,
      "##ç": 164,
      "##ă": 165,
      "##ť": 166,
      "##ü": 167,
      "##â": 168,
      "##ë": 169,
      "##ā": 170,
      "th": 171
    },
    "merges": [
      "t ##h"
    ]
  }
}

The text was updated successfully, but these errors were encountered:

mcognetta · 2024-05-25T21:01:18Z

As part of my search, I found that I was setting dropout

A json file with this change works

<     "dropout": 0.0,
---
>     "dropout": null,

Thus, I think that this is where the bug is. I need to check the rust code to see what the expected type of dropout is supposed to be, but it looks like if it is anything other than (0.0, 1.0] it fails. This failure makes sense to me (since otherwise dropout is undefined), but 0.0 should probably be accepted and it should be clearly documented.

If someone confirms that this is the root cause, I will close this issue and open another about the documentation + including 0.0.

mcognetta · 2024-05-27T07:46:22Z

Having looked into the rust code (see below), my previous comment does seem to be true (it only accepts (0, 1]). Calling it a bug is thus not correct, and I will close this issue and open one about allowing dropout = 0.0 to be analogous to None.

tokenizers/tokenizers/src/models/bpe/model.rs

Lines 138 to 142 in 25aee8b

    
           if let Some(p) = self.config.dropout { 
        
               if p <= 0.0 || p > 1.0 { 
        
                   return Err(Error::InvalidDropout.into()); 
        
               } 
        
           }

ArthurZucker · 2024-06-07T09:57:23Z

Thanks for the detailed issue, and yeah it makes sense to have dropout 0 equivalent to None

mcognetta · 2024-06-07T14:02:42Z

Thanks for the comment. I opened #1550 to address this.

mcognetta closed this as completed May 27, 2024

mcognetta closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024

mcognetta mentioned this issue Jun 7, 2024

Enable dropout = 0.0 as an equivalent to none in BPE #1550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserializing BPE tokenizer failure #1541

Deserializing BPE tokenizer failure #1541

mcognetta commented May 25, 2024 •

edited

Loading

mcognetta commented May 25, 2024 •

edited

Loading

mcognetta commented May 27, 2024

ArthurZucker commented Jun 7, 2024

mcognetta commented Jun 7, 2024

Deserializing BPE tokenizer failure #1541

Deserializing BPE tokenizer failure #1541

Comments

mcognetta commented May 25, 2024 • edited Loading

mcognetta commented May 25, 2024 • edited Loading

mcognetta commented May 27, 2024

ArthurZucker commented Jun 7, 2024

mcognetta commented Jun 7, 2024

mcognetta commented May 25, 2024 •

edited

Loading

mcognetta commented May 25, 2024 •

edited

Loading