XLM-R tokenizer, return correct unk id for corrupted input

This should never happen, but we returned the incorrect unknown piece identifier in the worst-case fallback (where tokenization doesn't return any pieces).
tensordot · Oct 19, 2023 · b123802 · b123802
1 parent d9a8333
commit b123802
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/syntaxdot-tokenizers/src/xlm_roberta.rs b/syntaxdot-tokenizers/src/xlm_roberta.rs
@@ -76,7 +76,7 @@ impl Tokenize for XlmRobertaTokenizer {
                 // tokens. However, the input may be corrupt and use
                 // some form of non-tab whitespace as a form, for which
                 // sentencepiece does not return any identifier.
-                pieces.push(self.spp.unk_id() as i64 + FAIRSEQ_OFFSET);
+                pieces.push(FAIRSEQ_UNK);
             }
         }