Skip to content

Commit

Permalink
XLM-R tokenizer, return correct unk id for corrupted input
Browse files Browse the repository at this point in the history
This should never happen, but we returned the incorrect unknown piece
identifier in the worst-case fallback (where tokenization doesn't return
any pieces).
  • Loading branch information
danieldk committed Oct 19, 2023
1 parent d9a8333 commit b123802
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion syntaxdot-tokenizers/src/xlm_roberta.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ impl Tokenize for XlmRobertaTokenizer {
// tokens. However, the input may be corrupt and use
// some form of non-tab whitespace as a form, for which
// sentencepiece does not return any identifier.
pieces.push(self.spp.unk_id() as i64 + FAIRSEQ_OFFSET);
pieces.push(FAIRSEQ_UNK);
}
}

Expand Down

0 comments on commit b123802

Please sign in to comment.