Skip to content

Commit

Permalink
Implement max character check for WordPiece tokenizer (#398)
Browse files Browse the repository at this point in the history
* Implement max character check per token

* Update maxInputCharsPerWord to max_input_chars_per_word

Co-authored-by: Joshua Lochner <[email protected]>

* Update maxInputCharsPerWord to max_input_chars_per_word

Co-authored-by: Joshua Lochner <[email protected]>

* Update to ??

Co-authored-by: Joshua Lochner <[email protected]>

---------

Co-authored-by: Joshua Lochner <[email protected]>
  • Loading branch information
samlhuillier and xenova authored Nov 17, 2023
1 parent 4e4148c commit c8bbdd4
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,7 @@ class WordPieceTokenizer extends TokenizerModel {
* @param {Object} config.vocab A mapping of tokens to ids.
* @param {string} config.unk_token The unknown token string.
* @param {string} config.continuing_subword_prefix The prefix to use for continuing subwords.
* @param {number} [config.max_input_chars_per_word=100] The maximum number of characters per word.
*/
constructor(config) {
super(config);
Expand All @@ -291,6 +292,12 @@ class WordPieceTokenizer extends TokenizerModel {
*/
this.unk_token = config.unk_token;

/**
* The maximum number of characters allowed per word.
* @type {number}
*/
this.max_input_chars_per_word = config.max_input_chars_per_word ?? 100;

/**
* An array of tokens.
* @type {string[]}
Expand All @@ -310,10 +317,10 @@ class WordPieceTokenizer extends TokenizerModel {
let outputTokens = [];
for (let token of tokens) {
let chars = [...token];
// TODO add
// if len(chars) > self.max_input_chars_per_word:
// output_tokens.append(self.unk_token)
// continue
if (chars.length > this.max_input_chars_per_word) {
outputTokens.push(this.unk_token);
continue;
}

let isUnknown = false;
let start = 0;
Expand Down

0 comments on commit c8bbdd4

Please sign in to comment.