Skip to content

Commit

Permalink
Fix: Inconsistent results in byte-level tokenization when using pre_t…
Browse files Browse the repository at this point in the history
…okenizer.sequence

Description:
When utilizing pre_tokenizer.sequence, there was an inconsistency in the results of byte-level tokenization based on the order of byte_level and digits. This issue has been resolved by removing characters identified as digits among those used in byte_level.

Changes:
- Modified the byte-level tokenization process to ensure consistent results when pre_tokenizer.sequence is employed.
- Characters identified as digits in the byte_level set are now properly excluded, addressing the order-dependent discrepancy.
  • Loading branch information
jun.4 committed Nov 16, 2023
1 parent e3bcef2 commit a609178
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion tokenizers/src/pre_tokenizers/byte_level.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ fn bytes_char() -> HashMap<u8, char> {
let mut bs: Vec<u8> = vec![];
bs.extend(b'!'..=b'~');
bs.extend(b'\xA1'..=b'\xAC');
bs.extend(b'\xAE'..=b'\xFF');
bs.extend(b'\xAE'..=b'\xB1');
bs.extend(b'\xB4'..=b'\xB8');
bs.extend(b'\xBA'..=b'\xBB');
bs.extend(b'\xBF'..=b'\xFF');

let mut cs: Vec<u32> = bs.iter().map(|i| *i as u32).collect();
let mut n = 0;
Expand Down

0 comments on commit a609178

Please sign in to comment.