Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix: Inconsistent results in byte-level tokenization when using pre_t…
…okenizer.sequence Description: When utilizing pre_tokenizer.sequence, there was an inconsistency in the results of byte-level tokenization based on the order of byte_level and digits. This issue has been resolved by removing characters identified as digits among those used in byte_level. Changes: - Modified the byte-level tokenization process to ensure consistent results when pre_tokenizer.sequence is employed. - Characters identified as digits in the byte_level set are now properly excluded, addressing the order-dependent discrepancy.
- Loading branch information