Adding pretty print of tokenizer #1540

haixuanTao · 2024-05-23T12:24:39Z

Adding a default implementation for __str__ and __repr__ for Tokenizer.

Test it out

Before

>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors
>>> from tokenizers.implementations import BaseTokenizer
>>> toki = Tokenizer(models.BPE())
>>> print(toki)
<tokenizers.Tokenizer object at 0x7d687d32bc30>

After

>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors
>>> from tokenizers.implementations import BaseTokenizer
>>> toki = Tokenizer(models.BPE())
>>> print(toki)
TokenizerImpl {
    normalizer: None,
    pre_tokenizer: None,
    model: PyModel {
        model: RwLock {
            data: BPE(
                BPE {
                    dropout: None,
                    unk_token: None,
                    continuing_subword_prefix: None,
                    end_of_word_suffix: None,
                    fuse_unk: false,
                    byte_fallback: false,
                    vocab: 0,
                    merges: 0,
                    ignore_merges: false,
                },
            ),
            poisoned: false,
            ..
        },
    },
    post_processor: None,
    decoder: None,
    added_vocabulary: AddedVocabulary {
        added_tokens_map: {},
        added_tokens_map_r: {},
        added_tokens: [],
        special_tokens: [],
        special_tokens_set: {},
        split_trie: (
            AhoCorasick(
                dfa::DFA(
                D 000000: \x00 => 0
                F 000001:
                 >000002: \x00 => 2
                  000003: \x00 => 0
                match kind: LeftmostLongest
                prefilter: false
                state length: 4
                pattern length: 0
                shortest pattern length: 18446744073709551615
                longest pattern length: 0
                alphabet length: 1
                stride: 1
                byte classes: ByteClasses(0 => [0-255])
                memory usage: 16
                )
                ,
            ),
            [],
        ),
        split_normalized_trie: (
            AhoCorasick(
                dfa::DFA(
                D 000000: \x00 => 0
                F 000001:
                 >000002: \x00 => 2
                  000003: \x00 => 0
                match kind: LeftmostLongest
                prefilter: false
                state length: 4
                pattern length: 0
                shortest pattern length: 18446744073709551615
                longest pattern length: 0
                alphabet length: 1
                stride: 1
                byte classes: ByteClasses(0 => [0-255])
                memory usage: 16
                )
                ,
            ),
            [],
        ),
        encode_special_tokens: false,
    },
    truncation: None,
    padding: None,
}

Hope this helps :)

Open for any critics and the representation or implementation

Inspired by dora-rs/dora#503

ArthurZucker

Yep looks good to me!

ArthurZucker

for now this does the work! I'd want to print the content and etc long term with classes

HuggingFaceDocBuilderDev · 2024-05-23T13:11:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-06-05T06:54:20Z

Closing in favor of #1542!

ArthurZucker approved these changes May 23, 2024

View reviewed changes

ArthurZucker reviewed May 23, 2024

View reviewed changes

Add pretty print formating of tokenizer when printed

f9841b2

haixuanTao force-pushed the main branch from d8a64ec to f9841b2 Compare May 23, 2024 12:57

haixuanTao added 2 commits May 23, 2024 15:21

Fix rust-toolchain version

ffdc8c3

Fix clippy warnings

dc5b9bb

haixuanTao changed the title ~~Adding json serialization of tokenizer when printed~~ Adding pretty print of tokenizer May 23, 2024

ArthurZucker closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding pretty print of tokenizer #1540

Adding pretty print of tokenizer #1540

haixuanTao commented May 23, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented May 23, 2024

ArthurZucker commented Jun 5, 2024

Adding pretty print of tokenizer #1540

Adding pretty print of tokenizer #1540

Conversation

haixuanTao commented May 23, 2024 • edited Loading

Test it out

Before

After

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 23, 2024

ArthurZucker commented Jun 5, 2024

haixuanTao commented May 23, 2024 •

edited

Loading