Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding pretty print of tokenizer #1540

Closed
wants to merge 3 commits into from
Closed

Conversation

haixuanTao
Copy link

@haixuanTao haixuanTao commented May 23, 2024

Adding a default implementation for __str__ and __repr__ for Tokenizer.

Test it out

Before

>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors
>>> from tokenizers.implementations import BaseTokenizer
>>> toki = Tokenizer(models.BPE())
>>> print(toki)
<tokenizers.Tokenizer object at 0x7d687d32bc30>

After

>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors
>>> from tokenizers.implementations import BaseTokenizer
>>> toki = Tokenizer(models.BPE())
>>> print(toki)
TokenizerImpl {
    normalizer: None,
    pre_tokenizer: None,
    model: PyModel {
        model: RwLock {
            data: BPE(
                BPE {
                    dropout: None,
                    unk_token: None,
                    continuing_subword_prefix: None,
                    end_of_word_suffix: None,
                    fuse_unk: false,
                    byte_fallback: false,
                    vocab: 0,
                    merges: 0,
                    ignore_merges: false,
                },
            ),
            poisoned: false,
            ..
        },
    },
    post_processor: None,
    decoder: None,
    added_vocabulary: AddedVocabulary {
        added_tokens_map: {},
        added_tokens_map_r: {},
        added_tokens: [],
        special_tokens: [],
        special_tokens_set: {},
        split_trie: (
            AhoCorasick(
                dfa::DFA(
                D 000000: \x00 => 0
                F 000001:
                 >000002: \x00 => 2
                  000003: \x00 => 0
                match kind: LeftmostLongest
                prefilter: false
                state length: 4
                pattern length: 0
                shortest pattern length: 18446744073709551615
                longest pattern length: 0
                alphabet length: 1
                stride: 1
                byte classes: ByteClasses(0 => [0-255])
                memory usage: 16
                )
                ,
            ),
            [],
        ),
        split_normalized_trie: (
            AhoCorasick(
                dfa::DFA(
                D 000000: \x00 => 0
                F 000001:
                 >000002: \x00 => 2
                  000003: \x00 => 0
                match kind: LeftmostLongest
                prefilter: false
                state length: 4
                pattern length: 0
                shortest pattern length: 18446744073709551615
                longest pattern length: 0
                alphabet length: 1
                stride: 1
                byte classes: ByteClasses(0 => [0-255])
                memory usage: 16
                )
                ,
            ),
            [],
        ),
        encode_special_tokens: false,
    },
    truncation: None,
    padding: None,
}

Hope this helps :)

Open for any critics and the representation or implementation

Inspired by dora-rs/dora#503

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep looks good to me!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now this does the work! I'd want to print the content and etc long term with classes

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@haixuanTao haixuanTao changed the title Adding json serialization of tokenizer when printed Adding pretty print of tokenizer May 23, 2024
@ArthurZucker
Copy link
Collaborator

Closing in favor of #1542!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants