Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Claude Tokenizer #15

Open
BarberAlec opened this issue Sep 16, 2023 · 1 comment
Open

[Proposal] Claude Tokenizer #15

BarberAlec opened this issue Sep 16, 2023 · 1 comment

Comments

@BarberAlec
Copy link

tiktoken_ruby gem currently supports 4 encoders:

  • r50k_base
  • p50k_base
  • p50k_edit
  • cl100k_base

Claude appears to use tiktoken parameters outlined here and implemented here.

The BPE rankings are in an alternate format but doing some reverse engineering by looking at the javascript tiktoken implementation here I was able to use the following code to create a tiktoken encoder for Claude in Python. Note claude.json was sourced from the referenced javascript tiktoken library which is apart of the official Anthropic account.

import tiktoken
import json
import base64


def decode_claude_bpe(claude_configs):
    _, offset, *tokens = claude_configs['bpe_ranks'].split(" ")
    offset = int(offset)

    # This starts at 5 (offset) for some reason, this is what the original JS code does
    rankMap = {base64.b64decode(token): offset+idx for idx, token in enumerate(tokens)}

    return rankMap

if __name__ == "__main__":
    with open("claude.json") as f:
        claude_configs = json.load(f)
        bpe_ranks = decode_claude_bpe(claude_configs)

    enc = tiktoken.Encoding(
        name="claude_tokenizer",
        pat_str=claude_configs['pat_str'],
        mergeable_ranks=bpe_ranks,
        special_tokens=claude_configs['special_tokens'],
    )
    print(enc.encode("hello world"))

Alternatively an option to create a tiktoken encoder using custom BPE ranks etc. like in the Python library would be a more general solution.

@IAPark
Copy link
Owner

IAPark commented Sep 21, 2023

I do prefer the idea of creating a general solution. I think adding explicit Claude support moves away from the idea of a wrapper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants