You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Claude appears to use tiktoken parameters outlined here and implemented here.
The BPE rankings are in an alternate format but doing some reverse engineering by looking at the javascript tiktoken implementation here I was able to use the following code to create a tiktoken encoder for Claude in Python. Note claude.json was sourced from the referenced javascript tiktoken library which is apart of the official Anthropic account.
import tiktoken
import json
import base64
def decode_claude_bpe(claude_configs):
_, offset, *tokens = claude_configs['bpe_ranks'].split(" ")
offset = int(offset)
# This starts at 5 (offset) for some reason, this is what the original JS code does
rankMap = {base64.b64decode(token): offset+idx for idx, token in enumerate(tokens)}
return rankMap
if __name__ == "__main__":
with open("claude.json") as f:
claude_configs = json.load(f)
bpe_ranks = decode_claude_bpe(claude_configs)
enc = tiktoken.Encoding(
name="claude_tokenizer",
pat_str=claude_configs['pat_str'],
mergeable_ranks=bpe_ranks,
special_tokens=claude_configs['special_tokens'],
)
print(enc.encode("hello world"))
Alternatively an option to create a tiktoken encoder using custom BPE ranks etc. like in the Python library would be a more general solution.
The text was updated successfully, but these errors were encountered:
tiktoken_ruby gem currently supports 4 encoders:
Claude appears to use tiktoken parameters outlined here and implemented here.
The BPE rankings are in an alternate format but doing some reverse engineering by looking at the javascript tiktoken implementation here I was able to use the following code to create a tiktoken encoder for Claude in Python. Note claude.json was sourced from the referenced javascript tiktoken library which is apart of the official Anthropic account.
Alternatively an option to create a tiktoken encoder using custom BPE ranks etc. like in the Python library would be a more general solution.
The text was updated successfully, but these errors were encountered: