Bug: llamafiler /tokenize endpoint with add_special does not add special tokens #643

k8si · 2024-11-26T19:11:00Z

Contact Details

What happened?

Summary: Using the llamafiler /tokenize endpoint does not seem to add special tokens when the corresponding flag is set to true, at least for llama-3.1-8b-instruct.

Model/system info:

llamafiler built against commit e5c0921
model: meta-llama-3.1-8b-instruct.Q5_K_S.gguf
macOS 14.2.1 (M2 Pro)

Command used to start llamafiler:

#!/bin/bash

LLAMAFILER="./bin/llamafiler"
GGUF="meta-llama-3.1-8b-instruct.Q5_K_S.gguf"
"${LLAMAFILER}" --model "${GGUF}"

Curl to reproduce issue:

curl \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The quick brown fox jumped over the lazy dog.", "add_special": true, "parse_special": false}' \
  "http://localhost:8080/tokenize"

Output:

{
  "add_special": true,
  "parse_special": false,
  "tokens": [
    "The",
    " quick",
    " brown",
    " fox",
    " jumped",
    " over",
    " the",
    " lazy",
    " dog",
    "."
  ]
}

For comparison, here is a script to do the same thing in python using the HuggingFace transformers library directly:

from transformers import AutoTokenizer, PreTrainedTokenizer


def main():
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_name)

    encoded = tokenizer(
        "The quick brown fox jumped over the lazy dog.",
        add_special_tokens=True
    )
    input_ids = encoded["input_ids"]

    print(tokenizer.convert_ids_to_tokens(input_ids))
    # ['<|begin_of_text|>', 'The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumped', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']



if __name__ == '__main__':
    main()

Output:

['<|begin_of_text|>', 'The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumped', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']

Version

llamafile v0.8.16
llamafiler v0.8.16
(but actually I built from source at commit e5c0921)

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jart · 2024-11-26T21:23:37Z

I can't reproduce this. Could you trying passing https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q5_K_M.llamafile as the --model flag? It may be an issue with your GGUF file metadata.

k8si added bug medium severity labels Nov 26, 2024

jart added the awaiting response label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: llamafiler /tokenize endpoint with add_special does not add special tokens #643

Bug: llamafiler /tokenize endpoint with add_special does not add special tokens #643

k8si commented Nov 26, 2024

jart commented Nov 26, 2024

Bug: llamafiler /tokenize endpoint with add_special does not add special tokens #643

Bug: llamafiler /tokenize endpoint with add_special does not add special tokens #643

Comments

k8si commented Nov 26, 2024

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output

jart commented Nov 26, 2024