Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Output token sequence cannot match with AutoTokenizer #11054

Closed
RunningLeon opened this issue Jan 3, 2025 · 3 comments · Fixed by #11058
Closed

Eval bug: Output token sequence cannot match with AutoTokenizer #11054

RunningLeon opened this issue Jan 3, 2025 · 3 comments · Fixed by #11058

Comments

@RunningLeon
Copy link
Contributor

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
version: 4354 (0e70ba6)
built with cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) for x86_64-redhat-linux

Operating systems

Linux

GGML backends

CUDA

Hardware

NVIDIA A100-SXM4-80GB

Models

Meta-Llama-3-8B-Instruct

Problem description & steps to reproduce

Found that the output token sequence cannot match exactly between llama-tokenize and AutoTokenizer for models like Meta-Llama-3-8B-Instruct, internlm2_5-7b-chat.

reproduce

  1. convert model to gguf
python3 convert_hf_to_gguf.py \
$model_path \
--outfile $gguf_path
  1. run llama-tokenize
prompt="<|im_start|>user\nhello who are you?<|im_end|>\n<|im_start|>assistant\n"
./build/bin/llama-tokenize -m \
./Meta-Llama-3-8B-Instruct.gguf \
-p "$prompt" \
--ids
  1. run with AutoTokenizer from transformers
from transformers import AutoTokenizer
model_path = './Meta-Llama-3-8B-Instruct'
# model_path = './internlm2_5-7b-chat'

tk = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompts = "<|im_start|>user\nhello who are you?<|im_end|>\n<|im_start|>assistant\n"
print(tk.encode(prompts))

results

Meta-Llama-3-8B-Instruct

llama-tokenize
[27, 91, 318, 5011, 91, 29, 882, 1734, 15339, 889, 527, 499, 76514, 91, 318, 6345, 91, 8616, 77, 27, 91, 318, 5011, 91, 29, 78191, 1734]
AutoTokenizer
[27, 91, 318, 5011, 91, 29, 882, 198, 15339, 889, 527, 499, 76514, 91, 318, 6345, 91, 397, 27, 91, 318, 5011, 91, 29, 78191, 198]

internlm2_5-7b-chat

llama-tokenize
[1, 92543, 1008, 1849, 15115, 1015, 657, 629, 345, 92542, 1849, 92543, 525, 11353, 1849]
AutoTokenizer
[1, 92543, 1008, 364, 15115, 1015, 657, 629, 345, 92542, 364, 92543, 525, 11353, 364]

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
llama_load_model_from_file: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 10133 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_load_model_from_file: using device CUDA2 (NVIDIA A100-SXM4-80GB) - 11791 MiB free
llama_load_model_from_file: using device CUDA3 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_load_model_from_file: using device CUDA4 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_load_model_from_file: using device CUDA5 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_load_model_from_file: using device CUDA6 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_load_model_from_file: using device CUDA7 (NVIDIA A100-SXM4-80GB) - 80614 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = llama3
llama_model_loader: - kv   8:                       general.license.link str              = LICENSE
llama_model_loader: - kv   9:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  11:                          llama.block_count u32              = 32
llama_model_loader: - kv  12:                       llama.context_length u32              = 8192
llama_model_loader: - kv  13:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  14:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  15:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  16:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  18:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                          general.file_type u32              = 1
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG
....
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128001 '<|end_of_text|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 0.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_pre_seq (512) > n_ctx_train (0) -- possible training context overflow
[27, 91, 318, 5011, 91, 29, 882, 1734, 15339, 889, 527, 499, 76514, 91, 318, 6345, 91, 8616, 77, 27, 91, 318, 5011, 91, 29, 78191, 1734]
@ggerganov
Copy link
Owner

#11058 should fix this

@RunningLeon
Copy link
Contributor Author

RunningLeon commented Jan 6, 2025

Hi, thanks for your quick response. It works for me. Just wonder if llama-cli and llama-server neet to fix?

@ggerganov
Copy link
Owner

They should already have the string escape enabled. Let me know if you spot something that is off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants