Merge pull request ztxz16#417 from TylunasLi/tokenizer

完全对齐Tokenizer，支持InternLM-7B和XVERSE-7B
TylunasLi · Feb 22, 2024 · 4af2b7d · 4af2b7d
2 parents 792e5eb + 8ee4958
commit 4af2b7d
Show file tree

Hide file tree

Showing 12 changed files with 510 additions and 176 deletions.
diff --git a/docs/llama_cookbook.md b/docs/llama_cookbook.md
@@ -8,7 +8,13 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 
 以下配置方案根据模型的源代码整理，不保证模型推理结果与原版完全一致。
 
-## 修改脚本并转换
+## 修改方式
+
+目前，转换脚本和两行加速方式均可用于llama类模型。但无论采用哪一种方式，都需要预留足够的内存（可以用swap空间）。
+
+在float16模式下，转换时约需要4×参数量+1GB的空闲内存。
+
+### 转换脚本
 
 这里以支持推理各类Llama结构的基座模型为例，介绍如何应用本文档。
 
@@ -40,17 +46,36 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 
 如需添加Token ID而非字符串（类似baichuan-chat模型），可以使用“<FLM_FIX_TOKEN_{ID}>”的格式添加。
 
+* 执行脚本
+
+```shell
+python3 tools/alpaca2flm.py [输出文件名] [精度] [原始模型名称或路径]
+```
+
 ### 两行加速
 
 ```python
+    conf = model.config.__dict__
+    conf["model_type"] = "llama"
     llm.from_hf(model, tokenizer, pre_prompt = "", 
                 user_role = "", bot_role = "", history_sep = "", 
                 dtype = dtype)
 ```
 
+## 对齐
+
+如果想使fastllm模型和原版transformers模型基本一致，最主要的操作是对齐tokenizer。
+如果模型使用了huggingface 加速版本的Tokenizers（即模型目录中包含`tokenizer.json`并优先使用），目前的转换脚本**仅在从本地文件转换时，能够对齐tokenizer**。
+
+注意检查原始tokenizer的`encode()`方法返回的结果前面是否会加空格。如果原始tokenizer没有加空格，则需要设置：
+
+```python
+    conf["tokenizer_add_dummy_prefix"] = False
+```
+
 ## Base Model
 
-见上方“[修改方案](#修改方案)”。
+见上方“[修改方案](#修改方式)”。
 
 一部分模型需要制定bos_token_id，假设bos_token_id为1则可以配置如下：
 
@@ -66,6 +91,8 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 
 ### InternLM（书生）
 
+* internlm/[internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
+* internlm/[internlm-chat-7b v1.1](https://huggingface.co/internlm/internlm-chat-7b-v1_1)
 * internlm/[internlm-chat-20b](https://huggingface.co/internlm/internlm-chat-20b)
 
 ```python
@@ -76,6 +103,15 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
                      history_sep = "<eoa>\n<s>", dtype = dtype)
 ```
 
+可以直接使用`internlm2flm.py`脚本转换：
+
+``` sh
+cd build
+python3 tools/internlm2flm.py internlm-7b-fp16.flm float16 #导出float16模型
+python3 tools/internlm2flm.py internlm-7b-int8.flm int8 #导出int8模型
+python3 tools/internlm2flm.py internlm-7b-int4.flm int4 #导出int4模型
+python3 tools/internlm2flm.py internlm-7b-int4.flm float16 internlm/internlm-chat-7b #导出internlm-chat-7b float16模型
+```
 
 ### XVERSE
 
@@ -85,10 +121,12 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 ```python
     conf = model.config.__dict__
     conf["model_type"] = "llama"
+    conf["tokenizer_add_dummy_prefix"] = False
     torch2flm.tofile(exportPath, model, tokenizer, pre_prompt = "", 
                      user_role = "Human: ", bot_role = "\n\nAssistant: ", 
                      history_sep = "<FLM_FIX_TOKEN_3>", dtype = dtype)
 ```
+XVERSE-13B-Chat V1 版本需要对输入做NFKC规范化，fastllm暂不支持，因此需要使用原始tokenizer. 
 
 ### 其他 llama1 系列
 
@@ -163,7 +201,7 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 ```python
     torch2flm.tofile(exportPath, model, tokenizer, 
                      pre_prompt="The following is a conversation between a human and an AI assistant namely YuLan, developed by GSAI, Renmin University of China. " \
-                                "The AI assistant gives helpful, detailed, and polite answers to the user's questions.\n"
+                                "The AI assistant gives helpful, detailed, and polite answers to the user's questions.\n",
                      user_role="[|Human|]:", bot_role="\n[|AI|]:", history_sep="\n", dtype=dtype)
 ```
 
@@ -174,7 +212,7 @@ LLaMA类模型有着基本相同的结构，但权重和prompt构造有差异。
 
 ```python
     torch2flm.tofile(exportPath, model, tokenizer, 
-                     pre_prompt="Below is an instruction that describes a task. "
-                                "Write a response that appropriately completes the request.\n\n"
+                     pre_prompt="Below is an instruction that describes a task. " \
+                                "Write a response that appropriately completes the request.\n\n",
                      user_role="### Instruction:\n", bot_role="\n\n### Response:", history_sep="\n", dtype=dtype)
 ```
diff --git a/include/fastllm.h b/include/fastllm.h
@@ -17,6 +17,8 @@
 #include <iostream>
 #include <functional>
 #include <memory>
+#include <locale>
+#include <codecvt>
 #include "devices/cpu/cputhreadpool.h"
 
 #ifdef USE_SENTENCEPIECE
@@ -43,7 +45,7 @@ namespace fastllm {
         float top_p = 1.0; // top_p采样
         float temperature = 1.0; // 温度参数，一般在0.1 ~ 1.0之间，设大这个参数可以带来结果的多样性
         bool output_logits = false; // 是否返回logits
-		bool enable_hash_id = false; // 给会话添加hash id
+        bool enable_hash_id = false; // 给会话添加hash id
         std::multiset <int> stop_token_ids;
 
         bool IsSimpleGreedy() const {
@@ -359,11 +361,22 @@ namespace fastllm {
 
         TrieNode *root;
 
+        TrieNode *specialRoot = nullptr;
+
         TokenizerType type = TokenizerType::BPE;
 
+        bool addDummyPrefix = true;   // 是否在首位添加空格
+        bool removeExtraWhitespaces = true;   // 是否将多个空格合并为一个
+        bool byteAsChar = false;  // 是否将byte变为展示字符
+
         std::unordered_map <int, std::string> tokenToStringDict;
         std::unordered_map <int, float> tokenToScoreDict;
         std::unordered_map <std::string, int> stringToTokenDict;
+        std::vector <std::string> specialTokens;
+
+        std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
+        std::unordered_map <wchar_t, wchar_t> byteCharDict;
+        std::unordered_map <wchar_t, wchar_t> charByteDict;
 #ifdef USE_SENTENCEPIECE
         std::unique_ptr<sentencepiece::SentencePieceProcessor> spProcessor;
 #endif
@@ -380,6 +393,10 @@ namespace fastllm {
 
         void Insert(const std::string &s, int tokenId, float score = 1.0f); // 插入一个token
 
+        void SetSpecialTokens(const std::map <std::string, int> &specialTokens); // 设置需要优先处理的特殊token
+
+        std::string Normalize(const std::string &ori); // 字符规范化
+
         Data Encode(const std::string &s); // 编码
 
         std::string Decode(const Data &data); // 解码