-
Notifications
You must be signed in to change notification settings - Fork 302
分词器配置
Wenyi Duan edited this page Nov 2, 2022
·
1 revision
分词器的配置包括analyzer配置、tokenizer配置以及繁简转换表配置三个部分,一个完整的分词器配置文件analyzer.json如下所示:
{
"analyzers":
{
"simple_analyzer":
{
"tokenizer_name" : "simple_tab_tokenzier",
"stopwords" : [],
"normalize_options" :
{
"case_sensitive" : true,
"traditional_sensitive" : true,
"width_sensitive" : true
}
},
"singlews_analyzer":
{
"tokenizer_name" : "singlews_tokenzier",
"stopwords" : [],
"normalize_options" :
{
"case_sensitive" : true,
"traditional_sensitive" : true,
"width_sensitive" : true
}
}
},
"tokenizer_config" : {
"modules" : [
{
"module_name": "TokenizerModule1",
"module_path": "libTokenizerModule1.so",
"parameters": {
}
}
],
"tokenizers" : [
{
"tokenizer_name" : "simple_tab_tokenzier",
"tokenizer_type" : "simple",
"module_name" : "",
"parameters": {
"delimiter" : "\t"
}
},
{
"tokenizer_name" : "singlews_tokenzier",
"tokenizer_type" : "aliws",
"module_name" : "",
"parameters": {
}
}
]
},
"traditional_tables": [
{
"table_name" : "table_name",
"transform_table" : {
"0x4E7E" : "0x4E7E" #设置繁体字"乾"不会被转换成简体字
}
}
]
}
- stopword,可以是任何字符串,也可以为空
- case_sensitive,表示大小写是否敏感,如果敏感,则保留原始文本,否则全转成小写;
- traditional_sensitive,表示简繁体是否敏感,如果敏感,则保留原始文本,否则全转成简体;
- width_sensitive,表示全半角是否敏感,如果敏感,则保留原始文本,否则全转成半角;
- traditional_table_name,表示简繁体转换表,当traditional_sensitive为false时,默认会全转成简体,可以通过设置table_name对应的traditional_table使得特殊的繁体字不会被转成简体字。