Docs update for kagome_ja tokenizer (#2732)

* Prep kagome_ja docs * Update formatting of tokenized text * Copyedits * Update versions
weaviate · Dec 10, 2024 · 3852110 · 3852110
1 parent 1f9471f
commit 3852110
Show file tree

Hide file tree

Showing 4 changed files with 31 additions and 5 deletions.
diff --git a/_includes/tokenization_definition.mdx b/_includes/tokenization_definition.mdx
@@ -6,4 +6,5 @@
 | `field`             | Index the whole field after trimming whitespace characters.                  | `Hello, (beautiful) world`       |
 | `trigram`           | Split the property as rolling trigrams.                                      | `Hel`, `ell`, `llo`, `lo,`, ...   |
 | `gse`               | Use the `gse` tokenizer to split the property.                               | [See `gse` docs](https://pkg.go.dev/github.com/go-ego/gse#section-readme) |
-| `kagome_kr`         | Use the `Kagome` tokenizer with a Korean dictionary to split the property.   | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [Korean dictionary](https://github.com/ikawaha/kagome-dict-ko) |
+| `kagome_ja`         | Use the `Kagome` tokenizer with a Japanese (IPA) dictionary to split the property.   | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [dictionary](https://github.com/ikawaha/kagome-dict/). |
+| `kagome_kr`         | Use the `Kagome` tokenizer with a Korean dictionary to split the property.   | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [Korean dictionary](https://github.com/ikawaha/kagome-dict-ko). |
diff --git a/developers/academy/py/tokenization/200_options.mdx b/developers/academy/py/tokenization/200_options.mdx
@@ -131,7 +131,9 @@ Weaviate provides `gse` and `trigram` (from `v1.24`) and `kagome_kr` (from `v1.2
 
 `gse` implements the "Jieba" algorithm, which is a popular Chinese text segmentation algorithm. `trigram` splits text into all possible trigrams, which can be useful for languages like Japanese.
 
-`kagome_kr` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Korean MeCab ([mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/)) dictionary to split the property text. This is useful for Korean text.
+`kagome_ja` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Japanese [MeCab IPA](https://github.com/ikawaha/kagome-dict/) dictionary to split Japanese property text.
+
+`kagome_kr` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Korean MeCab ([mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/)) dictionary to split Korean property text.
 
 ## Questions and feedback
 

diff --git a/developers/weaviate/config-refs/env-vars.md b/developers/weaviate/config-refs/env-vars.md
@@ -29,6 +29,7 @@ default hostname has changed and a single node cluster believes there are suppos
 | `ENABLE_API_BASED_MODULES` | Enable all API-based modules. (Experimental as of `v1.26.0`) | `boolean` | `true` |
 | `ENABLE_MODULES` | Specify Weaviate modules to enable | `string - comma separated names` | `text2vec-openai,generative-openai` |
 | `ENABLE_TOKENIZER_GSE` | Enable the [`GSE` tokenizer](../config-refs/schema/index.md#gse-and-trigram-tokenization-methods) for use | `boolean` | `true` |
+| `ENABLE_TOKENIZER_KAGOME_JA` | Enable the [`Kagome` tokenizer for Japanese](../config-refs/schema/index.md#kagome_ja-tokenization-method) for use (Experimental as of `v1.28.0`) | `boolean` | `true` |
 | `ENABLE_TOKENIZER_KAGOME_KR` | Enable the [`Kagome` tokenizer for Korean](../config-refs/schema/index.md#kagome_kr-tokenization-method) for use (Experimental as of `v1.25.7`) | `boolean` | `true` |
 | `GODEBUG` | Controls debugging variables within the runtime. [See official Go docs](https://pkg.go.dev/runtime). | `string - comma-separated list of name=val pairs` | `gctrace=1` |
 | `GOMAXPROCS` | Set the maximum number of threads that can be executing simultaneously. If this value is set, it be respected by `LIMIT_RESOURCES`. | `string - number` | `NUMBER_OF_CPU_CORES` |

diff --git a/developers/weaviate/config-refs/schema/index.md b/developers/weaviate/config-refs/schema/index.md
@@ -546,6 +546,25 @@ The `gse` tokenizer is not loaded by default to save resources. To use it, set t
 - `"素早い茶色の狐が怠けた犬を飛び越えた"`: `["素早", "素早い", "早い", "茶色", "の", "狐", "が", "怠け", "けた", "犬", "を", "飛び", "飛び越え", "越え", "た", "素早い茶色の狐が怠けた犬を飛び越えた"]`
 - `"すばやいちゃいろのきつねがなまけたいぬをとびこえた"`: `["すばや", "すばやい", "やい", "いち", "ちゃ", "ちゃい", "ちゃいろ", "いろ", "のき", "きつ", "きつね", "つね", "ねが", "がな", "なま", "なまけ", "まけ", "けた", "けたい", "たい", "いぬ", "を", "とび", "とびこえ", "こえ", "た", "すばやいちゃいろのきつねがなまけたいぬをとびこえた"]`
 
+### `kagome_ja` tokenization method
+
+:::caution Experimental feature
+Available starting in `v1.28.0`. This is an experimental feature. Use with caution.
+:::
+
+For Japanese text, `kagome_ja` tokenization method is also available. This uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Japanese [MeCab IPA](https://github.com/ikawaha/kagome-dict/) dictionary to split the property text.
+
+The `kagome_ja` tokenizer is not loaded by default to save resources. To use it, set the environment variable `ENABLE_TOKENIZER_KAGOME_JA` to `true` on the Weaviate instance.
+
+`kagome_ja` tokenization examples:
+
+- `"春の夜の夢はうつつよりもかなしき 夏の夜の夢はうつつに似たり 秋の夜の夢はうつつを超え 冬の夜の夢は心に響く 山のあなたに小さな村が見える 川の音が静かに耳に届く 風が木々を通り抜ける音 星空の下、すべてが平和である"`:
+  - [`"春", "の", "夜", "の", "夢", "は", "うつつ", "より", "も", "かなしき", "\n\t", "夏", "の", "夜", "の", "夢", "は", "うつつ", "に", "似", "たり", "\n\t", "秋", "の", "夜", "の", "夢", "は", "うつつ", "を", "超え", "\n\t", "冬", "の", "夜", "の", "夢", "は", "心", "に", "響く", "\n\n\t", "山", "の", "あなた", "に", "小さな", "村", "が", "見える", "\n\t", "川", "の", "音", "が", "静か", "に", "耳", "に", "届く", "\n\t", "風", "が", "木々", "を", "通り抜ける", "音", "\n\t", "星空", "の", "下", "、", "すべて", "が", "平和", "で", "ある"`]
+- `"素早い茶色の狐が怠けた犬を飛び越えた"`:
+  - `["素早い", "茶色", "の", "狐", "が", "怠け", "た", "犬", "を", "飛び越え", "た"]`
+- `"すばやいちゃいろのきつねがなまけたいぬをとびこえた"`:
+  - `["すばやい", "ちゃ", "いろ", "の", "きつね", "が", "なまけ", "た", "いぬ", "を", "とびこえ", "た"]`
+
 ### `kagome_kr` tokenization method
 
 :::caution Experimental feature
@@ -558,9 +577,12 @@ The `kagome_kr` tokenizer is not loaded by default to save resources. To use it,
 
 `kagome_kr` tokenization examples:
 
-- `"아버지가방에들어가신다"`: `["아버지", "가", "방", "에", "들어가", "신다"]`
-- `"아버지가 방에 들어가신다"`: `["아버지", "가", "방", "에", "들어가", "신다"]`
-- `"결정하겠다"`: `["결정", "하", "겠", "다"]`
+- `"아버지가방에들어가신다"`:
+  - `["아버지", "가", "방", "에", "들어가", "신다"]`
+- `"아버지가 방에 들어가신다"`:
+  - `["아버지", "가", "방", "에", "들어가", "신다"]`
+- `"결정하겠다"`:
+  - `["결정", "하", "겠", "다"]`
 
 ### Inverted index types