Skip to content

Commit

Permalink
Docs update for kagome_ja tokenizer (#2732)
Browse files Browse the repository at this point in the history
* Prep kagome_ja docs

* Update formatting of tokenized text

* Copyedits

* Update versions
  • Loading branch information
databyjp authored Dec 10, 2024
1 parent 1f9471f commit 3852110
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 5 deletions.
3 changes: 2 additions & 1 deletion _includes/tokenization_definition.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
| `field` | Index the whole field after trimming whitespace characters. | `Hello, (beautiful) world` |
| `trigram` | Split the property as rolling trigrams. | `Hel`, `ell`, `llo`, `lo,`, ... |
| `gse` | Use the `gse` tokenizer to split the property. | [See `gse` docs](https://pkg.go.dev/github.com/go-ego/gse#section-readme) |
| `kagome_kr` | Use the `Kagome` tokenizer with a Korean dictionary to split the property. | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [Korean dictionary](https://github.com/ikawaha/kagome-dict-ko) |
| `kagome_ja` | Use the `Kagome` tokenizer with a Japanese (IPA) dictionary to split the property. | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [dictionary](https://github.com/ikawaha/kagome-dict/). |
| `kagome_kr` | Use the `Kagome` tokenizer with a Korean dictionary to split the property. | [See `kagome` docs](https://github.com/ikawaha/kagome) and the [Korean dictionary](https://github.com/ikawaha/kagome-dict-ko). |
4 changes: 3 additions & 1 deletion developers/academy/py/tokenization/200_options.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,9 @@ Weaviate provides `gse` and `trigram` (from `v1.24`) and `kagome_kr` (from `v1.2

`gse` implements the "Jieba" algorithm, which is a popular Chinese text segmentation algorithm. `trigram` splits text into all possible trigrams, which can be useful for languages like Japanese.

`kagome_kr` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Korean MeCab ([mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/)) dictionary to split the property text. This is useful for Korean text.
`kagome_ja` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Japanese [MeCab IPA](https://github.com/ikawaha/kagome-dict/) dictionary to split Japanese property text.

`kagome_kr` uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Korean MeCab ([mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/)) dictionary to split Korean property text.

## Questions and feedback

Expand Down
1 change: 1 addition & 0 deletions developers/weaviate/config-refs/env-vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ default hostname has changed and a single node cluster believes there are suppos
| `ENABLE_API_BASED_MODULES` | Enable all API-based modules. (Experimental as of `v1.26.0`) | `boolean` | `true` |
| `ENABLE_MODULES` | Specify Weaviate modules to enable | `string - comma separated names` | `text2vec-openai,generative-openai` |
| `ENABLE_TOKENIZER_GSE` | Enable the [`GSE` tokenizer](../config-refs/schema/index.md#gse-and-trigram-tokenization-methods) for use | `boolean` | `true` |
| `ENABLE_TOKENIZER_KAGOME_JA` | Enable the [`Kagome` tokenizer for Japanese](../config-refs/schema/index.md#kagome_ja-tokenization-method) for use (Experimental as of `v1.28.0`) | `boolean` | `true` |
| `ENABLE_TOKENIZER_KAGOME_KR` | Enable the [`Kagome` tokenizer for Korean](../config-refs/schema/index.md#kagome_kr-tokenization-method) for use (Experimental as of `v1.25.7`) | `boolean` | `true` |
| `GODEBUG` | Controls debugging variables within the runtime. [See official Go docs](https://pkg.go.dev/runtime). | `string - comma-separated list of name=val pairs` | `gctrace=1` |
| `GOMAXPROCS` | Set the maximum number of threads that can be executing simultaneously. If this value is set, it be respected by `LIMIT_RESOURCES`. | `string - number` | `NUMBER_OF_CPU_CORES` |
Expand Down
28 changes: 25 additions & 3 deletions developers/weaviate/config-refs/schema/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,25 @@ The `gse` tokenizer is not loaded by default to save resources. To use it, set t
- `"素早い茶色の狐が怠けた犬を飛び越えた"`: `["素早", "素早い", "早い", "茶色", "の", "狐", "が", "怠け", "けた", "犬", "を", "飛び", "飛び越え", "越え", "た", "素早い茶色の狐が怠けた犬を飛び越えた"]`
- `"すばやいちゃいろのきつねがなまけたいぬをとびこえた"`: `["すばや", "すばやい", "やい", "いち", "ちゃ", "ちゃい", "ちゃいろ", "いろ", "のき", "きつ", "きつね", "つね", "ねが", "がな", "なま", "なまけ", "まけ", "けた", "けたい", "たい", "いぬ", "を", "とび", "とびこえ", "こえ", "た", "すばやいちゃいろのきつねがなまけたいぬをとびこえた"]`

### `kagome_ja` tokenization method

:::caution Experimental feature
Available starting in `v1.28.0`. This is an experimental feature. Use with caution.
:::

For Japanese text, `kagome_ja` tokenization method is also available. This uses the [`Kagome` tokenizer](https://github.com/ikawaha/kagome?tab=readme-ov-file) with a Japanese [MeCab IPA](https://github.com/ikawaha/kagome-dict/) dictionary to split the property text.

The `kagome_ja` tokenizer is not loaded by default to save resources. To use it, set the environment variable `ENABLE_TOKENIZER_KAGOME_JA` to `true` on the Weaviate instance.

`kagome_ja` tokenization examples:

- `"春の夜の夢はうつつよりもかなしき 夏の夜の夢はうつつに似たり 秋の夜の夢はうつつを超え 冬の夜の夢は心に響く 山のあなたに小さな村が見える 川の音が静かに耳に届く 風が木々を通り抜ける音 星空の下、すべてが平和である"`:
- [`"春", "の", "夜", "の", "夢", "は", "うつつ", "より", "も", "かなしき", "\n\t", "夏", "の", "夜", "の", "夢", "は", "うつつ", "に", "似", "たり", "\n\t", "秋", "の", "夜", "の", "夢", "は", "うつつ", "を", "超え", "\n\t", "冬", "の", "夜", "の", "夢", "は", "心", "に", "響く", "\n\n\t", "山", "の", "あなた", "に", "小さな", "村", "が", "見える", "\n\t", "川", "の", "音", "が", "静か", "に", "耳", "に", "届く", "\n\t", "風", "が", "木々", "を", "通り抜ける", "音", "\n\t", "星空", "の", "下", "、", "すべて", "が", "平和", "で", "ある"`]
- `"素早い茶色の狐が怠けた犬を飛び越えた"`:
- `["素早い", "茶色", "の", "狐", "が", "怠け", "た", "犬", "を", "飛び越え", "た"]`
- `"すばやいちゃいろのきつねがなまけたいぬをとびこえた"`:
- `["すばやい", "ちゃ", "いろ", "の", "きつね", "が", "なまけ", "た", "いぬ", "を", "とびこえ", "た"]`

### `kagome_kr` tokenization method

:::caution Experimental feature
Expand All @@ -558,9 +577,12 @@ The `kagome_kr` tokenizer is not loaded by default to save resources. To use it,

`kagome_kr` tokenization examples:

- `"아버지가방에들어가신다"`: `["아버지", "가", "방", "에", "들어가", "신다"]`
- `"아버지가 방에 들어가신다"`: `["아버지", "가", "방", "에", "들어가", "신다"]`
- `"결정하겠다"`: `["결정", "하", "겠", "다"]`
- `"아버지가방에들어가신다"`:
- `["아버지", "가", "방", "에", "들어가", "신다"]`
- `"아버지가 방에 들어가신다"`:
- `["아버지", "가", "방", "에", "들어가", "신다"]`
- `"결정하겠다"`:
- `["결정", "하", "겠", "다"]`

### Inverted index types

Expand Down

0 comments on commit 3852110

Please sign in to comment.