Vespa Linguistics with Kuromoji Tokenizer

Overview

This package provides Japanese tokenizer with Vespa using Kuromoji. Kuromoji is one of the famous Japanese tokenizer, it is implemented by Java and used by various services such as Solr, Elasticsearch, and so on. For more details, please see official website of Kuromoji.

Kuromoji

Create Package

Requirement

JDK (>= 11) and maven are required to build package.

Build

Execute mvn command as below, and you can get package as target/kuromoji-linguistics-${VERSION}-deploy.jar

$ mvn package -Dvespa.version='7.594.36'     # You can specify 7.594.36 or later.

Use Package

Deploy

Put the built package to components directory of your service. If there is no components directory, create it. For example, the structure will be like below with sampleapps.

sampleapps/search/music/
- services.xml
- components/
  - kuromoji-linguistics-${VERSION}-deploy.jar

Configuration

Because the package will be used by searcher and indexer, it is recommended to define <component> in all <jdisc> sections of services.xml.

<container id="container" version="1.0">
    <component id="kuromoji" class="jp.co.yahoo.vespa.language.lib.kuromoji.KuromojiLinguistics" bundle="kuromoji-linguistics">
        <config name="language.lib.kuromoji.kuromoji">
            <mode>search</mode>
            <ignore_case>true</ignore_case>
        </config>
    </component>
</container>

You can configure package by <config name="language.lib.kuromoji.kuromoji"> (optional). Parameters and default settings are below.

parameter	type	default	description
mode	string	search	mode of Kuromoji (normal OR search OR extended)
kanji.length_threshold	int	2	threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).
kanji.penalty	int	3000	additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).
other.length_threshold	int	7	threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).
other.penalty	int	1700	additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).
nakaguro_split	bool	false	whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
user_dict	string	-	path of user dictionary
tokenlist_name	string	default	target specialtokens name
all_language	bool	false	apply kuromoji tokenizer to all language or only Japanese
ignore_case	bool	true	ignore upper/lower case difference

Activate

Simply use deploy command to activate package. For example, commands will be like below with sampleapps.

$ vespa-deploy prepare sampleapps/search/music/
$ vespa-deploy activate

Now, you can use the tokenizer with "language=ja" options !

License

Code licensed under the Apache 2.0 license. See LICENSE for terms.

Contributor License Agreement

This project requires contributors to agree to a Contributor License Agreement (CLA).

Note that only for contributions to the vespa-kuromoji-linguistics repository on the GitHub (https://github.com/yahoojapan/vespa-kuromoji-linguistics), the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vespa Linguistics with Kuromoji Tokenizer

Overview

Create Package

Requirement

Build

Use Package

Deploy

Configuration

Activate

License

Contributor License Agreement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vespa Linguistics with Kuromoji Tokenizer

Overview

Create Package

Requirement

Build

Use Package

Deploy

Configuration

Activate

License

Contributor License Agreement