Header-only Japanese text normalizer in C++11

Simple Japanese text normalizer written in portable C++11, based on neologdn. https://github.com/ikegami-yukino/neologdn

Its good to embed Japanese normalization feature to your LLM(Large Language Model) apps. (e.g. llama.cpp https://github.com/ggerganov/llama.cpp )

C++ STL のみ(and no std::regex) で日本語を正規化するシンプルなライブラリ README-ja.md

Rules

https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Requirements

UTF-8 text
- UTF-16 and UTF-32 are not supported at the moment.

Usage

// Define this only in one **.cc
#define JP_NORMALIZER_IMPLEMENTATION
#include "jp_normalizer.hh"

std::string text = "ﾊﾝｶｸｶﾅ";
std::string normalized_text = jpnormalizer::normalize(text);
// => "ハンカクカナ"

// Control normalization using NormalizationOptions
jpnormalizer::NormaliationOptions options;

options.repeat = 5;

std::string normalized_text = jpnormalizer::normalize(text, options);

Limitation

Default up to 1GB tokens(~ 3GB in UTF-8 Japanase character). You can set this limit in NormalizationOptions;

Security

jp_normalizer.hh is tested on LLVM fuzzer. No security issue(segfault, OOM) observed at the moment.

TODO

Implement shorten repeat feature.
More Enclosed CJK Letters and Months.
wstring(WideChar) support in Windows
UTF-16 text?(e.g. UNICODE UTF-16LE text in Windows)

License

Apache 2.0 license

Thrid party licenses

neologdn: Apache 2.0 License : https://github.com/ikegami-yukino/neologdn

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
fuzzer		fuzzer
LICENSE		LICENSE
Makefile		Makefile
README-ja.md		README-ja.md
README.md		README.md
jp_normalizer.hh		jp_normalizer.hh
test_jpnormalizer.cc		test_jpnormalizer.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Header-only Japanese text normalizer in C++11

Rules

Requirements

Usage

Limitation

Security

TODO

License

Thrid party licenses

About

Releases

Packages

Languages

License

lighttransport/japanese-normalizer-cpp

Folders and files

Latest commit

History

Repository files navigation

Header-only Japanese text normalizer in C++11

Rules

Requirements

Usage

Limitation

Security

TODO

License

Thrid party licenses

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages