Tokenizers used for each language:

as, ca, cs, co, de, el, en, es, fi, fr, ga, hu, is, it, lt, lv, mni, nl, pl, pt, ro, ru, sk, sl, sv, ta: sacremoses (MosesTokenizer) (https://github.com/hplt-project/sacremoses)
da, et, no, tr: NLTK word tokenizer (https://www.nltk.org/api/nltk.tokenize.html)
sr, mk, bg, hr: Reldi (https://github.com/clarinsi/reldi-tokeniser)
bs: Reldi, with fallback to sr
ab, ba: sacremoses (MosesTokenizer), with fallback to ru
af: sacremoses (MosesTokenizer), with fallback to nl
ast,eu, gl: sacremoses (MosesTokenizer) with fallback to es
br: sacremoses (MosesTokenizer), with fallback to fr
co: sacremoses (MosesTokenizer), with fallback to it
cy, eo, ga, la, lt, mt, so, sq, sw, tl: sacremoses (MosesTokenizer) with fallback to en
lb: sacremoses (MosesTokenizer) with fallback to de
lvs: sacremoses (MosesTokenizer) with fallback to lv
ja: MeCab (https://github.com/SamuraiT/mecab-python3)
ko: MeCab-ko (https://github.com/NoUnique/pymecab-ko)
my, shn: pyidaungsu (https://github.com/kaunghtetsan275/pyidaungsu)
si: sinling (https://github.com/ysenarath/sinling)
uz, uzn: fitrat (https://github.com/tahrirchi/fitrat)
mr: MarathiNLP (https://github.com/l3cube-pune/MarathiNLP)
bn: BNLP (https://github.com/sagorbrur/bnlp)
th: Thai Segmenter (https://github.com/Querela/thai-segmenter)
awa, bho, gu, hi, hne, kn, ks, mag, mai, ml, mr, ne, npi, pa, sa, sat, te, ur: Indic NLP Library (https://github.com/anoopkunchukuttan/indic_nlp_library)
zh, zh-Hant: pkuseg (https://github.com/explosion/spacy-pkuseg)
he, iw, ydd: Hebrew Tokenizer (https://github.com/YontiLevin/Hebrew-Tokenizer)
id: Kumparan's NLP Services (https://github.com/kumparan/nlp-id)
~~bo: Botok (https://github.com/OpenPecha/Botok)~~
~~ckb, kmr: KLPT (https://github.com/sinaahmadi/KurdishTokenization)~~
yue: pyCantonese (https://github.com/jacksonllee/pycantonese)
lo: LaoNLP (https://github.com/wannaphong/LaoNLP)
ory: Open Odia (https://github.com/soumendrak/openodia)
ig: Igbo Text (https://github.com/goodyduru/igbo-text)
am, ti: ETNLTK (https://github.com/robeleq/etnltk/)
nb, nn: NTLK word tokenizer, with fallback to no
others (any not in the list above): NTLK word punct tokenizer (https://www.nltk.org/api/nltk.tokenize.punkt.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizers-info.md

tokenizers-info.md

Files

tokenizers-info.md

Latest commit

History

tokenizers-info.md

File metadata and controls