Skip to content

Simple word-level language ID using Viterbi based on unigram frequencies and character n-grams.

License

Notifications You must be signed in to change notification settings

eginhard/word-level-language-id

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Word-level language ID

Simple word-level language identification using the Viterbi algorithm based on unigram frequencies and character n-grams.

Usage

I recommend using Python 3 for better Unicode support.

To quickly try out the system, corpora and language models are already included for British English and Irish. See below how to add new ones. You might want to do some post-processing on the lexicons because e.g. the Irish one contains some English as well and vice versa.

Run word-level language ID on some example sentences:

python word-level-language-id/identify.py

Train new language models

Create or download a unigram frequency lexicon, e.g. from the Crúbadán Project which has those readily available for over 2000 languages.

For example, download and unzip British English and Irish:

wget http://crubadan.org/files/en-GB.zip 
wget http://crubadan.org/files/ga.zip

unzip '*.zip' -d word-level-language-id/corpora

Train the language models.

python word-level-language-id/train.py

About

Simple word-level language ID using Viterbi based on unigram frequencies and character n-grams.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages