LanguageTool can make use of large n-gram data sets to detect errors with words that are often confused, like their and there. The n-gram data set is huge and thus not part of the LT download. To make use of it, you have two choices:
- Use the editor on https://languagetool.org, which always has the latest and best ngram data.
- Set up your own LT server with the n-gram data.
To use the data locally:
- Make sure you have a fast disk, i.e. an SSD. Without an SSD, using this data can make LanguageTool much slower.
- Download the data (~8GB) from http://languagetool.org/download/ngram-data/ - note: data
is currently only available for English, German, French, and Spanish.
Usengrams-xx-2015*
files for LanguageTool <= 6.5,ngrams-xx-2024*
files for LanguageTool >= 6.6. - Unzip it and put it in its own directory named
en
,de
,fr
, ores
, depending on the language. The path you need to set in the next step is the directory that theen
etc. directory is in, not that directory itself. - Then, depending on how you use LanguageTool:
- Command line: start with the
--languagemodel
option pointing to the ngram-index directory. - Server mode: Start with the
--languageModel
option. Alternatively, you can start with the--config file
option. This properties file needs to have alanguageModel=...
entry pointing to the ngram-index directory. Using the properties files will give you some advanced configurations. Calljava -jar languagetool-server.jar
to get a list of all options.
- Command line: start with the
- Test with these sentences. These are examples of errors that can only be detected using the
n-gram rule, as of September 2020:
- English:
- Don't forget to put on the breaks.
- German:
- In den christlichen Traditionen gibt es unterschiedliche Anleitungen zur Mediation und Kontemplation.
- English:
An n-gram is a contiguous sequence of n items from a text, like a girl
(2-gram) or
a tall girl
(3-gram). Once you have a large amount of these n-grams with their number
of occurrences, you can use this to detect errors in texts. For example, in
This is there last chance to escape.
, LanguageTool will look at the context of
there, considering up to three words:
This is there
, is there last
, there last chance
The probabilities of these n-grams are then compared to the probabilities of:
This is their
, is their last
, their last chance
If the probability of the n-grams with their is higher than of those with there, LanguageTool assumes there's an error in the input sentence.
We use this data set from Google, which is very similar to what Google uses for its n-gram viewer.
Here you can see the confusion pairs we support so far:
How to add words pairs is documented at Adding ngram Data Rules.
Technical background information can be found on
Finding errors using Big Data.
To use ngrams via the Java API, use JLanguageTool.activateLanguageModelRules(File)
.