Improve similar word searching by changing lexicon data structure #10

foxik · 2015-04-29T14:59:09Z

Currently the similar word searching takes quite a lot of time. Considering korektor-czech-130202 with viterbi_beam_size=15 and viterbi_stage_pruning=2.5, similar words searching takes:

47.6% with diacritics_h2mor.conf (for comparison, LM lookup takes 38.3%)
44.2% with spellchecking_h2mor.conf (for comparison, LM lookup takes 30.6%)

Nevertheless, in both these cases similar word searching should process hundreds kilowords per second. That would make korektor nearly twice as fast as it is now. The reason for the large search time is the lexicon data structure, which is currenly very flexible, but also quite slow.

New lexicon data structure ideas:

use several hash tables instead of the current tree structure
one hash table for all words in lexicon, mapping them to their ID
another hash table indexed for each word by their variant without diacritical marks. That is needed for diacritics_h2mor.conf
for 1 edit distance searches, another hash table containing for each word all its variants with one character replaced by a wildcard character (i.e., for "boy" there would be "?oy", "b?y" and "bo?"). Such hash table size is approximately size of dictionary multiplied by an average word length. Luckily, this hash table can be represented only using a Bloom filter, which makes the size of the representation reasonable.
if we require more than 1 edit distance searches, we need to chain them with the previous hash. For 2 edit distance searches, we would have another hash table containing for each entry in the previous hash (i.e. words with one wildcard) all its variants with one character before the wildcard replaced (and the original wildcard removed). In other words, this hash contains such strings withone wildcards that can be made into a valid word by using a second wildcard (maybe I did not cover deletions, so these would have to be added too).
we could continue with 3 edit distance, but note that the size of the n-edit distance gradually grows by a factor of average word length, so 2 edit distance is probably reasonable maximum

Such structure would have the following complexity:

for the diacritics search, we need to make only 1 lookup
for the 1 edit distance searches, we would need linear number of lookups. Nevertheless, all the hashes can be computed in the same linear time if we use suitable hash function
for the 2 edit distance searches, it is possible to limit the complexity by O(N + N * number of found words) lookups, where N is the length of the original word. The reason is that when we try placing a wildcard in the word, we only continue with the second wildcard if there is at least one possibility how to place the second wildcard so that the result is valid word -- therefore obtaining result sensitive complexity
for 3, 4, ... edit distance searches we can obtain the same complexity as for the 2 edit distance searches, but the question is the size of the required hash

Better ideas welcome :-)

The text was updated successfully, but these errors were encountered:

foxik mentioned this issue Apr 29, 2015

Improve language model lookup speed #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve similar word searching by changing lexicon data structure #10

Improve similar word searching by changing lexicon data structure #10

foxik commented Apr 29, 2015

Improve similar word searching by changing lexicon data structure #10

Improve similar word searching by changing lexicon data structure #10

Comments

foxik commented Apr 29, 2015