Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve similar word searching by changing lexicon data structure #10

Open
foxik opened this issue Apr 29, 2015 · 0 comments
Open

Improve similar word searching by changing lexicon data structure #10

foxik opened this issue Apr 29, 2015 · 0 comments

Comments

@foxik
Copy link
Member

foxik commented Apr 29, 2015

Currently the similar word searching takes quite a lot of time. Considering korektor-czech-130202 with viterbi_beam_size=15 and viterbi_stage_pruning=2.5, similar words searching takes:

  • 47.6% with diacritics_h2mor.conf (for comparison, LM lookup takes 38.3%)
  • 44.2% with spellchecking_h2mor.conf (for comparison, LM lookup takes 30.6%)

Nevertheless, in both these cases similar word searching should process hundreds kilowords per second. That would make korektor nearly twice as fast as it is now. The reason for the large search time is the lexicon data structure, which is currenly very flexible, but also quite slow.

New lexicon data structure ideas:

  • use several hash tables instead of the current tree structure
  • one hash table for all words in lexicon, mapping them to their ID
  • another hash table indexed for each word by their variant without diacritical marks. That is needed for diacritics_h2mor.conf
  • for 1 edit distance searches, another hash table containing for each word all its variants with one character replaced by a wildcard character (i.e., for "boy" there would be "?oy", "b?y" and "bo?"). Such hash table size is approximately size of dictionary multiplied by an average word length. Luckily, this hash table can be represented only using a Bloom filter, which makes the size of the representation reasonable.
  • if we require more than 1 edit distance searches, we need to chain them with the previous hash. For 2 edit distance searches, we would have another hash table containing for each entry in the previous hash (i.e. words with one wildcard) all its variants with one character before the wildcard replaced (and the original wildcard removed). In other words, this hash contains such strings withone wildcards that can be made into a valid word by using a second wildcard (maybe I did not cover deletions, so these would have to be added too).
  • we could continue with 3 edit distance, but note that the size of the n-edit distance gradually grows by a factor of average word length, so 2 edit distance is probably reasonable maximum

Such structure would have the following complexity:

  • for the diacritics search, we need to make only 1 lookup
  • for the 1 edit distance searches, we would need linear number of lookups. Nevertheless, all the hashes can be computed in the same linear time if we use suitable hash function
  • for the 2 edit distance searches, it is possible to limit the complexity by O(N + N * number of found words) lookups, where N is the length of the original word. The reason is that when we try placing a wildcard in the word, we only continue with the second wildcard if there is at least one possibility how to place the second wildcard so that the result is valid word -- therefore obtaining result sensitive complexity
  • for 3, 4, ... edit distance searches we can obtain the same complexity as for the 2 edit distance searches, but the question is the size of the required hash

Better ideas welcome :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant