-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate non-exact / fuzzy matches #11
Comments
From johann-petrak/gateplugin-StringAnnotation#13 |
Hello Johann, I'm using your Extended Gazetteer with several millions of entries and it works very well, thanks for that! To contribute to the ideas about non-exact matches, I can share my method:
This method allows to choose the best normalization for each gazetteer and it's easy to change it as all the type of gazetteers are always created by the Java script and also the Token features by the Jape grammar. |
Thanks Thomas! Yes, this is a good method, when it is possible to generate most or all of the alternatives one wants to match automatically. However, sometimes users want to match in a way that cannot be predicted, e.g. based on Levenshtein distance, phonetic similarity or some such. If there is a well defined distance metric between the strings, it is possible to implement this as an extension to the trie matching algorithm but it is not easy to implement. |
I also plan to use the double metaphone then levenshtein for a phonetic match.
|
From johann-petrak/gateplugin-StringAnnotation#10
Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.
Also, see if we can use gateplugin-ft-distance (GateNLP private so far)
The text was updated successfully, but these errors were encountered: