Investigate non-exact / fuzzy matches #11

johann-petrak · 2018-06-18T15:22:22Z

From johann-petrak/gateplugin-StringAnnotation#10
Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.

Also, see if we can use gateplugin-ft-distance (GateNLP private so far)

johann-petrak · 2018-06-18T15:23:02Z

From johann-petrak/gateplugin-StringAnnotation#13
There are various possible approaches, several of them based on tries.
See also http://dbgroup.cs.tsinghua.edu.cn/dd/projects/taste/index.html

thomas-heitz · 2018-08-26T12:20:19Z

Hello Johann,

I'm using your Extended Gazetteer with several millions of entries and it works very well, thanks for that!

To contribute to the ideas about non-exact matches, I can share my method:

creation of different versions of gazetteer directories from a source gazetteer directory with a Java script
for example one is with normalized punctuation, then accent, stemmed, metaphone, etc.
point to the needed gazetteer in the .def file, for entities it will be punctuation and/or accent, for concepts it will be stemmed
add a Jape grammar with the same code as the previous Java script before the Extended Gazetteer to create as many Token features as type of gazetteer, for example: normalized, punctuation, stemmed, etc.
set the feature in the parameters of the Extended Gazetteer

This method allows to choose the best normalization for each gazetteer and it's easy to change it as all the type of gazetteers are always created by the Java script and also the Token features by the Jape grammar.

johann-petrak · 2018-08-26T15:10:27Z

Thanks Thomas! Yes, this is a good method, when it is possible to generate most or all of the alternatives one wants to match automatically.

However, sometimes users want to match in a way that cannot be predicted, e.g. based on Levenshtein distance, phonetic similarity or some such. If there is a well defined distance metric between the strings, it is possible to implement this as an extension to the trie matching algorithm but it is not easy to implement.

thomas-heitz · 2018-08-30T08:34:46Z

I also plan to use the double metaphone then levenshtein for a phonetic match.

convert the gazetteer into double metaphone values
keep the original value of the gazetteer entry as a feature "string"
add a feature Token.metaphone for Tokens in the text that are not in my dictionary
apply the metaphone gazetteer on these Token.metaphone in the text
you get several Lookup.string for each Token
use levenshtein to choose the best candidate for each Token by comparing the Lookup.string and Token.string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate non-exact / fuzzy matches #11

Investigate non-exact / fuzzy matches #11

johann-petrak commented Jun 18, 2018

johann-petrak commented Jun 18, 2018

thomas-heitz commented Aug 26, 2018 •

edited

Loading

johann-petrak commented Aug 26, 2018

thomas-heitz commented Aug 30, 2018

Investigate non-exact / fuzzy matches #11

Investigate non-exact / fuzzy matches #11

Comments

johann-petrak commented Jun 18, 2018

johann-petrak commented Jun 18, 2018

thomas-heitz commented Aug 26, 2018 • edited Loading

johann-petrak commented Aug 26, 2018

thomas-heitz commented Aug 30, 2018

thomas-heitz commented Aug 26, 2018 •

edited

Loading