Question for finding string similarity #84

Shellcat-Zero · 2019-01-17T00:46:17Z

Hi,

I was hoping to leverage NearPy to find similarities between strings, but it's not clear to me how to query the engine with a string vector (if that's possible). My use case is that I have ~30 million names to store in the engine, and I have around 1.5 million names to submit as queries to find a best match from the engine. I was going to use your Redis storage adapter so that all of the queries could be submitted asynchronously. Please let me know if that is not a good use case for NearPy.

Thanks.

pixelogik · 2019-04-03T19:56:56Z

@Shellcat-Zero sorry for the long silence.

NearPy is very modular and allows users to customize the pipeline they are using.

It is however based on numerical vectors. So you would need to convert your strings to numerical vectors. I bet there are a couple of methods for this out there. The most straightforward way I can think of is to first lower case the name and then map the string to an array of numbers based on the character value. Depending on which encoding you are using (UTF8/UTF16) this might result in values between 0 and 255 or much larger for each character position.

Another aspect you would need to consider is the maximum name length, in characters. Because this would determine the dimension of your vector space.

Let's consider this example, where you have these names to store

Pauline
Georgie
Peter
Sebastian

The maximum name length is 9 (Sebastian) so your vector space should be of (at least) dimension 9.

You would then turn those names into numerical vectors of size 9 each (one number per character) and use the pipeline as usual.

However I might be that NearPy is NOT the framework for your project. There are so many really good Python frameworks out there for language and string processing, maybe some of them would be a better pick:

https://spacy.io/
https://radimrehurek.com/gensim/
http://www.nltk.org/

More "learning" focused, but might be useful as well:

https://scikit-learn.org/stable/

I hope I am not too late with my response. Good luck with your project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for finding string similarity #84

Question for finding string similarity #84

Shellcat-Zero commented Jan 17, 2019

pixelogik commented Apr 3, 2019

Question for finding string similarity #84

Question for finding string similarity #84

Comments

Shellcat-Zero commented Jan 17, 2019

pixelogik commented Apr 3, 2019