You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was hoping to leverage NearPy to find similarities between strings, but it's not clear to me how to query the engine with a string vector (if that's possible). My use case is that I have ~30 million names to store in the engine, and I have around 1.5 million names to submit as queries to find a best match from the engine. I was going to use your Redis storage adapter so that all of the queries could be submitted asynchronously. Please let me know if that is not a good use case for NearPy.
Thanks.
The text was updated successfully, but these errors were encountered:
NearPy is very modular and allows users to customize the pipeline they are using.
It is however based on numerical vectors. So you would need to convert your strings to numerical vectors. I bet there are a couple of methods for this out there. The most straightforward way I can think of is to first lower case the name and then map the string to an array of numbers based on the character value. Depending on which encoding you are using (UTF8/UTF16) this might result in values between 0 and 255 or much larger for each character position.
Another aspect you would need to consider is the maximum name length, in characters. Because this would determine the dimension of your vector space.
Let's consider this example, where you have these names to store
Pauline
Georgie
Peter
Sebastian
The maximum name length is 9 (Sebastian) so your vector space should be of (at least) dimension 9.
You would then turn those names into numerical vectors of size 9 each (one number per character) and use the pipeline as usual.
However I might be that NearPy is NOT the framework for your project. There are so many really good Python frameworks out there for language and string processing, maybe some of them would be a better pick:
Hi,
I was hoping to leverage NearPy to find similarities between strings, but it's not clear to me how to query the engine with a string vector (if that's possible). My use case is that I have ~30 million names to store in the engine, and I have around 1.5 million names to submit as queries to find a best match from the engine. I was going to use your Redis storage adapter so that all of the queries could be submitted asynchronously. Please let me know if that is not a good use case for NearPy.
Thanks.
The text was updated successfully, but these errors were encountered: