-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaccard distance generalization #24
Comments
Could you replace the |
Thanks so much for the reply! But got the following error: File "run_algorithm.py", line 3, in I understand your concern, but Jaccard distance is a metric. Its triangle inequality property was proved in https://www.sciencedirect.com/science/article/pii/S0167865518309188 :) |
Thank you for your information.
|
Yes, the distances using linearScan() are all correct. Thank you for the efforts! 486: ONNG-NGT(500, 30, 10, -2, 1.000) 1.000 334.322 But when I turned it back to scan(), the recall dropped down again... :( 486: ONNG-NGT(300, 30, 30, -2, 1.000) 0.027 7200.361 |
I assume that the dimensionality of your dataset is 1024 * 8. It means that your data space is so sparse that NGT might not work well for the space. Anyway, you have to increase the epsilon for NGT construction. For example, when you use 1.8 as the epsilon for search, you might want to use 0.8(=1.8-1.0) for construction instead of [0.0, 0.1]. The epsilon only for search is added 1.0, because the ann benchmarks cannot accept minus value. |
Thank you for the reply and suggestion! |
I have found bugs in Graph.h and Graph.cpp in below, where I should put corresponding codes for Jaccard. It seems that the comparator has not been invoked before. Now it works good. |
The number of dimensions of your data with your jaccard distance is 1024 * 8, because the jaccard distance is based on 1 bit for each dimension. Although the sift's data length is 128 * 32, the number of dimensions of the sift with euclidean distance is just 128, because the euclidean distance is based on 1 single float variable (32 bits) for each dimension. Therefore your data with your jaccard distance is supposed to be sparser than the sift with the euclidean distance. |
@chunjiangzhu |
Hi, thanks again for the prompt responses to the previous question #23 .
We are trying to generalize your code to Jaccard distance and test it under the ann benchmarking. We assume that the input is the same as the input of hamming distance, but the distance function is changed to Jaccard. E.g., for two bit vectors "A=10111" and "B=10011", their hamming distance is 1 but jaccard distance is 1-popcount(A&B)|/popcount(A|B)=1-3/4=0.25.
What we did are that for every code including hamming, generating corresponding code for jaccard. For example in our repository,
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/PrimitiveComparator.h#L287-L303
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpaceRepository.h#L97-L114
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Index.h#L113
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Command.cpp#L155-L157
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpace.h#L162
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/src/ngtpy.cpp#L67-L68
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L131
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L254-L256
Our input data is an N*1024 numpy array of data type int (or bool). We have tested the code using parameters epsilon=[0.0,0.1], edge_size=[100, 200, 300, 500, 1000], outdegree=[10, 30, 50, 70, 100], indegree=[10, 30, 50, 70, 120], query epsilon=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] and object_type=Byte. But the resulting recall are consistently lower than 20%. We believe that there are something wrong. We noticed the 16 boundary you mentioned in #21 . Since our data dimension is always a power of 16, e.g. 1024, the error should not be here.
Could you please give suggestions on the generalization? We hope that a successful generalization may be helpful to support more distance metrics, before you make a custom distance function. If you need more information, e.g. a dataset, please let me know. Thank you so much!
The text was updated successfully, but these errors were encountered: