-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linear interpolation leads to some crazy estimates #14
Comments
That sounds very reasonable. I'll work on a fix. Sent from my iPhone
|
Hey Andrew, I finally got around to a fix for this, would love to get your review on it. The branch is: tdial-interpolation-fix I am working on certification tests next. |
Actually, hold off. There's an issue with the data in the tables. |
I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail <https://github.com/astolman/HyperHeadTail> in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.
… On Mar 6, 2017, at 1:29 AM, Tom Dial ***@***.***> wrote:
Actually, hold off. There's an issue with the data in the tables.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ>.
|
I think these cases work out so that your cases are log of mine. I think I did that as part of fixing the problem with the first bug report I submitted.
As far as my use for libcount, I did a summer project at Sandia National Labs where I used it for a streaming algorithm. I’m still (slowly) writing a paper for it that should be up on ArXiv soon.
… On Mar 6, 2017, at 2:13 PM, Tom Dial ***@***.***> wrote:
P.S.
Unless I am mistaken, in your version of EMP_alpha(), I think you have the
first three switch cases(16, 32, 64) wrong.
Or am I missing something?
TD
On Mon, Mar 6, 2017 at 4:26 PM, astolman ***@***.***> wrote:
> I wrote a fix for it in my adaptation of libcount. I think I had to sort
> the data that you used because it wasn’t sorted in a few cases. You can see
> what I did in my project https://github.com/astolman/HyperHeadTail <
> https://github.com/astolman/HyperHeadTail> in the shll folder there. I
> needed a HyperLogLog implementation that supported a sliding window so I
> heavily rewrote the hll.cc files, but the rest of it is changed as little
> as possible. It’s been a while, but I think the stuff in empirical_data.cc
> contains the now appropriately sorted values.
> > On Mar 6, 2017, at 1:29 AM, Tom Dial ***@***.***> wrote:
> >
> > Actually, hold off. There's an issue with the data in the tables.
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #14 (comment)>, or
> mute the thread <https://github.com/notifications/unsubscribe-auth/
> AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ>.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#14 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AH5tLKp-pvOtxKkR98tcuw1Sf6y5186Mks5rjHoUgaJpZM4Jl3qJ>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AS0tCpHeAL8_G6MIzeJcj6u7_vGPc3iRks5rjITtgaJpZM4Jl3qJ>.
|
Linear interpolation fix completed. Note that the empirical data is now sorted properly. Also added a simple certification test. Just run: make certify |
The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.
The text was updated successfully, but these errors were encountered: