linear interpolation leads to some crazy estimates #14

astolman · 2016-08-16T21:28:06Z

The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.

dialtr · 2016-08-16T22:06:18Z

That sounds very reasonable. I'll work on a fix.

Sent from my iPhone

On Aug 16, 2016, at 5:28 PM, astolman [email protected] wrote:

The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

dialtr · 2017-03-06T07:56:23Z

Hey Andrew, I finally got around to a fix for this, would love to get your review on it. The branch is:

tdial-interpolation-fix

I am working on certification tests next.

dialtr · 2017-03-06T09:29:37Z

Actually, hold off. There's an issue with the data in the tables.

astolman · 2017-03-06T21:26:44Z

I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail <https://github.com/astolman/HyperHeadTail> in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.

…

On Mar 6, 2017, at 1:29 AM, Tom Dial ***@***.***> wrote: Actually, hold off. There's an issue with the data in the tables. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ>.

astolman · 2017-03-07T00:16:18Z

I think these cases work out so that your cases are log of mine. I think I did that as part of fixing the problem with the first bug report I submitted. As far as my use for libcount, I did a summer project at Sandia National Labs where I used it for a streaming algorithm. I’m still (slowly) writing a paper for it that should be up on ArXiv soon.

…

On Mar 6, 2017, at 2:13 PM, Tom Dial ***@***.***> wrote: P.S. Unless I am mistaken, in your version of EMP_alpha(), I think you have the first three switch cases(16, 32, 64) wrong. Or am I missing something? TD On Mon, Mar 6, 2017 at 4:26 PM, astolman ***@***.***> wrote: > I wrote a fix for it in my adaptation of libcount. I think I had to sort > the data that you used because it wasn’t sorted in a few cases. You can see > what I did in my project https://github.com/astolman/HyperHeadTail < > https://github.com/astolman/HyperHeadTail> in the shll folder there. I > needed a HyperLogLog implementation that supported a sliding window so I > heavily rewrote the hll.cc files, but the rest of it is changed as little > as possible. It’s been a while, but I think the stuff in empirical_data.cc > contains the now appropriately sorted values. > > On Mar 6, 2017, at 1:29 AM, Tom Dial ***@***.***> wrote: > > > > Actually, hold off. There's an issue with the data in the tables. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #14 (comment)>, or > mute the thread <https://github.com/notifications/unsubscribe-auth/ > AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ>. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#14 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AH5tLKp-pvOtxKkR98tcuw1Sf6y5186Mks5rjHoUgaJpZM4Jl3qJ> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AS0tCpHeAL8_G6MIzeJcj6u7_vGPc3iRks5rjITtgaJpZM4Jl3qJ>.

dialtr · 2017-03-07T06:19:41Z

Linear interpolation fix completed. Note that the empirical data is now sorted properly. Also added a simple certification test. Just run:

make certify

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear interpolation leads to some crazy estimates #14

linear interpolation leads to some crazy estimates #14

astolman commented Aug 16, 2016

dialtr commented Aug 16, 2016

dialtr commented Mar 6, 2017

dialtr commented Mar 6, 2017

astolman commented Mar 6, 2017 via email

astolman commented Mar 7, 2017 via email

dialtr commented Mar 7, 2017

linear interpolation leads to some crazy estimates #14

linear interpolation leads to some crazy estimates #14

Comments

astolman commented Aug 16, 2016

dialtr commented Aug 16, 2016

dialtr commented Mar 6, 2017

dialtr commented Mar 6, 2017

astolman commented Mar 6, 2017 via email

astolman commented Mar 7, 2017 via email

dialtr commented Mar 7, 2017