Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram histogram division #302

Closed
henryiii opened this issue Jan 15, 2020 · 24 comments · Fixed by #393
Closed

Histogram histogram division #302

henryiii opened this issue Jan 15, 2020 · 24 comments · Fixed by #393
Milestone

Comments

@henryiii
Copy link
Member

Dividing histograms by histograms behaves oddly. It should either not be supported, or work, but not misbehave.

h / h
# TypeError: Only axes supported in histogram **constructor**

This is triggering the wrong constructor

h /= h
# Does nothing at all???

And, finally, here is the current workaround, mentioned on Gitter:

# If an array output is needed
arr = h1.view() / h2.view()

# If a histogram output is required
h3 = h1.copy()
h3[...] = h1.view(flow=True) / h2.view(flow=True)
@henryiii henryiii changed the title Histogram division Histogram histogram division Jan 15, 2020
@nsmith-
Copy link

nsmith- commented Jan 20, 2020

I think it should definitely require at least the workaround, because a ratio histogram is not a histogram: two ratios cannot be added. Unless we want to implement a container that stores numerator and denominator (my preferred technique is just to add a category axis to store each) and operates like a ratio for certain methods like getting bin content. That's probably something best left to hist.

@henryiii
Copy link
Member Author

I expect you are correct; the same thing comes up for density, which still should have axis information, but really is not a histogram. However, the current error message and silent failure needs to be improved.

@HDembinski
Copy link
Member

tl;dr: The histogram object is not necessarily semantically a histogram. You can assign arbitrary values to a histogram object, so the library cannot offer a guarantee that the values in the histogram actually represent counts.

Holding a density inside a histogram object is acceptable, although not the typical use case. My original design restricted the interface so that you cannot set cell values at all, so that a histogram can only be filled with the .fill method. This would guarantee that the values in histogram cells are really counts. But we gave up on that for practical reasons. You can assign arbitrary values to histogram cells, hence the semantic meaning of these values is arbitrary. In Boost.Histogram (C++), you can divide two histograms and that should work in boost-histogram as well. Semantically, the result of this operation is not a histogram. Neither is multiplying histograms.

@henryiii
Copy link
Member Author

Okay, sounds good. I think the best way to handle this is to enable operations on storages, then to use those for h / h2 rather than calling the boost-histogram divide, as that will avoid increasing the compile-time and will allow different storage types to be divided.

@henryiii
Copy link
Member Author

PS: I'm on vacation and will be back next week; development from my side has been minimal this week.

@HDembinski
Copy link
Member

I think the best way to handle this is to enable operations on storages, then to use those for h / h2 rather than calling the boost-histogram divide

I am against this. boost-histogram should use the boost.histogram facilities as much as possible, unless there is functionality missing in boost.histogram. The extra indirection for calling division on histograms instead of storages is minimal. Just make a ticket in boost.histogram if dividing histograms with different storages does not work.

@henryiii
Copy link
Member Author

The problem is that you have to compile every possible combination for each of these for each storage, giving N^2 methods for N storages for each operation.

@HDembinski
Copy link
Member

The N^2 problem is not solved by doing that directly on the storages.

@HDembinski
Copy link
Member

HDembinski commented Jan 31, 2020

You can try it out both ways, but I think you don't safe much in compile-time, however, you are reducing code-reuse between boost-histogram and boost.histogram.

@HDembinski
Copy link
Member

I looked into the code, boost.histogram already supports dividing histograms with arbitrary storages.

@henryiii
Copy link
Member Author

It somewhat is, since you don't compile many of the operations. You still have to make sure they work, but numpy provides most of them for free (int / double, int / int, etc). The problem is really just 4^2 then, since you have to provide double + the three accumulator storages. At that point, hopefully you could reuse Boost.Histogram's operations. I assume Boost.Histogram knows how to multiply/divide accumulators?

@HDembinski
Copy link
Member

It does if the accumulators support it.

@HDembinski
Copy link
Member

Boost.Histogram also has rules to select which storage the resulting histogram should have. If you divide int by double, it uses double for the storage type of the result, for instance. If you don't use the boost.histogram code, you have to reimplement this.

@henryiii
Copy link
Member Author

I'm stuck in writing now, but when I'm done, I want to work on #276, once that's in we can see if this is solvable through that or if we need to implement the N^2 methods. We can also try writing the mp_11 code for one of the N^2. And as long as it works, users don't even need to know if the backend changes.

I looked into the code, boost.histogram already supports dividing histograms with arbitrary storages.

Yes, I know. :)

@HDembinski
Copy link
Member

HDembinski commented Jan 31, 2020

Ok, fine, you can probably solve this with a numpy-array view for some storages, then you rely on the pre-compiled numpy code to handle all the code paths. If this is easier to get the feature working, then I am not against. Like you said, implementation can be changed at any time if needed.

That being said, I am proud of the operator implementation in C++, which should handle everything you can conceivably throw at it. This was not easy.

@henryiii
Copy link
Member Author

henryiii commented Jan 31, 2020

I am by no means convinced this is the right way to go in the end, it's just the path I think will be easiest to start with, and I want #276 working anyway so I'm hoping to piggyback on that work, much like we used the view setting/accessing to work around at-that-time missing features in Boost.Histogram. And I'm hoping we can keep compiling under control.

I am also not at all discrediting the operator implementation Boost.Histogram; as I've said I've looked at it before and have been quite impressed with it. :)

@henryiii henryiii added this to the 1.0.0 milestone May 29, 2020
@HDembinski
Copy link
Member

HDembinski commented May 29, 2020

Coming back to this, I think this should be implemented at least superficially, so that h / h does not give an error. Something that can be done on the pure Python level (demo-code only, did not check if works):

# method on histogram
def __rtruediv__(self, rhs):
    if isinstance(rhs, <some number>):
         v /= rhs
    elif isinstance(rhs, <a histogram>) and self.axes == rhs.axes:
         v = self.view(True)
         v /= rhs.view(True)
    else:
         raise TypeError
    return self

The divide operator can make a copy and use rtruediv. This is essentially what the C++ code does and there is no additional magic in the C++ code, so we might as well implement division like this at zero additional compile-time cost. I give up my earlier position in this case that boost-histogram should always use the underlying C++ code. For such simple stuff it does not have to.

@henryiii
Copy link
Member Author

Agreed, that's a good point. I'll move it to a 0.8.0 milestone, but with the caveat that it might only be the simple workaround for now, possibly to be improved later.

@henryiii henryiii modified the milestones: 1.0.0, 0.8.0 May 29, 2020
@HDembinski
Copy link
Member

This pushes the responsibility to implement division properly into the view, which is ok, I think because we said elsewhere that we want the view of a storage to support the same operators as the histogram.

@henryiii
Copy link
Member Author

In #276, I believe.

@LovelyBuggies
Copy link
Collaborator

LovelyBuggies commented Jul 15, 2020

@henryiii I think this is misbehavior of division if we fill with a 10-element array and divide it by 1000...

h.fill(np.random.normal(size=10))
h = h/1_000_000

Then the counts will be double typed. For example, h[0] will be a small float number if not zero. This seems doesn't make sense.

@henryiii
Copy link
Member Author

It's still a float:

>>> h = bh.Histogram(bh.axis.Regular(10,-3,3))
>>> h.fill(.5)
>>> h2 = h/1_000_000
>>> h2[5]
1e-06
>>> h2[4]
0.0
>>> type(h2[4])
<class 'float'>
>>> type(h2[5])
<class 'float'>

These are all floats. Unless I misunderstood your comment.

Mixing types here is actually quite hard to do.

@LovelyBuggies
Copy link
Collaborator

LovelyBuggies commented Jul 15, 2020

Yep, but what's the meaning of float counts? Counts should only be integers, shouldn't they? My brain is screwed up, I just realized we have weight argument 😅

@henryiii
Copy link
Member Author

Weighted fills need ... yes. :)

You can set the Int64 storage if you truly want ints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants