-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram histogram division #302
Comments
I think it should definitely require at least the workaround, because a ratio histogram is not a histogram: two ratios cannot be added. Unless we want to implement a container that stores numerator and denominator (my preferred technique is just to add a category axis to store each) and operates like a ratio for certain methods like getting bin content. That's probably something best left to |
I expect you are correct; the same thing comes up for density, which still should have axis information, but really is not a histogram. However, the current error message and silent failure needs to be improved. |
tl;dr: The histogram object is not necessarily semantically a histogram. You can assign arbitrary values to a histogram object, so the library cannot offer a guarantee that the values in the histogram actually represent counts. Holding a density inside a histogram object is acceptable, although not the typical use case. My original design restricted the interface so that you cannot set cell values at all, so that a histogram can only be filled with the .fill method. This would guarantee that the values in histogram cells are really counts. But we gave up on that for practical reasons. You can assign arbitrary values to histogram cells, hence the semantic meaning of these values is arbitrary. In Boost.Histogram (C++), you can divide two histograms and that should work in boost-histogram as well. Semantically, the result of this operation is not a histogram. Neither is multiplying histograms. |
Okay, sounds good. I think the best way to handle this is to enable operations on storages, then to use those for |
PS: I'm on vacation and will be back next week; development from my side has been minimal this week. |
I am against this. boost-histogram should use the boost.histogram facilities as much as possible, unless there is functionality missing in boost.histogram. The extra indirection for calling division on histograms instead of storages is minimal. Just make a ticket in boost.histogram if dividing histograms with different storages does not work. |
The problem is that you have to compile every possible combination for each of these for each storage, giving N^2 methods for N storages for each operation. |
The N^2 problem is not solved by doing that directly on the storages. |
You can try it out both ways, but I think you don't safe much in compile-time, however, you are reducing code-reuse between boost-histogram and boost.histogram. |
I looked into the code, boost.histogram already supports dividing histograms with arbitrary storages. |
It somewhat is, since you don't compile many of the operations. You still have to make sure they work, but numpy provides most of them for free (int / double, int / int, etc). The problem is really just 4^2 then, since you have to provide double + the three accumulator storages. At that point, hopefully you could reuse Boost.Histogram's operations. I assume Boost.Histogram knows how to multiply/divide accumulators? |
It does if the accumulators support it. |
Boost.Histogram also has rules to select which storage the resulting histogram should have. If you divide int by double, it uses double for the storage type of the result, for instance. If you don't use the boost.histogram code, you have to reimplement this. |
I'm stuck in writing now, but when I'm done, I want to work on #276, once that's in we can see if this is solvable through that or if we need to implement the N^2 methods. We can also try writing the mp_11 code for one of the N^2. And as long as it works, users don't even need to know if the backend changes.
Yes, I know. :) |
Ok, fine, you can probably solve this with a numpy-array view for some storages, then you rely on the pre-compiled numpy code to handle all the code paths. If this is easier to get the feature working, then I am not against. Like you said, implementation can be changed at any time if needed. That being said, I am proud of the operator implementation in C++, which should handle everything you can conceivably throw at it. This was not easy. |
I am by no means convinced this is the right way to go in the end, it's just the path I think will be easiest to start with, and I want #276 working anyway so I'm hoping to piggyback on that work, much like we used the view setting/accessing to work around at-that-time missing features in Boost.Histogram. And I'm hoping we can keep compiling under control. I am also not at all discrediting the operator implementation Boost.Histogram; as I've said I've looked at it before and have been quite impressed with it. :) |
Coming back to this, I think this should be implemented at least superficially, so that # method on histogram
def __rtruediv__(self, rhs):
if isinstance(rhs, <some number>):
v /= rhs
elif isinstance(rhs, <a histogram>) and self.axes == rhs.axes:
v = self.view(True)
v /= rhs.view(True)
else:
raise TypeError
return self The divide operator can make a copy and use rtruediv. This is essentially what the C++ code does and there is no additional magic in the C++ code, so we might as well implement division like this at zero additional compile-time cost. I give up my earlier position in this case that boost-histogram should always use the underlying C++ code. For such simple stuff it does not have to. |
Agreed, that's a good point. I'll move it to a 0.8.0 milestone, but with the caveat that it might only be the simple workaround for now, possibly to be improved later. |
This pushes the responsibility to implement division properly into the view, which is ok, I think because we said elsewhere that we want the view of a storage to support the same operators as the histogram. |
In #276, I believe. |
@henryiii I think this is misbehavior of division if we fill with a 10-element array and divide it by 1000... h.fill(np.random.normal(size=10))
h = h/1_000_000 Then the counts will be double typed. For example, |
It's still a float: >>> h = bh.Histogram(bh.axis.Regular(10,-3,3))
>>> h.fill(.5)
>>> h2 = h/1_000_000
>>> h2[5]
1e-06
>>> h2[4]
0.0
>>> type(h2[4])
<class 'float'>
>>> type(h2[5])
<class 'float'> These are all floats. Unless I misunderstood your comment. Mixing types here is actually quite hard to do. |
|
Weighted fills need ... yes. :) You can set the Int64 storage if you truly want ints. |
Dividing histograms by histograms behaves oddly. It should either not be supported, or work, but not misbehave.
This is triggering the wrong constructor
And, finally, here is the current workaround, mentioned on Gitter:
The text was updated successfully, but these errors were encountered: