Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: slicing projection with one-sided slices #479

Merged
merged 3 commits into from
Dec 19, 2020
Merged

Conversation

HDembinski
Copy link
Member

@HDembinski HDembinski commented Dec 7, 2020

Fixes #478

  • updated boostorg/histogram to latest develop with bug-fix
  • added test
  • Updated the rest of boost to 1.75

The core bug is fixed in boostorg/histogram, but there still seems to be a bug when bh.loc is used and the value is past the upper end of the axis. The computed index then should be size + 1 to make the crop work correctly.

@github-actions github-actions bot added the needs changelog Might need a changelog entry label Dec 7, 2020
@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Just for clarification, it should be capped at size + 1? Should it be capped at -1 too? And should the capping happen in the Python layer, or in the C++ wrapper layer? In the C++ wrapper would keep the various UHI locators much simpler.

@HDembinski
Copy link
Member Author

Indices do not have to be capped, they are capped automatically in reduce.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

Ideally, the value to index conversion should happen in C++, so that it is guaranteed to be consistent. I think the current implementation always converts the value to an index before passing it to reduce?

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Yes, it does - so (this) problem is that .index(past_end) returns size instead of size + 1?

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Actually, no, this is ideal, I think. loc(past_end) should not include the overflow bin. The overflow bin is infinite. Only leaving this off (None) should include the overflow bin.

h[0::sum] includes the overflow bin, h[0:anything_finite:sum] does not, that's in the UHI spec.

@HDembinski
Copy link
Member Author

Remembering more about the details: The problem is not what .index returns. The rules for what happens when doing slice over a value range is not perfectly compatible with what .index returns. There is special code in reduce to handle this.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

Actually, no, this is ideal, I think. loc(past_end) should not include the overflow bin. The overflow bin is infinite. Only leaving this off (None) should include the overflow bin.

h[0::sum] includes the overflow bin, h[0:anything_finite:sum] does not, that's in the UHI spec.

Then it is a) inconsistent with how Boost Histogram works and b) it is inconsistent, because the current implementation handles the underflow cases differently to the overflow case.

Sorry, I realize that you are talking about indices here, while I was talking about values.

h[0::sum] includes the overflow bin

Ok

h[0:anything_finite:sum] does not

That's apparently not how it is currently implemented, because I get the behavior I expected when I pass indices directly. I only see an inconsistency when I pass values instead of indices.

Nevertheless, the spec should be updated then.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

We said at some point that a value range specified should select all bins which overlap with that value range. That should include the flow bins.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

It's not as clear as I thought, since it only refers to standard indexing:

https://boost-histogram.readthedocs.io/en/latest/usage/indexing.html#slicing

But triggering an infinitely wide bin by accident seems quite dangerous, especially when you can't do this accidentally by integer indexing (-1 is invalid), but you can via loc? Floating point operations could include a bin you were not intending, and an infinite bin you might not even know about would be pretty bad. I think it would be much better to be explicit here if you want to include a flow bin.

This is only an issue for slicing (or selecting), since all other operations just bring flow bins along for the ride.

Either way, the current inconsistent behavior is wrong. :)

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

The rule is: if the bin overlaps with the value range specified, it should be included. That also goes for bins which are infinitely wide and I don't think that this is anything to worry about. You can make sure that the overflow bin is excluded by picking the upper value inside the last bin of the axis up to the upper edge of that bin.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Think about it carefully, making the best decision for Python users, then let me know. I worry if someone selects an upper edge of bh.loc(6.0000000001) on a 0-6 axis and suddenly get an overflow bin in a sum. I could see an easy mistake being "I know my histogram is has an upper limit of about 10, so I'll just use bh.loc(20) as an upper limit to be safe". Or maybe the limit is a variable. I can't easily imagine a case where you would have a variable upper or lower edge but would prefer to have it suddenly include infinity. Most cases are "I want a range" OR "I want all above/below x". Overflow bins in integrations only being included if explicitly requested nicely fits under "explicit is better than implicit", I would think.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Everywhere else in boost-histogram the flow bins are either hidden (carried along), natural (sum over None endpoints includes everything, like ::sum), or explicit (bh.underflow/bh.overflow, flow=true).

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

Think about it carefully, making the best decision for Python users, then let me know. I worry if someone selects an upper edge of bh.loc(6.0000000001) on a 0-6 axis and suddenly get an overflow bin in a sum. I could see an easy mistake being "I know my histogram is has an upper limit of about 10, so I'll just use bh.loc(20) as an upper limit to be safe". Or maybe the limit is a variable. I can't easily imagine a case where you would have a variable upper or lower edge but would prefer to have it suddenly include infinity. Most cases are "I want a range" OR "I want all above/below x". Overflow bins in integrations only being included if explicitly requested nicely fits under "explicit is better than implicit", I would think.

It has to be documented properly, of course, but it is important to have simple rules that people can remember. The flow bins are special in that you normally don't see them, but they should be handled just like other bins as much as possible (from a logical point of view and also from a programming point of view).

Note that what you worry about is only a problem when there are actually entries in the flow bins. It is hard to say whether the user wants to keep those or not when he/she uses an upper value beyond the axis range. Maybe keeping the overflow is exactly the intention?

My mind is set on applying rules consistently that are as simple as possible, so behavior remains easy to remember.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

Note: I need to move fast with the patch in boostorg/histogram to get this still included in the upcoming release in two days, so there is some pressure.

But since the feature was broken so far anyway, we actually have the opportunity to select a different behavior if we want. So you would prefer including the flow bins when values are passed only if the value passed is infinite, right?

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

This is perfectly fine with Boost.Histogram, I'm just looking at the UHI slicing syntax when reducing. Let's pick the include syntax for now, but revisit it to see if it's more consistent/user friendly to make overflow/underflow always explicit when reducing via UHI reducer before a boost-histogram release.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

So you would prefer including the flow bins when values are passed only if the value passed is infinite, right?

Yes, so these would be the only ways to apply a reducer to an overflow bin:

h[start:None:sum] # More usually written as h[start::sum]
h[start:bh.loc(infty):sum] # Hadn't include this originally, but it could also make sense?

At least that is what I think the original intention was, and I think it would be more explicit?

@HDembinski
Copy link
Member Author

I like the simple consistent rule, but I do see your point and I worry about one thing: I don't remember whether the current implementation of regular axis is consistent so that axis.index(axis.value(i)) always returns i. If we pick my approach and axis.value(len(axis)) happens to be too large by a tiny amount, the overflow bin would be included even though the intention was probably to exclude it.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

For rules, h[start:stop] isn't affected, since it always brings along the underflow/overflow bins, since it returns a new histogram. So we are left with h[select], but that's a different construct anyway; and if you write h[bh.loc(too_big)], the choices are error or return the overflow bin; since you asked for a bin, it is quite natural to return a bin, even the overflow bin. You also are not asking for a range - including infinity when you request finite range is much more worrisome because infinity makes the width of the range infinite. You should already be checking the properties of h.axes[0][bh.loc(too_big)] to see what you are looking at. So I think this is not inconsistent?

Also, this, and especially generalized reducers, h[a:b:ufunc] when implemented, don't have an exact parallel to Boost.Histogram, so while we should consider how slices work in Boost.Histogram, but the behavior doesn't have to be a perfect match. And Boost.Histogram explicitly assumes users are knowledgeable enough to handle knowing about flow.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

For rules, h[start:stop] isn't affected, since it always brings along the underflow/overflow bins, since it returns a new histogram. So we are left with h[select], but that's a different construct anyway; if you write h[bh.loc(too_big)], the choices are error or return the overflow bin; since you asked for a bin, it is quite natural to return a bin, even the overflow bin. You also are not asking for a range - including infinity when you request finite range is much more worrisome because infinity makes the width of the range infinite. You should already be checking the properties of h.axes[0][bh.loc(too_big)] to see what you are looking at. So I think this is not inconsistent?

Also, this, and especially generalized reducers, h[a:b:ufunc] when implemented, don't have an exact parallel to Boost.Histogram, so while we should consider how slices work in Boost.Histogram, but the behavior doesn't have to be a perfect match. And Boost.Histogram explicitly assumes users are knowledgeable enough to handle knowing about flow.

I cannot say much about this, because it seems to address other points. Can we stick to the topic how h[bh.loc(a):bh.loc(b):sum] and h[bh.loc(a):bh.loc(b)] should work? This is independent to what happens with h[select]. The logically consistent thing is that h[bh.loc(too_big)] gives you the overflow bin just like h[bh.loc(too_small)] should give you the underflow bin.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

My mind is set on applying rules consistently that are as simple as possible, so behavior remains easy to remember.

I was just comparing to the existing rules to see if there was an inconstancy if we apply this explicit endpoint rule for reducers. I don't think so.

Also this does include h[bh.loc(a):bh.loc(b)] - this returns a new histogram, so the flow bins are always implicitly included and summed if needed no matter what you do here. h[bh.loc(a):bh.loc(b)][::sum] will always sum over the entirety of h regardless of what you select in the first step.

So the only question is for h[bh.loc(a):bh.loc(b):sum] (which has been broken since 0.8, though I think in 0.7 it had the behavior I described because that was my intention, though it might have happened to work differently unintentionally).

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

Naturally, h[bh.loc(a):bh.loc(b)] and h[bh.loc(a):bh.loc(b):sum] should select the same bin ranges. The only difference between the two is what happens with the overflow bins. When you do h[bh.loc(a):bh.loc(b)], the content of the removed bins are added to the respective flow bins, while in case of h[bh.loc(a):bh.loc(b):sum] we decided to crop the content of the removed bins. But both commands should select the same bin range for any values of a and b.

With h[bh.loc(a):bh.loc(b)][::sum] you cannot do a one-sided crop, I don't see how that relates to the discussion (also no one should do that, it looks awkward).

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

And they currently do. In fact, there is no difference in h[bh.loc(a):bh.loc(b)] if you pick a b inside the final bin, or if you pick a b past the final bin. You want to introduce a difference in the one with sum based on this distinction, however.

This affects two sided crops too, by the way, if a<min and b>max.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

My point was h[bh.loc(a):bh.loc(b)][::sum] does not crop at all. The first step makes a smaller histogram, but then the second step sums over the flow bins too, so it is identical to the same action before the cropping. So the only time you truly "crop" is in the reducer expression. (It might be nice to have a way to return new histograms with flow bins turned off for an axis, but not something that currently exists, and not something that would likely ever be part of UHI, which describes conceptual manipulations, not changes in axis type specifics, and removing information should only be done at the last possible point it could be useful).

Edit on last comment: You want to introduce a difference for the upper boundary. I want to remove the distinction for the lower boundary. The current situation is of course an inconstant mess. I am assuming that whatever you decide, it will be applied to both endpoints consistently, and I just sometimes select one endpoint to discuss for clarity.

@HDembinski
Copy link
Member Author

And they currently do. In fact, there is no difference in h[bh.loc(a):bh.loc(b)] if you pick a b inside the final bin, or if you pick a b past the final bin. You want to introduce a difference in the one with sum based on this distinction, however.

Not sure if you are referring to my original statement that slices select bin ranges that overlap with (a,b) or something else. I am currently listening to some talks, but can we discuss this later via Zoom?

@HDembinski
Copy link
Member Author

As far as I can see, either way nothing changes for h[bh.loc(a):bh.loc(b)].

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Collecting the discussion, I think the only thing we are considering changing is the following behavior. (This was broken before, and currently inconsistent, so "changing" might be the wrong term, "defining" might be better).

h = bh.Histogram(bh.axis.Regular(1,1,2))
h.view(flow=True)[...] = 1

# Explicit endpoints
3 == h[::sum]
2 == h[:bh.loc(2.00001):sum]
2 == h[bh.loc(.99999)::sum]
1 == h[bh.loc(.99999):bh.loc(2.00001):sum]
1 == h[0:len:sum]

h[at(-1):]
h[at(-1)::sum]

h[:at(1):sum]
h[:at(2):sum]



# Possible benefit:
# h[a:b:sum] is finite width ^
# h[0:bh.loc(x):sum] never "jumps" to flow ^
# If edges are not exact, could be an issue?

h[loc(-1)]

# Flow inclusive endpoints
3 == h[::sum]
3 == h[:bh.loc(2.00001):sum]
3 == h[bh.loc(.99999)::sum]
3 == h[bh.loc(.99999):bh.loc(2.00001):sum]
1 == h[0:len:sum]
  • [2] -> overflow
  • [0:2] -> 0, 1.
  • [0:2.001] -> 0, 1, 2
  • [0:2.001] -> 0, 1, overlflow

You can replace an empty endpoint with None (syntactically equivalent) or possibly bh.loc(float("infty")).

Note that if you were to do this in two steps, doing the slice first before the reduction, there is no difference if including an a point in the edge bin vs. in the flow bin in the "normal" slicing.

1 == h[bh.loc(.99999):bh.loc(2.00001)][0:len:sum]
1 == h[bh.loc(1):bh.loc(2)][0:len:sum]

3 == h[bh.loc(.99999):bh.loc(2.00001)][::sum]
3 == h[bh.loc(1):bh.loc(2)][::sum]

A few thoughts: Currently, UHI locators are simple callables that take an axes and return a boost-histogram-style index (-1 to len(axis)). The system was designed around direct indexing, where the distinction between adding an endpoint (like 0) and leaving it off (None) was important, and could not be replaced with Python indexing directly (-1 has a distinct and well defined meaning).

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Outcome of meeting:

  • UHI locators need to be able to detect if they are in a slice to enable advanced tricks. They only receive the axis as an argument, so my thought would be to include a _slice_ object on the axis (proxy) that they receive - this would even allow them to detect what the action was. You would also need to indicate which item in the slice was being called, _slice_index_?
  • The default bh.loc would avoid including overflow/underflow bins by default, following the standard indexing rule that including an endpoint does not include flow. Writing a custom loc function that would include these would be easy.
  • Boost.Histogram should follow the current design and include the underflow/overflow bin if included in the crop command. We don't actually use any boost-histogram indexing commands that data coordinates, because we always convert to indices first using UHI.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

  • We can (finally) fix bh.overflow to behave nicely in reductions with this! Currently h[:bh.overflow:sum] slices to one-before-the-bin, which is the final bin, not the overflow bin. For example, for a 10 bin histogram, h[5:bh.overflow:sum] simply reduces to h[5:10:sum], and Python slicing does not include the endpoint - so no overflow bin. We can fix this if we want to now.

Does that sound useful?

PS: In the current implementation, you just aren't supposed to use any non-None indexer for a reduction slice endpoint if you want flow, we could also adhere to that for all built-in indexers. But it would be nice to either be able to support this or throw an error, rather than doing the rather unintuitive thing and working but not including overflow. Also, bh.underflow, since Python does include the lower bound, just "works" here currently.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

* We can (finally) fix `bh.overflow` to behave nicely in reductions with this! Currently `h[:bh.overflow:sum]` slices to one-before-the-bin, which is the final bin, not the overflow bin. For example, for a 10 bin histogram, `h[5:bh.overflow:sum]` simply reduces to `h[5:10:sum]`, and Python slicing does not include the endpoint - so no overflow bin. We can fix this if we want to now.

Does that sound useful?

Yes and also yes to the thoughts in the PS.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 7, 2020

How do we go about with this PR? @henryiii could you finish the work by committing to the branch that I opened? I think you know best how to implement the changes to loc.

@henryiii
Copy link
Member

henryiii commented Dec 7, 2020

Sure, I can do that.

Fun fact: we need the extra information as described above even if we wanted to include endpoints by default or remove them by default - the reason that currently h[:bh.loc(1e9):sum] and h[bh.loc(-1e9)::sum] differ is due to Python not including the endpoint in the slice.

@henryiii henryiii added this to the 1.0.0 milestone Dec 7, 2020
@henryiii
Copy link
Member

henryiii commented Dec 8, 2020

Also, this affects h[:bh.loc(.5)], where this is a point inside the histogram. Currently, bh.loc(.5) returns the index of the bin that contains .5, and then Python slicing syntax takes over - and goes up to but not including this bin. Before the above planned change, we had no way to do anything different - returning one more than the index containing the bin would break h[bh.loc(.5)]. And h[bh.loc(.5):], too, for that matter. Do we want to change this now, so that bh.loc when not on a bin edge and when the stop of a slice will return the bin + 1?

I remember describing the first behavior in PyHEP 2019, and then I think @HDembinski decided on the second behavior for Boost.Histogram, but since this part is controlled by normal indexes (since UHI converts all values to indexes, and then boost-histogram uniformly passes around indices behind the scenes), this behavior inherits the natural Python index behavior and never changed (technically from Boost.Histogram bin based indexing, which is also one past the end, like Python). It was unable to change - you have to detect which part of the slice you are in in order to add 1 to the value of loc. This would be a major, breaking change, all code using bh.loc as an endpoint to a slice would change. And might require some tricky behavior based on being exactly on a floating point edge.

To me, this is significantly more natural in Python (as evidenced by the fact it couldn't even be done before); slicing is very well defined in Python (unlike C++, unless you count valarray's std::slice, which is also like Python) - but there really isn't a clear definition for what happens for floating point slices, which don't exist in Python, so it's only a generalization. On the other hand, including the bin is harder to make by hand with a locator, so if we only provide one, providing the harder to make one might be better. Also, from a practical standpoint, (more?) often you do want the bin that contains the endpoint - in PyHEP 2019, I recommended using bh.loc(.5) + 1 for this purpose. So I'm really fine with either way, I just know that changing is going to be a challenge that needs to be considered. Maybe have a way to opt-in to the old behavior for the transition? @jpivarski might have ideas, but first let's get verification from @HDembinski that he wants the change and I'm not mistaken.

@HDembinski
Copy link
Member Author

Slicing with indices works in C++ in the same way as in Python. In a range a:b, b is not included. In Boost.Histogram that does not matter at all, because I don't use a construct like bh.loc. You pass a slice with the slice object, which then has normal slicing behavior. A slice can only use indices. To use values you need the crop or shrink object, which only accept values. The two concepts are well separated and there is no problem that one includes the bin that contains the value and the other does not.

The design in Pyhton has become awkward because you force both values and indices to go through a Python slice. The behavior of the slice simply does not generalize to this case.

@HDembinski
Copy link
Member Author

Note that Jim's rule was born out of what the user most likely needs, we should not forget that. I agree that for consistency, bh.loc should not automatically add +1 when it is on the upper edge, because that is awkward. Instead, we should offer an alternative object that you can use instead of the slice, which keeps the bins that overlap.

@henryiii
Copy link
Member

henryiii commented Dec 9, 2020

This is part of UHI, the locators control the conversion to integers, giving this power to users very powerful. We also can't directly use Boost.Histogram's implementation anyway, because it doesn't support mixing endpoint types; there is no way to do h[0:bh.loc(.5):sum], or h[bh.loc(.5):len:sum].

The question here is do we keep the current simpler behavior or convert to the more like Boost.Histogram's behavior?

@henryiii
Copy link
Member

henryiii commented Dec 9, 2020

Okay, that's much less disruptive. And I do think it tends to be a common need, it's often mentally simpler to request a range that is inclusive when using data coords instead of bins, hence the immediate request for a +1. What would you like to call it?

@henryiii
Copy link
Member

henryiii commented Dec 9, 2020

bh.incloc, for inclusive loc, perhaps? How about bh.contains? It's a little long but I like that it is clear, h[bh.contains(.5):bh.contains(1.5)] is very obvious to someone unfamiliar with our library.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 10, 2020

My proposal is to have another object for doing a slice in values. I don't care that you cannot mix values and integers then, because I don't believe that's what people want. For this PR, it does not matter how we solve that in future. Let's focus on this PR. For this PR, we do as you suggest, bh.loc gives you the index that corresponds to the value. If you want to include the upper bin you do bh.loc(value) + 1.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 10, 2020

In other words, I was happy with the result of our Zoom call, let's just implement that and worry about other ways of slicing later.

The only thing important to me is that [:bh.loc(value):sum] includes the underflow bin and [bh.loc(value)::sum] includes the overflow bin, but that is still the case, as far as I can see.

@HDembinski
Copy link
Member Author

I wrote up what I had in mind for a value_slice here #482, so that we can focus on this PR, which is strictly about the behavior of bh.loc.

@henryiii
Copy link
Member

henryiii commented Dec 10, 2020

I was happy with the result of our Zoom call

I just was afraid that you thought the behavior of loc was different in the Zoom call (I didn't really think of it then, either).

No problem, was not going to block or make too many changes in one PR anyway. I'll get to this late (the actual point of the PR, the one we talked about) today or tomorrow.

@henryiii
Copy link
Member

henryiii commented Dec 19, 2020

This looks good, will merge. I'll make a followup PR to add the smarter UHI support, which will allow locators to detect which part of a slice they are in, allowing much more powerful locators to be written by users.

@henryiii henryiii merged commit 41f56ef into develop Dec 19, 2020
@henryiii henryiii deleted the fix_cropping_bug branch December 19, 2020 23:02
@henryiii henryiii removed the needs changelog Might need a changelog entry label Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in h[start::bh.sum] from v0.8 onwards
2 participants