-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reductions: fix text for associative binary ops #12
reductions: fix text for associative binary ops #12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo but otherwise look good.
Co-authored-by: Muhammad Awad <[email protected]>
Co-authored-by: Muhammad Awad <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have as much familiarity with our stance on associativity. I think this text change is basically saying the ops are not necessarily associative because they really can't be with FP. I think we want to make sure run-to-run results are the same though right? even if those results are maybe technically ""wrong"", at least they are the same. Maybe we should put explicit text to that effect if that's the case?
Otherwise minor an binary -> a binary in two places.
I think this change is saying that results of FP reductions might have small rounding errors so they may never be guaranteed to be exactly the same run-to-run. But as an example, users could verify/validate FP reduction results by comparing the difference and if greater than some relevant threshold (say something like a 1e-6 for Looping in @jdinan and @BKP since they were interested and might have thoughts on how to explain it better. |
FWIW: The C++ Standard text equivalent to this is:
|
I think you can get repeatable results as long as you do something to guarantee the operations are always done in the same order. Maybe that's not possible to enforce in shmem reductions though, hence the language here. I'm certainly fine with whatever the group thoughts are, I just know we got bit by this at IBM long ago (we had a broadcast-based allreduce algorithm that was really fast, but it failed Livermore verification checks because you'd do the operations in the random order you got packets. Results would differ by something like 10^-15 but the standard allreduce that enforced a given order would give the exact same results to whatever precision they were looking for every time) |
So is that saying that accumulate is repeatable then? |
I like that - I didn't really say anything like "[reductions apply binary ops] in an unspecified order" which is also applicable to OpenSHMEM I think. Fun side question: Can we re-use this text? The ISO C++ standard doesn't have a clear re-use/derived-works allowed license to my eye, so I'm guessing not? |
yes, it is like what you said in your previous comment. |
Huh, accumulate defines a specific ordering: "first to last" whereas reduce does not. I don't think OpenSHMEM can get away with a well-defined ordering for reductions myself... but maybe?
My hunch is 1e-15 is pretty "large" for an epsilon - 64 bit machine epsilon is ~2.2e-16. I don't have enough experience to know what "worst case" is though, say with wildly varying element magnitudes and orderings... Maybe we should look into this before committing to the correction. I'm curious to what extent this affects your use-cases @kwaters4? |
The biggest issue that we have run across is how to deal with Today we have a higher level API that aborts a job whenever I should have followed up on this issue earlier, but was unable to. I would like to push it into the upcoming specification, but need some help articulating the behavior. |
b6cfcc6
to
3d53054
Compare
identical to the \dest{} arrays produced by all other \acp{PE} in the team. | ||
|
||
The result of any given \openshmem reduction is reproducible across job | ||
execution environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we improve/clarify this? "job execution environments"
Closing this in favor of a PR targeting RC2 |
Summary of changes
This attempts to correct the language around "associativity" of binary operations in OpenSHMEM reductions.
The proposed changelog entry is here:
kwaters4#8
I believe this is an errata, so that's included there as well.
Proposal Checklist