reductions: fix text for associative binary ops #12

davidozog · 2024-08-30T19:20:46Z

Summary of changes

This attempts to correct the language around "associativity" of binary operations in OpenSHMEM reductions.

The proposed changelog entry is here:
kwaters4#8

I believe this is an errata, so that's included there as well.

Proposal Checklist

Link to issue(s)
Changelog entry
Reviewed for changes to front matter
Reviewed for changes to back matter

content/shmem_reductions.tex

maawad

Minor typo but otherwise look good.

content/shmem_reductions.tex

Co-authored-by: Muhammad Awad <[email protected]>

isubsmith

I don't have as much familiarity with our stance on associativity. I think this text change is basically saying the ops are not necessarily associative because they really can't be with FP. I think we want to make sure run-to-run results are the same though right? even if those results are maybe technically ""wrong"", at least they are the same. Maybe we should put explicit text to that effect if that's the case?

Otherwise minor an binary -> a binary in two places.

davidozog · 2024-08-30T19:45:02Z

I don't have as much familiarity with our stance on associativity. I think this text change is basically saying the ops are not necessarily associative because they really can't be with FP. I think we want to make sure run-to-run results are the same though right? even if those results are maybe technically ""wrong"", at least they are the same. Maybe we should put explicit text to that effect if that's the case?

I think this change is saying that results of FP reductions might have small rounding errors so they may never be guaranteed to be exactly the same run-to-run. But as an example, users could verify/validate FP reduction results by comparing the difference and if greater than some relevant threshold (say something like a 1e-6 for float or even a tighter "machine epsilon"). Maybe we need more clarity regarding this...

Looping in @jdinan and @BKP since they were interested and might have thoughts on how to explain it better.

maawad · 2024-08-30T19:45:45Z

I don't have as much familiarity with our stance on associativity. I think this text change is basically saying the ops are not necessarily associative because they really can't be with FP. I think we want to make sure run-to-run results are the same though right? even if those results are maybe technically ""wrong"", at least they are the same. Maybe we should put explicit text to that effect if that's the case?

Otherwise minor an binary -> a binary in two places.

FWIW: The C++ Standard text equivalent to this is:

[Note 1: The difference between reduce and accumulate is that reduce applies binary_op in an unspecified order, which yields a nondeterministic result for non-associative or non-commutative binary_op such as floating-point addition. — end note]

https://eel.is/c++draft/reduce#9

isubsmith · 2024-08-30T19:59:43Z

I think this change is saying that results of FP reductions might have small rounding errors so they may never be guaranteed to be exactly the same run-to-run. But as an example, users could verify/validate FP reduction results by comparing the difference and if greater than some relevant threshold (say something like a 1e-6 for float or even a tighter "machine epsilon"). Maybe we need more clarity regarding this...

I think you can get repeatable results as long as you do something to guarantee the operations are always done in the same order. Maybe that's not possible to enforce in shmem reductions though, hence the language here. I'm certainly fine with whatever the group thoughts are, I just know we got bit by this at IBM long ago (we had a broadcast-based allreduce algorithm that was really fast, but it failed Livermore verification checks because you'd do the operations in the random order you got packets. Results would differ by something like 10^-15 but the standard allreduce that enforced a given order would give the exact same results to whatever precision they were looking for every time)

isubsmith · 2024-08-30T20:01:00Z

FWIW: The C++ Standard text equivalent to this is:

[Note 1: The difference between reduce and accumulate is that reduce applies binary_op in an unspecified order, which yields a nondeterministic result for non-associative or non-commutative binary_op such as floating-point addition. — end note]

https://eel.is/c++draft/reduce#9

So is that saying that accumulate is repeatable then?

davidozog · 2024-08-30T20:01:56Z

FWIW: The C++ Standard text equivalent to this is:

[Note 1: The difference between reduce and accumulate is that reduce applies binary_op in an unspecified order, which yields a nondeterministic result for non-associative or non-commutative binary_op such as floating-point addition. — end note]

https://eel.is/c++draft/reduce#9

I like that - I didn't really say anything like "[reductions apply binary ops] in an unspecified order" which is also applicable to OpenSHMEM I think.

Fun side question: Can we re-use this text? The ISO C++ standard doesn't have a clear re-use/derived-works allowed license to my eye, so I'm guessing not?

maawad · 2024-08-30T20:07:09Z

FWIW: The C++ Standard text equivalent to this is:

[Note 1: The difference between reduce and accumulate is that reduce applies binary_op in an unspecified order, which yields a nondeterministic result for non-associative or non-commutative binary_op such as floating-point addition. — end note]

eel.is/c++draft/reduce#9

So is that saying that accumulate is repeatable then?

yes, it is like what you said in your previous comment.
https://eel.is/c++draft/accumulate#2

davidozog · 2024-08-30T20:15:28Z

FWIW: The C++ Standard text equivalent to this is:

[Note 1: The difference between reduce and accumulate is that reduce applies binary_op in an unspecified order, which yields a nondeterministic result for non-associative or non-commutative binary_op such as floating-point addition. — end note]

eel.is/c++draft/reduce#9

So is that saying that accumulate is repeatable then?

yes, it is like what you said in your previous comment. https://eel.is/c++draft/accumulate#2

Huh, accumulate defines a specific ordering: "first to last" whereas reduce does not. I don't think OpenSHMEM can get away with a well-defined ordering for reductions myself... but maybe?

Results would differ by something like 10^-15

My hunch is 1e-15 is pretty "large" for an epsilon - 64 bit machine epsilon is ~2.2e-16. I don't have enough experience to know what "worst case" is though, say with wildly varying element magnitudes and orderings... Maybe we should look into this before committing to the correction.

I'm curious to what extent this affects your use-cases @kwaters4?

kwaters4 · 2024-08-30T22:24:41Z

The biggest issue that we have run across is how to deal with NaNs, similar to the following issue #467. Currently with some implementations these reductions will not handle this in a uniform manner and some PEs will get NaN, while other do not. Some implementations handle it well so all PEs get the value at the end of the operation. Defining the epsilon is not much of a concern, NaN is the only factor. If this were defined in the specification then it would force implementers to define the behavior around 'NaNs'

Today we have a higher level API that aborts a job whenever NaN is detected on any PE following a reduction operation. This is not the best scenario, but the pragmatic approach we take today.

I should have followed up on this issue earlier, but was unable to. I would like to push it into the upcoming specification, but need some help articulating the behavior.

content/shmem_reductions.tex

Co-authored-by: James Dinan <[email protected]>

davidozog · 2024-09-19T17:36:19Z

content/shmem_reductions.tex

+    identical to the \dest{} arrays produced by all other \acp{PE} in the team.
+
+    The result of any given \openshmem reduction is reproducible across job
+    execution environments.


Can we improve/clarify this? "job execution environments"

content/shmem_reductions.tex

davidozog · 2024-10-03T13:55:16Z

Closing this in favor of a PR targeting RC2

reductions: fix text for associative binary ops

1635557

davidozog mentioned this pull request Aug 30, 2024

changelog: entry for associativity of reductions kwaters4/specification#8

Closed

4 tasks

davidozog requested review from kwaters4, isubsmith, maawad and BKP August 30, 2024 19:21

isubsmith reviewed Aug 30, 2024

View reviewed changes

content/shmem_reductions.tex Outdated Show resolved Hide resolved

isubsmith reviewed Aug 30, 2024

View reviewed changes

content/shmem_reductions.tex Outdated Show resolved Hide resolved

maawad approved these changes Aug 30, 2024

View reviewed changes

content/shmem_reductions.tex Outdated Show resolved Hide resolved

content/shmem_reductions.tex Outdated Show resolved Hide resolved

davidozog and others added 2 commits August 30, 2024 15:35

Update content/shmem_reductions.tex

065e751

Co-authored-by: Muhammad Awad <[email protected]>

Update content/shmem_reductions.tex

e8c0146

Co-authored-by: Muhammad Awad <[email protected]>

isubsmith reviewed Aug 30, 2024

View reviewed changes

davidozog mentioned this pull request Aug 30, 2024

[SHMEM 1.6 Sec 9.10] Collectives section committee changes openshmem-org/specification#535

Closed

4 tasks

jdinan reviewed Sep 5, 2024

View reviewed changes

content/shmem_reductions.tex Outdated Show resolved Hide resolved

davidozog and others added 4 commits September 17, 2024 10:47

Update content/shmem_reductions.tex

b61dc81

Co-authored-by: James Dinan <[email protected]>

reduce: fixes for reduction result consistency

794e19f

collectives: a couple small typos

f554b8c

CI: update action/upload-artifact from v1 to v4

3d53054

davidozog force-pushed the pr/associativity_of_reductions branch from b6cfcc6 to 3d53054 Compare September 17, 2024 15:04

davidozog commented Sep 19, 2024

View reviewed changes

content/shmem_reductions.tex Show resolved Hide resolved

davidozog commented Sep 19, 2024

View reviewed changes

content/shmem_reductions.tex Show resolved Hide resolved

davidozog closed this Oct 3, 2024

davidozog mentioned this pull request Oct 3, 2024

Pr/associativity of reductions #14

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reductions: fix text for associative binary ops #12

reductions: fix text for associative binary ops #12

davidozog commented Aug 30, 2024 •

edited

Loading

maawad left a comment

isubsmith left a comment

davidozog commented Aug 30, 2024

maawad commented Aug 30, 2024

isubsmith commented Aug 30, 2024

isubsmith commented Aug 30, 2024

davidozog commented Aug 30, 2024

maawad commented Aug 30, 2024

davidozog commented Aug 30, 2024

kwaters4 commented Aug 30, 2024 •

edited

Loading

davidozog Sep 19, 2024

davidozog commented Oct 3, 2024

reductions: fix text for associative binary ops #12

reductions: fix text for associative binary ops #12

Conversation

davidozog commented Aug 30, 2024 • edited Loading

Summary of changes

Proposal Checklist

maawad left a comment

Choose a reason for hiding this comment

isubsmith left a comment

Choose a reason for hiding this comment

davidozog commented Aug 30, 2024

maawad commented Aug 30, 2024

isubsmith commented Aug 30, 2024

isubsmith commented Aug 30, 2024

davidozog commented Aug 30, 2024

maawad commented Aug 30, 2024

davidozog commented Aug 30, 2024

kwaters4 commented Aug 30, 2024 • edited Loading

davidozog Sep 19, 2024

Choose a reason for hiding this comment

davidozog commented Oct 3, 2024

davidozog commented Aug 30, 2024 •

edited

Loading

kwaters4 commented Aug 30, 2024 •

edited

Loading