-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eager fixes for distributed execution #558
base: future
Are you sure you want to change the base?
Conversation
tammam1998
commented
May 20, 2022
- Add eager after remote_read commands
- Fix an issue causing conflicting node ids on the worker. DFGNode.next_id starts from 0 in the worker which causes us to override some nodes when adding eagers. A better long-term solution is to make node id generation local to the graph.
Signed-off-by: Tammam Mustafa <[email protected]>
fd53591
to
494a436
Compare
OS:ubuntu-20.04 |
OS:ubuntu-20.04 |
OS:ubuntu-18.04 |
OS:ubuntu-18.04 |
OS = Debian 10
|
Signed-off-by: Tammam Mustafa <[email protected]>
OS = Debian 10
|
OS:ubuntu-18.04 |
OS:ubuntu-20.04 |
OS = Debian 10
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned about the changes in the graph traversal. Where they necessary or simple improvements?
# Set DFGNode next id to not clash with already existing ids | ||
# TODO: ideally we should get the next_id from the graph object | ||
# to avoid conflicts across parallel processes | ||
DFGNode.next_id = max(DFGNode.next_id , max(graph.nodes.keys()) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do that in general? Outside of this if
statement?
@@ -261,6 +265,9 @@ def assign_workers_to_subgraphs(subgraphs:List[IR], file_id_gen: FileIdGen, inpu | |||
# sometimes a command can have both a file resource and an ephemeral resources (example: spell oneliner) | |||
continue | |||
|
|||
# for worker, graph in worker_subgraph_pairs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would delete this
@@ -721,14 +722,20 @@ def add_eager_nodes(graph, use_dgsh_tee): | |||
intermediateFileIdGen = FileIdGen(0, runtime_config['eager_intermediate_prefix']) | |||
|
|||
## Get the next nodes | |||
workset = [node for source_node_id in source_node_ids for node in graph.get_next_nodes(source_node_id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be tempted to think that there was a reason why this was here. Maybe we don't want to start from all source_node_ids
? Is there a particular reason you deleted this? The reason why I am asking is because regressions here cannot be caught by any test (the compiler won't fail most likely but maybe it won't optimize that much).
if (not curr_id in visited): | ||
visited.add(curr_id) | ||
next_node_ids = graph.get_next_nodes(curr_id) | ||
|
||
# Skip if this is the last node | ||
if not next_node_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, is that necessary or an optimization?