Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager fixes for distributed execution #558

Open
wants to merge 2 commits into
base: future
Choose a base branch
from
Open

Conversation

tammam1998
Copy link
Collaborator

  • Add eager after remote_read commands
  • Fix an issue causing conflicting node ids on the worker. DFGNode.next_id starts from 0 in the worker which causes us to override some nodes when adding eagers. A better long-term solution is to make node id generation local to the graph.

@tammam1998 tammam1998 force-pushed the dspash-eagers-fix branch from fd53591 to 494a436 Compare May 20, 2022 14:46
@tammam1998 tammam1998 requested a review from angelhof May 20, 2022 14:47
@github-actions
Copy link

OS:ubuntu-20.04
Fri May 20 14:53:18 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS:ubuntu-20.04
Fri May 20 14:54:26 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS:ubuntu-18.04
Fri May 20 14:54:27 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS:ubuntu-18.04
Fri May 20 14:54:38 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = fd53591
Kernel= Linux 4.15.0-167-generic x86_64

benchmark tests passed failed untested unresolved unsupported not_in_use other_status
posix 494 375 41 31 6 40 1 0
intro 2 2 0 0 0 0 0 0
interface 35 35 0 0 0 0 0 0
compiler 54 54 0 0 0 0 0 0
aggregator 109 109 0 0 0 0 0 0

Signed-off-by: Tammam Mustafa <[email protected]>
@github-actions
Copy link

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = 494a436
Kernel= Linux 4.15.0-167-generic x86_64

benchmark tests passed failed untested unresolved unsupported not_in_use other_status
posix 494 375 41 31 6 40 1 0
intro 2 2 0 0 0 0 0 0
interface 35 35 0 0 0 0 0 0
compiler 54 54 0 0 0 0 0 0
aggregator 109 109 0 0 0 0 0 0

@github-actions
Copy link

OS:ubuntu-18.04
Fri May 20 15:14:01 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS:ubuntu-20.04
Fri May 20 15:14:17 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@github-actions
Copy link

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = 3334b76
Kernel= Linux 4.15.0-167-generic x86_64

benchmark tests passed failed untested unresolved unsupported not_in_use other_status
posix 494 375 41 31 6 40 1 0
intro 2 2 0 0 0 0 0 0
interface 35 35 0 0 0 0 0 0
compiler 54 54 0 0 0 0 0 0
aggregator 109 109 0 0 0 0 0 0

Copy link
Member

@angelhof angelhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about the changes in the graph traversal. Where they necessary or simple improvements?

# Set DFGNode next id to not clash with already existing ids
# TODO: ideally we should get the next_id from the graph object
# to avoid conflicts across parallel processes
DFGNode.next_id = max(DFGNode.next_id , max(graph.nodes.keys()) + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do that in general? Outside of this if statement?

@@ -261,6 +265,9 @@ def assign_workers_to_subgraphs(subgraphs:List[IR], file_id_gen: FileIdGen, inpu
# sometimes a command can have both a file resource and an ephemeral resources (example: spell oneliner)
continue

# for worker, graph in worker_subgraph_pairs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete this

@@ -721,14 +722,20 @@ def add_eager_nodes(graph, use_dgsh_tee):
intermediateFileIdGen = FileIdGen(0, runtime_config['eager_intermediate_prefix'])

## Get the next nodes
workset = [node for source_node_id in source_node_ids for node in graph.get_next_nodes(source_node_id)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to think that there was a reason why this was here. Maybe we don't want to start from all source_node_ids? Is there a particular reason you deleted this? The reason why I am asking is because regressions here cannot be caught by any test (the compiler won't fail most likely but maybe it won't optimize that much).

if (not curr_id in visited):
visited.add(curr_id)
next_node_ids = graph.get_next_nodes(curr_id)

# Skip if this is the last node
if not next_node_ids:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, is that necessary or an optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants