Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WQ: Further mainloop optimizations #3380

Merged
merged 7 commits into from
Jun 21, 2023

Conversation

colinthomas-z80
Copy link
Contributor

No description provided.

@benclifford
Copy link

I tried this PR (commit fa92953) on my laptop with parsl-perf --time 300, and I still see that I get only 42 tasks/second near the start of a run. This is with around 75000 tasks in the queue at the start. I think that's not a noticeable change for this particular measurement compared to the current cctools master commit. here's a plot of TASK RUNNING events for that run.

(as I've mentioned before, ignore spikes near the beginning, as that is parsl-perf calibrating itself)

temp

@colinthomas-z80
Copy link
Contributor Author

@benclifford This change improved my performance in the ranges of 10-20k tasks in queue. But yes it still does not hold up in the larger queue sizes. I believe the remaining issue is in the way we retrieve tasks.

@colinthomas-z80
Copy link
Contributor Author

Task retrieval is shaped up a bit, now the suspect seems to be logging overhead as seen in the profile on #3284

@dthain
Copy link
Member

dthain commented Jun 20, 2023

Looks like this is making progress but maybe not ready to merge yet.
Let me know when you think it's ready.

@colinthomas-z80
Copy link
Contributor Author

@dthain Before this is done I wanted to hear your thoughts on adding some sort of "high throughput" mode which disables the performance log.

@dthain
Copy link
Member

dthain commented Jun 21, 2023

Some quick thoughts -

  • Ideally, the performance log shouldn't come at an enormous overhead. Check to see if we are doing something unnecessarily expensive in writing to the performance log, like fflush/fsync.
  • The performance log should only be enabled if specifically requested by vine_enable_perf_log. Perhaps the parsl-taskvine executor has that on by default.

@colinthomas-z80
Copy link
Contributor Author

@dthain It is not necessarily writing the performance log that is expensive, rather collecting the information about waiting/ready tasks when the queue is large. We do an unnecessarily expensive operation collecting that information which could perhaps be reduced to a single full iteration of the task list, which will still present a problem as the queue size grows.

This is also pertaining to work queue specifically at the moment. Even if the performance log is not specified we still collect the information before we break out of log_queue_stats

@btovar
Copy link
Member

btovar commented Jun 21, 2023

@colinthomas-z80 An alternative is to increase the interval at which stats are written (e.g in work_queue.c change to 15*ONE_SECOND), and remove the force write in line 7090, (change to log_queue_stats(q, 0), and remove the enclosing if).

@dthain
Copy link
Member

dthain commented Jun 21, 2023

Ok, so our guiding principles here are:

  • Make minimal changes to WQ to meet needs and keep it working.
  • Put our main efforts into TaskVine going forward.

So, if the underlying problem is that we are inefficiently iterating over the entire ready queue in order to count task states when writing the performance lag, then I think we should do the following:

1 - Make a few safe and easy modifications to WQ: just traverse the list once, and do it less frequently.
2 - Modify TaskVine to properly track the count of tasks in each state. Since all task state changes (wisely) go through change_task_state, it ought to be straightforward to update some integer counters there, as long as we are disciplined to ensure that there is no other way to change the state.

@btovar
Copy link
Member

btovar commented Jun 21, 2023

Regarding point 2, I think that's how it was in wq 1, and why we changed it to the current implementation in wq 2. As code evolved, it became really hard to maintain the state twice (i.e., the real state, and the stats state). I think point 1 (iterating the list once and doing it less frequently) may go a long way.

@dthain
Copy link
Member

dthain commented Jun 21, 2023

Let's proceed with doing the easy thing for WQ and see how far it gets us.

I'm willing to try the more complex thing for TV if it necessary to meet our performance goals.

@colinthomas-z80
Copy link
Contributor Author

Ready to merge if there are no further concerns

work_queue/src/work_queue.c Show resolved Hide resolved
@dthain
Copy link
Member

dthain commented Jun 21, 2023

What's the performance effect of this change?

@dthain
Copy link
Member

dthain commented Jun 21, 2023

Ok, how much does this help performance?

@colinthomas-z80
Copy link
Contributor Author

@dthain Although it is subjective depending on the hardware, This gives me a peak around 500 tasks/sec with 12k tasks in the queue. At queue sizes up to 60k-70k I will have around 200 tasks/sec. The most I have tested is 110k tasks which resulted in a 160 tasks/sec average. So the drop off is much less severe than before.

Before we were seeing 300 ish tasks/sec with 10k in the queue, and less than 100 tasks/sec with 20k+ in the queue.

Once I start working on taskvine I hope to have some more visual depictions of the performance.

work_queue/src/work_queue.c Outdated Show resolved Hide resolved
work_queue/src/work_queue.c Outdated Show resolved Hide resolved
@colinthomas-z80
Copy link
Contributor Author

colinthomas-z80 commented Jun 21, 2023

This PR

  ==== Iteration 1 ====
  Will run 10 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  warning: using plain-text when communicating with workers.
  warning: use encryption with a key and cert when creating the manager.
  All 10 tasks submitted ... waiting for completion
  Submission took 0.008 seconds = 1235.690 tasks/second
  Runtime: actual 4.432s vs target 120s
  Tasks per second: 2.256
  ==== Iteration 2 ====
  Will run 270 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 270 tasks submitted ... waiting for completion
  Submission took 0.105 seconds = 2569.648 tasks/second
  Runtime: actual 2.376s vs target 120s
  Tasks per second: 113.648
  ==== Iteration 3 ====
  Will run 13637 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 13637 tasks submitted ... waiting for completion
  Submission took 5.136 seconds = 2654.964 tasks/second
  Runtime: actual 26.247s vs target 120s
  Tasks per second: 519.573
  ==== Iteration 4 ====
  Will run 62348 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 62348 tasks submitted ... waiting for completion
  Submission took 25.965 seconds = 2401.229 tasks/second
  Runtime: actual 325.878s vs target 120s
  Tasks per second: 191.323
  Cleaning up DFK
  The end

@colinthomas-z80
Copy link
Contributor Author

Compared to cctools current

  ==== Iteration 1 ====
  Will run 10 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  warning: using plain-text when communicating with workers.
  warning: use encryption with a key and cert when creating the manager.
  All 10 tasks submitted ... waiting for completion
  Submission took 0.008 seconds = 1261.672 tasks/second
  Runtime: actual 4.412s vs target 120s
  Tasks per second: 2.266
  ==== Iteration 2 ====
  Will run 271 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 271 tasks submitted ... waiting for completion
  Submission took 0.088 seconds = 3077.867 tasks/second
  Runtime: actual 2.372s vs target 120s
  Tasks per second: 114.231
  ==== Iteration 3 ====
  Will run 13707 tasks to target 120 seconds runtime
  Submitting tasks / invoking apps
  All 13707 tasks submitted ... waiting for completion
  Submission took 4.890 seconds = 2802.854 tasks/second
  Runtime: actual 91.802s vs target 120s
  Tasks per second: 149.310
  Cleaning up DFK
  The end

@dthain
Copy link
Member

dthain commented Jun 21, 2023

Looking good! Ready to merge now?

@colinthomas-z80
Copy link
Contributor Author

Yes I believe so

@dthain dthain merged commit c2c386d into cooperative-computing-lab:master Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants