Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

Closed
djmitche opened this issue Aug 7, 2020 · 6 comments
Assignees

Comments

@djmitche
Copy link
Contributor

djmitche commented Aug 7, 2020

@tomprince commented on Tue Aug 04 2020

https://bugzilla.mozilla.org/show_bug.cgi?id=1637216 for context

There is potentially a trade-off between total cost, and end-to-end time for a graph or graphs, of changing the worker-pool size. If there was zero over-provisioning, and zero overhead, but neither of those are currently the case.


@djmitche commented on Tue Aug 04 2020

There is potentially a trade-off between total cost, and end-to-end time for a graph or graphs, of changing the worker-pool size. If there was overhead over-provisioning, and zero overhead, but neither of those are currently the case.

Can you rephrase this? I'm not sure what "overhead over-provisioning" means.


@tomprince commented on Tue Aug 04 2020

*zero over-provisiong


@djmitche commented on Tue Aug 04 2020

that makes a lot more sense, thanks :)


@djmitche commented on Wed Aug 05 2020

I think that this won't require any special functionality from the simulator, but be represented by a simulation run with a particular set of parameters, maybe with a provisioner that just retains a specific number N of running workers.

I expect we'll see that

  • for some large N, all tasks start immediately and workers are often idle
  • for some small N, E2E time grows without bound as there's just more work to do than capacity to do it
  • for some intermediate N, tasks do not start immediately and E2E time increases from the minimum, but is bounded

Within that intermediate N, the total work provided by the workers (integral of capacity over time) is greater than the total work required (integral of task duration over time). Different N's in that range balance spare capacity to handle spikes of task load against the cost of that capacity. In the visualization, we would see that lower N's in that range will have pending counts that fall more slowly, and at higher N's pending will fall more quickly.

Anyway, this remains a good hypothesis to test out when the simulator is ready.

@djmitche
Copy link
Contributor Author

#3375 will go a long way toward measuring this. Here's a situation with a workload of 6 (that is, 6 minutes of work being injected per minute), and 4 workers:

image

and here's the same with 6 workers:

image

A workload of 6.1:

image

ok, it looks the same, but mousing over shows that there are about 60 pending tasks at the end of the 24 hour sample period.

@djmitche
Copy link
Contributor Author

With some experimentation with the firefox-ci gecko-t/t-linux-large pool, I was able to show that 315 workers are more-or-less able to hold the pending steady at 10,000 pending tasks (!)

image

Note: ignore the diagonal lines at the end of the graph -- these graphs are showing the ramp-down phase where no no load is injected. We should be hiding that from the output.

Here's 350 workers:

image

and 400:

image

Anyway, 400 m5.large's in us-east-1 would cost about $330/day. We don't have data on what the wait-times would be for such an arrangement (that's #10), but it looks like it's probably still pretty bad. Still, that provides a baseline, at least.

I think that comparing a fixed pool size to total wait time over the range from never-a-pending-task to falling-far-behind would wrap this issue up.

@djmitche djmitche self-assigned this Aug 10, 2020
@djmitche
Copy link
Contributor Author

(in particular, it's interesting that 315 workers seemed to be steady long-term, but with a VERY high pending, and that it took quite a few more workers to get that pending down)

@tomprince
Copy link

I think load in a lot of pools tends to be much more bursty that the load you are generating here. Looking at android-components (linked in the original bug), there are between 10-120 tasks that take 2-8m and 11 tasks that take 20-30m all at once. And the arrival of these tasks varies, often with enough time for the pool to idle between, but likely also occasionally with overlap between them.

@djmitche
Copy link
Contributor Author

These models are using the "bursty" creation timing and variation of durations as sampled in reality. If there are other pools we should also sample, I'm open to suggestions.

@djmitche
Copy link
Contributor Author

OK, a few datapoints:

300 workers:

Overprovisioned Workers: 24073 workers
Overprovisioned Worker Time: 8 months (21665700000 ms)
Minimum wait time: about 5 hours (16219272 ms)
Maximum wait time: about 21 hours (74301666 ms)
Mean wait time: about 13 hours (46046170.242533885 ms)
Median wait time: about 13 hours (46046352.5 ms)

(that minimum makes sense since a big backlog of tasks was generated during the ramp-up period)

350 workers:

Overprovisioned Workers: 46860 workers
Overprovisioned Worker Time: over 1 year (42174000000 ms)
Minimum wait time: none
Maximum wait time: about 7 hours (26184234 ms)
Mean wait time: about 5 hours (16708393.935356956 ms)
Median wait time: about 5 hours (16753458 ms)

400 workers:

Overprovisioned Workers: 61559 workers
Overprovisioned Worker Time: almost 2 years (55403100000 ms)
Minimum wait time: none
Maximum wait time: about 4 hours (14271387 ms)
Mean wait time: about 1 hour (5081847.820867305 ms)
Median wait time: about 1 hour (4715043 ms)

450 workers:

Overprovisioned Workers: 76809 workers
Overprovisioned Worker Time: about 2 years (69128100000 ms)
Minimum wait time: none
Maximum wait time: about 3 hours (9374780 ms)
Mean wait time: 33 minutes (1951899.7495900707 ms)
Median wait time: 18 minutes (1090774 ms)

500 workers:

Overprovisioned Workers: 90186 workers
Overprovisioned Worker Time: over 2 years (81167400000 ms)
Minimum wait time: none
Maximum wait time: about 2 hours (6786750 ms)
Mean wait time: 15 minutes (910757.5965235521 ms)
Median wait time: 5 minutes (320395.5 ms)

if we agree that wait times of 5-15 minutes are OK, then a maximum pool size of 500 might be reasonable. The number of over-provisioned workers here is obviously untenable, but even a very basic provisioning algorithm (like the simple estimate we have now) could bring that down quite a bit.

500 workers would cost about $410/day.

500 workers for 5 days is about 7 compute-years, and assuming we could pretty easily trim out 2 compute-years, cutting that down by two sevenths to about $285/day. I don't know what our current spending on this pool is, but perhaps just setting maxCapacity=500 would reduce it?

Anyway, I think that gives a pretty good answer to the question in this issue, so I'll call it finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants