Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

djmitche · 2020-08-07T19:06:57Z

@tomprince commented on Tue Aug 04 2020

https://bugzilla.mozilla.org/show_bug.cgi?id=1637216 for context

There is potentially a trade-off between total cost, and end-to-end time for a graph or graphs, of changing the worker-pool size. If there was zero over-provisioning, and zero overhead, but neither of those are currently the case.

@djmitche commented on Tue Aug 04 2020

There is potentially a trade-off between total cost, and end-to-end time for a graph or graphs, of changing the worker-pool size. If there was overhead over-provisioning, and zero overhead, but neither of those are currently the case.

Can you rephrase this? I'm not sure what "overhead over-provisioning" means.

@tomprince commented on Tue Aug 04 2020

*zero over-provisiong

@djmitche commented on Tue Aug 04 2020

that makes a lot more sense, thanks :)

@djmitche commented on Wed Aug 05 2020

I think that this won't require any special functionality from the simulator, but be represented by a simulation run with a particular set of parameters, maybe with a provisioner that just retains a specific number N of running workers.

I expect we'll see that

for some large N, all tasks start immediately and workers are often idle
for some small N, E2E time grows without bound as there's just more work to do than capacity to do it
for some intermediate N, tasks do not start immediately and E2E time increases from the minimum, but is bounded

Within that intermediate N, the total work provided by the workers (integral of capacity over time) is greater than the total work required (integral of task duration over time). Different N's in that range balance spare capacity to handle spikes of task load against the cost of that capacity. In the visualization, we would see that lower N's in that range will have pending counts that fall more slowly, and at higher N's pending will fall more quickly.

Anyway, this remains a good hypothesis to test out when the simulator is ready.

djmitche · 2020-08-10T21:49:12Z

#3375 will go a long way toward measuring this. Here's a situation with a workload of 6 (that is, 6 minutes of work being injected per minute), and 4 workers:

and here's the same with 6 workers:

A workload of 6.1:

ok, it looks the same, but mousing over shows that there are about 60 pending tasks at the end of the 24 hour sample period.

djmitche · 2020-08-10T22:08:12Z

With some experimentation with the firefox-ci gecko-t/t-linux-large pool, I was able to show that 315 workers are more-or-less able to hold the pending steady at 10,000 pending tasks (!)

Note: ignore the diagonal lines at the end of the graph -- these graphs are showing the ramp-down phase where no no load is injected. We should be hiding that from the output.

Here's 350 workers:

and 400:

Anyway, 400 m5.large's in us-east-1 would cost about $330/day. We don't have data on what the wait-times would be for such an arrangement (that's #10), but it looks like it's probably still pretty bad. Still, that provides a baseline, at least.

I think that comparing a fixed pool size to total wait time over the range from never-a-pending-task to falling-far-behind would wrap this issue up.

djmitche · 2020-08-10T22:09:40Z

(in particular, it's interesting that 315 workers seemed to be steady long-term, but with a VERY high pending, and that it took quite a few more workers to get that pending down)

tomprince · 2020-08-11T00:51:02Z

I think load in a lot of pools tends to be much more bursty that the load you are generating here. Looking at android-components (linked in the original bug), there are between 10-120 tasks that take 2-8m and 11 tasks that take 20-30m all at once. And the arrival of these tasks varies, often with enough time for the pool to idle between, but likely also occasionally with overlap between them.

djmitche · 2020-08-11T12:47:55Z

These models are using the "bursty" creation timing and variation of durations as sampled in reality. If there are other pools we should also sample, I'm open to suggestions.

djmitche · 2020-08-13T20:41:35Z

OK, a few datapoints:

300 workers:

Overprovisioned Workers: 24073 workers
Overprovisioned Worker Time: 8 months (21665700000 ms)
Minimum wait time: about 5 hours (16219272 ms)
Maximum wait time: about 21 hours (74301666 ms)
Mean wait time: about 13 hours (46046170.242533885 ms)
Median wait time: about 13 hours (46046352.5 ms)

(that minimum makes sense since a big backlog of tasks was generated during the ramp-up period)

350 workers:

Overprovisioned Workers: 46860 workers
Overprovisioned Worker Time: over 1 year (42174000000 ms)
Minimum wait time: none
Maximum wait time: about 7 hours (26184234 ms)
Mean wait time: about 5 hours (16708393.935356956 ms)
Median wait time: about 5 hours (16753458 ms)

400 workers:

Overprovisioned Workers: 61559 workers
Overprovisioned Worker Time: almost 2 years (55403100000 ms)
Minimum wait time: none
Maximum wait time: about 4 hours (14271387 ms)
Mean wait time: about 1 hour (5081847.820867305 ms)
Median wait time: about 1 hour (4715043 ms)

450 workers:

Overprovisioned Workers: 76809 workers
Overprovisioned Worker Time: about 2 years (69128100000 ms)
Minimum wait time: none
Maximum wait time: about 3 hours (9374780 ms)
Mean wait time: 33 minutes (1951899.7495900707 ms)
Median wait time: 18 minutes (1090774 ms)

500 workers:

Overprovisioned Workers: 90186 workers
Overprovisioned Worker Time: over 2 years (81167400000 ms)
Minimum wait time: none
Maximum wait time: about 2 hours (6786750 ms)
Mean wait time: 15 minutes (910757.5965235521 ms)
Median wait time: 5 minutes (320395.5 ms)

if we agree that wait times of 5-15 minutes are OK, then a maximum pool size of 500 might be reasonable. The number of over-provisioned workers here is obviously untenable, but even a very basic provisioning algorithm (like the simple estimate we have now) could bring that down quite a bit.

500 workers would cost about $410/day.

500 workers for 5 days is about 7 compute-years, and assuming we could pretty easily trim out 2 compute-years, cutting that down by two sevenths to about $285/day. I don't know what our current spending on this pool is, but perhaps just setting maxCapacity=500 would reduce it?

Anyway, I think that gives a pretty good answer to the question in this issue, so I'll call it finished.

djmitche mentioned this issue Aug 7, 2020

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. taskcluster/taskcluster#3336

Closed

djmitche added this to the Provisioning is Efficient milestone Aug 7, 2020

djmitche added the provisioning label Aug 7, 2020

djmitche self-assigned this Aug 10, 2020

djmitche closed this as completed Aug 13, 2020

djmitche mentioned this issue Aug 13, 2020

simulate firefox load on a manual provisioner #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

djmitche commented Aug 7, 2020

djmitche commented Aug 10, 2020

djmitche commented Aug 10, 2020

djmitche commented Aug 10, 2020

tomprince commented Aug 11, 2020

djmitche commented Aug 11, 2020

djmitche commented Aug 13, 2020

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11

Comments

djmitche commented Aug 7, 2020

djmitche commented Aug 10, 2020

djmitche commented Aug 10, 2020

djmitche commented Aug 10, 2020

tomprince commented Aug 11, 2020

djmitche commented Aug 11, 2020

djmitche commented Aug 13, 2020