-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to meaure in the simulator the impact of work-pool size on provisioning. #11
Comments
#3375 will go a long way toward measuring this. Here's a situation with a workload of 6 (that is, 6 minutes of work being injected per minute), and 4 workers: and here's the same with 6 workers: A workload of 6.1: ok, it looks the same, but mousing over shows that there are about 60 pending tasks at the end of the 24 hour sample period. |
With some experimentation with the firefox-ci gecko-t/t-linux-large pool, I was able to show that 315 workers are more-or-less able to hold the pending steady at 10,000 pending tasks (!) Note: ignore the diagonal lines at the end of the graph -- these graphs are showing the ramp-down phase where no no load is injected. We should be hiding that from the output. Here's 350 workers: and 400: Anyway, 400 m5.large's in us-east-1 would cost about $330/day. We don't have data on what the wait-times would be for such an arrangement (that's #10), but it looks like it's probably still pretty bad. Still, that provides a baseline, at least. I think that comparing a fixed pool size to total wait time over the range from never-a-pending-task to falling-far-behind would wrap this issue up. |
(in particular, it's interesting that 315 workers seemed to be steady long-term, but with a VERY high pending, and that it took quite a few more workers to get that pending down) |
I think load in a lot of pools tends to be much more bursty that the load you are generating here. Looking at android-components (linked in the original bug), there are between 10-120 tasks that take 2-8m and 11 tasks that take 20-30m all at once. And the arrival of these tasks varies, often with enough time for the pool to idle between, but likely also occasionally with overlap between them. |
These models are using the "bursty" creation timing and variation of durations as sampled in reality. If there are other pools we should also sample, I'm open to suggestions. |
OK, a few datapoints: 300 workers:
(that minimum makes sense since a big backlog of tasks was generated during the ramp-up period) 350 workers:
400 workers:
450 workers:
500 workers:
if we agree that wait times of 5-15 minutes are OK, then a maximum pool size of 500 might be reasonable. The number of over-provisioned workers here is obviously untenable, but even a very basic provisioning algorithm (like the simple estimate we have now) could bring that down quite a bit. 500 workers would cost about $410/day. 500 workers for 5 days is about 7 compute-years, and assuming we could pretty easily trim out 2 compute-years, cutting that down by two sevenths to about $285/day. I don't know what our current spending on this pool is, but perhaps just setting maxCapacity=500 would reduce it? Anyway, I think that gives a pretty good answer to the question in this issue, so I'll call it finished. |
@tomprince commented on Tue Aug 04 2020
https://bugzilla.mozilla.org/show_bug.cgi?id=1637216 for context
There is potentially a trade-off between total cost, and end-to-end time for a graph or graphs, of changing the worker-pool size. If there was zero over-provisioning, and zero overhead, but neither of those are currently the case.
@djmitche commented on Tue Aug 04 2020
Can you rephrase this? I'm not sure what "overhead over-provisioning" means.
@tomprince commented on Tue Aug 04 2020
*zero over-provisiong
@djmitche commented on Tue Aug 04 2020
that makes a lot more sense, thanks :)
@djmitche commented on Wed Aug 05 2020
I think that this won't require any special functionality from the simulator, but be represented by a simulation run with a particular set of parameters, maybe with a provisioner that just retains a specific number N of running workers.
I expect we'll see that
Within that intermediate N, the total work provided by the workers (integral of capacity over time) is greater than the total work required (integral of task duration over time). Different N's in that range balance spare capacity to handle spikes of task load against the cost of that capacity. In the visualization, we would see that lower N's in that range will have pending counts that fall more slowly, and at higher N's pending will fall more quickly.
Anyway, this remains a good hypothesis to test out when the simulator is ready.
The text was updated successfully, but these errors were encountered: