Improve matching simulator isolation group metrics #6505
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Record isolation group information for additional events and use it to calculate the median, mean, and max latency of events per task list and isolation group. Additionally record the percent of tasks that are dispatched to a poller with that same isolation group per task list and isolation group. With the current implementation no scenarios leak tasks to another isolation group.
Additionally provide a definition of getAllIsolationGroups so that the matching simulator doesn't deadlock due to panics in task list manager initialization.
Create 6 new scenarios for zonal isolation. The first three (few_pollers, many_pollers, and single_partition) test a scenario where the total task throughput is easily manageable with any number of pollers but the number of pollers/partitions significantly impacts the performance. The next two (zonal_isolation, zonal_isolation_skew) show a higher throughput scenario which should still be manageable by the specified pollers for each isolation group. The latter of the two has the tasks skewed to the maximum that pollers from a single group should be able to process (64/12/12/12) vs (25/25/25/25).
The final scenario, zonal_isolation_skew_extreme, has the tasks heavily skewed (90/3/3/3) beyond what a single group can handle.
I struggled to get table headers working correctly so the output in the script is a little rough (median, avg, max).
With 4 partitions, 4 isolation groups (a-d), 8 pollers per group (25ms/task, so 40/poller or 320/group), and 500 tasks/sec we see:
non skew (25/25/25/25):
zonal_skew(64/12/12/12):
zonal_skew_extreme (90/3/3/3)
What changed?
Why?
How did you test it?
Potential risks
Release notes
Documentation Changes