Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve matching simulator isolation group metrics #6505

Merged
merged 1 commit into from
Nov 21, 2024

Conversation

natemort
Copy link
Member

Record isolation group information for additional events and use it to calculate the median, mean, and max latency of events per task list and isolation group. Additionally record the percent of tasks that are dispatched to a poller with that same isolation group per task list and isolation group. With the current implementation no scenarios leak tasks to another isolation group.

Additionally provide a definition of getAllIsolationGroups so that the matching simulator doesn't deadlock due to panics in task list manager initialization.

Create 6 new scenarios for zonal isolation. The first three (few_pollers, many_pollers, and single_partition) test a scenario where the total task throughput is easily manageable with any number of pollers but the number of pollers/partitions significantly impacts the performance. The next two (zonal_isolation, zonal_isolation_skew) show a higher throughput scenario which should still be manageable by the specified pollers for each isolation group. The latter of the two has the tasks skewed to the maximum that pollers from a single group should be able to process (64/12/12/12) vs (25/25/25/25).

The final scenario, zonal_isolation_skew_extreme, has the tasks heavily skewed (90/3/3/3) beyond what a single group can handle.

I struggled to get table headers working correctly so the output in the script is a little rough (median, avg, max).

With 4 partitions, 4 isolation groups (a-d), 8 pollers per group (25ms/task, so 40/poller or 320/group), and 500 tasks/sec we see:

non skew (25/25/25/25):

Avg Task latency (ms): 150.791
P50 Task latency (ms): 14
P75 Task latency (ms): 227
P95 Task latency (ms): 563
P99 Task latency (ms): 898
Max Task latency (ms): 6887
Latency per isolation group:
     a  6   150.5264  6887
     b  10  127.7648  2479
     c  58  180.6496  2437
     d  60  144.2216  692
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  58   93.47636363636363   410
     /__cadence_sys/my-tasklist/1  b  164  207.4746376811594   633
     /__cadence_sys/my-tasklist/1  c  105  188.95289855072463  2437
     /__cadence_sys/my-tasklist/1  d  138  153.3804713804714   436
     /__cadence_sys/my-tasklist/2  a  89   124.59352517985612  516
     /__cadence_sys/my-tasklist/2  b  115  135.1615120274914   504
     /__cadence_sys/my-tasklist/2  c  374  392.2730375426621   1031
     /__cadence_sys/my-tasklist/2  d  329  292.938566552901    692
     /__cadence_sys/my-tasklist/3  a  221  493.00387596899225  6887
     /__cadence_sys/my-tasklist/3  b  80   221.56028368794327  2479
     /__cadence_sys/my-tasklist/3  c  195  206.1950354609929   592
     /__cadence_sys/my-tasklist/3  d  102  170.52650176678446  659
     my-tasklist                   a  1    1.4123006833712983  7
     my-tasklist                   b  1    1.57356608478803    20
     my-tasklist                   c  1    1.4486215538847118  29
     my-tasklist                   d  1    1.6790450928381964  34

zonal_skew(64/12/12/12):

Avg Task latency (ms): 498.544
P50 Task latency (ms): 281
P75 Task latency (ms): 838
P95 Task latency (ms): 1750
P99 Task latency (ms): 2344
Max Task latency (ms): 2493
Latency per isolation group:
     a  2    173.72666666666666  2217
     b  38   295.635             1434
     c  10   258.0233333333333   2391
     d  592  642.5903125         2493
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  235  312.9685039370079   2217
     /__cadence_sys/my-tasklist/1  b  136  321.4307692307692   1161
     /__cadence_sys/my-tasklist/1  c  479  475.09285714285716  2391
     /__cadence_sys/my-tasklist/1  d  817  721.9948119325551   1684
     /__cadence_sys/my-tasklist/2  a  385  391.1940298507463   1115
     /__cadence_sys/my-tasklist/2  b  179  338.3071428571429   1068
     /__cadence_sys/my-tasklist/2  c  502  470.87301587301585  1314
     /__cadence_sys/my-tasklist/2  d  491  412.90970350404314  1044
     /__cadence_sys/my-tasklist/3  a  20   95.82113821138212   450
     /__cadence_sys/my-tasklist/3  b  457  582.7615894039735   1434
     /__cadence_sys/my-tasklist/3  c  144  212.62962962962962  2255
     /__cadence_sys/my-tasklist/3  d  310  412.041095890411    1094
     my-tasklist                   a  1    1.3101851851851851  4
     my-tasklist                   b  1    1.312849162011173   3
     my-tasklist                   c  1    1.3366834170854272  3
     my-tasklist                   d  793  932.5621734587252   2493

zonal_skew_extreme (90/3/3/3)

Avg Task latency (ms): 2390.04
P50 Task latency (ms): 1929
P75 Task latency (ms): 4521
P95 Task latency (ms): 6084
P99 Task latency (ms): 6518
Max Task latency (ms): 12804
Latency per isolation group:
     a  76    856.4518072289156   8593
     b  2     327.05389221556885  6182
     c  2     783.0538922155689   5568
     d  2376  2582.8126666666667  12804
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  523   747.7428571428571   5037
     /__cadence_sys/my-tasklist/1  b  2     391.77777777777777  5552
     /__cadence_sys/my-tasklist/1  c  1     1.4444444444444444  3
     /__cadence_sys/my-tasklist/1  d  2722  2812.715667311412   6508
     /__cadence_sys/my-tasklist/2  a  978   1576.3103448275863  8593
     /__cadence_sys/my-tasklist/2  b  227   577.7878787878788   6182
     /__cadence_sys/my-tasklist/2  c  2482  2333.975            5568
     /__cadence_sys/my-tasklist/2  d  1735  2047.7808080808081  4825
     /__cadence_sys/my-tasklist/3  a  1097  1712                7237
     /__cadence_sys/my-tasklist/3  b  136   594.9               6031
     /__cadence_sys/my-tasklist/3  c  775   1007.5945945945946  2687
     /__cadence_sys/my-tasklist/3  d  3506  3209.5471512770137  12804
     my-tasklist                   a  1     1.5573770491803278  3
     my-tasklist                   b  1     1.2542372881355932  2
     my-tasklist                   c  1     1.4444444444444444  3
     my-tasklist                   d  1785  2345.4643347050755  6111

What changed?

  • Fixed matching simulator for zonal isolation
  • Included isolation group info in additional events
  • Add new zonal isolation scenarios

Why?

  • Enable better benchmarking and comparison for subsequent changes to matching and zonal isolation.

How did you test it?

  • Running the matching simulator.

Potential risks

Release notes

Documentation Changes

tasks:
- numtaskgenerators: 2
taskspersecond: 80
maxtasktogenerate: 3000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set max tasks to generate to 5k on all and please share total duration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@natemort natemort force-pushed the isolation_sim branch 2 times, most recently from 489352a to 9a35d53 Compare November 21, 2024 18:02
Record isolation group information for additional events and use it to calculate the median, mean, and max latency of events per task list and isolation group. Additionally record the percent of tasks that are dispatched to a poller with that same isolation group per task list and isolation group. With the current implementation no scenarios leak tasks to another isolation group.

Additionally provide a definition of getAllIsolationGroups so that the matching simulator doesn't deadlock due to panics in task list manager initialization.

Create 6 new scenarios for zonal isolation. The first three (few_pollers, many_pollers, and single_partition) test a scenario where the total task throughput is easily manageable with any number of pollers but the number of pollers/partitions significantly impacts the performance. The next two (zonal_isolation, zonal_isolation_skew) show a higher throughput scenario which should still be manageable by the specified pollers for each isolation group. The latter of the two has the tasks skewed to the maximum that pollers from a single group should be able to process (64/12/12/12) vs (25/25/25/25).

The final scenario, zonal_isolation_skew_extreme, has the tasks heavily skewed (90/3/3/3) beyond what a single group can handle.
@natemort natemort merged commit a0f83db into cadence-workflow:master Nov 21, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants