Improve matching simulator isolation group metrics #6505

natemort · 2024-11-18T22:14:14Z

Record isolation group information for additional events and use it to calculate the median, mean, and max latency of events per task list and isolation group. Additionally record the percent of tasks that are dispatched to a poller with that same isolation group per task list and isolation group. With the current implementation no scenarios leak tasks to another isolation group.

Additionally provide a definition of getAllIsolationGroups so that the matching simulator doesn't deadlock due to panics in task list manager initialization.

Create 6 new scenarios for zonal isolation. The first three (few_pollers, many_pollers, and single_partition) test a scenario where the total task throughput is easily manageable with any number of pollers but the number of pollers/partitions significantly impacts the performance. The next two (zonal_isolation, zonal_isolation_skew) show a higher throughput scenario which should still be manageable by the specified pollers for each isolation group. The latter of the two has the tasks skewed to the maximum that pollers from a single group should be able to process (64/12/12/12) vs (25/25/25/25).

The final scenario, zonal_isolation_skew_extreme, has the tasks heavily skewed (90/3/3/3) beyond what a single group can handle.

I struggled to get table headers working correctly so the output in the script is a little rough (median, avg, max).

With 4 partitions, 4 isolation groups (a-d), 8 pollers per group (25ms/task, so 40/poller or 320/group), and 500 tasks/sec we see:

non skew (25/25/25/25):

Avg Task latency (ms): 150.791
P50 Task latency (ms): 14
P75 Task latency (ms): 227
P95 Task latency (ms): 563
P99 Task latency (ms): 898
Max Task latency (ms): 6887
Latency per isolation group:
     a  6   150.5264  6887
     b  10  127.7648  2479
     c  58  180.6496  2437
     d  60  144.2216  692
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  58   93.47636363636363   410
     /__cadence_sys/my-tasklist/1  b  164  207.4746376811594   633
     /__cadence_sys/my-tasklist/1  c  105  188.95289855072463  2437
     /__cadence_sys/my-tasklist/1  d  138  153.3804713804714   436
     /__cadence_sys/my-tasklist/2  a  89   124.59352517985612  516
     /__cadence_sys/my-tasklist/2  b  115  135.1615120274914   504
     /__cadence_sys/my-tasklist/2  c  374  392.2730375426621   1031
     /__cadence_sys/my-tasklist/2  d  329  292.938566552901    692
     /__cadence_sys/my-tasklist/3  a  221  493.00387596899225  6887
     /__cadence_sys/my-tasklist/3  b  80   221.56028368794327  2479
     /__cadence_sys/my-tasklist/3  c  195  206.1950354609929   592
     /__cadence_sys/my-tasklist/3  d  102  170.52650176678446  659
     my-tasklist                   a  1    1.4123006833712983  7
     my-tasklist                   b  1    1.57356608478803    20
     my-tasklist                   c  1    1.4486215538847118  29
     my-tasklist                   d  1    1.6790450928381964  34

zonal_skew(64/12/12/12):

Avg Task latency (ms): 498.544
P50 Task latency (ms): 281
P75 Task latency (ms): 838
P95 Task latency (ms): 1750
P99 Task latency (ms): 2344
Max Task latency (ms): 2493
Latency per isolation group:
     a  2    173.72666666666666  2217
     b  38   295.635             1434
     c  10   258.0233333333333   2391
     d  592  642.5903125         2493
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  235  312.9685039370079   2217
     /__cadence_sys/my-tasklist/1  b  136  321.4307692307692   1161
     /__cadence_sys/my-tasklist/1  c  479  475.09285714285716  2391
     /__cadence_sys/my-tasklist/1  d  817  721.9948119325551   1684
     /__cadence_sys/my-tasklist/2  a  385  391.1940298507463   1115
     /__cadence_sys/my-tasklist/2  b  179  338.3071428571429   1068
     /__cadence_sys/my-tasklist/2  c  502  470.87301587301585  1314
     /__cadence_sys/my-tasklist/2  d  491  412.90970350404314  1044
     /__cadence_sys/my-tasklist/3  a  20   95.82113821138212   450
     /__cadence_sys/my-tasklist/3  b  457  582.7615894039735   1434
     /__cadence_sys/my-tasklist/3  c  144  212.62962962962962  2255
     /__cadence_sys/my-tasklist/3  d  310  412.041095890411    1094
     my-tasklist                   a  1    1.3101851851851851  4
     my-tasklist                   b  1    1.312849162011173   3
     my-tasklist                   c  1    1.3366834170854272  3
     my-tasklist                   d  793  932.5621734587252   2493

zonal_skew_extreme (90/3/3/3)

Avg Task latency (ms): 2390.04
P50 Task latency (ms): 1929
P75 Task latency (ms): 4521
P95 Task latency (ms): 6084
P99 Task latency (ms): 6518
Max Task latency (ms): 12804
Latency per isolation group:
     a  76    856.4518072289156   8593
     b  2     327.05389221556885  6182
     c  2     783.0538922155689   5568
     d  2376  2582.8126666666667  12804
Latency per isolation group and task list:
     /__cadence_sys/my-tasklist/1  a  523   747.7428571428571   5037
     /__cadence_sys/my-tasklist/1  b  2     391.77777777777777  5552
     /__cadence_sys/my-tasklist/1  c  1     1.4444444444444444  3
     /__cadence_sys/my-tasklist/1  d  2722  2812.715667311412   6508
     /__cadence_sys/my-tasklist/2  a  978   1576.3103448275863  8593
     /__cadence_sys/my-tasklist/2  b  227   577.7878787878788   6182
     /__cadence_sys/my-tasklist/2  c  2482  2333.975            5568
     /__cadence_sys/my-tasklist/2  d  1735  2047.7808080808081  4825
     /__cadence_sys/my-tasklist/3  a  1097  1712                7237
     /__cadence_sys/my-tasklist/3  b  136   594.9               6031
     /__cadence_sys/my-tasklist/3  c  775   1007.5945945945946  2687
     /__cadence_sys/my-tasklist/3  d  3506  3209.5471512770137  12804
     my-tasklist                   a  1     1.5573770491803278  3
     my-tasklist                   b  1     1.2542372881355932  2
     my-tasklist                   c  1     1.4444444444444444  3
     my-tasklist                   d  1785  2345.4643347050755  6111

What changed?

Fixed matching simulator for zonal isolation
Included isolation group info in additional events
Add new zonal isolation scenarios

Why?

Enable better benchmarking and comparison for subsequent changes to matching and zonal isolation.

How did you test it?

Running the matching simulator.

Potential risks

Release notes

Documentation Changes

taylanisikdemir · 2024-11-21T01:01:58Z

host/testdata/matching_simulation_zonal_isolation_many_pollers.yaml

+    tasks:
+      - numtaskgenerators: 2
+        taskspersecond: 80
+        maxtasktogenerate:  3000


let's set max tasks to generate to 5k on all and please share total duration

Record isolation group information for additional events and use it to calculate the median, mean, and max latency of events per task list and isolation group. Additionally record the percent of tasks that are dispatched to a poller with that same isolation group per task list and isolation group. With the current implementation no scenarios leak tasks to another isolation group. Additionally provide a definition of getAllIsolationGroups so that the matching simulator doesn't deadlock due to panics in task list manager initialization. Create 6 new scenarios for zonal isolation. The first three (few_pollers, many_pollers, and single_partition) test a scenario where the total task throughput is easily manageable with any number of pollers but the number of pollers/partitions significantly impacts the performance. The next two (zonal_isolation, zonal_isolation_skew) show a higher throughput scenario which should still be manageable by the specified pollers for each isolation group. The latter of the two has the tasks skewed to the maximum that pollers from a single group should be able to process (64/12/12/12) vs (25/25/25/25). The final scenario, zonal_isolation_skew_extreme, has the tasks heavily skewed (90/3/3/3) beyond what a single group can handle.

natemort requested a review from demirkayaender as a code owner November 18, 2024 22:14

Shaddoll approved these changes Nov 18, 2024

View reviewed changes

natemort force-pushed the isolation_sim branch from 8cf46e3 to 09526ec Compare November 21, 2024 00:43

natemort requested review from neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx and taylanisikdemir as code owners November 21, 2024 00:43

taylanisikdemir reviewed Nov 21, 2024

View reviewed changes

natemort force-pushed the isolation_sim branch 2 times, most recently from 489352a to 9a35d53 Compare November 21, 2024 18:02

taylanisikdemir approved these changes Nov 21, 2024

View reviewed changes

natemort force-pushed the isolation_sim branch from 9a35d53 to 2099c28 Compare November 21, 2024 18:29

natemort merged commit a0f83db into cadence-workflow:master Nov 21, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve matching simulator isolation group metrics #6505

Improve matching simulator isolation group metrics #6505

natemort commented Nov 18, 2024

taylanisikdemir Nov 21, 2024

natemort Nov 21, 2024

Improve matching simulator isolation group metrics #6505

Improve matching simulator isolation group metrics #6505

Conversation

natemort commented Nov 18, 2024

taylanisikdemir Nov 21, 2024

Choose a reason for hiding this comment

natemort Nov 21, 2024

Choose a reason for hiding this comment