-
Notifications
You must be signed in to change notification settings - Fork 34
[Performance Analysis] DPM/ACA gRPC Performance Report #384
base: master
Are you sure you want to change the base?
Changes from 10 commits
76a66fc
effc445
5e5e7a1
051c7a7
9fd3615
8a43224
85cdfc7
8bc0ec3
c0d9ea8
d759317
7fb47ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
= ALCOR CONTROL AGENT-ALCOR DATAPLANE MANAGER Test Report | ||
Xiaodong Zhang <[email protected]> | ||
v0.1, 2020-09-15 | ||
:toc: right | ||
:imagesdir: ../../images | ||
|
||
*ALCOR CONTROL AGENT-ALCOR DATAPLANE MANAGER Test Report* | ||
|
||
**Abstract:**This document contains Alcor Dataplane Manager-Alcor Control Agent grpc performance test result and analysis | ||
|
||
[arabic] | ||
. *Environment description* | ||
|
||
[cols=",",options="header",] | ||
|=== | ||
|*IP address* | | ||
|*A* |*10.213.43.162* | ||
|*B* |*10.213.43.163* | ||
|*C* |*10.213.43.164* | ||
|*D* |*10.213.43.166* | ||
|*E* |*10.213.43.187* | ||
|F |*10.213.43.188* | ||
|=== | ||
|
||
[cols=",,",options="header",] | ||
|=== | ||
|*Hardware Configuration:* | | | ||
| |*A, B, C, D, E running ALCOR CONTROL AGENT* |*F running as ALCOR DATAPLANE MANAGER* | ||
|*CPU* |40 cores |56 cores | ||
|*Model Name* |Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz |Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz | ||
|*cpu MHz* |2231.772 |2599.079 | ||
|*Memory* |192GB |386GB | ||
|*Network* |NetXtreme BCM5719 Gigabit Ethernet PCIe (GB network) |82599ES 10-Gigabit SFI/SFP+ Network Connection | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check the network bandwidth. As the results shows DPM client is network bounded, so we would need to revisit this configuration. |
||
|*Storage* |LSI raid (no ssd) |AVAGO (no ssd) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the DPM machine (.188) has 6X1600GB SSD. Could you confirm? |
||
|*Software Configuration:* | | | ||
|*System* |Ubuntu18 LTS | | ||
|*Server* |ALCOR CONTROL AGENT on A, B, C, D, E | | ||
|*Client* |ALCOR DATAPLANE MANAGER on F | | ||
|=== | ||
|
||
[arabic, start=2] | ||
. *Test step:* | ||
|
||
F send goal state message to A-E at the same time concurrently after first warming up then wait for the response, goal state message is different in each payload | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you upload the test scripts or codes that generate the payload to https://github.com/futurewei-cloud/alcor-int/tree/master/tools? This can be done in a sperate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some important points of test set up include
Readers would get confused to look at the images if don't understand the setup. |
||
|
||
On A-E there are 2600 ACA running on each box, ACA code has been revised to cut off the ovsdb and mq operations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 2,600 or 2,000? I thought 2,000 is the stable setup. Need to update the image accordingly. |
||
|
||
image::p1.png["Test Deployment",width=488,height=302] | ||
|
||
[arabic, start=3] | ||
. *Test results* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would like to describe latency mentioned in the following table as latency for a single connection. |
||
|
||
[cols=",,,,,,,,",options="header",] | ||
|=== | ||
|*Log Time Cost Summary (ms)* | | | | | | | | | ||
|*Thread* |*1 -1^st^* |*1-2^nd^* |*32 -1^st^* |*32-2^nd^* |*56 -1^st^* |*56-2^nd^* |*128 -1^st^* |*128-2^nd^* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good time to explain what "1st" and "2nd" mean here. I know you talked about briefly before. We can explain what difference is between the two comparison. |
||
|*Average* |11.14 |10.37 |16.50 |14.40 |26.54 |25.60 |70.64 |61.54 | ||
|*MIN* |9 |9 |10 |9 |10 |9 |10 |9 | ||
|*MAX* |48 |18 |223 |116 |364 |160 |603 |352 | ||
|*MEDIAN* |11 |10 |11 |11 |11 |11 |11 |11 | ||
|*>AVERAGE* |28.88% |40.35% |21.48% |20.09% |21.80% |20.09% |21.03% |20.00% | ||
|*<AVERAGE* |71.12% |59.65% |78.52% |79.91% |78.20% |79.91% |78.97% |80.00% | ||
|*99% TILE* |13 |12 |67 |43 |107.04 |116.03 |331 |312 | ||
|*95% TILE* |12 |12 |36 |31 |83 |89 |305 |277 | ||
|*90% TILE* |12 |11 |32 |28 |78 |84 |292 |262 | ||
|=== | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The column and row of this table is opposite of the next one. Could we make them consistent? |
||
|
||
|
||
[arabic, start=4] | ||
. *Test results analysis* | ||
[loweralpha] | ||
.. {blank} | ||
+ | ||
____ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is "____" for? I think this creates some formatting issue if you check https://github.com/haboy52581/alcor/blob/feature/add-adoc/docs/modules/ROOT/pages/performance_analysis/dpmAcaGrpcTest.adoc#L4 |
||
*System Resource Usage on F-- ALCOR DATAPLANE MANAGER (context switch)* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would need to describe what context switch exactly refers to here and the approach (for example, commands) to retrieve this metric. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also for each image, give a brief description including the X- and Y-axis info, and what observation we could get from each image. |
||
____ | ||
|
||
____ | ||
image::p2.png["context swith Alcor DataPlane Manager",width=553,height=302] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The image is in a low resolution. Maybe need a new way to restore the resolution of the images. |
||
____ | ||
|
||
[loweralpha, start=2] | ||
. {blank} | ||
+ | ||
____ | ||
*Time Cost on F -- ALCOR DATAPLANE MANAGER round trip* | ||
____ | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please explain this is the e2e elapsed time to complete all 10K messages. |
||
____ | ||
image::omax.png["thread number of Alcor DataPlane Manager",width=553,height=302] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we could delete the first two data dots as they are much larger than others, and instead put the result (114,494 and 104,536) in the texts. |
||
____ | ||
|
||
____ | ||
*Time Cost on single execution max -- ALCOR DATAPLANE MANAGER round trip* | ||
____ | ||
|
||
____ | ||
image::jmax.png["thread number of single execution max time Alcor DataPlane Manager",width=553,height=302] | ||
____ | ||
|
||
[loweralpha, start=3] | ||
. {blank} | ||
+ | ||
____ | ||
*Time Cost Charts for round trip when thread number change on F* | ||
____ | ||
|
||
single thread | ||
____ | ||
image::1-1.png["1 thread",width=276,height=165] | ||
____ | ||
|
||
single thread -- 2nd time | ||
____ | ||
image::1-2.png["1 thread 2nd time",width=276,height=165] | ||
____ | ||
|
||
32 threads | ||
____ | ||
image::32-1.png["32 thread",width=262,height=156] | ||
____ | ||
|
||
32 threads -- 2nd time | ||
____ | ||
image::32-2.png["32 thread 2nd time",width=262,height=156] | ||
____ | ||
|
||
56 threads | ||
____ | ||
image::56-1.png["56 thread",width=262,height=156] | ||
____ | ||
|
||
56 threads -- 2nd time | ||
____ | ||
image::56-2.png["56 thread 2nd time",width=262,height=156] | ||
____ | ||
|
||
128 threads | ||
____ | ||
image::128-1.png["128 thread",width=262,height=156] | ||
____ | ||
|
||
128 threads -- 2nd time | ||
____ | ||
image::128-2.png["128 thread 2nd time",width=262,height=156] | ||
____ | ||
Comment on lines
+130
to
+168
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we use the same width and height (e.g. width=553,height=302) as the first three images? The images are a bit small. |
||
|
||
for 256 threads and below, the success rate is 100% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add one more data point of 256 threads? People will be interested in seeing the limit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also can we put some resource utilization diagram including CPU, RAM, Disk IO and Network IO in this extreme case? This would help. |
||
|
||
|
||
for 512 threads above, the success rate is under 100% | ||
|
||
____ | ||
* 10k neighbor, overall time cost for different concurrent thread number* | ||
____ | ||
|
||
____ | ||
image::1w-ov.png["10k neighbor, overall time cost for different concurrent thread number",width=553,height=302] | ||
____ | ||
|
||
____ | ||
* 10k neighbor, every connection time cost for different concurrent thread number* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please explain the x-axis, what do those numbers represent? for example, first number is number of threads and the second number is number of successful run out of a total of 10K runs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, as discussed, we need to verify the extreme large value (5,594,098) and rerun the test. |
||
____ | ||
|
||
____ | ||
image::1w-ov-jc.png["10k neighbor, every connection time cost for different concurrent thread number",width=553,height=302] | ||
____ | ||
|
||
|
||
____ | ||
* when neighbor number changed, every connection time cost and overall time cost for different concurrent thread number* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This image is important. Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other. |
||
____ | ||
|
||
____ | ||
image::other-ov-jc.png[" when neighbor number changed, every connection time cost and overall time cost for different concurrent thread number",width=553,height=302] | ||
____ | ||
|
||
____ | ||
* when neighbor number changed, overall time cost for different concurrent thread number* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as to image other-ov-jc.png. "Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other." we can take out the data point for "1t-1w" and explain in the texts. |
||
____ | ||
|
||
____ | ||
image::other-ov.png["when neighbor number changed, overall time cost for different concurrent thread number",width=553,height=302] | ||
____ | ||
|
||
____ | ||
* here is details on each test box about every execution time cost * | ||
____ | ||
|
||
____ | ||
image::detail-jc.png["here is details on each test box about every execution time cost",width=1024,height=800] | ||
____ | ||
|
||
[arabic, start=5] | ||
. *Test Conclusion* | ||
|
||
[loweralpha] | ||
. *Alcor DataPlane Manager could support more than 10k concurrent ACA grpc requests* | ||
. *Alcor DataPlane Manager runs well when from 32 threads up to 256 threads for the performance* | ||
. *A-E hardware configuration could run 2000 stable ACA instances on each box* | ||
|
||
[arabic, start=6] | ||
. *Problems for now* | ||
|
||
[arabic] | ||
|
||
[arabic, start=1] | ||
. *ALCOR CONTROL AGENT crash after several heavy test* | ||
|
||
Syslog does not say error on ALCOR CONTROL AGENT | ||
|
||
[arabic, start=2] | ||
. *io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason* | ||
|
||
After reduce load on A-E, this issue is gone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested to change to "Alcor gRPC Performance Test Report"