Skip to content
This repository has been archived by the owner on Mar 31, 2023. It is now read-only.

[Performance Analysis] DPM/ACA gRPC Performance Report #384

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Binary file added docs/modules/ROOT/images/1-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/1-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/128-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/128-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/1w-ov-jc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/1w-ov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/32-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/32-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/56-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/56-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/detail-jc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/jmax.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/omax.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/other-ov-jc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/other-ov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/p1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/p2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/p4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/p5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
215 changes: 215 additions & 0 deletions docs/modules/ROOT/pages/performance_analysis/dpmAcaGrpcTest.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
= ALCOR CONTROL AGENT-ALCOR DATAPLANE MANAGER Test Report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to change to "Alcor gRPC Performance Test Report"

Xiaodong Zhang <[email protected]>
v0.1, 2020-09-15
:toc: right
:imagesdir: ../../images

*ALCOR CONTROL AGENT-ALCOR DATAPLANE MANAGER Test Report*

**Abstract:**This document contains Alcor Dataplane Manager-Alcor Control Agent grpc performance test result and analysis

[arabic]
. *Environment description*

[cols=",",options="header",]
|===
|*IP address* |
|*A* |*10.213.43.162*
|*B* |*10.213.43.163*
|*C* |*10.213.43.164*
|*D* |*10.213.43.166*
|*E* |*10.213.43.187*
|F |*10.213.43.188*
|===

[cols=",,",options="header",]
|===
|*Hardware Configuration:* | |
| |*A, B, C, D, E running ALCOR CONTROL AGENT* |*F running as ALCOR DATAPLANE MANAGER*
|*CPU* |40 cores |56 cores
|*Model Name* |Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz |Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
|*cpu MHz* |2231.772 |2599.079
|*Memory* |192GB |386GB
|*Network* |NetXtreme BCM5719 Gigabit Ethernet PCIe (GB network) |82599ES 10-Gigabit SFI/SFP+ Network Connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the network bandwidth. As the results shows DPM client is network bounded, so we would need to revisit this configuration.

|*Storage* |LSI raid (no ssd) |AVAGO (no ssd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the DPM machine (.188) has 6X1600GB SSD. Could you confirm?

|*Software Configuration:* | |
|*System* |Ubuntu18 LTS |
|*Server* |ALCOR CONTROL AGENT on A, B, C, D, E |
|*Client* |ALCOR DATAPLANE MANAGER on F |
|===

[arabic, start=2]
. *Test step:*

F send goal state message to A-E at the same time concurrently after first warming up then wait for the response, goal state message is different in each payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upload the test scripts or codes that generate the payload to https://github.com/futurewei-cloud/alcor-int/tree/master/tools? This can be done in a sperate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some important points of test set up include

  • number of long-running connections established
  • DPM thread pool setup and number of concurrent threads used for message delivery
  • message size (depending on the number of neighbor ports)

Readers would get confused to look at the images if don't understand the setup.


On A-E there are 2600 ACA running on each box, ACA code has been revised to cut off the ovsdb and mq operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2,600 or 2,000? I thought 2,000 is the stable setup. Need to update the image accordingly.


image::p1.png["Test Deployment",width=488,height=302]

[arabic, start=3]
. *Test results*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to describe latency mentioned in the following table as latency for a single connection.


[cols=",,,,,,,,",options="header",]
|===
|*Log Time Cost Summary (ms)* | | | | | | | |
|*Thread* |*1 -1^st^* |*1-2^nd^* |*32 -1^st^* |*32-2^nd^* |*56 -1^st^* |*56-2^nd^* |*128 -1^st^* |*128-2^nd^*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good time to explain what "1st" and "2nd" mean here. I know you talked about briefly before. We can explain what difference is between the two comparison.

|*Average* |11.14 |10.37 |16.50 |14.40 |26.54 |25.60 |70.64 |61.54
|*MIN* |9 |9 |10 |9 |10 |9 |10 |9
|*MAX* |48 |18 |223 |116 |364 |160 |603 |352
|*MEDIAN* |11 |10 |11 |11 |11 |11 |11 |11
|*>AVERAGE* |28.88% |40.35% |21.48% |20.09% |21.80% |20.09% |21.03% |20.00%
|*<AVERAGE* |71.12% |59.65% |78.52% |79.91% |78.20% |79.91% |78.97% |80.00%
|*99% TILE* |13 |12 |67 |43 |107.04 |116.03 |331 |312
|*95% TILE* |12 |12 |36 |31 |83 |89 |305 |277
|*90% TILE* |12 |11 |32 |28 |78 |84 |292 |262
|===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column and row of this table is opposite of the next one. Could we make them consistent?



[arabic, start=4]
. *Test results analysis*
[loweralpha]
.. {blank}
+
____
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*System Resource Usage on F-- ALCOR DATAPLANE MANAGER (context switch)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to describe what context switch exactly refers to here and the approach (for example, commands) to retrieve this metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for each image, give a brief description including the X- and Y-axis info, and what observation we could get from each image.

____

____
image::p2.png["context swith Alcor DataPlane Manager",width=553,height=302]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image is in a low resolution. Maybe need a new way to restore the resolution of the images.

____

[loweralpha, start=2]
. {blank}
+
____
*Time Cost on F -- ALCOR DATAPLANE MANAGER round trip*
____

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain this is the e2e elapsed time to complete all 10K messages.

____
image::omax.png["thread number of Alcor DataPlane Manager",width=553,height=302]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could delete the first two data dots as they are much larger than others, and instead put the result (114,494 and 104,536) in the texts.

____

____
*Time Cost on single execution max -- ALCOR DATAPLANE MANAGER round trip*
____

____
image::jmax.png["thread number of single execution max time Alcor DataPlane Manager",width=553,height=302]
____

[loweralpha, start=3]
. {blank}
+
____
*Time Cost Charts for round trip when thread number change on F*
____

single thread
____
image::1-1.png["1 thread",width=276,height=165]
____

single thread -- 2nd time
____
image::1-2.png["1 thread 2nd time",width=276,height=165]
____

32 threads
____
image::32-1.png["32 thread",width=262,height=156]
____

32 threads -- 2nd time
____
image::32-2.png["32 thread 2nd time",width=262,height=156]
____

56 threads
____
image::56-1.png["56 thread",width=262,height=156]
____

56 threads -- 2nd time
____
image::56-2.png["56 thread 2nd time",width=262,height=156]
____

128 threads
____
image::128-1.png["128 thread",width=262,height=156]
____

128 threads -- 2nd time
____
image::128-2.png["128 thread 2nd time",width=262,height=156]
____
Comment on lines +130 to +168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same width and height (e.g. width=553,height=302) as the first three images? The images are a bit small.


for 256 threads and below, the success rate is 100%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add one more data point of 256 threads? People will be interested in seeing the limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we put some resource utilization diagram including CPU, RAM, Disk IO and Network IO in this extreme case? This would help.



for 512 threads above, the success rate is under 100%

____
* 10k neighbor, overall time cost for different concurrent thread number*
____

____
image::1w-ov.png["10k neighbor, overall time cost for different concurrent thread number",width=553,height=302]
____

____
* 10k neighbor, every connection time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the x-axis, what do those numbers represent? for example, first number is number of threads and the second number is number of successful run out of a total of 10K runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as discussed, we need to verify the extreme large value (5,594,098) and rerun the test.

____

____
image::1w-ov-jc.png["10k neighbor, every connection time cost for different concurrent thread number",width=553,height=302]
____


____
* when neighbor number changed, every connection time cost and overall time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is important. Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other.

____

____
image::other-ov-jc.png[" when neighbor number changed, every connection time cost and overall time cost for different concurrent thread number",width=553,height=302]
____

____
* when neighbor number changed, overall time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as to image other-ov-jc.png.

"Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other."

we can take out the data point for "1t-1w" and explain in the texts.

____

____
image::other-ov.png["when neighbor number changed, overall time cost for different concurrent thread number",width=553,height=302]
____

____
* here is details on each test box about every execution time cost *
____

____
image::detail-jc.png["here is details on each test box about every execution time cost",width=1024,height=800]
____

[arabic, start=5]
. *Test Conclusion*

[loweralpha]
. *Alcor DataPlane Manager could support more than 10k concurrent ACA grpc requests*
. *Alcor DataPlane Manager runs well when from 32 threads up to 256 threads for the performance*
. *A-E hardware configuration could run 2000 stable ACA instances on each box*

[arabic, start=6]
. *Problems for now*

[arabic]

[arabic, start=1]
. *ALCOR CONTROL AGENT crash after several heavy test*

Syslog does not say error on ALCOR CONTROL AGENT

[arabic, start=2]
. *io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason*

After reduce load on A-E, this issue is gone