Skip to content
This repository has been archived by the owner on Mar 31, 2023. It is now read-only.

[Performance Analysis] DPM/ACA gRPC Performance Report #384

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

haboy52581
Copy link
Contributor

No description provided.

@xieus xieus changed the title add adoc for dpm aca grpc analysis [Performance Analysis] DPM/ACA gRPC Performance Report Sep 24, 2020
@xieus xieus added the perf testing Performance Testing label Sep 24, 2020
@xieus xieus added this to the Version 0.9.2020.09.30 milestone Sep 24, 2020
Copy link
Contributor

@xieus xieus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haboy52581 Some initial comments. Thanks for setting up tests and collecting the data points.

@@ -0,0 +1,215 @@
= ALCOR CONTROL AGENT-ALCOR DATAPLANE MANAGER Test Report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to change to "Alcor gRPC Performance Test Report"

|*cpu MHz* |2231.772 |2599.079
|*Memory* |192GB |386GB
|*Network* |NetXtreme BCM5719 Gigabit Ethernet PCIe (GB network) |82599ES 10-Gigabit SFI/SFP+ Network Connection
|*Storage* |LSI raid (no ssd) |AVAGO (no ssd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the DPM machine (.188) has 6X1600GB SSD. Could you confirm?

|*Model Name* |Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz |Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
|*cpu MHz* |2231.772 |2599.079
|*Memory* |192GB |386GB
|*Network* |NetXtreme BCM5719 Gigabit Ethernet PCIe (GB network) |82599ES 10-Gigabit SFI/SFP+ Network Connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the network bandwidth. As the results shows DPM client is network bounded, so we would need to revisit this configuration.

[arabic, start=2]
. *Test step:*

F send goal state message to A-E at the same time concurrently after first warming up then wait for the response, goal state message is different in each payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upload the test scripts or codes that generate the payload to https://github.com/futurewei-cloud/alcor-int/tree/master/tools? This can be done in a sperate PR.


F send goal state message to A-E at the same time concurrently after first warming up then wait for the response, goal state message is different in each payload

On A-E there are 2600 ACA running on each box, ACA code has been revised to cut off the ovsdb and mq operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2,600 or 2,000? I thought 2,000 is the stable setup. Need to update the image accordingly.

image::128-2.png["128 thread 2nd time",width=262,height=156]
____

for 256 threads and below, the success rate is 100%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add one more data point of 256 threads? People will be interested in seeing the limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we put some resource utilization diagram including CPU, RAM, Disk IO and Network IO in this extreme case? This would help.

____

____
* 10k neighbor, every connection time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the x-axis, what do those numbers represent? for example, first number is number of threads and the second number is number of successful run out of a total of 10K runs.

____

____
* 10k neighbor, every connection time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as discussed, we need to verify the extreme large value (5,594,098) and rerun the test.



____
* when neighbor number changed, every connection time cost and overall time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is important. Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other.

____

____
* when neighbor number changed, overall time cost for different concurrent thread number*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as to image other-ov-jc.png.

"Let us work to collect more data based on two dimensions (concurrent thread # and neighbor numbers), fix one and adjust the other."

we can take out the data point for "1t-1w" and explain in the texts.

@@ -65,6 +65,28 @@ image::p1.png["Test Deployment",width=488,height=302]
|*90% TILE* |12 |11 |32 |28 |78 |84 |292 |262
|===

different payload sizes vary from 1 neighbor to 10000 neighbor(2MB) each

*1WR+other OV-MAX+average*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what this means?

@@ -65,6 +65,28 @@ image::p1.png["Test Deployment",width=488,height=302]
|*90% TILE* |12 |11 |32 |28 |78 |84 |292 |262
|===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column and row of this table is opposite of the next one. Could we make them consistent?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
perf testing Performance Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants