-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling the Waku Protocol: A Performance Analysis with Wakurtosis #123
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding some key figures to the post?
For example:
The decreased efficiency in data transmission and reception suggests that Discv5’s overhead significantly affects the protocol's performance.
On a 1000 nodes network with X messages per second of size K , the activation of discv5 increased memory consumpion in aveerage per node from X to Y.
Thank you 👍 sembr would be appreciated 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the comments of other reviewers and added some of my own.
@jm-clius @fryorcraken thank you so much for your review!. I pushed some changes including addressing your comments. I also finished the Discv5 vs Non-Discv5 discussion with the latest data. Apologies if I forgot something. @kaiserd it would be nice if you could have a look. |
I'll let final review and approval to be handled with Vac team :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for addressing previous comments. Would leave to Vac to review for framing and other context. Not sure of the intended audience and what we assume they already know, but I would suggest that some concepts would need more contextualisation before being introduced. For example, the basic idea of what discv5 is and does would help readers intuitively understand results a bit a better. Currently the first you read about it is "non-Discv5 simulations", which I wouldn't understand if I didn't have the necessary background already.
When examining the effects on bandwidth utilization, we see varying results. | ||
For reception rates, with lower message rates we observe that Discv5 consistently utilizes more bandwidth than the corresponding non-Discv5 case. | ||
However, those differences become less pronounced with higher message rates. | ||
For transmission rates, both cases show similar performance and both exhibit huge improvements in transmission efficiency at higher message loads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more note, to reiterate: we wouldn't expect to see an efficiency that's higher than around 1/6. Where this is the case, we should assume there might be losses due to infrastructure (or other) limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏 for this post! Feedback inline.
It would be good to also add the 0 Msgs run, suggested by @jm-clius in Discord, to this rlog post.
For instance, running several nodes per container can alter propagation times, as nodes grouped within the same container may exhibit different messaging behavior compared to a true node-to-node topology | ||
Additionally, employing a multi-node approach may result in losing node-level sampling granularity depending on the metrics infrastructure used, e.g. Cadvisor. | ||
Nevertheless, Wakurtosis offers the flexibility to choose between this and a 1-to-1 simulation, catering to the specific needs of each test scenario. | ||
The results presented in here are 1-to-1 simulations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this earlier in the paragraph. Without this info, this paragraph reads as if the results presented here are based on unreliable runs.
Just pushed some more changes addressing the comments that were pending. Still waiting for the last round of simulations with a rate of 0 msg/s to expand on that. I will ask for re-review once I added those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :). Left some comments in-line, will give feedback on the rest later.
|
||
To evaluate Waku under varied conditions, we conducted simulations across a range of network sizes, topologies, and message rates. Each simulation lasted 3 hours to reach a steady state. | ||
|
||
The network sizes explored included 75, 150, 300, and 600 nodes. For Non-Discv5 simulations, we used static topologies with average node degrees of K=50. In simulations with Discv5, we set the max_peers parameter to 50 to approximate similar average degrees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should introduce why we test non-discv5, and explain that this is an "artificial scenario" here when we first mention non-discv5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an explanation in the subsection Clarification on Non-Discv5 Scenarios bellow. Also, Discv5 is briefly explained in the intro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still write a sentence explaining why we look into non-discv5 here.
Alternatively, you can move the "Clarification on Non-Discv5 Scenarios" up here, and adjust it accordingly so that it fits here.
|
||
To stress test message throughput, we simulated message rates of 1, and 10 messages per second. We initally run simulations up to a 100 msg/s but we found out the results were unreliable due to simulation hardware limitations and therefore decided not to include them in this analysis. | ||
|
||
We also included simulations batches with no load — i.e. 0 Msg/s. — to provide a clearer picture of Waku's baseline resource demands and inherent overhead costs stemming from core protocol operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is part of this baseline?
I'd definitely mention testing discv5 in isolation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow, can you please elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are "Waku's baseline resource demands" comprised of?
Just pushed the remaining changes. Also included memory vs nodes plots for clarity in the discussion of the results and to better showcase the different scaling of memory. @kaiserd please let me know if there is something I ve missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the rest of my feedback.
Generally I'd remove the runs where we see reduced bandwidth use to hardware limitations preventing messages from being injected.
Seeing bandwidth consumption drop with larger network sizes is misleading.
That is not a real world scenario but just due to testing environment artifacts.
|
||
The analysis reveals interesting trends in bandwidth and memory usage across different network sizes and loads. | ||
|
||
Transmission bandwidth does not consistently increase with more nodes when messages are sent, while reception bandwidth exhibits even more variable behavior depending on scale and load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the message rate is independent of the number of nodes,
it is expected that the bandwidth consumption is close to constant, too.
The following statement goes against this expectation:
bandwidth does not consistently increase
since it indicates am expected increase in message rate.
Do you expect such an increase, and if yes, why?
Is there something more than bandwidth reducing because of overload of the testing hardware?
(Imo, this strang behaviour that only stems for artifacts of the testing environment and does not happen in a real scenario should not be part of the actual analysis and only mentioned as additional info in a subsquent paragraph; it might confuse readers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Daniel here that these kind of "pure observations" without interpretation can be confusing, especially if it's in the opening of the paragraph. As a reader, I'm interested in questions such as "does Waku bandwidth grow in any particular pattern with network growth, message rate increase?", "does this pattern tell me anything fundamental about the protocol?", "anything in the results that surprises us or deviate from what we would expect mathematically?". Any particular observations about simulation anomalies can be mentioned (with some interpretation) separately, but the paragraph should open and focus on interpretative questions like these.
The reduced bandwidth at high message rates stems from simulation infrastructure limits rather than protocol inefficiencies. | ||
The no-load measurements provide insights into baseline protocol overhead costs that grow with network size. | ||
|
||
Transmission overhead appears to increase linearly, while reception overhead accelerates sharply beyond 300 nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the explanation for this? Is this linked to the overloading hardware, too? If yes, we should explicitly mention that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the transmission overhead increase linearly in the number of nodes or in the number of messages?
This should be explicit here.
Assuming you mean number of nodes, why does it increase linearly even though the message rates are fixed?
(Especially since this scenario is without discv5.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence here seems to me to be the important takeaway in this paragraph. "Bandwidth increases linearly with...". It should not be mixed in with an uninterpreted observation about simulation effects in reception bandwidth.
With the discovery mechanism, we again see interesting trends in resource usage across different network sizes and traffic loads. | ||
|
||
Transmission bandwidth remainded fairly stable with more nodes. | ||
We again observe reduced bandwidth at the highest rates and node counts, stemming from simulation infrastructure constraints rather than inherent protocol limitations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'd put this in a separate section, because this is confusing to someone quickly reading over the doc.
This is only a test specific artifact.
</figcaption> | ||
|
||
Reception bandwidth grows much faster with number of nodes compared to the baseline, especially the large spike from 300 to 600 nodes with discovery enabled. | ||
This reflects the substantial additional overhead of neighbor discovery and tracking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because message rates / payload transmission is comparatively low?
Somebody reading this would expect that discovery traffic is lower.
A bit more explanation would be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I have added some comments, most of which is related to making sure this post is more oriented towards conclusions/interpretation of "Scaling the Waku Protocol" rather than stating simulation observations. In other words, do the tests tell us something about the scalability of the Waku Protocol (e.g. what is our interpretation for each scenario about how Waku Protocols respond to changes in network size and message rates?). These are mentioned in the concluding section, but I think this should be the focus throughout.
|
||
The analysis reveals interesting trends in bandwidth and memory usage across different network sizes and loads. | ||
|
||
Transmission bandwidth does not consistently increase with more nodes when messages are sent, while reception bandwidth exhibits even more variable behavior depending on scale and load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Daniel here that these kind of "pure observations" without interpretation can be confusing, especially if it's in the opening of the paragraph. As a reader, I'm interested in questions such as "does Waku bandwidth grow in any particular pattern with network growth, message rate increase?", "does this pattern tell me anything fundamental about the protocol?", "anything in the results that surprises us or deviate from what we would expect mathematically?". Any particular observations about simulation anomalies can be mentioned (with some interpretation) separately, but the paragraph should open and focus on interpretative questions like these.
The reduced bandwidth at high message rates stems from simulation infrastructure limits rather than protocol inefficiencies. | ||
The no-load measurements provide insights into baseline protocol overhead costs that grow with network size. | ||
|
||
Transmission overhead appears to increase linearly, while reception overhead accelerates sharply beyond 300 nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence here seems to me to be the important takeaway in this paragraph. "Bandwidth increases linearly with...". It should not be mixed in with an uninterpreted observation about simulation effects in reception bandwidth.
|
||
With the discovery mechanism, we again see interesting trends in resource usage across different network sizes and traffic loads. | ||
|
||
Transmission bandwidth remainded fairly stable with more nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Does this tell us something about Waku messaging protocol, our chosen discovery technique and network size?
Co-authored-by: Hanno Cornelius <[email protected]>
What is the status here? What is blocking? |
@Daimakaimura @AlbertoSoutullo What is the status of the simulations? |
I finished them on Thursday around 13:00pm and notified Jordi. He will probably answer about the analysis part when he reads this. |
@Daimakaimura could you ping me and @jm-clius once you have updated the post with the latest data please? |
Of course! |
@fryorcraken @kaiserd @jm-clius just pushed the last changes with the results of the last round of simulations. Going to ask for re-review. |
### Scalability | ||
|
||
To overcome scalability challenges, Wakurtosis employs a multi-node approach, running several nodes within one container. | ||
This method supports simulations with over 1,000 nodes on a single machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest network size mentioned below is 600 nodes. Why not using the full capacity of Wakurtosis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for that size we need to employ more than 1 node per container. We are only showing simulation results for 1 node per container. Mostly for clarity as this can skew the results and will be harder to compare with the rest of the batches.
|
||
After this initial decrease, the bandwidth plateaus without significant growth even as rates rise. This contrasts sharply with the 600-node setup, where even without messaging, substantial bandwidth is consumed, comparable to active messaging in smaller networks. | ||
|
||
![Total mean bandwidth usage without discovery mechanism](/static/img/wakurtosis_waku/non_discv5_bandwidth.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the scale. The graph indicates MB for the bandwidth
Is that the total bandwidth used over the 3 hours of message injection?
Would be good to provide some basis. For example, what is the baseline data volume we are injecting? 3 hours * 1 msg per second * msg size.
Also the graph has 5 lines but we stated 4 message rates (0.25, 0.5, 0.75, 1 messages per second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the average total bandwidth usage per node over the 3h hour simulation. I will add the pure payload lines into the plot to show how much data we are actually injecting into the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fryorcraken it's 5 series: 0.0, 0.25, 0.5, 0.75, and 1.0 msg/s. I just realised there is a missing legend. I ll fix that too.
|
||
#### Results with discovery mechanism (Discv5) | ||
|
||
With the discovery mechanism discv5, distinct resource usage patterns emerge. Transmission bandwidth remains relatively stable despite an eight-fold node increase, evidencing the protocol's transmission efficiency under discovery. However, reception bandwidth shows a different trend. From 75 to 300 nodes, a significant increase occurs, especially at 0 msg/s (95.83 to 254.7 Mb/s). The 300 to 600 node change is even more dramatic, with bandwidth surging to 497.47 Mb/s without messaging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with bandwidth surging to 497.47 Mb/s without messaging.
This seems worrying. Cc @jm-clius @alrevuelta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
497.47 Mb/s with discv5 and without messaging is indeed worring, but I have never seen this with https://github.com/waku-org/waku-simulator. How can we reproduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
497.47 Mb/s with discv5 and without messaging is indeed worring, but I have never seen this with https://github.com/waku-org/waku-simulator. How can we reproduce it?
If you want we can select one day to do a quick call and reproduce this. It was done with Wakurtosis. You can do it with 3~ commands. But the script to analyze the data is owned by @Daimakaimura. Also @Daimakaimura, are those 497.47 Mb/s constant over the entire simulation? Did you check when this happened? Because if it is a "constant" behaviour, we can test in in 5~ minutes run and will be easier to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbertoSoutullo those ~500 MB are the average total bandwidth usage per node over the whole simulation (i.e. 3h). @alrevuelta we were also surprised with that result so we re-run the simulation and got similar results, plot attached. We should indeed arrange a meeting and discuss about this because it is indeed weird. Also, keep in mind this also happens without DIscv5.
analysis_hw.pdf
analysis_hw.pdf
Co-authored-by: fryorcraken <[email protected]>
Co-authored-by: fryorcraken <[email protected]>
Co-authored-by: fryorcraken <[email protected]>
@jm-clius @alrevuelta @fryorcraken The simulations were done with an older Waku version, and it seems that the ~500 MB bandwidth without sending payload messages might be caused by an anomaly/bug in this Waku version. The experiments will be re-run with an up to date Waku version to confirm this. Also, this anomaly only appears with higher node counts. @alrevuelta What was the highest node count you tested in waku-simulator? How many peer connections did nodes have in your experiments? |
With waku-simulator around 500 nodes, with shadow 1000. Around 50 connections per node. |
|
||
With the discovery mechanism discv5, distinct resource usage patterns emerge. Transmission bandwidth remains relatively stable despite an eight-fold node increase, evidencing the protocol's transmission efficiency under discovery. However, reception bandwidth shows a different trend. From 75 to 300 nodes, a significant increase occurs, especially at 0 msg/s (95.83 to 254.7 Mb/s). The 300 to 600 node change is even more dramatic, with bandwidth surging to 497.47 Mb/s without messaging. | ||
|
||
The observation of a similar anomaly in simulations without the discovery mechanism suggests that the issue might lie with the protocol implementation itself, rather than being merely a simulation artifact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing above suggests the anomaly is connected to discv5, while here it is stated it is not dependent on discv5.
With the discovery mechanism discv5, distinct resource usage patterns emerge.
This sentence should be removed, the anomaly should be discussed in a section independent of discv5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaiserd not sure what the plan with this PR is? Doing a bit of github cleanup. :)
Research blog post on Wakurtosis and scaling Waku.