Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling the Waku Protocol: A Performance Analysis with Wakurtosis #123

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from

Conversation

Daimakaimura
Copy link

@Daimakaimura Daimakaimura commented Aug 7, 2023

Research blog post on Wakurtosis and scaling Waku.

  • Updated the body of the article expanding some parts
  • Added bandwidth multiplier heat-map figures
  • Commented the Discv5 section as we are still waiting for some final data
  • Added reference to the Tech Report addendum
  • Added results of last simulations for the Discv5 vs non-Discv5 comparison
  • Added results of simulations with 0 message rate in discussion
  • Updated calculations and plots
  • Updated the results section and plots with the last results

@Daimakaimura Daimakaimura self-assigned this Aug 7, 2023
Copy link
Contributor

@fryorcraken fryorcraken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding some key figures to the post?

For example:

The decreased efficiency in data transmission and reception suggests that Discv5’s overhead significantly affects the protocol's performance.

On a 1000 nodes network with X messages per second of size K , the activation of discv5 increased memory consumpion in aveerage per node from X to Y.

@kaiserd
Copy link
Contributor

kaiserd commented Aug 14, 2023

Thank you 👍
What is blocking progress here?

sembr would be appreciated 🙏

@kaiserd kaiserd self-requested a review August 21, 2023 08:11
Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the comments of other reviewers and added some of my own.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
@Daimakaimura
Copy link
Author

Daimakaimura commented Aug 22, 2023

@jm-clius @fryorcraken thank you so much for your review!. I pushed some changes including addressing your comments. I also finished the Discv5 vs Non-Discv5 discussion with the latest data. Apologies if I forgot something. @kaiserd it would be nice if you could have a look.

@fryorcraken
Copy link
Contributor

I'll let final review and approval to be handled with Vac team :)

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for addressing previous comments. Would leave to Vac to review for framing and other context. Not sure of the intended audience and what we assume they already know, but I would suggest that some concepts would need more contextualisation before being introduced. For example, the basic idea of what discv5 is and does would help readers intuitively understand results a bit a better. Currently the first you read about it is "non-Discv5 simulations", which I wouldn't understand if I didn't have the necessary background already.

When examining the effects on bandwidth utilization, we see varying results.
For reception rates, with lower message rates we observe that Discv5 consistently utilizes more bandwidth than the corresponding non-Discv5 case.
However, those differences become less pronounced with higher message rates.
For transmission rates, both cases show similar performance and both exhibit huge improvements in transmission efficiency at higher message loads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more note, to reiterate: we wouldn't expect to see an efficiency that's higher than around 1/6. Where this is the case, we should assume there might be losses due to infrastructure (or other) limits.

Copy link
Contributor

@kaiserd kaiserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏 for this post! Feedback inline.
It would be good to also add the 0 Msgs run, suggested by @jm-clius in Discord, to this rlog post.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
For instance, running several nodes per container can alter propagation times, as nodes grouped within the same container may exhibit different messaging behavior compared to a true node-to-node topology
Additionally, employing a multi-node approach may result in losing node-level sampling granularity depending on the metrics infrastructure used, e.g. Cadvisor.
Nevertheless, Wakurtosis offers the flexibility to choose between this and a 1-to-1 simulation, catering to the specific needs of each test scenario.
The results presented in here are 1-to-1 simulations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this earlier in the paragraph. Without this info, this paragraph reads as if the results presented here are based on unreliable runs.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
@Daimakaimura
Copy link
Author

Just pushed some more changes addressing the comments that were pending. Still waiting for the last round of simulations with a rate of 0 msg/s to expand on that. I will ask for re-review once I added those.

Copy link
Contributor

@kaiserd kaiserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :). Left some comments in-line, will give feedback on the rest later.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved

To evaluate Waku under varied conditions, we conducted simulations across a range of network sizes, topologies, and message rates. Each simulation lasted 3 hours to reach a steady state.

The network sizes explored included 75, 150, 300, and 600 nodes. For Non-Discv5 simulations, we used static topologies with average node degrees of K=50. In simulations with Discv5, we set the max_peers parameter to 50 to approximate similar average degrees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should introduce why we test non-discv5, and explain that this is an "artificial scenario" here when we first mention non-discv5.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an explanation in the subsection Clarification on Non-Discv5 Scenarios bellow. Also, Discv5 is briefly explained in the intro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still write a sentence explaining why we look into non-discv5 here.
Alternatively, you can move the "Clarification on Non-Discv5 Scenarios" up here, and adjust it accordingly so that it fits here.


To stress test message throughput, we simulated message rates of 1, and 10 messages per second. We initally run simulations up to a 100 msg/s but we found out the results were unreliable due to simulation hardware limitations and therefore decided not to include them in this analysis.

We also included simulations batches with no load — i.e. 0 Msg/s. — to provide a clearer picture of Waku's baseline resource demands and inherent overhead costs stemming from core protocol operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is part of this baseline?
I'd definitely mention testing discv5 in isolation here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow, can you please elaborate a bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "Waku's baseline resource demands" comprised of?

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
@Daimakaimura
Copy link
Author

Just pushed the remaining changes. Also included memory vs nodes plots for clarity in the discussion of the results and to better showcase the different scaling of memory. @kaiserd please let me know if there is something I ve missed.

Copy link
Contributor

@kaiserd kaiserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the rest of my feedback.

Generally I'd remove the runs where we see reduced bandwidth use to hardware limitations preventing messages from being injected.
Seeing bandwidth consumption drop with larger network sizes is misleading.
That is not a real world scenario but just due to testing environment artifacts.


The analysis reveals interesting trends in bandwidth and memory usage across different network sizes and loads.

Transmission bandwidth does not consistently increase with more nodes when messages are sent, while reception bandwidth exhibits even more variable behavior depending on scale and load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the message rate is independent of the number of nodes,
it is expected that the bandwidth consumption is close to constant, too.

The following statement goes against this expectation:

bandwidth does not consistently increase

since it indicates am expected increase in message rate.
Do you expect such an increase, and if yes, why?

Is there something more than bandwidth reducing because of overload of the testing hardware?
(Imo, this strang behaviour that only stems for artifacts of the testing environment and does not happen in a real scenario should not be part of the actual analysis and only mentioned as additional info in a subsquent paragraph; it might confuse readers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Daniel here that these kind of "pure observations" without interpretation can be confusing, especially if it's in the opening of the paragraph. As a reader, I'm interested in questions such as "does Waku bandwidth grow in any particular pattern with network growth, message rate increase?", "does this pattern tell me anything fundamental about the protocol?", "anything in the results that surprises us or deviate from what we would expect mathematically?". Any particular observations about simulation anomalies can be mentioned (with some interpretation) separately, but the paragraph should open and focus on interpretative questions like these.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
The reduced bandwidth at high message rates stems from simulation infrastructure limits rather than protocol inefficiencies.
The no-load measurements provide insights into baseline protocol overhead costs that grow with network size.

Transmission overhead appears to increase linearly, while reception overhead accelerates sharply beyond 300 nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the explanation for this? Is this linked to the overloading hardware, too? If yes, we should explicitly mention that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the transmission overhead increase linearly in the number of nodes or in the number of messages?
This should be explicit here.

Assuming you mean number of nodes, why does it increase linearly even though the message rates are fixed?
(Especially since this scenario is without discv5.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence here seems to me to be the important takeaway in this paragraph. "Bandwidth increases linearly with...". It should not be mixed in with an uninterpreted observation about simulation effects in reception bandwidth.

With the discovery mechanism, we again see interesting trends in resource usage across different network sizes and traffic loads.

Transmission bandwidth remainded fairly stable with more nodes.
We again observe reduced bandwidth at the highest rates and node counts, stemming from simulation infrastructure constraints rather than inherent protocol limitations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd put this in a separate section, because this is confusing to someone quickly reading over the doc.
This is only a test specific artifact.

</figcaption>

Reception bandwidth grows much faster with number of nodes compared to the baseline, especially the large spike from 300 to 600 nodes with discovery enabled.
This reflects the substantial additional overhead of neighbor discovery and tracking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because message rates / payload transmission is comparatively low?
Somebody reading this would expect that discovery traffic is lower.
A bit more explanation would be helpful here.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I have added some comments, most of which is related to making sure this post is more oriented towards conclusions/interpretation of "Scaling the Waku Protocol" rather than stating simulation observations. In other words, do the tests tell us something about the scalability of the Waku Protocol (e.g. what is our interpretation for each scenario about how Waku Protocols respond to changes in network size and message rates?). These are mentioned in the concluding section, but I think this should be the focus throughout.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved

The analysis reveals interesting trends in bandwidth and memory usage across different network sizes and loads.

Transmission bandwidth does not consistently increase with more nodes when messages are sent, while reception bandwidth exhibits even more variable behavior depending on scale and load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Daniel here that these kind of "pure observations" without interpretation can be confusing, especially if it's in the opening of the paragraph. As a reader, I'm interested in questions such as "does Waku bandwidth grow in any particular pattern with network growth, message rate increase?", "does this pattern tell me anything fundamental about the protocol?", "anything in the results that surprises us or deviate from what we would expect mathematically?". Any particular observations about simulation anomalies can be mentioned (with some interpretation) separately, but the paragraph should open and focus on interpretative questions like these.

The reduced bandwidth at high message rates stems from simulation infrastructure limits rather than protocol inefficiencies.
The no-load measurements provide insights into baseline protocol overhead costs that grow with network size.

Transmission overhead appears to increase linearly, while reception overhead accelerates sharply beyond 300 nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence here seems to me to be the important takeaway in this paragraph. "Bandwidth increases linearly with...". It should not be mixed in with an uninterpreted observation about simulation effects in reception bandwidth.


With the discovery mechanism, we again see interesting trends in resource usage across different network sizes and traffic loads.

Transmission bandwidth remainded fairly stable with more nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Does this tell us something about Waku messaging protocol, our chosen discovery technique and network size?

@kaiserd
Copy link
Contributor

kaiserd commented Oct 9, 2023

What is the status here? What is blocking?
edt: Clarified in meeting, blocked on simulations.
-> prio on this
Simulation expected to finish towards the end of this week.

@kaiserd
Copy link
Contributor

kaiserd commented Oct 15, 2023

@Daimakaimura @AlbertoSoutullo What is the status of the simulations?

@AlbertoSoutullo
Copy link

@Daimakaimura @AlbertoSoutullo What is the status of the simulations?

I finished them on Thursday around 13:00pm and notified Jordi. He will probably answer about the analysis part when he reads this.

@fryorcraken
Copy link
Contributor

@Daimakaimura could you ping me and @jm-clius once you have updated the post with the latest data please?

@Daimakaimura
Copy link
Author

@Daimakaimura could you ping me and @jm-clius once you have updated the post with the latest data please?

Of course!

@Daimakaimura
Copy link
Author

@fryorcraken @kaiserd @jm-clius just pushed the last changes with the results of the last round of simulations. Going to ask for re-review.

rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
rlog/2023-08-04-wakurtosis-waku-scallability.mdx Outdated Show resolved Hide resolved
### Scalability

To overcome scalability challenges, Wakurtosis employs a multi-node approach, running several nodes within one container.
This method supports simulations with over 1,000 nodes on a single machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest network size mentioned below is 600 nodes. Why not using the full capacity of Wakurtosis?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for that size we need to employ more than 1 node per container. We are only showing simulation results for 1 node per container. Mostly for clarity as this can skew the results and will be harder to compare with the rest of the batches.


After this initial decrease, the bandwidth plateaus without significant growth even as rates rise. This contrasts sharply with the 600-node setup, where even without messaging, substantial bandwidth is consumed, comparable to active messaging in smaller networks.

![Total mean bandwidth usage without discovery mechanism](/static/img/wakurtosis_waku/non_discv5_bandwidth.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the scale. The graph indicates MB for the bandwidth

Is that the total bandwidth used over the 3 hours of message injection?

Would be good to provide some basis. For example, what is the baseline data volume we are injecting? 3 hours * 1 msg per second * msg size.

Also the graph has 5 lines but we stated 4 message rates (0.25, 0.5, 0.75, 1 messages per second).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the average total bandwidth usage per node over the 3h hour simulation. I will add the pure payload lines into the plot to show how much data we are actually injecting into the network.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fryorcraken it's 5 series: 0.0, 0.25, 0.5, 0.75, and 1.0 msg/s. I just realised there is a missing legend. I ll fix that too.


#### Results with discovery mechanism (Discv5)

With the discovery mechanism discv5, distinct resource usage patterns emerge. Transmission bandwidth remains relatively stable despite an eight-fold node increase, evidencing the protocol's transmission efficiency under discovery. However, reception bandwidth shows a different trend. From 75 to 300 nodes, a significant increase occurs, especially at 0 msg/s (95.83 to 254.7 Mb/s). The 300 to 600 node change is even more dramatic, with bandwidth surging to 497.47 Mb/s without messaging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with bandwidth surging to 497.47 Mb/s without messaging.

This seems worrying. Cc @jm-clius @alrevuelta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

497.47 Mb/s with discv5 and without messaging is indeed worring, but I have never seen this with https://github.com/waku-org/waku-simulator. How can we reproduce it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

497.47 Mb/s with discv5 and without messaging is indeed worring, but I have never seen this with https://github.com/waku-org/waku-simulator. How can we reproduce it?

If you want we can select one day to do a quick call and reproduce this. It was done with Wakurtosis. You can do it with 3~ commands. But the script to analyze the data is owned by @Daimakaimura. Also @Daimakaimura, are those 497.47 Mb/s constant over the entire simulation? Did you check when this happened? Because if it is a "constant" behaviour, we can test in in 5~ minutes run and will be easier to check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbertoSoutullo those ~500 MB are the average total bandwidth usage per node over the whole simulation (i.e. 3h). @alrevuelta we were also surprised with that result so we re-run the simulation and got similar results, plot attached. We should indeed arrange a meeting and discuss about this because it is indeed weird. Also, keep in mind this also happens without DIscv5.
analysis_hw.pdf
analysis_hw.pdf

@kaiserd
Copy link
Contributor

kaiserd commented Oct 27, 2023

@jm-clius @alrevuelta @fryorcraken

The simulations were done with an older Waku version, and it seems that the ~500 MB bandwidth without sending payload messages might be caused by an anomaly/bug in this Waku version. The experiments will be re-run with an up to date Waku version to confirm this.

Also, this anomaly only appears with higher node counts. @alrevuelta What was the highest node count you tested in waku-simulator? How many peer connections did nodes have in your experiments?

@alrevuelta
Copy link
Contributor

@alrevuelta What was the highest node count you tested in waku-simulator? How many peer connections did nodes have in your experiments?

With waku-simulator around 500 nodes, with shadow 1000. Around 50 connections per node.


With the discovery mechanism discv5, distinct resource usage patterns emerge. Transmission bandwidth remains relatively stable despite an eight-fold node increase, evidencing the protocol's transmission efficiency under discovery. However, reception bandwidth shows a different trend. From 75 to 300 nodes, a significant increase occurs, especially at 0 msg/s (95.83 to 254.7 Mb/s). The 300 to 600 node change is even more dramatic, with bandwidth surging to 497.47 Mb/s without messaging.

The observation of a similar anomaly in simulations without the discovery mechanism suggests that the issue might lie with the protocol implementation itself, rather than being merely a simulation artifact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing above suggests the anomaly is connected to discv5, while here it is stated it is not dependent on discv5.

With the discovery mechanism discv5, distinct resource usage patterns emerge.

This sentence should be removed, the anomaly should be discussed in a section independent of discv5.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaiserd not sure what the plan with this PR is? Doing a bit of github cleanup. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants