Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large peer count performance #2854

Open
dawn-minion opened this issue Nov 6, 2024 · 1 comment
Open

Large peer count performance #2854

dawn-minion opened this issue Nov 6, 2024 · 1 comment

Comments

@dawn-minion
Copy link
Contributor

At scales of 1000+ peers on a single GoBGP server we've noticed that the rate to bring up additional peers starts slowing in proportion to the number of peers and the number of routes each peer is advertising to the server.

As an example case, at 2500 iBGP peers, 1 route advertised per peer and the server configured as a route reflector it takes nearly 5 minutes for all to successfully reach an established state. If the number of routes is increased this time increases proportionally. During this bringup as well GoBGP API performance starts to suffer and it can take 10s of seconds to be able to run say a gobgp neighbor command.

From what we can see the bottleneck seems to be that all events are run through a single Goroutine here: https://github.com/osrg/gobgp/blob/master/pkg/server/server.go#L487-L505 - So every new route received, every API request, and every time a peer wants to transition from one state it exits and awaits this Goroutine to handle it and start a new peer goroutine.

Whilst routes and API requests aren't necessarily a problem if they are slower to be processed, the peer state transitions handled via this routine are. Under the above mentioned load the routine seems to develop quite a significant backlog, and with smaller hold timers can take so long to process a change from say OPENSENT to ESTABLISHED that the other end expires the hold timer.

As well there seems to be a not-insigificant amount of load caused by the use of reflection each iteration to process these events. We see upwards of 15-20% of CPU time alone here during this type of load. Dropping this for a single channel to the server (as in all peers shared the same peer -> server channel) provided a noticeable improvement, but of course at the cost of it no longer randomising across all peers.

It seems possible to allow the peer FSM to advance from state-to-state without requiring it to exit and have the server restart it each time, eliminating the bottleneck entirely. A bit of rework would likely be needed to ensure all the functionality in handleFSMMessage works if the peer is running independently, but it should not be a breaking API change as far as we can see.

Would you be open to a PR that implements this?

@fujita
Copy link
Member

fujita commented Nov 13, 2024

I'm not sure whether Go is the right choice if you handle such lots of peers. As you said, there is some room to improve in GoBGP with more complexity. But I'm not sure GoBGP can compete with C or Rust BGP implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants