Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate blocking cometbft reactors impact on p2p connectivity #10742

Closed
mhofman opened this issue Dec 18, 2024 · 1 comment · Fixed by #10770
Closed

Mitigate blocking cometbft reactors impact on p2p connectivity #10742

mhofman opened this issue Dec 18, 2024 · 1 comment · Fixed by #10770
Assignees
Labels
agoric-cosmos enhancement New feature or request

Comments

@mhofman
Copy link
Member

mhofman commented Dec 18, 2024

What is the Problem Being Solved?

While investigating the impact of long blocks on the p2p layer (#10075), we realized that cometbft assumed that channels and their reactors would process messages quickly / in a non blocking manner (cometbft/cometbft#2685). However they do not (cometbft/cometbft#3250), and because of that the basic processing ping / pong messages over the connection is blocked behind these reactors calls completing (cometbft/cometbft#2533).

The outcome is that when the ABCI app (cosmos)'s EndBlock & CommitBlock take a while, which currently happens because of performance issues in SwingSet, the p2p connectivity suffers, compounding the recovery of the network.

Description of the Design

We don't want to change the ordering of messages received over the peer connection, even though the cosmos seems mostly capable of dealing with a relaxation in the current strict ordering that the in-process tendermint offers (with the relaxation created by our committing client). Instead we want to unblock ping/pong message processing from channel messages.

To that end we can add a goroutine processing the messages that have been read from the connection. The number and aggregate size of pending messages should be bounded.

Security Considerations

A remote peer should remain incapable of consuming large amount of memory or CPU on the local peer. It would be optimal for a remote peer to become disconnected if they do not faithfully participate in the p2p network.

Scaling Considerations

The amount of work to maintain a peer connection should not grow based on the number of messages pending over that connection

Test Plan

Manual testing on a patched follower node

Unit testing of edge conditions when bounds on the queue are reached, and when the connection is closed at particular times (e.g. with pending messages in the queue_

Upgrade Considerations

Requires a chain software upgrade but does not affect consensus

@mhofman
Copy link
Member Author

mhofman commented Dec 18, 2024

Draft PR for fix in our cometbft fork: agoric-labs/cometbft#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agoric-cosmos enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants