Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Possible Bug in Consensus #184

Open
chaegleyOnclave opened this issue Jun 20, 2022 · 0 comments
Open

Possible Bug in Consensus #184

chaegleyOnclave opened this issue Jun 20, 2022 · 0 comments

Comments

@chaegleyOnclave
Copy link

I've been noticing an issue when using PBFT consensus and hoped I could find some help here.
When two different nodes attempt to write different blocks at very similar times, one node will beat the other, and the other node will fail to write a block. That is expected, but then the node that failed to write a block will report this failed to cancel block error:

INFO | pbft_engine::node:47 | Failed to cancel block when becoming secondary: InvalidState("Cannot cancel block in current state")

This leads to the pbft engine behaving irregularly and eventually crashing.

When the pbft engine eventually crashes, it will state there has been a zmq error which states socket dropped with little other context.

Before the crash it is unable to properly use consensus.

Other nodes may also crash when this failure state is met, even though they did not fail to write a block.

Stopping and rebuilding the docker containers tends to fix this issue, but it is concerning that it occurs at all, and that other nodes fail that did not enter this error state.

I realized that the nodes I am running are running pfbt engine version 1.0.2, and I am planning on upgrading to the latest version. However, I'm uncertain if that will prevent this issue from happening again. I so far have not been able to consistently replicate the issue as it is a specific timing error that occurs when two nodes are attempting to write a block at very similar times. However, I have seen it occur multiple times and am concerned about pbft's stability.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

1 participant