Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Balanced RPC redirecting to out of sync nodes #70

Closed
yohanelly95 opened this issue Jul 18, 2024 · 2 comments
Closed

Load Balanced RPC redirecting to out of sync nodes #70

yohanelly95 opened this issue Jul 18, 2024 · 2 comments
Assignees
Labels

Comments

@yohanelly95
Copy link

Validators using the load-balanced RPC were sometimes redirected to a faulty/out-of-sync RPC, which consistently returned a stale block number, leading to validator downtime. Restarting the validator node did not fix it. The out-of-sync RPC was found to be https://skale-node2.01node.com:10136/ , but it could change over time.

load-balanced RPC: https://mainnet.skalenodes.com/v1/turbulent-unique-scheat

@dmytrotkk
Copy link
Contributor

Thanks for opening this issue, @yohanelly95. We’ll look into it.

For context: we cannot check if a node is synced on each call, as it would add significant overhead to the Nginx proxy. Instead, we check block timestamps every three hours and remove out-of-sync endpoints from the rotation.

Here’s how it works: we check the block timestamps on all endpoints, identify the highest one, and compare it to the others. The maximum allowed slippage is 300 seconds (5 minutes). Given the average block frequency of 10.5 seconds per block (for Razor chain), this could result in an approximate 28-block outage.

if is_node_out_of_sync(node['block_ts'], max_ts):

ALLOWED_TIMESTAMP_DIFF = 300

We will come back to you after additional checks on our side.

@yohanelly95
Copy link
Author

Thanks for opening this issue, @yohanelly95. We’ll look into it.

For context: we cannot check if a node is synced on each call, as it would add significant overhead to the Nginx proxy. Instead, we check block timestamps every three hours and remove out-of-sync endpoints from the rotation.

Here’s how it works: we check the block timestamps on all endpoints, identify the highest one, and compare it to the others. The maximum allowed slippage is 300 seconds (5 minutes). Given the average block frequency of 10.5 seconds per block (for Razor chain), this could result in an approximate 28-block outage.

if is_node_out_of_sync(node['block_ts'], max_ts):

ALLOWED_TIMESTAMP_DIFF = 300

We will come back to you after additional checks on our side.

Hey @dmytrotkk! Are there any updates on how we can handle this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants