Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the events service connection #257

Merged
merged 16 commits into from
Dec 3, 2024

Conversation

dailinsubjam
Copy link
Contributor

@dailinsubjam dailinsubjam commented Nov 25, 2024

Closes #255

Pasted here what @Ayiga mentioned regarding the connection issue:

  • I noticed during the running of the inscriptions service, after Mainnet went through it’s degraded behavior and recovered, that the Block stream never recovered.

    • There’s already code in both the Builder and in the Inscriptions service to automatically reconnect when we detect that the stream is “closed”. There may be some underlying issue with the source rather than the consumer.
    • We may still want to address this on the consumer side, however, and the best way to address that, I think, is to keep track of the time since the last object received.
    • Since both the Block Stream, and the Event Stream should be receiving events fairly often, it seems like a good safe guard to determine whether we are in a state that doesn’t seem to be progressing.
    • Once this state is detected, we can just trigger a reconnect in hopes that it will recover and resolve the issue.

    NOTE: With this approach, we may end up triggering (at least with the block stream) unnecessarily due to potential latency / timeouts in the consensus protocol itself.

This PR:

  • Adds a time period check for disconnecting and reconnecting when no events are received for too long ( now set to 1 second).
  • Introduces a timestamp to track the last received event time in EventServiceStream.

This PR does not:

Key places to review:

@dailinsubjam dailinsubjam marked this pull request as draft November 25, 2024 15:01
}
},
}

// Disconnect and reconnect if no event has been received within quite a long time (set to 1s by default)
Copy link
Contributor

@QuentinI QuentinI Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will be executed after we received the event, so what we'll detect here is a long period without events that has just ended. What we should do instead is use tokio::timeout with connection.next() and handle timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, just want to make sure I understand correctly: here we'll wait pretty long before hit this line, but with tokio::timeout we'll only wait for the period I set, is that correct?

@coveralls
Copy link

coveralls commented Nov 26, 2024

Pull Request Test Coverage Report for Build 12118062892

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 61 of 69 (88.41%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 89.681%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/shared/src/utils.rs 61 69 88.41%
Totals Coverage Status
Change from base Build 12022052903: 0.01%
Covered Lines: 6944
Relevant Lines: 7743

💛 - Coveralls

@dailinsubjam dailinsubjam marked this pull request as ready for review November 26, 2024 16:39

// Simulate idle timeout by stopping the server and waiting
app_handle.abort();
tokio::time::sleep(IDLE_TIMEOUT + Duration::from_millis(500)).await; // Wait longer than idle timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this just be IDLE_TIMEOUT? Or can we increase IDLE_TIMEOUT to 2 seconds if we want it to be longer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these consts are tied (maybe not?) with the setting in non-test code, e.g., the connect function, can we move them to the non-test module near RETRY_PERIOD and replace hard-coded values with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500ms is to make sure we wait at least IDLE_TIMEOUT. It's also fine to do a smaller add-on. Updated to RETRY_PERIOD in 2a9d3ab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about the purpose of setting the sleep period at least RETRY_PERIOD if we create a new app immediately afterward.

IIUC when the period exceeds RETRY_PERIOD, the connection.next() call will enter the Err(_) case, whereas before this PR, it would get stuck. However, if we create a new_app_handle immediately after this sleep period, it looks like we are essentially testing what's already covered in test_event_stream_wrapper.

I think we should verify whether .next() returns an error as expected after the sleep, i.e., whether we do try to reconnect.

crates/shared/src/utils.rs Outdated Show resolved Hide resolved
crates/shared/src/utils.rs Outdated Show resolved Hide resolved
Copy link
Member

@shenkeyao shenkeyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As replied to a previous discussion, I think the test needs additional assertion to make sure the reconnection works.

@dailinsubjam dailinsubjam merged commit db39cfe into main Dec 3, 2024
7 of 8 checks passed
@dailinsubjam dailinsubjam deleted the sishan/events_service_connection branch December 3, 2024 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Mainnet Issue] - Fix the events service connection
4 participants