Fix the events service connection #257

dailinsubjam · 2024-11-25T15:00:31Z

Closes #255

Pasted here what @Ayiga mentioned regarding the connection issue:

I noticed during the running of the inscriptions service, after Mainnet went through it’s degraded behavior and recovered, that the Block stream never recovered.
- There’s already code in both the Builder and in the Inscriptions service to automatically reconnect when we detect that the stream is “closed”. There may be some underlying issue with the source rather than the consumer.
- We may still want to address this on the consumer side, however, and the best way to address that, I think, is to keep track of the time since the last object received.
- Since both the Block Stream, and the Event Stream should be receiving events fairly often, it seems like a good safe guard to determine whether we are in a state that doesn’t seem to be progressing.
- Once this state is detected, we can just trigger a reconnect in hopes that it will recover and resolve the issue.
NOTE: With this approach, we may end up triggering (at least with the block stream) unnecessarily due to potential latency / timeouts in the consensus protocol itself.

This PR:

Adds a time period check for disconnecting and reconnecting when no events are received for too long ( now set to 1 second).
Introduces a timestamp to track the last received event time in EventServiceStream.

This PR does not:

Key places to review:

QuentinI · 2024-11-25T17:25:29Z

crates/shared/src/utils.rs

                        }
                    },
                }
+
+                // Disconnect and reconnect if no event has been received within quite a long time (set to 1s by default)


This code will be executed after we received the event, so what we'll detect here is a long period without events that has just ended. What we should do instead is use tokio::timeout with connection.next() and handle timeout

Makes sense, just want to make sure I understand correctly: here we'll wait pretty long before hit this line, but with tokio::timeout we'll only wait for the period I set, is that correct?

coveralls · 2024-11-26T16:27:20Z

Pull Request Test Coverage Report for Build 12118062892

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

61 of 69 (88.41%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 89.681%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/shared/src/utils.rs	61	69	88.41%

Totals
Change from base Build 12022052903:	0.01%
Covered Lines:	6944
Relevant Lines:	7743

💛 - Coveralls

shenkeyao · 2024-11-27T06:12:01Z

crates/shared/src/utils.rs

+
+        // Simulate idle timeout by stopping the server and waiting
+        app_handle.abort();
+        tokio::time::sleep(IDLE_TIMEOUT + Duration::from_millis(500)).await; // Wait longer than idle timeout


Why can't this just be IDLE_TIMEOUT? Or can we increase IDLE_TIMEOUT to 2 seconds if we want it to be longer?

If these consts are tied (maybe not?) with the setting in non-test code, e.g., the connect function, can we move them to the non-test module near RETRY_PERIOD and replace hard-coded values with them?

500ms is to make sure we wait at least IDLE_TIMEOUT. It's also fine to do a smaller add-on. Updated to RETRY_PERIOD in 2a9d3ab.

I'm still confused about the purpose of setting the sleep period at least RETRY_PERIOD if we create a new app immediately afterward.

IIUC when the period exceeds RETRY_PERIOD, the connection.next() call will enter the Err(_) case, whereas before this PR, it would get stuck. However, if we create a new_app_handle immediately after this sleep period, it looks like we are essentially testing what's already covered in test_event_stream_wrapper.

I think we should verify whether .next() returns an error as expected after the sleep, i.e., whether we do try to reconnect.

crates/shared/src/utils.rs

shenkeyao

As replied to a previous discussion, I think the test needs additional assertion to make sure the reconnection works.

add event period checking

05405f0

dailinsubjam marked this pull request as draft November 25, 2024 15:01

dailinsubjam requested review from QuentinI, shenkeyao and Ayiga November 25, 2024 15:01

dailinsubjam added 3 commits November 25, 2024 23:08

upd comment

d86817c

try tests with CI

c8c3f95

upd comment

8bc5a7c

QuentinI reviewed Nov 25, 2024

View reviewed changes

upd to use tokio::timeout

34fa4e9

dailinsubjam added 4 commits November 27, 2024 00:27

upd without tests

02ebb27

clean up comment

e73731b

upd tests

0ca2f07

Merge branch 'main' into sishan/events_service_connection

9c0b19c

dailinsubjam marked this pull request as ready for review November 26, 2024 16:39

dailinsubjam requested a review from QuentinI November 26, 2024 16:39

shenkeyao reviewed Nov 27, 2024

View reviewed changes

dailinsubjam added 3 commits November 27, 2024 17:43

use RETRY_PERIOD as IDLE_TIMEOUT

2a9d3ab

add test_

9561de9

remove last_event_time

bca48d1

dailinsubjam requested a review from shenkeyao November 27, 2024 09:49

shenkeyao reviewed Nov 27, 2024

View reviewed changes

dailinsubjam added 4 commits December 2, 2024 18:38

add a line to check whether stream connection returns an error

ee22e9b

test test

0f89818

clippy

b10751d

restore to correct test

79b8202

dailinsubjam requested a review from shenkeyao December 2, 2024 11:09

shenkeyao approved these changes Dec 2, 2024

View reviewed changes

dailinsubjam merged commit db39cfe into main Dec 3, 2024
7 of 8 checks passed

dailinsubjam deleted the sishan/events_service_connection branch December 3, 2024 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the events service connection #257

Fix the events service connection #257

dailinsubjam commented Nov 25, 2024 •

edited

Loading

QuentinI Nov 25, 2024 •

edited

Loading

dailinsubjam Nov 26, 2024

coveralls commented Nov 26, 2024 •

edited

Loading

shenkeyao Nov 27, 2024

shenkeyao Nov 27, 2024

dailinsubjam Nov 27, 2024

shenkeyao Nov 27, 2024

shenkeyao left a comment

Fix the events service connection #257

Fix the events service connection #257

Conversation

dailinsubjam commented Nov 25, 2024 • edited Loading

This PR:

This PR does not:

Key places to review:

QuentinI Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

dailinsubjam Nov 26, 2024

Choose a reason for hiding this comment

coveralls commented Nov 26, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12118062892

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

shenkeyao Nov 27, 2024

Choose a reason for hiding this comment

shenkeyao Nov 27, 2024

Choose a reason for hiding this comment

dailinsubjam Nov 27, 2024

Choose a reason for hiding this comment

shenkeyao Nov 27, 2024

Choose a reason for hiding this comment

shenkeyao left a comment

Choose a reason for hiding this comment

dailinsubjam commented Nov 25, 2024 •

edited

Loading

QuentinI Nov 25, 2024 •

edited

Loading

coveralls commented Nov 26, 2024 •

edited

Loading