rmb calls timeout with some twins on devnet #200

Omarabdul3ziz · 2024-10-23T09:52:29Z

some twins on devnet (e.g., twin 29) are facing random timeouts when making calls to devnet nodes. the issue occurs inconsistently, with some calls succeeding while others fail

here are some findings:

the issue is not persistent, a node may occasionally respond (1 out of 5 calls)
the same node could respond to other twins
this affects both ts/go rmb clients
calls only succeed when the caller twin is connected to a relay not listed in the node's twin relays (if a node twin's relays are relay1 and relay2. caller on relay3 works fine, but calls fail if the caller is on relay1 or relay2)

Omarabdul3ziz · 2024-10-24T15:03:21Z

after debugging with @AhmedHanafy725 this what we found

the relay cache was not updating properly due to the chain event listener failure, causing outdated cached relay info.

what usually should happen is updating the cache on the relay in two cases.
- a twin sends a request to this relay so it updates only its cache.
- or a twin update its relays on the chain. so an events listener on each relay updates the cache based on the new relays on the chain.
we noticed that if a twin on relay 1 sends to a twin on relay 1/2, if the response comes back to relay 1 it will send it successfully.
but if it goes through the relay2 and it has an outdated cache for the destination twin. it will not federate nor will not succeed in sending the response.
the relay listener stopped without any notification so the twin cached relays didn't get invalidated. and some relays are stuck sending to twins not connected. this may have happened during the recent chain node update. the downtime broke the connection and couldn't reconnect.
a restart for the relays makes the cache mechanism work fine again.
we should monitor the chain listener health in the relay. or create a separate service on the stack that gets rebooted with any chain update.

AhmedHanafy725 · 2024-11-25T09:48:26Z

As a suggestion, we can use the graphql processor to update the relay's redis cache.

Nabil-Salah · 2024-11-26T14:37:16Z

26/11/2024

we Investigated the rmb code and ways of solving this issue

Nabil-Salah · 2024-11-27T16:21:43Z

27/11/2024

worked on fixing the event listned and adding a backoff strategy

Omarabdul3ziz · 2024-12-09T12:55:15Z

after a discussion we suspect that the chain restarted without notifying the relay, resulting in a half-open connection. and the process doesn't crash, nor does it trigger a reconnection as the connection appears valid from the relay's pov

to validate this guess, we could set up a local relay and chain, observing the connection/cache behavior when the chain is restarted while the relay remains active.

if this is the case we could:

implement a timeout mechanism on the relay side to detect idle or stale connections and reconnect
forcing a relay restart whenever the chain process restarts

Nabil-Salah · 2024-12-10T13:54:03Z

After discussing with Omar, we determined that the issue is not related to a half-open connection problem.

The actual problem occurs when the thread running the listener panics.

In this scenario:

The application remains alive but continues running without the listener.
Since the thread does not restart, no retry mechanism is triggered to recover the listener.
Additionally, the application itself does not terminate, so Docker does not restart the container to restore functionality.

If this is the case, we could:

Exit the whole application by allowing the thread's termination to bring down the app.
Implement a retry mechanism with an interval.

Omarabdul3ziz assigned Omarabdul3ziz and AhmedHanafy725 Oct 24, 2024

Omarabdul3ziz added this to 3.15.x Oct 24, 2024

Omarabdul3ziz moved this to In Progress in 3.15.x Oct 24, 2024

Omarabdul3ziz added the type_bug Something isn't working label Oct 24, 2024

ramezsaeed moved this from In Progress to Accepted in 3.15.x Oct 27, 2024

ramezsaeed added this to the 1.4.0 milestone Oct 27, 2024

ramezsaeed removed this from 3.15.x Oct 27, 2024

SalmaElsoly self-assigned this Nov 26, 2024

Nabil-Salah assigned Nabil-Salah and SalmaElsoly and unassigned SalmaElsoly Nov 26, 2024

Nabil-Salah mentioned this issue Nov 28, 2024

fix rmb calls timeout #202

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmb calls timeout with some twins on devnet #200

rmb calls timeout with some twins on devnet #200

Omarabdul3ziz commented Oct 23, 2024

Omarabdul3ziz commented Oct 24, 2024

AhmedHanafy725 commented Nov 25, 2024

Nabil-Salah commented Nov 26, 2024

Nabil-Salah commented Nov 27, 2024

Omarabdul3ziz commented Dec 9, 2024

Nabil-Salah commented Dec 10, 2024

rmb calls timeout with some twins on devnet #200

rmb calls timeout with some twins on devnet #200

Comments

Omarabdul3ziz commented Oct 23, 2024

Omarabdul3ziz commented Oct 24, 2024

AhmedHanafy725 commented Nov 25, 2024

Nabil-Salah commented Nov 26, 2024

26/11/2024

Nabil-Salah commented Nov 27, 2024

27/11/2024

Omarabdul3ziz commented Dec 9, 2024

Nabil-Salah commented Dec 10, 2024