Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmb calls timeout with some twins on devnet #200

Open
Omarabdul3ziz opened this issue Oct 23, 2024 · 6 comments
Open

rmb calls timeout with some twins on devnet #200

Omarabdul3ziz opened this issue Oct 23, 2024 · 6 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@Omarabdul3ziz
Copy link

some twins on devnet (e.g., twin 29) are facing random timeouts when making calls to devnet nodes. the issue occurs inconsistently, with some calls succeeding while others fail

here are some findings:

  • the issue is not persistent, a node may occasionally respond (1 out of 5 calls)
  • the same node could respond to other twins
  • this affects both ts/go rmb clients
  • calls only succeed when the caller twin is connected to a relay not listed in the node's twin relays (if a node twin's relays are relay1 and relay2. caller on relay3 works fine, but calls fail if the caller is on relay1 or relay2)
@Omarabdul3ziz
Copy link
Author

after debugging with @AhmedHanafy725 this what we found

the relay cache was not updating properly due to the chain event listener failure, causing outdated cached relay info.

  • what usually should happen is updating the cache on the relay in two cases.

    • a twin sends a request to this relay so it updates only its cache.
    • or a twin update its relays on the chain. so an events listener on each relay updates the cache based on the new relays on the chain.
  • we noticed that if a twin on relay 1 sends to a twin on relay 1/2, if the response comes back to relay 1 it will send it successfully.
    but if it goes through the relay2 and it has an outdated cache for the destination twin. it will not federate nor will not succeed in sending the response.

  • the relay listener stopped without any notification so the twin cached relays didn't get invalidated. and some relays are stuck sending to twins not connected. this may have happened during the recent chain node update. the downtime broke the connection and couldn't reconnect.

  • a restart for the relays makes the cache mechanism work fine again.

  • we should monitor the chain listener health in the relay. or create a separate service on the stack that gets rebooted with any chain update.

@AhmedHanafy725
Copy link

As a suggestion, we can use the graphql processor to update the relay's redis cache.

@SalmaElsoly SalmaElsoly self-assigned this Nov 26, 2024
@Nabil-Salah
Copy link
Contributor

26/11/2024

we Investigated the rmb code and ways of solving this issue

@Nabil-Salah
Copy link
Contributor

27/11/2024

worked on fixing the event listned and adding a backoff strategy

@Nabil-Salah Nabil-Salah mentioned this issue Nov 28, 2024
4 tasks
@Omarabdul3ziz
Copy link
Author

after a discussion we suspect that the chain restarted without notifying the relay, resulting in a half-open connection. and the process doesn't crash, nor does it trigger a reconnection as the connection appears valid from the relay's pov

to validate this guess, we could set up a local relay and chain, observing the connection/cache behavior when the chain is restarted while the relay remains active.

if this is the case we could:

  • implement a timeout mechanism on the relay side to detect idle or stale connections and reconnect
  • forcing a relay restart whenever the chain process restarts

@Nabil-Salah
Copy link
Contributor

After discussing with Omar, we determined that the issue is not related to a half-open connection problem.

The actual problem occurs when the thread running the listener panics.

In this scenario:

  • The application remains alive but continues running without the listener.
  • Since the thread does not restart, no retry mechanism is triggered to recover the listener.
  • Additionally, the application itself does not terminate, so Docker does not restart the container to restore functionality.

If this is the case, we could:

  • Exit the whole application by allowing the thread's termination to bring down the app.
  • Implement a retry mechanism with an interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants