Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

Open
marcdoan opened this issue Dec 17, 2015 · 0 comments

Comments

@marcdoan
Copy link
Contributor

This particular incident occurred on Dec. 16 with the following symptoms:

  • All drivers went offline on Atlas
  • Despite shown as all being offline, Atlas was still receiving location updates from the drivers through Node

Explanation:
All distributed Node instances use Redis to coordinate themselves amongst each other. Each instance emits a "heartbeat" to a particular Redis pub/sub channel. By listening to this channel, they can all deduce who's online and who's offline. Additionally, there are a number of other Redis pub/sub channels that the Node instances use to share information such which clients are connected to which Node instance.

At 6:01 PM, an unknown cause (network latency? etc.) most likely caused a tiny delay (~ 1s) in the heartbeat channel, resulting in each Node instance inaccurately deducing that all other servers had gone offline. When this happens, a push notification is sent to Atlas stating that all drivers have disconnected. Of course since Node is still actually running, Atlas continues to receive location updates.

The temporary solution is to increase the activeInterval value in Node's inter-node module from 1s to 3s.

Note: This behavior may also explain why some drivers periodically appear to "disconnect" from Node. Node is single-threaded and so if it encounters a portion of code that requires intensive CPU work which takes longer than the heartbeat interval, it will not emit a heartbeat on time. Consequently, all other Node instances will inaccurately deduce a temporary disconnection from that server and incorrectly emit a "disconnect" push notification to Atlas.

@vcardillo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant