Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

marcdoan · 2015-12-17T18:24:26Z

This particular incident occurred on Dec. 16 with the following symptoms:

All drivers went offline on Atlas
Despite shown as all being offline, Atlas was still receiving location updates from the drivers through Node

Explanation:
All distributed Node instances use Redis to coordinate themselves amongst each other. Each instance emits a "heartbeat" to a particular Redis pub/sub channel. By listening to this channel, they can all deduce who's online and who's offline. Additionally, there are a number of other Redis pub/sub channels that the Node instances use to share information such which clients are connected to which Node instance.

At 6:01 PM, an unknown cause (network latency? etc.) most likely caused a tiny delay (~ 1s) in the heartbeat channel, resulting in each Node instance inaccurately deducing that all other servers had gone offline. When this happens, a push notification is sent to Atlas stating that all drivers have disconnected. Of course since Node is still actually running, Atlas continues to receive location updates.

The temporary solution is to increase the activeInterval value in Node's inter-node module from 1s to 3s.

Note: This behavior may also explain why some drivers periodically appear to "disconnect" from Node. Node is single-threaded and so if it encounters a portion of code that requires intensive CPU work which takes longer than the heartbeat interval, it will not emit a heartbeat on time. Consequently, all other Node instances will inaccurately deduce a temporary disconnection from that server and incorrectly emit a "disconnect" push notification to Atlas.

@vcardillo

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

marcdoan commented Dec 17, 2015

Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

Redis failure causes client disconnect/connect status on Atlas to become inaccurate #1

Comments

marcdoan commented Dec 17, 2015