You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This particular incident occurred on Dec. 16 with the following symptoms:
All drivers went offline on Atlas
Despite shown as all being offline, Atlas was still receiving location updates from the drivers through Node
Explanation:
All distributed Node instances use Redis to coordinate themselves amongst each other. Each instance emits a "heartbeat" to a particular Redis pub/sub channel. By listening to this channel, they can all deduce who's online and who's offline. Additionally, there are a number of other Redis pub/sub channels that the Node instances use to share information such which clients are connected to which Node instance.
At 6:01 PM, an unknown cause (network latency? etc.) most likely caused a tiny delay (~ 1s) in the heartbeat channel, resulting in each Node instance inaccurately deducing that all other servers had gone offline. When this happens, a push notification is sent to Atlas stating that all drivers have disconnected. Of course since Node is still actually running, Atlas continues to receive location updates.
The temporary solution is to increase the activeInterval value in Node's inter-node module from 1s to 3s.
Note: This behavior may also explain why some drivers periodically appear to "disconnect" from Node. Node is single-threaded and so if it encounters a portion of code that requires intensive CPU work which takes longer than the heartbeat interval, it will not emit a heartbeat on time. Consequently, all other Node instances will inaccurately deduce a temporary disconnection from that server and incorrectly emit a "disconnect" push notification to Atlas.
This particular incident occurred on Dec. 16 with the following symptoms:
Explanation:
All distributed Node instances use Redis to coordinate themselves amongst each other. Each instance emits a "heartbeat" to a particular Redis pub/sub channel. By listening to this channel, they can all deduce who's online and who's offline. Additionally, there are a number of other Redis pub/sub channels that the Node instances use to share information such which clients are connected to which Node instance.
At 6:01 PM, an unknown cause (network latency? etc.) most likely caused a tiny delay (~ 1s) in the heartbeat channel, resulting in each Node instance inaccurately deducing that all other servers had gone offline. When this happens, a push notification is sent to Atlas stating that all drivers have disconnected. Of course since Node is still actually running, Atlas continues to receive location updates.
The temporary solution is to increase the
activeInterval
value in Node's inter-node module from 1s to 3s.Note: This behavior may also explain why some drivers periodically appear to "disconnect" from Node. Node is single-threaded and so if it encounters a portion of code that requires intensive CPU work which takes longer than the heartbeat interval, it will not emit a heartbeat on time. Consequently, all other Node instances will inaccurately deduce a temporary disconnection from that server and incorrectly emit a "disconnect" push notification to Atlas.
@vcardillo
The text was updated successfully, but these errors were encountered: