Alternating add/remove addresses #102

danielloczi · 2019-04-24T12:23:24Z

I have a 3 node k8s setup with a Mongo StatefulSet, which is configured to use 3 pods.
In case of a node failure the pod that was previously running on that node stuck in 'terminating' state. That is the behavior of Kubernetes StatefulSet.
My problem is that terminating pod stays in the replicaset and if that is the master then db is not reachable.
Checking the sidecar logs it turned out that adding and removing this pod is alternating and never ends:
Addresses to add: [] Addresses to remove: [ 'mongo-0.mongo.default.svc.cluster.local:27017' ] replSetReconfig { _id: 'rs0', version: 12800119, protocolVersion: 1, members: [ { _id: 1, host: 'mongo-1.mongo.default.svc.cluster.local:27017', arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: 'mongo-2.mongo.default.svc.cluster.local:27017', arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: 5c6ed152340d2b5e9e466284 } } Addresses to add: [ 'mongo-0.mongo.default.svc.cluster.local:27017' ] Addresses to remove: [] replSetReconfig { _id: 'rs0', version: 12800120, protocolVersion: 1, members: [ { _id: 1, host: 'mongo-1.mongo.default.svc.cluster.local:27017', arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: 'mongo-2.mongo.default.svc.cluster.local:27017', arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 0, host: 'mongo-0.mongo.default.svc.cluster.local:27017' } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: 5c6ed152340d2b5e9e466284 } }

I checked the source code of mongo-k8s-sidecar and in file worker.js this section is suspicious to me:
//Lets remove any pods that aren't running or haven't been assigned an IP address yet for (var i = pods.length - 1; i >= 0; i--) { var pod = pods[i]; if (pod.status.phase !== 'Running' || !pod.status.podIP) { pods.splice(i, 1); } }

So you check the pod.status.phase, but k8s reports still 'running' for the failed pod. The way to determinate, that the pod is in terminating state is the value of the deletionTimestamp property:
hiuser@node3:~$ kubectl get pod mongo-0 -o=yaml | grep phase phase: Running hiuser@node3:~$ kubectl get pod mongo-0 -o=yaml | grep dele deletionGracePeriodSeconds: 10 deletionTimestamp: "2019-04-24T11:39:58Z"

I think worker.js should be refactored to check the deletionTimestamp also to decide if a pod is healty or not.

The text was updated successfully, but these errors were encountered:

…hecking pod health.

AguGriguol · 2020-01-21T18:51:38Z

Hi @danielloczi !

Can you test this in production? Do you have any images in docker hub with this fix?

Thanks!

danielloczi added a commit to danielloczi/mongo-k8s-sidecar that referenced this issue Apr 24, 2019

cravall#102:checking the deletionTimestamp property of the pod when c…

857d7bc

…hecking pod health.

danielloczi mentioned this issue Apr 24, 2019

Checking the deletionTimestamp property of the pod when getting pod health #103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternating add/remove addresses #102

Alternating add/remove addresses #102

danielloczi commented Apr 24, 2019 •

edited

Loading

AguGriguol commented Jan 21, 2020

Alternating add/remove addresses #102

Alternating add/remove addresses #102

Comments

danielloczi commented Apr 24, 2019 • edited Loading

AguGriguol commented Jan 21, 2020

danielloczi commented Apr 24, 2019 •

edited

Loading