podtailer does not handle unclean shutdown of watcher #416

lizthegrey · 2024-05-29T09:07:46Z

Versions

v2.7.2

Steps to reproduce

Run honeycomb-kubernetes-agent on a long-lived node
Interrupt the connection between the agent and the apiserver. A stack trace is emitted from reflecter.go in the client-go k8s library.
Logs no longer are emitted after pods being watched rotate their logs. New pods starting up are not affected; it's only the watcher of the existing pod for its list of logfiles that dies.

W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930997       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931002       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931047       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931065       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Additional context

It appears we retry failure to create a new watcher on a pod being instantiated, but we do not retry watching if an existing watcher stops.

The text was updated successfully, but these errors were encountered:

lizthegrey · 2024-05-29T09:13:02Z

probably something wrong in https://github.com/honeycombio/honeycomb-kubernetes-agent/blob/main/k8sagent/watcher.go logic but I'm not 100% sure what yet.

lizthegrey · 2024-05-31T06:48:51Z

https://github.com/kubernetes/client-go/blob/v0.26.3/tools/cache/reflector.go#L347 the problem is here, client connection lost is not marked as retryable.

ah, but if the apiserver has gone away, we can't retry it, because the new apiserver won't have a record of our watch, so we have to set it up again from the start. now I understand.

lizthegrey · 2024-05-31T07:17:21Z

https://github.com/honeycombio/honeycomb-kubernetes-agent/blob/main/k8sagent/watcher.go#L172 assumes it will never stop, and does not need restarting in case of error, but my investigation has determined that the watcher will terminate on fatal error (eg apiserver shutdown); thus we need to re-initialise and start it back up again if it terminates.

lizthegrey added the type: bug label May 29, 2024

lizthegrey added a commit that referenced this issue Jun 3, 2024

fix(k8sagent): retry watcher if it stops; fixes #416

0af592b

lizthegrey mentioned this issue Jun 3, 2024

fix(k8sagent): retry watcher if it stops; fixes #416 #418

Merged

MikeGoldsmith closed this as completed in ff01400 Jun 3, 2024

robbkidd mentioned this issue Sep 19, 2024

rel: prepare v2.7.3 #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podtailer does not handle unclean shutdown of watcher #416

podtailer does not handle unclean shutdown of watcher #416

lizthegrey commented May 29, 2024 •

edited

Loading

lizthegrey commented May 29, 2024

lizthegrey commented May 31, 2024 •

edited

Loading

lizthegrey commented May 31, 2024 •

edited

Loading

podtailer does not handle unclean shutdown of watcher #416

podtailer does not handle unclean shutdown of watcher #416

Comments

lizthegrey commented May 29, 2024 • edited Loading

lizthegrey commented May 29, 2024

lizthegrey commented May 31, 2024 • edited Loading

lizthegrey commented May 31, 2024 • edited Loading

lizthegrey commented May 29, 2024 •

edited

Loading

lizthegrey commented May 31, 2024 •

edited

Loading

lizthegrey commented May 31, 2024 •

edited

Loading