Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podtailer does not handle unclean shutdown of watcher #416

Closed
lizthegrey opened this issue May 29, 2024 · 3 comments
Closed

podtailer does not handle unclean shutdown of watcher #416

lizthegrey opened this issue May 29, 2024 · 3 comments

Comments

@lizthegrey
Copy link
Member

lizthegrey commented May 29, 2024

Versions

v2.7.2

Steps to reproduce

  1. Run honeycomb-kubernetes-agent on a long-lived node
  2. Interrupt the connection between the agent and the apiserver. A stack trace is emitted from reflecter.go in the client-go k8s library.
  3. Logs no longer are emitted after pods being watched rotate their logs. New pods starting up are not affected; it's only the watcher of the existing pod for its list of logfiles that dies.
W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930997       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931002       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.930975       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931047       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0426 13:34:57.931065       1 reflector.go:347] k8s.io/[email protected]/tools/cache/reflector.go:169: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Additional context

It appears we retry failure to create a new watcher on a pod being instantiated, but we do not retry watching if an existing watcher stops.

@lizthegrey
Copy link
Member Author

probably something wrong in https://github.com/honeycombio/honeycomb-kubernetes-agent/blob/main/k8sagent/watcher.go logic but I'm not 100% sure what yet.

@lizthegrey
Copy link
Member Author

lizthegrey commented May 31, 2024

https://github.com/kubernetes/client-go/blob/v0.26.3/tools/cache/reflector.go#L347 the problem is here, client connection lost is not marked as retryable.

ah, but if the apiserver has gone away, we can't retry it, because the new apiserver won't have a record of our watch, so we have to set it up again from the start. now I understand.

@lizthegrey
Copy link
Member Author

lizthegrey commented May 31, 2024

https://github.com/honeycombio/honeycomb-kubernetes-agent/blob/main/k8sagent/watcher.go#L172 assumes it will never stop, and does not need restarting in case of error, but my investigation has determined that the watcher will terminate on fatal error (eg apiserver shutdown); thus we need to re-initialise and start it back up again if it terminates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant