fix: set timeout on sendmsg to avoid memory leak #324

mihivagyok · 2022-02-10T08:31:18Z

PR tries to fix Konnectivity server leaks memory and free sockets #255
I have found that if the server and agent tunnel hangs, new requests will cause memory leak in Server
I made the change based on the following discussion: Stream.Write() waits indefinitely if stream is full. Cancellations apply to entire stream, not messages. grpc/grpc-go#1229 (comment)
once this change is there, the memory leak does not happen
note: this change does not solve the issue which make the tunnel hanging

- this helps to avoid socket leaks when channel is full

k8s-ci-robot · 2022-02-10T08:31:26Z

Hi @mihivagyok. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andrewsykim

/ok-to-test

andrewsykim · 2022-02-10T19:13:03Z

@mihivagyok are you able to share steps you used to reproduce the memory leak and test your fix? I'd like to test/validate your patch against one of my development clusters

mihivagyok · 2022-02-11T19:01:53Z

@andrewsykim Will share the steps but let me re-validate the scenario. I let you know once I'm ready!

Thanks,
Adam

mihivagyok · 2022-02-14T15:38:27Z

@andrewsykim

Configuration:

konnectivity-server is configured with --mode=http-connect and --proxy-strategies=destHost,default
there is one konnectivity-agent with --agent-identifiers=ipv4=<NODE_IP>
konnectivity server has memory limit of 1 gb

Test:

using kubectl cp to copy huge files to a test pod (̀for kubectl cp, Konnecticity servergets agent based on the destHost strategy - request matches with the agent identifier)
eventually the kubectl cp fails to finish
after that kubectl logs is failing (for kubectl cp, Konnecticity server also gets agent based on the destHost strategy - request matches with the agent identifier)
among kubectl cp/log, there is a metrics server API server which generates traffic towards the cluster, but for that Konnectivity server picks random agent (default strategy)
to make the leak, run thousands of kubectl log command - will fail, but it will increase the memory/socket usage heavily
after some time, it will reach the memory limit, and kubernetes restarts konnectivity server

#255 (comment)

off topic:
I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connection are guarded by mutex. But if there are multiple strategy, then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:

BackendManager - DestHost
    - Backend1 (agent identifier, conn, mutex)
BackedManager - Default
    - Backend1* (conn, mutex)

We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there.

Thanks,
Adam

andrewsykim · 2022-02-16T19:27:06Z

@mihivagyok I think I was able to reproduce this issue, will try to test your patch to see if it resolves the issue

andrewsykim · 2022-02-16T19:41:28Z

note: this change does not solve the issue which make the tunnel hanging

Are you able to check the konnectivity_network_proxy_server_grpc_connections metric on the proxy server with this patch?

pkg/server/backend_manager.go

andrewsykim

I'm not able to reproduce the fix described in #261 (comment), but I also was not able to test this using multiple backend strategies. I think the patch makes sense though given grpc/grpc-go#1229.

Overall LGTM, left some minor comments

pkg/server/backend_manager.go

- simplify the anonymous func()

andrewsykim · 2022-02-24T01:43:21Z

@mihivagyok I think I'm able to reproduce this issue now, and it was possible while just using the default proxy strategy. From the goroutine stacktrace, this was the semaphore lock that was blocking goroutines:

goroutine 192765 [semacquire, 125 minutes]:
sync.runtime_SemacquireMutex(0xc00038c094, 0x1428700, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00038c090)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send(0xc00038c090, 0xc003b6fb00, 0x0, 0x0)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:86 +0xb9
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).serveRecvFrontend(0xc000283b00, 0x190d458, 0xc003a5d930, 0xc0072ddc20)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:432 +0xacc
created by sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).Proxy
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:367 +0x318

My steps to reproduce was to mimic some level of backend unavailability.

mihivagyok · 2022-02-25T12:10:29Z

@andrewsykim That's great news.
I think the semaphore is working according to the design: one send is blocking, but an other request is coming and the mutex is guarding and the go routine gets blocked.

I think my concern regarding multiple proxy strategies / backend managers are still vaild here.

To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.

Do you think that if it is feasible or not?
Thank you!

andrewsykim · 2022-02-25T17:33:38Z

To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.

Do you think that if it is feasible or not?

I'm still getting familiar with the codebase so I'm not 100% sure yet. But if you're willing to open the PR it would be helpful for me to understand your proposal better

andrewsykim · 2022-02-25T17:37:46Z

Btw -- it seems like the goroutine leaking due to backend connection mutex can happen for multiple reasons. In my specific test, it was due to write quota on the stream, I created a separate issue for that here #335

andrewsykim · 2022-02-25T21:41:15Z

/cc @cheftako

andrewsykim · 2022-03-03T19:40:12Z

pkg/server/backend_manager.go

+	// wrap a timer for around SendMsg to avoid blocking grpc call
+	// (e.g. stream is full)
+	errChan := make(chan error, 1)
+	go func() {


Is it possible for this goroutine to start leaking if b.conn.Send can block forever?

My thinking is no, because when the stream closes eventually from returning from Send, this will return an io.EOF but I'm not 100% confident about it.

As we use buffered channel, I think you are right. When the steam closes, it shall free up the goroutine and channel. At least this is how I understand what I'm reading on the docs/internet:
https://www.ardanlabs.com/blog/2018/11/goroutine-leaks-the-forgotten-sender.html

Thanks!

Although I kinda agree that this change just masks the real problem. Thanks!

As we use buffered channel, I think you are right. When the steam closes, it shall free up the goroutine and channel.

The buffered channel is non-blocking assuming b.conn.Send() returns a value, but if b.conn.Send() itself can be blocking right?

Yes.
I mention the buffered channel only because that is needed to be able free up the goroutine and the channel. Thanks'

k8s-triage-robot · 2022-06-05T16:59:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-05T17:09:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-08-04T17:10:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-08-04T17:10:35Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheftako · 2022-09-09T22:34:23Z

/remove-lifecycle rotten

k8s-ci-robot · 2022-09-09T22:34:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mihivagyok
Once this PR has been reviewed and has the lgtm label, please assign anfernee for approval by writing /assign @anfernee in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2022-12-08T22:53:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot · 2022-12-09T10:15:35Z

@mihivagyok: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-01-08T10:52:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-02-07T11:31:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2023-02-07T11:32:02Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

set timeout on SendMsg

6284bd4

- this helps to avoid socket leaks when channel is full

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 10, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 10, 2022

k8s-ci-robot requested review from Jefftree and jkh52 February 10, 2022 08:31

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 10, 2022

andrewsykim reviewed Feb 10, 2022

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2022

mihivagyok changed the title ~~fix: set timeout on sendmsg to avoid memory link~~ fix: set timeout on sendmsg to avoid memory leak Feb 14, 2022

andrewsykim reviewed Feb 16, 2022

View reviewed changes

pkg/server/backend_manager.go Outdated Show resolved Hide resolved

mihivagyok mentioned this pull request Feb 17, 2022

konnectivity fails on kubectl cp #261

Closed

andrewsykim reviewed Feb 18, 2022

View reviewed changes

pkg/server/backend_manager.go Show resolved Hide resolved

pkg/server/backend_manager.go Show resolved Hide resolved

pkg/server/backend_manager.go Outdated Show resolved Hide resolved

andrewsykim mentioned this pull request Feb 18, 2022

konnectivity-client: ensure connection read channel is closed #325

Closed

- add reference why the wrapper is needed

736d2e9

- simplify the anonymous func()

mihivagyok requested a review from andrewsykim February 23, 2022 18:44

andrewsykim mentioned this pull request Feb 24, 2022

Leaking/Too many open File descriptors in Konnectivity server with GRPC mode and UDS transport #276

Closed

andrewsykim mentioned this pull request Feb 25, 2022

proxy->agent send stream is blocked due to flow control write quota, leading to goroutine leaks #335

Closed

k8s-ci-robot requested a review from cheftako February 25, 2022 21:41

andrewsykim reviewed Mar 3, 2022

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 5, 2022

k8s-ci-robot closed this Aug 4, 2022

cheftako reopened this Sep 9, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2022

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 8, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 8, 2023

k8s-ci-robot closed this Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set timeout on sendmsg to avoid memory leak #324

fix: set timeout on sendmsg to avoid memory leak #324

mihivagyok commented Feb 10, 2022

k8s-ci-robot commented Feb 10, 2022

andrewsykim left a comment

andrewsykim commented Feb 10, 2022 •

edited

Loading

mihivagyok commented Feb 11, 2022

mihivagyok commented Feb 14, 2022

andrewsykim commented Feb 16, 2022

andrewsykim commented Feb 16, 2022

andrewsykim left a comment •

edited

Loading

andrewsykim commented Feb 24, 2022 •

edited

Loading

mihivagyok commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim Mar 3, 2022

andrewsykim Mar 3, 2022

mihivagyok Mar 7, 2022

mihivagyok Mar 7, 2022

andrewsykim Mar 7, 2022

mihivagyok Mar 7, 2022

k8s-triage-robot commented Jun 5, 2022

k8s-triage-robot commented Jul 5, 2022

k8s-triage-robot commented Aug 4, 2022

k8s-ci-robot commented Aug 4, 2022

cheftako commented Sep 9, 2022

k8s-ci-robot commented Sep 9, 2022

k8s-triage-robot commented Dec 8, 2022

k8s-ci-robot commented Dec 9, 2022

k8s-triage-robot commented Jan 8, 2023

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

fix: set timeout on sendmsg to avoid memory leak #324

fix: set timeout on sendmsg to avoid memory leak #324

Conversation

mihivagyok commented Feb 10, 2022

k8s-ci-robot commented Feb 10, 2022

andrewsykim left a comment

Choose a reason for hiding this comment

andrewsykim commented Feb 10, 2022 • edited Loading

mihivagyok commented Feb 11, 2022

mihivagyok commented Feb 14, 2022

andrewsykim commented Feb 16, 2022

andrewsykim commented Feb 16, 2022

andrewsykim left a comment • edited Loading

Choose a reason for hiding this comment

andrewsykim commented Feb 24, 2022 • edited Loading

mihivagyok commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim commented Feb 25, 2022

andrewsykim Mar 3, 2022

Choose a reason for hiding this comment

andrewsykim Mar 3, 2022

Choose a reason for hiding this comment

mihivagyok Mar 7, 2022

Choose a reason for hiding this comment

mihivagyok Mar 7, 2022

Choose a reason for hiding this comment

andrewsykim Mar 7, 2022

Choose a reason for hiding this comment

mihivagyok Mar 7, 2022

Choose a reason for hiding this comment

k8s-triage-robot commented Jun 5, 2022

k8s-triage-robot commented Jul 5, 2022

k8s-triage-robot commented Aug 4, 2022

k8s-ci-robot commented Aug 4, 2022

cheftako commented Sep 9, 2022

k8s-ci-robot commented Sep 9, 2022

k8s-triage-robot commented Dec 8, 2022

k8s-ci-robot commented Dec 9, 2022

k8s-triage-robot commented Jan 8, 2023

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

andrewsykim commented Feb 10, 2022 •

edited

Loading

andrewsykim left a comment •

edited

Loading

andrewsykim commented Feb 24, 2022 •

edited

Loading