Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KeepAliveParameters to agent client #4157

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Oct 3, 2023

Tracking issue

#3936

Describe your changes

Agents use headless services to balance client load. However, If HPA is used, propeller needs to be restarted to get a new set of IP addresses.

We could add KeepAliveParameters to gRPC client, so that propeller can get a new set of IPs every 10s by default.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Screenshots

Note to reviewers

@pingsutw pingsutw requested a review from honnix October 3, 2023 03:37
@pingsutw
Copy link
Member Author

pingsutw commented Oct 3, 2023

cc @honnix mind taking a look

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Attention: Patch coverage is 67.50000% with 52 lines in your changes missing coverage. Please review.

Project coverage is 59.32%. Comparing base (b35cc95) to head (6c43ad3).
Report is 964 commits behind head on master.

Current head 6c43ad3 differs from pull request most recent head 2a55000

Please upload reports for the commit 2a55000 to get more accurate results.

Files Patch % Lines
...er/pkg/controller/nodes/task/k8s/plugin_manager.go 59.15% 24 Missing and 5 partials ⚠️
...ler/pkg/controller/nodes/task/k8s/event_watcher.go 77.77% 12 Missing ⚠️
...propeller/pkg/controller/nodes/task/transformer.go 61.11% 6 Missing and 1 partial ⚠️
flytepropeller/pkg/controller/controller.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4157      +/-   ##
==========================================
+ Coverage   58.98%   59.32%   +0.34%     
==========================================
  Files         618      550      -68     
  Lines       52708    39699   -13009     
==========================================
- Hits        31088    23551    -7537     
+ Misses      19140    13828    -5312     
+ Partials     2480     2320     -160     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>
@honnix
Copy link
Member

honnix commented Oct 3, 2023

KeepAliveParameters: &KeepAliveParameters{
Time: config.Duration{Duration: 10 * time.Second},
Timeout: config.Duration{Duration: 5 * time.Second},
PermitWithoutStream: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we set this to true? It defaults to false I think.

@@ -43,6 +43,11 @@ var (
Endpoint: "dns:///flyteagent.flyte.svc.cluster.local:80",
Insecure: true,
DefaultTimeout: config.Duration{Duration: 10 * time.Second},
KeepAliveParameters: &KeepAliveParameters{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think we keep existing defaults defined by https://pkg.go.dev/google.golang.org/grpc/keepalive#ClientParameters unchanged, and only give a value for Time? Also 10s seems a bit aggressive. In a large setup, the overhead of DNS resolution is not negligible, plus 10s might be even smaller than DNS cache timeout so in many cases the refresh won't give any new IPs. In our backend production setup, we default this to 5 minutes, just for reference.

@@ -43,6 +43,11 @@ var (
Endpoint: "dns:///flyteagent.flyte.svc.cluster.local:80",
Insecure: true,
DefaultTimeout: config.Duration{Duration: 10 * time.Second},
KeepAliveParameters: &KeepAliveParameters{
Time: config.Duration{Duration: 10 * time.Second},
Timeout: config.Duration{Duration: 5 * time.Second},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Timeout: config.Duration{Duration: 5 * time.Second},
Timeout: config.Duration{Duration: 20 * time.Second},

KeepAliveParameters: &KeepAliveParameters{
Time: config.Duration{Duration: 10 * time.Second},
Timeout: config.Duration{Duration: 5 * time.Second},
PermitWithoutStream: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PermitWithoutStream: true,
PermitWithoutStream: false,

io "github.com/flyteorg/flyte/flyteplugins/go/tasks/pluginmachinery/io"
core "github.com/flyteorg/flyteidl/gen/pb-go/flyteidl/core"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended change? It looks wrong in term of ordering.

@honnix
Copy link
Member

honnix commented Oct 3, 2023

After reading the doc more in-depth, I'm wondering whether this keepalive is intended for IP refreshing. It seems not to me. I think the problem might be more related to name resolver. The ping itself may not trigger name resolution refreshing.

There are more info:

That being said, I think this configuration still makes sense. It's just I'm not sure it solves the problem. Internally we have a custom name resolver goes together with client-side load balancing so we have not seen this problem.

@pingsutw
Copy link
Member Author

pingsutw commented Oct 3, 2023

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

@pingsutw
Copy link
Member Author

pingsutw commented Oct 3, 2023

@honnix
Copy link
Member

honnix commented Oct 3, 2023

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

I actually can't explain how this regular ping would trigger name resolution and why it would be different than normal traffic when coming to name resolution. Without these configuration, if there is no normal traffic there is no resolution, and when traffic comes back it should work the same as sending a ping, triggering resolution if DNS cache timed out. Do you find any doc stating that ping explicitly forces resolution?

Yes we use HPA.

@github-actions github-actions bot added the stale label Jun 30, 2024
@pingsutw pingsutw marked this pull request as draft August 16, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants