-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KeepAliveParameters to agent client #4157
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Kevin Su <[email protected]>
cc @honnix mind taking a look |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4157 +/- ##
==========================================
+ Coverage 58.98% 59.32% +0.34%
==========================================
Files 618 550 -68
Lines 52708 39699 -13009
==========================================
- Hits 31088 23551 -7537
+ Misses 19140 13828 -5312
+ Partials 2480 2320 -160
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Kevin Su <[email protected]>
Nice finding! I did not know there are no reasonable defaults for these. I guess we need the same thing for https://github.com/flyteorg/flytepropeller/blob/0fcc1da879333af1282ee1142651e191eb1d6bb4/pkg/controller/nodes/catalog/config.go#L44, and https://github.com/flyteorg/flyteidl/blob/85bfc8ff1c36b2d37a4fcc68a2be1911a9fc3940/clients/go/admin/client.go#L173 ? |
KeepAliveParameters: &KeepAliveParameters{ | ||
Time: config.Duration{Duration: 10 * time.Second}, | ||
Timeout: config.Duration{Duration: 5 * time.Second}, | ||
PermitWithoutStream: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we set this to true
? It defaults to false
I think.
@@ -43,6 +43,11 @@ var ( | |||
Endpoint: "dns:///flyteagent.flyte.svc.cluster.local:80", | |||
Insecure: true, | |||
DefaultTimeout: config.Duration{Duration: 10 * time.Second}, | |||
KeepAliveParameters: &KeepAliveParameters{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think we keep existing defaults defined by https://pkg.go.dev/google.golang.org/grpc/keepalive#ClientParameters unchanged, and only give a value for Time
? Also 10s seems a bit aggressive. In a large setup, the overhead of DNS resolution is not negligible, plus 10s might be even smaller than DNS cache timeout so in many cases the refresh won't give any new IPs. In our backend production setup, we default this to 5 minutes, just for reference.
@@ -43,6 +43,11 @@ var ( | |||
Endpoint: "dns:///flyteagent.flyte.svc.cluster.local:80", | |||
Insecure: true, | |||
DefaultTimeout: config.Duration{Duration: 10 * time.Second}, | |||
KeepAliveParameters: &KeepAliveParameters{ | |||
Time: config.Duration{Duration: 10 * time.Second}, | |||
Timeout: config.Duration{Duration: 5 * time.Second}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout: config.Duration{Duration: 5 * time.Second}, | |
Timeout: config.Duration{Duration: 20 * time.Second}, |
KeepAliveParameters: &KeepAliveParameters{ | ||
Time: config.Duration{Duration: 10 * time.Second}, | ||
Timeout: config.Duration{Duration: 5 * time.Second}, | ||
PermitWithoutStream: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PermitWithoutStream: true, | |
PermitWithoutStream: false, |
io "github.com/flyteorg/flyte/flyteplugins/go/tasks/pluginmachinery/io" | ||
core "github.com/flyteorg/flyteidl/gen/pb-go/flyteidl/core" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended change? It looks wrong in term of ordering.
After reading the doc more in-depth, I'm wondering whether this keepalive is intended for IP refreshing. It seems not to me. I think the problem might be more related to name resolver. The ping itself may not trigger name resolution refreshing. There are more info:
That being said, I think this configuration still makes sense. It's just I'm not sure it solves the problem. Internally we have a custom name resolver goes together with client-side load balancing so we have not seen this problem. |
It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod. |
|
I actually can't explain how this regular ping would trigger name resolution and why it would be different than normal traffic when coming to name resolution. Without these configuration, if there is no normal traffic there is no resolution, and when traffic comes back it should work the same as sending a ping, triggering resolution if DNS cache timed out. Do you find any doc stating that Yes we use HPA. |
Tracking issue
#3936
Describe your changes
Agents use headless services to balance client load. However, If HPA is used, propeller needs to be restarted to get a new set of IP addresses.
We could add
KeepAliveParameters
to gRPC client, so that propeller can get a new set of IPs every 10s by default.Check all the applicable boxes
Screenshots
Note to reviewers