Add KeepAliveParameters to agent client #4157

pingsutw · 2023-10-03T03:37:11Z

Tracking issue

Describe your changes

Agents use headless services to balance client load. However, If HPA is used, propeller needs to be restarted to get a new set of IP addresses.

We could add KeepAliveParameters to gRPC client, so that propeller can get a new set of IPs every 10s by default.

https://discuss.kubernetes.io/t/how-to-do-load-balance-with-grpc-connection-when-hpa-autoscaling-is-enabled/16612

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Screenshots

Note to reviewers

Signed-off-by: Kevin Su <[email protected]>

pingsutw · 2023-10-03T03:37:55Z

cc @honnix mind taking a look

codecov · 2023-10-03T03:42:27Z

Codecov Report

Attention: Patch coverage is 67.50000% with 52 lines in your changes missing coverage. Please review.

Project coverage is 59.32%. Comparing base (b35cc95) to head (6c43ad3).
Report is 964 commits behind head on master.

❗ Current head 6c43ad3 differs from pull request most recent head 2a55000

Please upload reports for the commit 2a55000 to get more accurate results.

Files	Patch %	Lines
...er/pkg/controller/nodes/task/k8s/plugin_manager.go	59.15%	24 Missing and 5 partials ⚠️
...ler/pkg/controller/nodes/task/k8s/event_watcher.go	77.77%	12 Missing ⚠️
...propeller/pkg/controller/nodes/task/transformer.go	61.11%	6 Missing and 1 partial ⚠️
flytepropeller/pkg/controller/controller.go	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4157      +/-   ##
==========================================
+ Coverage   58.98%   59.32%   +0.34%     
==========================================
  Files         618      550      -68     
  Lines       52708    39699   -13009     
==========================================
- Hits        31088    23551    -7537     
+ Misses      19140    13828    -5312     
+ Partials     2480     2320     -160

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>

honnix · 2023-10-03T06:14:50Z

Nice finding! I did not know there are no reasonable defaults for these. I guess we need the same thing for https://github.com/flyteorg/flytepropeller/blob/0fcc1da879333af1282ee1142651e191eb1d6bb4/pkg/controller/nodes/catalog/config.go#L44, and https://github.com/flyteorg/flyteidl/blob/85bfc8ff1c36b2d37a4fcc68a2be1911a9fc3940/clients/go/admin/client.go#L173 ?

honnix · 2023-10-03T06:16:28Z

flyteplugins/go/tasks/plugins/webapi/agent/config.go

+			KeepAliveParameters: &KeepAliveParameters{
+				Time:                config.Duration{Duration: 10 * time.Second},
+				Timeout:             config.Duration{Duration: 5 * time.Second},
+				PermitWithoutStream: true,


What is the reason we set this to true? It defaults to false I think.

honnix · 2023-10-03T06:20:53Z

flyteplugins/go/tasks/plugins/webapi/agent/config.go

@@ -43,6 +43,11 @@ var (
 			Endpoint:       "dns:///flyteagent.flyte.svc.cluster.local:80",
 			Insecure:       true,
 			DefaultTimeout: config.Duration{Duration: 10 * time.Second},
+			KeepAliveParameters: &KeepAliveParameters{


What do you think we keep existing defaults defined by https://pkg.go.dev/google.golang.org/grpc/keepalive#ClientParameters unchanged, and only give a value for Time? Also 10s seems a bit aggressive. In a large setup, the overhead of DNS resolution is not negligible, plus 10s might be even smaller than DNS cache timeout so in many cases the refresh won't give any new IPs. In our backend production setup, we default this to 5 minutes, just for reference.

honnix · 2023-10-03T06:21:01Z

flyteplugins/go/tasks/plugins/webapi/agent/config.go

@@ -43,6 +43,11 @@ var (
 			Endpoint:       "dns:///flyteagent.flyte.svc.cluster.local:80",
 			Insecure:       true,
 			DefaultTimeout: config.Duration{Duration: 10 * time.Second},
+			KeepAliveParameters: &KeepAliveParameters{
+				Time:                config.Duration{Duration: 10 * time.Second},
+				Timeout:             config.Duration{Duration: 5 * time.Second},


Suggested change

Timeout: config.Duration{Duration: 5 * time.Second},

Timeout: config.Duration{Duration: 20 * time.Second},

honnix · 2023-10-03T06:21:09Z

flyteplugins/go/tasks/plugins/webapi/agent/config.go

+			KeepAliveParameters: &KeepAliveParameters{
+				Time:                config.Duration{Duration: 10 * time.Second},
+				Timeout:             config.Duration{Duration: 5 * time.Second},
+				PermitWithoutStream: true,


Suggested change

PermitWithoutStream: true,

PermitWithoutStream: false,

honnix · 2023-10-03T06:23:03Z

flyteplugins/go/tasks/pluginmachinery/io/mocks/output_reader.go

 	io "github.com/flyteorg/flyte/flyteplugins/go/tasks/pluginmachinery/io"
+	core "github.com/flyteorg/flyteidl/gen/pb-go/flyteidl/core"


Is this intended change? It looks wrong in term of ordering.

honnix · 2023-10-03T06:49:25Z

After reading the doc more in-depth, I'm wondering whether this keepalive is intended for IP refreshing. It seems not to me. I think the problem might be more related to name resolver. The ping itself may not trigger name resolution refreshing.

There are more info:

https://grpc.io/docs/guides/custom-name-resolution/
https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalancing?view=aspnetcore-7.0#dns-address-caching (it's dotnet but should be general enough)

That being said, I think this configuration still makes sense. It's just I'm not sure it solves the problem. Internally we have a custom name resolver goes together with client-side load balancing so we have not seen this problem.

pingsutw · 2023-10-03T07:37:20Z

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

pingsutw · 2023-10-03T07:53:57Z

Need more intelligent re-resolution of names grpc/grpc#12295 (comment)
https://github.com/grpc/grpc-go/blob/9e1fc3e9c088387890f5a303b805d320f11cc6dd/examples/features/name_resolving/client/main.go#L92-L94
You're right. updating resolver might be a better solution. let me add one and test it.

honnix · 2023-10-03T09:35:16Z

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

I actually can't explain how this regular ping would trigger name resolution and why it would be different than normal traffic when coming to name resolution. Without these configuration, if there is no normal traffic there is no resolution, and when traffic comes back it should work the same as sending a ping, triggering resolution if DNS cache timed out. Do you find any doc stating that ping explicitly forces resolution?

Yes we use HPA.

Add KeepAliveParameters to agent client

5b4a3a9

Signed-off-by: Kevin Su <[email protected]>

pingsutw requested a review from honnix October 3, 2023 03:37

nit

2a55000

Signed-off-by: Kevin Su <[email protected]>

honnix reviewed Oct 3, 2023

View reviewed changes

github-actions bot added the stale label Jun 30, 2024

pingsutw marked this pull request as draft August 16, 2024 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KeepAliveParameters to agent client #4157

Add KeepAliveParameters to agent client #4157

pingsutw commented Oct 3, 2023 •

edited

Loading

pingsutw commented Oct 3, 2023

codecov bot commented Oct 3, 2023 •

edited

Loading

honnix commented Oct 3, 2023

honnix Oct 3, 2023

honnix Oct 3, 2023

honnix Oct 3, 2023

honnix Oct 3, 2023

honnix Oct 3, 2023

honnix commented Oct 3, 2023

pingsutw commented Oct 3, 2023

pingsutw commented Oct 3, 2023 •

edited

Loading

honnix commented Oct 3, 2023

	Timeout: config.Duration{Duration: 5 * time.Second},
	Timeout: config.Duration{Duration: 20 * time.Second},

		io "github.com/flyteorg/flyte/flyteplugins/go/tasks/pluginmachinery/io"
		core "github.com/flyteorg/flyteidl/gen/pb-go/flyteidl/core"

Add KeepAliveParameters to agent client #4157

Are you sure you want to change the base?

Add KeepAliveParameters to agent client #4157

Conversation

pingsutw commented Oct 3, 2023 • edited Loading

Tracking issue

Describe your changes

Check all the applicable boxes

Screenshots

Note to reviewers

pingsutw commented Oct 3, 2023

codecov bot commented Oct 3, 2023 • edited Loading

Codecov Report

honnix commented Oct 3, 2023

honnix Oct 3, 2023

Choose a reason for hiding this comment

honnix Oct 3, 2023

Choose a reason for hiding this comment

honnix Oct 3, 2023

Choose a reason for hiding this comment

honnix Oct 3, 2023

Choose a reason for hiding this comment

honnix Oct 3, 2023

Choose a reason for hiding this comment

honnix commented Oct 3, 2023

pingsutw commented Oct 3, 2023

pingsutw commented Oct 3, 2023 • edited Loading

honnix commented Oct 3, 2023

pingsutw commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

pingsutw commented Oct 3, 2023 •

edited

Loading