Upstream contributions from Union.ai #5769

andrewwdye · 2024-09-23T17:11:38Z

Why are the changes needed?

This change upstreams a series of contributions from Union.ai.

What changes were proposed in this pull request?

Fix cluster pool assignment validation (@iaroslav-ciupin )
Don't send inputURI for start-node (@iaroslav-ciupin )
Unexpectedly deleted pod metrics (@iaroslav-ciupin )
CreateDownloadLink: Head before signing (@iaroslav-ciupin )
Fix metrics scale division in timer (@iaroslav-ciupin )
Add histogram stopwatch to stow storage (@andrewwdye )
Override ArrayNode log links with map plugin (@hamersaw )
Fix k3d local setup prefix (@andrewwdye )
adjust Dask LogName to (Dask Runner Logs) (@fiedlerNr9 )
Dask dashboard should have a separate log config (@EngHabu )
Log and monitor failures to validate access tokens (@katrogan )
Fix type assertion when an event is missed while connection to apiser… (@EngHabu )
Add read replica host config and connection (@squiishyy )
added lock to memstore make threadsafe (@hamersaw )
Move storage cache settings to correct location (@mbarrien )
Add config for grpc MaxMessageSizeBytes (@andrewwdye )
Add org to CreateUploadLocation (@katrogan )
Histogram Bucket Options (@squiishyy )
Add client-go metrics (@andrewwdye )
Enqueue owner on launchplan terminal state (@hamersaw )
Add configuration for launchplan cache resync duration (@hamersaw )
Overlap fetching input and output data (@andrewwdye )
Fix async notifications tests (@andrewwdye )
Overlap FutureFileReader blob store writes/reads (@andrewwdye )
Overlap create execution blob store reads/writes (@andrewwdye )

How was this patch tested?

TBD

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

N/A

Docs link

N/A

codecov · 2024-09-23T17:15:39Z

Codecov Report

Attention: Patch coverage is 52.07280% with 474 lines in your changes missing coverage. Please review.

Project coverage is 36.81%. Comparing base (56b6d6d) to head (12de9a6).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
flyteadmin/pkg/manager/mocks/resource_interface.go	0.00%	275 Missing ⚠️
...eller/pkg/controller/nodes/array/event_recorder.go	67.74%	23 Missing and 7 partials ⚠️
flytestdlib/promutils/scope.go	59.67%	23 Missing and 2 partials ⚠️
...eplugins/go/tasks/plugins/k8s/dask/config_flags.go	40.00%	21 Missing ⚠️
...er/pkg/controller/nodes/task/future_file_reader.go	0.00%	20 Missing ⚠️
flytestdlib/database/db.go	0.00%	18 Missing ⚠️
flytepropeller/pkg/controller/controller.go	0.00%	15 Missing ⚠️
flyteadmin/auth/authzserver/provider.go	42.10%	10 Missing and 1 partial ⚠️
...ytestdlib/promutils/labeled/histogram_stopwatch.go	81.63%	5 Missing and 4 partials ⚠️
flyteadmin/auth/handlers.go	0.00%	7 Missing ⚠️
... and 11 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5769      +/-   ##
==========================================
+ Coverage   36.71%   36.81%   +0.09%     
==========================================
  Files        1304     1309       +5     
  Lines      130081   130895     +814     
==========================================
+ Hits        47764    48189     +425     
- Misses      78147    78524     +377     
- Partials     4170     4182      +12

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`54.03% <34.88%> (-0.39%)`	⬇️
unittests-flytecopilot	`11.73% <ø> (ø)`
unittests-flytectl	`62.45% <ø> (+0.04%)`	⬆️
unittests-flyteidl	`6.92% <87.50%> (+0.03%)`	⬆️
unittests-flyteplugins	`53.59% <48.88%> (-0.03%)`	⬇️
unittests-flytepropeller	`43.00% <61.83%> (+0.15%)`	⬆️
unittests-flytestdlib	`55.41% <74.58%> (+0.62%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below). Signed-off-by: Andrew Dye <[email protected]>

This change updates `FutureFileReader.Cache` and `FutureFileReader.RetrieveCache` to use overlapped write and reads, respectively, to reduce end-to-end latency. The read path is a common operation on each iteration of the propeller `Handle` loop for dynamic nodes. Signed-off-by: Andrew Dye <[email protected]>

I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks. Signed-off-by: Andrew Dye <[email protected]>

This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data. Signed-off-by: Andrew Dye <[email protected]>

Currently, the launchplan cache resync duration uses the DownstreamEval duration configuration which is also used for the sync period on the k8s client. This means if we want to configure a more aggressive launchplan cache resync, we would also incur overhead in syncing all k8s resources (ex. Pods from `PodPlugin`). By adding a separate configuration value we can update these independently. Signed-off-by: Andrew Dye <[email protected]>

This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state. Signed-off-by: Andrew Dye <[email protected]>

Register a few metric callbacks with the client-go metrics interface so that we can monitor request latencies and rate limiting of kubeclient. ``` ❯ curl http://localhost:10254/metrics | rg k8s_client k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.01"} 87 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.1"} 114 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="1"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="10"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_rate_limiter_latency_sum{verb="GET"} 1.9358371670000003 k8s_client_rate_limiter_latency_count{verb="GET"} 117 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.005"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.01"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.025"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="10"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_rate_limiter_latency_sum{verb="POST"} 1.0542e-05 k8s_client_rate_limiter_latency_count{verb="POST"} 6 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="10"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_rate_limiter_latency_sum{verb="PUT"} 5e-07 k8s_client_rate_limiter_latency_count{verb="PUT"} 1 k8s_client_request_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_request_latency_bucket{verb="GET",le="0.01"} 86 k8s_client_request_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_request_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_request_latency_bucket{verb="GET",le="0.1"} 112 k8s_client_request_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_request_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="1"} 117 k8s_client_request_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="5"} 117 k8s_client_request_latency_bucket{verb="GET",le="10"} 117 k8s_client_request_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_request_latency_sum{verb="GET"} 2.1254330859999997 k8s_client_request_latency_count{verb="GET"} 117 k8s_client_request_latency_bucket{verb="POST",le="0.005"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.01"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.025"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="1"} 6 k8s_client_request_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="5"} 6 k8s_client_request_latency_bucket{verb="POST",le="10"} 6 k8s_client_request_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_request_latency_sum{verb="POST"} 0.048558582 k8s_client_request_latency_count{verb="POST"} 6 k8s_client_request_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="10"} 1 k8s_client_request_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_request_latency_sum{verb="PUT"} 0.002381375 k8s_client_request_latency_count{verb="PUT"} 1 k8s_client_request_total{code="200",method="GET"} 120 k8s_client_request_total{code="200",method="PUT"} 1 k8s_client_request_total{code="409",method="POST"} 6 ``` Signed-off-by: Andrew Dye <[email protected]>

Add abstraction to be able to pass buckets custom defined to histogram vectors. Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin. Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

- Add a new field to the postgres db config struct, `readReplicaHost`. - Add a new endpoint in the `database` package to enable establishing a connection with a db without creating it if it doesn't exist Signed-off-by: Andrew Dye <[email protected]>

…ver was severed Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext Ran `setup_local_dev.sh` Signed-off-by: Andrew Dye <[email protected]>

This PR adds a configuration option to override ArrayNode log links with those defined in the map plugin. The map plugin contains it's own configuration for log links, which may differ from those defined on the PodPlugin. ArrayNode, executing subNodes as regular tasks (ie. using the PodPlugin) means that it uses the default PodPlugin log templates. Signed-off-by: Andrew Dye <[email protected]>

This change * Adds a new `HistogramStopWatch` to promutils. This [allows for aggregating latencies](https://prometheus.io/docs/practices/histograms/#quantiles) across pods and computing quantiles at query time * Adds `HistogramStopWatch` latency metrics for stow so that we can reason about storage latencies in aggregate. Existing latency metrics remain. - [x] Added unittests Signed-off-by: Andrew Dye <[email protected]>

* Fix metrics scale division in timer Signed-off-by: Iaroslav Ciupin <[email protected]> Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

* Count when we see unexpectedly terminated pods Signed-off-by: Andrew Dye <[email protected]>

* send empty `inputUri` for `start-node` in node execution event to flyteadmin and therefore, GetNodeExecutionData will not attempt to download non-existing inputUri as it was doing before this change. * add DB migration to clear `input_uri` in existing `node_executions` table for start nodes. Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Eduardo Apolinario <[email protected]>

* Overlap create execution blob store reads/writes This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below). Signed-off-by: Andrew Dye <[email protected]> * Overlap FutureFileReader blob store writes/reads This change updates `FutureFileReader.Cache` and `FutureFileReader.RetrieveCache` to use overlapped write and reads, respectively, to reduce end-to-end latency. The read path is a common operation on each iteration of the propeller `Handle` loop for dynamic nodes. Signed-off-by: Andrew Dye <[email protected]> * Fix async notifications tests I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks. Signed-off-by: Andrew Dye <[email protected]> * Overlap fetching input and output data This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data. Signed-off-by: Andrew Dye <[email protected]> * Add configuration for launchplan cache resync duration Currently, the launchplan cache resync duration uses the DownstreamEval duration configuration which is also used for the sync period on the k8s client. This means if we want to configure a more aggressive launchplan cache resync, we would also incur overhead in syncing all k8s resources (ex. Pods from `PodPlugin`). By adding a separate configuration value we can update these independently. Signed-off-by: Andrew Dye <[email protected]> * Enqueue owner on launchplan terminal state This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state. Signed-off-by: Andrew Dye <[email protected]> * Add client-go metrics Register a few metric callbacks with the client-go metrics interface so that we can monitor request latencies and rate limiting of kubeclient. ``` ❯ curl http://localhost:10254/metrics | rg k8s_client k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.01"} 87 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.1"} 114 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="1"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="10"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_rate_limiter_latency_sum{verb="GET"} 1.9358371670000003 k8s_client_rate_limiter_latency_count{verb="GET"} 117 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.005"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.01"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.025"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="10"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_rate_limiter_latency_sum{verb="POST"} 1.0542e-05 k8s_client_rate_limiter_latency_count{verb="POST"} 6 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="10"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_rate_limiter_latency_sum{verb="PUT"} 5e-07 k8s_client_rate_limiter_latency_count{verb="PUT"} 1 k8s_client_request_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_request_latency_bucket{verb="GET",le="0.01"} 86 k8s_client_request_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_request_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_request_latency_bucket{verb="GET",le="0.1"} 112 k8s_client_request_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_request_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="1"} 117 k8s_client_request_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="5"} 117 k8s_client_request_latency_bucket{verb="GET",le="10"} 117 k8s_client_request_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_request_latency_sum{verb="GET"} 2.1254330859999997 k8s_client_request_latency_count{verb="GET"} 117 k8s_client_request_latency_bucket{verb="POST",le="0.005"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.01"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.025"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="1"} 6 k8s_client_request_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="5"} 6 k8s_client_request_latency_bucket{verb="POST",le="10"} 6 k8s_client_request_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_request_latency_sum{verb="POST"} 0.048558582 k8s_client_request_latency_count{verb="POST"} 6 k8s_client_request_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="10"} 1 k8s_client_request_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_request_latency_sum{verb="PUT"} 0.002381375 k8s_client_request_latency_count{verb="PUT"} 1 k8s_client_request_total{code="200",method="GET"} 120 k8s_client_request_total{code="200",method="PUT"} 1 k8s_client_request_total{code="409",method="POST"} 6 ``` Signed-off-by: Andrew Dye <[email protected]> * Histogram Bucket Options Add abstraction to be able to pass buckets custom defined to histogram vectors. Signed-off-by: Andrew Dye <[email protected]> * Add org to CreateUploadLocation Signed-off-by: Andrew Dye <[email protected]> * Add config for grpc MaxMessageSizeBytes We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin. Signed-off-by: Andrew Dye <[email protected]> * Move storage cache settings to correct location Signed-off-by: Andrew Dye <[email protected]> * added lock to memstore make threadsafe Signed-off-by: Andrew Dye <[email protected]> * Add read replica host config and connection - Add a new field to the postgres db config struct, `readReplicaHost`. - Add a new endpoint in the `database` package to enable establishing a connection with a db without creating it if it doesn't exist Signed-off-by: Andrew Dye <[email protected]> * Fix type assertion when an event is missed while connection to apiser… …ver was severed Signed-off-by: Andrew Dye <[email protected]> * Log and monitor failures to validate access tokens Signed-off-by: Andrew Dye <[email protected]> * Dask dashboard should have a separate log config Signed-off-by: Andrew Dye <[email protected]> * adjust Dask LogName to (Dask Runner Logs) Signed-off-by: Andrew Dye <[email protected]> * Fix k3d local setup prefix I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext Ran `setup_local_dev.sh` Signed-off-by: Andrew Dye <[email protected]> * Override ArrayNode log links with map plugin This PR adds a configuration option to override ArrayNode log links with those defined in the map plugin. The map plugin contains it's own configuration for log links, which may differ from those defined on the PodPlugin. ArrayNode, executing subNodes as regular tasks (ie. using the PodPlugin) means that it uses the default PodPlugin log templates. Signed-off-by: Andrew Dye <[email protected]> * Add histogram stopwatch to stow storage This change * Adds a new `HistogramStopWatch` to promutils. This [allows for aggregating latencies](https://prometheus.io/docs/practices/histograms/#quantiles) across pods and computing quantiles at query time * Adds `HistogramStopWatch` latency metrics for stow so that we can reason about storage latencies in aggregate. Existing latency metrics remain. - [x] Added unittests Signed-off-by: Andrew Dye <[email protected]> * Fix metrics scale division in timer * Fix metrics scale division in timer Signed-off-by: Iaroslav Ciupin <[email protected]> Signed-off-by: Andrew Dye <[email protected]> * CreateDownloadLink: Head before signing Signed-off-by: Andrew Dye <[email protected]> * Unexpectedly deleted pod metrics * Count when we see unexpectedly terminated pods Signed-off-by: Andrew Dye <[email protected]> * Don't send inputURI for start-node * send empty `inputUri` for `start-node` in node execution event to flyteadmin and therefore, GetNodeExecutionData will not attempt to download non-existing inputUri as it was doing before this change. * add DB migration to clear `input_uri` in existing `node_executions` table for start nodes. Signed-off-by: Andrew Dye <[email protected]> * Fix cluster pool assignment validation Signed-off-by: Andrew Dye <[email protected]> --------- Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Co-authored-by: Joe Eschen <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Michael Barrientos <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: Jan Fiedler <[email protected]> Co-authored-by: Iaroslav Ciupin <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]>

* Add finalizer to avoid premature CRD Deletion (#5788) Signed-off-by: Rafael Raposo <[email protected]> * Handle CORS in secure connections (#5855) * Upstream contributions from Union.ai (#5769) * Overlap create execution blob store reads/writes This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below). Signed-off-by: Andrew Dye <[email protected]> * Overlap FutureFileReader blob store writes/reads This change updates `FutureFileReader.Cache` and `FutureFileReader.RetrieveCache` to use overlapped write and reads, respectively, to reduce end-to-end latency. The read path is a common operation on each iteration of the propeller `Handle` loop for dynamic nodes. Signed-off-by: Andrew Dye <[email protected]> * Fix async notifications tests I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks. Signed-off-by: Andrew Dye <[email protected]> * Overlap fetching input and output data This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data. Signed-off-by: Andrew Dye <[email protected]> * Add configuration for launchplan cache resync duration Currently, the launchplan cache resync duration uses the DownstreamEval duration configuration which is also used for the sync period on the k8s client. This means if we want to configure a more aggressive launchplan cache resync, we would also incur overhead in syncing all k8s resources (ex. Pods from `PodPlugin`). By adding a separate configuration value we can update these independently. Signed-off-by: Andrew Dye <[email protected]> * Enqueue owner on launchplan terminal state This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state. Signed-off-by: Andrew Dye <[email protected]> * Add client-go metrics Register a few metric callbacks with the client-go metrics interface so that we can monitor request latencies and rate limiting of kubeclient. ``` ❯ curl http://localhost:10254/metrics | rg k8s_client k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.01"} 87 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.1"} 114 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="1"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="5"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="10"} 117 k8s_client_rate_limiter_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_rate_limiter_latency_sum{verb="GET"} 1.9358371670000003 k8s_client_rate_limiter_latency_count{verb="GET"} 117 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.005"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.01"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.025"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="1"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="5"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="10"} 6 k8s_client_rate_limiter_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_rate_limiter_latency_sum{verb="POST"} 1.0542e-05 k8s_client_rate_limiter_latency_count{verb="POST"} 6 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="1"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="5"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="10"} 1 k8s_client_rate_limiter_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_rate_limiter_latency_sum{verb="PUT"} 5e-07 k8s_client_rate_limiter_latency_count{verb="PUT"} 1 k8s_client_request_latency_bucket{verb="GET",le="0.005"} 84 k8s_client_request_latency_bucket{verb="GET",le="0.01"} 86 k8s_client_request_latency_bucket{verb="GET",le="0.025"} 89 k8s_client_request_latency_bucket{verb="GET",le="0.05"} 99 k8s_client_request_latency_bucket{verb="GET",le="0.1"} 112 k8s_client_request_latency_bucket{verb="GET",le="0.25"} 117 k8s_client_request_latency_bucket{verb="GET",le="0.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="1"} 117 k8s_client_request_latency_bucket{verb="GET",le="2.5"} 117 k8s_client_request_latency_bucket{verb="GET",le="5"} 117 k8s_client_request_latency_bucket{verb="GET",le="10"} 117 k8s_client_request_latency_bucket{verb="GET",le="+Inf"} 117 k8s_client_request_latency_sum{verb="GET"} 2.1254330859999997 k8s_client_request_latency_count{verb="GET"} 117 k8s_client_request_latency_bucket{verb="POST",le="0.005"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.01"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.025"} 5 k8s_client_request_latency_bucket{verb="POST",le="0.05"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.1"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.25"} 6 k8s_client_request_latency_bucket{verb="POST",le="0.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="1"} 6 k8s_client_request_latency_bucket{verb="POST",le="2.5"} 6 k8s_client_request_latency_bucket{verb="POST",le="5"} 6 k8s_client_request_latency_bucket{verb="POST",le="10"} 6 k8s_client_request_latency_bucket{verb="POST",le="+Inf"} 6 k8s_client_request_latency_sum{verb="POST"} 0.048558582 k8s_client_request_latency_count{verb="POST"} 6 k8s_client_request_latency_bucket{verb="PUT",le="0.005"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.01"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.025"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.05"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.25"} 1 k8s_client_request_latency_bucket{verb="PUT",le="0.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="1"} 1 k8s_client_request_latency_bucket{verb="PUT",le="2.5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="5"} 1 k8s_client_request_latency_bucket{verb="PUT",le="10"} 1 k8s_client_request_latency_bucket{verb="PUT",le="+Inf"} 1 k8s_client_request_latency_sum{verb="PUT"} 0.002381375 k8s_client_request_latency_count{verb="PUT"} 1 k8s_client_request_total{code="200",method="GET"} 120 k8s_client_request_total{code="200",method="PUT"} 1 k8s_client_request_total{code="409",method="POST"} 6 ``` Signed-off-by: Andrew Dye <[email protected]> * Histogram Bucket Options Add abstraction to be able to pass buckets custom defined to histogram vectors. Signed-off-by: Andrew Dye <[email protected]> * Add org to CreateUploadLocation Signed-off-by: Andrew Dye <[email protected]> * Add config for grpc MaxMessageSizeBytes We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin. Signed-off-by: Andrew Dye <[email protected]> * Move storage cache settings to correct location Signed-off-by: Andrew Dye <[email protected]> * added lock to memstore make threadsafe Signed-off-by: Andrew Dye <[email protected]> * Add read replica host config and connection - Add a new field to the postgres db config struct, `readReplicaHost`. - Add a new endpoint in the `database` package to enable establishing a connection with a db without creating it if it doesn't exist Signed-off-by: Andrew Dye <[email protected]> * Fix type assertion when an event is missed while connection to apiser… …ver was severed Signed-off-by: Andrew Dye <[email protected]> * Log and monitor failures to validate access tokens Signed-off-by: Andrew Dye <[email protected]> * Dask dashboard should have a separate log config Signed-off-by: Andrew Dye <[email protected]> * adjust Dask LogName to (Dask Runner Logs) Signed-off-by: Andrew Dye <[email protected]> * Fix k3d local setup prefix I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext Ran `setup_local_dev.sh` Signed-off-by: Andrew Dye <[email protected]> * Override ArrayNode log links with map plugin This PR adds a configuration option to override ArrayNode log links with those defined in the map plugin. The map plugin contains it's own configuration for log links, which may differ from those defined on the PodPlugin. ArrayNode, executing subNodes as regular tasks (ie. using the PodPlugin) means that it uses the default PodPlugin log templates. Signed-off-by: Andrew Dye <[email protected]> * Add histogram stopwatch to stow storage This change * Adds a new `HistogramStopWatch` to promutils. This [allows for aggregating latencies](https://prometheus.io/docs/practices/histograms/#quantiles) across pods and computing quantiles at query time * Adds `HistogramStopWatch` latency metrics for stow so that we can reason about storage latencies in aggregate. Existing latency metrics remain. - [x] Added unittests Signed-off-by: Andrew Dye <[email protected]> * Fix metrics scale division in timer * Fix metrics scale division in timer Signed-off-by: Iaroslav Ciupin <[email protected]> Signed-off-by: Andrew Dye <[email protected]> * CreateDownloadLink: Head before signing Signed-off-by: Andrew Dye <[email protected]> * Unexpectedly deleted pod metrics * Count when we see unexpectedly terminated pods Signed-off-by: Andrew Dye <[email protected]> * Don't send inputURI for start-node * send empty `inputUri` for `start-node` in node execution event to flyteadmin and therefore, GetNodeExecutionData will not attempt to download non-existing inputUri as it was doing before this change. * add DB migration to clear `input_uri` in existing `node_executions` table for start nodes. Signed-off-by: Andrew Dye <[email protected]> * Fix cluster pool assignment validation Signed-off-by: Andrew Dye <[email protected]> --------- Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Co-authored-by: Joe Eschen <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Michael Barrientos <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: Jan Fiedler <[email protected]> Co-authored-by: Iaroslav Ciupin <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Rafael Raposo <[email protected]> Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Rafael Raposo <[email protected]> Co-authored-by: Andrew Dye <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Co-authored-by: Joe Eschen <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Michael Barrientos <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: Jan Fiedler <[email protected]> Co-authored-by: Iaroslav Ciupin <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]>

andrewwdye force-pushed the union/upstream branch 19 times, most recently from 69991e1 to 0d1583e Compare September 25, 2024 20:23

andrewwdye marked this pull request as ready for review September 25, 2024 21:46

andrewwdye requested review from eapolinario and wild-endeavor September 25, 2024 21:47

andrewwdye and others added 7 commits September 30, 2024 16:25

Overlap create execution blob store reads/writes

8437865

This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below). Signed-off-by: Andrew Dye <[email protected]>

Fix async notifications tests

ea56a96

I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks. Signed-off-by: Andrew Dye <[email protected]>

Overlap fetching input and output data

3e1f249

This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data. Signed-off-by: Andrew Dye <[email protected]>

Enqueue owner on launchplan terminal state

85ac321

This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state. Signed-off-by: Andrew Dye <[email protected]>

squiishyy and others added 18 commits September 30, 2024 16:25

Histogram Bucket Options

39b249f

Add abstraction to be able to pass buckets custom defined to histogram vectors. Signed-off-by: Andrew Dye <[email protected]>

Add org to CreateUploadLocation

4752c1e

Signed-off-by: Andrew Dye <[email protected]>

Add config for grpc MaxMessageSizeBytes

4996328

We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin. Signed-off-by: Andrew Dye <[email protected]>

Move storage cache settings to correct location

0b935a6

Signed-off-by: Andrew Dye <[email protected]>

added lock to memstore make threadsafe

69f47dd

Signed-off-by: Andrew Dye <[email protected]>

Fix type assertion when an event is missed while connection to apiser…

05dfd9d

…ver was severed Signed-off-by: Andrew Dye <[email protected]>

Log and monitor failures to validate access tokens

7ce2ca8

Signed-off-by: Andrew Dye <[email protected]>

Dask dashboard should have a separate log config

c68d2db

Signed-off-by: Andrew Dye <[email protected]>

adjust Dask LogName to (Dask Runner Logs)

d00a159

Signed-off-by: Andrew Dye <[email protected]>

Fix k3d local setup prefix

6db5458

I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext Ran `setup_local_dev.sh` Signed-off-by: Andrew Dye <[email protected]>

Fix metrics scale division in timer

b699642

* Fix metrics scale division in timer Signed-off-by: Iaroslav Ciupin <[email protected]> Signed-off-by: Andrew Dye <[email protected]>

CreateDownloadLink: Head before signing

04c4a04

Signed-off-by: Andrew Dye <[email protected]>

Unexpectedly deleted pod metrics

765ce2e

* Count when we see unexpectedly terminated pods Signed-off-by: Andrew Dye <[email protected]>

Fix cluster pool assignment validation

895344d

Signed-off-by: Andrew Dye <[email protected]>

andrewwdye force-pushed the union/upstream branch from 03986bc to 895344d Compare September 30, 2024 23:27

Merge remote-tracking branch 'origin' into union/upstream

12de9a6

Signed-off-by: Eduardo Apolinario <[email protected]>

eapolinario approved these changes Oct 22, 2024

View reviewed changes

eapolinario merged commit e4d29f8 into master Oct 22, 2024
65 checks passed

eapolinario deleted the union/upstream branch October 22, 2024 22:46

eapolinario mentioned this pull request Oct 22, 2024

Backport #5788, #5855, and #5769 to release v1.13 #5890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream contributions from Union.ai #5769

Upstream contributions from Union.ai #5769

andrewwdye commented Sep 23, 2024 •

edited

Loading

codecov bot commented Sep 23, 2024 •

edited

Loading

Upstream contributions from Union.ai #5769

Upstream contributions from Union.ai #5769

Conversation

andrewwdye commented Sep 23, 2024 • edited Loading

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Sep 23, 2024 • edited Loading

Codecov Report

andrewwdye commented Sep 23, 2024 •

edited

Loading

codecov bot commented Sep 23, 2024 •

edited

Loading