[Newbie] Supporting Yunikorn and Kueue #5915

0yukali0 · 2024-10-24T23:27:22Z

Tracking issue

Related to #5575

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: yuteng <[email protected]>

kumare3 · 2024-11-02T17:07:20Z

My preference is to use kueue or anyone that has min changes to codebase and does not need too many modifications to the pods

Signed-off-by: yuteng <[email protected]>

fg91 · 2024-11-05T18:06:45Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+
+## 1 Executive Summary
+
+Providing kubernetes (k8s) resource management, gang scheduling and preemption for flyte applications by third-party software, including Apache Yunikorn and Kueue.


Could you please explain what preemption means here compared to what preemption means in the context of spot instances on e.g. AWS or GCP?

fg91 · 2024-11-05T18:08:37Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+Flyte support multi-tenancy and various k8s plugins.
+
+Kueue and Yunikorn support gang scheduling and preemption.
+Gang scheduling guarantees the availability of certain K8s crd services, such as Spark, Ray, with sufficient resource and preemption make sure high priority task execute immediately.


guarantees the availability of certain K8s crd services, such as Spark

I would rather say that gang scheduling guarantees that all worker pods derived from a CRD are scheduled at the same time. Would add that this is important to prevent waste of resources when jobs can partially start without being able to do any meaningful work.

fg91 · 2024-11-05T18:09:35Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+      gangscheduling: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=hard"
+    - type: "spark"
+      gangscheduling: "placeholderTimeoutInSeconds=30 gangSchedulingStyle=hard"
+```


Is this list complete or an example? I.e. will this also work for plugins like kubeflow pytorch, tf, mpi or dask, ...?

This is an example.
Admin can set default configuration about gang scheduling for CRD in flyte k8s plugins.

fg91 · 2024-11-05T18:11:34Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+      gangscheduling: "placeholderTimeoutInSeconds=30 gangSchedulingStyle=hard"
+```
+
+Mentioned configuration indicates what queues exist for an org.


Could you please explain what an org is in this context? It's not the same as this org right?

fg91 · 2024-11-05T18:17:26Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+Hierarchical queues will be structured as follows.
+root.org1.ray、root.org1.spark and root.org1.default".
+
+ResourceFlavor allocates resource based on labels which indicates that category-based resource allocation by organization label is available.


Could you please explain how the the reseource flavor will be determined? Is there a way to automatically derive this from the task decorator args @task(resources=..., accelerator=...)?

It would be really nice if tasks that need e.g. an A100 GPU were automatically not in the same queue as tasks that need 2 x T4 GPUs. We're using kubeflow pytorch jobs with scheduler plugins' gang scheduling and have observed jobs being starved that the cluster would have had resources for because other jobs which were trying to get different GPU types couldn't be scheduled but which had a higher priority.

No, need to create Kueue CRDs first.
A cluster queue defines resource quota which property is defined by resource flavors.
I think creating resource flavors to categorizing resources under a cluster queue is available solution.

apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "spot-t4" spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-tesla-t4 nodeTaints: - effect: NoSchedule key: cloud.google.com/gke-accelerator: nvidia-tesla-t4 value: "true" tolerations: - key: "spot-taint" operator: "Exists" effect: "NoSchedule" apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "spot-t4" resources: - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "nvidia.com/gpuu" nominalQuota: 50 - name: "spot-a100" resources: - name: "cpu" nominalQuota: 18 - name: "memory" nominalQuota: 72Gi - name: "nvidia.com/gpu" nominalQuota: 100

In the other hand, kueue preemption requires Kueue WorkloadPriorityClass and patching job with label.
The plugin received the preemption label and then it should patch it to pods belonging same job

fg91 · 2024-11-05T18:19:23Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+root.org1.ray、root.org1.spark and root.org1.default".
+
+ResourceFlavor allocates resource based on labels which indicates that category-based resource allocation by organization label is available.
+Thus, a clusterQueue including multiple resources represents the total acessaible resource for an organization.  


I don't understand this sentence tbh, could you please explain/expand?

fg91 · 2024-11-05T18:21:06Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+A tenant can submit organization-specific tasks to queues such as org.ray, org.spark and org.default to track which job types are submittable. 
+
+
+A SchedulerConfigManager maintains config from mentioned yaml.


SchedulerConfigManager would be a go struct or are you suggesting a new backend service?

I don't see this in any of the code snippets below.

fg91 · 2024-11-05T18:22:10Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+A SchedulerConfigManager maintains config from mentioned yaml.
+It patches labels or annotations on k8s resources after they pass rules specified in the configuration.
+
+```go


Why not have a single interface and two implementations of the same interface for yunikorn and kueue?

I would prefer in func (e *PluginManager) launchResource( to not call queue or yunikorn specific code, see snippet below, but just a general interface whose implementation depends on the propeller config.

Agree, i updated the document..

fg91 · 2024-11-05T18:23:25Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+  // Add label is the specific label doesn't exist
+}
+```
+


Do any additional k8s resources have to be created for the queues or does a queue exist as soon as a pod has an annotation with a new queue name?

Yes, Kueue CRDs describe the quota a queue when adopting Kueue.

In the other hand, queues are configured by setting [Yunikorn configuration] (https://yunikorn.apache.org/docs/user_guide/queue_config) if adopting Yunikorn.

fg91 · 2024-11-05T18:26:25Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+in the other hand, a Kueue scheduler plugin constructs labels including localQueueName, preemption.
+
+```go
+func (e *PluginManager) launchResource(ctx context.Context, tCtx pluginsCore.TaskExecutionContext) (pluginsCore.Transition, error) {


Maybe addObjectMetadata which is called by launchResource would be a better place to inject the required metadata. Or do we need to inject something other than labels/annotations?

fg91 · 2024-11-05T18:32:33Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+	if err != nil {
+		return pluginsCore.UnknownTransition, err
+	}
+  if o, err = e.SchedulerPlugin.MutateResourceForKueue(o); err == nil {


We should only mutate the resource if the plugin manager manages a plugin which the user configured a queue for, right? How will this matching be done? Just comparing this type string

queueconfig: scheduler: yunikorn jobs: - type: "ray"

to the name of the plugin?

fg91 · 2024-11-05T18:33:50Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+  o, err := e.plugin.BuildResource(ctx, k8sTaskCtx)
+	if err != nil {
+		return pluginsCore.UnknownTransition, err
+	}


Would be nice to not have Kueue specific code here but a general interface, see this comment.

fg91 · 2024-11-05T18:34:39Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+}
+```
+When batchscheduler in flyte is yunikorn, some examples are like following.
+For example, this appoarch submits a Ray job owned by user1 in org1 to "root.org1.ray".


Where does flytepropeller know the user from? Or does the user not matter as the label "root.org1.ray" suggests?

fg91 · 2024-11-05T18:37:17Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+| Spark | v | x |
+| Ray | v | v |
+| Kubeflow | x | v |
+


From what I understand, one only needs to add labels/annotations on the worker pods. Can't we do this purely from flyte by modifying the pod template spec of the respective CRD? What do the operators have to do in addition to that?

Yes, current progress fetch the pod templates from CRDs and patch label on them.
If operators implement the mechanism to patch group label for their CRD to support gang scheduling in the future, we can start to remove the code for generating group labels to reduce the maintaining overhead.

fg91 · 2024-11-05T18:38:21Z

rfc/system/5575-supporting-yunikorn-and-kueue.md

+
+Yunikorn and Kueue support gang scheduling allowing all necassary pods to run sumultaneously when required resource are available.
+Yunikorn provides preemption calculating the priority of applications based on thier priority class and priority score of the queue where they are submitted, in order to trigger high-prioirty or emergency application immediately. 
+Yunikorn's hierarchical queue includes grarateed resources settings and ACLs.


Nit: Could you please run the doc through a spelling checker? Thank you 🙇

yes, i ran the make spellcheck in the latest commit :)

Signed-off-by: yuteng <[email protected]>

codecov · 2024-11-07T19:13:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.15%. Comparing base (bdaf79f) to head (87ee532).
Report is 81 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5915      +/-   ##
==========================================
- Coverage   36.71%   33.15%   -3.57%     
==========================================
  Files        1304     1013     -291     
  Lines      130081   107571   -22510     
==========================================
- Hits        47764    35667   -12097     
+ Misses      78147    68744    -9403     
+ Partials     4170     3160    -1010

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`?`
unittests-flytecopilot	`22.23% <ø> (+10.50%)`	⬆️
unittests-flytectl	`62.39% <ø> (-0.02%)`	⬇️
unittests-flyteidl	`6.95% <ø> (+0.06%)`	⬆️
unittests-flyteplugins	`53.83% <ø> (+0.20%)`	⬆️
unittests-flytepropeller	`43.10% <ø> (+0.26%)`	⬆️
unittests-flytestdlib	`55.31% <ø> (+0.53%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Signed-off-by: yuteng <[email protected]>

draft of idea

6bd5fb6

Signed-off-by: yuteng <[email protected]>

0yukali0 changed the title ~~draft of idea~~ [Newbie] Supporting Yunikorn and Kueue Oct 24, 2024

0yukali0 added 7 commits October 28, 2024 07:25

Mapping between kueue crd and heirachy queue

7641f32

Signed-off-by: yuteng <[email protected]>

update potential impact

02e35df

Signed-off-by: yuteng <[email protected]>

yunikorn

590d64a

Signed-off-by: yuteng <[email protected]>

update

20d5672

Signed-off-by: yuteng <[email protected]>

update logic flow

bb8a0ec

Signed-off-by: yuteng <[email protected]>

update

2c12357

Signed-off-by: yuteng <[email protected]>

fin

9a732c7

Signed-off-by: yuteng <[email protected]>

0yukali0 marked this pull request as ready for review November 1, 2024 15:35

0yukali0 added 2 commits November 1, 2024 23:45

update

ae9f6b9

Signed-off-by: yuteng <[email protected]>

update

9ed6b05

Signed-off-by: yuteng <[email protected]>

Discard priority class

552a800

Signed-off-by: yuteng <[email protected]>

fg91 reviewed Nov 5, 2024

View reviewed changes

org -> organization

b71b749

Signed-off-by: yuteng <[email protected]>

0yukali0 added 2 commits November 6, 2024 23:49

Use same interface and move to the AddObjectMeta

3fc572f

Signed-off-by: yuteng <[email protected]>

Set scheduler name when launchResource

e64beb0

Signed-off-by: yuteng <[email protected]>

yuteng added 4 commits November 11, 2024 06:51

update and pass spellcheck

539eff6

Signed-off-by: yuteng <[email protected]>

describe the progress when a new job comes

9d511e3

Signed-off-by: yuteng <[email protected]>

update

d6a420a

Signed-off-by: yuteng <[email protected]>

update

e9c401a

Signed-off-by: yuteng <[email protected]>

0yukali0 marked this pull request as draft November 11, 2024 22:53

The k8s plugin implements k8s.plugin and schedulePlugins

4387d58

Signed-off-by: yuteng <[email protected]>

0yukali0 marked this pull request as ready for review November 12, 2024 14:33

update

87ee532

Signed-off-by: yuteng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Newbie] Supporting Yunikorn and Kueue #5915

[Newbie] Supporting Yunikorn and Kueue #5915

0yukali0 commented Oct 24, 2024

kumare3 commented Nov 2, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

0yukali0 Nov 6, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024 •

edited

Loading

0yukali0 Nov 13, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

fg91 Nov 5, 2024

0yukali0 Nov 13, 2024

fg91 Nov 5, 2024

0yukali0 Nov 6, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024

0yukali0 Nov 13, 2024

fg91 Nov 5, 2024 •

edited

Loading

0yukali0 Nov 13, 2024

codecov bot commented Nov 7, 2024 •

edited

Loading


		## 1 Executive Summary

		Providing kubernetes (k8s) resource management, gang scheduling and preemption for flyte applications by third-party software, including Apache Yunikorn and Kueue.

		A tenant can submit organization-specific tasks to queues such as org.ray, org.spark and org.default to track which job types are submittable.


		A SchedulerConfigManager maintains config from mentioned yaml.

[Newbie] Supporting Yunikorn and Kueue #5915

Are you sure you want to change the base?

[Newbie] Supporting Yunikorn and Kueue #5915

Conversation

0yukali0 commented Oct 24, 2024

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

kumare3 commented Nov 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fg91 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0yukali0 Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

fg91 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

fg91 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fg91 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fg91 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 7, 2024 • edited Loading

Codecov Report

fg91 Nov 5, 2024 •

edited

Loading

0yukali0 Nov 6, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

fg91 Nov 5, 2024 •

edited

Loading

codecov bot commented Nov 7, 2024 •

edited

Loading