Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduling-rules #909

Merged
merged 1 commit into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/Researcher/scheduling/the-runai-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ The Run:ai scheduler wakes up periodically to perform allocation tasks on pendin
A *Node Pool* is a set of nodes grouped by an Administrator into a distinct group of resources from which resources can be allocated to Projects and Departments.
By default, any node pool created in the system is automatically associated with all Projects and Departments using zero quota resource (GPUs, CPUs, Memory) allocation. This allows any Project and Department to use any node pool with Over-Quota (for Preemptible workloads), thus maximizing the system resource utilization.

* An Administrator can allocate resources from a specific node pool to chosen Projects and Departments. See [Project Setup](../../admin/admin-ui-setup/project-setup.md#limit-jobs-to-run-on-specific-node-groups)
* An Administrator can allocate resources from a specific node pool to chosen Projects and Departments. See [Project Scheduling Rules](../../admin/aiinitiatives/org/scheduling-rules.md)
* The Researcher can use node pools in two ways. The first one is where a Project has guaranteed resources on node pools - The Researcher can then submit a workload and specify a single node pool or a prioritized list of node pools to use and receive guaranteed resources.
The second is by using node-pool(s) with no guaranteed resource for that Project (zero allocated resources), and in practice using Over-Quota resources of node-pools. This means a Workload must be Preemptible as it uses resources out of the Project or node pool quota. The same scenario occurs if a Researcher uses more resources than allocated to a specific node pool and goes Over-Quota.
* By default, if a Researcher doesn't specify a node-pool to use by a workload, the scheduler assigns the workload to run using the Project's 'Default node-pool list'.
Expand All @@ -113,7 +113,7 @@ The second is by using node-pool(s) with no guaranteed resource for that Project

Both the Administrator and the Researcher can provide limitations as to which nodes can be selected for the Job. Limits are managed via [Kubernetes labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/){target=_blank}:

* The Administrator can set limits at the Project level. Example: Project `team-a` can only run `interactive` Jobs on machines with a label of `v-100` or `a-100`. See [Project Setup](../../admin/admin-ui-setup/project-setup.md#limit-jobs-to-run-on-specific-node-groups) for more information.
* The Administrator can set limits at the Project level. Example: Project `team-a` can only run `interactive` Jobs on machines with a label of `v-100` or `a-100`. See [Project Scheduling Rules](../../admin/aiinitiatives/org/scheduling-rules.md) for more information.
* The Researcher can set a limit at the Job level, by using the command-line interface flag `--node-type`. The flag acts as a subset to the Project setting.

Node affinity constraints are used during the *Allocation* phase to filter out candidate nodes for running the Job. For more information on how nodes are filtered see the `Filtering` section under [Node selection in kube-scheduler](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler-implementation){target=_blank}. The Run:ai scheduler works similarly.
Expand Down
2 changes: 1 addition & 1 deletion docs/Researcher/user-interface/workspaces/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ When the workspace is active it exposes the connections to the tools (for exampl
![](img/2-connecting-to-tools.png)


An active workspace is a Run:ai [interactive workload](../../../admin/workloads/workload-overview-admin.md). The interactive workload starts when the workspace is started and stopped when the workspace is stopped.
An active workspace is a Run:ai [interactive workload](../../../admin/workloads/submitting-workloads.md). The interactive workload starts when the workspace is started and stops when the workspace is stopped.


Workspaces can be used via the user interface or programmatically via the Run:ai [Admin API](../../../developer/admin-rest-api/overview.md). Workspaces are not supported via the command line interface. You can still run an interactive workload via the command line.
Expand Down
2 changes: 1 addition & 1 deletion docs/Researcher/user-interface/workspaces/statuses.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The *Initializing* status indicates that the workspace has been scheduled and is
The *Active* status indicates that the workspace is ready to be used and allows the researcher to connect to its tools. At this status, the workspace is consuming resources and affecting the project’s quota. The workspace will turn to active status once the `Active` button is pressed, the activation process ends up successfully and relevant resources are available and vacant.

## Stopped workspace
The *Stopped* status indicates that the workspace is currently unused and does not consume any resources. A workspace can be stopped either manually, or automatically if triggered by idleness criteria set by the admin (see [Limit duration of interactive Jobs](../../../admin/admin-ui-setup/project-setup.md#limit-duration-of-interactive-and-training-jobs)).
The *Stopped* status indicates that the workspace is currently unused and does not consume any resources. A workspace can be stopped either manually, or automatically if triggered by idleness criteria set by the admin (see [Limit duration of interactive Jobs](../../../admin/aiinitiatives/org/scheduling-rules.md)).

## Failed workspace

Expand Down
48 changes: 48 additions & 0 deletions docs/admin/aiinitiatives/org/scheduling-rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
This article explains the procedure of configuring and managing Scheduling rules. Scheduling rules refer to restrictions applied over workloads. These restrictions apply to either the resources (nodes) on which workloads can run or to the duration of the workload run time. Scheduling rules are set for Projects and apply to a specific workload type. Once scheduling rules are set for a project, all matching workloads associated with the project will have the restrictions as defined when the workload was submitted. New scheduling rules added to a project are not applied over already created workloads associated with that project.

There are 3 types of rules:

* **Workload time limit** - This rule limits the duration of a workload run time. Workload run time is calculated as the total time in which the workload was in status “Running“.
* **Idle GPU time limit** - This rule limits the total GPU time of a workload. Workload idle time is counted since the first time the workload was in status “Running“ and the GPU was idle.
For fractional workloads, workloads running on a MIG slice, multi GPU or multi-node workloads, each GPU idle second is calculated as follows: __<requires explanation about how it is calculated__
* **Node type (Affinity)** - This rule limits a workload to run on specific node types. node type is a node affinity applied on the node. Run:ai labels the nodes with the appropriate affinity and indicates the scheduler where it is allowed to schedule the workload.

Adding a scheduling rule to a project

To add a scheduling rule:

1. Select the project you want to add a scheduling rule for
2. Click **EDIT**
3. In the **Scheduling rules** section click **\+RULE**
4. Select the **rule type**
5. Select the **workload type** and **time limitation period**
6. For Node type, choose one or more labels for the desired nodes.
7. Click **SAVE**

!!! Note
You can review the defined rules in the Projects table in the relevant column.

## Editing the project’s scheduling rule

To edit a scheduling rule:

1. Select the project you want to edit its scheduling rule
2. Click **EDIT**
3. Find the scheduling rule you would like to edit
4. Edit the rule
5. Click **SAVE**

## Deleting the project’s scheduling rule

To delete a scheduling rule:

1. Select the project you want to delete a scheduling rule from
2. Click **EDIT**
3. Find the scheduling rule you would like to delete
4. Click on the x icon
5. Click **SAVE**

## Using API

Go to the [Projects](https://app.run.ai/api/docs#tag/Projects/operation/create_project) API reference to view the available actions

2 changes: 1 addition & 1 deletion docs/home/whats-new-2-13.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ The association between workspaces and node pools is done using *Compute resourc

**Time limit duration**

* Improved the behavior of any workload time limit (for example, *Idle time limit*) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see [Limit duration of interactive training jobs](../admin/admin-ui-setup/project-setup.md#limit-duration-of-interactive-and-training-jobs).
* Improved the behavior of any workload time limit (for example, *Idle time limit*) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see [Limit duration of interactive training jobs](#).

* Improved workspaces time limits. Workspaces that reach a time limit will now transition to a state of `stopped` so that they can be reactivated later.

Expand Down
2 changes: 1 addition & 1 deletion docs/home/whats-new-2-15.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ date: 2023-Dec-3
* Improved filters and search
* More information

Use the toggle at the top of the *Jobs* page to switch to the *Workloads* view. For more information, see [Workloads](../admin/workloads/workload-overview-admin.md#workloads-view).
Use the toggle at the top of the *Jobs* page to switch to the *Workloads* view. For more information.

* <!-- RUN-10639/RUN-11389 - Researcher Service Refactoring RUN-12505/RUN-12506 - Support Kubeflow notebooks for scheduling/orchestration -->Improved support for Kubeflow Notebooks. Run:ai now supports the scheduling of Kubeflow notebooks with fractional GPUs. Kubeflow notebooks are identified automatically and appear with a dedicated icon in the *Jobs* UI.
* <!-- RUN-11292/RUN-11592 General changes in favor of any asset based workload \(WS, training, DT\)-->Improved the *Trainings* and *Workspaces* forms. Now the runtime field for *Command* and *Arguments* can be edited directly in the new *Workspace* or *Training* creation form.
Expand Down
2 changes: 1 addition & 1 deletion graveyard/whats-new-2022.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
The command-line interface utility for version 2.3 is not compatible with a cluster version of 2.5 or later. If you upgrade the cluster, you must also upgrade the command-line interface.
* __Inference__. Run:ai inference offering has been overhauled with the ability to submit deployments via the user interface and a new and consistent API. For more information see [Inference overview](../admin/workloads/inference-overview.md). To enable the new inference module call by Run:ai customer support.
* __CPU and CPU memory quotas__ can now be configured for projects and departments. These are hard quotas which means that the total amount of the requested resource for all workloads associated with a project/department cannot exceed the set limit. To enable this feature please call Run:ai customer support.
* __Workloads__. We have revamped the way Run:ai submits Jobs. Run:ai now submits [Workloads](../admin/workloads/workload-overview-admin.md). The change includes:
* __Workloads__. We have revamped the way Run:ai submits Jobs. Run:ai now submits [Workloads](../admin/workloads/submitting-workloads.md). The change includes:
* New [Cluster API](../developer/cluster-api/workload-overview-dev.md). The older [API](../developer/deprecated/researcher-rest-api/overview.md) has been deprecated and remains for backward compatibility. The API creates all the resources required for the run, including volumes, services, and the like. It also deletes all resources when the workload itself is deleted.
* Administrative templates have been replaced with [Policies](../admin/workloads/policies.md). Policies apply across all ways to submit jobs: command-line, API, and user interface.
* `runai delete` has been changed in favor of `runai delete job`
Expand Down
21 changes: 11 additions & 10 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -202,21 +202,12 @@ nav:
- 'Setup cluster wide PVC' : 'admin/researcher-setup/cluster-wide-pvc.md'
- 'Group Nodes' : 'admin/researcher-setup/limit-to-node-group.md'
# - 'Messaging setup' : 'admin/researcher-setup/email-messaging.md'
- 'Workloads' :
- 'admin/workloads/README.md'
- 'Policies' :
- 'admin/workloads/policies/README.md'
- 'Former Policies' : 'admin/workloads/policies/policies.md'
- 'Training Policy' : 'admin/workloads/policies/training-policy.md'
- 'Workspaces Policy' : 'admin/workloads/policies/workspaces-policy.md'
- 'Secrets' : 'admin/workloads/secrets.md'
- 'Inference' : 'admin/workloads/inference-overview.md'
- 'Submitting Workloads' : 'admin/workloads/submitting-workloads.md'
- 'Managing AI Intiatives' :
- 'Overview' : 'admin/aiinitiatives/overview.md'
- 'Managing your Organization' :
- 'Projects' : 'admin/aiinitiatives/org/projects.md'
- 'Departments' : 'admin/aiinitiatives/org/departments.md'
- 'Scheduling Rules' : 'admin/aiinitiatives/org/scheduling-rules.md'
# - 'Managing your resources' :
# - 'Nodes' : 'admin/aiinitiatives/resources/nodes.md'
# - 'Node Pools' : 'admin/aiinitiatives/resources/node-pools.md'
Expand All @@ -229,6 +220,16 @@ nav:
- 'Jobs' : 'admin/admin-ui-setup/jobs.md'
- 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md'
- 'Templates': 'admin/admin-ui-setup/templates.md'
- 'Workloads' :
- 'admin/workloads/README.md'
- 'Policies' :
- 'admin/workloads/policies/README.md'
- 'Former Policies' : 'admin/workloads/policies/policies.md'
- 'Training Policy' : 'admin/workloads/policies/training-policy.md'
- 'Workspaces Policy' : 'admin/workloads/policies/workspaces-policy.md'
- 'Secrets' : 'admin/workloads/secrets.md'
- 'Inference' : 'admin/workloads/inference-overview.md'
- 'Submitting Workloads' : 'admin/workloads/submitting-workloads.md'
- 'Troubleshooting' :
- 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'
- 'Troubleshooting' : 'admin/troubleshooting/troubleshooting.md'
Expand Down
Loading