Merge pull request #1193 from run-ai/data-sources-219

Merge pull request #1192 from run-ai/data-sources
run-ai · Oct 28, 2024 · 1530a5a · 1530a5a
2 parents 5e919cc + 1606bfe
commit 1530a5a
Show file tree

Hide file tree

Showing 17 changed files with 284 additions and 152 deletions.
diff --git a/docs/Researcher/scheduling/gpu-memory-swap.md b/docs/Researcher/scheduling/gpu-memory-swap.md
@@ -115,4 +115,4 @@ If you prefer your workloads not to be swapped into CPU memory, you can specify
 
 CPU memory is limited, and since a single CPU serves multiple GPUs on a node, this number is usually between 2 to 8. For example, when using 80GB of GPU memory, each swapped workload consumes up to 80GB (but may use less) assuming each GPU is shared between 2-4 workloads. In this example, you can see how the swap memory can become very large. Therefore, we give administrators a way to limit the size of the CPU reserved memory for swapped GPU memory on each swap enabled node.
 
-Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, *Node Level Scheduler* and *Dynamic Fractions* logic takes over and provides GPU resource optimization.see [Dynamic Fractions](fractions.md#dynamic-mig) and [Node Level Scheduler](node-level-scheduler.md#how-to-configure-node-level-scheduler).
+Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, *Node Level Scheduler* logic takes over and provides GPU resource optimization. See [Node Level Scheduler](node-level-scheduler.md#how-to-configure-node-level-scheduler).
diff --git a/docs/Researcher/workloads/trainings.md b/docs/Researcher/workloads/trainings.md
@@ -55,7 +55,7 @@ To add a training:
        5. Enter the *Container path* for volume target location.
        6. Select a *Volume persistency.
 
-9.  (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../workloads/assets/datasources.md#create-a-new-data-source) When complete press, *Create Data Source*.
+9.  (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../workloads/assets/datasources.md#adding-a-new-data-source) When complete press, *Create Data Source*.
 10. (Optional) In the *General* pane, add special settings for your training (optional):
 
        1. Press *Auto-deletion* to delete the training automatically when it either completes or fails. You can configure the timeframe in days, hours, minutes, and seconds. If the timeframe is set to 0, the training will be deleted immediately after it completes or fails. (default = 30 days)
@@ -77,7 +77,7 @@ To add a training:
           5. Enter the *Container path* for volume target location.
           6. Select a *Volume persistency.
 
-       4. (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../workloads/assets/datasources.md#create-a-new-data-source) When complete press, *Create Data Source*.
+       4. (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../workloads/assets/datasources.md#adding-a-new-data-source) When complete press, *Create Data Source*.
        5. (Optional) In the *General* pane, add special settings for your training (optional):
 
           1. Press *Auto-deletion* to delete the training automatically when it either completes or fails. You can configure the timeframe in days, hours, minutes, and seconds. If the timeframe is set to 0, the training will be deleted immediately after it completes or fails. (default = 30 days)

diff --git a/docs/admin/config/create-k8s-assets-in-advance.md b/docs/admin/config/create-k8s-assets-in-advance.md
@@ -0,0 +1,48 @@
+# Creating Kubernetes Assets in Advance
+
+The article describe how to mark Kubernetes assets for use by Run:ai
+
+## Creating PVCs in advance
+
+Add PVCs in advance to be used when creating a PVC-type data source via the Run:ai UI.
+
+Follow the steps below for each required scope:
+
+
+### Cluster scope
+
+1.  Create the PVC in the Run:ai namespace (runai)
+2.  To authorize Run:ai to use the PVC, label it: `run.ai/cluster-wide: "true”`  
+    The PVC is now displayed for that scope in the list of existing PVCs.
+
+### Department scope
+
+1.  Create the PVC in the Run:ai namespace (runai)
+2.  To authorize Run:ai to use the PVC, label it: `run.ai/department: "\\<name of department>"`  
+    The PVC is now displayed for that scope in the list of existing PVCs.
+
+### Project scope
+
+1.  Create the PVC in the project’s namespace  
+    The PVC is now displayed for that scope in the list of existing PVCs.
+
+## Creating ConfigMaps in advance
+
+Add ConfigMaps in advance to be used when creating a ConfigMap-type data source via the Run:ai UI.
+
+### Cluster scope
+
+1.  Create the ConfigMap in the Run:ai namespace (runai)
+2.  To authorize Run:ai to use the ConfigMap, label it: run.ai/cluster-wide: "true”  
+    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.
+
+### Department scope
+
+1.  Create the ConfigMap in the Run:ai namespace (runai)
+2.  To authorize Run:ai to use the ConfigMap, label it: `run.ai/department: "\\<name of department>"`  
+    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.
+
+### Project scope
+
+1.  Create the ConfigMap in the project’s namespace  
+    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.
diff --git a/docs/admin/config/large-clusters.md b/docs/admin/config/large-clusters.md
@@ -112,4 +112,4 @@ queueConfig:
 
 This [article](https://last9.io/blog/how-to-scale-prometheus-remote-write/){target=_blank} provides additional details and insight.
 
-Also, note that this configuration enlarges the Prometheus queues and thus increases the required memory. It is hence suggested to reduce the metrics retention period as described [here](../runai-setup/cluster-setup/customize-cluster-install.md#configurations)
+Also, note that this configuration enlarges the Prometheus queues and thus increases the required memory. It is hence suggested to reduce the metrics retention period as described [here](./advanced-cluster-config.md)
diff --git a/docs/admin/config/node-roles.md b/docs/admin/config/node-roles.md
@@ -31,7 +31,7 @@ runai-adm remove node-role --runai-system-worker <node-name>
 
 
 !!! Important
-    To enable this feature, you must set the cluster configuration flag `global.nodeAffinity.restrictScheduling` to `true`. For more information see [customize cluster](../runai-setup/cluster-setup/customize-cluster-install.md#configurations).
+    To enable this feature, you must set the cluster configuration flag `global.nodeAffinity.restrictScheduling` to `true`. For more information see [customize cluster](./advanced-cluster-config.md).
 
 Separate nodes into those that:
 

diff --git a/docs/admin/config/shared-storage.md b/docs/admin/config/shared-storage.md
@@ -18,7 +18,7 @@ Run:ai [Data Sources](../../platform-admin/workloads/assets/datasources.md) supp
 
 Storage classes in Kubernetes defines how storage is provisioned and managed. This allows you to select storage types optimized for AI workloads. For example, you can choose storage with high IOPS (Input/Output Operations Per Second) for rapid data access during intensive training sessions, or tiered storage options to balance cost and performance-based on your organization’s requirements. This approach supports dynamic provisioning, enabling storage to be allocated on-demand as required by your applications.
 
-Run:ai data sources such as [Persistent Volume Claims (PVC)](../../platform-admin/workloads/assets/existing-PVC.md) and [Data Volumes](../../platform-admin/workloads/assets/data-volumes.md) leverage storage class to manage and allocate storage efficiently. This ensures that the most suitable storage option is always accessible, contributing to the efficiency and performance of AI workloads.
+Run:ai data sources such as [Persistent Volume Claims (PVC)](../../platform-admin/workloads/assets/datasources.md#pvc) and [Data Volumes](../../platform-admin/workloads/assets/data-volumes.md) leverage storage class to manage and allocate storage efficiently. This ensures that the most suitable storage option is always accessible, contributing to the efficiency and performance of AI workloads.
 
 !!! Note
     Run:ai lists all available storage classes in the Kubernetes cluster, making it easy for users to select the appropriate storage. Additionally, policies can be set to restrict or enforce the use of specific storage classes, to helpl maintain compliance with organizational standards and optimize resource utilization.

diff --git a/docs/home/whats-new-2-13.md b/docs/home/whats-new-2-13.md
@@ -105,7 +105,7 @@ The association between workspaces and node pools is done using *Compute resourc
 
 **PVC data sources**
 <!-- RUN-9826/10186 Support PVC from block storage -->
-* Added support for PVC block storage in the *New data source* form. In the *New data source* form for a new PVC data source, in the *Volume mode* field, select from *Filesystem* or *Block*. For more information, see [Create a PVC data source](../Researcher/workloads/assets/datasources.md#create-a-pvc-data-source).
+* Added support for PVC block storage in the *New data source* form. In the *New data source* form for a new PVC data source, in the *Volume mode* field, select from *Filesystem* or *Block*. For more information, see [Create a PVC data source](../Researcher/workloads/assets/datasources.md#pvc).
 
 **Credentials**
 

diff --git a/docs/home/whats-new-2-17.md b/docs/home/whats-new-2-17.md
@@ -41,7 +41,7 @@ date: 2024-Apr-14
 
 #### Assets
 
-* <!-- RUN14616/RUN-14759/RUN-14758/RUN14761/RUN-14772/RUN-14773 - Add configmap as data source, control by policy, CLI -->Added the capability to use a ConfigMap as a data source. The ability to use a ConfigMap as a data source can be configured in the *Data sources* UI, the CLI, and as part of a policy. For more information, see [Setup a ConfigMap as a data source](../Researcher/workloads/assets/datasources.md#create-a-configmap-data-source), [Setup a ConfigMap as a volume using the CLI](../Researcher/cli-reference/runai-submit.md#-configmap-volume-namepath).
+* <!-- RUN14616/RUN-14759/RUN-14758/RUN14761/RUN-14772/RUN-14773 - Add configmap as data source, control by policy, CLI -->Added the capability to use a ConfigMap as a data source. The ability to use a ConfigMap as a data source can be configured in the *Data sources* UI, the CLI, and as part of a policy. For more information, see [Setup a ConfigMap as a data source](../Researcher/workloads/assets/datasources.md#configmap), [Setup a ConfigMap as a volume using the CLI](../Researcher/cli-reference/runai-submit.md#-configmap-volume-namepath).
 
 * <!-- RUN-16242/RUN-16243/RUN-14596/RUN-14742/RUN-14577/RUN-14743/RUN-16427/RUN-16428 PVC status Add status table for credentials, ConfigMap-DS, PVC-ds -->Added a *Status* column to the *Credentials* table, and the *Data sources* table. The *Status* column displays the state of the resource and provides troubleshooting information about that asset. For more information, see the [Credentials table](../platform-admin/workloads/assets/credentials.md#credentials-table) and the [Data sources table](../Researcher/workloads/assets/datasources.md#data-sources-table).
 

diff --git a/docs/home/whats-new-2-18.md b/docs/home/whats-new-2-18.md
@@ -77,7 +77,7 @@ date: 2024-June-14
 
     For more information, see [Data Volumes](../platform-admin/workloads/assets/data-volumes.md). (Requires minimum cluster version v2.18).
 
-* <!-- TODO fix doc link RUN-16917/RUN-19363 Expose secrets in workload submission -->Added new data source of type *Secret*. Run:ai now allows you to configure a *Credential* as a data source. A *Data source* of type *Secret* is best used in workloads so that access to 3rd party interfaces and storage used in containers, keep access credentials hidden. For more information, see [Secrets as a data source](../Researcher/workloads/assets/datasources.md#create-a-secret-as-data-source).
+* <!-- TODO fix doc link RUN-16917/RUN-19363 Expose secrets in workload submission -->Added new data source of type *Secret*. Run:ai now allows you to configure a *Credential* as a data source. A *Data source* of type *Secret* is best used in workloads so that access to 3rd party interfaces and storage used in containers, keep access credentials hidden. For more information, see [Secrets as a data source](../Researcher/workloads/assets/datasources.md#secret).
 
 * Updated the logic of data source initializing state which keeps the workload in “initializing” status until S3 data is fully mapped. For more information see [Sidecar containers documentation](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/).
 

diff --git a/docs/platform-admin/workloads/assets/credentials.md b/docs/platform-admin/workloads/assets/credentials.md
@@ -165,13 +165,13 @@ You can use credentials (secrets) in various ways within the system
 
 ### Access private data sources
 
-To access private data sources, attach credentials to data sources of the following types: [Git](./datasources.md#create-a-git-data-source), [S3 Bucket](./datasources.md#create-an-s3-data-source)
+To access private data sources, attach credentials to data sources of the following types: [Git](./datasources.md#git), [S3 Bucket](./datasources.md#s3-bucket)
 
 ### Use directly within the container
 
 To use the secret directly from within the container, you can choose between the following options
 
-1. Get the secret mounted to the file system by using the [Generic secret](./datasources.md#create-a-secret-as-data-source) data source
+1. Get the secret mounted to the file system by using the [Generic secret](./datasources.md#secret) data source
 2. Get the secret as an environment variable injected into the container. There are two equivalent ways to inject the environment variable.
       a. By adding it to the Environment asset. 
       b. By adding it ad-hoc as part of the workload.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -115,4 +115,4 @@ If you prefer your workloads not to be swapped into CPU memory, you can specify

		CPU memory is limited, and since a single CPU serves multiple GPUs on a node, this number is usually between 2 to 8. For example, when using 80GB of GPU memory, each swapped workload consumes up to 80GB (but may use less) assuming each GPU is shared between 2-4 workloads. In this example, you can see how the swap memory can become very large. Therefore, we give administrators a way to limit the size of the CPU reserved memory for swapped GPU memory on each swap enabled node.

		Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, Node Level Scheduler and Dynamic Fractions logic takes over and provides GPU resource optimization.see [Dynamic Fractions](fractions.md#dynamic-mig) and [Node Level Scheduler](node-level-scheduler.md#how-to-configure-node-level-scheduler).
		Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, Node Level Scheduler logic takes over and provides GPU resource optimization. See [Node Level Scheduler](node-level-scheduler.md#how-to-configure-node-level-scheduler).
Original file line number	Diff line number	Diff line change
Expand Up		@@ -112,4 +112,4 @@ queueConfig:

		This [article](https://last9.io/blog/how-to-scale-prometheus-remote-write/){target=_blank} provides additional details and insight.

		Also, note that this configuration enlarges the Prometheus queues and thus increases the required memory. It is hence suggested to reduce the metrics retention period as described [here](../runai-setup/cluster-setup/customize-cluster-install.md#configurations)
		Also, note that this configuration enlarges the Prometheus queues and thus increases the required memory. It is hence suggested to reduce the metrics retention period as described [here](./advanced-cluster-config.md)