Skip to content

Commit

Permalink
Merge pull request #887 from run-ai/reference-fixes
Browse files Browse the repository at this point in the history
Reference fixes
  • Loading branch information
yarongol authored Jul 31, 2024
2 parents e2fc422 + c9ce1f1 commit 457a13b
Show file tree
Hide file tree
Showing 8 changed files with 6 additions and 29 deletions.
1 change: 0 additions & 1 deletion docs/Researcher/Walkthroughs/quickstart-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ Follow the Quickstart documents below to learn more:
* [Interactive build sessions with externalized services](walkthrough-build-ports.md)
* [Using GPU Fractions](walkthrough-fractions.md)
* [Distributed Training](walkthrough-distributed-training.md)
* [Hyperparameter Optimization](walkthrough-hpo.md)
* [Over-Quota, Basic Fairness & Bin Packing](walkthrough-overquota.md)
* [Fairness](walkthrough-queue-fairness.md)
* [Inference](quickstart-inference.md)
Expand Down
7 changes: 0 additions & 7 deletions docs/Researcher/best-practices/env-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,6 @@ Run:ai provides the following environment variables:
Note that the Job can be deleted and then recreated with the same name. A Job UUID will be different even if the Job names are the same.


## Identifying a Pod

With [Hyperparameter Optimization](../Walkthroughs/walkthrough-hpo.md), experiments are run as _Pods_ within the Job. Run:ai provides the following environment variables to identify the Pod.

* ``POD_INDEX`` - An index number (0, 1, 2, 3....) for a specific Pod within the Job. This is useful for Hyperparameter Optimization to allow easy mapping to individual experiments. The Pod index will remain the same if restarted (due to a failure or preemption). Therefore, it can be used by the Researcher to identify experiments.
* ``POD_UUID`` - a unique identifier for the Pod. if the Pod is restarted, the Pod UUID will change.

## GPU Allocation

Run:ai provides an environment variable, visible inside the container, to help identify the number of GPUs allocated for the container. Use `RUNAI_NUM_OF_GPUS`
Expand Down
8 changes: 0 additions & 8 deletions docs/Researcher/cli-reference/runai-submit.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,6 @@ runai submit --name frac05 -i gcr.io/run-ai-demo/quickstart -g 0.5

(see: [GPU fractions Quickstart](../Walkthroughs/walkthrough-fractions.md)).

Hyperparameter Optimization

```console
runai submit --name hpo1 -i gcr.io/run-ai-demo/quickstart-hpo -g 1 \
--parallelism 3 --completions 12 -v /nfs/john/hpo:/hpo
```

(see: [hyperparameter optimization Quickstart](../Walkthroughs/walkthrough-hpo.md)).

Submit a Job without a name (automatically generates a name)

Expand Down
2 changes: 0 additions & 2 deletions docs/Researcher/scheduling/the-runai-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,3 @@ To search for good hyperparameters, Researchers typically start a series of smal

With HPO, the Researcher provides a single script that is used with multiple, varying, parameters. Each run is a *pod* (see definition above). Unlike Gang Scheduling, with HPO, pods are **independent**. They are scheduled independently, started, and end independently, and if preempted, the other pods are unaffected. The scheduling behavior for individual pods is exactly as described in the Scheduler Details section above for Jobs.
In case node pools are enabled, if the HPO workload has been assigned with more than one node pool, the different pods might end up running on different node pools.

For more information on Hyperparameter Optimization in Run:ai see [here](../Walkthroughs/walkthrough-hpo.md)
2 changes: 1 addition & 1 deletion docs/admin/troubleshooting/cluster-health-check.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ kubectl get cm runai-public -oyaml

### Resources not deployed / System unavailable / Reconciliation failed

1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
1. Run the [Preinstall diagnostic script](../runai-setup/cluster-setup/cluster-prerequisites.md#pre-install-script) and check for issues.
2. Run

```
Expand Down
4 changes: 2 additions & 2 deletions docs/admin/workloads/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ To get the full experience of Run:ai’s environment and platform use the follow

* [Workspaces](../../Researcher/user-interface/workspaces/overview.md#getting-familiar-with-workspaces)
* [Trainings](../../Researcher/user-interface/trainings.md#trainings) (Only available when using the *Jobs* view)
* [Distributed trainings](../../Researcher/user-interface/trainings.md#trainings)
* [Deployment](../admin-ui-setup/deployments.md#viewing-and-submitting-deployments)
* [Distributed training](../../Researcher/user-interface/trainings.md#trainings)
* Deployments.

## Workload-related Integrations

Expand Down
7 changes: 3 additions & 4 deletions docs/admin/workloads/inference-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,12 @@ Run:ai provides *Inference* services as an equal part together with the other tw

* Multiple replicas will appear in Run:ai as a single *Inference* workload. The workload will appear in all Run:ai dashboards and views as well as the Command-line interface.

* Inference workloads can be submitted via Run:ai [user interface](../admin-ui-setup/deployments.md) as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.
* Inference workloads can be submitted via Run:ai user interface as well as [Run:ai API](../../developer/cluster-api/workload-overview-dev.md). Internally, spawning an Inference workload also creates a Kubernetes *Service*. The service is an end-point to which clients can connect.

## Autoscaling

To withstand SLA, *Inference* workloads are typically set with *auto scaling*. Auto-scaling is the ability to add more computing power (Kubernetes pods) when the load increases and shrink allocated resources when the system is idle.

There are a number of ways to trigger autoscaling. Run:ai supports the following:
There are several ways to trigger autoscaling. Run:ai supports the following:

| Metric | Units | Run:ai name |
|-----------------|--------------|-----------------|
Expand All @@ -45,7 +44,7 @@ There are a number of ways to trigger autoscaling. Run:ai supports the following

The Minimum and Maximum number of replicas can be configured as part of the autoscaling configuration.

Autoscaling also supports a scale to zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.
Autoscaling also supports a scale-to-zero policy with *Throughput* and *Concurrency* metrics, meaning that given enough time under the target threshold, the number of replicas will be scaled down to 0.

This has the benefit of conserving resources at the risk of a delay from "cold starting" the model when traffic resumes.

Expand Down
4 changes: 0 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,6 @@ plugins:
'admin/runai-setup/cluster-setup/researcher-authentication.md' : 'admin/runai-setup/authentication/sso.md'
'admin/researcher-setup/cli-troubleshooting.md' : 'admin/troubleshooting/troubleshooting.md'
'developer/deprecated/inference/submit-via-yaml.md' : 'developer/cluster-api/other-resources.md'
'Researcher/researcher-library/rl-hpo-support.md' : 'Researcher/scheduling/hpo.md'
'Researcher/researcher-library/researcher-library-overview.md' : 'Researcher/scheduling/hpo.md'

nav:
- Home:
- 'Overview': 'index.md'
Expand Down Expand Up @@ -217,7 +214,6 @@ nav:
- 'Dashboard Analysis' : 'admin/admin-ui-setup/dashboard-analysis.md'
- 'Jobs' : 'admin/admin-ui-setup/jobs.md'
- 'Credentials' : 'admin/admin-ui-setup/credentials-setup.md'
- 'Deployments' : 'admin/admin-ui-setup/deployments.md'
- 'Templates': 'admin/admin-ui-setup/templates.md'
- 'Troubleshooting' :
- 'Cluster Health' : 'admin/troubleshooting/cluster-health-check.md'
Expand Down

0 comments on commit 457a13b

Please sign in to comment.