Skip to content

Commit

Permalink
more detail in README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ssiegel95 committed Dec 9, 2024
1 parent 1257b88 commit 59fee29
Showing 1 changed file with 112 additions and 29 deletions.
141 changes: 112 additions & 29 deletions services/finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ a finetuneing job on a kubernetes-based system.

## Prerequisites:

* GNU make
* git
* git-lfs (available in many system package managers such as apt, dnf, and brew)
* python >=3.10, <3.13
* poetry (`pip install poetry`)
* zsh or bash
* docker or podman (to run examples, we have not tested well with podman)
* kubectl for deploying a local test cluster
- GNU make
- git
- git-lfs (available in many system package managers such as apt, dnf, and brew)
- python >=3.10, <3.13
- poetry (`pip install poetry`)
- zsh or bash
- docker or podman (to run examples, we have not tested well with podman)
- kubectl for deploying a local test cluster

## Installation

Expand All @@ -29,15 +29,14 @@ pip install poetry && poetry install --with dev
This will run basic unit tests. You should run them and confirm they pass before
proceeding to kubernetes-based tests and examples.


```zsh
make test_local
```

### Building an image

You must have either docker or podman installed on your system for this to
work. You must also have proper permissions on your system to build images. We assume you have a working docker command which can be docker itself
work. You must also have proper permissions on your system to build images. We assume you have a working docker command which can be docker itself
or `podman` that has been aliased as `docker` or has been installed with the podman-docker package that will do this for you.

```zsh
Expand All @@ -48,7 +47,7 @@ Note that be default we build an image **without** GPU support. This makes the d
than a fully nvidia-enabled image. GPU enablement is coming soon and will be available via an environment
prefix to the `make image` command.

After a successful build you should have a local image named
After a successful build you should have a local image named
`tsfmfinetuning:latest`

```zsh
Expand All @@ -61,28 +60,29 @@ tsfmfinetuning latest

For this example we'll use [kind](https://kind.sigs.k8s.io/docs/user/quick-start/),
a lightweight way of running a local kubernetes cluster using docker. We will
use the kubeflow training operator's custom resource to start
use the kubeflow training operator's custom resource to start
and monitor an ayschronous finetuning job.

### Create a local cluster

First:

* [Install kubectl](https://kubernetes.io/docs/tasks/tools/)
* [Install helm](https://helm.sh/docs/intro/install/)
* If you are using podman, you will need to enable the use of an insecure (using http instead of https)
local container registry by creating a file called `/etc/containers/registries.conf.d/localhost.conf`
with the following content:
- [Install kubectl](https://kubernetes.io/docs/tasks/tools/)
- [Install helm](https://helm.sh/docs/intro/install/)
- If you are using podman, you will need to enable the use of an insecure (using http instead of https)
local container registry by creating a file called `/etc/containers/registries.conf.d/localhost.conf`
with the following content:

```
[[registry]]
location = "localhost:5001"
insecure = true
```
* If you're using podman, you may run into issues running the kserve container due to
open file (nofile) limits. If so,
see https://github.com/containers/common/blob/main/docs/containers.conf.5.md
for instructions on how to increase the default limits.

- If you're using podman, you may run into issues running the kserve container due to
open file (nofile) limits. If so,
see https://github.com/containers/common/blob/main/docs/containers.conf.5.md
for instructions on how to increase the default limits.

Now install a kind control plane with a local docker registry:

Expand All @@ -91,11 +91,11 @@ curl -s https://kind.sigs.k8s.io/examples/kind-with-registry.sh | bash

Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.29.2) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

Expand Down Expand Up @@ -129,6 +129,11 @@ local-path-storage local-path-provisioner-57c5987fd4-ts26j 1/1 Runnin

Note that your names will look similar to necessarily identical to the above.

### Set up rancher storage provisioning (this is necessary only when using a kind local cluster)

```zsh
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.30/deploy/local-path-storage.yaml
```

### Install the kubeflow training operator (KFTO)

Expand All @@ -148,7 +153,6 @@ kubeflow training-operator-7f8bfd56f-lrpm2 1/1 Runnin

As before, your output should be similar but not necessarily identical to the above.


Check that the custom resource definitions have been created:

```zsh
Expand All @@ -164,13 +168,92 @@ xgboostjobs.kubeflow.org 2024-12-07T18:21:06Z

```

### Upload the tsfm service image to the kind local registry:
### Push the tsfm service image to the kind local registry:

```zsh
# don't forget to run "make image" first
docker tag tsfmfinetuning:latest localhost:5001/tsfmfinetuning:latest
docker push localhost:5001/tsfmfinetuning:latest
```

### Create your local storage

Define a persistent volume claim using rancher's local-path storage:

```sh
kubectl apply -f examples/local_pvc.yaml
```

Create a alpine instance bound to this PVC to make it easier to copy things to the local storage location

```sh
kubectl apply -f examples/alpine.yaml
```

Copy models to the PVC

```sh
make clone_models && make fetchdata
```

Copy data and payload parameters to the storage location. **Remember this is
just a local development example, you would not be doing things like
giving rwX permission to everyone (the last line in the code snippet below)
in a real deployment!**

```sh
kubectl cp mytest-tsfm alpine:/data \
&& kubectl cp --no-preserve=true data/ETTh1.csv alpine:/data \
&& tf=$(mktemp) \
&& cat data/ftpayload.json | awk '{gsub("file://./", "file:///")}1' >> $tf \
&& kubectl cp --no-preserve=true $tf alpine:/data/ftpayload.json \
&& cat tsfmfinetuning/default_config.yml | awk '{gsub("/tmp", "/data")}1' > $tf \
&& kubectl cp --no-preserve=true $tf alpine:/data/default_config.yml \
&& kubectl exec alpine -- chmod -R go+rwX /data
```

mkdir -p /tmp/kind-local-storage
Create a finetuning job and monitor its output

```sh
kubectl apply -f examples/kfto_job.yaml

pytorchjob.kubeflow.org/tsfmfinetuning-job created
```

```sh
kubectl logs -f tsfmfinetuning-job-master-0

/finetuning/.venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "ForecastingInferenceInput" shadows an attribute in parent "BaseInferenceInput"
warnings.warn(
/finetuning/.venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "ForecastingTuneInput" shadows an attribute in parent "BaseTuneInput"
warnings.warn(
INFO:p-1:t-139742116783936:finetuning.py:__init__:registered tinytimemixer
INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:in _forecasting_tuning_workflow
INFO:p-1:t-139742116783936:finetuning.py:load:No preprocessor found
INFO:p-1:t-139742116783936:hfutil.py:load_model:Found model class: TinyTimeMixerForPrediction
INFO:p-1:t-139742116783936:finetuning.py:load:Successfully loaded model
WARNING:p-1:t-139742116783936:other.py:check_os_kernel:Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:calling trainer.train
{'loss': 7.3424, 'grad_norm': 8.357532501220703, 'learning_rate': 0.0, 'epoch': 1.0}

100%|██████████| 3/3 [00:17<00:00, 3.51s/it]e': 6.2719, 'eval_samples_per_second': 119.581, 'eval_steps_per_second': 3.827, 'epoch': 1.0}
100%|██████████| 3/3 [00:17<00:00, 5.85s/it]_second': 42.739, 'train_steps_per_second': 0.171, 'train_loss': 7.342405319213867, 'epoch': 1.0}
INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:done with training
```
Confirm that a new finetuned model has been produced
```sh
# 'finetuned_from_kfto' comes from the value set for the
# --model_name argument in examples/kfto_job.yaml
kubectl exec alpine -- ls -lR /data/finetuned_from_kfto

/data/finetuned_from_kfto:
total 3188
-rw-r--r-- 1 1001 root 1573 Dec 9 15:28 config.json
-rw-r--r-- 1 1001 root 69 Dec 9 15:28 generation_config.json
-rw-r--r-- 1 1001 root 3240592 Dec 9 15:28 model.safetensors
-rw-r--r-- 1 1001 root 857 Dec 9 15:28 preprocessor_config.json
-rw-r--r-- 1 1001 root 5304 Dec 9 15:28 training_args.bin

```

0 comments on commit 59fee29

Please sign in to comment.