diff --git a/services/finetuning/README.md b/services/finetuning/README.md index b6534699..21baf9ab 100644 --- a/services/finetuning/README.md +++ b/services/finetuning/README.md @@ -9,14 +9,14 @@ a finetuneing job on a kubernetes-based system. ## Prerequisites: -* GNU make -* git -* git-lfs (available in many system package managers such as apt, dnf, and brew) -* python >=3.10, <3.13 -* poetry (`pip install poetry`) -* zsh or bash -* docker or podman (to run examples, we have not tested well with podman) -* kubectl for deploying a local test cluster +- GNU make +- git +- git-lfs (available in many system package managers such as apt, dnf, and brew) +- python >=3.10, <3.13 +- poetry (`pip install poetry`) +- zsh or bash +- docker or podman (to run examples, we have not tested well with podman) +- kubectl for deploying a local test cluster ## Installation @@ -29,7 +29,6 @@ pip install poetry && poetry install --with dev This will run basic unit tests. You should run them and confirm they pass before proceeding to kubernetes-based tests and examples. - ```zsh make test_local ``` @@ -37,7 +36,7 @@ make test_local ### Building an image You must have either docker or podman installed on your system for this to -work. You must also have proper permissions on your system to build images. We assume you have a working docker command which can be docker itself +work. You must also have proper permissions on your system to build images. We assume you have a working docker command which can be docker itself or `podman` that has been aliased as `docker` or has been installed with the podman-docker package that will do this for you. ```zsh @@ -48,7 +47,7 @@ Note that be default we build an image **without** GPU support. This makes the d than a fully nvidia-enabled image. GPU enablement is coming soon and will be available via an environment prefix to the `make image` command. -After a successful build you should have a local image named +After a successful build you should have a local image named `tsfmfinetuning:latest` ```zsh @@ -61,28 +60,29 @@ tsfmfinetuning latest For this example we'll use [kind](https://kind.sigs.k8s.io/docs/user/quick-start/), a lightweight way of running a local kubernetes cluster using docker. We will -use the kubeflow training operator's custom resource to start +use the kubeflow training operator's custom resource to start and monitor an ayschronous finetuning job. ### Create a local cluster First: -* [Install kubectl](https://kubernetes.io/docs/tasks/tools/) -* [Install helm](https://helm.sh/docs/intro/install/) -* If you are using podman, you will need to enable the use of an insecure (using http instead of https) -local container registry by creating a file called `/etc/containers/registries.conf.d/localhost.conf` -with the following content: +- [Install kubectl](https://kubernetes.io/docs/tasks/tools/) +- [Install helm](https://helm.sh/docs/intro/install/) +- If you are using podman, you will need to enable the use of an insecure (using http instead of https) + local container registry by creating a file called `/etc/containers/registries.conf.d/localhost.conf` + with the following content: ``` [[registry]] location = "localhost:5001" insecure = true ``` -* If you're using podman, you may run into issues running the kserve container due to -open file (nofile) limits. If so, -see https://github.com/containers/common/blob/main/docs/containers.conf.5.md -for instructions on how to increase the default limits. + +- If you're using podman, you may run into issues running the kserve container due to + open file (nofile) limits. If so, + see https://github.com/containers/common/blob/main/docs/containers.conf.5.md + for instructions on how to increase the default limits. Now install a kind control plane with a local docker registry: @@ -91,11 +91,11 @@ curl -s https://kind.sigs.k8s.io/examples/kind-with-registry.sh | bash Creating cluster "kind" ... ✓ Ensuring node image (kindest/node:v1.29.2) đŸ–ŧ - ✓ Preparing nodes đŸ“Ļ - ✓ Writing configuration 📜 - ✓ Starting control-plane 🕹ī¸ - ✓ Installing CNI 🔌 - ✓ Installing StorageClass 💾 + ✓ Preparing nodes đŸ“Ļ + ✓ Writing configuration 📜 + ✓ Starting control-plane 🕹ī¸ + ✓ Installing CNI 🔌 + ✓ Installing StorageClass 💾 Set kubectl context to "kind-kind" You can now use your cluster with: @@ -129,6 +129,11 @@ local-path-storage local-path-provisioner-57c5987fd4-ts26j 1/1 Runnin Note that your names will look similar to necessarily identical to the above. +### Set up rancher storage provisioning (this is necessary only when using a kind local cluster) + +```zsh +kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.30/deploy/local-path-storage.yaml +``` ### Install the kubeflow training operator (KFTO) @@ -148,7 +153,6 @@ kubeflow training-operator-7f8bfd56f-lrpm2 1/1 Runnin As before, your output should be similar but not necessarily identical to the above. - Check that the custom resource definitions have been created: ```zsh @@ -164,7 +168,7 @@ xgboostjobs.kubeflow.org 2024-12-07T18:21:06Z ``` -### Upload the tsfm service image to the kind local registry: +### Push the tsfm service image to the kind local registry: ```zsh # don't forget to run "make image" first @@ -172,5 +176,84 @@ docker tag tsfmfinetuning:latest localhost:5001/tsfmfinetuning:latest docker push localhost:5001/tsfmfinetuning:latest ``` +### Create your local storage + +Define a persistent volume claim using rancher's local-path storage: + +```sh +kubectl apply -f examples/local_pvc.yaml +``` + +Create a alpine instance bound to this PVC to make it easier to copy things to the local storage location + +```sh +kubectl apply -f examples/alpine.yaml +``` + +Copy models to the PVC + +```sh +make clone_models && make fetchdata +``` + +Copy data and payload parameters to the storage location. **Remember this is +just a local development example, you would not be doing things like +giving rwX permission to everyone (the last line in the code snippet below) +in a real deployment!** + +```sh +kubectl cp mytest-tsfm alpine:/data \ +&& kubectl cp --no-preserve=true data/ETTh1.csv alpine:/data \ +&& tf=$(mktemp) \ +&& cat data/ftpayload.json | awk '{gsub("file://./", "file:///")}1' >> $tf \ +&& kubectl cp --no-preserve=true $tf alpine:/data/ftpayload.json \ +&& cat tsfmfinetuning/default_config.yml | awk '{gsub("/tmp", "/data")}1' > $tf \ +&& kubectl cp --no-preserve=true $tf alpine:/data/default_config.yml \ +&& kubectl exec alpine -- chmod -R go+rwX /data +``` -mkdir -p /tmp/kind-local-storage +Create a finetuning job and monitor its output + +```sh +kubectl apply -f examples/kfto_job.yaml + +pytorchjob.kubeflow.org/tsfmfinetuning-job created +``` + +```sh +kubectl logs -f tsfmfinetuning-job-master-0 + +/finetuning/.venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "ForecastingInferenceInput" shadows an attribute in parent "BaseInferenceInput" + warnings.warn( +/finetuning/.venv/lib/python3.12/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "ForecastingTuneInput" shadows an attribute in parent "BaseTuneInput" + warnings.warn( +INFO:p-1:t-139742116783936:finetuning.py:__init__:registered tinytimemixer +INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:in _forecasting_tuning_workflow +INFO:p-1:t-139742116783936:finetuning.py:load:No preprocessor found +INFO:p-1:t-139742116783936:hfutil.py:load_model:Found model class: TinyTimeMixerForPrediction +INFO:p-1:t-139742116783936:finetuning.py:load:Successfully loaded model +WARNING:p-1:t-139742116783936:other.py:check_os_kernel:Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. +INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:calling trainer.train +{'loss': 7.3424, 'grad_norm': 8.357532501220703, 'learning_rate': 0.0, 'epoch': 1.0} + +100%|██████████| 3/3 [00:17<00:00, 3.51s/it]e': 6.2719, 'eval_samples_per_second': 119.581, 'eval_steps_per_second': 3.827, 'epoch': 1.0} +100%|██████████| 3/3 [00:17<00:00, 5.85s/it]_second': 42.739, 'train_steps_per_second': 0.171, 'train_loss': 7.342405319213867, 'epoch': 1.0} +INFO:p-1:t-139742116783936:finetuning.py:_finetuning_common:done with training +``` + +Confirm that a new finetuned model has been produced + +```sh +# 'finetuned_from_kfto' comes from the value set for the +# --model_name argument in examples/kfto_job.yaml +kubectl exec alpine -- ls -lR /data/finetuned_from_kfto + +/data/finetuned_from_kfto: +total 3188 +-rw-r--r-- 1 1001 root 1573 Dec 9 15:28 config.json +-rw-r--r-- 1 1001 root 69 Dec 9 15:28 generation_config.json +-rw-r--r-- 1 1001 root 3240592 Dec 9 15:28 model.safetensors +-rw-r--r-- 1 1001 root 857 Dec 9 15:28 preprocessor_config.json +-rw-r--r-- 1 1001 root 5304 Dec 9 15:28 training_args.bin + +```