This document covers deploying HuggingFace T5 models to GKE, whether it is directly from HuggingFace or fine-tuned through the LLM Pipeline in this repo.
The output of running the commands described in this document will be a provisioned GKE cluster within your GCP project hosting the model you provide. The model will have an endpoint accessible from within your virtual cloud network, and external to the internet if your project's firewall is configured for it.
Support has been tested for the following T5 families available on huggingface.
- t5 (small, base, large, 11b)
- google/t5-v1_1 (small, base, large, xl, xxl)
- google/flan-t5 (small, base, large, xl, xxl)
These commands can be run from any terminal configured for gcloud, or through Cloud Shell.
Start by enabling the required APIs within your GCP project.
gcloud services enable container.googleapis.com storage.googleapis.com run.googleapis.com cloudresourcemanager.googleapis.com notebooks.googleapis.com
The default compute service account in your project must also have several permissions set: Editor, Project IAM Admin, and Service Account Admin.
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects list \
--filter="${PROJECT_ID}" \
--format="value(PROJECT_NUMBER)")
SERVICE_ACCOUNT=${PROJECT_NUMBER}[email protected]
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SERVICE_ACCOUNT} \
--role=roles/editor
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SERVICE_ACCOUNT} \
--role=roles/resourcemanager.projectIamAdmin
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SERVICE_ACCOUNT} \
--role=roles/iam.serviceAccountAdmin
Start by ensuring the Pre-Requisites have been met. Afterwards these commands can be run in your Cloud Shell or gcloud configured terminal to get started.
The result of these commands will create a single node GKE cluster with a2-highgpu-1g VMs, and 1 nvidia-tesla-a100 GPU on the node.
The commands will deploy the google/flan-t5-base model onto the cluster, and expose the model on an http endpoint in the specified project's default VPC.
-
Follow the pre-requisites section to enable the necessary services and IAM policies
-
Clone this repository
git clone https://github.com/gcp-llm-platform/llm-pipeline.git
cd llm-pipeline
- Run these commands:
PROJECT_ID=$(gcloud config get-value project)
NAME_PREFIX=my-test-cluster
JOB_NAME=my-job
REGION=us-central1
gcloud run jobs create $JOB_NAME --project=$PROJECT_ID --region=$REGION --env-vars-file src/gke/cluster_config.yml --image=gcr.io/llm-containers/gke-provision-deploy:release --args=--project=$PROJECT_ID,--name-prefix=$NAME_PREFIX --execute-now --wait
- Follow the instructions in Consuming the deployed model.
An environment variable file containing the configuration for the GKE cluster and the model needs to be created. The full specification for the cluster configuration can be found here. A sample configuration is available in the repository at llm-pipeline-examples/src/gke/sample_environment_config.yml
Using the sample configuration will create a single node GKE cluster with a2-highgpu-1g VMs, and 1 nvidia-tesla-a100 GPU on the node. The logs from the provisioning of the cluster will be uploaded to a newly created cloud storage bucket named: aiinfra-terraform-<project_id>
.
There are several variables that need to be set for the Model Deployment.
Note: The PROJECT_ID
and CONVERTED_MODEL_UPLOAD_PATH
values must be changed, or provided as runtime arguments.
Environment Variable Name | Required | Description | Example Value |
GPU_COUNT_PER_MODEL
|
Y | Number of GPUs exposed to the pod, also used to set the parallelism when using FasterTransformer | 4
|
MODEL_SOURCE_PATH
|
Y | GCS path or Huggingface repo pointing to the directory of the model to deploy.
Note: For a model fine tuned using the pipeline, look at the Model Artifact after the training step and use the URL property. |
gs://my-bucket/pipeline_runs/237939871711/llm-pipeline-20230328153111/train_5373485673388965888/Model/
or
google/t5-flan-xxl |
NAME_PREFIX
|
N* | Prefix to use when naming the GKE cluster that will be provisioned. Full name will be `$NAME_PREFIX-gke` | my-cluster
|
EXISTING_CLUSTER_ID
|
N* | Name of an existing cluster (in the corresponding Region and Project) to use instead of provisioning a new cluster. | my-gke
|
KSA_NAME
|
N | Name of the Kubernetes Service Account configured with access to the given GCS path. By default one will be provisioned as ‘aiinfra-gke-sa’ | my-other-ksa
|
MODEL_NAME
|
N | Friendly name for the model, used in constructing the Kubernetes Resource names | t5-flan
|
INFERENCING_IMAGE_TAG
|
N | Image tag for the inference image. Default is ‘release’ | latest
|
USE_FASTER_TRANSFORMER
|
N | Boolean to set when the FasterTransformer / Triton path should be enabled.
This controls whether a Conversion job is scheduled, and the inference image that will be deployed. |
true
|
CONVERTED_MODEL_UPLOAD_PATH
|
Y** | Only required when USE_FASTER_TRANSFORMER is set.
A GCS path to upload the model after it is converted for FasterTransformer |
gs://my-bucket/converted_t5/1/Model
|
POD_MEMORY_LIMIT
|
N | Sets the memory limit of pods for GKE in Kubernetes Memory resource format. Defaults to “16Gi”. | 50Gi
|
* One of NAME_PREFIX or EXISTING_CLUSTER_ID must be provided.
** Must be provided when setting USE_FASTER_TRANSFORMER
The Cluster Provisioning + Deployment image is available at gcr.io/llm-containers/gke-provision-deploy .
Several flags are available as arguments to be passed to the image.
A known payload and response can be used to test the image by using the -v, -i, and -o flags.
Options
-h|--help) Display this menu.
-v|--verify) Setting this flag will use the -i and -o flags to validate the expected inferencing behavior of the deployed.
-i|--verify-input-payload=) Path to a file containing the inferencing input for verification. This will route to the Flask endpoint on the image.
-o|--verify-output-payload=) Path to a file containing the inferencing output for verification.
-p|--project=) ID of the project to use. Defaults to environment variable \$PROJECT_ID
--converted-model-upload-path=) Only required when USE_FASTER_TRANSFORMER is set. A GCS path to upload the model after it is converted for FasterTransformer
--name-prefix=) Name prefix for the cluster to create. Cluster will be named <name-prefix>-gke Defaults to environment variable \$NAME_PREFIX
--cleanup) Deletes the model and cluster at the end of the run. Used for testing.
Run the image using gcloud run jobs
, or through any docker executor.
export JOB_NAME=my-job
export REGION=us-central1
export PROJECT_ID=$(gcloud config get project)
gcloud run jobs create $JOB_NAME --project=$PROJECT_ID --region=$REGION --env-vars-file src/gke/cluster_config.yml --image=gcr.io/llm-containers/gke-provision-deploy:release --args=--project=$PROJECT_ID --execute-now --wait
After the image finishes provisioning the cluster, the model will be converted (if necessary) and deployed to the cluster. The image will then terminate.
A NodePort service on the cluster is automatically created during deployment. This nodeport allows a user to consume the model on a network that has access to the GKE node.
The image will log these values at the end of a run. If the image was run through gcloud run
then you will need to retrieve the logs from the job execution to see this output. You can retrieve the logs using these commands.
gcloud logging read "resource.type=\"cloud_run_job\" resource.labels.job_name=\"${JOB_NAME}\" resource.labels.location=\"${REGION}\" severity>=DEFAULT" --project=${PROJECT_ID} --format=json | jq -r '.[].textPayload' | tac | tail -n 20
Sample output:
From a machine on the same VPC as this cluster you can call http://10.128.0.29:32754/infer
***********
To deploy a sample notebook for experimenting with this deployed model, paste the following link into your browser:
https://console.cloud.google.com/vertex-ai/workbench/user-managed/deploy?download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fllm-pipeline-examples%main%2Fexamples%2Ft5-gke-sample-notebook.ipynb&project=$PROJECT_ID
Set the following parameters in the variables cell of the notebook:
host = '10.128.0.29'
flask_node_port = '32754'
triton_node_port = '31246'
payload = """{"instances":["Sandwiched between a second-hand bookstore..."]}"""
***********
Basic health endpoint serving as a Kubernetes Liveness probe.
Returns a basic UI for prompt engineering
Takes and returns the string version of an inference payload. Configured for the Vertex API, so payloads should be provided in the format of:
{ “instances”: [“payload1”, “payload2” … ] }
Responses will be returned in Vertex format:
{ “predictions”: [“prediction1”, “prediction2” … ], “metrics”: [ {“metric1”: “value1”}, {“units”: “unit_measurement”}]
Examples of payloads can be seen in predict_payload.json and predict_result.json
Only available on FasterTransformer image. A raw endpoint that directly communicates with Triton, taking the Triton tensor payload.
These limitations are accurate as of June 1, 2023.
- FasterTransformer image only supports the T5 model family (t5, t5-v1_1, flan-t5). All sizes are supported.