Update llama notebook with inference container stuff (#2820)

* add notebook changes * move max_concurrent_requests var * fix formatting * update to use mii-v2 * address comments * fix to use mii instead of mii-v2 * change env vars * add section to readme * fix black formatting * update notes on env vars * fix black formatting * update readme, add a100/h100 requirement for mii * add bullets for deepspeed fastgen * add space * typo --------- Co-authored-by: svaruag <[email protected]>
Azure · Nov 14, 2023 · e09d30c · e09d30c
1 parent 46181cf
commit e09d30c
Show file tree

Hide file tree

Showing 2 changed files with 79 additions and 24 deletions.
diff --git a/sdk/python/foundation-models/system/inference/README.md b/sdk/python/foundation-models/system/inference/README.md
@@ -0,0 +1,29 @@
+# Foundation Model Inferencing
+The __foundation-model-inference__ container is a curated solution for inferencing foundation models on Azure ML. It incorporates the best inferencing frameworks to ensure optimal request throughput and latency. The container is user-friendly and integrates seamlessly with Azure ML. It utilizes:
+
+## vLLM
+vLLM is a high-performance inferencing server that offers several features, making it a top choice for inferencing systems. vLLM provides:
+- Top-tier serving throughput
+- Efficient handling of attention key and value memory with PagedAttention
+- Continuous batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
+- Support for tensor parallelism for distributed inference
+
+## DeepSpeed FastGen
+DeepSpeed FastGen, a recent release from DeepSpeed, offers up to 2.3x faster throughput than vLLM, which already outperforms similar frameworks like Huggingface TGI. DeepSpeed-FastGen combines DeepSpeed-MII and DeepSpeed-Inference to deliver a fast and user-friendly serving system.
+DeepSpeed FastGen features include:
+- Upt to 2.3x faster throughput than vLLM
+- Optimized memory handling with a blocked KV cache
+- Continuous batching of incoming requests
+- Optimized CUDA kernals
+- Tensor parallelism support
+- New Dynamic Splitfuse technique to increase overall performance and provide better throughput consistency.
+
+DeepSpeed FastGen achieves superior performance by using a new technique called Dynamic Splitfuse. This technique enhances responsiveness, efficiency, and result consistency. For more information, visit the DeepSpeed FastGen [github page](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md).
+
+## Supported Tasks by the Container
+- Text Generation
+> More tasks will be supported soon.
+
+For additional information on this container and its use with foundation models, refer to section 3.4 of the [text-generation example](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/inference/text-generation/llama-safe-online-deployment.ipynb).
diff --git a/...hon/foundation-models/system/inference/text-generation/llama-safe-online-deployment.ipynb b/...hon/foundation-models/system/inference/text-generation/llama-safe-online-deployment.ipynb
@@ -403,28 +403,25 @@
    ]
   },
   {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "##### 3.4 Deploy Llama 2 model\n",
-    "This step may take a few minutes."
-   ]
-  },
-  {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Create deployment"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Initialize deployment parameters"
+    "#### 3.4 Setup Deployment Parameters\n",
+    "\n",
+    "We utilize an optimized __foundation-model-inference__ container for model scoring. This container is designed to deliver high throughput and low latency. In this section, we introduce several environment variables that can be adjusted to customize a deployment for either high throughput or low latency scenarios.\n",
+    "\n",
+    "- __ENGINE_NAME__: Used to choose the inferencing framework to use in the scoring script. For Llama-2 models, if ENGINE_NAME = 'mii' the container will inference with the new [DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen). Alternatively if ENGINE_NAME = 'vllm' the container will inference with [vLLM](https://vllm.readthedocs.io/en/latest/), which is also the default.\n",
+    "- __WORKER_COUNT__: The number of workers to use for inferencing. This is used as a proxy for the number of concurrent requests that the server should handle.\n",
+    "- __TENSOR_PARALLEL__: The number of GPUs to use for tensor parallelism.\n",
+    "- __NUM_REPLICAS__: The number of model instances to load for the deployment. This is used to increase throughput by loading multiple models on multiple GPUs, if the model is small enough to fit.\n",
+    "\n",
+    "`NUM_REPLICAS` and `TENSOR_PARALLEL` work hand-in-hand to determine the most optimal configuration to increase the throughput for the deployment without degrading too much on the latency. The total number of GPUs used for inference will be `NUM_REPLICAS` * `TENSOR_PARALLEL`. For example, if `NUM_REPLICAS` = 2 and `TENSOR_PARALLEL` = 2, then 4 GPUs will be used for inference.\n",
+    "\n",
+    "Ensure that the model you are deploying is small enough to fit on the number of GPUs you are using, specified by `TENSOR_PARALLEL`. For instance, if there are 4 GPUs available, and `TENSOR_PARALLEL` = 2, then the model must be small enough to fit on 2 GPUs. If the model is too large, then the deployment will fail. \n",
+    "\n",
+    "__NOTE__: \n",
+    "- `NUM_REPLICAS` is currently only supported by the vLLM engine.\n",
+    "- DeepSpeed MII Engine is only supported on A100 / H100 GPUs.\n"
    ]
   },
   {
@@ -433,17 +430,43 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "REQUEST_TIMEOUT_MS = 90000\n",
+    "REQUEST_TIMEOUT_MS = 90000  # the timeout for each request in milliseconds\n",
+    "MAX_CONCURRENT_REQUESTS = (\n",
+    "    128  # the maximum number of concurrent requests supported by the endpoint\n",
+    ")\n",
     "\n",
-    "deployment_env_vars = {\n",
+    "acs_env_vars = {\n",
     "    \"CONTENT_SAFETY_ACCOUNT_NAME\": aacs_name,\n",
     "    \"CONTENT_SAFETY_ENDPOINT\": aacs_endpoint,\n",
     "    \"CONTENT_SAFETY_KEY\": aacs_access_key if uai_client_id == \"\" else None,\n",
     "    \"CONTENT_SAFETY_THRESHOLD\": content_severity_threshold,\n",
     "    \"SUBSCRIPTION_ID\": subscription_id,\n",
     "    \"RESOURCE_GROUP_NAME\": resource_group,\n",
     "    \"UAI_CLIENT_ID\": uai_client_id,\n",
-    "}"
+    "}\n",
+    "\n",
+    "fm_container_default_env_vars = {\n",
+    "    \"WORKER_COUNT\": MAX_CONCURRENT_REQUESTS,\n",
+    "    \"TENSOR_PARALLEL\": 2,\n",
+    "    \"NUM_REPLICAS\": 2,\n",
+    "}\n",
+    "\n",
+    "deployment_env_vars = {**fm_container_default_env_vars, **acs_env_vars}\n",
+    "\n",
+    "# Uncomment the following lines to use DeepSpeed FastGen engine (experimental)\n",
+    "# mii_fastgen_env_vars = {\n",
+    "#     \"ENGINE_NAME\": \"mii\",\n",
+    "#     \"WORKER_COUNT\": MAX_CONCURRENT_REQUESTS,\n",
+    "# }\n",
+    "# deployment_env_vars = {**mii_fastgen_env_vars, **acs_env_vars}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 3.5 Deploy Llama 2 model\n",
+    "This step may take a few minutes."
    ]
   },
   {
@@ -459,7 +482,7 @@
     "    ProbeSettings,\n",
     ")\n",
     "\n",
-    "# For inference environments HF TGI and DS MII, the scoring script is baked into the container\n",
+    "# For inference environments vLLM and DS MII, the scoring script is baked into the container\n",
     "code_configuration = (\n",
     "    CodeConfiguration(code=\"./llama-files/score/default/\", scoring_script=\"score.py\")\n",
     "    if not inference_envs_exist\n",
@@ -474,7 +497,10 @@
     "    instance_count=1,\n",
     "    code_configuration=code_configuration,\n",
     "    environment_variables=deployment_env_vars,\n",
-    "    request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),\n",
+    "    request_settings=OnlineRequestSettings(\n",
+    "        request_timeout_ms=REQUEST_TIMEOUT_MS,\n",
+    "        max_concurrent_requests_per_instance=MAX_CONCURRENT_REQUESTS,\n",
+    "    ),\n",
     "    liveness_probe=ProbeSettings(\n",
     "        failure_threshold=30,\n",
     "        success_threshold=1,\n",