Merge pull request #62 from georgian-io/automated_deployment

Automated deployment
georgian-io · Dec 18, 2023 · 6a6582d · 6a6582d
2 parents 830d314 + f1a3e0b
commit 6a6582d
Show file tree

Hide file tree

Showing 131 changed files with 734 additions and 15,239 deletions.
diff --git a/inference/README.md b/inference/README.md
@@ -1,16 +1,171 @@
 # Deployment
 
-In this section you can find the instructions on how to deploy your model using FastApi and Text Generation Inference. 
+In this section you can find the instructions on how to deploy your models using different inference servers.
+
+## Prerequisites
+
+### General 
 
 To follow these instructions you need:
 
 - Docker installed
-- Path of the folder with model weights
-- HuggingFace account
+- HuggingFace repository with a merged model (follow steps 1-4 from [How to merge the model](#how-to-merge-the-model))
 
 Note: To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). 
 
-## FastApi
+### Load testing
+
+- [Vegeta](https://github.com/tsenart/vegeta) installed (follow [this guide](https://geshan.com.np/blog/2020/09/vegeta-load-testing-primer-with-examples/) for installation)
+
+
+
+## Automated deployment and benchmark
+
+With automated deployment you can easily deploy LLama-2, RedPajama, Falcon or Flan models and load test them for different number of requests. 
+
+Go to <code> automated_deployment </code>folder.
+
+```
+cd automated_deployment
+```
+
+### Deployment
+
+Before running the inference, you will need to fill the <code>config.json</code> file which has the next default structure:
+
+```
+{
+    "server": "tgi",  
+    "huggingface_repo": "NousResearch/Llama-2-7b-hf",
+    "huggingface_token": "",
+    "model_type": "llama",
+    "max_tokens": 20
+}
+```
+
+#### server
+
+   Mappings for the possible servers you can deploy on:
+
+   | Server | Parameter name |
+   |-----------------|-----------------|
+   | vLLM     | ```vllm```     |
+   | Text Generation Inference     | ```tgi```     |
+   | Ray     | ```ray```     |
+   |Triton Inference Server with vLLM backend | ```triton_vllm```|
+
+
+#### huggingface_token
+
+Read/Write token for your HuggingFace account.
+
+#### huggingface_repo
+
+The model repository on HuggingFace that stores model files. Pass in the format ```username/repo_name```. 
+
+#### max_tokens
+
+Maximum number of tokens you model should generate (should be integer value). 
+
+#### model_type
+
+Mappings for different model types.
+   | Model      | Type    |
+   |------------|---------|
+   | Flan-T5       | flan |
+   | Falcon-7B     | falcon  |
+   | RedPajama  | red_pajama  |
+   | LLama-2      | llama  |
+
+
+After modifying the fields according to your preferences, run next command to start the server:
+
+```
+python run_inference.py
+```
+
+
+
+### Send request to the server
+
+When the server has starter, you now are able to send the request. 
+
+1. Run the following command:
+
+```
+python send_post_request.py inference
+```
+2. You will be asked then to provide the input.
+
+For example:
+
+```
+Input: Classify the following sentence that is delimited with triple backticks. ### Sentence:I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. ### Class:
+```
+
+### Benchmark
+
+If you want to find out what latency, thoughput each server provides you can perform the benchmark using [Vegeta](https://github.com/tsenart/vegeta) load-testing tool.
+
+We currently support benchmark for classification/summarization tasks.
+
+Before running the command you will have to add few more fields to the `config.json`:
+```
+{
+    ...
+
+    "task": "classification",
+    "model_name": "llama_7b_class",
+    "duration": "10s",
+    "rate": "10"
+}  
+```
+#### task
+
+You should specify task your model was trained for, either ```classification``` or ```summarization```.
+
+#### model_name
+
+Text identifier of the model for summary table (can be anything).
+
+#### duration and rate
+
+Duration of the benchmark test. During each second certain name of requests (rate value) will be sent. If the duration is `10s` and rate is `20`, in total `200` requests will be sent.
+
+Usually with longer time you will be able to send less requests per second without the server crashing.
+
+Once the server is started, run command for benchmark in a separate window:
+
+   ```
+   python run_benchmark.py
+   ```
+
+The test will run 2 times for more fair results and in the end all metrics will be calculated with deviation.
+
+<b> Raw data (Vegeta output for 1 test) </b>
+
+```
+Requests      [total, rate, throughput]         100, 10.10, 9.87
+Duration      [total, attack, wait]             10.137s, 9.9s, 236.754ms
+Latencies     [min, mean, 50, 90, 95, 99, max]  227.567ms, 347.64ms, 325.601ms, 421.165ms, 424.789ms, 426.472ms, 426.884ms
+Bytes In      [total, mean]                     3200, 32.00
+Bytes Out     [total, mean]                     36900, 369.00
+Success       [ratio]                           100.00%
+Status Codes  [code:count]                      200:100
+Error Set:
+```
+
+<b> Processed data (summary of results for 2 tests)</b>
+| model          | server | rps | latency_with_deviation | throughput_with_deviation | duration_with_deviation |
+|----------------|--------|-----|-----------------------|---------------------------|-------------------------|
+| llama_7b_class | tgi    | 10.1| 0.465±0.315           | 7.200±3.600               | 10.207±0.228            |
+
+
+
+
+## Manual deployment 
+
+### FastApi
 
 For building FastApi application, do the following:
 
@@ -53,7 +208,9 @@ For building FastApi application, do the following:
    python client.py --url http://localhost:8080/predict --prompt "Your custom prompt here"
    ```
 
-## [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
+### [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
+
+#### How to merge the model
 
 1. Install HuggingFace library:
 
@@ -75,7 +232,7 @@ For building FastApi application, do the following:
    ```
    python merge_script.py --model_path /my/path --model_type causal --repo_id johndoe/new_model
    ```
-5. Serve the model:
+#### Serve the model with TGI:
 
    ```
    model=meta-llama/Llama-2-7b-chat-hf
@@ -87,7 +244,7 @@ For building FastApi application, do the following:
    docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
    ```
 
-## [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
+### [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
 
 1. Install the package:
 

diff --git a/inference/automated_deployment/config.json b/inference/automated_deployment/config.json
@@ -0,0 +1,7 @@
+{
+    "server": "tgi",  
+    "huggingface_repo": "NousResearch/Llama-2-7b-hf",
+    "huggingface_token": "",
+    "model_type": "llama",
+    "task": "classification"
+}
diff --git a/inference/automated_deployment/constants.py b/inference/automated_deployment/constants.py
@@ -0,0 +1,17 @@
+BASE_DIR = "./benchmark_results"
+PROCESSED_DIR = f"{BASE_DIR}/processed"
+PLOTS_DIR = f"{BASE_DIR}/plots"
+RAW_DIR = f"{BASE_DIR}/raw"
+CONFIG_FILE_PATH = './config.json'
+
+# length = length + 1 (for comma), example: "34.5ms," "10m0s"
+MILLISECONDS_LENGTH = 3
+MICROSECONDS_LENGTH = 3
+SECONDS_LENGTH = 2
+MINUTES_LENGTH = 4
+
+NUMBER_OF_MS_IN_SECOND = 1000
+NUMBER_OF_MICROSEC_IN_SECOND = 1000000
+NUMBER_OF_SECONDS_IN_MINUTE = 60
+
+TOO_MANY_REQUEST_ERROR = 429
diff --git a/inference/automated_deployment/enum_types.py b/inference/automated_deployment/enum_types.py
@@ -0,0 +1,11 @@
+from enum import Enum
+
+class Server(Enum):
+    TGI = "tgi"
+    VLLM = "vllm"
+    RAY = "ray"
+    TRITON_VLLM = "triton_vllm"
+
+class Task(Enum):
+    CLASSIFICATION = "classification"
+    SUMMARIZATION = "summarization"
diff --git a/...eta/fastapi/classification/test_text.json → inference/automated_deployment/input.json b/...eta/fastapi/classification/test_text.json → inference/automated_deployment/input.json
diff --git a/inference/automated_deployment/process_benchmark_data.py b/inference/automated_deployment/process_benchmark_data.py
@@ -0,0 +1,89 @@
+import csv
+import os
+import numpy as np
+from enum_types import Server, Task
+from constants import MICROSECONDS_LENGTH, MILLISECONDS_LENGTH, SECONDS_LENGTH, MINUTES_LENGTH
+from constants import NUMBER_OF_MICROSEC_IN_SECOND, NUMBER_OF_MS_IN_SECOND, NUMBER_OF_SECONDS_IN_MINUTE
+from constants import TOO_MANY_REQUEST_ERROR
+import typer
+from utils import load_json
+from constants import CONFIG_FILE_PATH
+
+def save_data_for_final_table(csv_file_path, data):
+    headers = ["model", "server", "rps", "latency_with_deviation", "throughput_with_deviation", "duration_with_deviation"]
+
+    write_header = not os.path.exists(csv_file_path) or os.path.getsize(csv_file_path) == 0
+
+    with open(csv_file_path, mode='a', newline='') as file:
+        writer = csv.writer(file)
+        if write_header:
+            writer.writerow(headers)
+        writer.writerow(data)
+
+def convert_to_seconds(time):
+    if 'ms' in time:
+        return float(time[:-MILLISECONDS_LENGTH]) / NUMBER_OF_MS_IN_SECOND
+    elif 'µs' in time:
+        return float(time[:-MICROSECONDS_LENGTH]) / NUMBER_OF_MICROSEC_IN_SECOND
+    elif 'm' in time and 's' in time:
+        return float(time[:-MINUTES_LENGTH]) * NUMBER_OF_SECONDS_IN_MINUTE
+    else:
+        return float(time[:-SECONDS_LENGTH])
+
+def get_metrics(raw_results_path: str, processed_results_path: str):
+    rate = 0
+    config = load_json(CONFIG_FILE_PATH)
+    server = config["server"]
+    model = config["model_name"]
+    with open(raw_results_path) as f:
+        benchmark_logs = f.readlines()
+        result_dict = {}
+        max_total_request = 0
+        raws = [i.split() for i in benchmark_logs]
+        for raw in raws:
+            if len(raw) > 0:
+                if raw[0] == TOO_MANY_REQUEST_ERROR: 
+                    break
+                if raw[0] == 'Requests':
+                    pos_of_total_request_value = 4
+                    pos_of_throughput_value = 6
+                    pos_of_rate_value = 5
+
+                    total_request = int(raw[pos_of_total_request_value][:-1])
+                    if result_dict.get(total_request) is None:
+                        result_dict[total_request] = {'latency': [], 
+                                                      'throughput': [convert_to_seconds(raw[pos_of_throughput_value])],
+                                                      'count': 1, 
+                                                      'duration': []}
+                    else:
+                        result_dict[total_request]['count'] += 1
+                        result_dict[total_request]['throughput'].append(convert_to_seconds(raw[pos_of_throughput_value]))
+                    max_total_request = total_request
+                    rate = float(raw[pos_of_rate_value][:-1])
+                if raw[0] == 'Duration':
+                    pos_of_duration_value = 4
+                    result_dict[max_total_request]['duration'].append(convert_to_seconds(raw[pos_of_duration_value]))
+                if raw[0] == 'Latencies':
+                    pos_of_latency_value = 11
+                    result_dict[max_total_request]['latency'].append(convert_to_seconds(raw[pos_of_latency_value]))
+
+        keys_to_modify = ['latency', 'duration', 'throughput']
+
+        for num_req in result_dict.keys():
+            for key in keys_to_modify:
+                mean_value = np.mean(result_dict[num_req][key])
+                std_deviation = np.std(result_dict[num_req][key])
+
+                formatted_mean = "{:.3f}".format(mean_value)
+                formatted_std_dev = "{:.3f}".format(std_deviation)
+
+                result_dict[num_req][f"{key}_with_deviation"] = f"{formatted_mean}±{formatted_std_dev}"
+                result_dict[num_req][key] = mean_value
+
+        save_data_for_final_table(processed_results_path, [model, server, rate, result_dict[max_total_request]['latency_with_deviation'], 
+                                                        result_dict[max_total_request]['throughput_with_deviation'], 
+                                                        result_dict[max_total_request]['duration_with_deviation']])
+
+if __name__ == '__main__':
+
+    typer.run(get_metrics)
diff --git a/inference/automated_deployment/requirements.txt b/inference/automated_deployment/requirements.txt
@@ -0,0 +1,18 @@
+accelerate==0.21.0
+bitsandbytes==0.39.0
+datasets==2.12.0
+evaluate==0.4.0
+fastapi==0.104.1
+numpy==1.24.3
+pydantic==1.10.13
+ray==2.8.0
+starlette==0.27.0
+tokenizers==0.14.1
+torch==2.0.1
+transformers==4.35.0
+triton==2.0.0
+tritonclient==2.39.0
+typer==0.9.0
+typing_extensions==4.8.0
+uvicorn==0.23.2
+vllm==0.2.1.post1
diff --git a/inference/automated_deployment/run_benchmark.py b/inference/automated_deployment/run_benchmark.py
@@ -0,0 +1,37 @@
+import subprocess
+from pathlib import Path 
+import typer
+import json
+from utils import load_json
+from constants import PROCESSED_DIR, RAW_DIR, CONFIG_FILE_PATH
+from validation import validate_benchmark_config
+from validation import ValidationError
+
+def main():
+
+    config = load_json(CONFIG_FILE_PATH)
+    try:
+        validate_benchmark_config(config)
+    except ValidationError as e:
+        print(f"An error occurred: {e}")
+    else:
+
+        Path(PROCESSED_DIR).mkdir(parents=True, exist_ok=True)
+        Path(RAW_DIR).mkdir(parents=True, exist_ok=True)
+
+        server = config["server"]
+        model_name = config["model_name"]
+        raw_results_path = f"{RAW_DIR}/{model_name}_{server}.txt"
+        processed_results_path = f"{PROCESSED_DIR}/{model_name}.csv"
+
+        subprocess.run("docker stop $(docker ps -q)".split())
+        subprocess.run(["chmod", "+x", f"./script_benchmark.sh"])
+        print("Running benchmark...")
+        subprocess.run([f"./script_benchmark.sh", raw_results_path, processed_results_path, 
+                        config['duration'], config['rate']])
+        print("Benchmark is finished.")
+        print(f"Raw results are saved at: {raw_results_path}")
+        print(f"Processed results are saved at: {processed_results_path}")
+
+if __name__ == "__main__":
+    main()