Skip to content

Commit

Permalink
Merge pull request #62 from georgian-io/automated_deployment
Browse files Browse the repository at this point in the history
Automated deployment
  • Loading branch information
mariia-georgian authored Dec 18, 2023
2 parents 830d314 + f1a3e0b commit 6a6582d
Show file tree
Hide file tree
Showing 131 changed files with 734 additions and 15,239 deletions.
171 changes: 164 additions & 7 deletions inference/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,171 @@
# Deployment

In this section you can find the instructions on how to deploy your model using FastApi and Text Generation Inference.
In this section you can find the instructions on how to deploy your models using different inference servers.

## Prerequisites

### General

To follow these instructions you need:

- Docker installed
- Path of the folder with model weights
- HuggingFace account
- HuggingFace repository with a merged model (follow steps 1-4 from [How to merge the model](#how-to-merge-the-model))

Note: To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).

## FastApi
### Load testing

- [Vegeta](https://github.com/tsenart/vegeta) installed (follow [this guide](https://geshan.com.np/blog/2020/09/vegeta-load-testing-primer-with-examples/) for installation)



## Automated deployment and benchmark

With automated deployment you can easily deploy LLama-2, RedPajama, Falcon or Flan models and load test them for different number of requests.

Go to <code> automated_deployment </code>folder.

```
cd automated_deployment
```

### Deployment

Before running the inference, you will need to fill the <code>config.json</code> file which has the next default structure:

```
{
"server": "tgi",
"huggingface_repo": "NousResearch/Llama-2-7b-hf",
"huggingface_token": "",
"model_type": "llama",
"max_tokens": 20
}
```

#### server

Mappings for the possible servers you can deploy on:

| Server | Parameter name |
|-----------------|-----------------|
| vLLM | ```vllm``` |
| Text Generation Inference | ```tgi``` |
| Ray | ```ray``` |
|Triton Inference Server with vLLM backend | ```triton_vllm```|


#### huggingface_token

Read/Write token for your HuggingFace account.

#### huggingface_repo

The model repository on HuggingFace that stores model files. Pass in the format ```username/repo_name```.

#### max_tokens

Maximum number of tokens you model should generate (should be integer value).

#### model_type

Mappings for different model types.
| Model | Type |
|------------|---------|
| Flan-T5 | flan |
| Falcon-7B | falcon |
| RedPajama | red_pajama |
| LLama-2 | llama |


After modifying the fields according to your preferences, run next command to start the server:

```
python run_inference.py
```



### Send request to the server

When the server has starter, you now are able to send the request.

1. Run the following command:

```
python send_post_request.py inference
```
2. You will be asked then to provide the input.

For example:

```
Input: Classify the following sentence that is delimited with triple backticks. ### Sentence:I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. ### Class:
```

### Benchmark

If you want to find out what latency, thoughput each server provides you can perform the benchmark using [Vegeta](https://github.com/tsenart/vegeta) load-testing tool.

We currently support benchmark for classification/summarization tasks.

Before running the command you will have to add few more fields to the `config.json`:
```
{
...
"task": "classification",
"model_name": "llama_7b_class",
"duration": "10s",
"rate": "10"
}
```
#### task

You should specify task your model was trained for, either ```classification``` or ```summarization```.

#### model_name

Text identifier of the model for summary table (can be anything).

#### duration and rate

Duration of the benchmark test. During each second certain name of requests (rate value) will be sent. If the duration is `10s` and rate is `20`, in total `200` requests will be sent.

Usually with longer time you will be able to send less requests per second without the server crashing.

Once the server is started, run command for benchmark in a separate window:

```
python run_benchmark.py
```

The test will run 2 times for more fair results and in the end all metrics will be calculated with deviation.

<b> Raw data (Vegeta output for 1 test) </b>

```
Requests [total, rate, throughput] 100, 10.10, 9.87
Duration [total, attack, wait] 10.137s, 9.9s, 236.754ms
Latencies [min, mean, 50, 90, 95, 99, max] 227.567ms, 347.64ms, 325.601ms, 421.165ms, 424.789ms, 426.472ms, 426.884ms
Bytes In [total, mean] 3200, 32.00
Bytes Out [total, mean] 36900, 369.00
Success [ratio] 100.00%
Status Codes [code:count] 200:100
Error Set:
```

<b> Processed data (summary of results for 2 tests)</b>
| model | server | rps | latency_with_deviation | throughput_with_deviation | duration_with_deviation |
|----------------|--------|-----|-----------------------|---------------------------|-------------------------|
| llama_7b_class | tgi | 10.1| 0.465±0.315 | 7.200±3.600 | 10.207±0.228 |




## Manual deployment

### FastApi

For building FastApi application, do the following:

Expand Down Expand Up @@ -53,7 +208,9 @@ For building FastApi application, do the following:
python client.py --url http://localhost:8080/predict --prompt "Your custom prompt here"
```

## [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
### [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)

#### How to merge the model

1. Install HuggingFace library:

Expand All @@ -75,7 +232,7 @@ For building FastApi application, do the following:
```
python merge_script.py --model_path /my/path --model_type causal --repo_id johndoe/new_model
```
5. Serve the model:
#### Serve the model with TGI:

```
model=meta-llama/Llama-2-7b-chat-hf
Expand All @@ -87,7 +244,7 @@ For building FastApi application, do the following:
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
```

## [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
### [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)

1. Install the package:

Expand Down
7 changes: 7 additions & 0 deletions inference/automated_deployment/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"server": "tgi",
"huggingface_repo": "NousResearch/Llama-2-7b-hf",
"huggingface_token": "",
"model_type": "llama",
"task": "classification"
}
17 changes: 17 additions & 0 deletions inference/automated_deployment/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
BASE_DIR = "./benchmark_results"
PROCESSED_DIR = f"{BASE_DIR}/processed"
PLOTS_DIR = f"{BASE_DIR}/plots"
RAW_DIR = f"{BASE_DIR}/raw"
CONFIG_FILE_PATH = './config.json'

# length = length + 1 (for comma), example: "34.5ms," "10m0s"
MILLISECONDS_LENGTH = 3
MICROSECONDS_LENGTH = 3
SECONDS_LENGTH = 2
MINUTES_LENGTH = 4

NUMBER_OF_MS_IN_SECOND = 1000
NUMBER_OF_MICROSEC_IN_SECOND = 1000000
NUMBER_OF_SECONDS_IN_MINUTE = 60

TOO_MANY_REQUEST_ERROR = 429
11 changes: 11 additions & 0 deletions inference/automated_deployment/enum_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from enum import Enum

class Server(Enum):
TGI = "tgi"
VLLM = "vllm"
RAY = "ray"
TRITON_VLLM = "triton_vllm"

class Task(Enum):
CLASSIFICATION = "classification"
SUMMARIZATION = "summarization"
File renamed without changes.
89 changes: 89 additions & 0 deletions inference/automated_deployment/process_benchmark_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import csv
import os
import numpy as np
from enum_types import Server, Task
from constants import MICROSECONDS_LENGTH, MILLISECONDS_LENGTH, SECONDS_LENGTH, MINUTES_LENGTH
from constants import NUMBER_OF_MICROSEC_IN_SECOND, NUMBER_OF_MS_IN_SECOND, NUMBER_OF_SECONDS_IN_MINUTE
from constants import TOO_MANY_REQUEST_ERROR
import typer
from utils import load_json
from constants import CONFIG_FILE_PATH

def save_data_for_final_table(csv_file_path, data):
headers = ["model", "server", "rps", "latency_with_deviation", "throughput_with_deviation", "duration_with_deviation"]

write_header = not os.path.exists(csv_file_path) or os.path.getsize(csv_file_path) == 0

with open(csv_file_path, mode='a', newline='') as file:
writer = csv.writer(file)
if write_header:
writer.writerow(headers)
writer.writerow(data)

def convert_to_seconds(time):
if 'ms' in time:
return float(time[:-MILLISECONDS_LENGTH]) / NUMBER_OF_MS_IN_SECOND
elif 'µs' in time:
return float(time[:-MICROSECONDS_LENGTH]) / NUMBER_OF_MICROSEC_IN_SECOND
elif 'm' in time and 's' in time:
return float(time[:-MINUTES_LENGTH]) * NUMBER_OF_SECONDS_IN_MINUTE
else:
return float(time[:-SECONDS_LENGTH])

def get_metrics(raw_results_path: str, processed_results_path: str):
rate = 0
config = load_json(CONFIG_FILE_PATH)
server = config["server"]
model = config["model_name"]
with open(raw_results_path) as f:
benchmark_logs = f.readlines()
result_dict = {}
max_total_request = 0
raws = [i.split() for i in benchmark_logs]
for raw in raws:
if len(raw) > 0:
if raw[0] == TOO_MANY_REQUEST_ERROR:
break
if raw[0] == 'Requests':
pos_of_total_request_value = 4
pos_of_throughput_value = 6
pos_of_rate_value = 5

total_request = int(raw[pos_of_total_request_value][:-1])
if result_dict.get(total_request) is None:
result_dict[total_request] = {'latency': [],
'throughput': [convert_to_seconds(raw[pos_of_throughput_value])],
'count': 1,
'duration': []}
else:
result_dict[total_request]['count'] += 1
result_dict[total_request]['throughput'].append(convert_to_seconds(raw[pos_of_throughput_value]))
max_total_request = total_request
rate = float(raw[pos_of_rate_value][:-1])
if raw[0] == 'Duration':
pos_of_duration_value = 4
result_dict[max_total_request]['duration'].append(convert_to_seconds(raw[pos_of_duration_value]))
if raw[0] == 'Latencies':
pos_of_latency_value = 11
result_dict[max_total_request]['latency'].append(convert_to_seconds(raw[pos_of_latency_value]))

keys_to_modify = ['latency', 'duration', 'throughput']

for num_req in result_dict.keys():
for key in keys_to_modify:
mean_value = np.mean(result_dict[num_req][key])
std_deviation = np.std(result_dict[num_req][key])

formatted_mean = "{:.3f}".format(mean_value)
formatted_std_dev = "{:.3f}".format(std_deviation)

result_dict[num_req][f"{key}_with_deviation"] = f"{formatted_mean}±{formatted_std_dev}"
result_dict[num_req][key] = mean_value

save_data_for_final_table(processed_results_path, [model, server, rate, result_dict[max_total_request]['latency_with_deviation'],
result_dict[max_total_request]['throughput_with_deviation'],
result_dict[max_total_request]['duration_with_deviation']])

if __name__ == '__main__':

typer.run(get_metrics)
18 changes: 18 additions & 0 deletions inference/automated_deployment/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
accelerate==0.21.0
bitsandbytes==0.39.0
datasets==2.12.0
evaluate==0.4.0
fastapi==0.104.1
numpy==1.24.3
pydantic==1.10.13
ray==2.8.0
starlette==0.27.0
tokenizers==0.14.1
torch==2.0.1
transformers==4.35.0
triton==2.0.0
tritonclient==2.39.0
typer==0.9.0
typing_extensions==4.8.0
uvicorn==0.23.2
vllm==0.2.1.post1
37 changes: 37 additions & 0 deletions inference/automated_deployment/run_benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import subprocess
from pathlib import Path
import typer
import json
from utils import load_json
from constants import PROCESSED_DIR, RAW_DIR, CONFIG_FILE_PATH
from validation import validate_benchmark_config
from validation import ValidationError

def main():

config = load_json(CONFIG_FILE_PATH)
try:
validate_benchmark_config(config)
except ValidationError as e:
print(f"An error occurred: {e}")
else:

Path(PROCESSED_DIR).mkdir(parents=True, exist_ok=True)
Path(RAW_DIR).mkdir(parents=True, exist_ok=True)

server = config["server"]
model_name = config["model_name"]
raw_results_path = f"{RAW_DIR}/{model_name}_{server}.txt"
processed_results_path = f"{PROCESSED_DIR}/{model_name}.csv"

subprocess.run("docker stop $(docker ps -q)".split())
subprocess.run(["chmod", "+x", f"./script_benchmark.sh"])
print("Running benchmark...")
subprocess.run([f"./script_benchmark.sh", raw_results_path, processed_results_path,
config['duration'], config['rate']])
print("Benchmark is finished.")
print(f"Raw results are saved at: {raw_results_path}")
print(f"Processed results are saved at: {processed_results_path}")

if __name__ == "__main__":
main()
Loading

0 comments on commit 6a6582d

Please sign in to comment.