Skip to content

Commit

Permalink
Add performance benchmark scripts for 4 use cases. (#1052)
Browse files Browse the repository at this point in the history
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Arthur Leung <[email protected]>
  • Loading branch information
2 people authored and Arthur Leung committed Nov 4, 2024
1 parent 4181cb2 commit 478ed6d
Show file tree
Hide file tree
Showing 12 changed files with 892 additions and 0 deletions.
77 changes: 77 additions & 0 deletions CodeGen/benchmark/performance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CodeGen Benchmarking

This folder contains a collection of scripts to enable inference benchmarking by leveraging a comprehensive benchmarking tool, [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md), that enables throughput analysis to assess inference performance.

By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.

## Purpose

We aim to run these benchmarks and share them with the OPEA community for three primary reasons:

- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.

## Metrics

The benchmark will report the below metrics, including:

- Number of Concurrent Requests
- End-to-End Latency: P50, P90, P99 (in milliseconds)
- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
- Average Next Token Latency (in milliseconds)
- Average Token Latency (in milliseconds)
- Requests Per Second (RPS)
- Output Tokens Per Second
- Input Tokens Per Second

Results will be displayed in the terminal and saved as CSV file named `1_testspec.yaml`.

## Getting Started

We recommend using Kubernetes to deploy the CodeGen service, as it offers benefits such as load balancing and improved scalability. However, you can also deploy the service using Docker if that better suits your needs.

### Prerequisites

- Install Kubernetes by following [this guide](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md).

- Every node has direct internet access
- Set up kubectl on the master node with access to the Kubernetes cluster.
- Install Python 3.8+ on the master node for running GenAIEval.
- Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
- Ensure that the container's ulimit can meet the the number of requests.

```bash
# The way to modify the containered ulimit:
sudo systemctl edit containerd
# Add two lines:
[Service]
LimitNOFILE=65536:1048576

sudo systemctl daemon-reload; sudo systemctl restart containerd
```

### Test Steps

Please deploy CodeGen service before benchmarking.

##### Run Benchmark Test

Before the benchmark, we can configure the number of test queries and test output directory by:

```bash
export USER_QUERIES="[128, 128, 128, 128]"
export TEST_OUTPUT_DIR="/tmp/benchmark_output"
```

And then run the benchmark by:

```bash
bash benchmark.sh -n <node_count>
```

The argument `-n` refers to the number of test nodes.

##### 4. Data collection

All the test results will come to this folder `/tmp/benchmark_output` configured by the environment variable `TEST_OUTPUT_DIR` in previous steps.
99 changes: 99 additions & 0 deletions CodeGen/benchmark/performance/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/bin/bash

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

deployment_type="k8s"
node_number=1
service_port=7778
query_per_node=128

benchmark_tool_path="$(pwd)/GenAIEval"

usage() {
echo "Usage: $0 [-d deployment_type] [-n node_number] [-i service_ip] [-p service_port]"
echo " -d deployment_type deployment type, select between k8s and docker (default: ${deployment_type})"
echo " -n node_number Test node number, required only for k8s deployment_type, (default: ${node_number})"
echo " -i service_ip service ip, required only for docker deployment_type"
echo " -p service_port service port, required only for docker deployment_type, (default: ${service_port})"
exit 1
}

while getopts ":d:n:i:p:" opt; do
case ${opt} in
d )
deployment_type=$OPTARG
;;
n )
node_number=$OPTARG
;;
i )
service_ip=$OPTARG
;;
p )
service_port=$OPTARG
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
usage
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
usage
;;
esac
done

if [[ "$deployment_type" == "docker" && -z "$service_ip" ]]; then
echo "Error: service_ip is required for docker deployment_type" 1>&2
usage
fi

if [[ "$deployment_type" == "k8s" && ( -n "$service_ip" || -n "$service_port" ) ]]; then
echo "Warning: service_ip and service_port are ignored for k8s deployment_type" 1>&2
fi

function main() {
if [[ ! -d ${benchmark_tool_path} ]]; then
echo "Benchmark tool not found, setting up..."
setup_env
fi
run_benchmark
}

function setup_env() {
git clone https://github.com/opea-project/GenAIEval.git
pushd ${benchmark_tool_path}
python3 -m venv stress_venv
source stress_venv/bin/activate
pip install -r requirements.txt
popd
}

function run_benchmark() {
source ${benchmark_tool_path}/stress_venv/bin/activate
export DEPLOYMENT_TYPE=${deployment_type}
export SERVICE_IP=${service_ip:-"None"}
export SERVICE_PORT=${service_port:-"None"}
if [[ -z $USER_QUERIES ]]; then
user_query=$((query_per_node*node_number))
export USER_QUERIES="[${user_query}, ${user_query}, ${user_query}, ${user_query}]"
echo "USER_QUERIES not configured, setting to: ${USER_QUERIES}."
fi
export WARMUP=$(echo $USER_QUERIES | sed -e 's/[][]//g' -e 's/,.*//')
if [[ -z $WARMUP ]]; then export WARMUP=0; fi
if [[ -z $TEST_OUTPUT_DIR ]]; then
if [[ $DEPLOYMENT_TYPE == "k8s" ]]; then
export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/node_${node_number}"
else
export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/docker"
fi
echo "TEST_OUTPUT_DIR not configured, setting to: ${TEST_OUTPUT_DIR}."
fi

envsubst < ./benchmark.yaml > ${benchmark_tool_path}/evals/benchmark/benchmark.yaml
cd ${benchmark_tool_path}/evals/benchmark
python benchmark.py
}

main
47 changes: 47 additions & 0 deletions CodeGen/benchmark/performance/benchmark.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

test_suite_config: # Overall configuration settings for the test suite
examples: ["chatqna"] # The specific test cases being tested, e.g., chatqna, codegen, codetrans, faqgen, audioqna, visualqna
deployment_type: "k8s" # Default is "k8s", can also be "docker"
service_ip: None # Leave as None for k8s, specify for Docker
service_port: None # Leave as None for k8s, specify for Docker
warm_ups: 0 # Number of test requests for warm-up
run_time: 60m # The max total run time for the test suite
seed: # The seed for all RNGs
user_queries: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048] # Number of test requests at each concurrency level
query_timeout: 120 # Number of seconds to wait for a simulated user to complete any executing task before exiting. 120 sec by defeult.
random_prompt: false # Use random prompts if true, fixed prompts if false
collect_service_metric: false # Collect service metrics if true, do not collect service metrics if false
data_visualization: false # Generate data visualization if true, do not generate data visualization if false
llm_model: "Qwen/CodeQwen1.5-7B-Chat" # The LLM model used for the test
test_output_dir: "/tmp/benchmark_output" # The directory to store the test output
load_shape: # Tenant concurrency pattern
name: constant # poisson or constant(locust default load shape)
params: # Loadshape-specific parameters
constant: # Constant load shape specific parameters, activate only if load_shape.name is constant
concurrent_level: 4 # If user_queries is specified, concurrent_level is target number of requests per user. If not, it is the number of simulated users
# arrival_rate: 1.0 # Request arrival rate. If set, concurrent_level will be overridden, constant load will be generated based on arrival-rate
poisson: # Poisson load shape specific parameters, activate only if load_shape.name is poisson
arrival_rate: 1.0 # Request arrival rate
namespace: "" # Fill the user-defined namespace. Otherwise, it will be default.

test_cases:
codegen:
llm:
run_test: true
service_name: "llm-dependency-svc" # Replace with your service name
parameters:
model_name: "Qwen/CodeQwen1.5-7B-Chat"
max_new_tokens: 128
temperature: 0.01
top_k: 10
top_p: 0.95
repetition_penalty: 1.03
streaming: true
llmserve:
run_test: true
service_name: "llm-svc" # Replace with your service name
e2e:
run_test: true
service_name: "codegen-backend-svc" # Replace with your service name
77 changes: 77 additions & 0 deletions CodeTrans/benchmark/performance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CodeTrans Benchmarking

This folder contains a collection of scripts to enable inference benchmarking by leveraging a comprehensive benchmarking tool, [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md), that enables throughput analysis to assess inference performance.

By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.

## Purpose

We aim to run these benchmarks and share them with the OPEA community for three primary reasons:

- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.

## Metrics

The benchmark will report the below metrics, including:

- Number of Concurrent Requests
- End-to-End Latency: P50, P90, P99 (in milliseconds)
- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
- Average Next Token Latency (in milliseconds)
- Average Token Latency (in milliseconds)
- Requests Per Second (RPS)
- Output Tokens Per Second
- Input Tokens Per Second

Results will be displayed in the terminal and saved as CSV file named `1_testspec.yaml`.

## Getting Started

We recommend using Kubernetes to deploy the CodeTrans service, as it offers benefits such as load balancing and improved scalability. However, you can also deploy the service using Docker if that better suits your needs.

### Prerequisites

- Install Kubernetes by following [this guide](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md).

- Every node has direct internet access
- Set up kubectl on the master node with access to the Kubernetes cluster.
- Install Python 3.8+ on the master node for running GenAIEval.
- Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
- Ensure that the container's ulimit can meet the the number of requests.

```bash
# The way to modify the containered ulimit:
sudo systemctl edit containerd
# Add two lines:
[Service]
LimitNOFILE=65536:1048576

sudo systemctl daemon-reload; sudo systemctl restart containerd
```

### Test Steps

Please deploy CodeTrans service before benchmarking.

##### Run Benchmark Test

Before the benchmark, we can configure the number of test queries and test output directory by:

```bash
export USER_QUERIES="[1, 1, 1, 1]"
export TEST_OUTPUT_DIR="/tmp/benchmark_output"
```

And then run the benchmark by:

```bash
bash benchmark.sh -n <node_count>
```

The argument `-n` refers to the number of test nodes.

##### 4. Data collection

All the test results will come to this folder `/tmp/benchmark_output` configured by the environment variable `TEST_OUTPUT_DIR` in previous steps.
Loading

0 comments on commit 478ed6d

Please sign in to comment.