Unify benchmark tool based on stresscli library (#66)

* Unify benchmarktool based on stresscli Signed-off-by: lvliang-intel <[email protected]> * update code Signed-off-by: lvliang-intel <[email protected]> * add locust template files Signed-off-by: lvliang-intel <[email protected]> * fix issue Signed-off-by: lvliang-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix stress cli config path issue Signed-off-by: lvliang-intel <[email protected]> * update code Signed-off-by: lvliang-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove hardcode in aistress.py Signed-off-by: lvliang-intel <[email protected]> * add readme Signed-off-by: lvliang-intel <[email protected]> * update document Signed-off-by: lvliang-intel <[email protected]> * Support streaming response for getting correct first token latency and input output tokens number Signed-off-by: lvliang-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add input token number in output format Signed-off-by: lvliang-intel <[email protected]> * update config.ini with input and output token number Signed-off-by: lvliang-intel <[email protected]> --------- Signed-off-by: lvliang-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Aug 19, 2024 · 71637c0 · 71637c0
1 parent ebee50c
commit 71637c0
Show file tree

Hide file tree

Showing 26 changed files with 1,474 additions and 1,340 deletions.
diff --git a/evals/benchmark/README.md b/evals/benchmark/README.md
@@ -1,71 +1,96 @@
-# Stress Test Script
+# OPEA Benchmark Tool
 
-## Introduction
+This Tool provides a microservices benchmarking framework that uses YAML configurations to define test cases for different services. It executes these tests using `stresscli`, built on top of [locust](https://github.com/locustio/locust), a performance/load testing tool for HTTP and other protocols and logs the results for performance analysis and data visualization.
 
-This script is a load testing tool designed to simulate high-concurrency scenarios for a given server. It supports multiple task types and models, allowing users to evaluate the performance and stability of different configurations under heavy load.
+## Features
 
-## Prerequisites
+- **Services load testing**: Simulates high concurrency levels to test services like LLM, reranking, ASR, E2E and more.
+- **YAML-based configuration**: Define test cases, service endpoints, and testing parameters in YAML.
+- **Service metrics collection**: Optionally collect service metrics for detailed performance analysis.
+- **Flexible testing**: Supports various test cases like chatqna, codegen, codetrans, faqgen, audioqna, and visualqna.
+- **Data analysis and visualization**: After tests are executed, results can be analyzed and visualized to gain insights into the performance and behavior of each service. Performance trends, bottlenecks, and other key metrics are highlighted for decision-making.
 
-- Python 3.8+
-- Required Python packages:
-  - argparse
-  - requests
-  - transformers
+## Table of Contents
 
-## Installation
-
-1. Clone the repository or download the script to your local machine.
-2. Install the required Python packages using `pip`:
-
-   ```sh
-   pip install argparse requests transformers
-   ```
+- [Installation](#installation)
+- [Usage](#usage)
+- [Configuration](#configuration)
+  - [Test Suite Configuration](#test-suite-configuration)
+  - [Test Cases](#test-cases)
 
-## Usage
 
-The script can be executed with various command-line arguments to customize the test. Here is a breakdown of the available options:
+## Installation
 
-- `-f`: The file path containing the list of questions to be used for the test. If not provided, a default question will be used.
-- `-s`: The server address in the format `host:port`. Default is `localhost:8080`.
-- `-c`: The number of concurrent workers. Default is 20.
-- `-d`: The duration for which the test should run. This can be specified in seconds (e.g., `30s`), minutes (e.g., `10m`), or hours (e.g., `1h`). Default is `1h`.
-- `-u`: The delay time before each worker starts, specified in seconds (e.g., `2s`). Default is `1s`.
-- `-t`: The task type to be tested. Options are `chatqna`, `openai`, `tgi`, `llm`, `tei_embedding`, `embedding`, `retrieval`, `tei_rerank` or `reranking`. Default is `chatqna`.
-- `-m`: The model to be used. Default is `Intel/neural-chat-7b-v3-3`.
-- `-z`: The maximum number of tokens for the model. Default is 1024.
+### Prerequisites
 
-### Example Commands
+- Python 3.x
+- Install the required Python packages:
 
 ```bash
-python stress_benchmark.py -f data.txt -s localhost:8888 -c 50 -d 30m -t chatqna
+pip install -r ../../requirements.txt
 ```
 
-### Running the Test
+## Usage
 
-To start the test, execute the script with the desired options. The script will:
+1 Define the test cases and configurations in the benchmark.yaml file.
 
-1. Initialize the question pool from the provided file or use the default question.
-2. Start a specified number of worker threads.
-3. Each worker will repeatedly send requests to the server and collect response data.
-4. Results will be written to a CSV file.
+2 Run the benchmark script:
 
-### Output
+```bash
+python benchmark.py
+```
 
-The results will be stored in a CSV file with the following columns:
+The results will be stored in the directory specified by `test_output_dir` in the configuration.
 
-- `question_len`: The length of the input question in tokens.
-- `answer_len`: The length of the response in tokens.
-- `first_chunk`: The time taken to receive the first chunk of the response.
-- `overall`: The total time taken for the request to complete.
-- `err`: Any error that occurred during the request.
-- `code`: The HTTP status code of the response.
 
-## Notes
+## Configuration
 
-- Ensure the server address is correctly specified and accessible.
-- Adjust the concurrency level (`-c`) and duration (`-d`) based on the capacity of your server and the goals of your test.
-- Monitor the server's performance and logs to identify any potential issues during the load test.
+The benchmark.yaml file defines the test suite and individual test cases. Below are the primary sections:
 
-## Logging
+### Test Suite Configuration
 
-The script logs detailed information about each request and any errors encountered. The logs can be useful for diagnosing issues and understanding the behavior of the server under load.
+```yaml
+test_suite_config: 
+  examples: ["chatqna"]  # Test cases to be run (e.g., chatqna, codegen)
+  concurrent_level: 4  # The concurrency level
+  user_queries: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]  # Number of test requests
+  random_prompt: false  # Use random prompts if true, fixed prompts if false
+  run_time: 60m  # Total runtime for the test suite
+  collect_service_metric: false  # Enable service metrics collection
+  data_visualization: false # Enable data visualization
+  test_output_dir: "/home/sdp/benchmark_output"  # Directory for test outputs
+```
+
+### Test Cases
+
+Each test case includes multiple services, each of which can be toggled on/off using the `run_test` flag. You can also change specific parameters for each service for performance tuning.
+
+Example test case configuration for `chatqna`:
+
+```yaml
+test_cases:
+  chatqna:
+    embedding:
+      run_test: false
+      service_name: "embedding-svc"
+    retriever:
+      run_test: false
+      service_name: "retriever-svc"
+      parameters:
+        search_type: "similarity"
+        k: 4
+        fetch_k: 20
+        lambda_mult: 0.5
+        score_threshold: 0.2
+    llm:
+      run_test: false
+      service_name: "llm-svc"
+      parameters:
+        model_name: "Intel/neural-chat-7b-v3-3"
+        max_new_tokens: 128
+        temperature: 0.01
+        streaming: true
+    e2e:
+      run_test: true
+      service_name: "chatqna-backend-server-svc"
+```
diff --git a/evals/benchmark/benchmark.py b/evals/benchmark/benchmark.py
@@ -0,0 +1,175 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+from datetime import datetime
+
+import yaml
+from stresscli.commands.load_test import locust_runtests
+from utils import get_service_cluster_ip, load_yaml
+
+service_endpoints = {
+    "chatqna": {
+        "embedding": "/v1/embeddings",
+        "embedding_serving": "/v1/embeddings",
+        "retriever": "/v1/retrieval",
+        "reranking": "/v1/reranking",
+        "reranking_serving": "/rerank",
+        "llm": "/v1/chat/completions",
+        "llm_serving": "/v1/chat/completions",
+        "e2e": "/v1/chatqna",
+    },
+    "codegen": {"llm": "/v1/chat/completions", "llm_serving": "/v1/chat/completions", "e2e": "/v1/codegen"},
+    "codetrans": {"llm": "/v1/chat/completions", "llm_serving": "/v1/chat/completions", "e2e": "/v1/codetrans"},
+    "faqgen": {"llm": "/v1/chat/completions", "llm_serving": "/v1/chat/completions", "e2e": "/v1/faqgen"},
+    "audioqna": {
+        "asr": "/v1/audio/transcriptions",
+        "llm": "/v1/chat/completions",
+        "llm_serving": "/v1/chat/completions",
+        "tts": "/v1/audio/speech",
+        "e2e": "/v1/audioqna",
+    },
+    "visualqna": {"lvm": "/v1/chat/completions", "lvm_serving": "/v1/chat/completions", "e2e": "/v1/visualqna"},
+}
+
+
+def extract_test_case_data(content):
+    """Extract relevant data from the YAML based on the specified test cases."""
+    # Extract test suite configuration
+    test_suite_config = content.get("test_suite_config", {})
+
+    return {
+        "examples": test_suite_config.get("examples", []),
+        "concurrent_level": test_suite_config.get("concurrent_level"),
+        "user_queries": test_suite_config.get("user_queries", []),
+        "random_prompt": test_suite_config.get("random_prompt"),
+        "test_output_dir": test_suite_config.get("test_output_dir"),
+        "run_time": test_suite_config.get("run_time"),
+        "collect_service_metric": test_suite_config.get("collect_service_metric"),
+        "llm_model": test_suite_config.get("llm_model"),
+        "all_case_data": {
+            example: content["test_cases"].get(example, {}) for example in test_suite_config.get("examples", [])
+        },
+    }
+
+
+def create_run_yaml_content(service_name, base_url, bench_target, concurrency, user_queries, test_suite_config):
+    """Create content for the run.yaml file."""
+    return {
+        "profile": {
+            "storage": {"hostpath": test_suite_config["test_output_dir"]},
+            "global-settings": {
+                "tool": "locust",
+                "locustfile": os.path.join(os.getcwd(), "stresscli/locust/aistress.py"),
+                "host": base_url,
+                "stop-timeout": 120,
+                "processes": 2,
+                "namespace": "default",
+                "bench-target": bench_target,
+                "run-time": test_suite_config["run_time"],
+                "service-metric-collect": test_suite_config["collect_service_metric"],
+                "llm-model": test_suite_config["llm_model"],
+            },
+            "runs": [{"name": "benchmark", "users": concurrency, "max-request": user_queries}],
+        }
+    }
+
+
+def create_and_save_run_yaml(example, service_type, service_name, base_url, test_suite_config, index):
+    """Create and save the run.yaml file for the service being tested."""
+    os.makedirs(test_suite_config["test_output_dir"], exist_ok=True)
+
+    run_yaml_paths = []
+    for user_queries in test_suite_config["user_queries"]:
+        concurrency = max(1, user_queries // test_suite_config["concurrent_level"])
+
+        bench_target = (
+            f"{example}{'bench' if service_type == 'e2e' and test_suite_config['random_prompt'] else 'fixed'}"
+        )
+        run_yaml_content = create_run_yaml_content(
+            service_name, base_url, bench_target, concurrency, user_queries, test_suite_config
+        )
+
+        run_yaml_path = os.path.join(
+            test_suite_config["test_output_dir"], f"run_{service_name}_{index}_users_{user_queries}.yaml"
+        )
+        with open(run_yaml_path, "w") as yaml_file:
+            yaml.dump(run_yaml_content, yaml_file)
+
+        run_yaml_paths.append(run_yaml_path)
+
+    return run_yaml_paths
+
+
+def run_service_test(example, service_type, service_name, parameters, test_suite_config):
+    svc_ip, port = get_service_cluster_ip(service_name)
+    base_url = f"http://{svc_ip}:{port}"
+    endpoint = service_endpoints[example][service_type]
+    url = f"{base_url}{endpoint}"
+    print(f"[OPEA BENCHMARK] 🚀 Running test for {service_name} at {url}")
+
+    # Generate a unique index based on the current time
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+
+    # Create the run.yaml for the service
+    run_yaml_paths = create_and_save_run_yaml(
+        example, service_type, service_name, base_url, test_suite_config, timestamp
+    )
+
+    # Run the test using locust_runtests function
+    for index, run_yaml_path in enumerate(run_yaml_paths, start=1):
+        print(f"[OPEA BENCHMARK] 🚀 The {index} time test is running, run yaml: {run_yaml_path}...")
+        locust_runtests(None, run_yaml_path)
+
+    print(f"[OPEA BENCHMARK] 🚀 Test completed for {service_name} at {url}")
+
+
+def process_service(example, service_name, case_data, test_suite_config):
+    service = case_data.get(service_name)
+    if service and service.get("run_test"):
+        print(f"[OPEA BENCHMARK] 🚀 Example: {example} Service: {service.get('service_name')}, Running test...")
+        run_service_test(
+            example, service_name, service.get("service_name"), service.get("parameters", {}), test_suite_config
+        )
+
+
+if __name__ == "__main__":
+    # Load test suit configuration
+    yaml_content = load_yaml("./benchmark.yaml")
+    # Extract data
+    parsed_data = extract_test_case_data(yaml_content)
+    test_suite_config = {
+        "concurrent_level": parsed_data["concurrent_level"],
+        "user_queries": parsed_data["user_queries"],
+        "random_prompt": parsed_data["random_prompt"],
+        "run_time": parsed_data["run_time"],
+        "collect_service_metric": parsed_data["collect_service_metric"],
+        "llm_model": parsed_data["llm_model"],
+        "test_output_dir": parsed_data["test_output_dir"],
+    }
+
+    # Mapping of example names to service types
+    example_service_map = {
+        "chatqna": [
+            "embedding",
+            "embedding_serving",
+            "retriever",
+            "reranking",
+            "reranking_serving",
+            "llm",
+            "llm_serving",
+            "e2e",
+        ],
+        "codegen": ["llm", "llm_serving", "e2e"],
+        "codetrans": ["llm", "llm_serving", "e2e"],
+        "faqgen": ["llm", "llm_serving", "e2e"],
+        "audioqna": ["asr", "llm", "llm_serving", "tts", "e2e"],
+        "visualqna": ["lvm", "lvm_serving", "e2e"],
+    }
+
+    # Process each example's services
+    for example in parsed_data["examples"]:
+        case_data = parsed_data["all_case_data"].get(example, {})
+        service_types = example_service_map.get(example, [])
+        for service_type in service_types:
+            process_service(example, service_type, case_data, test_suite_config)