Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Benchmark (#13)
Browse files Browse the repository at this point in the history
* import iceberg

* server struct

* rename crates to libs

* init db

* add table response

* add first four table responses, and response.rs for struct

* init db

* init error

* init api framework

* error handling

* error handling

* namespace w/o delete and add new properties

* return JsonResultGeneric<T>

* fix function signature for table.rs

* fix conflict

* general not found message

* general error handler

* create namespace

* fix return type in table.rs

* remove/update namespace && empty result

* modify table implementation

* fix conflict

* fix ok_empty()

* fix ok_empty()

* unit test

* add structs for Schema

* fmt

* reorg

* db config && move catches to server/catches.rs

* complete all table functions

* fix import

* added namespace init

* 1 test

* fix Namespace Param

* fmt

* add ? for Result<>

* add structs for Schema

* Update design doc with benchmarking

* Update design doc for benchmarking new

* fix get table by namespace

* fix delete table implementation

* add Schema/TableMetadata to Table struct

* fix create and get Table uuid

* add atomic increase table uuid

* add get_table_by_namespace unit test

* fix unit test json; but with possible conflict error

* create tempdir for a new db in unittest, having bugs

* change hardcoded root_dir in new DBconnection

* modularized mock_client_creation

* add some table unit tests

* add unit tests for namespace

* Add table unit tests

* clean up (#12)

* clean dead import, dead result

* mod test

* modify benchmark.py

* benchmark with vegeta

* plot

* add schema in createTableRequest

* Modify benchmark request rate

* comment out schema in createTableRequest

* add 3 endpoints to benchmark

* minor name fix

* add random endpoints

* minor bug fixed

* change duration to 60 secs and # of random tables

* rm lib iceberg

* merge main

---------

Co-authored-by: Angela-CMU <[email protected]>
Co-authored-by: Yen-Ju Wu <[email protected]>
  • Loading branch information
3 people authored May 2, 2024
1 parent 1283c30 commit 7a3f212
Show file tree
Hide file tree
Showing 25 changed files with 2,212 additions and 80 deletions.
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,8 @@ Cargo.lock
# macOS resource forks and .DS_Store files
.DS_Store

.vscode
.vscode

database

test/
10 changes: 5 additions & 5 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ license = "Apache-2.0"
repository = "https://github.com/cmu-db/15721-s24-catalog2"
rust-version = "1.75.0"



# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
rocket = { version = "0.5.0", features = ["json", "http2"] }
iceberg = { src = "./libs/iceberg" }
dotenv = "0.15.0"
pickledb = "^0.5.0"
pickledb = "^0.5.0"
derive_builder = "0.20.0"
serde_json = "1.0.79"
clap = { version = "4.5.4", features = ["derive"] }
tempfile = "3.10.1"
19 changes: 12 additions & 7 deletions doc/design_doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The goal of this project is to design and implement a **Catalog Service** for an
## Architectural Design
We follow the logic model described below. The input of our service comes from execution engine and I/O service. And we will provide metadata to planner and scheduler. We will use [pickleDB](https://docs.rs/pickledb/latest/pickledb/) as the key-value store to store (namespace, tables) and (table_name, metadata) as two (key, value) pairs as local db files.
We will use [Rocket](https://rocket.rs) as the web framework handling incoming API traffic.

![system architecture](./assets/system-architecture.png)
### Data Model
We adhere to the Iceberg data model, arranging tables based on namespaces, with each table uniquely identified by its name.
Expand All @@ -19,9 +20,10 @@ The parameters for request and response can be referenced from [REST API](https:

### Use Cases
#### Namespace
create/delete/rename namespace
create/delete/rename/list namespace
#### Table
create/delete/rename table
create/delete/rename/list table

#### Query Table’s Metadata (including statistics, version, table-uuid, location, last-column-id, schema, and partition-spec)
get metadeta by {namespace}/{table}

Expand All @@ -35,6 +37,7 @@ get metadeta by {namespace}/{table}
* Centralized metadata management achieved by separating data and metadata, reducing complexity and facilitating consistent metadata handling.
* Code modularity and clear interfaces facilitate easier updates and improvements.
* We adopt the existing kvstore ([pickleDB](https://docs.rs/pickledb/latest/pickledb/)) and server ([Rocket](https://github.com/rwf2/Rocket)) to mitigate the engineering complexity.

* Testing:
* Comprehensive testing plans cover correctness through unit tests and performance through long-running regression tests. Unit tests focus on individual components of the catalog service, while regression tests evaluate system-wide performance and stability.
* Other Implementations:
Expand All @@ -46,14 +49,16 @@ To ensure the quality and the performance of the catalog implemented, a comprehe
* Functional testing
* API tests: For functional testing, we can achieve the goal through unit tests. We will test each API endpoint implemented in our project to ensure correct behavior. We will test various input parameters and validate the response format and the status code are as expected. Also, we will try to mimic possible edge cases and errors to ensure the implementation is robust and can perform suitable error handling. By doing so, we can ensure the API works as expected and provides correct results to clients.
* Metadata tests: We will focus on verifying the correct storage and retrieval of metadata. Tests will include different scenarios, including some edge cases. [Quickcheck](https://github.com/BurntSushi/quickcheck) is an example for performing the testing.
* [Documentation tests](https://doc.rust-lang.org/rustdoc/write-documentation/documentation-tests.html#documentation-tests): Execute document examples
* Non-functional testing
* Microbenchmarking for performance evaluation: We can use [Criterion.rs](https://github.com/bheisler/criterion.rs?tab=readme-ov-file#features) and [bencher](https://github.com/bluss/bencher) to collect statistics to enable statistics-driven optimizations. In addition, we can set up a performance baseline to compare the performance with our implementation. We can measure different metrics, for example, response time, throughput, etc.
* Scalability test: We will try to test our implementation under increased load and ensure the correctness and efficiency at the same time.
* Benchmark testing
* Key performance metrics: Latency and Request Per Second (RPS) would be used as key metrics.
* Workload: Since we are working on an OLAP database, the workload expected should be read-heavy. We thus expect read-heavy and write-occasional workloads that include complex joins and predicates, analytical queries, periodic updates on catalog data, and some metadata updates. Based on this assumption, we plan to evaluate 3 different read-to-write ratios: 1000:1, 100:1, and 10:1.
* Performance evaluation: We can use [ali](https://github.com/nakabonne/ali) to create HTTP traffic and visualize the outcomes in real-time for performance evaluation.
* Performance optimization: We can use [Criterion.rs](https://github.com/bheisler/criterion.rs?tab=readme-ov-file#features) and [bencher](https://github.com/bluss/bencher) to collect statistics to enable statistics-driven optimizations. In addition, we can set up a performance baseline to compare the performance with our implementation. We can measure different metrics, for example, response time, throughput, etc.


## Trade-offs and Potential Problems
* Balancing between metadata retrieval speed and storage efficiency.
* Balancing between query performance and engineering complexity/maintainability (such as adding bloom filters).

## Glossary (Optional)
>If you are introducing new concepts or giving unintuitive names to components, write them down here.
>If you are introducing new concepts or giving unintuitive names to components, write them down here.
Empty file added plot.html
Empty file.
164 changes: 164 additions & 0 deletions scripts/bench
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/usr/bin/env python3
# This script is used to benchmark the catalog server.
# It will start the catalog server, seed the catalog with some namespaces and tables, and use vegeta to stress test the server.
# vegeta: https://github.com/tsenart/vegeta
# Install on mac: brew install vegeta

import subprocess as sp
import time
import signal
import sys
import requests
import argparse
import string
import random


def get_random_str(length=8):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for _ in range(length))


def run(cmd, note, bg=False, out=None):
print(f"{note.ljust(48)}...", end=" ", flush=True)
try:
res = None
if out:
with open(out, "a") as f:
if bg:
res = sp.Popen(cmd, shell=True, stdout=f, stderr=f)
else:
sp.run(cmd, shell=True, check=True,
stdout=f, stderr=f)
else:
if bg:
res = sp.Popen(cmd, stdout=sp.DEVNULL, stderr=sp.DEVNULL)
else:
sp.run(cmd, shell=True, check=True,
stdout=sp.DEVNULL, stderr=sp.DEVNULL)
print("DONE!")
return res
except sp.CalledProcessError as e:
print("FAIL!")
print("Error:", e)


TEST_ROOT_DIR = "test"
DEFAULT_BINARY_NAME = "catalog2"
DEFAULT_DB_ROOT_DIR = f"{TEST_ROOT_DIR}/db"
DEFAULT_BASE_URL = "http://127.0.0.1:8000/v1/"
DEFAULT_NAMESPACE_NUM = 1
DEFAULT_TABLE_NUM = 1
DEFAULT_RATE = 8

parser = argparse.ArgumentParser(description="Benchmark.")
parser.add_argument("-b", "--binary_name", type=str,
default=DEFAULT_BINARY_NAME, help="Name of the catalog binary.")
parser.add_argument("-d", "--db_root", type=str,
default=DEFAULT_DB_ROOT_DIR, help="Root directory for the database.")
parser.add_argument("-u", "--base_url", type=str,
default=DEFAULT_BASE_URL, help="Base URL for catalog server.")
parser.add_argument("-n", "--namespace_num", type=int,
default=DEFAULT_NAMESPACE_NUM, help="The number of namespace to seed in catalog.")
parser.add_argument("-t", "--table_num", type=int,
default=DEFAULT_TABLE_NUM, help="The number of table to seed in catalog.")
parser.add_argument("-r", "--rate", type=int,
default=DEFAULT_RATE, help="Request rate.")
parser.add_argument("-p", "--plot", action="store_true",
default=False, help="Generate a plot of this benchmark.")
args = parser.parse_args()


CATALOG_LOG = f"{TEST_ROOT_DIR}/catalog.log"

# build catalog in release mode
run(f"rm -rf {TEST_ROOT_DIR} && mkdir {TEST_ROOT_DIR}",
note="initializing test dir")
run(f"cargo build --release && cp target/release/{args.binary_name} {TEST_ROOT_DIR}/{args.binary_name}",
note="building catalog in release mode")
catalog_server = run(f"{TEST_ROOT_DIR}/{args.binary_name} --db-root {args.db_root}",
note="starting catalog server", bg=True, out=CATALOG_LOG)
print("Waiting for catalog server to start...")
time.sleep(1)

# seeding the catalog, uniformly distribute tables to namespaces
print(f"Seeding namespaces and tables...")
NAMESPACE_ENDPOINT = "namespaces"
TABLE_ENDPOINT = "tables"
namespaces = []
table_per_namespace = args.table_num // args.namespace_num
for i in range(args.namespace_num):
namespace = get_random_str(32)
tables = []
for j in range(table_per_namespace):
tables.append(get_random_str(32))
namespaces.append({'name': namespace, 'tables': tables})
# create namespace
response = requests.post(f"{args.base_url}/{NAMESPACE_ENDPOINT}",
json={'namespace': [namespace]})
assert response.status_code == 200, f"Failed to create namespace {namespace}"

# crate tables
for table in tables:
response = requests.post(
f"{args.base_url}/{NAMESPACE_ENDPOINT}/{namespace}/{TABLE_ENDPOINT}",
json={'name': table}
)
assert response.status_code == 200, f"Failed to create namespace {namespace}"

print(f"Seeded {len(namespaces)} namespaces and {len(namespaces) * table_per_namespace} tables.")

# test begins
# 1. single endpoint stress test
namespace = namespaces[0]
table = namespace['tables'][0]
targets = {
"get_table": f"{args.base_url}/{NAMESPACE_ENDPOINT}/{namespace['name']}/{TABLE_ENDPOINT}/{table}",
"list_table": f"{args.base_url}/{NAMESPACE_ENDPOINT}/{namespace['name']}/{TABLE_ENDPOINT}",
"get_namespace": f"{args.base_url}/{NAMESPACE_ENDPOINT}/{namespace['name']}",
"list_namespace": f"{args.base_url}/{NAMESPACE_ENDPOINT}"
}

for name, target in targets.items():
STATISTIC_FILE = f"{TEST_ROOT_DIR}/results_{name}.bin"
attack = f"echo 'GET {target}' | vegeta attack -rate={args.rate} -duration=60s | tee {STATISTIC_FILE} | vegeta report"
run(attack, note=f"single endpoint stress test for {name}",
out=f"{TEST_ROOT_DIR}/veneta_{name}.log")
if args.plot:
PLOT_FILE = f"{TEST_ROOT_DIR}/plot_{name}.html"
run(f"cat {STATISTIC_FILE} | vegeta plot > {PLOT_FILE}",
note="generating plot")
# ... more?


# 2. random endpoint stress test
# Define the file path
PATH_TARGET_FILE = f"{TEST_ROOT_DIR}/requests_get_table.txt"

# Write the URLs to the file
with open(PATH_TARGET_FILE, "w") as file:
for i in range(len(namespaces)):
random_namespace = random.choice(namespaces)
random_table = random.choice(random_namespace['tables'])

# Generate request URL
target = f"{args.base_url}/{NAMESPACE_ENDPOINT}/{random_namespace['name']}/{TABLE_ENDPOINT}/{random_table}"
request_url = f"GET {target}"

file.write(request_url + "\n")

print("URLs have been written to", PATH_TARGET_FILE)


STATISTIC_FILE = f"{TEST_ROOT_DIR}/results_random.bin"
attack = f"vegeta attack -targets={PATH_TARGET_FILE} -rate={args.rate} -duration=60s | tee {STATISTIC_FILE} | vegeta report"
run(attack, note="random endpoints stress test",
out=f"{TEST_ROOT_DIR}/veneta_random.log")
if args.plot:
PLOT_FILE = f"{TEST_ROOT_DIR}/plot_random.html"
run(f"cat {STATISTIC_FILE} | vegeta plot > {PLOT_FILE}",
note="generating plot")


# clean up
catalog_server.send_signal(signal.SIGINT)
2 changes: 2 additions & 0 deletions src/catalog/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pub mod namespace;
pub mod table;
Loading

0 comments on commit 7a3f212

Please sign in to comment.