-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE]: tpc-ds datagen #3103
[CHORE]: tpc-ds datagen #3103
Conversation
CodSpeed Performance ReportMerging #3103 will improve performances by 11.01%Comparing Summary
Benchmarks breakdown
|
I'm not sure I'm the best person to be reviewing this, but I'm definitely going to look at it because I think it might relate in some ways to what I'm doing with tests. |
@universalmind303 i think it makes sense to check in the SQL queries and fixtures but I think it would be better to place the answers in a public S3 bucket since they are more artifacts rather than code. |
that makes sense. Will update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. As Sammy said, make sure upload dbgen outputs to S3, maybe choose some sets of scale factors (e.g. 1, 10, 100, 10000)
benchmarking/tpcds/datagen.py
Outdated
parser.add_argument( | ||
"--tpch-gen-folder", | ||
default="data/tpch-dbgen", | ||
help="Path to the folder containing the TPCH dbgen tool and generated data", | ||
) | ||
parser.add_argument("--scale-factor", default=0.01, help="Scale factor to run on in GB", type=float) | ||
|
||
args = parser.parse_args() | ||
num_parts = args.scale_factor | ||
|
||
logger.info( | ||
"Generating data at %s with: scale_factor=%s num_parts=%s generate_sqlite_db=%s generate_parquet=%s", | ||
args.tpch_gen_folder, | ||
args.scale_factor, | ||
num_parts, | ||
) | ||
|
||
gen_tpcds(basedir=args.tpch_gen_folder, scale_factor=args.scale_factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename TPCH to TPCDS in these lines
Makefile
Outdated
@@ -56,6 +61,10 @@ build-release: check-toolchain .venv ## Compile and install a faster Daft binar | |||
test: .venv build ## Run tests | |||
HYPOTHESIS_MAX_EXAMPLES=$(HYPOTHESIS_MAX_EXAMPLES) $(VENV_BIN)/pytest --hypothesis-seed=$(HYPOTHESIS_SEED) | |||
|
|||
.PHONY: dsdgen | |||
dsdgen: .venv ## Generate TPC-DS data | |||
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR) | |
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpcds-gen-folder=$(OUTPUT_DIR) |
a whole bunch of boilerplate for tpc-ds benchmarking and testing. wanted to keep this separate from others as there's not much functionality here, just adding a `dsdgen` command to the makefile to generate tpc-ds datasets. I called it `dsdgen` because that's what duckdb calls it, and this uses the duckdb implementation to generate all of the datasets. The answers were copied from [duckdb/duckdb/extension/tpcds/dsdgen/answers](https://github.com/duckdb/duckdb/tree/10c42435f1805ee4415faa5d6da4943e8c98fa55/extension/tpcds/dsdgen/answers) Usage: ```sh # defaults to sf=1 and dir=data/tpc-ds > make dsdgen > make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir> ``` ## Notes for reviewer Most files here are boilerplate. The only relevant files are: - Makefile - requirements_dev.txt - benchmarking/tpc-ds/datagen.py
a whole bunch of boilerplate for tpc-ds benchmarking and testing.
wanted to keep this separate from others as there's not much functionality here, just adding a
dsdgen
command to the makefile to generate tpc-ds datasets. I called itdsdgen
because that's what duckdb calls it, and this uses the duckdb implementation to generate all of the datasets.The answers were copied from duckdb/duckdb/extension/tpcds/dsdgen/answers
Usage:
Notes for reviewer
Most files here are boilerplate.
The only relevant files are: