[CHORE]: tpc-ds datagen #3103

universalmind303 · 2024-10-22T22:43:05Z

a whole bunch of boilerplate for tpc-ds benchmarking and testing.

wanted to keep this separate from others as there's not much functionality here, just adding a dsdgen command to the makefile to generate tpc-ds datasets. I called it dsdgen because that's what duckdb calls it, and this uses the duckdb implementation to generate all of the datasets.

The answers were copied from duckdb/duckdb/extension/tpcds/dsdgen/answers

Usage:

# defaults to sf=1 and dir=data/tpc-ds
> make dsdgen
> make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir>

Notes for reviewer

Most files here are boilerplate.

The only relevant files are:

Makefile
requirements_dev.txt
benchmarking/tpc-ds/datagen.py

codspeed-hq · 2024-10-22T22:53:00Z

CodSpeed Performance Report

Merging #3103 will improve performances by 11.01%

_{Comparing universalmind303:tpcds (b70f83b) with main (138d078)}

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`universalmind303:tpcds`	Change
⚡	`test_count[1 Small File]`	4.4 ms	4 ms	+11.01%

andrewgazelka · 2024-10-22T22:58:16Z

I'm not sure I'm the best person to be reviewing this, but I'm definitely going to look at it because I think it might relate in some ways to what I'm doing with tests.

samster25 · 2024-10-22T23:23:32Z

@universalmind303 i think it makes sense to check in the SQL queries and fixtures but I think it would be better to place the answers in a public S3 bucket since they are more artifacts rather than code.

universalmind303 · 2024-10-22T23:53:51Z

@universalmind303 i think it makes sense to check in the SQL queries and fixtures but I think it would be better to place the answers in a public S3 bucket since they are more artifacts rather than code.

that makes sense. Will update!

kevinzwang

Looks good to me. As Sammy said, make sure upload dbgen outputs to S3, maybe choose some sets of scale factors (e.g. 1, 10, 100, 10000)

kevinzwang · 2024-11-01T00:41:59Z

benchmarking/tpcds/datagen.py

+    parser.add_argument(
+        "--tpch-gen-folder",
+        default="data/tpch-dbgen",
+        help="Path to the folder containing the TPCH dbgen tool and generated data",
+    )
+    parser.add_argument("--scale-factor", default=0.01, help="Scale factor to run on in GB", type=float)
+
+    args = parser.parse_args()
+    num_parts = args.scale_factor
+
+    logger.info(
+        "Generating data at %s with: scale_factor=%s num_parts=%s generate_sqlite_db=%s generate_parquet=%s",
+        args.tpch_gen_folder,
+        args.scale_factor,
+        num_parts,
+    )
+
+    gen_tpcds(basedir=args.tpch_gen_folder, scale_factor=args.scale_factor)


Rename TPCH to TPCDS in these lines

kevinzwang · 2024-11-01T00:42:17Z

Makefile

@@ -56,6 +61,10 @@ build-release: check-toolchain .venv  ## Compile and install a faster Daft binar
 test: .venv build  ## Run tests
 	HYPOTHESIS_MAX_EXAMPLES=$(HYPOTHESIS_MAX_EXAMPLES) $(VENV_BIN)/pytest --hypothesis-seed=$(HYPOTHESIS_SEED)

+.PHONY: dsdgen
+dsdgen: .venv ## Generate TPC-DS data
+	$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR)


Suggested change

$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR)

$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpcds-gen-folder=$(OUTPUT_DIR)

a whole bunch of boilerplate for tpc-ds benchmarking and testing. wanted to keep this separate from others as there's not much functionality here, just adding a `dsdgen` command to the makefile to generate tpc-ds datasets. I called it `dsdgen` because that's what duckdb calls it, and this uses the duckdb implementation to generate all of the datasets. The answers were copied from [duckdb/duckdb/extension/tpcds/dsdgen/answers](https://github.com/duckdb/duckdb/tree/10c42435f1805ee4415faa5d6da4943e8c98fa55/extension/tpcds/dsdgen/answers) Usage: ```sh # defaults to sf=1 and dir=data/tpc-ds > make dsdgen > make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir> ``` ## Notes for reviewer Most files here are boilerplate. The only relevant files are: - Makefile - requirements_dev.txt - benchmarking/tpc-ds/datagen.py

feat: tpc-ds datagen

4597ecc

github-actions bot added the chore label Oct 22, 2024

universalmind303 requested review from jaychia and andrewgazelka October 22, 2024 22:47

andrewgazelka removed their request for review October 22, 2024 22:56

universalmind303 added 3 commits October 23, 2024 11:42

Merge branch 'main' of https://github.com/Eventual-Inc/Daft into tpcds

c2b344c

Merge branch 'main' of https://github.com/Eventual-Inc/Daft into tpcds

39344ed

remove answers

747ba05

universalmind303 requested a review from samster25 October 30, 2024 16:43

update makefile

6cae8a3

universalmind303 requested review from desmondcheongzx, kevinzwang, raunakab and colin-ho and removed request for jaychia and samster25 October 31, 2024 16:34

kevinzwang reviewed Nov 1, 2024

View reviewed changes

kevinzwang approved these changes Nov 1, 2024

View reviewed changes

pr feedback

b70f83b

universalmind303 enabled auto-merge (squash) November 1, 2024 14:02

universalmind303 merged commit 9d4adfb into Eventual-Inc:main Nov 1, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHORE]: tpc-ds datagen #3103

[CHORE]: tpc-ds datagen #3103

universalmind303 commented Oct 22, 2024 •

edited

Loading

codspeed-hq bot commented Oct 22, 2024 •

edited

Loading

andrewgazelka commented Oct 22, 2024

samster25 commented Oct 22, 2024

universalmind303 commented Oct 22, 2024

kevinzwang left a comment

kevinzwang Nov 1, 2024

kevinzwang Nov 1, 2024

	$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR)
	$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpcds-gen-folder=$(OUTPUT_DIR)

[CHORE]: tpc-ds datagen #3103

[CHORE]: tpc-ds datagen #3103

Conversation

universalmind303 commented Oct 22, 2024 • edited Loading

Notes for reviewer

codspeed-hq bot commented Oct 22, 2024 • edited Loading

CodSpeed Performance Report

Merging #3103 will improve performances by 11.01%

Summary

Benchmarks breakdown

andrewgazelka commented Oct 22, 2024

samster25 commented Oct 22, 2024

universalmind303 commented Oct 22, 2024

kevinzwang left a comment

Choose a reason for hiding this comment

kevinzwang Nov 1, 2024

Choose a reason for hiding this comment

kevinzwang Nov 1, 2024

Choose a reason for hiding this comment

universalmind303 commented Oct 22, 2024 •

edited

Loading

codspeed-hq bot commented Oct 22, 2024 •

edited

Loading