Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE]: tpc-ds datagen #3103

Merged
merged 6 commits into from
Nov 1, 2024
Merged

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented Oct 22, 2024

a whole bunch of boilerplate for tpc-ds benchmarking and testing.

wanted to keep this separate from others as there's not much functionality here, just adding a dsdgen command to the makefile to generate tpc-ds datasets. I called it dsdgen because that's what duckdb calls it, and this uses the duckdb implementation to generate all of the datasets.

The answers were copied from duckdb/duckdb/extension/tpcds/dsdgen/answers

Usage:

# defaults to sf=1 and dir=data/tpc-ds
> make dsdgen
> make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir>

Notes for reviewer

Most files here are boilerplate.

The only relevant files are:

  • Makefile
  • requirements_dev.txt
  • benchmarking/tpc-ds/datagen.py

Copy link

codspeed-hq bot commented Oct 22, 2024

CodSpeed Performance Report

Merging #3103 will improve performances by 11.01%

Comparing universalmind303:tpcds (b70f83b) with main (138d078)

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main universalmind303:tpcds Change
test_count[1 Small File] 4.4 ms 4 ms +11.01%

@andrewgazelka andrewgazelka removed their request for review October 22, 2024 22:56
@andrewgazelka
Copy link
Member

I'm not sure I'm the best person to be reviewing this, but I'm definitely going to look at it because I think it might relate in some ways to what I'm doing with tests.

@samster25
Copy link
Member

@universalmind303 i think it makes sense to check in the SQL queries and fixtures but I think it would be better to place the answers in a public S3 bucket since they are more artifacts rather than code.

@universalmind303
Copy link
Contributor Author

@universalmind303 i think it makes sense to check in the SQL queries and fixtures but I think it would be better to place the answers in a public S3 bucket since they are more artifacts rather than code.

that makes sense. Will update!

@universalmind303 universalmind303 requested review from desmondcheongzx, kevinzwang, raunakab and colin-ho and removed request for jaychia and samster25 October 31, 2024 16:34
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. As Sammy said, make sure upload dbgen outputs to S3, maybe choose some sets of scale factors (e.g. 1, 10, 100, 10000)

Comment on lines 23 to 40
parser.add_argument(
"--tpch-gen-folder",
default="data/tpch-dbgen",
help="Path to the folder containing the TPCH dbgen tool and generated data",
)
parser.add_argument("--scale-factor", default=0.01, help="Scale factor to run on in GB", type=float)

args = parser.parse_args()
num_parts = args.scale_factor

logger.info(
"Generating data at %s with: scale_factor=%s num_parts=%s generate_sqlite_db=%s generate_parquet=%s",
args.tpch_gen_folder,
args.scale_factor,
num_parts,
)

gen_tpcds(basedir=args.tpch_gen_folder, scale_factor=args.scale_factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename TPCH to TPCDS in these lines

Makefile Outdated
@@ -56,6 +61,10 @@ build-release: check-toolchain .venv ## Compile and install a faster Daft binar
test: .venv build ## Run tests
HYPOTHESIS_MAX_EXAMPLES=$(HYPOTHESIS_MAX_EXAMPLES) $(VENV_BIN)/pytest --hypothesis-seed=$(HYPOTHESIS_SEED)

.PHONY: dsdgen
dsdgen: .venv ## Generate TPC-DS data
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpch-gen-folder=$(OUTPUT_DIR)
$(VENV_BIN)/python benchmarking/tpcds/datagen.py --scale-factor=$(SCALE_FACTOR) --tpcds-gen-folder=$(OUTPUT_DIR)

@universalmind303 universalmind303 enabled auto-merge (squash) November 1, 2024 14:02
@universalmind303 universalmind303 merged commit 9d4adfb into Eventual-Inc:main Nov 1, 2024
38 checks passed
sagiahrac pushed a commit to sagiahrac/Daft that referenced this pull request Nov 4, 2024
a whole bunch of boilerplate for tpc-ds benchmarking and testing. 

wanted to keep this separate from others as there's not much
functionality here, just adding a `dsdgen` command to the makefile to
generate tpc-ds datasets. I called it `dsdgen` because that's what
duckdb calls it, and this uses the duckdb implementation to generate all
of the datasets.

The answers were copied from
[duckdb/duckdb/extension/tpcds/dsdgen/answers](https://github.com/duckdb/duckdb/tree/10c42435f1805ee4415faa5d6da4943e8c98fa55/extension/tpcds/dsdgen/answers)

Usage:

```sh
# defaults to sf=1 and dir=data/tpc-ds
> make dsdgen
> make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir>
```

## Notes for reviewer

Most files here are boilerplate. 

The only relevant files are:
- Makefile
- requirements_dev.txt
- benchmarking/tpc-ds/datagen.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants