Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Write tpch parquet files one at a time #3396

Merged
merged 1 commit into from
Dec 1, 2024
Merged

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Nov 21, 2024

When you specify a num_parts parameter when generating tpch files. It will first generate num_parts CSVs, then read those CSVs and write to parquet using Daft.

However, write_parquet will not respect the input number of files, e.g. even if there are 16 input files there might only be 1 output file.

The fix here is to read and write 1 file at a time.

@github-actions github-actions bot added the chore label Nov 21, 2024
Copy link

codspeed-hq bot commented Nov 21, 2024

CodSpeed Performance Report

Merging #3396 will not alter performance

Comparing colin/gen-parquet (85fd788) with main (ec39dc0)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.35%. Comparing base (3394a66) to head (85fd788).
Report is 42 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
- Coverage   77.35%   77.35%   -0.01%     
==========================================
  Files         685      685              
  Lines       83631    83637       +6     
==========================================
+ Hits        64695    64697       +2     
- Misses      18936    18940       +4     

see 6 files with indirect coverage changes

Copy link

graphite-app bot commented Nov 21, 2024

Graphite Automations

"Request reviewers once CI passes" took an action on this PR • (11/21/24)

1 reviewer was added to this PR based on Andrew Gazelka's automation.

@colin-ho colin-ho merged commit 8652eba into main Dec 1, 2024
46 checks passed
@colin-ho colin-ho deleted the colin/gen-parquet branch December 1, 2024 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants