-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: use ZSTD compressed parquet by default #8158
Changes from 5 commits
1165630
65e8610
2b88a7b
844e78e
f2309ad
f184984
fbacf2c
df93fc5
bda3c03
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -142,12 +142,13 @@ private TableProperties() {} | |
|
||
public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec"; | ||
public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec"; | ||
public static final String PARQUET_COMPRESSION_DEFAULT = "gzip"; | ||
public static final String PARQUET_COMPRESSION_DEFAULT = "zstd"; | ||
|
||
public static final String PARQUET_COMPRESSION_LEVEL = "write.parquet.compression-level"; | ||
public static final String DELETE_PARQUET_COMPRESSION_LEVEL = | ||
"write.delete.parquet.compression-level"; | ||
public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT = null; | ||
public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT = | ||
null; // For zstd, it is default to "3" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How valuable is this comment given than it is specific to the underlying codec and may change in the future? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, like we don't mention gzip is default to "6". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed. |
||
|
||
public static final String PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT = | ||
"write.parquet.row-group-check-min-record-count"; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -291,7 +291,7 @@ public void testSelectNestedValues() throws Exception { | |
public void testNestedValues() throws Exception { | ||
createNestedTable(); | ||
|
||
Row leafDoubleCol = Row.of(53L, 3L, 1L, 1L, 0.0D, 0.0D); | ||
Row leafDoubleCol = Row.of(46L, 3L, 1L, 1L, 0.0D, 0.0D); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @szehon-ho @RussellSpitzer do you know what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea I think it include the size, sorry for lack of comment, lets disable zstd in :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I can confirm it's the metric of size. Because of changing to zstd, the sizes are reduced overall. I updated it to the new metrics. We can have a followup PR to change those metric related test with uncompressed parquet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dbtsai , i looked into it. I think we have to put it where we write data files, but its a bigger change. There's a test helper we use called FileHelpers.writeDataFile() and we need to change the code to take in map of properties. Then,
but its ok to do in separate pr if you want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let’s do it in a followup PR to reduce the changes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Chat offline with @dbtsai , will make follow up to make expected metrics take from DataFile.fileSizeInBytes() |
||
Row leafLongCol = Row.of(54L, 3L, 1L, null, 0L, 1L); | ||
Row metrics = Row.of(Row.of(leafDoubleCol, leafLongCol)); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. Thanks.