Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: use ZSTD compressed parquet by default #8158

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .palantir/revapi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -774,6 +774,11 @@ acceptedBreaks:
- code: "java.method.removed"
old: "method org.apache.iceberg.view.ViewBuilder org.apache.iceberg.view.ViewBuilder::withQueryColumnNames(java.util.List<java.lang.String>)"
justification: "Acceptable break due to updating View APIs and the View Spec"
org.apache.iceberg:iceberg-core:
- code: "java.field.constantValueChanged"
old: "field org.apache.iceberg.TableProperties.PARQUET_COMPRESSION_DEFAULT"
new: "field org.apache.iceberg.TableProperties.PARQUET_COMPRESSION_DEFAULT"
justification: "{Changing the default compression codec from gzip to zstd}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
justification: "{Changing the default compression codec from gzip to zstd}"
justification: "Changing the default compression codec from gzip to zstd"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Thanks.

apache-iceberg-0.14.0:
org.apache.iceberg:iceberg-api:
- code: "java.class.defaultSerializationChanged"
Expand Down
5 changes: 3 additions & 2 deletions core/src/main/java/org/apache/iceberg/TableProperties.java
Original file line number Diff line number Diff line change
Expand Up @@ -142,12 +142,13 @@ private TableProperties() {}

public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec";
public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec";
public static final String PARQUET_COMPRESSION_DEFAULT = "gzip";
public static final String PARQUET_COMPRESSION_DEFAULT = "zstd";

public static final String PARQUET_COMPRESSION_LEVEL = "write.parquet.compression-level";
public static final String DELETE_PARQUET_COMPRESSION_LEVEL =
"write.delete.parquet.compression-level";
public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT = null;
public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT =
null; // For zstd, it is default to "3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How valuable is this comment given than it is specific to the underlying codec and may change in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, like we don't mention gzip is default to "6".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


public static final String PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT =
"write.parquet.row-group-check-min-record-count";
Expand Down
4 changes: 2 additions & 2 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ Iceberg tables support table properties to configure table behavior, like the de
| write.parquet.page-size-bytes | 1048576 (1 MB) | Parquet page size |
| write.parquet.page-row-limit | 20000 | Parquet page row limit |
| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size |
| write.parquet.compression-codec | gzip | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |
| write.parquet.compression-level | null | Parquet compression level |
| write.parquet.compression-codec | zstd | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |
| write.parquet.compression-level | null | Parquet compression level (zstd internally uses 3 as default if not set) |
| write.parquet.bloom-filter-enabled.column.col1 | (not set) | Hint to parquet to write a bloom filter for the column: col1 |
| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
| write.avro.compression-codec | gzip | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ public void testSelectNestedValues() throws Exception {
public void testNestedValues() throws Exception {
createNestedTable();

Row leafDoubleCol = Row.of(53L, 3L, 1L, 1L, 0.0D, 0.0D);
Row leafDoubleCol = Row.of(46L, 3L, 1L, 1L, 0.0D, 0.0D);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho @RussellSpitzer do you know what leafDoubleCol is? Is it some kind of metrics that could be changed if the compression codec is changed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think it include the size, sorry for lack of comment, lets disable zstd in :

    Table table =
        catalog.createTable(
            TableIdentifier.of(Namespace.of(database()), tableName()),
            PRIMITIVE_SCHEMA,
            PartitionSpec.unpartitioned(),
            ImmutableMap.of());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can confirm it's the metric of size. Because of changing to zstd, the sizes are reduced overall. I updated it to the new metrics. We can have a followup PR to change those metric related test with uncompressed parquet.

Copy link
Collaborator

@szehon-ho szehon-ho Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai , i looked into it. I think we have to put it where we write data files, but its a bigger change. There's a test helper we use called FileHelpers.writeDataFile() and we need to change the code to take in map of properties.

Then,

  public static DataFile writeDataFile(Table table, OutputFile out, List<Record> rows, Map<String, String> properties)
      throws IOException {
    FileFormat format = defaultFormat(table.properties());
    GenericAppenderFactory factory = new GenericAppenderFactory(table.schema());
    properties.forEach(factory::set);

but its ok to do in separate pr if you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s do it in a followup PR to reduce the changes

Copy link
Collaborator

@szehon-ho szehon-ho Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chat offline with @dbtsai , will make follow up to make expected metrics take from DataFile.fileSizeInBytes()

Row leafLongCol = Row.of(54L, 3L, 1L, null, 0L, 1L);
Row metrics = Row.of(Row.of(leafDoubleCol, leafLongCol));

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ public void testSelectNestedValues() throws Exception {
public void testNestedValues() throws Exception {
createNestedTable();

Row leafDoubleCol = Row.of(53L, 3L, 1L, 1L, 0.0D, 0.0D);
Row leafDoubleCol = Row.of(46L, 3L, 1L, 1L, 0.0D, 0.0D);
Row leafLongCol = Row.of(54L, 3L, 1L, null, 0L, 1L);
Row metrics = Row.of(Row.of(leafDoubleCol, leafLongCol));

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ public void testSelectNestedValues() throws Exception {
public void testNestedValues() throws Exception {
createNestedTable();

Row leafDoubleCol = Row.of(53L, 3L, 1L, 1L, 0.0D, 0.0D);
Row leafDoubleCol = Row.of(46L, 3L, 1L, 1L, 0.0D, 0.0D);
Row leafLongCol = Row.of(54L, 3L, 1L, null, 0L, 1L);
Row metrics = Row.of(Row.of(leafDoubleCol, leafLongCol));

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ public void testSelectNestedValues() throws Exception {
public void testNestedValues() throws Exception {
createNestedTable();

Object[] leafDoubleCol = row(53L, 3L, 1L, 1L, 0.0D, 0.0D);
Object[] leafDoubleCol = row(46L, 3L, 1L, 1L, 0.0D, 0.0D);
Object[] leafLongCol = row(54L, 3L, 1L, null, 0L, 1L);
Object[] metrics = row(leafDoubleCol, leafLongCol);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ public void testSelectNestedValues() throws Exception {
public void testNestedValues() throws Exception {
createNestedTable();

Object[] leafDoubleCol = row(53L, 3L, 1L, 1L, 0.0D, 0.0D);
Object[] leafDoubleCol = row(46L, 3L, 1L, 1L, 0.0D, 0.0D);
Object[] leafLongCol = row(54L, 3L, 1L, null, 0L, 1L);
Object[] metrics = row(leafDoubleCol, leafLongCol);

Expand Down