Spark 3.5: Support default values in vectorized reads #11815

rdblue · 2024-12-18T22:51:10Z

This follows on #11803 and adds default value support to vectorized reads.

rdblue · 2024-12-18T22:53:32Z

...c/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java

@@ -49,7 +49,6 @@
 import org.apache.parquet.schema.MessageType;
 import org.apache.parquet.schema.Type;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
-import org.junit.jupiter.api.Disabled;


I switched this to use assumptions like the other tests that are based on AvroDataTest. I just wanted to be consistent.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

nastra

+1 pending successful CI run

rdblue · 2024-12-19T16:40:00Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException {
    writeAndValidate(schema);
  }

-  @Test
-  public void testTimestampWithoutZone() throws IOException {


Removing this test for TimestampNTZ by adding the type to SUPPORTED_PRIMITIVES (so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow doesn't support TimestampNTZType. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType instead.

This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.

yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624

Spark: Support default values in vectorized reads.

a54dd2e

rdblue requested review from nastra and Fokko December 18, 2024 22:51

github-actions bot added spark arrow labels Dec 18, 2024

rdblue commented Dec 18, 2024

View reviewed changes

Fokko approved these changes Dec 19, 2024

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java Outdated Show resolved Hide resolved

nastra approved these changes Dec 19, 2024

View reviewed changes

rdblue added 2 commits December 19, 2024 08:28

Remove unnecessary whitespace changes.

f21fe7e

Fix ORC tests for TimestampNTZType.

9c48b0b

rdblue commented Dec 19, 2024

View reviewed changes

rdblue merged commit 7033667 into apache:main Dec 19, 2024
43 checks passed

This was referenced Jan 10, 2025

[SPARK-50624][SQL] Add TimestampNTZType to ColumnarRow/MutableColumnarRow apache/spark#49437

Open

[SPARK-50624][SQL] Add TimestampNTZType to ColumnarRow/MutableColumnarRow apache/spark#49244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Support default values in vectorized reads #11815

Spark 3.5: Support default values in vectorized reads #11815

rdblue commented Dec 18, 2024

rdblue Dec 18, 2024

nastra left a comment

rdblue Dec 19, 2024

nastra Dec 19, 2024 •

edited

Loading

Spark 3.5: Support default values in vectorized reads #11815

Spark 3.5: Support default values in vectorized reads #11815

Conversation

rdblue commented Dec 18, 2024

rdblue Dec 18, 2024

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

rdblue Dec 19, 2024

Choose a reason for hiding this comment

nastra Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

nastra Dec 19, 2024 •

edited

Loading