-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5: Support default values in vectorized reads #11815
Spark 3.5: Support default values in vectorized reads #11815
Conversation
@@ -49,7 +49,6 @@ | |||
import org.apache.parquet.schema.MessageType; | |||
import org.apache.parquet.schema.Type; | |||
import org.apache.spark.sql.vectorized.ColumnarBatch; | |||
import org.junit.jupiter.api.Disabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to use assumptions like the other tests that are based on AvroDataTest
. I just wanted to be consistent.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending successful CI run
@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException { | |||
writeAndValidate(schema); | |||
} | |||
|
|||
@Test | |||
public void testTimestampWithoutZone() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this test for TimestampNTZ
by adding the type to SUPPORTED_PRIMITIVES
(so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow
doesn't support TimestampNTZType
. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType
instead.
This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624
This follows on #11803 and adds default value support to vectorized reads.