Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Support default values in vectorized reads #11815

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 18, 2024

This follows on #11803 and adds default value support to vectorized reads.

@@ -49,7 +49,6 @@
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.Type;
import org.apache.spark.sql.vectorized.ColumnarBatch;
import org.junit.jupiter.api.Disabled;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to use assumptions like the other tests that are based on AvroDataTest. I just wanted to be consistent.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending successful CI run

@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException {
writeAndValidate(schema);
}

@Test
public void testTimestampWithoutZone() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this test for TimestampNTZ by adding the type to SUPPORTED_PRIMITIVES (so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow doesn't support TimestampNTZType. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType instead.

This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.

Copy link
Contributor

@nastra nastra Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants