Spark: Test reading default values in Spark #11832

rdblue · 2024-12-19T23:48:48Z

This updates Spark's tests for scans and data frame writes to validate default values.

This fixes problems found in testing:

ReassignIds was dropping defaults
SchemaParser did not support either initial-default or write-default
SchemaParser did not have a test suite

This also refactors the data frame writer tests and removes ParameterizedAvroDataTest that was an unnecessary copy of AvroDataTest. To avoid needing the duplicate test suite, this updates the tests to inherit from a base class like the scan tests. Last, there were a few unnecessary tests that have been removed. One was testing basic Spark behavior (no commit if an action fails) and the others were only valid for Spark 2.x.

rdblue · 2024-12-19T23:56:46Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

-        .isInstanceOf(IllegalArgumentException.class)
-        .hasMessage("Missing required field: missing_str");
+        .hasRootCauseInstanceOf(IllegalArgumentException.class)
+        .hasMessageContaining("Missing required field: missing_str");


This was needed to validate the reader failure in testMissingRequiredWithoutDefault in Spark scans because the failure happens on executors and is wrapped in SparkException when it is thrown on the driver.

rdblue · 2024-12-19T23:56:58Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

@@ -542,44 +539,4 @@ public void testPrimitiveTypeDefaultValues(Type.PrimitiveType type, Object defau

    writeAndValidate(writeSchema, readSchema);
  }
-
-  protected void withSQLConf(Map<String, String> conf, Action action) throws IOException {


This was unused.

rdblue · 2024-12-19T23:57:56Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java

@@ -96,11 +96,20 @@ public static void assertEqualsSafe(Types.StructType struct, List<Record> recs,

  public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row) {
    List<Types.NestedField> fields = struct.fields();
-    for (int i = 0; i < fields.size(); i += 1) {
-      Type fieldType = fields.get(i).type();
+    for (int readPos = 0; readPos < fields.size(); readPos += 1) {


These changes mirror what was already done in #11803 and #11811.

rdblue · 2024-12-20T00:01:40Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/DataFrameWriteTestBase.java

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+public abstract class DataFrameWriteTestBase extends ScanTestBase {


New base suite for tests of data frame writes, which replaces TestDataFrameWrites and ParameterizedAvroDataTest.

rdblue · 2024-12-20T00:02:26Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/ScanTestBase.java

+import org.junit.jupiter.api.io.TempDir;
+
+/** An AvroDataScan test that validates data by reading through Spark */
+public abstract class ScanTestBase extends AvroDataTest {


New base class for scan tests (TestAvroScan, TestParquetScan, TestParquetVectorizedScan).

rdblue · 2024-12-20T00:03:55Z