Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Test reading default values in Spark #11832

Merged
merged 5 commits into from
Dec 21, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 19, 2024

This updates Spark's tests for scans and data frame writes to validate default values.

This fixes problems found in testing:

  • ReassignIds was dropping defaults
  • SchemaParser did not support either initial-default or write-default
  • SchemaParser did not have a test suite

This also refactors the data frame writer tests and removes ParameterizedAvroDataTest that was an unnecessary copy of AvroDataTest. To avoid needing the duplicate test suite, this updates the tests to inherit from a base class like the scan tests. Last, there were a few unnecessary tests that have been removed. One was testing basic Spark behavior (no commit if an action fails) and the others were only valid for Spark 2.x.

.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Missing required field: missing_str");
.hasRootCauseInstanceOf(IllegalArgumentException.class)
.hasMessageContaining("Missing required field: missing_str");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to validate the reader failure in testMissingRequiredWithoutDefault in Spark scans because the failure happens on executors and is wrapped in SparkException when it is thrown on the driver.

@@ -542,44 +539,4 @@ public void testPrimitiveTypeDefaultValues(Type.PrimitiveType type, Object defau

writeAndValidate(writeSchema, readSchema);
}

protected void withSQLConf(Map<String, String> conf, Action action) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unused.

@@ -96,11 +96,20 @@ public static void assertEqualsSafe(Types.StructType struct, List<Record> recs,

public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row) {
List<Types.NestedField> fields = struct.fields();
for (int i = 0; i < fields.size(); i += 1) {
Type fieldType = fields.get(i).type();
for (int readPos = 0; readPos < fields.size(); readPos += 1) {
Copy link
Contributor Author

@rdblue rdblue Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes mirror what was already done in #11803 and #11811.

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

public abstract class DataFrameWriteTestBase extends ScanTestBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New base suite for tests of data frame writes, which replaces TestDataFrameWrites and ParameterizedAvroDataTest.

import org.junit.jupiter.api.io.TempDir;

/** An AvroDataScan test that validates data by reading through Spark */
public abstract class ScanTestBase extends AvroDataTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New base class for scan tests (TestAvroScan, TestParquetScan, TestParquetVectorizedScan).


@Parameters(name = "format = {0}")
public static Collection<String> parameters() {
return Arrays.asList("parquet", "avro", "orc");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was broken into DataFrameWriteTestBase and subclasses for each format:

  • TestAvroDataFrameWrite
  • TestParquetDataFrameWrite
  • TestORCDataFrameWrite

}

@TestTemplate
public void testWriteWithCustomDataLocation() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by DataFrameWriteTestBase#testAlternateLocation.

}

@TestTemplate
public void testNullableWithWriteOption() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes Spark 2.x so this is no longer needed.

}

@TestTemplate
public void testNullableWithSparkSqlOption() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes Spark 2.x so this is no longer needed.

}

@TestTemplate
public void testFaultToleranceOnWrite() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this test because it is testing basic Spark behavior and doesn't belong in scan and write tests for specific schemas. I didn't move it anywhere because I don't think it is a valuable test. Spark stage failures throw exceptions and don't commit. I think it was originally trying to check for side-effects, but that isn't necessary in Iceberg.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

api/src/main/java/org/apache/iceberg/types/Types.java Outdated Show resolved Hide resolved
@rdblue rdblue merged commit cd187c5 into apache:main Dec 21, 2024
50 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Dec 21, 2024

Thanks for the reviews, @Fokko!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants