Python: allow projection of Iceberg fields to pyarrow table schema with names #8144

moriyoshi · 2023-07-25T04:59:29Z

Design note

In this PR, thw following two new keyword arguments are introduced to Table.to_pyarrow, `Table.to_pandas', and likewise.

matched_with_field_name (bool)
- Setting this to True instructs it to map fields between the data file (pyarrow) schema and Iceberg schema by names, when the mapping by field ids is not feasible. Setting this to False willl keep it adhering to the normal behavior.
ignore_unprojectable_fields (bool)
- Setting this to True instructs it to ignore fields that are present in the data file (pyarrow) schema, but absent in the Iceberg schema.

moriyoshi · 2023-07-26T09:35:05Z

Tests ready, mypy checks fail due to the existing oddities.

JonasJ-ap

Thanks for your great contribution @moriyoshi. It is really a nice feature to add!

I conducted some tests using an iceberg table migrated from delta lake (no field id in the data file), which is created via the following spark dataframe:

spark
            .range(0, 5, 1, 5)
            .withColumn("longCol", expr("id"))
            .withColumn("decimalCol", expr("CAST(longCol AS DECIMAL(10, 2))"))
            .withColumn("magic_number", expr("rand(5) * 100"))
            .withColumn("dateCol", date_add(current_date(), 1))
            .withColumn("dateString", expr("CAST(dateCol AS STRING)"))
            .withColumn("random1", expr("CAST(rand(5) * 100 as LONG)"))
            .withColumn("random2", expr("CAST(rand(51) * 100 as LONG)"))
            .withColumn("random3", expr("CAST(rand(511) * 100 as LONG)"))
            .withColumn("random4", expr("CAST(rand(15) * 100 as LONG)"))
            .withColumn("random5", expr("CAST(rand(115) * 100 as LONG)"))
            .withColumn("innerStruct1", expr("STRUCT(random1, random2)"))
            .withColumn("innerStruct2", expr("STRUCT(random3, random4)"))
            .withColumn("structCol1", expr("STRUCT(innerStruct1, innerStruct2)"))
            .withColumn(
                "innerStruct3",
                expr("STRUCT(SHA1(CAST(random5 AS BINARY)), SHA1(CAST(random1 AS BINARY)))"))
            .withColumn(
                "structCol2",
                expr(
                    "STRUCT(innerStruct3, STRUCT(SHA1(CAST(random2 AS BINARY)), SHA1(CAST(random3 AS BINARY))))"))
            .withColumn("arrayCol", expr("ARRAY(random1, random2, random3, random4, random5)"))
            .withColumn("arrayStructCol", expr("ARRAY(innerStruct1, innerStruct1, innerStruct1)"))
            .withColumn("mapCol1", expr("MAP(structCol1, structCol2)"))
            .withColumn("mapCol2", expr("MAP(longCol, dateString)"))
            .withColumn("mapCol3", expr("MAP(dateCol, arrayCol)"))
            .withColumn("structCol3", expr("STRUCT(structCol2, mapCol3, arrayCol)"));

I'd like to highlight some areas where we could potentially improve.

If we choose to filter out some nested field, pyarrow_to_schema will fail even with ignore_unprojectable_fields = True
May be we can let pyarrow_to_schema take the complete table schema rather than projected schema. In this way, we can focus on dealing with fields that are missing in the table schema and let

iceberg/python/pyiceberg/io/pyarrow.py

Line 773 in 9116118

file_project_schema = prune_columns(file_schema, projected_field_ids, select_full_types=False)

handle the unselected columns
If we have MapType whose key type is nested, pyarrow_to_schema will fail to switch to the correct inner schema when visiting MapType's value.

Please correct me if I misunderstand something.

JonasJ-ap · 2023-07-28T02:03:16Z

python/tests/io/test_pyarrow.py

@@ -360,6 +361,65 @@ def test_schema_to_pyarrow_schema(table_schema_nested: Schema) -> None:
    assert repr(actual) == expected


+def test_pyarrow_to_schema(table_schema_simple: Schema, table_schema_nested: Schema) -> None:


Do you think it a good idea to move these two tests to test_pyarrow_visitor.py (containing other tests for pyarrow_to_schema)? We currently have too many tests in test_parrow.py. I think we may want to stop adding new tests to it and consider refactoring it into different files.

Yes, now that test_pyarrow.py contains some backend specific tests, it makes more sense to move those visitor related tests to another file.

JonasJ-ap · 2023-07-28T02:07:17Z