Spark 3.5: Parallelize reading files in add_files procedure #9274

manuzhang · 2023-12-11T15:01:59Z

Currently, only one thread is used to list files when importing a Spark table in add_files procedure. It can be very slow for a table or a partition with many files. This PR adds an argument listing_parallelism to add_files procedure such that multiple threads can be used to list files.

docs/spark-procedures.md

amogh-jahagirdar

Thanks @manuzhang I'm good with having an option to parallelizee file listing but I have concerns on some public API compatibility breakages (these util APIs have existed for a while, and people are already using them, so changing this makes it harder for folks to upgrade).

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

manuzhang · 2023-12-13T06:47:03Z

@amogh-jahagirdar @singhpk234 please check again. I've restored public util APIs

core/src/main/java/org/apache/iceberg/SnapshotProducer.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

docs/spark-procedures.md

amogh-jahagirdar

Apologies for not catching this earlier @manuzhang I took a look at the procedure, I don't think this really is parallelizing the listing. Really what we're doing is parallelizing the file reads. The listing happens here: https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java#L126 as you can see there really is no parallelism (other than what the file system does internally).

The parallelism comes into play in the Task which is reading the files after the listing. https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java#L146

I think we should rename the parameter, and any other references (comments, method parameters) to just parallelism.

The docs can just say parallelism controls the parallelism when reading files.

Let me know what you think, or if I missed something.

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

manuzhang · 2023-12-19T04:28:32Z

@amogh-jahagirdar since there are other usages of "parallelism", I updated all references to be more specific "readFilesParallelism".

amogh-jahagirdar

@manuzhang where are the other references to parallelism? Are you referring to the existing one in SparkTableUtil? https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L545 is the actual listing parallelism and I'd just rename that variable in the method if that's what you're referring to

readfilesParallelism seems more verbose than needed

manuzhang · 2023-12-25T04:35:47Z

@amogh-jahagirdar updated. Thanks for review. Merry Christmas 🎄

amogh-jahagirdar

Thanks for working through the iterations @manuzhang , I think it looks really good now. Merry Christmas to you too!

amogh-jahagirdar

I just noticed one of the importSparkTable methods still breaks API compatibility. We should address that before merging. Some level of duplication is OK but we should avoid breaking APIs (this means adding new parameters, removing parameters, changing type signatures etc.). In this case I think we should be able to define another importSparkTable which uses parallelism. The old importSparkTable API can call the new API with parallelism 1.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

amogh-jahagirdar

Thanks for the quick turnaround @manuzhang , looks good now. Merging. Thanks @singhpk234 for reviews!

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

)

lurnagao-dahua · 2024-08-27T06:08:00Z

Hi,May I ask if there are any plans to port to Spark 3.3?

manuzhang · 2024-08-27T15:59:57Z

I will submit a PR to back-port to Spark 3.3 and 3.4 soon.

Back-port of apache#9274

Back-port of #9274 Back-port of #10037

…he#11043) Back-port of apache#9274 Back-port of apache#10037

github-actions bot added spark data docs labels Dec 11, 2023

singhpk234 reviewed Dec 11, 2023

View reviewed changes

docs/spark-procedures.md Outdated Show resolved Hide resolved

amogh-jahagirdar requested changes Dec 11, 2023

View reviewed changes

manuzhang force-pushed the add_files_parallelism branch 2 times, most recently from c86c7ba to 42def01 Compare December 13, 2023 05:56

github-actions bot added the core label Dec 13, 2023

manuzhang force-pushed the add_files_parallelism branch from 42def01 to 5d8719d Compare December 13, 2023 05:57

amogh-jahagirdar reviewed Dec 14, 2023

View reviewed changes

manuzhang force-pushed the add_files_parallelism branch from 5d8719d to 19299a3 Compare December 14, 2023 07:13

manuzhang requested a review from amogh-jahagirdar December 14, 2023 10:37

amogh-jahagirdar requested changes Dec 19, 2023

View reviewed changes

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java Outdated Show resolved Hide resolved

manuzhang force-pushed the add_files_parallelism branch from 19299a3 to b79634e Compare December 19, 2023 04:21

manuzhang changed the title ~~Spark 3.5: Parallelize file listing in add_files procedure~~ Spark 3.5: Parallelize reading files in add_files procedure Dec 19, 2023

manuzhang force-pushed the add_files_parallelism branch 2 times, most recently from c6045f2 to b8603a9 Compare December 19, 2023 04:27

manuzhang requested a review from amogh-jahagirdar December 19, 2023 07:20

amogh-jahagirdar reviewed Dec 22, 2023

View reviewed changes

manuzhang force-pushed the add_files_parallelism branch 3 times, most recently from 0c38098 to 6c7a80d Compare December 25, 2023 04:34

manuzhang force-pushed the add_files_parallelism branch from 6c7a80d to 7efd874 Compare December 27, 2023 04:21

manuzhang requested a review from amogh-jahagirdar December 27, 2023 06:16

amogh-jahagirdar approved these changes Dec 28, 2023

View reviewed changes

amogh-jahagirdar requested changes Dec 28, 2023

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java Show resolved Hide resolved

manuzhang force-pushed the add_files_parallelism branch from 7efd874 to a76bba1 Compare December 28, 2023 08:33

Spark 3.5: Parallelize reading files in add_files procedure

06062ab

manuzhang force-pushed the add_files_parallelism branch from a76bba1 to 06062ab Compare December 28, 2023 08:35

amogh-jahagirdar approved these changes Dec 28, 2023

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java Show resolved Hide resolved

amogh-jahagirdar merged commit 22d4e78 into apache:main Dec 28, 2023
41 checks passed

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024

Spark 3.5: Parallelize reading files in add_files procedure (apache#9274

37a5dba

)

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Spark 3.5: Parallelize reading files in add_files procedure (apache#9274

10f9c31

)

manuzhang mentioned this pull request Mar 25, 2024

Spark 3.5: Parallelize reading files in snapshot and migrate procedures #10037

Merged

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Spark 3.5: Parallelize reading files in add_files procedure (apache#9274

056a0c8

)

manuzhang deleted the add_files_parallelism branch June 4, 2024 11:09

manuzhang added a commit to manuzhang/iceberg that referenced this pull request Aug 29, 2024

Spark 3.3, 3.4: Parallelize reading files in add_files procedure

2512315

Back-port of apache#9274

manuzhang mentioned this pull request Aug 29, 2024

Spark 3.3, 3.4: Parallelize reading files in migrate procedures #11043

Merged

amogh-jahagirdar pushed a commit that referenced this pull request Sep 5, 2024

Spark 3.3, 3.4: Parallelize reading files in migrate procedures (#11043)

f508a7e

Back-port of #9274 Back-port of #10037

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark 3.3, 3.4: Parallelize reading files in migrate procedures (apac…

ecb3e4e

…he#11043) Back-port of apache#9274 Back-port of apache#10037

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Parallelize reading files in add_files procedure #9274

Spark 3.5: Parallelize reading files in add_files procedure #9274

manuzhang commented Dec 11, 2023 •

edited

Loading

amogh-jahagirdar left a comment

manuzhang commented Dec 13, 2023 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading

manuzhang commented Dec 19, 2023

amogh-jahagirdar left a comment

manuzhang commented Dec 25, 2023

amogh-jahagirdar left a comment

amogh-jahagirdar left a comment •

edited

Loading

amogh-jahagirdar left a comment

lurnagao-dahua commented Aug 27, 2024

manuzhang commented Aug 27, 2024

Spark 3.5: Parallelize reading files in add_files procedure #9274

Spark 3.5: Parallelize reading files in add_files procedure #9274

Conversation

manuzhang commented Dec 11, 2023 • edited Loading

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

manuzhang commented Dec 13, 2023 • edited Loading

amogh-jahagirdar left a comment • edited Loading

Choose a reason for hiding this comment

manuzhang commented Dec 19, 2023

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

manuzhang commented Dec 25, 2023

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar left a comment • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

lurnagao-dahua commented Aug 27, 2024

manuzhang commented Aug 27, 2024

manuzhang commented Dec 11, 2023 •

edited

Loading

manuzhang commented Dec 13, 2023 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading