Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Parallelize reading files in add_files procedure #9274

Merged
merged 1 commit into from
Dec 28, 2023

Conversation

manuzhang
Copy link
Contributor

@manuzhang manuzhang commented Dec 11, 2023

Currently, only one thread is used to list files when importing a Spark table in add_files procedure. It can be very slow for a table or a partition with many files. This PR adds an argument listing_parallelism to add_files procedure such that multiple threads can be used to list files.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manuzhang I'm good with having an option to parallelizee file listing but I have concerns on some public API compatibility breakages (these util APIs have existed for a while, and people are already using them, so changing this makes it harder for folks to upgrade).

@manuzhang manuzhang force-pushed the add_files_parallelism branch 2 times, most recently from c86c7ba to 42def01 Compare December 13, 2023 05:56
@github-actions github-actions bot added the core label Dec 13, 2023
@manuzhang manuzhang force-pushed the add_files_parallelism branch from 42def01 to 5d8719d Compare December 13, 2023 05:57
@manuzhang
Copy link
Contributor Author

manuzhang commented Dec 13, 2023

@amogh-jahagirdar @singhpk234 please check again. I've restored public util APIs

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not catching this earlier @manuzhang I took a look at the procedure, I don't think this really is parallelizing the listing. Really what we're doing is parallelizing the file reads. The listing happens here: https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java#L126 as you can see there really is no parallelism (other than what the file system does internally).

The parallelism comes into play in the Task which is reading the files after the listing. https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java#L146

I think we should rename the parameter, and any other references (comments, method parameters) to just parallelism.

The docs can just say parallelism controls the parallelism when reading files.

Let me know what you think, or if I missed something.

@manuzhang manuzhang force-pushed the add_files_parallelism branch from 19299a3 to b79634e Compare December 19, 2023 04:21
@manuzhang manuzhang changed the title Spark 3.5: Parallelize file listing in add_files procedure Spark 3.5: Parallelize reading files in add_files procedure Dec 19, 2023
@manuzhang manuzhang force-pushed the add_files_parallelism branch 2 times, most recently from c6045f2 to b8603a9 Compare December 19, 2023 04:27
@manuzhang
Copy link
Contributor Author

@amogh-jahagirdar since there are other usages of "parallelism", I updated all references to be more specific "readFilesParallelism".

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuzhang where are the other references to parallelism? Are you referring to the existing one in SparkTableUtil? https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L545 is the actual listing parallelism and I'd just rename that variable in the method if that's what you're referring to

readfilesParallelism seems more verbose than needed

@manuzhang manuzhang force-pushed the add_files_parallelism branch 3 times, most recently from 0c38098 to 6c7a80d Compare December 25, 2023 04:34
@manuzhang
Copy link
Contributor Author

@amogh-jahagirdar updated. Thanks for review. Merry Christmas 🎄

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the iterations @manuzhang , I think it looks really good now. Merry Christmas to you too!

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed one of the importSparkTable methods still breaks API compatibility. We should address that before merging. Some level of duplication is OK but we should avoid breaking APIs (this means adding new parameters, removing parameters, changing type signatures etc.). In this case I think we should be able to define another importSparkTable which uses parallelism. The old importSparkTable API can call the new API with parallelism 1.

@manuzhang manuzhang force-pushed the add_files_parallelism branch from 7efd874 to a76bba1 Compare December 28, 2023 08:33
@manuzhang manuzhang force-pushed the add_files_parallelism branch from a76bba1 to 06062ab Compare December 28, 2023 08:35
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround @manuzhang , looks good now. Merging. Thanks @singhpk234 for reviews!

@amogh-jahagirdar amogh-jahagirdar merged commit 22d4e78 into apache:main Dec 28, 2023
41 checks passed
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
@manuzhang manuzhang deleted the add_files_parallelism branch June 4, 2024 11:09
@lurnagao-dahua
Copy link
Contributor

Hi,May I ask if there are any plans to port to Spark 3.3?

@manuzhang
Copy link
Contributor Author

I will submit a PR to back-port to Spark 3.3 and 3.4 soon.

manuzhang added a commit to manuzhang/iceberg that referenced this pull request Aug 29, 2024
amogh-jahagirdar pushed a commit that referenced this pull request Sep 5, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants