Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Use FileIO for hadoop table metadata file operations #11690

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

IzzelAliz
Copy link

The s3a filesystem implementation of exists calls are headobject and listobjects, where listobjects do not scale well in some self hosted s3 implementations (minio).

This PR replace exists calls with isFile which is implemented as headobject only in s3a filesystem.

@github-actions github-actions bot added the core label Dec 3, 2024
@@ -237,15 +237,15 @@ Path getMetadataFile(int metadataVersion) throws IOException {
for (TableMetadataParser.Codec codec : TABLE_METADATA_PARSER_CODEC_VALUES) {
Path metadataFile = metadataFilePath(metadataVersion, codec);
FileSystem fs = getFileSystem(metadataFile, conf);
if (fs.exists(metadataFile)) {
if (fs.isFile(metadataFile)) {
Copy link
Contributor

@ebyhr ebyhr Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFile method isn't marked as a deprecated in this version, but it will be deprecated in newer versions and the current javadoc says:

Note: Avoid using this method. Instead reuse the FileStatus returned by getFileStatus() or listStatus() methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3a has specialized implementation of isFile that only use headobject. If we switch to getFileStatus then this PR become non sense.
What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we use S3FileIO here for metadata file operations? The S3FileIO seems to perform neat file exist check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely shouldn't try to mix use of HadoopFileIO/FileSystem and S3FileIO since they have different behaviors and configuration. I think the additional call is a product of trying to achieve correct file system semantics on an object store. Unless they remove the deprecation or provide a separate utility to perform this check, I'm not sure there's a lot of alternatives.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to use S3FileIO here, but just use the base FileIO interface for read/write/check manifest files, and the FileIO's javadoc says it is capable for this.
The s3a's additional call I guess is for the folder semantics, but if iceberg knows manifest files are "file", the listobject can be eliminated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IzzelAliz I would only be concerned about aligning all the client/request configuration with the the way that S3A is configured, which could make that approach difficult.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated them with FileIO. Users who configures a S3FileIO can benefit from it, and the default HadoopFileIO basically behaves the same as the fs.exists.

@IzzelAliz IzzelAliz force-pushed the main branch 2 times, most recently from e54dc61 to ed31fa5 Compare December 25, 2024 09:20
@IzzelAliz IzzelAliz changed the title Core: Use isFile for hadoop table metadata file operations Core: Use FileIO for hadoop table metadata file operations Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants