-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Use FileIO for hadoop table metadata file operations #11690
base: main
Are you sure you want to change the base?
Conversation
@@ -237,15 +237,15 @@ Path getMetadataFile(int metadataVersion) throws IOException { | |||
for (TableMetadataParser.Codec codec : TABLE_METADATA_PARSER_CODEC_VALUES) { | |||
Path metadataFile = metadataFilePath(metadataVersion, codec); | |||
FileSystem fs = getFileSystem(metadataFile, conf); | |||
if (fs.exists(metadataFile)) { | |||
if (fs.isFile(metadataFile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isFile
method isn't marked as a deprecated in this version, but it will be deprecated in newer versions and the current javadoc says:
Note: Avoid using this method. Instead reuse the FileStatus returned by getFileStatus() or listStatus() methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3a has specialized implementation of isFile
that only use headobject. If we switch to getFileStatus
then this PR become non sense.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we use S3FileIO here for metadata file operations? The S3FileIO seems to perform neat file exist check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely shouldn't try to mix use of HadoopFileIO/FileSystem and S3FileIO since they have different behaviors and configuration. I think the additional call is a product of trying to achieve correct file system semantics on an object store. Unless they remove the deprecation or provide a separate utility to perform this check, I'm not sure there's a lot of alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to use S3FileIO here, but just use the base FileIO interface for read/write/check manifest files, and the FileIO's javadoc says it is capable for this.
The s3a's additional call I guess is for the folder semantics, but if iceberg knows manifest files are "file", the listobject can be eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IzzelAliz I would only be concerned about aligning all the client/request configuration with the the way that S3A is configured, which could make that approach difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated them with FileIO. Users who configures a S3FileIO can benefit from it, and the default HadoopFileIO basically behaves the same as the fs.exists.
e54dc61
to
ed31fa5
Compare
isFile
for hadoop table metadata file operations
The s3a filesystem implementation of
exists
calls are headobject and listobjects, where listobjects do not scale well in some self hosted s3 implementations (minio).This PR replace
exists
calls withisFile
which is implemented as headobject only in s3a filesystem.