-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Extend FileIO and add EncryptingFileIO. #9592
Conversation
@ggershinsky, here is what I have so far. It would be great to get your input on it! |
@rdblue Thanks for the PR. I am about to complete applying it to the e2e manifest/list/table encryption. So far, everything works well. I'll send updated patches next week. |
|
||
public InputFile newDecryptingInputFile(String path, long length, ByteBuffer buffer) { | ||
// TODO: is the length correct for the encrypted file? It may be the length of the plaintext | ||
// stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the plaintext stream length (which is recorded in the ContentFile and ManifestFile objects). I have updated the AesGcmInput classes to work with the verified plaintext length values (instead of untrusted file length values) - actually, the code becomes more compact and intuitive. I'll send this patch along with other e2e updates next week.
Everything works with this PR. But there are two options for its integration with other PRs, lets discuss them in the call. |
@@ -201,6 +188,25 @@ public void shouldNotDropDataFilesIfGcNotEnabled() { | |||
.containsAll(metadataLocations); | |||
} | |||
|
|||
private static FileIO createMockFileIO(FileIO wrapped) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests needed the mock object to support the added newInputFile
methods. Rather than continuing to change each test individually, I refactored to have one method that is used throughout to create the mock FileIO
.
@@ -232,80 +237,63 @@ public long getLength() { | |||
@Override | |||
public SeekableInputStream newStream() { | |||
try { | |||
// read from cache if file length is less than or equal to maximum length allowed to | |||
// cache. | |||
if (getLength() <= contentCache.maxContentLength()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check duplicated the one in tryCache
.
import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
/** | ||
* Class that provides file-content caching during reading. | ||
* | ||
* <p>The file-content caching is initiated by calling {@link ContentCache#tryCache(FileIO, String, | ||
* long)}. Given a FileIO, a file location string, and file length that is within allowed limit, | ||
* <p>The file-content caching is initiated by calling {@link ContentCache#tryCache(InputFile)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here aren't strictly necessary, but clean up several problems I noticed while updating the code to call newInputFile(ManifestFile)
instead of newInputFile(String,long)
. The class leaked the CacheEntry
class in its public API (those methods are now deprecated) and had duplicate checks, overly verbose debug logs, and odd error handling. I've simplified it a bit and made the naming more clear for internal methods and classes.
"ContentCache creation failed. Check that all manifest caching configurations has valid value."); | ||
LOG.debug("FileIO-level cache stats: {}", CONTENT_CACHES.stats()); | ||
return cache.tryCache(io, path, length); | ||
private static InputFile newInputFile(FileIO io, ManifestFile manifest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the logic here in addition to passing ManifestFile
through to newInputFile
.
this.lazyInputFiles = | ||
EncryptingFileIO.create(table().io(), table().encryption()) | ||
.bulkDecrypt( | ||
() -> taskGroup.tasks().stream().flatMap(this::referencedFiles).iterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes demonstrate how EncryptingFileIO
is cleaner for bulk decryption. This section has been broken by contributors several times in efforts to simplify the logic, but that end up breaking the bulk decrypt behavior.
public class EncryptingFileIO implements FileIO, Serializable { | ||
public static EncryptingFileIO create(FileIO io, EncryptionManager em) { | ||
if (io instanceof EncryptingFileIO) { | ||
return (EncryptingFileIO) io; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if IO's Encryption Manager matches the one being passed? Just a little worried about users of this API potentially calling this method and having their EM ignored.
nit: Perhaps "wrap" instead of create?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but it felt like overkill. It would make little sense to double-wrap encryption and only the outer one would be used. There's also the consideration of where the FileIO
and EncryptionManager
come from and I think nearly all of the time it would be from the same table. The check looked like dead code to me.
I also considered other names, but this is creating a thing with new capabilities rather than wrapping one thing in another. I also considered "combine" but didn't think it was substantially better.
default InputFile newInputFile(DataFile file) { | ||
Preconditions.checkArgument( | ||
file.keyMetadata() == null, | ||
"Cannot decrypt data file: {} (use EncryptingFileIO)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Suggestion on error
Cannot decrypt encrypted data file without configured EncryptingFileIO: {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's clear that the data file is encrypted, so I would omit that addition. And the "without configured EncryptingFileIO" is a stronger statement that isn't quite right. You can decrypt a data file using an EncryptionManager
and newInputFile(String)
but if you want to use this method, you must use EncryptingFileIO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few nit comments but this makes sense to me
I'm going to merge this to unblock @ggershinsky and we can continue discussing the |
- Testing with Iceberg 1.5.0 RC0 and Nessie 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 - Handle JDBC catalog test failures
- Testing with Iceberg 1.5.0 RC0 and Nessie 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 - Handle JDBC catalog test failures Co-authored-by: Eduard Tudenhoefner <[email protected]>
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 - Handle JDBC catalog test failures Co-authored-by: Eduard Tudenhoefner <[email protected]>
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 - Handle JDBC catalog test failures Co-authored-by: Eduard Tudenhoefner <[email protected]>
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 - Handle JDBC catalog test failures Co-authored-by: Eduard Tudenhoefner <[email protected]>
- Bump Iceberg to 1.5.0 and Nessie to 0.77.1 - Remove deprecated usage - InputFile.length is being called since apache/iceberg#9592 Co-authored-by: Eduard Tudenhoefner <[email protected]>
This adds
EncryptingFileIO
to handle metadata file encryption without breaking existing interfaces that passFileIO
without anEncryptionManager
. It also updatesFileIO
to have methods for openingManifestFile
,DataFile
, andDeleteFile
directly to avoid needing to check whether a givenFileIO
supports encryption. The default implementation inFileIO
fails if metadata files are encrypted.