Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Extend FileIO and add EncryptingFileIO. #9592

Merged
merged 9 commits into from
Feb 19, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jan 31, 2024

This adds EncryptingFileIO to handle metadata file encryption without breaking existing interfaces that pass FileIO without an EncryptionManager. It also updates FileIO to have methods for opening ManifestFile, DataFile, and DeleteFile directly to avoid needing to check whether a given FileIO supports encryption. The default implementation in FileIO fails if metadata files are encrypted.

@rdblue
Copy link
Contributor Author

rdblue commented Jan 31, 2024

@ggershinsky, here is what I have so far. It would be great to get your input on it!

@ggershinsky
Copy link
Contributor

@rdblue Thanks for the PR. I am about to complete applying it to the e2e manifest/list/table encryption. So far, everything works well. I'll send updated patches next week.


public InputFile newDecryptingInputFile(String path, long length, ByteBuffer buffer) {
// TODO: is the length correct for the encrypted file? It may be the length of the plaintext
// stream
Copy link
Contributor

@ggershinsky ggershinsky Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the plaintext stream length (which is recorded in the ContentFile and ManifestFile objects). I have updated the AesGcmInput classes to work with the verified plaintext length values (instead of untrusted file length values) - actually, the code becomes more compact and intuitive. I'll send this patch along with other e2e updates next week.

@ggershinsky
Copy link
Contributor

Everything works with this PR. But there are two options for its integration with other PRs, lets discuss them in the call.

@@ -201,6 +188,25 @@ public void shouldNotDropDataFilesIfGcNotEnabled() {
.containsAll(metadataLocations);
}

private static FileIO createMockFileIO(FileIO wrapped) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests needed the mock object to support the added newInputFile methods. Rather than continuing to change each test individually, I refactored to have one method that is used throughout to create the mock FileIO.

@@ -232,80 +237,63 @@ public long getLength() {
@Override
public SeekableInputStream newStream() {
try {
// read from cache if file length is less than or equal to maximum length allowed to
// cache.
if (getLength() <= contentCache.maxContentLength()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check duplicated the one in tryCache.

import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
* Class that provides file-content caching during reading.
*
* <p>The file-content caching is initiated by calling {@link ContentCache#tryCache(FileIO, String,
* long)}. Given a FileIO, a file location string, and file length that is within allowed limit,
* <p>The file-content caching is initiated by calling {@link ContentCache#tryCache(InputFile)}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here aren't strictly necessary, but clean up several problems I noticed while updating the code to call newInputFile(ManifestFile) instead of newInputFile(String,long). The class leaked the CacheEntry class in its public API (those methods are now deprecated) and had duplicate checks, overly verbose debug logs, and odd error handling. I've simplified it a bit and made the naming more clear for internal methods and classes.

"ContentCache creation failed. Check that all manifest caching configurations has valid value.");
LOG.debug("FileIO-level cache stats: {}", CONTENT_CACHES.stats());
return cache.tryCache(io, path, length);
private static InputFile newInputFile(FileIO io, ManifestFile manifest) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the logic here in addition to passing ManifestFile through to newInputFile.

this.lazyInputFiles =
EncryptingFileIO.create(table().io(), table().encryption())
.bulkDecrypt(
() -> taskGroup.tasks().stream().flatMap(this::referencedFiles).iterator());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes demonstrate how EncryptingFileIO is cleaner for bulk decryption. This section has been broken by contributors several times in efforts to simplify the logic, but that end up breaking the bulk decrypt behavior.

public class EncryptingFileIO implements FileIO, Serializable {
public static EncryptingFileIO create(FileIO io, EncryptionManager em) {
if (io instanceof EncryptingFileIO) {
return (EncryptingFileIO) io;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if IO's Encryption Manager matches the one being passed? Just a little worried about users of this API potentially calling this method and having their EM ignored.

nit: Perhaps "wrap" instead of create?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but it felt like overkill. It would make little sense to double-wrap encryption and only the outer one would be used. There's also the consideration of where the FileIO and EncryptionManager come from and I think nearly all of the time it would be from the same table. The check looked like dead code to me.

I also considered other names, but this is creating a thing with new capabilities rather than wrapping one thing in another. I also considered "combine" but didn't think it was substantially better.

default InputFile newInputFile(DataFile file) {
Preconditions.checkArgument(
file.keyMetadata() == null,
"Cannot decrypt data file: {} (use EncryptingFileIO)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Suggestion on error

Cannot decrypt encrypted data file without configured EncryptingFileIO: {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clear that the data file is encrypted, so I would omit that addition. And the "without configured EncryptingFileIO" is a stronger statement that isn't quite right. You can decrypt a data file using an EncryptionManager and newInputFile(String) but if you want to use this method, you must use EncryptingFileIO.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nit comments but this makes sense to me

@rdblue
Copy link
Contributor Author

rdblue commented Feb 19, 2024

I'm going to merge this to unblock @ggershinsky and we can continue discussing the create method changes in a follow up. Thanks, @RussellSpitzer!

@rdblue rdblue merged commit 4c5208a into apache:main Feb 19, 2024
42 checks passed
nastra added a commit to ajantha-bhat/trino that referenced this pull request Feb 20, 2024
ajantha-bhat pushed a commit to ajantha-bhat/trino that referenced this pull request Feb 21, 2024
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Feb 21, 2024
- Testing with Iceberg 1.5.0 RC0 and Nessie 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592
- Handle JDBC catalog test failures
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Feb 21, 2024
- Testing with Iceberg 1.5.0 RC0 and Nessie 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592
- Handle JDBC catalog test failures

Co-authored-by: Eduard Tudenhoefner <[email protected]>
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Feb 27, 2024
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592
- Handle JDBC catalog test failures

Co-authored-by: Eduard Tudenhoefner <[email protected]>
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Mar 3, 2024
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592
- Handle JDBC catalog test failures

Co-authored-by: Eduard Tudenhoefner <[email protected]>
bitsondatadev pushed a commit to bitsondatadev/iceberg that referenced this pull request Mar 3, 2024
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Mar 6, 2024
- Testing with Iceberg 1.5.0 RC4 and Nessie 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592
- Handle JDBC catalog test failures

Co-authored-by: Eduard Tudenhoefner <[email protected]>
ajantha-bhat added a commit to ajantha-bhat/trino that referenced this pull request Mar 13, 2024
- Bump Iceberg to 1.5.0 and Nessie to 0.77.1
- Remove deprecated usage
- InputFile.length is being called since apache/iceberg#9592

Co-authored-by: Eduard Tudenhoefner <[email protected]>
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants