Deliver key metadata to parquet encryption #6762

ggershinsky · 2023-02-07T06:55:41Z

No description provided.

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

rdblue · 2023-03-11T21:02:46Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

@@ -52,6 +55,7 @@

  protected CloseableIterable<ColumnarBatch> newBatchIterable(
      InputFile inputFile,
+      ByteBuffer keyMetadata,


Should this throw an exception if keyMetadata is non-null and the format is ORC?

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java

rdblue · 2023-03-11T21:08:21Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

-  public FileAppender<InternalRow> newAppender(OutputFile file, FileFormat fileFormat) {
+  public FileAppender<InternalRow> newAppender(OutputFile outputFile, FileFormat format) {
+    return newAppender(
+        EncryptedFiles.encryptedOutput(outputFile, (EncryptionKeyMetadata) null), format);


Can you add a pass-through factory method to EncryptedFiles rather than constructing one here? I think it would be better to call EncryptedFiles.plainAsEncryptedOutput rather than passing null key metadata.

rdblue

Overall this looks reasonable.

rdblue · 2023-05-22T16:10:53Z

@ggershinsky, I think this is close but tests are failing. Can you update it?

rdblue · 2023-07-31T23:56:08Z

core/src/main/java/org/apache/iceberg/encryption/BaseEncryptedOutputFile.java

+  BaseEncryptedOutputFile(
+      OutputFile encryptingOutputFile,
+      EncryptionKeyMetadata keyMetadata,
+      OutputFile rawOutputFile) {


Two output files looks really suspicious to me. Maybe we would want to use a different implementation of the API to avoid that. Shouldn't this create the encrypting output file rather than having it passed in? Or is that an artifact from how the encryption manager works?

The encryption manager we use doesn't have to produce these files.

Or is that an artifact from how the encryption manager works?

Correct. This is a result of our previous discussion on encryption manager and Parquet -
#6884 (comment)
"For native Parquet encryption, I think the EncryptedOutputFile and EncryptedInputFile classes would need to be able to return the underlying stream as well, so that encryption can be handled by Parquet."

Also, this class (BaseEncryptedOutputFile) exists for a while now, and was designed to be a simple wrapper of an encrypting stream, its key metadata, and now its underlying stream.

I was considering this when looking at #6884. EncryptedOutputFile definitely needs to be able to return the underlying stream for Parquet encryption, but that doesn't mean that BaseEncryptedInputFile necessarily has to be used here.

Overall, I'm fine with this but it does look odd to have a wrapper that can supply either the raw output file or an encrypted one.

I've reverted the BaseEncryptedOutputFile class. There is a new implementation of the updated EncryptedOutputFile, residing inside the encryption manager that provides the native Parquet encryption.

rdblue · 2023-09-10T22:09:35Z

api/src/main/java/org/apache/iceberg/encryption/EncryptionKeyMetadata.java

@@ -49,4 +49,12 @@ static EncryptionKeyMetadata empty() {
  ByteBuffer buffer();

  EncryptionKeyMetadata copy();
+
+  default ByteBuffer encryptionKey() {
+    return null;


Why null instead of UnsupportedOperationException? Won't the caller need to throw an exception?

rdblue · 2023-09-10T22:10:04Z

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java

+
+  private EncryptionUtil() {}
+
+  public static EncryptionKeyMetadata parseKeyMetadata(ByteBuffer metadataBuffer) {


Why doesn't this return KeyMetadata?

rdblue · 2023-09-10T22:10:54Z

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java

+    return new KeyMetadata(key, aadPrefix);
+  }
+
+  static KeyManagementClient createKmsClient(String kmsImpl) {


As I mentioned above, I'd prefer to primarily use a kms-type and fall back to an impl class if needed.

Sure. This method will be called only if kms-type is custom

rdblue · 2023-09-10T22:11:57Z

core/src/main/java/org/apache/iceberg/encryption/PlaintextEncryptionManager.java

  @Override
  public InputFile decrypt(EncryptedInputFile encrypted) {
-    if (encrypted.keyMetadata().buffer() != null) {
+    if (encrypted.keyMetadata() != null && encrypted.keyMetadata().buffer() != null) {
      LOG.warn(
          "File encryption key metadata is present, but currently using PlaintextEncryptionManager.");


Rather than using a class name, can we change this to "but no encryption has been configured"

core/src/main/java/org/apache/iceberg/io/FileAppenderFactory.java

rdblue · 2023-09-10T22:14:13Z

data/src/main/java/org/apache/iceberg/data/BaseFileWriterFactory.java

@@ -118,13 +117,15 @@ public DataWriter<T> newDataWriter(

        case PARQUET:
          Parquet.DataWriteBuilder parquetBuilder =
-              Parquet.writeData(outputFile)
+              Parquet.writeData(file.rawOutputFile())


This is a behavior change. I think we need to check whether keyMetadata is a KeyMetadata in order to do this. If it's metadata that can be used for Parquet native encryption then we can use it. Otherwise we should fall back to using encryptingOutputFile() like before.

rdblue · 2023-09-10T22:15:31Z

data/src/main/java/org/apache/iceberg/data/BaseFileWriterFactory.java

@@ -194,6 +195,8 @@ public EqualityDeleteWriter<T> newEqualityDeleteWriter(
                  .withSpec(spec)
                  .withPartition(partition)
                  .withKeyMetadata(keyMetadata)
+                  .withFileEncryptionKey(keyMetadata.encryptionKey())


Doesn't this also need to use the raw output file?

rdblue · 2023-09-10T22:19:19Z

data/src/main/java/org/apache/iceberg/data/BaseFileWriterFactory.java

@@ -261,6 +264,8 @@ public PositionDeleteWriter<T> newPositionDeleteWriter(
                  .withSpec(spec)
                  .withPartition(partition)
                  .withKeyMetadata(keyMetadata)
+                  .withFileEncryptionKey(keyMetadata.encryptionKey())


Same here. Why doesn't this use the raw output file?

Also, should we have a Parquet.writeDeletes(EncryptingOutputFile)? It would work like this:

public static DeleteWriteBuilder writeDeletes(EncryptedOutputFile file) { if (file.keyMetadata() instanceof KeyMetadata) { KeyMetadata standardKeyMetadata = (KeyMetadata) file.keyMetadata(); return writeDeletes(file.rawOutputFile()) .keyMetadata(standardKeyMetadata.encryptionKey()) .withAADPrefix(standardKeyMetadata.aadPrefix()); } else { return writeDeletes(file.encryptingOutputFile()); } }

rdblue · 2023-09-10T22:19:51Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -293,6 +295,13 @@ private CloseableIterable<Record> openDeletes(DeleteFile deleteFile, Schema dele
          builder.filter(Expressions.equal(MetadataColumns.DELETE_FILE_PATH.name(), filePath));
        }

+        if (deleteFile.keyMetadata() != null) {
+          EncryptionKeyMetadata keyMetadata =
+              EncryptionUtil.parseKeyMetadata(deleteFile.keyMetadata());


Cast or parse would be better here, too.

keyMetadata() is always a ByteBuffer here, so it has to be parsed.

Thanks for clarifying. That makes sense.

rdblue · 2023-09-10T22:21:51Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

@@ -61,9 +66,12 @@ protected CloseableIterable<ColumnarBatch> newBatchIterable(
      SparkDeleteFilter deleteFilter) {
    switch (format) {
      case PARQUET:
-        return newParquetIterable(inputFile, start, length, residual, idToConstant, deleteFilter);
+        return newParquetIterable(
+            inputFile, keyMetadata, start, length, residual, idToConstant, deleteFilter);


Can this pass the EncryptedInputFile instead? That would avoid needing to pass both separately.

rdblue · 2023-09-10T22:29:14Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

-      // decrypt with the batch call to avoid multiple RPCs to a key server, if possible
-      Iterable<InputFile> decryptedFiles = table.encryption().decrypt(encryptedFiles::iterator);
+      Stream<InputFile> inputFiles =
+          taskGroup.tasks().stream().flatMap(this::referencedFiles).map(this::toInputFile);


This should not change because it breaks usage of the bulk decrypt API. That doesn't matter for Iceberg standard encryption, but it would negatively affect people using their own EncryptionManager.

There's also no need to change how this works. It's fine for the decrypt method to return BaseEncryptedInputFile.encryptedInputFile. That doesn't actually create a decryption stream unless the InputFile is used.

Instead, we can either keep the EncryptedInputFile instances around or we can use a class that handles both APIs.

I'll revert the changes in this class, and move the native decryption logic to the StandardEncryptionManager class.

rdblue · 2023-09-10T22:29:47Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseRowReader.java

@@ -58,7 +62,8 @@ protected CloseableIterable<InternalRow> newIterable(
      Map<Integer, ?> idToConstant) {
    switch (format) {
      case PARQUET:
-        return newParquetIterable(file, start, length, residual, projection, idToConstant);
+        return newParquetIterable(
+            file, encryptionKeyMetadata, start, length, residual, projection, idToConstant);


Prefer passing EncryptedInputFile here as well.

ggershinsky · 2023-09-21T06:19:18Z

@rdblue Thanks for the review. This PR and the related #6884 , #5544 are updated to address the comments.

rdblue · 2023-09-28T22:43:09Z

api/src/main/java/org/apache/iceberg/encryption/EncryptedOutputFile.java

+
+  /** Underlying output file for native encryption. */
+  default OutputFile rawOutputFile() {
+    return null;


Should this throw UnsupportedOperationException? That seems like a good idea to me so that we don't get NPE when trying to use this with Parquet.

rdblue · 2023-09-28T22:43:54Z

api/src/main/java/org/apache/iceberg/encryption/EncryptionKeyMetadata.java

+  }
+
+  default ByteBuffer aadPrefix() {
+    return null;


Do we handle a null aadPrefix or do we assume it is non-null?

Null is technically possible. It'd be indeed safer to throw an unsupported exception here as well.

rdblue · 2023-12-11T00:02:17Z

data/src/main/java/org/apache/iceberg/data/GenericReader.java

@@ -126,6 +128,13 @@ private CloseableIterable<Record> openFile(FileScanTask task, Schema fileProject
          parquet.reuseContainers();
        }

+        if (task.file().keyMetadata() != null) {
+          EncryptionKeyMetadata keyMetadata =


This method in EncryptionUtil returns a package-private implementation, KeyMetadata. That leaks the class outside of the package but it isn't useful. I think that's why this added methods to EncryptionKeyMetadata. I think it's better to make KeyMetadata (or StandardKeyMetadata as I renamed it in my last PR) public.

rdblue · 2023-12-11T00:05:10Z

data/src/main/java/org/apache/iceberg/data/BaseFileWriterFactory.java

@@ -118,7 +118,7 @@ public DataWriter<T> newDataWriter(

        case PARQUET:
          Parquet.DataWriteBuilder parquetBuilder =
-              Parquet.writeData(outputFile)
+              Parquet.writeData(file)


Since encryptingOutputFile will create an AesGcmOutputFile, I don't think it should be called unless it is going to be used. I think outputFile should be removed and the branches that pass an OutputFile (ORC and Avro) should call file.encryptingOutputFile() inline.

rdblue · 2023-12-11T00:09:01Z

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java

@@ -146,10 +154,12 @@ public EqualityDeleteWriter<Record> newEqDeleteWriter(
        "Equality delete row schema shouldn't be null when creating equality-delete writer");

    MetricsConfig metricsConfig = MetricsConfig.fromProperties(config);
+    OutputFile outputFile = file.encryptingOutputFile();


Same as above, I think this should be called inline like it was before.

rdblue · 2023-12-11T00:10:46Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+  public static WriteBuilder write(EncryptedOutputFile file) {
+    if (EncryptionUtil.useNativeEncryption(file.keyMetadata())) {
+      return write(file.rawOutputFile())
+          .withFileEncryptionKey(file.keyMetadata().encryptionKey())


I think I suggested adding encryptionKey and aadPrefix to the base, but after looking at the castOrParse, I think it would be cleaner to use that here to get an instance of StandardKeyMetadata.

It may also be simpler if this used a direct instanceof check here. Then you could just cast directly. You can't do that with a method that may change doing the type check.

if (file.keyMetadata() instanceof StandardKeyMetadata) { StandardKeyMetadata keyMetadata = (StandardKeyMetadata) file.keyMetadata(); return write(file.plainOutputFile()) .withFileEncryptionKey(keyMetadata.encryptionKey()) .withAADPrefix(keyMetadata.aadPrefix()) } else { return write(file.encryptingOutputFile()); }

I think this also needs to call withKeyMetadata so that the key metadata instance is set.

After that, it's awkward that the key metadata is set separately than the encryption key and AAD prefix in this case because the key metadata is not opaque. I think the solution is to do the instanceof check above in withKeyMetadata. That way if the key metadata is StandardKeyMetadata then it will also configure the AAD prefix and key. If not, then setting them separately makes sense.

+1 to the first suggestion.

As for the second - this WriterBuilder class doesn't have a withKeyMetadata method (since it is a basic writer appender).

rdblue · 2023-12-11T00:11:53Z

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java

+  }
+
+  public static boolean useNativeEncryption(EncryptionKeyMetadata keyMetadata) {
+    return keyMetadata != null && keyMetadata instanceof KeyMetadata;


The null check is redundant and can be removed.

rdblue · 2023-12-11T00:29:29Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

@@ -175,15 +181,15 @@ public FileAppender<InternalRow> newAppender(OutputFile file, FileFormat fileFor
              .build();

        case AVRO:
-          return Avro.write(file)
+          return Avro.write(file.encryptingOutputFile())


@ggershinsky, it seems to me that with the AES GCM streams set up, Avro encryption would also work, right? In fact, although the StandardEncryptionManager is not used unless the format is Parquet, I think you can still request a format in individual writes. Those would work and use AES GCM stream encryption.

I think we need to change how we prevent Avro and ORC encryption. Instead of doing the check when creating the encryption manager, it should be done here instead. What I would do is add Avro.write(EncryptingOutputStream) and ORC.write(EncryptingOutputStream) and have them throw UnsupportedOperationException unless the key metadata is null. That will prevent AES GCM from being used until we want to add the feature.

We should also consider whether this will just work for Avro files!

use key and aadPrefix explicitly util post-review changes package-private KeyMetadata fix key metadata method signatures fix NPE update PositionDeletesRowReader use ALL_CAPS move spark 3.3 to 3.4, flink 1.16 to 1.17 update spark source BaseReader address review comments clean up update revapi revert visibility limit revert TableOperations move plaintext manager changes to another pr address review comments revert BaseEncryptedOutputFile

ggershinsky · 2023-12-21T13:52:44Z

Moved to main branch base via #9359

github-actions bot added data flink MR spark labels Feb 7, 2023

ggershinsky changed the title ~~MR: Deliver key metadata to parquet readers~~ Deliver key metadata to parquet readers Feb 7, 2023

github-actions bot added the core label Feb 20, 2023

rdblue reviewed Mar 11, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/encryption/EncryptionUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

github-actions bot added the API label Jun 8, 2023

ggershinsky force-pushed the deliver-key-metadata branch 2 times, most recently from 2e0c52f to 2bc7789 Compare June 11, 2023 04:49

ggershinsky force-pushed the deliver-key-metadata branch from 14c1eab to 83e239b Compare July 24, 2023 05:16

ggershinsky changed the title ~~Deliver key metadata to parquet readers~~ Deliver key metadata to parquet encryption Jul 31, 2023

rdblue reviewed Jul 31, 2023

View reviewed changes

rdblue reviewed Sep 10, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/io/FileAppenderFactory.java Show resolved Hide resolved

rdblue reviewed Sep 10, 2023

View reviewed changes

github-actions bot added the parquet label Sep 19, 2023

ggershinsky force-pushed the deliver-key-metadata branch from 23bb56b to fbbecba Compare September 19, 2023 07:23

rdblue reviewed Sep 28, 2023

View reviewed changes

rdblue reviewed Dec 11, 2023

View reviewed changes

ggershinsky force-pushed the deliver-key-metadata branch from 27f2caa to e20cda6 Compare December 21, 2023 12:58

tmp commit

aec7285

ggershinsky mentioned this pull request Dec 21, 2023

Deliver key metadata for encryption of data files #9359

Merged

ggershinsky closed this Dec 21, 2023

ggershinsky deleted the deliver-key-metadata branch January 8, 2024 05:25


		private EncryptionUtil() {}

		public static EncryptionKeyMetadata parseKeyMetadata(ByteBuffer metadataBuffer) {

Deliver key metadata to parquet encryption #6762

Deliver key metadata to parquet encryption #6762

Conversation

ggershinsky commented Feb 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

rdblue commented May 22, 2023

rdblue Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

ggershinsky Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky commented Sep 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky commented Dec 21, 2023

rdblue Jul 31, 2023 •

edited

Loading

ggershinsky Aug 2, 2023 •

edited

Loading

rdblue Dec 11, 2023 •

edited

Loading

rdblue Dec 11, 2023 •

edited

Loading