-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating SparkScan to only read Apache DataSketches #11035
Changes from 1 commit
284f79d
467a986
8d45bfd
73e560a
57c954d
fd9c3f1
9965b8a
9634300
0acdc21
2aebee4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -199,28 +199,24 @@ protected Statistics estimateStatistics(Snapshot snapshot) { | |
List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); | ||
|
||
for (BlobMetadata blobMetadata : metadataList) { | ||
int id = blobMetadata.fields().get(0); | ||
String colName = table.schema().findColumnName(id); | ||
NamedReference ref = FieldReference.column(colName); | ||
|
||
Long ndv = null; | ||
if (blobMetadata | ||
.type() | ||
.equals(org.apache.iceberg.puffin.StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1)) { | ||
int id = blobMetadata.fields().get(0); | ||
String colName = table.schema().findColumnName(id); | ||
NamedReference ref = FieldReference.column(colName); | ||
Long ndv = null; | ||
String ndvStr = blobMetadata.properties().get(NDV_KEY); | ||
if (!Strings.isNullOrEmpty(ndvStr)) { | ||
ndv = Long.parseLong(ndvStr); | ||
} else { | ||
LOG.debug("ndv is not set in BlobMetadata for column {}", colName); | ||
} | ||
} else { | ||
LOG.debug("DataSketch blob is not available for column {}", colName); | ||
} | ||
ColumnStatistics colStats = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically we should group the metadata by field first and then extract all of the relevant metadata and create the SparkColumnStatistics instance for the column There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
new SparkColumnStatistics(ndv, null, null, null, null, null, null); | ||
|
||
ColumnStatistics colStats = | ||
new SparkColumnStatistics(ndv, null, null, null, null, null, null); | ||
|
||
colStatsMap.put(ref, colStats); | ||
colStatsMap.put(ref, colStats); | ||
} | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to store the exact parameter used by presto as part of iceberg.
we can use it in the test or even use a dummy identifier to simulate the existence of additional non supported metadata.
separately we should reach agreement on what is the right way to store the data size in the puffin file cross engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this since Iceberg doesn't support this yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it shouldn't be here. If it is a generic blob type we want to support across engines, we should discuss this on the dev list and vote.