-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating SparkScan to only read Apache DataSketches #11035
Conversation
Hi @huaxingao, @karuppayya kindly review the PR. |
@@ -26,4 +26,6 @@ private StandardBlobTypes() {} | |||
* href="https://datasketches.apache.org/">Apache DataSketches</a> library | |||
*/ | |||
public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; | |||
|
|||
public static final String PRESTO_SUM_DATA_SIZE_BYTES_V1 = "presto-sum-data-size-bytes-v1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to store the exact parameter used by presto as part of iceberg.
we can use it in the test or even use a dummy identifier to simulate the existence of additional non supported metadata.
separately we should reach agreement on what is the right way to store the data size in the puffin file cross engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this since Iceberg doesn't support this yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it shouldn't be here. If it is a generic blob type we want to support across engines, we should discuss this on the dev list and vote.
} else { | ||
LOG.debug("DataSketch blob is not available for column {}", colName); | ||
} | ||
ColumnStatistics colStats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we should group the metadata by field first and then extract all of the relevant metadata and create the SparkColumnStatistics instance for the column
This is not specifically related to this PR because this was the behaviour before but we might want to address it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Hi Adding an enhancement in test case - For no stats scenario also, we were traversing over the expectedNDVs Map, which was empty, and thus the Assert was never reached, and it was not properly tested whether no statistics are generated or whether the statistics generated were null or not. |
Hi @karuppayya , @aokolnychyi , @huaxingao kindly review this PR once. |
@karuppayya @huaxingao @szehon-ho can you please help review this. |
ndv = Long.parseLong(ndvStr); | ||
} else { | ||
LOG.debug("ndv is not set in BlobMetadata for column {}", colName); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the blob type is not APACHE_DATASKETCHES_THETA_V1
, shall we add a log message, something like "Blob type XXX is not supported yet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @huaxingao I have added the logs, kindly give it a check and let me know if it works.
ColumnStatistics colStats = | ||
new SparkColumnStatistics(ndv, null, null, null, null, null, null); | ||
new SparkColumnStatistics(ndv, null, null, null, null, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4-space indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected.
Hi @aokolnychyi could you please help review this PR. |
I'll check tomorrow. Sorry for the delay! |
Hi @aokolnychyi could you please help review this PR once. |
if (expectedNDVs.isEmpty()) { | ||
assertThat( | ||
columnStats.isEmpty() | ||
|| columnStats.values().iterator().next().distinctCount().isEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the second check here. Shouldn't we be checking all columnStats.values().distinctCount
are empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @RussellSpitzer the check will not work like this, if we do columnStats.values().distinctCount, it will give error as "Cannot resolve method 'distinctCount' in 'Collection'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using shorthand, I meant that for every value in "values" you should be checking disticntCount
Psuedocode
for all values in columStatsValues
value.distinctCount.isEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I have updated it, could you please review again.
if (!Strings.isNullOrEmpty(ndvStr)) { | ||
ndv = Long.parseLong(ndvStr); | ||
} else { | ||
LOG.debug("ndv is not set in BlobMetadata for column {}", colName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change but I think we should use the actual key string
"{} .... for column{}", NDV_KEY, colName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I'll merge when tests are complete
Thanks @jeesou for the PR, @aokolnychyi , @karuppayya , @huaxingao , @guykhazma all for reviewing. |
Closes #11034