Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating SparkScan to only read Apache DataSketches #11035

Merged
merged 10 commits into from
Oct 16, 2024

Conversation

jeesou
Copy link
Contributor

@jeesou jeesou commented Aug 28, 2024

Closes #11034

@jeesou
Copy link
Contributor Author

jeesou commented Aug 28, 2024

Hi @huaxingao, @karuppayya kindly review the PR.

@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/">Apache DataSketches</a> library
*/
public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";

public static final String PRESTO_SUM_DATA_SIZE_BYTES_V1 = "presto-sum-data-size-bytes-v1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to store the exact parameter used by presto as part of iceberg.
we can use it in the test or even use a dummy identifier to simulate the existence of additional non supported metadata.

separately we should reach agreement on what is the right way to store the data size in the puffin file cross engines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this since Iceberg doesn't support this yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it shouldn't be here. If it is a generic blob type we want to support across engines, we should discuss this on the dev list and vote.

} else {
LOG.debug("DataSketch blob is not available for column {}", colName);
}
ColumnStatistics colStats =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we should group the metadata by field first and then extract all of the relevant metadata and create the SparkColumnStatistics instance for the column
This is not specifically related to this PR because this was the behaviour before but we might want to address it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jeesou
Copy link
Contributor Author

jeesou commented Aug 29, 2024

Hi Adding an enhancement in test case -

For no stats scenario also, we were traversing over the expectedNDVs Map, which was empty, and thus the Assert was never reached, and it was not properly tested whether no statistics are generated or whether the statistics generated were null or not.

@jeesou
Copy link
Contributor Author

jeesou commented Sep 5, 2024

Hi @karuppayya , @aokolnychyi , @huaxingao kindly review this PR once.

@guykhazma
Copy link
Contributor

@karuppayya @huaxingao @szehon-ho can you please help review this.

ndv = Long.parseLong(ndvStr);
} else {
LOG.debug("ndv is not set in BlobMetadata for column {}", colName);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the blob type is not APACHE_DATASKETCHES_THETA_V1, shall we add a log message, something like "Blob type XXX is not supported yet"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @huaxingao I have added the logs, kindly give it a check and let me know if it works.

ColumnStatistics colStats =
new SparkColumnStatistics(ndv, null, null, null, null, null, null);
new SparkColumnStatistics(ndv, null, null, null, null, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4-space indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

@jeesou jeesou requested a review from huaxingao September 22, 2024 19:58
@jeesou
Copy link
Contributor Author

jeesou commented Sep 23, 2024

Hi @aokolnychyi could you please help review this PR.

@aokolnychyi
Copy link
Contributor

I'll check tomorrow. Sorry for the delay!

@jeesou
Copy link
Contributor Author

jeesou commented Oct 8, 2024

Hi @aokolnychyi could you please help review this PR once.

if (expectedNDVs.isEmpty()) {
assertThat(
columnStats.isEmpty()
|| columnStats.values().iterator().next().distinctCount().isEmpty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the second check here. Shouldn't we be checking all columnStats.values().distinctCount are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RussellSpitzer the check will not work like this, if we do columnStats.values().distinctCount, it will give error as "Cannot resolve method 'distinctCount' in 'Collection'"

Copy link
Member

@RussellSpitzer RussellSpitzer Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using shorthand, I meant that for every value in "values" you should be checking disticntCount

Psuedocode

for all values in columStatsValues
   value.distinctCount.isEmpty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I have updated it, could you please review again.

if (!Strings.isNullOrEmpty(ndvStr)) {
ndv = Long.parseLong(ndvStr);
} else {
LOG.debug("ndv is not set in BlobMetadata for column {}", colName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change but I think we should use the actual key string
"{} .... for column{}", NDV_KEY, colName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@jeesou jeesou requested a review from RussellSpitzer October 16, 2024 18:41
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I'll merge when tests are complete

@RussellSpitzer RussellSpitzer merged commit 17f1c4d into apache:main Oct 16, 2024
31 checks passed
@RussellSpitzer
Copy link
Member

Thanks @jeesou for the PR, @aokolnychyi , @karuppayya , @huaxingao , @guykhazma all for reviewing.

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stale column stats getting reported when reading puffin files generated by Presto with Spark engine
6 participants