Updating SparkScan to only read Apache DataSketches #11035

jeesou · 2024-08-28T11:26:32Z

jeesou · 2024-08-28T11:34:32Z

Hi @huaxingao, @karuppayya kindly review the PR.

guykhazma · 2024-08-28T12:02:02Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
   * href="https://datasketches.apache.org/">Apache DataSketches</a> library
   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static final String PRESTO_SUM_DATA_SIZE_BYTES_V1 = "presto-sum-data-size-bytes-v1";


we don't need to store the exact parameter used by presto as part of iceberg.
we can use it in the test or even use a dummy identifier to simulate the existence of additional non supported metadata.

separately we should reach agreement on what is the right way to store the data size in the puffin file cross engines.

I think we should remove this since Iceberg doesn't support this yet

+1, it shouldn't be here. If it is a generic blob type we want to support across engines, we should discuss this on the dev list and vote.

guykhazma · 2024-08-28T12:03:30Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

-          } else {
-            LOG.debug("DataSketch blob is not available for column {}", colName);
-          }
+            ColumnStatistics colStats =


Technically we should group the metadata by field first and then extract all of the relevant metadata and create the SparkColumnStatistics instance for the column
This is not specifically related to this PR because this was the behaviour before but we might want to address it as well.

jeesou · 2024-08-29T20:15:23Z

Hi Adding an enhancement in test case -

For no stats scenario also, we were traversing over the expectedNDVs Map, which was empty, and thus the Assert was never reached, and it was not properly tested whether no statistics are generated or whether the statistics generated were null or not.

jeesou · 2024-09-05T04:57:09Z

Hi @karuppayya , @aokolnychyi , @huaxingao kindly review this PR once.

guykhazma · 2024-09-14T12:18:42Z

@karuppayya @huaxingao @szehon-ho can you please help review this.

huaxingao · 2024-09-20T23:25:38Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+                ndv = Long.parseLong(ndvStr);
+              } else {
+                LOG.debug("ndv is not set in BlobMetadata for column {}", colName);
+              }


If the blob type is not APACHE_DATASKETCHES_THETA_V1, shall we add a log message, something like "Blob type XXX is not supported yet"?

HI @huaxingao I have added the logs, kindly give it a check and let me know if it works.

huaxingao · 2024-09-20T23:26:01Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

          ColumnStatistics colStats =
-              new SparkColumnStatistics(ndv, null, null, null, null, null, null);
+                  new SparkColumnStatistics(ndv, null, null, null, null, null, null);


nit: 4-space indentation?

jeesou · 2024-09-23T17:35:57Z

Hi @aokolnychyi could you please help review this PR.

aokolnychyi · 2024-09-24T05:51:21Z

I'll check tomorrow. Sorry for the delay!

jeesou · 2024-10-08T06:47:13Z

Hi @aokolnychyi could you please help review this PR once.

RussellSpitzer · 2024-10-15T21:34:26Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

+    if (expectedNDVs.isEmpty()) {
+      assertThat(
+              columnStats.isEmpty()
+                  || columnStats.values().iterator().next().distinctCount().isEmpty())


I'm not sure I understand the second check here. Shouldn't we be checking all columnStats.values().distinctCount are empty?

Hi @RussellSpitzer the check will not work like this, if we do columnStats.values().distinctCount, it will give error as "Cannot resolve method 'distinctCount' in 'Collection'"

I was using shorthand, I meant that for every value in "values" you should be checking disticntCount

Psuedocode

for all values in columStatsValues value.distinctCount.isEmpty

Understood, I have updated it, could you please review again.

RussellSpitzer · 2024-10-15T21:39:01Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+              if (!Strings.isNullOrEmpty(ndvStr)) {
+                ndv = Long.parseLong(ndvStr);
+              } else {
+                LOG.debug("ndv is not set in BlobMetadata for column {}", colName);


Minor change but I think we should use the actual key string
"{} .... for column{}", NDV_KEY, colName

RussellSpitzer

Looks good to me, I'll merge when tests are complete

RussellSpitzer · 2024-10-16T22:19:26Z

Thanks @jeesou for the PR, @aokolnychyi , @karuppayya , @huaxingao , @guykhazma all for reviewing.

…aSketches (apache#11035)

Updating SparkScan to only read Apache DataSketches

284f79d

github-actions bot added spark core labels Aug 28, 2024

guykhazma reviewed Aug 28, 2024

View reviewed changes

jeesou added 4 commits August 29, 2024 20:35

Addressing Review comments to group Blobs based on Field

467a986

Removing older code

8d45bfd

Minor refactoring

73e560a

Adding test case enhancement

57c954d

jeesou requested review from guykhazma, karuppayya and aokolnychyi September 2, 2024 07:25

guykhazma approved these changes Sep 3, 2024

View reviewed changes

karuppayya approved these changes Sep 20, 2024

View reviewed changes

huaxingao reviewed Sep 20, 2024

View reviewed changes

jeesou added 2 commits September 23, 2024 01:06

Formatting fixes

fd9c3f1

Adding Log for non supported Bolb type

9965b8a

jeesou requested a review from huaxingao September 22, 2024 19:58

huaxingao approved these changes Sep 23, 2024

View reviewed changes

RussellSpitzer reviewed Oct 15, 2024

View reviewed changes

jeesou added 2 commits October 16, 2024 10:33

Adding Log related changes

9634300

Updating Test logic

0acdc21

jeesou requested a review from RussellSpitzer October 16, 2024 18:41

removing redundant empty check

2aebee4

RussellSpitzer approved these changes Oct 16, 2024

View reviewed changes

RussellSpitzer merged commit 17f1c4d into apache:main Oct 16, 2024
31 checks passed

huaxingao mentioned this pull request Nov 13, 2024

Spark 3.4: Support Spark Column Stats #11532

Merged

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark 3.5: Spark Scan should ignore statistics not of type Apache Dat…

5905bc2

…aSketches (apache#11035)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating SparkScan to only read Apache DataSketches #11035

Updating SparkScan to only read Apache DataSketches #11035

jeesou commented Aug 28, 2024

jeesou commented Aug 28, 2024

guykhazma Aug 28, 2024

karuppayya Aug 28, 2024

aokolnychyi Aug 28, 2024

guykhazma Aug 28, 2024

karuppayya Aug 28, 2024

jeesou commented Aug 29, 2024 •

edited

Loading

jeesou commented Sep 5, 2024

guykhazma commented Sep 14, 2024

huaxingao Sep 20, 2024

jeesou Sep 22, 2024

huaxingao Sep 20, 2024

jeesou Sep 22, 2024

jeesou commented Sep 23, 2024

aokolnychyi commented Sep 24, 2024

jeesou commented Oct 8, 2024

RussellSpitzer Oct 15, 2024

jeesou Oct 16, 2024

RussellSpitzer Oct 16, 2024 •

edited

Loading

jeesou Oct 16, 2024

RussellSpitzer Oct 15, 2024

jeesou Oct 16, 2024

RussellSpitzer left a comment

RussellSpitzer commented Oct 16, 2024

Updating SparkScan to only read Apache DataSketches #11035

Updating SparkScan to only read Apache DataSketches #11035

Conversation

jeesou commented Aug 28, 2024

jeesou commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeesou commented Aug 29, 2024 • edited Loading

jeesou commented Sep 5, 2024

guykhazma commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeesou commented Sep 23, 2024

aokolnychyi commented Sep 24, 2024

jeesou commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

RussellSpitzer commented Oct 16, 2024

jeesou commented Aug 29, 2024 •

edited

Loading

RussellSpitzer Oct 16, 2024 •

edited

Loading