[Docs] Update spark-getting-started docs page to make the example valid #11923

nickdelnano · 2025-01-07T22:35:38Z

The Spark Getting Started docs page has intro Spark examples but they reference tables and columns that do not exist. This is one of the first docs pages that new Iceberg users will see ... having a correct example that someone can run is helpful to them.

I found this as I was reading the project's tests and saw this TODO marked in SmokeTest.java:

iceberg/spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

Lines 42 to 44 in 67e084c

    
           // Run through our Doc's Getting Started Example 
        
           // TODO Update doc example so that it can actually be run, modifications were required for this 
        
           // test suite to run

I've updated the docs in line with the test cases, and also made a minor change to the example a bit to make it more clear - each MERGE sets a unique data value for each id.

Testing

Ran the tests modified in this PR - they pass
Built the site locally as documented here and loaded the modified page

nickdelnano · 2025-01-07T22:38:07Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count


before this PR, updates does not exist, nor does t.count or u.count

nickdelnano · 2025-01-07T22:38:38Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


small grammar improvements

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

thanks, updated

nickdelnano · 2025-01-07T22:39:09Z

docs/docs/spark-getting-started.md

@@ -77,21 +77,24 @@ Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#in

 ```sql
 INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
-INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;


This statement does not add much to the simple example here, remove it

nickdelnano · 2025-01-07T22:39:36Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


to make the example more interesting to users, set unique values of data so that the function of MERGE is more clear in the result

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

the example still hits all branches of merge

id 1 and 2 are updated

id 3, 10, 11 are unchanged

id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

nickdelnano · 2025-01-07T22:57:53Z

Hi @kevinjqliu - I saw that you're a committer and recently looked at this doc page in #11845. Could you review this PR?

kevinjqliu

Thanks for improving the getting started guide! I've added a few comments

kevinjqliu · 2025-01-08T00:59:56Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

kevinjqliu · 2025-01-08T01:04:17Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",


nit: for this and the create table statement above, can we change to USING iceberg instead?

I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs

kevinjqliu · 2025-01-08T01:31:45Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

kevinjqliu · 2025-01-08T01:36:20Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
+CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
+INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');


same as below, lets update the values so it will hit all branch of the merge into statement.

nit: and also add values as comment to track the table state

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

nickdelnano

@kevinjqliu thanks for the review. I replied to your comments and added an updated screenshot in the description.

nickdelnano · 2025-01-10T21:53:20Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",


I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs

nickdelnano · 2025-01-10T21:57:30Z

spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
    sql(
        "CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
        temp.newFolder());
-    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
+    sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");


the example still hits all branches of merge

id 1 and 2 are updated

id 3, 10, 11 are unchanged

id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

nickdelnano · 2025-01-10T21:57:45Z

docs/docs/spark-getting-started.md

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
 | map             | map                        |       |

 !!! info
-    The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
+    The table is based on type conversions during table creation. Broader type conversions are applied on write:


thanks, updated

nickdelnano · 2025-01-10T22:03:26Z

docs/docs/spark-getting-started.md

-MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
-WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
+CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
+INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');


about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

nickdelnano added 2 commits January 7, 2025 14:26

Update SmokeTest tests to reflect Getting Started Spark docs

4651aab

Update Getting Started Spark docs to have valid examples

3c97949

github-actions bot added spark docs labels Jan 7, 2025

nickdelnano commented Jan 7, 2025

View reviewed changes

nickdelnano marked this pull request as ready for review January 7, 2025 22:44

kevinjqliu reviewed Jan 8, 2025

View reviewed changes

Review feedback on Getting Started Spark docs

ea5951c

nickdelnano commented Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Update spark-getting-started docs page to make the example valid #11923

[Docs] Update spark-getting-started docs page to make the example valid #11923

nickdelnano commented Jan 7, 2025 •

edited

Loading

nickdelnano Jan 7, 2025

nickdelnano Jan 7, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

nickdelnano Jan 7, 2025

nickdelnano Jan 7, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

nickdelnano commented Jan 7, 2025

kevinjqliu left a comment

kevinjqliu Jan 8, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

kevinjqliu Jan 8, 2025

kevinjqliu Jan 8, 2025

nickdelnano Jan 10, 2025

nickdelnano left a comment

nickdelnano Jan 10, 2025

nickdelnano Jan 10, 2025

nickdelnano Jan 10, 2025

nickdelnano Jan 10, 2025

	// Run through our Doc's Getting Started Example
	// TODO Update doc example so that it can actually be run, modifications were required for this
	// test suite to run

		MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
		WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count

		@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
		sql(
		"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",

[Docs] Update spark-getting-started docs page to make the example valid #11923

Are you sure you want to change the base?

[Docs] Update spark-getting-started docs page to make the example valid #11923

Conversation

nickdelnano commented Jan 7, 2025 • edited Loading

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickdelnano commented Jan 7, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickdelnano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickdelnano commented Jan 7, 2025 •

edited

Loading