Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Update spark-getting-started docs page to make the example valid #11923

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nickdelnano
Copy link

@nickdelnano nickdelnano commented Jan 7, 2025

The Spark Getting Started docs page has intro Spark examples but they reference tables and columns that do not exist. This is one of the first docs pages that new Iceberg users will see ... having a correct example that someone can run is helpful to them.

I found this as I was reading the project's tests and saw this TODO marked in SmokeTest.java:

// Run through our Doc's Getting Started Example
// TODO Update doc example so that it can actually be run, modifications were required for this
// test suite to run

I've updated the docs in line with the test cases, and also made a minor change to the example a bit to make it more clear - each MERGE sets a unique data value for each id.

Testing

  • Ran the tests modified in this PR - they pass
  • Built the site locally as documented here and loaded the modified page
spark-getting-started

Comment on lines -86 to -87
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this PR, updates does not exist, nor does t.count or u.count

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small grammar improvements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@@ -77,21 +77,24 @@ Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#in

```sql
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement does not add much to the simple example here, remove it

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the example more interesting to users, set unique values of data so that the function of MERGE is more clear in the result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example still hits all branches of merge

  • id 1 and 2 are updated
  • id 3, 10, 11 are unchanged
  • id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

@nickdelnano nickdelnano marked this pull request as ready for review January 7, 2025 22:44
@nickdelnano
Copy link
Author

Hi @kevinjqliu - I saw that you're a committer and recently looked at this doc page in #11845. Could you review this PR?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the getting started guide! I've added a few comments

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for this and the create table statement above, can we change to USING iceberg instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below, lets update the values so it will hit all branch of the merge into statement.

nit: and also add values as comment to track the table state

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

Copy link
Author

@nickdelnano nickdelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu thanks for the review. I replied to your comments and added an updated screenshot in the description.

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs

@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException {
sql(
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'",
temp.newFolder());
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')");
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example still hits all branches of merge

  • id 1 and 2 are updated
  • id 3, 10, 11 are unchanged
  • id 4 does not match and is inserted

the change here is to provide a unique data value for results as that helps to explain the example in the docs better

@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber
| map | map | |

!!! info
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:
The table is based on type conversions during table creation. Broader type conversions are applied on write:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg;
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418

The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants