-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Update spark-getting-started docs page to make the example valid #11923
base: main
Are you sure you want to change the base?
[Docs] Update spark-getting-started docs page to make the example valid #11923
Conversation
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id | ||
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, updates
does not exist, nor does t.count
or u.count
docs/docs/spark-getting-started.md
Outdated
@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber | |||
| map | map | | | |||
|
|||
!!! info | |||
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write: | |||
The table is based on type conversions during table creation. Broader type conversions are applied on write: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small grammar improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated
@@ -77,21 +77,24 @@ Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#in | |||
|
|||
```sql | |||
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); | |||
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement does not add much to the simple example here, remove it
@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException { | |||
sql( | |||
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'", | |||
temp.newFolder()); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')"); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make the example more interesting to users, set unique values of data
so that the function of MERGE is more clear in the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example still hits all branches of merge
- id 1 and 2 are updated
- id 3, 10, 11 are unchanged
- id 4 does not match and is inserted
the change here is to provide a unique data
value for results as that helps to explain the example in the docs better
Hi @kevinjqliu - I saw that you're a committer and recently looked at this doc page in #11845. Could you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the getting started guide! I've added a few comments
docs/docs/spark-getting-started.md
Outdated
@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber | |||
| map | map | | | |||
|
|||
!!! info | |||
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write: | |||
The table is based on type conversions during table creation. Broader type conversions are applied on write: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the paragraph before mentions the table is for both create and write while this sentence says its only based on create.
@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException { | |||
sql( | |||
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for this and the create table statement above, can we change to USING iceberg
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs
@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException { | |||
sql( | |||
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'", | |||
temp.newFolder()); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')"); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the original example since it hits all branch of the merge into statement.
also it'd be nice to keep track of table state in the comment
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id | ||
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count | ||
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg; | ||
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below, lets update the values so it will hit all branch of the merge into statement.
nit: and also add values as comment to track the table state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418
The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu thanks for the review. I replied to your comments and added an updated screenshot in the description.
@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException { | |||
sql( | |||
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and this change breaks tests that use SmokeTest with hadoop and hive catalogs
@@ -66,25 +64,25 @@ public void testGettingStarted() throws IOException { | |||
sql( | |||
"CREATE TABLE updates (id bigint, data string) USING parquet LOCATION '%s'", | |||
temp.newFolder()); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'x'), (4, 'z')"); | |||
sql("INSERT INTO updates VALUES (1, 'x'), (2, 'y'), (4, 'z')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example still hits all branches of merge
- id 1 and 2 are updated
- id 3, 10, 11 are unchanged
- id 4 does not match and is inserted
the change here is to provide a unique data
value for results as that helps to explain the example in the docs better
docs/docs/spark-getting-started.md
Outdated
@@ -160,7 +163,7 @@ This type conversion table describes how Spark types are converted to the Iceber | |||
| map | map | | | |||
|
|||
!!! info | |||
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write: | |||
The table is based on type conversions during table creation. Broader type conversions are applied on write: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated
MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id | ||
WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count | ||
CREATE TABLE local.db.updates (id bigint, data string) USING iceberg; | ||
INSERT INTO local.db.updates VALUES (1, 'x'), (2, 'y'), (4, 'z'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about merge branches, commented here https://github.com/apache/iceberg/pull/11923/files?diff=unified&w=0#r1906122418
The table states are straightforward until after the MERGE query (1 insert per table). I have added the table state as a comment after MERGE only. Otherwise there is a lot of duplication. Let me know your thoughts.
The Spark Getting Started docs page has intro Spark examples but they reference tables and columns that do not exist. This is one of the first docs pages that new Iceberg users will see ... having a correct example that someone can run is helpful to them.
I found this as I was reading the project's tests and saw this TODO marked in SmokeTest.java:
iceberg/spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java
Lines 42 to 44 in 67e084c
I've updated the docs in line with the test cases, and also made a minor change to the example a bit to make it more clear - each MERGE sets a unique
data
value for eachid
.Testing