[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

kevinjqliu · 2024-12-22T02:02:23Z

This PR replaces examples of Hadoop catalog with examples of JDBC catalog and add examples of setting up a REST catalog

Testing

`spark-quickstart.md` using JDBC catalog

Using spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=jdbc \
    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local

Using spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type                         jdbc
spark.sql.catalog.local.uri                          jdbc:sqlite:iceberg_catalog_db.sqlite
spark.sql.catalog.local.warehouse                    warehouse
spark.sql.defaultCatalog                             local

spark-sql --properties-file ./spark-defaults.conf

`spark-quickstart.md` using REST catalog

With spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.rest.type=rest \
    --conf spark.sql.catalog.rest.uri=http://localhost:8181 \
    --conf spark.sql.catalog.rest.warehouse=s3://warehouse/ \
    --conf spark.sql.catalog.rest.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.rest.s3.endpoint=http://localhost:9000 \
    --conf spark.sql.catalog.rest.s3.path-style-access=true \
    --conf spark.sql.catalog.rest.s3.access-key-id=admin \
    --conf spark.sql.catalog.rest.s3.secret-access-key=password \
    --conf spark.sql.catalog.rest.client.region=us-east-1 \
    --conf spark.sql.defaultCatalog=rest

With spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.rest                               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type                          rest
spark.sql.catalog.rest.uri                           http://localhost:8181
spark.sql.catalog.rest.warehouse                     s3://warehouse/
spark.sql.catalog.rest.io-impl                       org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.s3.endpoint                   http://localhost:9000
spark.sql.catalog.rest.s3.path-style-access          true
spark.sql.catalog.rest.s3.access-key-id              admin
spark.sql.catalog.rest.s3.secret-access-key          password
spark.sql.catalog.rest.client.region                 us-east-1
spark.sql.defaultCatalog                             rest

spark-sql --properties-file ./spark-defaults.conf

Rendered Docs

`site/docs/spark-quickstart.md` (`http://127.0.0.1:8000/spark-quickstart/#adding-catalogs`)

`docs/docs/spark-getting-started.md` (`http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs`)

`site/docs/how-to-release.md` (`http://127.0.0.1:8000/how-to-release/#verifying-with-spark`)

kevinjqliu · 2024-12-22T19:57:10Z

docs/docs/spark-getting-started.md

note, there are two "getting started" docs
this one and site/docs/spark-quickstart.md

kevinjqliu · 2024-12-22T19:59:22Z

site/docs/spark-quickstart.md

@@ -269,42 +273,104 @@ To read a table, simply use the Iceberg table's name.

 ### Adding A Catalog

-Iceberg has several catalog back-ends that can be used to track tables, like JDBC, Hive MetaStore and Glue.
-Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. In this guide,
-we use JDBC, but you can follow these instructions to configure other catalog types. To learn more, check out


weird that the guide already mention JDBC here, but the example is still hadoop

kevinjqliu · 2024-12-22T20:15:53Z

site/docs/spark-quickstart.md

+    - [Configuring JDBC Catalog](#configuring-jdbc-catalog)
+    - [Configuring REST Catalog](#configuring-rest-catalog)
+- [Next steps](#next-steps)
+    - [Adding Iceberg to Spark](#adding-iceberg-to-spark)
+    - [Learn More](#learn-more)


renders the subsection correctly

kevinjqliu · 2024-12-22T21:02:16Z

docs/docs/spark-getting-started.md

    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
 ```

+For example configuring a REST-based catalog, see [Configuring REST Catalog](/spark-quickstart#configuring-rest-catalog)


instead of repeating here for configuring REST catalog, just link to site/docs/spark-quickstart.md. I double checked the link here locally

kevinjqliu · 2024-12-22T22:20:29Z

docs/docs/spark-getting-started.md

+    --conf spark.sql.catalog.local.type=jdbc \
+    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
+    --conf spark.sql.defaultCatalog=local


add defaultCatalog to match other pages

kevinjqliu · 2024-12-22T22:21:20Z

site/docs/spark-quickstart.md

    spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type                 hive
    spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
-    spark.sql.catalog.local.type                         hadoop
-    spark.sql.catalog.local.warehouse                    $PWD/warehouse


$PWD does not expand in spark-defaults.conf. keeping this here will create a folder named $PWD

kevinjqliu · 2024-12-24T18:05:19Z

site/docs/spark-quickstart.md


 === "CLI"

    ```sh
-    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
+    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \


taking on this extra dep since i dont see any iceberg specific package i can use. there is a hive-jdbc package

github-actions bot added the docs label Dec 22, 2024

hadoop -> jdbc

6fe50e1

kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 00ca569 to 6fe50e1 Compare December 22, 2024 18:58

kevinjqliu added 6 commits December 22, 2024 11:56

add jdbc and rest example to spark iceberg quickstart

eb5d30c

nit

0a7b829

render toc subsection correctly

fd6efc7

format

81b3d0a

format

96d59ee

format

63c9a1a

kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 496e51e to 63c9a1a Compare December 22, 2024 21:01

kevinjqliu added 4 commits December 22, 2024 13:18

indent!

94d6a10

fix S3FileIO

9c877dd

dont use $PWD in spark-defaults.conf

e6a64b4

match config

d1c9a16

kevinjqliu commented Dec 22, 2024

View reviewed changes

kevinjqliu marked this pull request as ready for review December 22, 2024 22:28

jbonofre self-requested a review December 23, 2024 06:17

kevinjqliu commented Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

kevinjqliu commented Dec 22, 2024 •

edited

Loading

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 24, 2024

[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

Are you sure you want to change the base?

[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

Conversation

kevinjqliu commented Dec 22, 2024 • edited Loading

Testing

spark-quickstart.md using JDBC catalog

spark-quickstart.md using REST catalog

Rendered Docs

site/docs/spark-quickstart.md (http://127.0.0.1:8000/spark-quickstart/#adding-catalogs)

docs/docs/spark-getting-started.md (http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs)

site/docs/how-to-release.md (http://127.0.0.1:8000/how-to-release/#verifying-with-spark)

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 22, 2024

Choose a reason for hiding this comment

kevinjqliu Dec 24, 2024

Choose a reason for hiding this comment

kevinjqliu commented Dec 22, 2024 •

edited

Loading

`spark-quickstart.md` using JDBC catalog

`spark-quickstart.md` using REST catalog

`site/docs/spark-quickstart.md` (`http://127.0.0.1:8000/spark-quickstart/#adding-catalogs`)

`docs/docs/spark-getting-started.md` (`http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs`)

`site/docs/how-to-release.md` (`http://127.0.0.1:8000/how-to-release/#verifying-with-spark`)