Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Replace examples of Hadoop catalog with JDBC catalog #11845

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Dec 22, 2024

Closes #11284
devlist discussion

This PR replaces examples of Hadoop catalog with examples of JDBC catalog and add examples of setting up a REST catalog

Testing

spark-quickstart.md using JDBC catalog

Using spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=jdbc \
    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local

Using spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type                         jdbc
spark.sql.catalog.local.uri                          jdbc:sqlite:iceberg_catalog_db.sqlite
spark.sql.catalog.local.warehouse                    warehouse
spark.sql.defaultCatalog                             local
spark-sql --properties-file ./spark-defaults.conf

spark-quickstart.md using REST catalog

With spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.rest.type=rest \
    --conf spark.sql.catalog.rest.uri=http://localhost:8181 \
    --conf spark.sql.catalog.rest.warehouse=s3://warehouse/ \
    --conf spark.sql.catalog.rest.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.rest.s3.endpoint=http://localhost:9000 \
    --conf spark.sql.catalog.rest.s3.path-style-access=true \
    --conf spark.sql.catalog.rest.s3.access-key-id=admin \
    --conf spark.sql.catalog.rest.s3.secret-access-key=password \
    --conf spark.sql.catalog.rest.client.region=us-east-1 \
    --conf spark.sql.defaultCatalog=rest

With spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.rest                               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type                          rest
spark.sql.catalog.rest.uri                           http://localhost:8181
spark.sql.catalog.rest.warehouse                     s3://warehouse/
spark.sql.catalog.rest.io-impl                       org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.s3.endpoint                   http://localhost:9000
spark.sql.catalog.rest.s3.path-style-access          true
spark.sql.catalog.rest.s3.access-key-id              admin
spark.sql.catalog.rest.s3.secret-access-key          password
spark.sql.catalog.rest.client.region                 us-east-1
spark.sql.defaultCatalog                             rest
spark-sql --properties-file ./spark-defaults.conf

Rendered Docs

site/docs/spark-quickstart.md (http://127.0.0.1:8000/spark-quickstart/#adding-catalogs)

Screenshot 2024-12-22 at 2 23 16 PM

docs/docs/spark-getting-started.md (http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs)

Screenshot 2024-12-22 at 2 24 07 PM

site/docs/how-to-release.md (http://127.0.0.1:8000/how-to-release/#verifying-with-spark)

Screenshot 2024-12-22 at 2 24 28 PM

@github-actions github-actions bot added the docs label Dec 22, 2024
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 00ca569 to 6fe50e1 Compare December 22, 2024 18:58
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 496e51e to 63c9a1a Compare December 22, 2024 21:01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, there are two "getting started" docs
this one and site/docs/spark-quickstart.md

@@ -269,42 +273,104 @@ To read a table, simply use the Iceberg table's name.

### Adding A Catalog

Iceberg has several catalog back-ends that can be used to track tables, like JDBC, Hive MetaStore and Glue.
Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. In this guide,
we use JDBC, but you can follow these instructions to configure other catalog types. To learn more, check out
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird that the guide already mention JDBC here, but the example is still hadoop

Comment on lines +29 to +33
- [Configuring JDBC Catalog](#configuring-jdbc-catalog)
- [Configuring REST Catalog](#configuring-rest-catalog)
- [Next steps](#next-steps)
- [Adding Iceberg to Spark](#adding-iceberg-to-spark)
- [Learn More](#learn-more)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renders the subsection correctly
Screenshot 2024-12-22 at 1 22 48 PM

--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
```

For example configuring a REST-based catalog, see [Configuring REST Catalog](/spark-quickstart#configuring-rest-catalog)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of repeating here for configuring REST catalog, just link to site/docs/spark-quickstart.md. I double checked the link here locally

--conf spark.sql.catalog.local.type=jdbc \
--conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
--conf spark.sql.defaultCatalog=local
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add defaultCatalog to match other pages

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
spark.sql.catalog.local org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type hadoop
spark.sql.catalog.local.warehouse $PWD/warehouse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$PWD does not expand in spark-defaults.conf. keeping this here will create a folder named $PWD

@kevinjqliu kevinjqliu marked this pull request as ready for review December 22, 2024 22:28
@jbonofre jbonofre self-requested a review December 23, 2024 06:17

=== "CLI"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking on this extra dep since i dont see any iceberg specific package i can use. there is a hive-jdbc package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docs] Update Examples to Replace Hadoop Catalog with JDBC Catalog
1 participant