Skip to content

Commit

Permalink
[DOCS] Create separate Iceberg docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Jay Chia committed Feb 1, 2024
1 parent 59af587 commit 5fcc1cc
Show file tree
Hide file tree
Showing 3 changed files with 41 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,3 @@ This is accomplished by techniques such as:
1. **Partition pruning:** ignore files where their partition values don't match filter predicates
2. **Schema retrieval:** convert the schema provided by the data catalog into a Daft schema instead of sampling a schema from the data
3. **Metadata execution**: utilize metadata such as row counts to read the bare minimum amount of data necessary from storage

Data Catalog Integrations
-------------------------

Apache Iceberg
^^^^^^^^^^^^^^

Apache Iceberg is an open-sourced table format originally developed at Netflix for large-scale analytical datasets.

To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function.

We integrate closely with `PyIceberg <https://py.iceberg.apache.org/>`_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes from PyIceberg's Table objects.
2 changes: 1 addition & 1 deletion docs/source/user_guide/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ Integrations

.. toctree::

integrations/data_catalogs
integrations/iceberg
40 changes: 40 additions & 0 deletions docs/source/user_guide/integrations/iceberg.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
Apache Iceberg
==============

`Apache Iceberg <https://iceberg.apache.org/>`_ is an open-sourced table format originally developed at Netflix for large-scale analytical datasets.

To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function.

We integrate closely with `PyIceberg <https://py.iceberg.apache.org/>`_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes easily from PyIceberg's Table objects.

.. code:: python
# Access a PyIceberg table as per normal
from pyiceberg.catalog import load_catalog
catalog = load_catalog("my_iceberg_catalog")
table = catalog.load_table("my_namespace.my_table")
# Create a Daft Dataframe
import daft
df = daft.read_iceberg(table)
Daft currently natively supports:

1. **Distributed Reads:** Daft will fully distribute the I/O of reads over your compute resources (whether Ray or on multithreading on the local PyRunner)
2. **Skipping filtered data:** Daft uses ``df.where(...)`` filter calls to only read data that matches your predicates
3. **All Catalogs From PyIceberg:** Daft is natively integrated with PyIceberg, and supports all the catalogs that PyIceberg does!

Selecting a Table
*****************

Daft currently leverages PyIceberg for catalog/table discovery. Please consult `PyIceberg documentation <https://py.iceberg.apache.org/api/#load-a-table>`_ for more details on how to load a table!

Roadmap
*******

Here are features of Iceberg that are works-in-progress.

1. Iceberg V2 merge-on-read features
2. Writing back to an Iceberg table (appends, overwrites, upserts)

0 comments on commit 5fcc1cc

Please sign in to comment.