Skip to content

Commit

Permalink
[DOCS] Add documentation for read_iceberg
Browse files Browse the repository at this point in the history
  • Loading branch information
Jay Chia committed Jan 9, 2024
1 parent fd662c1 commit c995b1a
Show file tree
Hide file tree
Showing 5 changed files with 114 additions and 23 deletions.
24 changes: 24 additions & 0 deletions daft/io/_iceberg.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,30 @@ def read_iceberg(
pyiceberg_table: "PyIcebergTable",
io_config: Optional["IOConfig"] = None,
) -> DataFrame:
"""Create a DataFrame from an Iceberg table
Example:
>>> import pyiceberg
>>>
>>> pyiceberg_table = pyiceberg.Table(...)
>>> df = daft.read_iceberg(pyiceberg_table)
>>>
>>> # Filters on this dataframe can now be pushed into
>>> # the read operation from Iceberg
>>> df = df.where(df["foo"] > 5)
>>> df.show()
.. NOTE::
This function requires the use of `PyIceberg <https://py.iceberg.apache.org/>`_, which is the Apache Iceberg's
official project for Python.
Args:
pyiceberg_table: Iceberg table created using the PyIceberg library
io_config: A custom IOConfig to use when accessing Iceberg object storage data. Defaults to None.
Returns:
DataFrame: a DataFrame with the schema converted from the specified Iceberg table
"""
from daft.iceberg.iceberg_scan import IcebergScanOperator

io_config = (
Expand Down
58 changes: 35 additions & 23 deletions docs/source/api_docs/creation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,71 +20,83 @@ Python Objects
from_pylist
from_pydict

Arrow
~~~~~
Files
-----

.. _df-io-files:

Parquet
~~~~~~~

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet

CSV
~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_arrow
read_csv

Pandas
~~~~~~
JSON
~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_pandas
read_json

File Paths
~~~~~~~~~~
Data Catalogs
-------------

Apache Iceberg
^^^^^^^^^^^^^^

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

from_glob_path
read_iceberg

Files
-----

.. _df-io-files:
Arrow
~~~~~

Parquet
~~~~~~~
.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

.. _daft-read-parquet:

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_parquet
from_arrow

CSV
~~~
Pandas
~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_csv
from_pandas

JSON
~~~~
File Paths
~~~~~~~~~~

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_functions

read_json
from_glob_path

Integrations
------------
Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Daft User Guide
basic_concepts
daft_in_depth
poweruser
integrations
tutorials

Welcome to **Daft**!
Expand Down Expand Up @@ -61,6 +62,11 @@ Core Daft concepts all Daft users will find useful to understand deeply.

Become a true Daft Poweruser! This section explores advanced topics to help you configure Daft for specific application environments, improve reliability and optimize for performance.

:doc:`Integrations <integrations>`
**********************************

Learn how to use Daft's integrations with other technologies such as Ray Datasets or Apache Iceberg.

:doc:`Tutorials <tutorials>`
****************************

Expand Down
6 changes: 6 additions & 0 deletions docs/source/user_guide/integrations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Integrations
============

.. toctree::

integrations/data_catalogs
43 changes: 43 additions & 0 deletions docs/source/user_guide/integrations/data_catalogs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Data Catalogs
=============

**Data Catalogs** are services that provide access to **Tables** of data. **Tables** are powerful abstractions for large datasets in storage, providing many benefits over naively storing data as just a bunch of CSV/Parquet files.

There are many different **Table Formats** that are employed by Data Catalogs. These table formats will differ implementation and capabilities, but will often provide advantages such as:

1. **Schema:** what data do these files contain?
2. **Partitioning Specification:** how is the data organized?
3. **Statistics/Metadata:** how many rows does each file contain, and what are the min/max values of each files' columns?
4. **ACID compliance:** updates to the table are atomic

.. NOTE::
The names of Table Formats and their Data Catalogs are often used interchangeably.

For example, "Apache Iceberg" often refers to both the Data Catalog and its Table Format.

You can retrieve an **Apache Iceberg Table** from an **Apache Iceberg REST Data Catalog**.

However, some Data Catalogs allow for many different underlying Table Formats. For example, you can request both an **Apache Iceberg Table** or a **Hive Table** from an **AWS Glue Data Catalog**.

Why use Data Catalogs?
----------------------

Daft can effectively leverage the statistics and metadata provided by these Data Catalogs' Tables to dramatically speed up queries.

This is accomplished by techniques such as:

1. **Partition pruning:** ignore files where their partition values don't match filter predicates
2. **Schema retrieval:** convert the schema provided by the data catalog into a Daft schema instead of sampling a schema from the data
3. **Metadata execution**: utilize metadata such as row counts to read the bare minimum amount of data necessary from storage

Data Catalog Integrations
-------------------------

Apache Iceberg
^^^^^^^^^^^^^^

Apache Iceberg is an open-sourced table format originally developed at Netflix for large-scale analytical datasets.

To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function.

We integrate closely with `PyIceberg <https://py.iceberg.apache.org/>`_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes from PyIceberg's Table objects.

0 comments on commit c995b1a

Please sign in to comment.