From 87ca07cb3505a04376c559352720ae5390ac3da4 Mon Sep 17 00:00:00 2001 From: Jay Chia Date: Fri, 2 Feb 2024 14:52:39 -0800 Subject: [PATCH] [DOCS] Add type conversions between iceberg and daft --- .../user_guide/integrations/iceberg.rst | 75 +++++++++++++++++-- 1 file changed, 68 insertions(+), 7 deletions(-) diff --git a/docs/source/user_guide/integrations/iceberg.rst b/docs/source/user_guide/integrations/iceberg.rst index be33356b04..19ed131ad7 100644 --- a/docs/source/user_guide/integrations/iceberg.rst +++ b/docs/source/user_guide/integrations/iceberg.rst @@ -3,9 +3,19 @@ Apache Iceberg `Apache Iceberg `_ is an open-sourced table format originally developed at Netflix for large-scale analytical datasets. +Daft currently natively supports: + +1. **Distributed Reads:** Daft will fully distribute the I/O of reads over your compute resources (whether Ray or on multithreading on the local PyRunner) +2. **Skipping filtered data:** Daft uses ``df.where(...)`` filter calls to only read data that matches your predicates +3. **All Catalogs From PyIceberg:** Daft is natively integrated with PyIceberg, and supports all the catalogs that PyIceberg does + +Reading a Table +*************** + To read from the Apache Iceberg table format, use the :func:`daft.read_iceberg` function. We integrate closely with `PyIceberg `_ (the official Python implementation for Apache Iceberg) and allow the reading of Daft dataframes easily from PyIceberg's Table objects. +The following is an example snippet of loading an example table, but for more information please consult the `PyIceberg Table loading documentation `_. .. code:: python @@ -15,21 +25,71 @@ We integrate closely with `PyIceberg `_ (the off catalog = load_catalog("my_iceberg_catalog") table = catalog.load_table("my_namespace.my_table") +After a table is loaded as the ``table`` object, reading it into a DataFrame is extremely easy. + +.. code:: python + # Create a Daft Dataframe import daft df = daft.read_iceberg(table) -Daft currently natively supports: +Any subsequent filter operations on the Daft ``df`` DataFrame object will be correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads. -1. **Distributed Reads:** Daft will fully distribute the I/O of reads over your compute resources (whether Ray or on multithreading on the local PyRunner) -2. **Skipping filtered data:** Daft uses ``df.where(...)`` filter calls to only read data that matches your predicates -3. **All Catalogs From PyIceberg:** Daft is natively integrated with PyIceberg, and supports all the catalogs that PyIceberg does! +.. code:: python + + # Filter which takes advantage of partition pruning capabilities of Iceberg + df = df.where(df["partition_key"] < 1000) + df.show() + +Type System +*********** + +Daft and Iceberg have compatible type systems. Here are how types are converted across the two systems. -Selecting a Table -***************** +When reading from an Iceberg table into Daft: -Daft currently leverages PyIceberg for catalog/table discovery. Please consult `PyIceberg documentation `_ for more details on how to load a table! ++-----------------------------+------------------------------------------------------------------------------------------+ +| Iceberg | Daft | ++=============================+==========================================================================================+ +| **Primitive Types** | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `boolean` | :meth:`daft.DataType.bool() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `int` | :meth:`daft.DataType.int32() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `long` | :meth:`daft.DataType.int64() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `float` | :meth:`daft.DataType.float32() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `double` | :meth:`daft.DataType.float64() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `decimal(precision, scale)` | :meth:`daft.DataType.decimal128(precision, scale) ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `date` | :meth:`daft.DataType.date() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `time` | :meth:`daft.DataType.int64() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `timestamp` | :meth:`daft.DataType.timestamp(timeunit="us", timezone=None) ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `timestampz` | :meth:`daft.DataType.timestamp(timeunit="us", timezone="UTC") ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `string` | :meth:`daft.DataType.string() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `uuid` | :meth:`daft.DataType.binary() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `fixed(L)` | :meth:`daft.DataType.binary() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `binary` | :meth:`daft.DataType.binary() ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| **Nested Types** | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `struct(fields)` | :meth:`daft.DataType.struct(fields) ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `list(child_type)` | :meth:`daft.DataType.list(child_type) ` | ++-----------------------------+------------------------------------------------------------------------------------------+ +| `map(K, V)` | :meth:`daft.DataType.struct({"key": K, "value": V}) ` | ++-----------------------------+------------------------------------------------------------------------------------------+ Roadmap ******* @@ -38,3 +98,4 @@ Here are features of Iceberg that are works-in-progress. 1. Iceberg V2 merge-on-read features 2. Writing back to an Iceberg table (appends, overwrites, upserts) +3. More extensive usage of Iceberg-provided statistics to futher optimize queries