From c88da204c463d74b8bc317d5cd3a1a86bb7f9daf Mon Sep 17 00:00:00 2001 From: Sammy Sidhu Date: Mon, 12 Feb 2024 17:37:33 -0800 Subject: [PATCH 1/3] refactor docs --- README.rst | 21 ++++++++++++------- docs/source/_static/dataframe-comp-table.csv | 2 +- docs/source/faq/dataframe_comparison.rst | 4 ++-- docs/source/index.rst | 9 +++++++- .../basic_concepts/introduction.rst | 12 +++++------ .../user_guide/daft_in_depth/datatypes.rst | 2 +- pyproject.toml | 2 +- 7 files changed, 33 insertions(+), 19 deletions(-) diff --git a/README.rst b/README.rst index 824f4a04e5..3638a691ad 100644 --- a/README.rst +++ b/README.rst @@ -4,11 +4,18 @@ `Website `_ • `Docs `_ • `Installation`_ • `10-minute tour of Daft `_ • `Community and Support `_ -Daft: the distributed Python dataframe for complex data +Daft: Distributed dataframes for multimodal data ======================================================= -`Daft `_ is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads. +`Daft `_ is a distributed query engine for large-scale data processing in Python and is implemented in Rust. + +* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration +* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible +* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg +* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more +* **Seamless Interchange**: Built on the `Apache Arrow `_ In-Memory Format +* **Built for the cloud:** `Record-setting `_ I/O performance for integrations with S3 cloud storage **Table of Contents** @@ -21,11 +28,11 @@ Daft: the distributed Python dataframe for complex data About Daft ---------- -The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more. +Daft was designed with the following principles in mind: -1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex multimodal data such as Images, Embeddings and Python objects. Ingestion and basic transformations of complex data is extremely easy and performant in Daft. -2. **Notebook Computing**: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration. -3. **Distributed Computing**: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray `_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs. +1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it's Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft. +2. **Interactive Computing**: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration. +3. **Distributed Computing**: Some workloads can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray `_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs. Getting Started --------------- @@ -101,7 +108,7 @@ Related Projects ---------------- +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ -| Dataframe | Query Optimizer | Complex Types | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core | +| Dataframe | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core | +===================================================+=================+===============+=============+=================+=============================+=============+ | Daft | Yes | Yes | Yes | Yes | Yes | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ diff --git a/docs/source/_static/dataframe-comp-table.csv b/docs/source/_static/dataframe-comp-table.csv index cfcfa785d9..0a192bd006 100644 --- a/docs/source/_static/dataframe-comp-table.csv +++ b/docs/source/_static/dataframe-comp-table.csv @@ -1,4 +1,4 @@ -Dataframe,Query Optimizer,Complex Types,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core +Dataframe,Query Optimizer,Multimodal,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core Daft,✅,✅,✅,✅,✅,✅ `Pandas `_,❌,Python object,❌,optional >= 2.0,some(Numpy),❌ `Polars `_,✅,Python object,❌,✅,✅,✅ diff --git a/docs/source/faq/dataframe_comparison.rst b/docs/source/faq/dataframe_comparison.rst index 5a76a2125a..742aacdd25 100644 --- a/docs/source/faq/dataframe_comparison.rst +++ b/docs/source/faq/dataframe_comparison.rst @@ -24,7 +24,7 @@ Pandas/Modin The main drawback of using Pandas is scalability. Pandas is single-threaded and not built for distributed computing. While this is not as much of a problem for purely tabular datasets, when dealing with data such as images/video your data can get very large and expensive to compute very quickly. -Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of complex data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible. +Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of multimodal data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible. Spark Dataframes ---------------- @@ -42,7 +42,7 @@ Spark excels at large scale tabular analytics, with support for running Python c #. Unravel the flattened array again on the other end * **Debugging:** Key features such as exposing print statements or breakpoints from user-defined functions to the user are missing, which make PySpark extremely difficult to develop on. -* **Lack of granular execution control:** with heavy processing of complex data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult. +* **Lack of granular execution control:** with heavy processing of multimodal data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult. * **Compatibility with downstream Machine Learning tasks:** Spark itself is not well suited for performing distributed ML training which is increasingly becoming the domain of frameworks such as Ray and Horovod. Integrating with such a solution is difficult and requires expert tuning of intermediate storage and data engineering solutions. Ray Datasets diff --git a/docs/source/index.rst b/docs/source/index.rst index 1de62eb697..e0ecc7ddfe 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,7 +1,14 @@ Daft Documentation ================== -Daft is a **fast and scalable Python dataframe** for complex data and machine learning workloads. +Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust. + +* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration +* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible +* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg +* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more +* **Seamless Interchange**: Built on the `Apache Arrow `_ In-Memory Format +* **Built for the cloud:** `Record-setting `_ I/O performance for integrations with S3 cloud storage Installing Daft --------------- diff --git a/docs/source/user_guide/basic_concepts/introduction.rst b/docs/source/user_guide/basic_concepts/introduction.rst index d01d54532b..85fffde600 100644 --- a/docs/source/user_guide/basic_concepts/introduction.rst +++ b/docs/source/user_guide/basic_concepts/introduction.rst @@ -1,12 +1,12 @@ Introduction ============ -Daft is a data processing library that has two main classes: +Daft is a distributed query engine with a DataFrame API. The two key concepts to Daft are: -1. :class:`DataFrame `: a DataFrame consisting of rows and columns of data -2. :class:`Expression `: an expression representing some (delayed) computation to execute on columns of data +1. :class:`DataFrame `: a Table-like structure that represents rows and columns of data +2. :class:`Expression `: a symbolic representation of computation that transforms columns of the DataFrame to a new one. -With Daft, you create :class:`DataFrame ` from a variety of sources (e.g. reading data from files or from Python dictionaries) and use :class:`Expression ` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions! +With Daft, you create :class:`DataFrame ` from a variety of sources (e.g. reading data from files, data catalogs or from Python dictionaries) and use :class:`Expression ` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions! DataFrame --------- @@ -29,8 +29,8 @@ Using this abstraction of a DataFrame, you can run common tabular operations suc Daft DataFrames are: 1. **Distributed:** your data is split into *Partitions* and can be processed in parallel/on different machines -2. **Lazy:** computations are enqueued in a query plan, and only executed when requested -3. **Complex:** columns can contain complex datatypes such as tensors, images and Python objects +2. **Lazy:** computations are enqueued in a query plan and optimized and are only executed when requested +3. **Multimodal:** columns can contain complex datatypes such as tensors, images and Python objects Since Daft is lazy, it can actually execute the query plan on a variety of different backends. By default, it will run computations locally using Python multithreading. However if you need to scale to large amounts of data that cannot be processed on a single machine, using the Ray runner allows Daft to run computations on a `Ray `_ cluster instead. diff --git a/docs/source/user_guide/daft_in_depth/datatypes.rst b/docs/source/user_guide/daft_in_depth/datatypes.rst index 9b57b41776..f2af4ef1f0 100644 --- a/docs/source/user_guide/daft_in_depth/datatypes.rst +++ b/docs/source/user_guide/daft_in_depth/datatypes.rst @@ -70,7 +70,7 @@ See also: Nested ------ -Nested DataTypes wrap other DataTypes, allowing you to compose types into complex datastructures. +Nested DataTypes wrap other DataTypes, allowing you to compose types into complex data structures. Examples: diff --git a/pyproject.toml b/pyproject.toml index 84fdfd2d0b..3575ce769f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -12,7 +12,7 @@ dependencies = [ "typing-extensions >= 4.0.0; python_version < '3.10'", "pickle5 >= 0.0.12; python_version < '3.8'" ] -description = "A Distributed DataFrame library for large scale complex data processing." +description = "Distributed Dataframes for Multimodal Data" dynamic = ["version"] license = {file = "LICENSE"} maintainers = [ From 1cf5492bde1ef22c6745c1ee2412c5cda4930892 Mon Sep 17 00:00:00 2001 From: Sammy Sidhu Date: Mon, 12 Feb 2024 17:38:40 -0800 Subject: [PATCH 2/3] extra space --- README.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.rst b/README.rst index 3638a691ad..a02a13c129 100644 --- a/README.rst +++ b/README.rst @@ -8,7 +8,7 @@ Daft: Distributed dataframes for multimodal data ======================================================= -`Daft `_ is a distributed query engine for large-scale data processing in Python and is implemented in Rust. +`Daft `_ is a distributed query engine for large-scale data processing in Python and is implemented in Rust. * **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration * **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible From fc6dfb7e9ddb1bf48cf8993d1a1696831989f953 Mon Sep 17 00:00:00 2001 From: Sammy Sidhu Date: Mon, 12 Feb 2024 17:54:19 -0800 Subject: [PATCH 3/3] fixed double and --- docs/source/user_guide/basic_concepts/introduction.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/basic_concepts/introduction.rst b/docs/source/user_guide/basic_concepts/introduction.rst index 85fffde600..2fa1c8fa94 100644 --- a/docs/source/user_guide/basic_concepts/introduction.rst +++ b/docs/source/user_guide/basic_concepts/introduction.rst @@ -29,7 +29,7 @@ Using this abstraction of a DataFrame, you can run common tabular operations suc Daft DataFrames are: 1. **Distributed:** your data is split into *Partitions* and can be processed in parallel/on different machines -2. **Lazy:** computations are enqueued in a query plan and optimized and are only executed when requested +2. **Lazy:** computations are enqueued in a query plan which is then optimized and executed only when requested 3. **Multimodal:** columns can contain complex datatypes such as tensors, images and Python objects Since Daft is lazy, it can actually execute the query plan on a variety of different backends. By default, it will run computations locally using Python multithreading. However if you need to scale to large amounts of data that cannot be processed on a single machine, using the Ray runner allows Daft to run computations on a `Ray `_ cluster instead.