Skip to content

Commit

Permalink
[DOCS] Drop use of "Complex Data" in favor of multimodal (#1875)
Browse files Browse the repository at this point in the history
  • Loading branch information
samster25 authored Feb 13, 2024
1 parent 4caeed4 commit adac24c
Show file tree
Hide file tree
Showing 7 changed files with 33 additions and 19 deletions.
21 changes: 14 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,18 @@

`Website <https://www.getdaft.io>`_ • `Docs <https://www.getdaft.io/projects/docs/>`_ • `Installation`_ • `10-minute tour of Daft <https://www.getdaft.io/projects/docs/en/latest/learn/10-min.html>`_ • `Community and Support <https://github.com/Eventual-Inc/Daft/discussions>`_

Daft: the distributed Python dataframe for complex data
Daft: Distributed dataframes for multimodal data
=======================================================


`Daft <https://www.getdaft.io>`_ is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.
`Daft <https://www.getdaft.io>`_ is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration
* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible
* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg
* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more
* **Seamless Interchange**: Built on the `Apache Arrow <https://arrow.apache.org/docs/index.html>`_ In-Memory Format
* **Built for the cloud:** `Record-setting <https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io>`_ I/O performance for integrations with S3 cloud storage

**Table of Contents**

Expand All @@ -21,11 +28,11 @@ Daft: the distributed Python dataframe for complex data
About Daft
----------

The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.
Daft was designed with the following principles in mind:

1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex multimodal data such as Images, Embeddings and Python objects. Ingestion and basic transformations of complex data is extremely easy and performant in Daft.
2. **Notebook Computing**: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.
3. **Distributed Computing**: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray <https://www.ray.io>`_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs.
1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it's Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.
2. **Interactive Computing**: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.
3. **Distributed Computing**: Some workloads can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray <https://www.ray.io>`_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started
---------------
Expand Down Expand Up @@ -101,7 +108,7 @@ Related Projects
----------------

+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Dataframe | Query Optimizer | Complex Types | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
| Dataframe | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
+===================================================+=================+===============+=============+=================+=============================+=============+
| Daft | Yes | Yes | Yes | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
Expand Down
2 changes: 1 addition & 1 deletion docs/source/_static/dataframe-comp-table.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Dataframe,Query Optimizer,Complex Types,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core
Dataframe,Query Optimizer,Multimodal,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core
Daft,✅,✅,✅,✅,✅,✅
`Pandas <https://github.com/pandas-dev/pandas>`_,❌,Python object,❌,optional >= 2.0,some(Numpy),❌
`Polars <https://github.com/pola-rs/polars>`_,✅,Python object,❌,✅,✅,✅
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faq/dataframe_comparison.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Pandas/Modin

The main drawback of using Pandas is scalability. Pandas is single-threaded and not built for distributed computing. While this is not as much of a problem for purely tabular datasets, when dealing with data such as images/video your data can get very large and expensive to compute very quickly.

Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of complex data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible.
Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of multimodal data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible.

Spark Dataframes
----------------
Expand All @@ -42,7 +42,7 @@ Spark excels at large scale tabular analytics, with support for running Python c
#. Unravel the flattened array again on the other end

* **Debugging:** Key features such as exposing print statements or breakpoints from user-defined functions to the user are missing, which make PySpark extremely difficult to develop on.
* **Lack of granular execution control:** with heavy processing of complex data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult.
* **Lack of granular execution control:** with heavy processing of multimodal data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult.
* **Compatibility with downstream Machine Learning tasks:** Spark itself is not well suited for performing distributed ML training which is increasingly becoming the domain of frameworks such as Ray and Horovod. Integrating with such a solution is difficult and requires expert tuning of intermediate storage and data engineering solutions.

Ray Datasets
Expand Down
9 changes: 8 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
Daft Documentation
==================

Daft is a **fast and scalable Python dataframe** for complex data and machine learning workloads.
Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration
* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible
* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg
* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more
* **Seamless Interchange**: Built on the `Apache Arrow <https://arrow.apache.org/docs/index.html>`_ In-Memory Format
* **Built for the cloud:** `Record-setting <https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io>`_ I/O performance for integrations with S3 cloud storage

Installing Daft
---------------
Expand Down
12 changes: 6 additions & 6 deletions docs/source/user_guide/basic_concepts/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
Introduction
============

Daft is a data processing library that has two main classes:
Daft is a distributed query engine with a DataFrame API. The two key concepts to Daft are:

1. :class:`DataFrame <daft.DataFrame>`: a DataFrame consisting of rows and columns of data
2. :class:`Expression <daft.expressions.Expression>`: an expression representing some (delayed) computation to execute on columns of data
1. :class:`DataFrame <daft.DataFrame>`: a Table-like structure that represents rows and columns of data
2. :class:`Expression <daft.expressions.Expression>`: a symbolic representation of computation that transforms columns of the DataFrame to a new one.

With Daft, you create :class:`DataFrame <daft.DataFrame>` from a variety of sources (e.g. reading data from files or from Python dictionaries) and use :class:`Expression <daft.expressions.Expression>` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions!
With Daft, you create :class:`DataFrame <daft.DataFrame>` from a variety of sources (e.g. reading data from files, data catalogs or from Python dictionaries) and use :class:`Expression <daft.expressions.Expression>` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions!

DataFrame
---------
Expand All @@ -29,8 +29,8 @@ Using this abstraction of a DataFrame, you can run common tabular operations suc
Daft DataFrames are:

1. **Distributed:** your data is split into *Partitions* and can be processed in parallel/on different machines
2. **Lazy:** computations are enqueued in a query plan, and only executed when requested
3. **Complex:** columns can contain complex datatypes such as tensors, images and Python objects
2. **Lazy:** computations are enqueued in a query plan which is then optimized and executed only when requested
3. **Multimodal:** columns can contain complex datatypes such as tensors, images and Python objects

Since Daft is lazy, it can actually execute the query plan on a variety of different backends. By default, it will run computations locally using Python multithreading. However if you need to scale to large amounts of data that cannot be processed on a single machine, using the Ray runner allows Daft to run computations on a `Ray <https://www.ray.io/>`_ cluster instead.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/daft_in_depth/datatypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ See also:
Nested
------

Nested DataTypes wrap other DataTypes, allowing you to compose types into complex datastructures.
Nested DataTypes wrap other DataTypes, allowing you to compose types into complex data structures.

Examples:

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ dependencies = [
"typing-extensions >= 4.0.0; python_version < '3.10'",
"pickle5 >= 0.0.12; python_version < '3.8'"
]
description = "A Distributed DataFrame library for large scale complex data processing."
description = "Distributed Dataframes for Multimodal Data"
dynamic = ["version"]
license = {file = "LICENSE"}
maintainers = [
Expand Down

0 comments on commit adac24c

Please sign in to comment.