Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Drop use of "Complex Data" in favor of multimodal #1875

Merged
merged 3 commits into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,18 @@

`Website <https://www.getdaft.io>`_ • `Docs <https://www.getdaft.io/projects/docs/>`_ • `Installation`_ • `10-minute tour of Daft <https://www.getdaft.io/projects/docs/en/latest/learn/10-min.html>`_ • `Community and Support <https://github.com/Eventual-Inc/Daft/discussions>`_

Daft: the distributed Python dataframe for complex data
Daft: Distributed dataframes for multimodal data
=======================================================


`Daft <https://www.getdaft.io>`_ is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.
`Daft <https://www.getdaft.io>`_ is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration
* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible
* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg
* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more
* **Seamless Interchange**: Built on the `Apache Arrow <https://arrow.apache.org/docs/index.html>`_ In-Memory Format
* **Built for the cloud:** `Record-setting <https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io>`_ I/O performance for integrations with S3 cloud storage

**Table of Contents**

Expand All @@ -21,11 +28,11 @@ Daft: the distributed Python dataframe for complex data
About Daft
----------

The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.
Daft was designed with the following principles in mind:

1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex multimodal data such as Images, Embeddings and Python objects. Ingestion and basic transformations of complex data is extremely easy and performant in Daft.
2. **Notebook Computing**: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.
3. **Distributed Computing**: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray <https://www.ray.io>`_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs.
1. **Any Data**: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it's Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.
2. **Interactive Computing**: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.
3. **Distributed Computing**: Some workloads can quickly outgrow your local laptop's computational resources - Daft integrates natively with `Ray <https://www.ray.io>`_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started
---------------
Expand Down Expand Up @@ -101,7 +108,7 @@ Related Projects
----------------

+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Dataframe | Query Optimizer | Complex Types | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
| Dataframe | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
+===================================================+=================+===============+=============+=================+=============================+=============+
| Daft | Yes | Yes | Yes | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
Expand Down
2 changes: 1 addition & 1 deletion docs/source/_static/dataframe-comp-table.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Dataframe,Query Optimizer,Complex Types,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core
Dataframe,Query Optimizer,Multimodal,Distributed,Arrow Backed,Vectorized Execution Engine,Out-of-core
Daft,✅,✅,✅,✅,✅,✅
`Pandas <https://github.com/pandas-dev/pandas>`_,❌,Python object,❌,optional >= 2.0,some(Numpy),❌
`Polars <https://github.com/pola-rs/polars>`_,✅,Python object,❌,✅,✅,✅
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faq/dataframe_comparison.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Pandas/Modin

The main drawback of using Pandas is scalability. Pandas is single-threaded and not built for distributed computing. While this is not as much of a problem for purely tabular datasets, when dealing with data such as images/video your data can get very large and expensive to compute very quickly.

Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of complex data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible.
Modin is a project that provides "distributed Pandas". If the use-case is tabular, has code that is already written in Pandas but just needs to be scaled up to larger data, Modin may be a good choice. Modin aims to be 100% Pandas API compatible which means that certain operations that are important for performance in the world of multimodal data such as requesting for certain amount of resources (e.g. GPUs) is not yet possible.

Spark Dataframes
----------------
Expand All @@ -42,7 +42,7 @@ Spark excels at large scale tabular analytics, with support for running Python c
#. Unravel the flattened array again on the other end

* **Debugging:** Key features such as exposing print statements or breakpoints from user-defined functions to the user are missing, which make PySpark extremely difficult to develop on.
* **Lack of granular execution control:** with heavy processing of complex data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult.
* **Lack of granular execution control:** with heavy processing of multimodal data, users often need more control around the execution and scheduling of their work. For example, users may need to ensure that Spark runs a single executor per GPU, but Spark's programming model makes this very difficult.
* **Compatibility with downstream Machine Learning tasks:** Spark itself is not well suited for performing distributed ML training which is increasingly becoming the domain of frameworks such as Ray and Horovod. Integrating with such a solution is difficult and requires expert tuning of intermediate storage and data engineering solutions.

Ray Datasets
Expand Down
9 changes: 8 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
Daft Documentation
==================

Daft is a **fast and scalable Python dataframe** for complex data and machine learning workloads.
Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust.

* **Familiar interactive API:** Lazy Python Dataframe for rapid and interactive iteration
* **Focus on the what:** Powerful Query Optimizer that rewrites queries to be as efficient as possible
* **Data Catalog integrations:** Full integration with data catalogs such as Apache Iceberg
* **Rich multimodal type-system:** Supports multimodal types such as Images, URLs, Tensors and more
* **Seamless Interchange**: Built on the `Apache Arrow <https://arrow.apache.org/docs/index.html>`_ In-Memory Format
* **Built for the cloud:** `Record-setting <https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io>`_ I/O performance for integrations with S3 cloud storage

Installing Daft
---------------
Expand Down
12 changes: 6 additions & 6 deletions docs/source/user_guide/basic_concepts/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
Introduction
============

Daft is a data processing library that has two main classes:
Daft is a distributed query engine with a DataFrame API. The two key concepts to Daft are:

1. :class:`DataFrame <daft.DataFrame>`: a DataFrame consisting of rows and columns of data
2. :class:`Expression <daft.expressions.Expression>`: an expression representing some (delayed) computation to execute on columns of data
1. :class:`DataFrame <daft.DataFrame>`: a Table-like structure that represents rows and columns of data
2. :class:`Expression <daft.expressions.Expression>`: a symbolic representation of computation that transforms columns of the DataFrame to a new one.

With Daft, you create :class:`DataFrame <daft.DataFrame>` from a variety of sources (e.g. reading data from files or from Python dictionaries) and use :class:`Expression <daft.expressions.Expression>` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions!
With Daft, you create :class:`DataFrame <daft.DataFrame>` from a variety of sources (e.g. reading data from files, data catalogs or from Python dictionaries) and use :class:`Expression <daft.expressions.Expression>` to manipulate data in that DataFrame. Let's take a closer look at these two abstractions!

DataFrame
---------
Expand All @@ -29,8 +29,8 @@ Using this abstraction of a DataFrame, you can run common tabular operations suc
Daft DataFrames are:

1. **Distributed:** your data is split into *Partitions* and can be processed in parallel/on different machines
2. **Lazy:** computations are enqueued in a query plan, and only executed when requested
3. **Complex:** columns can contain complex datatypes such as tensors, images and Python objects
2. **Lazy:** computations are enqueued in a query plan which is then optimized and executed only when requested
3. **Multimodal:** columns can contain complex datatypes such as tensors, images and Python objects

Since Daft is lazy, it can actually execute the query plan on a variety of different backends. By default, it will run computations locally using Python multithreading. However if you need to scale to large amounts of data that cannot be processed on a single machine, using the Ray runner allows Daft to run computations on a `Ray <https://www.ray.io/>`_ cluster instead.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/daft_in_depth/datatypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ See also:
Nested
------

Nested DataTypes wrap other DataTypes, allowing you to compose types into complex datastructures.
Nested DataTypes wrap other DataTypes, allowing you to compose types into complex data structures.

Examples:

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ dependencies = [
"typing-extensions >= 4.0.0; python_version < '3.10'",
"pickle5 >= 0.0.12; python_version < '3.8'"
]
description = "A Distributed DataFrame library for large scale complex data processing."
description = "Distributed Dataframes for Multimodal Data"
dynamic = ["version"]
license = {file = "LICENSE"}
maintainers = [
Expand Down
Loading