Skip to content

Commit

Permalink
fix whitespace
Browse files Browse the repository at this point in the history
  • Loading branch information
ccmao1130 committed Dec 19, 2024
1 parent 9c67ef6 commit 187f06e
Show file tree
Hide file tree
Showing 23 changed files with 91 additions and 103 deletions.
3 changes: 0 additions & 3 deletions docs-v2/advanced/distributed.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ By default, Daft runs using your local machine's resources and your operations a

However, Daft has strong integrations with [Ray](https://www.ray.io) which is a distributed computing framework for distributing computations across a cluster of machines. Here is a snippet showing how you can connect Daft to a Ray cluster:

<!-- :material-language-python: -->

=== "🐍 Python"

```python
Expand Down Expand Up @@ -72,4 +70,3 @@ You can take the IP address and port and pass it to Daft:

(Showing first 2 of 2 rows)
```

14 changes: 7 additions & 7 deletions docs-v2/core_concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Let's create our first Dataframe from a Python dictionary of columns.
"C": [True, True, False, False],
"D": [None, None, None, None],
})
```
```

Examine your Dataframe by printing it:

Expand Down Expand Up @@ -261,7 +261,7 @@ Notice also that when we printed our DataFrame, Daft displayed its **schema**. E
Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation.

!!! note "Note"

Under the hood, Daft represents data in the [Apache Arrow](https://arrow.apache.org/) format, which allows it to efficiently represent and work on data using high-performance kernels which are written in Rust.

### Running Computation with Expressions
Expand Down Expand Up @@ -299,7 +299,7 @@ The following statement will [`df.show()`](https://www.getdaft.io/projects/docs/
(Showing first 4 of 4 rows)
```

!!! info "Info"
!!! info "Info"

A common pattern is to create a new columns using [`DataFrame.with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html):

Expand Down Expand Up @@ -1545,7 +1545,7 @@ Writing data will execute your DataFrame and write the results out to the specif

!!! note "Note"

Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.
Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.

## DataTypes

Expand All @@ -1556,7 +1556,7 @@ All elements of a column are of the same dtype, or they can be the special Null
Daft provides simple DataTypes that are ubiquituous in many DataFrames such as numbers, strings and dates - all the way up to more complex types like tensors and images.

!!! tip "Tip"

For a full overview on all the DataTypes that Daft supports, see the [DataType API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html).


Expand Down Expand Up @@ -1709,7 +1709,7 @@ natively integrate with the rest of your Daft query.
df = daft.read_parquet("s3://...")
daft.sql("SELECT * FROM df")
```

We appreciate your patience with us and hope to deliver this crucial feature soon!

### SQL Expressions
Expand Down Expand Up @@ -2308,7 +2308,7 @@ Let’s turn the bytes into human-readable images using [`image.decode()`](https

<div class="grid cards" markdown>

<!-- - [**Coming from Spark**] -->
<!-- - [:simple-apachespark: **Coming from Spark**](migratoin/spark_migration.md) -->
- [:simple-dask: **Coming from Dask**](migration/dask_migration.md)

</div>
Expand Down
8 changes: 4 additions & 4 deletions docs-v2/core_concepts/dataframe.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Common data operations that you would perform on DataFrames are:
4. [**Sorting:**](dataframe.md#reordering-rows) Use [`df.sort(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sort.html#daft.DataFrame.sort) to arrange your data based on values in one or more columns.
5. **Grouping and aggregating:** Use [`df.groupby(...).agg(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.groupby.html#daft.DataFrame.groupby) to summarize your data by groups.

## Creating a Dataframe
## Creating a DataFrame

!!! tip "See Also"

Expand All @@ -42,7 +42,7 @@ Let's create our first Dataframe from a Python dictionary of columns.
"C": [True, True, False, False],
"D": [None, None, None, None],
})
```
```

Examine your Dataframe by printing it:

Expand Down Expand Up @@ -255,7 +255,7 @@ Notice also that when we printed our DataFrame, Daft displayed its **schema**. E
Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation.

!!! note "Note"

Under the hood, Daft represents data in the [Apache Arrow](https://arrow.apache.org/) format, which allows it to efficiently represent and work on data using high-performance kernels which are written in Rust.

## Running Computation with Expressions
Expand Down Expand Up @@ -293,7 +293,7 @@ The following statement will [`df.show()`](https://www.getdaft.io/projects/docs/
(Showing first 4 of 4 rows)
```

!!! info "Info"
!!! info "Info"

A common pattern is to create a new columns using [`DataFrame.with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html):

Expand Down
4 changes: 2 additions & 2 deletions docs-v2/core_concepts/datatypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ All elements of a column are of the same dtype, or they can be the special Null
Daft provides simple DataTypes that are ubiquituous in many DataFrames such as numbers, strings and dates - all the way up to more complex types like tensors and images.

!!! tip "Tip"

For a full overview on all the DataTypes that Daft supports, see the [DataType API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html).


Expand Down Expand Up @@ -93,4 +93,4 @@ Daft abstracts away the in-memory representation of your data and provides kerne

For more complex algorithms, you can also drop into a Python UDF to process this data using your custom Python libraries.

Please add suggestions for new DataTypes to our Github Discussions page!
Please add suggestions for new DataTypes to our [Github Discussions page](https://github.com/Eventual-Inc/Daft/discussions)!
7 changes: 2 additions & 5 deletions docs-v2/core_concepts/read_write.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@ Daft can read data from a variety of sources, and write data to many destination

### From Files

DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3.

Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet.
DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3. Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet.

Daft supports file paths to a single file, a directory of files, and wildcards. It also supports paths to remote object storage such as AWS S3.
=== "🐍 Python"
Expand Down Expand Up @@ -141,5 +139,4 @@ Writing data will execute your DataFrame and write the results out to the specif

!!! note "Note"

Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.

Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.
5 changes: 2 additions & 3 deletions docs-v2/core_concepts/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,7 @@ Daft's [`daft.sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.
(Showing first 3 of 3 rows)
```

In the above example, we query the DataFrame called `"my_special_df"` by simply referring to it in the SQL command. This produces a new DataFrame `sql_df` which can
natively integrate with the rest of your Daft query.
In the above example, we query the DataFrame called `"my_special_df"` by simply referring to it in the SQL command. This produces a new DataFrame `sql_df` which can natively integrate with the rest of your Daft query.

## Reading data from SQL

Expand All @@ -65,7 +64,7 @@ natively integrate with the rest of your Daft query.
df = daft.read_parquet("s3://...")
daft.sql("SELECT * FROM df")
```

We appreciate your patience with us and hope to deliver this crucial feature soon!

## SQL Expressions
Expand Down
2 changes: 0 additions & 2 deletions docs-v2/core_concepts/udf.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,6 @@ Running Class UDFs are exactly the same as running their functional cousins.
```

## Resource Requests
-----------------

Sometimes, you may want to request for specific resources for your UDF. For example, some UDFs need one GPU to run as they will load a model onto the GPU.

Expand Down Expand Up @@ -212,4 +211,3 @@ UDFs can also be parametrized with new resource requests after being initialized
RunModelWithTwoGPUs(df["images"]),
)
```

4 changes: 2 additions & 2 deletions docs-v2/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ This user guide aims to help Daft users master the usage of Daft for all your da
1. [10 minute Quickstart](https://www.getdaft.io/projects/docs/en/stable/10-min.html): Itching to run some Daft code? Hit the ground running with our 10 minute quickstart notebook.

2. [API Documentation](https://www.getdaft.io/projects/docs/en/stable/api_docs/index.html): Searchable documentation and reference material to Daft’s public API.

### Get Started

<div class="grid cards" markdown>
Expand Down Expand Up @@ -131,7 +131,7 @@ This user guide aims to help Daft users master the usage of Daft for all your da

## Contribute to Daft

If you're interested in hands-on learning about Daft internals and would like to contribute to Daft, join us [on Github](https://github.com/Eventual-Inc/Daft) 🚀
If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us [on Github](https://github.com/Eventual-Inc/Daft) 🚀

Take a look at the many issues tagged with `good first issue` in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our [Distributed Data Slack Community](https://join.slack.com/t/dist-data/shared_invite/zt-2e77olvxw-uyZcPPV1SRchhi8ah6ZCtg) and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!

Expand Down
7 changes: 2 additions & 5 deletions docs-v2/integrations/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ If instead you wish to have Daft use credentials from the "driver", you may wish

You may also choose to pass these values into your Daft I/O function calls using an [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) config object.

<!-- add SQL S3Config https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/sql_funcs/daft.sql._sql_funcs.S3Config.html -->

!!! failure "todo(docs): add SQL S3Config https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/sql_funcs/daft.sql._sql_funcs.S3Config.html"

[`daft.set_planning_config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/configuration_functions/daft.set_planning_config.html#daft.set_planning_config) is a convenient way to set your [`daft.io.IOConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.IOConfig.html#daft.io.IOConfig) as the default config to use on any subsequent Daft method calls.

Expand All @@ -44,13 +43,11 @@ You may also choose to pass these values into your Daft I/O function calls using
df = daft.read_parquet("s3://my_bucket/my_path/**/*")
```

Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can
pass a different [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) per function call if you wish!
Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can pass a different [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) per function call if you wish!

=== "🐍 Python"

```python
# Perform some I/O operation but override the IOConfig
df2 = daft.read_csv("s3://my_bucket/my_other_path/**/*", io_config=io_config)
```

4 changes: 1 addition & 3 deletions docs-v2/integrations/azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,7 @@ You may also choose to pass these values into your Daft I/O function calls using
df = daft.read_parquet("az://my_container/my_path/**/*")
```

Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can
pass a different [`daft.io.AzureConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.AzureConfig.html#daft.io.AzureConfig) per function call if you wish!
Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can pass a different [`daft.io.AzureConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.AzureConfig.html#daft.io.AzureConfig) per function call if you wish!

=== "🐍 Python"

Expand Down Expand Up @@ -79,4 +78,3 @@ If you are connecting to storage in OneLake or another Microsoft Fabric service,

df = daft.read_deltalake('abfss://[WORKSPACE]@onelake.dfs.fabric.microsoft.com/[LAKEHOUSE].Lakehouse/Tables/[TABLE]', io_config=io_config)
```

5 changes: 3 additions & 2 deletions docs-v2/integrations/delta_lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ When reading from a Delta Lake table into Daft:
| `date` | [`daft.DataType.date()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.date) |
| `timestamp` | [`daft.DataType.timestamp(timeunit="us", timezone=None)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
| `timestampz`| [`daft.DataType.timestamp(timeunit="us", timezone="UTC")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
| `binary` | [`daft.DataType.binary()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.binary) |
| **Nested Types** |
| `struct(fields)` | [`daft.DataType.struct(fields)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.struct) |
Expand All @@ -122,6 +122,7 @@ Here are Delta Lake features that are on our roadmap. Please let us know if you
2. Read support for [column mappings](https://docs.delta.io/latest/delta-column-mapping.html>) ([issue](https://github.com/Eventual-Inc/Daft/issues/1955)).

3. Writing new Delta Lake tables ([issue](https://github.com/Eventual-Inc/Daft/issues/1967)).
<!-- ^ this needs an update, issue has been closed -->

!!! failure "todo(docs): ^ this needs to be updated, issue is already closed"

4. Writing back to an existing table with appends, overwrites, upserts, or deletes ([issue](https://github.com/Eventual-Inc/Daft/issues/1968)).
4 changes: 2 additions & 2 deletions docs-v2/integrations/hudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ pip install -U "getdaft[hudi]"

## Reading a Table

To read from an Apache Hudi table, use the [`daft.read_hudi`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_hudi.html#daft.read_hudi) function. The following is an example snippet of loading an example table
To read from an Apache Hudi table, use the [`daft.read_hudi`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_hudi.html#daft.read_hudi) function. The following is an example snippet of loading an example table:

=== "🐍 Python"

Expand Down Expand Up @@ -53,7 +53,7 @@ When reading from a Hudi table into Daft:
| `date` | [`daft.DataType.date()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.date) |
| `timestamp` | [`daft.DataType.timestamp(timeunit="us", timezone=None)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
| `timestampz`| [`daft.DataType.timestamp(timeunit="us", timezone="UTC")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
| `binary` | [`daft.DataType.binary()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.binary) |
| **Nested Types** |
| `struct(fields)` | [`daft.DataType.struct(fields)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.struct) |
Expand Down
5 changes: 2 additions & 3 deletions docs-v2/integrations/huggingface.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

Daft is able to read datasets directly from Hugging Face via the `hf://datasets/` protocol.

Since Hugging Face will [automatically convert](https://huggingface.co/docs/dataset-viewer/en/parquet) all public datasets to parquet format, we can read these datasets using the [`read_parquet`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_parquet.html) method.
Since Hugging Face will [automatically convert](https://huggingface.co/docs/dataset-viewer/en/parquet) all public datasets to parquet format, we can read these datasets using the [`daft.read_parquet()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_parquet.html) method.

!!! warning "Warning"

This is limited to either public datasets, or PRO/ENTERPRISE datasets.

For other file formats, you will need to manually specify the path or glob pattern to the files you want to read, similar to how you would read from a local file system.
Expand Down Expand Up @@ -67,4 +67,3 @@ to get around this, you can read all files using a glob pattern *(assuming they
```python
df = daft.read_parquet("hf://datasets/username/my_private_dataset/**/*.parquet", io_config=io_config) # Works
```

Loading

0 comments on commit 187f06e

Please sign in to comment.