Skip to content

Commit

Permalink
fix end of line
Browse files Browse the repository at this point in the history
  • Loading branch information
ccmao1130 committed Dec 19, 2024
1 parent be595a0 commit 54bea1d
Show file tree
Hide file tree
Showing 19 changed files with 23 additions and 20 deletions.
3 changes: 2 additions & 1 deletion docs-v2/advanced/distributed.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,5 @@ You can take the IP address and port and pass it to Daft:
╰───────╯

(Showing first 2 of 2 rows)
```
```

2 changes: 1 addition & 1 deletion docs-v2/advanced/memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ There are some options available to you.

5. Increase the number of partitions in your dataframe (hence making each partition smaller) using something like: `df.into_partitions(df.num_partitions() * 2)`

If your workload continues to experience OOM issues, perhaps Daft could be better estimating the required memory to run certain steps in your workload. Please contact Daft developers on our forums!
If your workload continues to experience OOM issues, perhaps Daft could be better estimating the required memory to run certain steps in your workload. Please contact Daft developers on our forums!
2 changes: 1 addition & 1 deletion docs-v2/advanced/partitioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,4 @@ Note that many of these methods will change both the *number of partitions* as w
| Estimated Scan Bytes = 72000000
| Clustering spec = { Num partitions = 3 }
| ...
```
```
2 changes: 1 addition & 1 deletion docs-v2/core_concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2332,4 +2332,4 @@ Let’s turn the bytes into human-readable images using [`image.decode()`](https
- [:fontawesome-solid-equals: **Partitioning**](advanced/partitioning.md)
- [:material-distribute-vertical-center: **Distributed Computing**](advanced/distributed.md)

</div>
</div>
2 changes: 1 addition & 1 deletion docs-v2/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ pip install -U getdaft --pre --extra-index-url https://pypi.anaconda.org/daft-ni
pip install -U https://github.com/Eventual-Inc/Daft/archive/refs/heads/main.zip
```

Please note that Daft requires the Rust toolchain in order to build from source.
Please note that Daft requires the Rust toolchain in order to build from source.
3 changes: 2 additions & 1 deletion docs-v2/integrations/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,5 @@ pass a different [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/st
```python
# Perform some I/O operation but override the IOConfig
df2 = daft.read_csv("s3://my_bucket/my_other_path/**/*", io_config=io_config)
```
```

3 changes: 2 additions & 1 deletion docs-v2/integrations/azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,5 @@ If you are connecting to storage in OneLake or another Microsoft Fabric service,
)

df = daft.read_deltalake('abfss://[WORKSPACE]@onelake.dfs.fabric.microsoft.com/[LAKEHOUSE].Lakehouse/Tables/[TABLE]', io_config=io_config)
```
```

2 changes: 1 addition & 1 deletion docs-v2/integrations/delta_lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,4 +124,4 @@ Here are Delta Lake features that are on our roadmap. Please let us know if you
3. Writing new Delta Lake tables ([issue](https://github.com/Eventual-Inc/Daft/issues/1967)).
<!-- ^ this needs an update, issue has been closed -->

4. Writing back to an existing table with appends, overwrites, upserts, or deletes ([issue](https://github.com/Eventual-Inc/Daft/issues/1968)).
4. Writing back to an existing table with appends, overwrites, upserts, or deletes ([issue](https://github.com/Eventual-Inc/Daft/issues/1968)).
2 changes: 1 addition & 1 deletion docs-v2/integrations/hudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ Support for more Hudi features are tracked as below:
1. Support incremental query for Copy-on-Write tables [issue](https://github.com/Eventual-Inc/Daft/issues/2153)).
2. Read support for 1.0 table format ([issue](https://github.com/Eventual-Inc/Daft/issues/2152)).
3. Read support (snapshot) for Merge-on-Read tables ([issue](https://github.com/Eventual-Inc/Daft/issues/2154)).
4. Write support ([issue](https://github.com/Eventual-Inc/Daft/issues/2155)).
4. Write support ([issue](https://github.com/Eventual-Inc/Daft/issues/2155)).
3 changes: 2 additions & 1 deletion docs-v2/integrations/huggingface.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,5 @@ to get around this, you can read all files using a glob pattern *(assuming they

```python
df = daft.read_parquet("hf://datasets/username/my_private_dataset/**/*.parquet", io_config=io_config) # Works
```
```

2 changes: 1 addition & 1 deletion docs-v2/integrations/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,4 @@ Here are some features of Iceberg that are works-in-progress:
2. More extensive usage of Iceberg-provided statistics to further optimize queries
3. Copy-on-write and merge-on-read writes

A more detailed Iceberg roadmap for Daft can be found on [our Github Issues page](https://github.com/Eventual-Inc/Daft/issues/2458).
A more detailed Iceberg roadmap for Daft can be found on [our Github Issues page](https://github.com/Eventual-Inc/Daft/issues/2458).
2 changes: 1 addition & 1 deletion docs-v2/integrations/ray.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,4 @@ ray job submit \

The runtime env parameter specifies that Daft should be installed on the Ray workers. Alternative methods of including Daft in the worker dependencies can be found [here](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html).

For more information about Ray jobs, see [Ray docs -> Ray Jobs Overview](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html).
For more information about Ray jobs, see [Ray docs -> Ray Jobs Overview](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html).
2 changes: 1 addition & 1 deletion docs-v2/integrations/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,4 +157,4 @@ You could modify the SQL query to add the filters and projections yourself, but
Here are the SQL features that are on our roadmap. Please let us know if you would like to see support for any of these features!

1. Write support into SQL databases.
2. Reads via [ADBC (Arrow Database Connectivity)](https://arrow.apache.org/docs/format/ADBC.html).
2. Reads via [ADBC (Arrow Database Connectivity)](https://arrow.apache.org/docs/format/ADBC.html).
2 changes: 1 addition & 1 deletion docs-v2/integrations/unity_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,4 @@ See also [Delta Lake](delta_lake.md) for more information about how to work with

2. Unity Iceberg integration for reading tables using the Iceberg interface instead of the Delta Lake interface

Please make issues on the [Daft repository](https://github.com/Eventual-Inc/Daft) if you have any use-cases that Daft does not currently cover!
Please make issues on the [Daft repository](https://github.com/Eventual-Inc/Daft) if you have any use-cases that Daft does not currently cover!
2 changes: 1 addition & 1 deletion docs-v2/migration/dask_migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,4 +129,4 @@ Daft provides a [`read_sql()`](https://www.getdaft.io/projects/docs/en/stable/ap

## Daft combines Python with Rust and Pyarrow for optimal performance

Daft combines Python with Rust and Pyarrow for optimal performance (see [Benchmarks](../resources/benchmarks/tpch.md)). Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience (see [Architecture](../resources/architecture.md)).
Daft combines Python with Rust and Pyarrow for optimal performance (see [Benchmarks](../resources/benchmarks/tpch.md)). Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience (see [Architecture](../resources/architecture.md)).
1 change: 0 additions & 1 deletion docs-v2/resources/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,4 +90,3 @@ Each Partition of a DataFrame is represented as a Table object, which is in turn
Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). We expose Python API bindings for Table using PyO3, which allows our PhysicalPlan to define operations that should be run on each Table.

This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience.

2 changes: 1 addition & 1 deletion docs-v2/resources/benchmarks/tpch.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,4 @@ For benchmarking Spark we used AWS EMR, the official managed Spark solution prov
| Dask (failed, multiple retries)| 1000 | 4 | 1. s3://daft-public-data/benchmarking/logs/dask.2023_5_0.1tb.4-i32xlarge.q126.log |
| Dask (multiple retries) | 100 | 4 | 1. s3://daft-public-data/benchmarking/logs/dask.2023_5_0.100gb.4-i32xlarge.0.log <br> 2. s3://daft-public-data/benchmarking/logs/dask.2023_5_0.100gb.4-i32xlarge.0.log <br> 3. s3://daft-public-data/benchmarking/logs/dask.2023_5_0.100gb.4-i32xlarge.1.log |
| Modin (failed, multiple retries) | 1000 | 16 | 1. s3://daft-public-data/benchmarking/logs/modin.0_20_1.1tb.16-i32xlarge.0.log <br> 2. s3://daft-public-data/benchmarking/logs/modin.0_20_1.1tb.16-i32xlarge.1.log |
| Modin (failed, multiple retries) | 100 | 4 | 1. s3://daft-public-data/benchmarking/logs/modin.0_20_1.100gb.4-i32xlarge.log |
| Modin (failed, multiple retries) | 100 | 4 | 1. s3://daft-public-data/benchmarking/logs/modin.0_20_1.100gb.4-i32xlarge.log |
2 changes: 1 addition & 1 deletion docs-v2/resources/dataframe_comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ Ray Datasets make it easy to feed data really efficiently into Ray's model train

However, Ray Datasets are not a fully-fledged Dataframe abstraction (and [it is explicit in not being an ETL framework for data science](https://docs.ray.io/en/latest/data/overview.html#ray-data-overview)) which means that it lacks key features in data querying, visualization and aggregations.

Instead, Ray Data is a perfect destination for processed data from DaFt Dataframes to be sent to with a simple [`df.to_ray_dataset()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.to_ray_dataset.html#daft.DataFrame.to_ray_dataset) call. This is useful as an entrypoint into your model training and inference ecosystem!
Instead, Ray Data is a perfect destination for processed data from DaFt Dataframes to be sent to with a simple [`df.to_ray_dataset()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.to_ray_dataset.html#daft.DataFrame.to_ray_dataset) call. This is useful as an entrypoint into your model training and inference ecosystem!
4 changes: 2 additions & 2 deletions docs-v2/resources/telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ We **do not** sell or buy any of the data that is collected in telemetry.

## What data do we collect?

To audit what data is collected, please see the implementation of ``AnalyticsClient`` in the ``daft.analytics`` module.
To audit what data is collected, please see the implementation of `AnalyticsClient` in the `daft.analytics` module.

In short, we collect the following:

1. On import, we track system information such as the runner being used, version of Daft, OS, Python version, etc.
2. On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.
2. On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.

0 comments on commit 54bea1d

Please sign in to comment.