fix whitespace

Eventual-Inc · Dec 19, 2024 · 187f06e · 187f06e
1 parent 9c67ef6
commit 187f06e
Show file tree

Hide file tree

Showing 23 changed files with 91 additions and 103 deletions.
diff --git a/docs-v2/advanced/distributed.md b/docs-v2/advanced/distributed.md
@@ -4,8 +4,6 @@ By default, Daft runs using your local machine's resources and your operations a
 
 However, Daft has strong integrations with [Ray](https://www.ray.io) which is a distributed computing framework for distributing computations across a cluster of machines. Here is a snippet showing how you can connect Daft to a Ray cluster:
 
-<!-- :material-language-python: -->
-
 === "🐍 Python"
 
     ```python
@@ -72,4 +70,3 @@ You can take the IP address and port and pass it to Daft:
 
     (Showing first 2 of 2 rows)
     ```
-
diff --git a/docs-v2/core_concepts.md b/docs-v2/core_concepts.md
@@ -48,7 +48,7 @@ Let's create our first Dataframe from a Python dictionary of columns.
         "C": [True, True, False, False],
         "D": [None, None, None, None],
     })
-    ``` 
+    ```
 
 Examine your Dataframe by printing it:
 
@@ -261,7 +261,7 @@ Notice also that when we printed our DataFrame, Daft displayed its **schema**. E
 Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation.
 
 !!! note "Note"
-    
+
     Under the hood, Daft represents data in the [Apache Arrow](https://arrow.apache.org/) format, which allows it to efficiently represent and work on data using high-performance kernels which are written in Rust.
 
 ### Running Computation with Expressions
@@ -299,7 +299,7 @@ The following statement will [`df.show()`](https://www.getdaft.io/projects/docs/
 (Showing first 4 of 4 rows)
 ```
 
-!!! info "Info" 
+!!! info "Info"
 
     A common pattern is to create a new columns using [`DataFrame.with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html):
 
@@ -1545,7 +1545,7 @@ Writing data will execute your DataFrame and write the results out to the specif
 
 !!! note "Note"
 
-    Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data. 
+    Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.
 
 ## DataTypes
 
@@ -1556,7 +1556,7 @@ All elements of a column are of the same dtype, or they can be the special Null
 Daft provides simple DataTypes that are ubiquituous in many DataFrames such as numbers, strings and dates - all the way up to more complex types like tensors and images.
 
 !!! tip "Tip"
-    
+
     For a full overview on all the DataTypes that Daft supports, see the [DataType API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html).
 
 
@@ -1709,7 +1709,7 @@ natively integrate with the rest of your Daft query.
         df = daft.read_parquet("s3://...")
         daft.sql("SELECT * FROM df")
         ```
-        
+
     We appreciate your patience with us and hope to deliver this crucial feature soon!
 
 ### SQL Expressions
@@ -2308,7 +2308,7 @@ Let’s turn the bytes into human-readable images using [`image.decode()`](https
 
 <div class="grid cards" markdown>
 
-<!-- - [**Coming from Spark**] -->
+<!-- - [:simple-apachespark: **Coming from Spark**](migratoin/spark_migration.md) -->
 - [:simple-dask: **Coming from Dask**](migration/dask_migration.md)
 
 </div>

diff --git a/docs-v2/core_concepts/dataframe.md b/docs-v2/core_concepts/dataframe.md
@@ -23,7 +23,7 @@ Common data operations that you would perform on DataFrames are:
 4. [**Sorting:**](dataframe.md#reordering-rows) Use [`df.sort(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sort.html#daft.DataFrame.sort) to arrange your data based on values in one or more columns.
 5. **Grouping and aggregating:** Use [`df.groupby(...).agg(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.groupby.html#daft.DataFrame.groupby) to summarize your data by groups.
 
-## Creating a Dataframe
+## Creating a DataFrame
 
 !!! tip "See Also"
 
@@ -42,7 +42,7 @@ Let's create our first Dataframe from a Python dictionary of columns.
         "C": [True, True, False, False],
         "D": [None, None, None, None],
     })
-    ``` 
+    ```
 
 Examine your Dataframe by printing it:
 
@@ -255,7 +255,7 @@ Notice also that when we printed our DataFrame, Daft displayed its **schema**. E
 Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation.
 
 !!! note "Note"
-    
+
     Under the hood, Daft represents data in the [Apache Arrow](https://arrow.apache.org/) format, which allows it to efficiently represent and work on data using high-performance kernels which are written in Rust.
 
 ## Running Computation with Expressions
@@ -293,7 +293,7 @@ The following statement will [`df.show()`](https://www.getdaft.io/projects/docs/
 (Showing first 4 of 4 rows)
 ```
 
-!!! info "Info" 
+!!! info "Info"
 
     A common pattern is to create a new columns using [`DataFrame.with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html):
 

diff --git a/docs-v2/core_concepts/datatypes.md b/docs-v2/core_concepts/datatypes.md
@@ -7,7 +7,7 @@ All elements of a column are of the same dtype, or they can be the special Null
 Daft provides simple DataTypes that are ubiquituous in many DataFrames such as numbers, strings and dates - all the way up to more complex types like tensors and images.
 
 !!! tip "Tip"
-    
+
     For a full overview on all the DataTypes that Daft supports, see the [DataType API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html).
 
 
@@ -93,4 +93,4 @@ Daft abstracts away the in-memory representation of your data and provides kerne
 
 For more complex algorithms, you can also drop into a Python UDF to process this data using your custom Python libraries.
 
-Please add suggestions for new DataTypes to our Github Discussions page!
+Please add suggestions for new DataTypes to our [Github Discussions page](https://github.com/Eventual-Inc/Daft/discussions)!
diff --git a/docs-v2/core_concepts/read_write.md b/docs-v2/core_concepts/read_write.md
@@ -8,9 +8,7 @@ Daft can read data from a variety of sources, and write data to many destination
 
 ### From Files
 
-DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3.
-
-Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet.
+DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3. Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet.
 
 Daft supports file paths to a single file, a directory of files, and wildcards. It also supports paths to remote object storage such as AWS S3.
 === "🐍 Python"
@@ -141,5 +139,4 @@ Writing data will execute your DataFrame and write the results out to the specif
 
 !!! note "Note"
 
-    Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data. 
-
+    Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data.
diff --git a/docs-v2/core_concepts/sql.md b/docs-v2/core_concepts/sql.md
@@ -41,8 +41,7 @@ Daft's [`daft.sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.
 (Showing first 3 of 3 rows)
 ```
 
-In the above example, we query the DataFrame called `"my_special_df"` by simply referring to it in the SQL command. This produces a new DataFrame `sql_df` which can
-natively integrate with the rest of your Daft query.
+In the above example, we query the DataFrame called `"my_special_df"` by simply referring to it in the SQL command. This produces a new DataFrame `sql_df` which can natively integrate with the rest of your Daft query.
 
 ## Reading data from SQL
 
@@ -65,7 +64,7 @@ natively integrate with the rest of your Daft query.
         df = daft.read_parquet("s3://...")
         daft.sql("SELECT * FROM df")
         ```
-        
+
     We appreciate your patience with us and hope to deliver this crucial feature soon!
 
 ## SQL Expressions

diff --git a/docs-v2/core_concepts/udf.md b/docs-v2/core_concepts/udf.md
@@ -174,7 +174,6 @@ Running Class UDFs are exactly the same as running their functional cousins.
     ```
 
 ## Resource Requests
------------------
 
 Sometimes, you may want to request for specific resources for your UDF. For example, some UDFs need one GPU to run as they will load a model onto the GPU.
 
@@ -212,4 +211,3 @@ UDFs can also be parametrized with new resource requests after being initialized
         RunModelWithTwoGPUs(df["images"]),
     )
     ```
-
diff --git a/docs-v2/index.md b/docs-v2/index.md
@@ -53,7 +53,7 @@ This user guide aims to help Daft users master the usage of Daft for all your da
     1. [10 minute Quickstart](https://www.getdaft.io/projects/docs/en/stable/10-min.html): Itching to run some Daft code? Hit the ground running with our 10 minute quickstart notebook.
 
     2. [API Documentation](https://www.getdaft.io/projects/docs/en/stable/api_docs/index.html): Searchable documentation and reference material to Daft’s public API.
-                  
+
 ### Get Started
 
 <div class="grid cards" markdown>
@@ -131,7 +131,7 @@ This user guide aims to help Daft users master the usage of Daft for all your da
 
 ## Contribute to Daft
 
-If you're interested in hands-on learning about Daft internals and would like to contribute to Daft, join us [on Github](https://github.com/Eventual-Inc/Daft) 🚀
+If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us [on Github](https://github.com/Eventual-Inc/Daft) 🚀
 
 Take a look at the many issues tagged with `good first issue` in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our [Distributed Data Slack Community](https://join.slack.com/t/dist-data/shared_invite/zt-2e77olvxw-uyZcPPV1SRchhi8ah6ZCtg) and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!
 

diff --git a/docs-v2/integrations/aws.md b/docs-v2/integrations/aws.md
@@ -24,8 +24,7 @@ If instead you wish to have Daft use credentials from the "driver", you may wish
 
 You may also choose to pass these values into your Daft I/O function calls using an [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) config object.
 
-<!-- add SQL S3Config https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/sql_funcs/daft.sql._sql_funcs.S3Config.html -->
-
+!!! failure "todo(docs): add SQL S3Config https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/sql_funcs/daft.sql._sql_funcs.S3Config.html"
 
 [`daft.set_planning_config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/configuration_functions/daft.set_planning_config.html#daft.set_planning_config) is a convenient way to set your [`daft.io.IOConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.IOConfig.html#daft.io.IOConfig) as the default config to use on any subsequent Daft method calls.
 
@@ -44,13 +43,11 @@ You may also choose to pass these values into your Daft I/O function calls using
     df = daft.read_parquet("s3://my_bucket/my_path/**/*")
     ```
 
-Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can
-pass a different [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) per function call if you wish!
+Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can pass a different [`daft.io.S3Config`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft.io.S3Config) per function call if you wish!
 
 === "🐍 Python"
 
     ```python
     # Perform some I/O operation but override the IOConfig
     df2 = daft.read_csv("s3://my_bucket/my_other_path/**/*", io_config=io_config)
     ```
-
diff --git a/docs-v2/integrations/azure.md b/docs-v2/integrations/azure.md
@@ -49,8 +49,7 @@ You may also choose to pass these values into your Daft I/O function calls using
     df = daft.read_parquet("az://my_container/my_path/**/*")
     ```
 
-Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can
-pass a different [`daft.io.AzureConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.AzureConfig.html#daft.io.AzureConfig) per function call if you wish!
+Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the `io_config=` keyword argument. This is extremely flexible as you can pass a different [`daft.io.AzureConfig`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.AzureConfig.html#daft.io.AzureConfig) per function call if you wish!
 
 === "🐍 Python"
 
@@ -79,4 +78,3 @@ If you are connecting to storage in OneLake or another Microsoft Fabric service,
 
     df = daft.read_deltalake('abfss://[WORKSPACE]@onelake.dfs.fabric.microsoft.com/[LAKEHOUSE].Lakehouse/Tables/[TABLE]', io_config=io_config)
     ```
-
diff --git a/docs-v2/integrations/delta_lake.md b/docs-v2/integrations/delta_lake.md
@@ -106,7 +106,7 @@ When reading from a Delta Lake table into Daft:
 | `date` | [`daft.DataType.date()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.date) |
 | `timestamp` | [`daft.DataType.timestamp(timeunit="us", timezone=None)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
 | `timestampz`| [`daft.DataType.timestamp(timeunit="us", timezone="UTC")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
-| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) | 
+| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
 | `binary` | [`daft.DataType.binary()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.binary) |
 | **Nested Types** |
 | `struct(fields)` | [`daft.DataType.struct(fields)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.struct) |
@@ -122,6 +122,7 @@ Here are Delta Lake features that are on our roadmap. Please let us know if you
 2. Read support for [column mappings](https://docs.delta.io/latest/delta-column-mapping.html>) ([issue](https://github.com/Eventual-Inc/Daft/issues/1955)).
 
 3. Writing new Delta Lake tables ([issue](https://github.com/Eventual-Inc/Daft/issues/1967)).
-<!-- ^ this needs an update, issue has been closed -->
+
+!!! failure "todo(docs): ^ this needs to be updated, issue is already closed"
 
 4. Writing back to an existing table with appends, overwrites, upserts, or deletes ([issue](https://github.com/Eventual-Inc/Daft/issues/1968)).
diff --git a/docs-v2/integrations/hudi.md b/docs-v2/integrations/hudi.md
@@ -20,7 +20,7 @@ pip install -U "getdaft[hudi]"
 
 ## Reading a Table
 
-To read from an Apache Hudi table, use the [`daft.read_hudi`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_hudi.html#daft.read_hudi) function. The following is an example snippet of loading an example table
+To read from an Apache Hudi table, use the [`daft.read_hudi`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_hudi.html#daft.read_hudi) function. The following is an example snippet of loading an example table:
 
 === "🐍 Python"
 
@@ -53,7 +53,7 @@ When reading from a Hudi table into Daft:
 | `date` | [`daft.DataType.date()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.date) |
 | `timestamp` | [`daft.DataType.timestamp(timeunit="us", timezone=None)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
 | `timestampz`| [`daft.DataType.timestamp(timeunit="us", timezone="UTC")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp) |
-| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) | 
+| `string` | [`daft.DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) |
 | `binary` | [`daft.DataType.binary()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.binary) |
 | **Nested Types** |
 | `struct(fields)` | [`daft.DataType.struct(fields)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.struct) |

diff --git a/docs-v2/integrations/huggingface.md b/docs-v2/integrations/huggingface.md
@@ -2,10 +2,10 @@
 
 Daft is able to read datasets directly from Hugging Face via the `hf://datasets/` protocol.
 
-Since Hugging Face will [automatically convert](https://huggingface.co/docs/dataset-viewer/en/parquet) all public datasets to parquet format, we can read these datasets using the [`read_parquet`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_parquet.html) method.
+Since Hugging Face will [automatically convert](https://huggingface.co/docs/dataset-viewer/en/parquet) all public datasets to parquet format, we can read these datasets using the [`daft.read_parquet()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_parquet.html) method.
 
 !!! warning "Warning"
-    
+
     This is limited to either public datasets, or PRO/ENTERPRISE datasets.
 
 For other file formats, you will need to manually specify the path or glob pattern to the files you want to read, similar to how you would read from a local file system.
@@ -67,4 +67,3 @@ to get around this, you can read all files using a glob pattern *(assuming they
     ```python
     df = daft.read_parquet("hf://datasets/username/my_private_dataset/**/*.parquet", io_config=io_config) # Works
     ```
-
-Original file line number
+Diff line change
@@ Expand Up @@
         ```
     ## Resource Requests
-    -----------------
     Sometimes, you may want to request for specific resources for your UDF. For example, some UDFs need one GPU to run as they will load a model onto the GPU.
@@ Expand Down Expand Up @@
             RunModelWithTwoGPUs(df["images"]),
         )
         ```