Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add feature-of-the-week IOConfigs #1754

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion daft/expressions/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ def download(
) -> Expression:
"""Treats each string as a URL, and downloads the bytes contents as a bytes column

..NOTE::
.. NOTE::
If you are observing excessive S3 issues (such as timeouts, DNS errors or slowdown errors) during URL downloads,
you may wish to reduce the value of ``max_connections`` (defaults to 32) to reduce the amount of load you are placing
on your S3 servers.
Expand Down
10 changes: 0 additions & 10 deletions docs/source/api_docs/input_output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,6 @@ Input/Output
Configuration
-------------

.. NOTE::
Daft is currently building out its own native code for reading/writing data. These configuration objects allow
users to control behavior when Daft runs native code, but otherwise will have no effect.

These configurations are currently used in:

1. :func:`daft.read_parquet`: controls behavior when reading DataFrames Parquet files using the native downloader
2. :func:`Expression.url.download() <daft.expressions.expressions.ExpressionUrlNamespace.download>`: controls behavior when downloading bytes from URLs using the native downloader
3. :func:`Table.read_parquet <daft.table.table.Table.read_parquet>`: (Advanced usecase!) controls behavior when reading a Daft Table from a Parquet file

.. autosummary::
:nosignatures:
:toctree: doc_gen/io_configs
Expand Down
134 changes: 134 additions & 0 deletions tutorials/feature_of_the_week/io_config.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install getdaft"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature of the Week: Input/Output Configurations (IOConfig)\n",
"\n",
"`IOConfig` is Daft's mechanism for controlling the behavior of data input/output from storage. It is useful for:\n",
"\n",
"1. **Providing credentials** for authenticating with cloud storage services\n",
"2. **Tuning performance** or reducing load on storage services\n",
"\n",
"For a deeper look at `IOConfig`, see: [IOConfig Documentation](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_configs/daft.io.IOConfig.html?highlight=IOConfig)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Default IOConfig Behavior\n",
"\n",
"The default behavior for IOConfig is to automatically detect credentials on your machines."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import daft\n",
"\n",
"# By default, calls to AWS S3 will use credentials retrieved from the machine(s) that they are called from\n",
"#\n",
"# For AWS S3 services, the default mechanism is to look through a chain of possible \"providers\":\n",
"# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials\n",
"df = daft.read_csv(\"s3://daft-public-data/file.csv\")\n",
"df.collect()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overriding the IOConfig\n",
"### Setting a Global Override\n",
"\n",
"Oftentimes you may want Daft to just use a certain configuration by default whenever it has to access storage such as S3, GCS or Azure Blob Store.\n",
"\n",
"> **Example:**\n",
">\n",
"> An extremely common use-case is to create a set of temporary credentials once, and share that across all calls to data access happening in Daft.\n",
">\n",
"> The example below demonstrates this with AWS S3's `boto3` Python SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use the boto3 library to generate temporary credentials which can be used for S3 access\n",
"import boto3\n",
"session = boto3.session.Session()\n",
"creds = session.get_credentials()\n",
"\n",
"# Attach temporary credentials to a Daft IOConfig object\n",
"MY_IO_CONFIG = daft.io.IOConfig(\n",
" s3=daft.io.S3Config(\n",
" key_id=creds.access_key,\n",
" access_key=creds.secret_key,\n",
" session_token=creds.token,\n",
" )\n",
")\n",
"\n",
"# Set the default config to `MY_IO_CONFIG` so that it is used in the absence of any overrides\n",
"daft.set_planning_config(default_io_config=MY_IO_CONFIG)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overriding IOConfigs per-API call\n",
"\n",
"Daft also allows for more granular per-call overrides through the use of keyword arguments.\n",
"\n",
"This is extremely flexible, allowing you to use a different set of credentials to read from two different locations!\n",
"\n",
"Here we use `daft.read_csv` as an example, but the same `io_config=...` keyword arg also exists for other I/O related functionality such as:\n",
"\n",
"1. `daft.read_parquet`\n",
"2. `daft.read_json`\n",
"3. `Expression.url.download()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# An \"Anonymous\" IOConfig will access storage **without credentials**, and can only access fully public data\n",
"MY_ANONYMOUS_IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))\n",
"\n",
"# Read this file using `MY_ANONYMOUS_IO_CONFIG` instead of the overridden global config `MY_IO_CONFIG`\n",
"df1 = daft.read_csv(\"s3://daft-public-data/melbourne-airbnb/melbourne_airbnb.csv\", io_config=MY_ANONYMOUS_IO_CONFIG)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading