From ed8daafef64141fb13bd0edd0ef1bd859daab317 Mon Sep 17 00:00:00 2001 From: Colin Date: Fri, 9 Feb 2024 12:11:34 +0800 Subject: [PATCH] [DOCS] Add docs for AWS S3 IO (#1855) Integration docs on how to use Daft with AWS S3. Thanks @jaychia for paving the way with your azure docs. --- docs/source/user_guide/integrations.rst | 1 + docs/source/user_guide/integrations/aws.rst | 53 +++++++++++++++++++++ 2 files changed, 54 insertions(+) create mode 100644 docs/source/user_guide/integrations/aws.rst diff --git a/docs/source/user_guide/integrations.rst b/docs/source/user_guide/integrations.rst index 027063f582..4a565cfa22 100644 --- a/docs/source/user_guide/integrations.rst +++ b/docs/source/user_guide/integrations.rst @@ -5,3 +5,4 @@ Integrations integrations/iceberg integrations/microsoft-azure + integrations/aws diff --git a/docs/source/user_guide/integrations/aws.rst b/docs/source/user_guide/integrations/aws.rst new file mode 100644 index 0000000000..3620045bf6 --- /dev/null +++ b/docs/source/user_guide/integrations/aws.rst @@ -0,0 +1,53 @@ +Amazon Web Services +=================== + +Daft is able to read/write data to/from AWS S3, and understands natively the URL protocol ``s3://`` as referring to data that resides +in S3. + +Authorization/Authentication +---------------------------- + +In AWS S3, data is stored under the hierarchy of: + +1. Bucket: The container for data storage, which is the top-level namespace for data storage in S3. +2. Object Key: The unique identifier for a piece of data within a bucket. + +URLs to data in S3 come in the form: ``s3://{BUCKET}/{OBJECT_KEY}``. + +Rely on Environment +******************* + +You can configure the AWS `CLI `_ to have Daft automatically discover credentials. +Alternatively, you may specifiy your credentials in `environment variables `_: ``AWS_ACCESS_KEY_ID``, ``AWS_SECRET_ACCESS_KEY``, and ``AWS_SESSION_TOKEN``. + +Please be aware that when doing so in a distributed environment such as Ray, Daft will pick these credentials up from worker machines and thus each worker machine needs to be appropriately provisioned. + +If instead you wish to have Daft use credentials from the "driver", you may wish to manually specify your credentials. + +Manually specify credentials +**************************** + +You may also choose to pass these values into your Daft I/O function calls using an :class:`daft.io.S3Config` config object. + +:func:`daft.set_planning_config` is a convenient way to set your :class:`daft.io.IOConfig` as the default config to use on any subsequent Daft method calls. + +.. code:: python + + from daft.io import IOConfig, S3Config + + # Supply actual values for the se + io_config = IOConfig(s3=S3Config(key_id="key_id", session_token="session_token", secret_key="secret_key")) + + # Globally set the default IOConfig for any subsequent I/O calls + daft.set_planning_config(default_io_config=io_config) + + # Perform some I/O operation + df = daft.read_parquet("s3://my_bucket/my_path/**/*") + +Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the ``io_config=`` keyword argument. This is extremely flexible as you can +pass a different :class:`daft.io.S3Config` per function call if you wish! + +.. code:: python + + # Perform some I/O operation but override the IOConfig + df2 = daft.read_csv("s3://my_bucket/my_other_path/**/*", io_config=io_config)