Skip to content

Commit

Permalink
Improve docs for Redshift parallel read
Browse files Browse the repository at this point in the history
  • Loading branch information
mosabua committed Jan 2, 2025
1 parent c06f1e1 commit 59ac395
Showing 1 changed file with 38 additions and 37 deletions.
75 changes: 38 additions & 37 deletions docs/src/main/sphinx/connector/redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,43 +64,6 @@ documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-configura
```{include} jdbc-authentication.fragment
```

### UNLOAD configuration

This feature enables using Amazon S3 to efficiently transfer data out of Redshift
instead of the default single threaded JDBC based implementation.
The connector automatically triggers the appropriate `UNLOAD` command
on Redshift to extract the output from Redshift to the configured
S3 bucket in the form of Parquet files. These Parquet files are read in parallel
from S3 to improve latency of reading from Redshift tables. The Parquet
files will be removed when Trino finishes executing the query. It is recommended
to define a custom life cycle policy on the S3 bucket used for unloading the
Redshift query results.
This feature is supported only when the Redshift cluster and the configured S3
bucket are in the same AWS region.

The following table describes configuration properties for using
`UNLOAD` command in Redshift connector. `redshift.unload-location` must be set
to use `UNLOAD`.

:::{list-table} UNLOAD configuration properties
:widths: 30, 60
:header-rows: 1

* - Property value
- Description
* - `redshift.unload-location`
- A writeable location in Amazon S3, to be used for temporarily unloading
Redshift query results.
* - `redshift.unload-iam-role`
- Optional. Fully specified ARN of the IAM Role attached to the Redshift cluster.
Provided role will be used in `UNLOAD` command. IAM role must have access to
Redshift cluster and write access to S3 bucket. The default IAM role attached to
Redshift cluster is used when this property is not configured.
:::

Additionally, define appropriate [S3 configurations](/object-storage/file-system-s3)
except `fs.native-s3.enabled`, required to read Parquet files from S3 bucket.

### Multiple Redshift databases or clusters

The Redshift connector can only access a single database within
Expand Down Expand Up @@ -255,3 +218,41 @@ FROM

```{include} query-table-function-ordering.fragment
```

## Performance

The connector includes a number of performance improvements, detailed in the
following sections.

### Parallel read via S3

The connector supports the Redshift `UNLOAD` command to transfer data to Parquet
files on S3. This enables parallel read of the data in Trino instead of the
default, single-threaded JDBC-based connection to Redshift, used by the
connector.

Configure the required S3 location with `redshift.unload-location` to enable the
parallel read, and define a custom life cycle policy on the S3 bucket. Parquet
files are automatically removed with query completion. The Redshift cluster and
the configured S3 bucket must use the same AWS region.

:::{list-table} Parallel read configuration properties
:widths: 30, 60
:header-rows: 1

* - Property value
- Description
* - `redshift.unload-location`
- A writeable location in Amazon S3 in the same AWS region as the Redshift
cluster. Used for temporary storage during query processing using the
`UNLOAD` commnad from Redshift.
* - `redshift.unload-iam-role`
- Optional. Fully specified ARN of the IAM Role attached to the Redshift
cluster to use for the `UNLOAD` command. The role must have read access to
the Redshift cluster and write access to the S3 bucket. Defaults to use the
default IAM role attached to the Redshift cluster.

:::

Additionally, define appropriate [S3 configuration](/object-storage/file-system-s3)
except `fs.native-s3.enabled`, required to read Parquet files from the S3 bucket.

0 comments on commit 59ac395

Please sign in to comment.