Skip to content

Latest commit

 

History

History
43 lines (33 loc) · 1.38 KB

README.md

File metadata and controls

43 lines (33 loc) · 1.38 KB

Downloading large amount of data from Redshift using RPostgreSQL::dbGetQuery can be pretty slow.

This utility allows to quickly download the result of a Redshift SELECT query and create a data frame from it.

Instead of fetching n rows at a time from the database, it:

  • Unloads the result of the SELECT query to compressed csv files on S3
  • Downloads the csv files locally using the aws cli
  • Reads the csv files and creates a data frame from it

Requirements

  • aws cli installed and configured
  • The following R packages: install.packages('uuid', 'RPostgreSQL')
  • A S3 bucket
  • A AWS role associated to the Redshift cluster with write permission on the bucket
  • The AWS access key and secret of a user with read permission on the bucket

Usage

library(RPostgreSQL)
source("db.R")

drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv,
  dbname   = "db_name",
  host     = "host",
  port     = "port",
  user     = "user",
  password = "password"
)

sql.query         <- "SELECT * FROM my_table"
s3.bucket         <- "bucket_name"
redshift.iam.role <- "arn:aws:iam::123456:role/RoleName"
aws.access.key    <- "AWS_ACCESS_KEY_ID"
aws.secret.key    <- "AWS_SECRET_ACCESS_KEY"

df <- ReadSqlResults(con, sql.query, aws.access.key, aws.secret.key, s3.bucket, redshift.iam.role)
head(df)