diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..b1b1f67 Binary files /dev/null and b/.DS_Store differ diff --git a/.nojekyll b/.nojekyll index d678ce0..2b3dff3 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ff3c9455 \ No newline at end of file +b265fe01 \ No newline at end of file diff --git a/appendix.html b/appendix.html index af3bf2d..e9b5910 100644 --- a/appendix.html +++ b/appendix.html @@ -233,7 +233,14 @@
All data is stored under schemas in the projects database and are accessible by the following programs:
-To establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects
. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\
before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects
server.
When users create tables in their PR (Research Project) or TR (Training Project) schema, the table is initially permissioned to the user only. This is analogous to creating a document or file in your U drive: Only you have access to the newly created table.
If you want to allow all individuals in your project workspace to access the table in the PR/TR schema, you will need to grant permission to the table to the rest of the users who have access to the PR or TR schema.
You can do this by running the following code:
-GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name TO group db_xxxxxx_rw;
GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name TO group db_xxxxxx_rw;
Note: In the above code example replace
schma_name
with thepr_
ortr_
schema assigned to your workspace and replacetable_name
with the name of the table on which you want to grant access. Also, in the group namedb_xxxxxx_rw
, replacexxxxxx
with your project code. This is the last 6 characters in your project based user name. This will start with either aT
or aP
.
If you want to allow only a single user on your project to access the table, you will need to grant permission to that user. You can do this by running the following code:
-GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE
schema_name.table_name to
"IAM:first_name.last_name.project_code";
GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name to "IAM:first_name.last_name.project_code";
@@ -321,9 +327,9 @@Note: In the above code example replace
schma_name
with thepr_
ortr_
schema assigned to your workspace and replacetable_name
with the name of the table on which you want to grant access. Also, in"IAM:first_name.last_name.project_code"
updatefirst_name.last_name.project_code
with the user name to whom you want to grant access to.
Redshift11_projects_DSN
In the code examples below, the default DSN is Redshift01_projects_DSN
.
proc sql;
connect to odbc as my con
(datasrc=Redshift01_projects_DSN user=adrf\user.name.project password=password);
@@ -331,158 +337,97 @@ Data Access
(select * form projects.schema.table);
disconnect from mycon;
quit;
-Best practices for loading large amounts of data in R
To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:
-options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m")) gc()
Redshift R database connectivity Changes
-When connecting to Redshift database using R (whether in R Studio or Jupyter Notebook), We strongly recommend that you use a JDBC based connection (versus using an ODBC based connection). Other than how you connect to database, rest of your code should remain the same.
+options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m")) gc()
Best practices for writing tables to Redshift
-When writing tables to Redshift from R in either a SQL query or R data frame, please use the following lines of R code:
-Writing a table from R to Redshift using SQL INTO
statement use the function dbSendUpdate()
:
dbSendUpdate(conn, SELECT col_1 INTO schema_name.table_name FROM schema_name.old_table_name
When writing an R data frame to Redshift use the following code as an example:
-<- "set search_path to schema_name"
- qry dbSendUpdate(conn, qry)
-
-dbWriteTable(
-conn = conn, #name of the connection
-name = 'table_name', #name of table to save df to
-value = df_name, #name of df to write to Redshift
-overwrite = TRUE #if you want to overwrite a current table, otherwise FALSE
# Note: replace the table_name with the name of the data frame you wish to write to Redshift
+
+::dbWriteTable(conn = conn, #name of the connection
+ DBIname = "schema_name.table_name", #name of table to save df to
+value = df_name, #name of df to write to Redshift
+overwrite = TRUE) #if you want to overwrite a current table, otherwise FALSE
+
+<- "GRANT SELECT ON TABLE schema.table_name TO group <group_name>;"
+ qry dbSendUpdate(conn,qry)
The below table is for connecting to RedShift11 Database
-- | ODBC Based Connection | -JDBC Based Connection(Recommended) | -
Libraries | -library(odbc) |
-library(RJDBC) |
-
User ID and Password | -- |
|
-
- | - |
|
-
Connection | -con <- dbConnect(odbc(),"Redshift11_projects_DSN",uid = "adrf\\John.doe.p00002", pwd = 'xxxxxxxxxxxxxx') |
-conn <- dbConnect(driver, url, dbusr, dbpswd) |
-
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
-='adrf\John.doe.p00002'
- DBUSER='xxxxxxxxxxxx' DBPASSWD
library(RJDBC)
+=Sys.getenv("DBUSER")
+ dbusr=Sys.getenv("DBPASSWD")
+ dbpswd
+# Database URL
+<- paste0("jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;",
+ url "loginToRp=urn:amazon:webservices:govcloud;",
+"ssl=true;",
+"AutoCreate=true;",
+"idp_host=adfs.adrf.net;",
+"idp_port=443;",
+"ssl_insecure=true;",
+"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider")
+
+# Redshift JDBC Driver Setting
+<- JDBC("com.amazon.redshift.jdbc42.Driver",
+ driver classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
+identifier.quote="`")
+<- dbConnect(driver, url, dbusr, dbpswd) conn
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
+='adrf\John.doe.p00002'
+ DBUSER='xxxxxxxxxxxx' DBPASSWD
PLEASE replace user id and password with your project workspace specific user is and password.
This will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.
-
The below table is for connecting to RedShift01 Database
-- | -ODBC Based Connection |
-JDBC Based Connection (Recommended) | -
Libraries | -library(odbc) |
-library(RJDBC) |
-
User ID and Password | -- | dbusr=Sys.getenv("DBUSER") dbpswd=Sys.getenv("DBPASSWD") |
-
- | - |
-
|
-
Connection | -con <- dbConnect(odbc(), "Redshift11_projects_DSN", uid = "adrf\\John.doe.p00002", pwd = 'xxxxxxxxxxxxxx') |
-
|
-
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
-='adrf\John.doe.p00002’
- DBUSERDBPASSWD='xxxxxxxxxxxx'
PLEASE replace user id and password with your project workspace specific user is and password.
+library(RJDBC)
+=Sys.getenv("DBUSER")
+ dbusr=Sys.getenv("DBPASSWD")
+ dbpswd
+# Database URL
+<- paste0("jdbc:redshift:iam://adrf-redshift01.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;",
+ url "loginToRp=urn:amazon:webservices:govcloud;",
+"ssl=true;",
+"AutoCreate=true;",
+"idp_host=adfs.adrf.net;",
+"idp_port=443;",
+"ssl_insecure=true;",
+"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider")
+
+# Redshift JDBC Driver Setting
+<- JDBC("com.amazon.redshift.jdbc42.Driver",
+ driver classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
+identifier.quote="`")
+<- dbConnect(driver, url, dbusr, dbpswd) conn
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
+='adrf\John.doe.p00002'
+ DBUSER='xxxxxxxxxxxx' DBPASSWD
PLEASE replace user id and password with your project workspace specific user is and password.
This will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.
-import pyodbc
-import pandas as pd
-= pyodbc.connect('DSN=Redshift01_projects_DSN;
- cnxn UID = adrf\\user.name.project; PWD = password')
-= pd.read_sql(“SELECT * FROM projects.schema_name.table_name”, cnxn) df
odbc load, exec(“select * from PATH_TO_TABLE”) clear
-") user(“adrf\user.name.project) password(“password”) dsn(“Redshift01_projects_DSN
import pyodbc
+import pandas as pd
+= pyodbc.connect('DSN=Redshift01_projects_DSN;
+ cnxn UID = adrf\\user.name.project; PWD = password')
+= pd.read_sql(“SELECT * FROM projects.schema_name.table_name”, cnxn) df
odbc load, exec(“select * from PATH_TO_TABLE”) clear
+") user(“adrf\user.name.project) password(“password”) dsn(“Redshift01_projects_DSN
Developing your query. Here’s an example workflow to follow when developing a query.
Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.
Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.
To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )
Narrow down the columns to the minimal set required to answer your question.
Apply any filters to those columns.
If you need to aggregate data, aggregate a small number of rows
If you need to aggregate data, aggregate a small number of rows
Once you have a query returning the results you need, look for sections of the query to save as a Common Table Expression (CTE) to encapsulate that logic.
To provide ADRF users with the ability to draw from sensitive data, results that are exported from the ADRF must meet rigorous standards meant to protect privacy and confidentiality. To ensure that those standards are met, the ADRF Export Review team reviews each request to ensure that it follows formal guidelines that are set by the respective agency providing the data in partnership with the Coleridge Initiative. Prior to moving data into the ADRF from the agency, the Export Review team suggests default guidelines to implement, based on standard statistical approaches in the U.S. government 1, 2 as well as international standards 3, 4, and 5. The Data Steward from the agency supplying the data works with the team to amend these default rules in line with the agency’s requirements. If you are unsure about the review guidelines for the data you are using in the ADRF or if you have any questions relating to exports, please reach out to support@coleridgeinitiative.org before submitting an export request.
+To provide ADRF users with the ability to draw from sensitive data, results that are exported from the ADRF must meet rigorous standards meant to protect privacy and confidentiality. To ensure that those standards are met, the ADRF Export Review team reviews each request to ensure that it follows formal guidelines that are set by the respective agency providing the data in partnership with the Coleridge Initiative. Prior to moving data into the ADRF from the agency, the Export Review team suggests default guidelines to implement, based on standard statistical approaches in the U.S. government 1,2 as well as international standards 3,4, and 5. The Data Steward from the agency supplying the data works with the team to amend these default rules in line with the agency’s requirements. If you are unsure about the review guidelines for the data you are using in the ADRF or if you have any questions relating to exports, please reach out to support@coleridgeinitiative.org before submitting an export request.
To learn more about limiting disclosure more generally, please refer to the textbook or view the videos.
The review process can be delayed if the reviewer needs additional information or if the reviewer needs you to make changes to your code or output to meet the ADRF nondisclosure requirements.
The ADRF Export Review process typically involves two main stages:
+This is an initial, cursory review of your documentation and exports to ensure they do not include micro-data. A primary review can take up to 5 business days, so please plan accordingly when submitting your materials.
+In cases where the reviewer has questions or requires additional information, the primary review may extend beyond 5 business days.
+This is a comprehensive review conducted by an approved Data Steward who has content knowledge for the data permissioned to your workspace.
+If your submission pertains to multiple data assets, it will require approval by each Data Steward before the material can be exported from the ADRF.
+If you’ve submitted an export request, you can easily check the status of your submission by following these steps:
+Log into the ADRF.
Open the ADRF Export module.
To help you better understand the different stages of the Export Review process, here are the status descriptions you may encounter:
+Your export is currently under primary review. If any issues arise during the primary review, your reviewer will notify you. Upon completion of the primary review, the secondary reviewer(s) will be notified.
+Your export is currently under secondary review. If your submission pertains to multiple data assets, it will require a review by each Data Steward before being approved.
+Each agency has specific disclosure review guidelines, especially with respect to the minimum allowable cell sizes for tables. Refer to these guidelines when preparing export requests. If you are unsure of what guidelines are in place for the dataset with which you are working in the ADRF, please reach out to support@coleridgeinitiative.org.
Cell Sizes
Each agency has specific disclosure review guidelines, especially with respect to the minimum allowable cell sizes for tables. Refer to these guidelines when preparing export requests. If you are unsure of what guidelines are in place for the dataset with which you are working in the ADRF, please reach out to support@coleridgeinitiative.org.
For individual-level data, please report the number of observations from each cell. For individual-level data, the default rule is to suppress cells with fewer than 10 observations, unless otherwise directed by the guidelines of the agency that provided the data.
If your table includes row or column totals or is dependent on a preceding or subsequent table, reviewers will need to take into account complementary disclosure risks—that is, whether the tables’ totals, or the separate tables when read together, might disclose information about individuals in the data in a way that a single, simpler table would not. Reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.
library(tidyverse)
All packages will be installed in your user folder.
To install a specific package version you can specify:
-install.packages("remotes")
remotes::install_version("tidyverse", "1.3.2")
install.packages("remotes")
remotes::install_version("tidyverse", "1.3.2")