diff --git a/.nojekyll b/.nojekyll index a9a16b4..31c6ff0 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -72fc965a \ No newline at end of file +979e106f \ No newline at end of file diff --git a/access.html b/access.html index 900a9cc..75d36e7 100644 --- a/access.html +++ b/access.html @@ -238,7 +238,7 @@

diff --git a/appendix.html b/appendix.html index 650e25f..32231cf 100644 --- a/appendix.html +++ b/appendix.html @@ -298,7 +298,7 @@

Data Accesslogin

-

All data is stored under schemas in the projects database and are accessible by the following programs: 

+

All data is stored under schemas in the projects database and are accessible by the following programs:

DBeaver

To establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\ before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects server.

@@ -341,7 +341,7 @@

SAS Connection

R Connection

Best practices for loading large amounts of data in R

-

To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded: 

+

To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:

options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m")) gc()

Best practices for writing tables to Redshift

When writing an R data frame to Redshift use the following code as an example:

@@ -422,8 +422,8 @@

Stata Connection

Redshift Query Guidelines for Researchers

Developing your query. Here’s an example workflow to follow when developing a query.

    -
  1. Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name. 

  2. -
  3. To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select  * from [table name] LIMIT 1000 )

  4. +
  5. Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.

  6. +
  7. To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )

  8. Narrow down the columns to the minimal set required to answer your question.

  9. Apply any filters to those columns.

  10. If you need to aggregate data, aggregate a small number of rows

  11. @@ -442,18 +442,18 @@

    Tip 2: Always fetch limited data and target accurate results

    -

    Lesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster.  In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.

    -

    SELECT  col_A, col_B, col_C FROM projects.schema_name.table_name  WHERE [apply some filter] LIMIT 1000

    +

    Lesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster. In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.

    +

    SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE [apply some filter] LIMIT 1000

    Tip 3: Use wildcard characters wisely

    Wildcard characters can be either used as a prefix or a suffix. Using leading wildcard (%) in combination with an ending wildcard will search all records for a match anywhere within the selected field.

    Inefficient

    Select col_A, col_B, col_C from projects.schema_name.table_name where col_A like '%BRO%'

    -

    This query will pull the expected results of Brown Sugar, Brownie, Brown Rice and so on. However, it will also pull unexpected results, such as Country Brown, Lamb with Broth, Cream of Broccoli.    

    +

    This query will pull the expected results ofBrown Sugar, Brownie, Brown Riceand so on. However, it will also pull unexpected results, such asCountry Brown, Lamb with Broth, Cream of Broccoli.

    Efficient

    Select col_A, col_B, col_C from projects.schema_name.table_name where col_B like 'BRO%'.

    -

    This query will pull only the expected results of Brownie, Brown Rice, Brown Sugar and so on. 

    +

    This query will pull only the expected results ofBrownie, Brown Rice, Brown Sugar and so on.

    Tip 4: Does My record exist?

    @@ -461,9 +461,9 @@

    Tip 4:

    Efficient

    select col_A from projects.schema_name.table_name A where exists (select 1 from projects.schema_name.table_name B where A.col_A = B.col_A ) order by col_A;

    -
    -

    Tip 5: Avoid correlated subqueries

    -

    A correlated subquery depends on the parent or outer query. Since it executes row by row, it decreases the overall speed of the process.  

    +
    +

    Tip 5: Avoidcorrelated subqueries

    +

    A correlated subquery depends on the parent or outer query. Since it executes row by row, it decreases the overall speed of the process.

    Inefficient

    SELECT col_A, col_B, (SELECT col_C FROM projects.schema_name.table_name_a WHERE col_C = c.rma LIMIT 1) AS new_name FROM projects.schema_name.table_name_b

    Here, the problem is — the inner query is run for each row returned by the outer query. Going over the “table_name_b” table again and again for every row processed by the outer query creates process overhead. Instead, for Amazon Redshift query optimization, use JOIN to solve such problems.

    @@ -472,25 +472,25 @@

    Tip 6: Avoid using Amazon Redshift function in the where condition

    -

    Often developers use functions or methods with their Amazon Redshift queries. 

    +

    Often developers use functions or methods with their Amazon Redshift queries.

    Inefficient

    -

    SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE RIGHT(birth_date,4) = '1965' and  LEFT(birth_date,2) = '07'

    -

    Note that even if birth_date has an index, the above query changes the WHERE clause in such a way that this index cannot be used anymore.

    +

    SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE RIGHT(birth_date,4) = '1965' and LEFT(birth_date,2) = '07'

    +

    Note that even ifbirth_date has an index, the above query changes the WHERE clause in such a way that this index cannot be used anymore.

    Efficient

    SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE birth_date between '711965' and '7311965'

    Tip 7: Use WHERE instead of HAVING

    -

    HAVING clause filters the rows after all the rows are selected. It is just like a filter. Do not use the HAVING clause for any other purposes. It is useful when performing group bys and aggregations.

    +

    HAVING clause filters the rows after all the rows are selected. It is just like a filter. Do not use the HAVING clause for any other purposes.It is useful when performing group bys and aggregations.

    Tip 8: Use temp tables when merging large data sets

    -

    Creating local temp tables will limit the number of records in large table joins and merges.  Instead of performing large table joins, one can break out the analysis by performing the analysis in two steps: 1) create a temp table with limiting criteria to create a smaller / filtered result set.  2) join the temp table to the second large table to limit the number of records being fetched and to speed up the query. This is especially useful when there are no indexes on the join columns.

    +

    Creating local temp tables will limit the number of records in large table joins and merges. Instead of performing large table joins, one can break out the analysis by performing the analysis in two steps: 1) create a temp table with limiting criteria to create a smaller / filtered result set. 2) join the temp table to the second large table to limit the number of records being fetched and to speed up the query. This is especially useful when there are no indexes on the join columns.

    Inefficient

    -

    SELECT col_A, col_B, sum(col_C) total FROM projects.schema_name.table_name  pd  INNER JOIN  projects.schema_name.table_name st  ON  pd.col_A=st.col_B WHERE pd.col_C like 'DOG%' GROUP BY pd.col_A, pd.col_B, pd.col_C

    -

    Note that even if joining column col_A has an index, the col_B column does not.  In addition, because the size of some tables can be large, one should limit the size of the join table by first building a smaller filtered #temp table then performing the table joins.  

    +

    SELECT col_A, col_B, sum(col_C) total FROM projects.schema_name.table_name pd INNER JOIN projects.schema_name.table_name st ON pd.col_A=st.col_B WHERE pd.col_C like 'DOG%' GROUP BY pd.col_A, pd.col_B, pd.col_C

    +

    Note that even if joining column col_A has an index, the col_B column does not. In addition, because the size of some tables can be large, one should limit the size of the join table by first building a smaller filtered #temp table then performing the table joins.

    Efficient

    -

    SET search_path = schema_name;  -- this statement sets the default schema/database to projects.schema_name

    +

    SET search_path = schema_name; -- this statement sets the default schema/database to projects.schema_name

    Step 1:

    CREATE TEMP TABLE temp_table (

    col_A varchar(14),

    @@ -500,7 +500,7 @@

    Other Pointers for best database performance

    SELECT columns, not stars. Specify the columns you’d like to include in the results (though it’s fine to use * when first exploring tables — just remember to LIMIT your results).

    -

    Avoid using SELECT DISTINCT. SELECT DISTINCT command in Amazon Redshift used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. To fetch distinct rows and remove duplicate rows, use more attributes in the SELECT operation. 

    +

    Avoid using SELECT DISTINCT. SELECT DISTINCT command in Amazon Redshift used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. To fetch distinct rows and remove duplicate rows, use more attributes in the SELECT operation.

    Inner joins vs WHERE clause. Use inner join for merging two or more tables rather than using the WHERE clause. WHERE clause creates the CROSS join/ CARTESIAN product for merging tables. CARTESIAN product of two tables takes a lot of time.

    -

    IN versus EXISTS. IN operator is costlier than EXISTS in terms of scans especially when the result of the subquery is a large dataset. We should try to use EXISTS rather than using IN for fetching results with a subquery. 

    +

    IN versus EXISTS. IN operator is costlier than EXISTS in terms of scans especially when the result of the subquery is a large dataset. We should try to use EXISTS rather than using IN for fetching results with a subquery.

    Avoid

    SELECT col_A , col_B, col_C

    FROM projects.schema_name.table_name

    @@ -540,7 +540,7 @@

    SELECT A.col_A , B.col_B, B.col_C

    FROM projects.schema_name.table_name as A

    JOIN projects.schema_name.table_name B ON A.col_A = B.col_B

    -

    Alias multiple tables. When querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table. 

    +

    Alias multiple tables. When querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table.

    Avoid

    SET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name

    SELECT col_A , col_B, col_C

    @@ -571,11 +571,11 @@

    Amazon Redshift best practices for HAVING

    Only use HAVING for filtering aggregates. Before HAVING, filter out values using a WHERE clause before aggregating and grouping those values.

    -

    SELECT  col_A, sum(col_B) as total_amt

    -

    FROM  projects.schema_name.table_name

    -

    WHERE  col_C = 1617 and col_A='key'

    +

    SELECT col_A, sum(col_B) as total_amt

    +

    FROM projects.schema_name.table_name

    +

    WHERE col_C = 1617 and col_A='key'

    GROUP BY col_A

    -

    HAVING  sum(col_D)> 0

    +

    HAVING sum(col_D)> 0

    Amazon Redshift best practices for UNION

    @@ -588,27 +588,27 @@

    Troubleshooting Queries

    There are several metrics for calculating the cost of the query in terms of storage, time, CPU utilization. However, these metrics require DBA permissions to execute. Follow up with ADRF support to get additional assistance.

    -

    Using the SVL_QUERY_SUMMARY view: To analyze query summary information by stream, do the following:    

    -

    Step 1: select query, elapsed, substring  from svl_qlog order by query desc limit 5;       

    +

    Using the SVL_QUERY_SUMMARY view: To analyze query summary information by stream, do the following:

    +

    Step 1: select query, elapsed, substring from svl_qlog order by query desc limit 5;

    Step 2: select * from svl_query_summary where query = MyQueryID order by stm, seg, step;

    -

    Execution Plan:  Lastly, an execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure: 

    +

    Execution Plan: Lastly, an execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure:

    1. Click on SQL Editor in the menu bar.

    2. Click on Explain Execution Plan.

    -

    It helps to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part. The execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged. Further, we can see a more detailed analysis view of each sub-operation performed during query execution.  

    +

    It helps to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part. The execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged. Further, we can see a more detailed analysis view of each sub-operation performed during query execution.

    diff --git a/index.html b/index.html index 5f633b3..4e04dd0 100644 --- a/index.html +++ b/index.html @@ -225,7 +225,7 @@

    ADRF Onboarding Handbook

    Published
    -

    Last Updated on 12 February, 2024

    +

    Last Updated on 13 February, 2024

    diff --git a/search.json b/search.json index db7472c..526f3be 100644 --- a/search.json +++ b/search.json @@ -18,7 +18,7 @@ "href": "access.html#account-registration-and-onboarding-tasks", "title": "1  Obtaining ADRF Access and Account Set Up", "section": "Account Registration and Onboarding Tasks", - "text": "Account Registration and Onboarding Tasks\n\nYou will receive an email invitation to activate your account. The email will come from http://okta.com, so please make sure that it doesn’t get caught in your email spam filter. Follow the steps outlined in the email to set up your password and your multi-factor authentication preferences. Clink on the link below to watch a video walking through the steps.\nAfter activating your account, you will be logged in to the ADRF Applications page. Proceed to the Data Stewardship application by clicking on the icon.\nIn the Data Stewardship application, you will notice a “Tasks” section with a number of items you will need to complete before you can gain access to the project space. Refer to the next section for details about the onboarding process." + "text": "Account Registration and Onboarding Tasks\n\nYou will receive an email invitation to activate your account. The email will come from http://okta.com, so please make sure that it doesn’t get caught in your email spam filter. Follow the steps outlined in the email to set up your password and your multi-factor authentication preferences. Clink on the link below to watch a video walking through the steps.\nAfter activating your account, you will be logged in to the ADRF Applications page. Proceed to the Data Stewardship application by clicking on the icon.\nIn the Management Portal, you will notice a “Onboarding Tasks” section within “Admin Tasks” with a number of items you will need to complete before you can gain access to the project space. Refer to the next section for details about the onboarding process." }, { "objectID": "access.html#obtaining-adrf-access", @@ -235,14 +235,14 @@ "href": "appendix.html#data-access", "title": "12  Redshift querying guide", "section": "Data Access", - "text": "Data Access\nThe data is housed in Redshift. You need to replace the “user.name.project” with your project based username. The project based username is your user folder name in the U:/ drive:\n\n\n\n\n\n\nNote: Your username will be different than in these examples.\n\nThe password needed to access Redshift is the second password entered when logging into the ADRF as shown in the screen below:\n\n\n\n\n\nAll data is stored under schemas in the projects database and are accessible by the following programs: \n\nDBeaver\nTo establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\\ before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects server.\n\n\n\n\n\nCreating Tables in PR/TR Schema\nWhen users create tables in their PR (Research Project) or TR (Training Project) schema, the table is initially permissioned to the user only. This is analogous to creating a document or file in your U drive: Only you have access to the newly created table.\nIf you want to allow all individuals in your project workspace to access the table in the PR/TR schema, you will need to grant permission to the table to the rest of the users who have access to the PR or TR schema.\nYou can do this by running the following code:\nGRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name TO group db_xxxxxx_rw;\n\nNote: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in the group name db_xxxxxx_rw, replace xxxxxx with your project code. This is the last 6 characters in your project based user name. This will start with either a T or a P.\n\nIf you want to allow only a single user on your project to access the table, you will need to grant permission to that user. You can do this by running the following code:\nGRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name to \"IAM:first_name.last_name.project_code\";\n\nNote: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in \"IAM:first_name.last_name.project_code\" update first_name.last_name.project_code with the user name to whom you want to grant access to.\n\nIf you have any questions, please reach out to us at support@coleridgeinitiative.org\nWhen connecting to the database through SAS, R, Stata, or Python you need to use one of the following DSNs:\n\nRedshift01_projects_DSN\nRedshift11_projects_DSN\n\nIn the code examples below, the default DSN is Redshift01_projects_DSN.\n\n\nSAS Connection\nproc sql;\nconnect to odbc as my con\n(datasrc=Redshift01_projects_DSN user=adrf\\user.name.project password=password);\nselect * from connection to mycon\n(select * form projects.schema.table);\ndisconnect from mycon;\nquit;\n\n\nR Connection\nBest practices for loading large amounts of data in R\nTo ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded: \noptions(java.parameters = c(\"-XX:+UseConcMarkSweepGC\", \"-Xmx8192m\")) gc()\nBest practices for writing tables to Redshift\nWhen writing an R data frame to Redshift use the following code as an example:\n# Note: replace the table_name with the name of the data frame you wish to write to Redshift\n\nDBI::dbWriteTable(conn = conn, #name of the connection \nname = \"schema_name.table_name\", #name of table to save df to \nvalue = df_name, #name of df to write to Redshift \noverwrite = TRUE) #if you want to overwrite a current table, otherwise FALSE\n\nqry <- \"GRANT SELECT ON TABLE schema.table_name TO group <group_name>;\"\ndbSendUpdate(conn,qry)\nThe below table is for connecting to RedShift11 Database\nlibrary(RJDBC)\ndbusr=Sys.getenv(\"DBUSER\") \ndbpswd=Sys.getenv(\"DBPASSWD\")\n\n# Database URL\nurl <- paste0(\"jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;\",\n\"loginToRp=urn:amazon:webservices:govcloud;\",\n\"ssl=true;\",\n\"AutoCreate=true;\",\n\"idp_host=adfs.adrf.net;\",\n\"idp_port=443;\",\n\"ssl_insecure=true;\",\n\"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider\")\n\n# Redshift JDBC Driver Setting\ndriver <- JDBC(\"com.amazon.redshift.jdbc42.Driver\",\nclassPath = \"C:\\\\drivers\\\\redshift_withsdk\\\\redshift-jdbc42-2.1.0.12\\\\redshift-jdbc42-2.1.0.12.jar\",\nidentifier.quote=\"`\")\nconn <- dbConnect(driver, url, dbusr, dbpswd)\nFor the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And .Renviron file should contain the following:\nDBUSER='adrf\\John.doe.p00002'\nDBPASSWD='xxxxxxxxxxxx'\nPLEASE replace user id and password with your project workspace specific user is and password.\nThis will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.\nThe below table is for connecting to RedShift01 Database\nlibrary(RJDBC)\ndbusr=Sys.getenv(\"DBUSER\") \ndbpswd=Sys.getenv(\"DBPASSWD\")\n\n# Database URL\nurl <- paste0(\"jdbc:redshift:iam://adrf-redshift01.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;\",\n\"loginToRp=urn:amazon:webservices:govcloud;\",\n\"ssl=true;\",\n\"AutoCreate=true;\",\n\"idp_host=adfs.adrf.net;\",\n\"idp_port=443;\",\n\"ssl_insecure=true;\",\n\"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider\")\n\n# Redshift JDBC Driver Setting\ndriver <- JDBC(\"com.amazon.redshift.jdbc42.Driver\",\nclassPath = \"C:\\\\drivers\\\\redshift_withsdk\\\\redshift-jdbc42-2.1.0.12\\\\redshift-jdbc42-2.1.0.12.jar\",\nidentifier.quote=\"`\")\nconn <- dbConnect(driver, url, dbusr, dbpswd)\nFor the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And .Renviron file should contain the following:\nDBUSER='adrf\\John.doe.p00002'\nDBPASSWD='xxxxxxxxxxxx'\nPLEASE replace user id and password with your project workspace specific user is and password.\nThis will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.\n\n\nPython Connection\nimport pyodbc\nimport pandas as pd\ncnxn = pyodbc.connect('DSN=Redshift01_projects_DSN;\n UID = adrf\\\\user.name.project; PWD = password')\ndf = pd.read_sql(“SELECT * FROM projects.schema_name.table_name”, cnxn)\n\n\nStata Connection\nodbc load, exec(\"select * from PATH_TO_TABLE\") clear dsn(\"Redshift11_projects_DSN\") user(\"adrf\\user.name.project\") password(\"password\")" + "text": "Data Access\nThe data is housed in Redshift. You need to replace the “user.name.project” with your project based username. The project based username is your user folder name in the U:/ drive:\n\n\n\n\n\n\nNote: Your username will be different than in these examples.\n\nThe password needed to access Redshift is the second password entered when logging into the ADRF as shown in the screen below:\n\n\n\n\n\nAll data is stored under schemas in the projects database and are accessible by the following programs:\n\nDBeaver\nTo establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\\ before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects server.\n\n\n\n\n\nCreating Tables in PR/TR Schema\nWhen users create tables in their PR (Research Project) or TR (Training Project) schema, the table is initially permissioned to the user only. This is analogous to creating a document or file in your U drive: Only you have access to the newly created table.\nIf you want to allow all individuals in your project workspace to access the table in the PR/TR schema, you will need to grant permission to the table to the rest of the users who have access to the PR or TR schema.\nYou can do this by running the following code:\nGRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name TO group db_xxxxxx_rw;\n\nNote: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in the group name db_xxxxxx_rw, replace xxxxxx with your project code. This is the last 6 characters in your project based user name. This will start with either a T or a P.\n\nIf you want to allow only a single user on your project to access the table, you will need to grant permission to that user. You can do this by running the following code:\nGRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name to \"IAM:first_name.last_name.project_code\";\n\nNote: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in \"IAM:first_name.last_name.project_code\" update first_name.last_name.project_code with the user name to whom you want to grant access to.\n\nIf you have any questions, please reach out to us at support@coleridgeinitiative.org\nWhen connecting to the database through SAS, R, Stata, or Python you need to use one of the following DSNs:\n\nRedshift01_projects_DSN\nRedshift11_projects_DSN\n\nIn the code examples below, the default DSN is Redshift01_projects_DSN.\n\n\nSAS Connection\nproc sql;\nconnect to odbc as my con\n(datasrc=Redshift01_projects_DSN user=adrf\\user.name.project password=password);\nselect * from connection to mycon\n(select * form projects.schema.table);\ndisconnect from mycon;\nquit;\n\n\nR Connection\nBest practices for loading large amounts of data in R\nTo ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:\noptions(java.parameters = c(\"-XX:+UseConcMarkSweepGC\", \"-Xmx8192m\")) gc()\nBest practices for writing tables to Redshift\nWhen writing an R data frame to Redshift use the following code as an example:\n# Note: replace the table_name with the name of the data frame you wish to write to Redshift\n\nDBI::dbWriteTable(conn = conn, #name of the connection \nname = \"schema_name.table_name\", #name of table to save df to \nvalue = df_name, #name of df to write to Redshift \noverwrite = TRUE) #if you want to overwrite a current table, otherwise FALSE\n\nqry <- \"GRANT SELECT ON TABLE schema.table_name TO group <group_name>;\"\ndbSendUpdate(conn,qry)\nThe below table is for connecting to RedShift11 Database\nlibrary(RJDBC)\ndbusr=Sys.getenv(\"DBUSER\") \ndbpswd=Sys.getenv(\"DBPASSWD\")\n\n# Database URL\nurl <- paste0(\"jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;\",\n\"loginToRp=urn:amazon:webservices:govcloud;\",\n\"ssl=true;\",\n\"AutoCreate=true;\",\n\"idp_host=adfs.adrf.net;\",\n\"idp_port=443;\",\n\"ssl_insecure=true;\",\n\"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider\")\n\n# Redshift JDBC Driver Setting\ndriver <- JDBC(\"com.amazon.redshift.jdbc42.Driver\",\nclassPath = \"C:\\\\drivers\\\\redshift_withsdk\\\\redshift-jdbc42-2.1.0.12\\\\redshift-jdbc42-2.1.0.12.jar\",\nidentifier.quote=\"`\")\nconn <- dbConnect(driver, url, dbusr, dbpswd)\nFor the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And .Renviron file should contain the following:\nDBUSER='adrf\\John.doe.p00002'\nDBPASSWD='xxxxxxxxxxxx'\nPLEASE replace user id and password with your project workspace specific user is and password.\nThis will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.\nThe below table is for connecting to RedShift01 Database\nlibrary(RJDBC)\ndbusr=Sys.getenv(\"DBUSER\") \ndbpswd=Sys.getenv(\"DBPASSWD\")\n\n# Database URL\nurl <- paste0(\"jdbc:redshift:iam://adrf-redshift01.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;\",\n\"loginToRp=urn:amazon:webservices:govcloud;\",\n\"ssl=true;\",\n\"AutoCreate=true;\",\n\"idp_host=adfs.adrf.net;\",\n\"idp_port=443;\",\n\"ssl_insecure=true;\",\n\"plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider\")\n\n# Redshift JDBC Driver Setting\ndriver <- JDBC(\"com.amazon.redshift.jdbc42.Driver\",\nclassPath = \"C:\\\\drivers\\\\redshift_withsdk\\\\redshift-jdbc42-2.1.0.12\\\\redshift-jdbc42-2.1.0.12.jar\",\nidentifier.quote=\"`\")\nconn <- dbConnect(driver, url, dbusr, dbpswd)\nFor the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And .Renviron file should contain the following:\nDBUSER='adrf\\John.doe.p00002'\nDBPASSWD='xxxxxxxxxxxx'\nPLEASE replace user id and password with your project workspace specific user is and password.\nThis will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.\n\n\nPython Connection\nimport pyodbc\nimport pandas as pd\ncnxn = pyodbc.connect('DSN=Redshift01_projects_DSN;\n UID = adrf\\\\user.name.project; PWD = password')\ndf = pd.read_sql(“SELECT * FROM projects.schema_name.table_name”, cnxn)\n\n\nStata Connection\nodbc load, exec(\"select * from PATH_TO_TABLE\") clear dsn(\"Redshift11_projects_DSN\") user(\"adrf\\user.name.project\") password(\"password\")" }, { "objectID": "appendix.html#redshift-query-guidelines-for-researchers", "href": "appendix.html#redshift-query-guidelines-for-researchers", "title": "12  Redshift querying guide", "section": "Redshift Query Guidelines for Researchers", - "text": "Redshift Query Guidelines for Researchers\nDeveloping your query. Here’s an example workflow to follow when developing a query.\n\nStudy the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name. \nTo get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select  * from [table name] LIMIT 1000 )\nNarrow down the columns to the minimal set required to answer your question.\nApply any filters to those columns.\nIf you need to aggregate data, aggregate a small number of rows\nOnce you have a query returning the results you need, look for sections of the query to save as a Common Table Expression (CTE) to encapsulate that logic.\n\n\nDO and DON’T DO BEST PRACTICES:\n\nTip 1: Use SELECT <columns> instead of SELECT *\nSpecify the columns in the SELECT clause instead of using SELECT *. The unnecessary columns place extra load on the database, which slows down not just the single Amazon Redshift, but the whole system.\nInefficient\nSELECT * FROM projects.schema_name.table_name\nThis query fetches all the data stored in the table you choose which might not be required for a particular scenario.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\n\n\nTip 2: Always fetch limited data and target accurate results\nLesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster.  In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.\nSELECT  col_A, col_B, col_C FROM projects.schema_name.table_name  WHERE [apply some filter] LIMIT 1000\n\n\nTip 3: Use wildcard characters wisely\nWildcard characters can be either used as a prefix or a suffix. Using leading wildcard (%) in combination with an ending wildcard will search all records for a match anywhere within the selected field.\nInefficient\nSelect col_A, col_B, col_C from projects.schema_name.table_name where col_A like '%BRO%'\nThis query will pull the expected results of Brown Sugar, Brownie, Brown Rice and so on. However, it will also pull unexpected results, such as Country Brown, Lamb with Broth, Cream of Broccoli.    \nEfficient\nSelect col_A, col_B, col_C from projects.schema_name.table_name where col_B like 'BRO%'.\nThis query will pull only the expected results of Brownie, Brown Rice, Brown Sugar and so on. \n\n\nTip 4: Does My record exist?\nNormally, developers use EXISTS() or COUNT() queries for matching a record entry. However, EXISTS() is more efficient as it will exit as soon as finding a matching record; whereas, COUNT() will scan the entire table even if the record is found in the first row.\nEfficient\nselect col_A from projects.schema_name.table_name A where exists (select 1 from projects.schema_name.table_name B where A.col_A = B.col_A ) order by col_A;\n\n\nTip 5: Avoid correlated subqueries\nA correlated subquery depends on the parent or outer query. Since it executes row by row, it decreases the overall speed of the process.  \nInefficient\nSELECT col_A, col_B, (SELECT col_C FROM projects.schema_name.table_name_a WHERE col_C = c.rma LIMIT 1) AS new_name FROM projects.schema_name.table_name_b\nHere, the problem is — the inner query is run for each row returned by the outer query. Going over the “table_name_b” table again and again for every row processed by the outer query creates process overhead. Instead, for Amazon Redshift query optimization, use JOIN to solve such problems.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name c LEFT JOIN projects.schema_name.table_name co ON c.col_A = co.col_B\n\n\nTip 6: Avoid using Amazon Redshift function in the where condition\nOften developers use functions or methods with their Amazon Redshift queries. \nInefficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE RIGHT(birth_date,4) = '1965' and  LEFT(birth_date,2) = '07'\nNote that even if birth_date has an index, the above query changes the WHERE clause in such a way that this index cannot be used anymore.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE birth_date between '711965' and '7311965'\n\n\nTip 7: Use WHERE instead of HAVING\nHAVING clause filters the rows after all the rows are selected. It is just like a filter. Do not use the HAVING clause for any other purposes. It is useful when performing group bys and aggregations.\n\n\nTip 8: Use temp tables when merging large data sets\nCreating local temp tables will limit the number of records in large table joins and merges.  Instead of performing large table joins, one can break out the analysis by performing the analysis in two steps: 1) create a temp table with limiting criteria to create a smaller / filtered result set.  2) join the temp table to the second large table to limit the number of records being fetched and to speed up the query. This is especially useful when there are no indexes on the join columns.\nInefficient\nSELECT col_A, col_B, sum(col_C) total FROM projects.schema_name.table_name  pd  INNER JOIN  projects.schema_name.table_name st  ON  pd.col_A=st.col_B WHERE pd.col_C like 'DOG%' GROUP BY pd.col_A, pd.col_B, pd.col_C\nNote that even if joining column col_A has an index, the col_B column does not.  In addition, because the size of some tables can be large, one should limit the size of the join table by first building a smaller filtered #temp table then performing the table joins.  \nEfficient\nSET search_path = schema_name;  -- this statement sets the default schema/database to projects.schema_name\nStep 1:\nCREATE TEMP TABLE temp_table (\ncol_A varchar(14),\ncol_B varchar(178),\ncol_C varchar(4) );\nStep 2:\nINSERT INTO temp_table SELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name WHERE col_B like 'CAT%';\nStep 3:\nSELECT pd.col_A, pd.col_B, pd.col_C, sum(col_C) as total FROM  temp_table pd  INNER JOIN  projects.schema_name.table_name st  ON pd.col_A=st.col_B  GROUP BY pd.col_A, pd.col_B, pd.col_C;\nDROP TABLE temp_table;\nNote always drop the temp table after the analysis is complete to release data from physical memory.\n\n\n\nOther Pointers for best database performance\nSELECT columns, not stars. Specify the columns you’d like to include in the results (though it’s fine to use * when first exploring tables — just remember to LIMIT your results).\nAvoid using SELECT DISTINCT. SELECT DISTINCT command in Amazon Redshift used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. To fetch distinct rows and remove duplicate rows, use more attributes in the SELECT operation. \nInner joins vs WHERE clause. Use inner join for merging two or more tables rather than using the WHERE clause. WHERE clause creates the CROSS join/ CARTESIAN product for merging tables. CARTESIAN product of two tables takes a lot of time.\nIN versus EXISTS. IN operator is costlier than EXISTS in terms of scans especially when the result of the subquery is a large dataset. We should try to use EXISTS rather than using IN for fetching results with a subquery. \nAvoid\nSELECT col_A , col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_A IN\n(SELECT col_B FROM projects.schema_name.table_name WHERE col_B = 'DOG')\nPrefer\nSELECT col_A , col_B, col_C\nFROM projects.schema_name.table_name\nWHERE EXISTS\n(SELECT col_A FROM projects.schema_name.table_name b WHERE\na.col_A = b.col_B and b.col_B = 'DOG')\nQuery optimizers can change the order of the following list, but this general lifecycle of a Amazon Redshift query is good to keep in mind when writing Amazon Redshift.\n\nFROM (and JOIN) get(s) the tables referenced in the query.\nWHERE filters data.\nGROUP BY aggregates data.\nHAVING filters out aggregated data that doesn’t meet the criteria.\nSELECT grabs the columns (then deduplicates rows if DISTINCT is invoked).\nUNION merges the selected data into a result set.\nORDER BY sorts the results.\n\n\n\nAmazon Redshift best practices for FROM\nJoin tables using the ON keyword. Although it’s possible to “join” two tables using a WHERE clause, use an explicit JOIN. The JOIN + ON syntax distinguishes joins from WHERE clauses intended to filter the results.\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT A.col_A , B.col_B, B.col_C\nFROM projects.schema_name.table_name as A\nJOIN projects.schema_name.table_name B ON A.col_A = B.col_B\nAlias multiple tables. When querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table. \nAvoid\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT col_A , col_B, col_C\nFROM dbo.table_name as A\nLEFT JOIN dbo.table_name as B ON A.col_A = B.col_B\nPrefer\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT A.col_A , B.col_B, B.col_C\nFROM dbo.table_name as A\nLEFT JOIN dbo.table_name as B\nA.col_A = B.col_B\n\n\nAmazon Redshift best practices for WHERE\nFilter with WHERE before HAVING. Use a WHERE clause to filter superfluous rows, so you don’t have to compute those values in the first place. Only after removing irrelevant rows, and after aggregating those rows and grouping them, include a HAVING clause to filter out aggregates.\nAvoid functions on columns in WHERE clauses. Using a function on a column in a WHERE clause can really slow down your query, as the function prevents the database from using an index to speed up the query. Instead of using the index to skip to the relevant rows, the function on the column forces the database to run the function on each row of the table. The concatenation operator || is also a function, so don’t try to concat strings to filter multiple columns. Prefer multiple conditions instead:\nAvoid\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\nWHERE concat(col_A, col_B) = 'REGULARCOFFEE'\nPrefer\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\nWHERE col_A ='REGULAR' and col_B = 'COFFEE'\n\n\nAmazon Redshift best practices for GROUP BY\nOrder multiple groupings by descending cardinality. Where possible, GROUP BY columns in order of descending cardinality. That is, group by columns with more unique values first (like IDs or phone numbers) before grouping by columns with fewer distinct values (like state or gender).\n\n\nAmazon Redshift best practices for HAVING\nOnly use HAVING for filtering aggregates. Before HAVING, filter out values using a WHERE clause before aggregating and grouping those values.\nSELECT  col_A, sum(col_B) as total_amt\nFROM  projects.schema_name.table_name\nWHERE  col_C = 1617 and col_A='key'\nGROUP BY col_A\nHAVING  sum(col_D)> 0\n\n\nAmazon Redshift best practices for UNION\nPrefer UNION All to UNION. If duplicates are not an issue, UNION ALL won’t discard them, and since UNION ALL isn’t tasked with removing duplicates, the query will be more efficient\n\n\nAmazon Redshift best practices for ORDER BY\nAvoid sorting where possible, especially in subqueries. If you must sort, make sure your subqueries are not needlessly sorting data.\nAvoid\nSELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_B IN\n(SELECT col_A  FROM projects.schema_name.table_name\nWHERE col_C = 534905 ORDER BY col_B);\nPrefer\nSELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_A IN\n(SELECT col_B  FROM projects.schema_name.table_name\nWHERE col_C = 534905);\n\n\nTroubleshooting Queries\nThere are several metrics for calculating the cost of the query in terms of storage, time, CPU utilization. However, these metrics require DBA permissions to execute. Follow up with ADRF support to get additional assistance.\nUsing the SVL_QUERY_SUMMARY view: To analyze query summary information by stream, do the following:    \nStep 1: select query, elapsed, substring  from svl_qlog order by query desc limit 5;       \nStep 2: select * from svl_query_summary where query = MyQueryID order by stm, seg, step;\nExecution Plan:  Lastly, an execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure: \n\nClick on SQL Editor in the menu bar.\nClick on Explain Execution Plan.\n\nIt helps to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part. The execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged. Further, we can see a more detailed analysis view of each sub-operation performed during query execution." + "text": "Redshift Query Guidelines for Researchers\nDeveloping your query. Here’s an example workflow to follow when developing a query.\n\nStudy the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.\nTo get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )\nNarrow down the columns to the minimal set required to answer your question.\nApply any filters to those columns.\nIf you need to aggregate data, aggregate a small number of rows\nOnce you have a query returning the results you need, look for sections of the query to save as a Common Table Expression (CTE) to encapsulate that logic.\n\n\nDO and DON’T DO BEST PRACTICES:\n\nTip 1: Use SELECT <columns> instead of SELECT *\nSpecify the columns in the SELECT clause instead of using SELECT *. The unnecessary columns place extra load on the database, which slows down not just the single Amazon Redshift, but the whole system.\nInefficient\nSELECT * FROM projects.schema_name.table_name\nThis query fetches all the data stored in the table you choose which might not be required for a particular scenario.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\n\n\nTip 2: Always fetch limited data and target accurate results\nLesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster. In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE [apply some filter] LIMIT 1000\n\n\nTip 3: Use wildcard characters wisely\nWildcard characters can be either used as a prefix or a suffix. Using leading wildcard (%) in combination with an ending wildcard will search all records for a match anywhere within the selected field.\nInefficient\nSelect col_A, col_B, col_C from projects.schema_name.table_name where col_A like '%BRO%'\nThis query will pull the expected results ofBrown Sugar, Brownie, Brown Riceand so on. However, it will also pull unexpected results, such asCountry Brown, Lamb with Broth, Cream of Broccoli.\nEfficient\nSelect col_A, col_B, col_C from projects.schema_name.table_name where col_B like 'BRO%'.\nThis query will pull only the expected results ofBrownie, Brown Rice, Brown Sugar and so on.\n\n\nTip 4: Does My record exist?\nNormally, developers use EXISTS() or COUNT() queries for matching a record entry. However, EXISTS() is more efficient as it will exit as soon as finding a matching record; whereas, COUNT() will scan the entire table even if the record is found in the first row.\nEfficient\nselect col_A from projects.schema_name.table_name A where exists (select 1 from projects.schema_name.table_name B where A.col_A = B.col_A ) order by col_A;\n\n\nTip 5: Avoidcorrelated subqueries\nA correlated subquery depends on the parent or outer query. Since it executes row by row, it decreases the overall speed of the process.\nInefficient\nSELECT col_A, col_B, (SELECT col_C FROM projects.schema_name.table_name_a WHERE col_C = c.rma LIMIT 1) AS new_name FROM projects.schema_name.table_name_b\nHere, the problem is — the inner query is run for each row returned by the outer query. Going over the “table_name_b” table again and again for every row processed by the outer query creates process overhead. Instead, for Amazon Redshift query optimization, use JOIN to solve such problems.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name c LEFT JOIN projects.schema_name.table_name co ON c.col_A = co.col_B\n\n\nTip 6: Avoid using Amazon Redshift function in the where condition\nOften developers use functions or methods with their Amazon Redshift queries.\nInefficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE RIGHT(birth_date,4) = '1965' and LEFT(birth_date,2) = '07'\nNote that even ifbirth_date has an index, the above query changes the WHERE clause in such a way that this index cannot be used anymore.\nEfficient\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE birth_date between '711965' and '7311965'\n\n\nTip 7: Use WHERE instead of HAVING\nHAVING clause filters the rows after all the rows are selected. It is just like a filter. Do not use the HAVING clause for any other purposes.It is useful when performing group bys and aggregations.\n\n\nTip 8: Use temp tables when merging large data sets\nCreating local temp tables will limit the number of records in large table joins and merges. Instead of performing large table joins, one can break out the analysis by performing the analysis in two steps: 1) create a temp table with limiting criteria to create a smaller / filtered result set. 2) join the temp table to the second large table to limit the number of records being fetched and to speed up the query. This is especially useful when there are no indexes on the join columns.\nInefficient\nSELECT col_A, col_B, sum(col_C) total FROM projects.schema_name.table_name pd INNER JOIN projects.schema_name.table_name st ON pd.col_A=st.col_B WHERE pd.col_C like 'DOG%' GROUP BY pd.col_A, pd.col_B, pd.col_C\nNote that even if joining column col_A has an index, the col_B column does not. In addition, because the size of some tables can be large, one should limit the size of the join table by first building a smaller filtered #temp table then performing the table joins.\nEfficient\nSET search_path = schema_name; -- this statement sets the default schema/database to projects.schema_name\nStep 1:\nCREATE TEMP TABLE temp_table (\ncol_A varchar(14),\ncol_B varchar(178),\ncol_C varchar(4) );\nStep 2:\nINSERT INTO temp_table SELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name WHERE col_B like 'CAT%';\nStep 3:\nSELECT pd.col_A, pd.col_B, pd.col_C, sum(col_C) as total FROM temp_table pd INNER JOIN projects.schema_name.table_name st ON pd.col_A=st.col_B GROUP BY pd.col_A, pd.col_B, pd.col_C;\nDROP TABLE temp_table;\nNote always drop the temp table after the analysis is complete to release data from physical memory.\n\n\n\nOther Pointers for best database performance\nSELECT columns, not stars. Specify the columns you’d like to include in the results (though it’s fine to use * when first exploring tables — just remember to LIMIT your results).\nAvoid using SELECT DISTINCT. SELECT DISTINCT command in Amazon Redshift used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. To fetch distinct rows and remove duplicate rows, use more attributes in the SELECT operation.\nInner joins vs WHERE clause. Use inner join for merging two or more tables rather than using the WHERE clause. WHERE clause creates the CROSS join/ CARTESIAN product for merging tables. CARTESIAN product of two tables takes a lot of time.\nIN versus EXISTS. IN operator is costlier than EXISTS in terms of scans especially when the result of the subquery is a large dataset. We should try to use EXISTS rather than using IN for fetching results with a subquery.\nAvoid\nSELECT col_A , col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_A IN\n(SELECT col_B FROM projects.schema_name.table_name WHERE col_B = 'DOG')\nPrefer\nSELECT col_A , col_B, col_C\nFROM projects.schema_name.table_name\nWHERE EXISTS\n(SELECT col_A FROM projects.schema_name.table_name b WHERE\na.col_A = b.col_B and b.col_B = 'DOG')\nQuery optimizers can change the order of the following list, but this general lifecycle of a Amazon Redshift query is good to keep in mind when writing Amazon Redshift.\n\nFROM (and JOIN) get(s) the tables referenced in the query.\nWHERE filters data.\nGROUP BY aggregates data.\nHAVING filters out aggregated data that doesn’t meet the criteria.\nSELECT grabs the columns (then deduplicates rows if DISTINCT is invoked).\nUNION merges the selected data into a result set.\nORDER BY sorts the results.\n\n\n\nAmazon Redshift best practices for FROM\nJoin tables using the ON keyword. Although it’s possible to “join” two tables using a WHERE clause, use an explicit JOIN. The JOIN + ON syntax distinguishes joins from WHERE clauses intended to filter the results.\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT A.col_A , B.col_B, B.col_C\nFROM projects.schema_name.table_name as A\nJOIN projects.schema_name.table_name B ON A.col_A = B.col_B\nAlias multiple tables. When querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table.\nAvoid\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT col_A , col_B, col_C\nFROM dbo.table_name as A\nLEFT JOIN dbo.table_name as B ON A.col_A = B.col_B\nPrefer\nSET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name\nSELECT A.col_A , B.col_B, B.col_C\nFROM dbo.table_name as A\nLEFT JOIN dbo.table_name as B\nA.col_A = B.col_B\n\n\nAmazon Redshift best practices for WHERE\nFilter with WHERE before HAVING. Use a WHERE clause to filter superfluous rows, so you don’t have to compute those values in the first place. Only after removing irrelevant rows, and after aggregating those rows and grouping them, include a HAVING clause to filter out aggregates.\nAvoid functions on columns in WHERE clauses. Using a function on a column in a WHERE clause can really slow down your query, as the function prevents the database from using an index to speed up the query. Instead of using the index to skip to the relevant rows, the function on the column forces the database to run the function on each row of the table. The concatenation operator || is also a function, so don’t try to concat strings to filter multiple columns. Prefer multiple conditions instead:\nAvoid\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\nWHERE concat(col_A, col_B) = 'REGULARCOFFEE'\nPrefer\nSELECT col_A, col_B, col_C FROM projects.schema_name.table_name\nWHERE col_A ='REGULAR' and col_B = 'COFFEE'\n\n\nAmazon Redshift best practices for GROUP BY\nOrder multiple groupings by descending cardinality. Where possible, GROUP BY columns in order of descending cardinality. That is, group by columns with more unique values first (like IDs or phone numbers) before grouping by columns with fewer distinct values (like state or gender).\n\n\nAmazon Redshift best practices for HAVING\nOnly use HAVING for filtering aggregates. Before HAVING, filter out values using a WHERE clause before aggregating and grouping those values.\nSELECT col_A, sum(col_B) as total_amt\nFROM projects.schema_name.table_name\nWHERE col_C = 1617 and col_A='key'\nGROUP BY col_A\nHAVING sum(col_D)> 0\n\n\nAmazon Redshift best practices for UNION\nPrefer UNION All to UNION. If duplicates are not an issue, UNION ALL won’t discard them, and since UNION ALL isn’t tasked with removing duplicates, the query will be more efficient\n\n\nAmazon Redshift best practices for ORDER BY\nAvoid sorting where possible, especially in subqueries. If you must sort, make sure your subqueries are not needlessly sorting data.\nAvoid\nSELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_B IN\n(SELECT col_A FROM projects.schema_name.table_name\nWHERE col_C = 534905 ORDER BY col_B);\nPrefer\nSELECT col_A, col_B, col_C\nFROM projects.schema_name.table_name\nWHERE col_A IN\n(SELECT col_B FROM projects.schema_name.table_name\nWHERE col_C = 534905);\n\n\nTroubleshooting Queries\nThere are several metrics for calculating the cost of the query in terms of storage, time, CPU utilization. However, these metrics require DBA permissions to execute. Follow up with ADRF support to get additional assistance.\nUsing the SVL_QUERY_SUMMARY view: To analyze query summary information by stream, do the following:\nStep 1: select query, elapsed, substring from svl_qlog order by query desc limit 5;\nStep 2: select * from svl_query_summary where query = MyQueryID order by stm, seg, step;\nExecution Plan: Lastly, an execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure:\n\nClick on SQL Editor in the menu bar.\nClick on Explain Execution Plan.\n\nIt helps to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part. The execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged. Further, we can see a more detailed analysis view of each sub-operation performed during query execution." }, { "objectID": "appendix.html#aws-sources",