diff --git a/.nojekyll b/.nojekyll index a9a16b4..31c6ff0 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -72fc965a \ No newline at end of file +979e106f \ No newline at end of file diff --git a/access.html b/access.html index 900a9cc..75d36e7 100644 --- a/access.html +++ b/access.html @@ -238,7 +238,7 @@
You will receive an email invitation to activate your account. The email will come from http://okta.com, so please make sure that it doesn’t get caught in your email spam filter. Follow the steps outlined in the email to set up your password and your multi-factor authentication preferences. Clink on the link below to watch a video walking through the steps.
After activating your account, you will be logged in to the ADRF Applications page. Proceed to the Data Stewardship application by clicking on the icon.
In the Data Stewardship application, you will notice a “Tasks” section with a number of items you will need to complete before you can gain access to the project space. Refer to the next section for details about the onboarding process.
In the Management Portal, you will notice a “Onboarding Tasks” section within “Admin Tasks” with a number of items you will need to complete before you can gain access to the project space. Refer to the next section for details about the onboarding process.
All data is stored under schemas in the projects database and are accessible by the following programs:
+All data is stored under schemas in the projects database and are accessible by the following programs:
To establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects
. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\
before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects
server.
Best practices for loading large amounts of data in R
-To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:
+To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:
options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m")) gc()
Best practices for writing tables to Redshift
When writing an R data frame to Redshift use the following code as an example:
@@ -422,8 +422,8 @@Developing your query. Here’s an example workflow to follow when developing a query.
Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.
To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )
Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.
To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )
Narrow down the columns to the minimal set required to answer your question.
Apply any filters to those columns.
If you need to aggregate data, aggregate a small number of rows
Lesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster. In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.
-SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE [apply some filter] LIMIT 1000
Lesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster. In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.
+SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE [apply some filter] LIMIT 1000
Wildcard characters can be either used as a prefix or a suffix. Using leading wildcard (%) in combination with an ending wildcard will search all records for a match anywhere within the selected field.
Inefficient
Select col_A, col_B, col_C from projects.schema_name.table_name where col_A like '%BRO%'
This query will pull the expected results of Brown Sugar, Brownie, Brown Rice and so on. However, it will also pull unexpected results, such as Country Brown, Lamb with Broth, Cream of Broccoli.
+This query will pull the expected results ofBrown Sugar, Brownie, Brown Riceand so on. However, it will also pull unexpected results, such asCountry Brown, Lamb with Broth, Cream of Broccoli.
Efficient
Select col_A, col_B, col_C from projects.schema_name.table_name where col_B like 'BRO%'.
This query will pull only the expected results of Brownie, Brown Rice, Brown Sugar and so on.
+This query will pull only the expected results ofBrownie, Brown Rice, Brown Sugar and so on.
Efficient
select col_A from projects.schema_name.table_name A where exists (select 1 from projects.schema_name.table_name B where A.col_A = B.col_A ) order by col_A;