From 976b0476896f1016f738254009028b83763c8894 Mon Sep 17 00:00:00 2001
From: Ty Garber Goals
reproducible and approachable.
The following are important for coding and non-coding projects +alike.
+filename_date_editorintials
.Sharing what type of data and who it can be shared with is often +confusing. Generally sport data can be shared with everyone freely, +while commercial data has restrictions on who and how the data +can be shared, for example under the Magnuson-Stevens Act (MSA)
+Sport data can usually freely shared to the public, although there +might be restrictions around sharing charter fishing data via the +MSA.
+The MSA has to be considered when sharing commercial data, often the +data has to be aggregated in way to not specifically identify fishers. +In the co-management realm this is rarely and issue as much of the data +is aggregated, but for sharing with the pubic guidance should be +requested through WDFWs Records Office.
+Public requests for treaty data should be directed to the individual +tribes themselves or WDFW’s Records Office. Tribal data can be shared +freely with the data’s respective tribe, sharing one tribes data with +another tribe should be done under caution with the guidance of NWIFC +staff.
+For any kind of substantial work involving more than one file, use @@ -394,10 +454,10 @@
When developing a document to report results or findings to a general user, use Rmarkdown or Quarto to create a report that blends R code with explanations and graphics.
-When a script is likely to be re-used (e.g. not a one-off analysis),
-create a commented version with instructions on use. This should be
-stored somewhere accessible. Collin Edwards maintains the
-snippets
github repository for
+
When a script is likely to be re-used (e.g. not a one-off analysis)
+or if it is going to be shared, create a commented version with
+instructions on use. This should be stored somewhere accessible. Collin
+Edwards maintains the snippets
github repository for
this kind of thing, or it could live in a Teams folder. If code is
likely to be useful to the team or others, it may also be appropriate to
incorporate this code into an R package, or develop a new R package for
@@ -474,23 +534,17 @@
When .zip
files are blacklisted by the recipient’s IT
department, an alternative would be the .7z
format from the
-7-zip software.
.zap
) and including instuctions to
+change the file name back.
When working across multiple projects, it can be helpful if each
-project has a similar file structure. Ty and Collin will discuss what
-that should be, but good foundation is Ty’s approach, which has a
-data/
and a scripts/
subfolder. It may be
-helpful to also include a standardized readme with basic information
-(when project was started, what goal was, who was working on it). When
-Collin was working in an academic setting, he had a code
-snippet that he ran whenever starting a new project, which created
-his standardized folder structure and auto-populated a few key template
-files. We could think about writing something similar.
Draft file structure? CBE: slightly updated. I like to keep the -raw data in a separate folder from where cleaned / intermediate data -files live.
+project has a similar file structure. Your needs for individual projects +may vary, but the following project structure is often a good option (or +at least a good starting point).project_folder
├── scripts
│ ├── data_clean.R # should save to `cleaned_data/`
@@ -506,6 +560,37 @@ Common project directory structure
│ └── some_spreadsheet.xlsx
├── .gitignore
└── project_folder.Rproj
+The idea with this folder structure is that:
+scripts/
folder contains R scripts used in the
+project.original_data/
contains the data files provided for
+this project (but not data files that are generated or cleaned in this
+project)cleaned_data/
contains any data files that are
+generated as part of this project (e.g., by cleaning and integrated data
+from original_data/
), which can then be used for subsequent
+analyses in this project. The idea with this separation of data is that
+it makes it easier to ensure that original data files are never
+modified.figures/
contains image objects created as a part of
+this project. Depending on the project, this folder may not be used, but
+sometimes it’s appropriate to generate hundreds of figures
+programmatically (e.g. separate bar plots of fishery impacts for each
+stock).results/
contains non-image objects created as a part
+of this project. For example, if a project synthesizes data and produces
+summary .csv
or .xlsx
files, they would go in
+results/
To streamline giving new projects this folder structure, the
+framrsquared
package found here has the
+initialize_project()
function. By default, this generates
+the folder structure above; optional arguments allow users to specify a
+different folder structure, copy template files for quarto documents,
+and initialize renv
.
To reiterate, this project structure is not mandatory for good +coding. It’s simply a useful option.
if(interactive())
allows you to write code that
behaves differently when being compiled for a report than when its being
run interactively. This can be useful when developing parameterized
-reports.for maximum compatibility, use dashes rather than spaces or +underscores in file names. \(\LaTeX\), +which is sometimes used as a part of Rmarkdown and Quarto documents, +does not like spaces or underscores. This is most relevant when creating +image files that may be loaded into reports.
Ensure that your code is reproducible by never saving / loading the environment. Scripts should include code to read in relevant files, and can save key objects for re-use later. In Rstudio, go to @@ -551,43 +642,37 @@
Ensure that figure titles are correct. When copy-pasting
-figure-generation code to make comparable figures for different parts of
-the data (e.g., different stocks or different fisheries), it’s easy to
-accidentally leave old titles in place, leading to confusion. Consider
-using paste()
or glue()
with variable names or
-even r functions so that the figure title auto-updates
-appropriately.
## "fragile" version of plotting an mtcar variable; copy-pasting and plotting a second variable requires careful updating of ggtitle()
-dat.plot <- data |>
- filter(fishery_title == "NT Area 10 Sport")
-ggplot(dat.plot, aes(x = stock, y = AEQ))+
- geom_col()+
- ggtitle("Chinook AEQ of NT Area 10 Sport")+
- coord_flip()
-
-## robust version:
-fishery_plot <- "NT Area 10 Sport" ## define the fishery to plot in one place at the top
-dat.plot <- data |>
- filter(fishery_title == fishery_plot) ## use variable in our filter function
-ggplot(dat.plot, aes(x = stock, y = AEQ))+
- geom_col()+
- ggtitle(paste("Chinook AEQ of", fishery_plot))+ ## use paste and variable name
- coord_flip()
-
-## alternative robust version:
-dat.plot <- data |>
- filter(fishery_title == "NT Area 10 Sport")
-ggplot(dat.plot, aes(x = stock, y = AEQ))+
- geom_col()+
- ggtitle(paste("Chinook AEQ of", dat.plot$fishery_title[1]))+ ## obtain the fishery name directly from dat.plot
- coord_flip()
-library()
rather than
+When loading libraries, use library()
rather than
require()
. Put all library calls at the top of the script,
so that users immediately encounter errors if they have not yet
-installed relevant libraries.
To improve transparency, give R scripts a header with your name,
+the date, and a brief explanation of the script’s purpose. To streamline
+this process, consider adding a header
Rstudio
+snippet. We have a template snippet here;
+you can update this with your own name and then add it to your Rstudio’s
+snippets.
When running simulations or other code in which the outcomes of a
+run can differ due to randomness, it can be difficult and frustrating
+for others to attempt to replicate your work (or replicate an error).
+One key tool is to use set.seed()
at the beginning of a
+script. This will ensure that the randomness is repeated exactly every
+time the script is run. Note that since setting the seed prevents
+alternative random outcomes, it is unwise to do so when developing code,
+as your code will only ever represent one set of random
+outcomes.
In rare cases, R packages will work only for 32 bit R or only for
+64 bit R (historically, this was an issue for connecting to databases).
+Code that uses these packages will then only run on some computers,
+severely hampering our transparency and code sharing. Because of this,
+these packages should be avoided whenever reasonable. When there is no
+other option, there should be very clear commenting or documentation
+identifying this issue, so that users know immediately whether or not
+they will be able to run the code. If R functions exist for both 32 bit
+and 64 bit R but have different functions or syntax, consider supportinb
+both architectures by including an if
statement;
+.Machine$sizeof.pointer
will return 8 in 64-bit R, and 4 in
+32-bit R.
Although Hadley Wickham recommends snake case for R scripting, R has no official naming convention. When writing a script a naming convention -should be chosen and be consistently used throughout the documents -entirety.
+should be chosen and then used consistently throughout the entire +document.<-
or =
,
although essentially equal, one should be chosen and used exclusively
-throughout the project.
+throughout the project. In Rstudio, [ctrl][=] is a hotkey to create
+<-
.
magittr
package. The ‘native pipe’ |>
was
introduced in R 4.0. These two perform essentially the same function,
but with different placeholders which can lead to various errors in
-scripts when mix. One pipe should be used in the document.
-
-
+scripts when mix. One pipe should be used in the document. In Rstudio,
+[ctrl][shift][m] generates a pipe; you can set which type of pipe is
+generated in Tools > Global Options > Code, and check/uncheck the
+“Use native pipe operator…” box.
Axes should have clear, interpretable labels.
Colors should be easily distinguishable, including by folks with +common forms of color vision deficiencies.
viridis
package
makes it very easy to create high-contrast accessible graphics.base_size
. A good
-starting point is theme_bw(base_size = 16)
.#| fig-cap: "caption contents go here"
.Text size should be large enough for others to read comfortably.
+In our experience, this always means making the font size seemly too
+large. When using ggplot, this can easily be achieved by using any of
+the built-in themes and including the optional argument
+base_size
. A good starting point is
+theme_bw(base_size = 16)
.
When there is some kind of nuance or complexity in interpreting a
+plot (e.g., an axis label can be misunderstood), this should be included
+in text associated with the plot. This is easy to do in Quarto or
+Rmarkdown, as we can add caveats or comments right below or above the
+associated R chunk. In quarto reports, an explicit figure caption can be
+added with #| fig-cap: "caption contents go here"
in the
+associated R chunk.
Sometimes we work with timeseries data using day of year as a
numeric (e.g., converting dates to values from 1 to 365). Plotting
results on a doy scale makes them difficult to interpret; instead, we
can use the doy_2md()
function here to translate back. In it’s simplest form, using this
function just requires including a
scale_x_continuous(labels = doy_2md)
call in your ggplot
-layers.
Ensure that figure titles are correct. When copy-pasting
+figure-generation code to make comparable figures for different parts of
+the data (e.g., different stocks or different fisheries), it’s easy to
+accidentally leave old titles in place, leading to confusion. Consider
+using paste()
or glue()
with variable names or
+even r functions so that the figure title auto-updates
+appropriately.
## "fragile" version; copy-pasting and plotting a different fishery requires careful updating of ggtitle()
+dat.plot <- data |>
+ filter(fishery_title == "NT Area 10 Sport")
+ggplot(dat.plot, aes(x = stock, y = AEQ))+
+ geom_col()+
+ ggtitle("Chinook AEQ of NT Area 10 Sport")+
+ coord_flip()
+
+## robust version:
+fishery_plot <- "NT Area 10 Sport" ## define the fishery to plot in one place at the top
+dat.plot <- data |>
+ filter(fishery_title == fishery_plot) ## use variable in our filter function
+ggplot(dat.plot, aes(x = stock, y = AEQ))+
+ geom_col()+
+ ggtitle(paste("Chinook AEQ of", fishery_plot))+ ## use paste and variable name
+ coord_flip()
+
+## alternative robust version:
+dat.plot <- data |>
+ filter(fishery_title == "NT Area 10 Sport")
+ggplot(dat.plot, aes(x = stock, y = AEQ))+
+ geom_col()+
+ ggtitle(paste("Chinook AEQ of", dat.plot$fishery_title[1]))+ ## obtain the fishery name directly from dat.plot
+ coord_flip()
plot
)make_fishery_plot()
, not
+fishery_plot()
). This name should not be the same as common
+R functions (e.g., don’t create a custom plotting function and call it
+plot
)assign()
). If you need a function
-to create several objects, have the function return a list of those
-objects. (Note that file manipulation is an obvious exception to the
-general aim to avoid side-effects in functions; functions can read or
-write)assign()
). If you need a function to create
+several objects, have the function return a list of those objects. (Note
+that file manipulation is an key exception to the general aim to avoid
+side-effects in functions; it is often appropriate to have functions
+read or write files)
Setting up Git and linking it to Rstudio is an involved task, and we +recommend Happy Git with R as a +resource.
+At its simplest, git is a way to keep track of changes, and merge +different, non-conflicting changes to the same documents. In this sense, +you can think of it as a mix between dropbox and a Google document, but +with more to learn but a lot more control and functionality. We +recommend Chapter 20 of Happy Git with R for an overview. For simple +tasks (e.g., working on your own project), a standard workflow is to +pull (this makes sure your local version of the repository is up to +date), and work in the repo. At good stopping points or key checkpoints +in your work (you completed a specific task, or are stopping work on +this project for the day), add any new files that were created, commit +the repository, pull the remote to make sure you are up to date, and +then push. See “Using git in Rstudio” for the terminal commands for this +workflow.
+Once an Rproject is linked to a git repository, Rstudio will have a +git button in the top menu (near the “go to file/function” field). +Clicking this button, then “Commit” opens up an interface to create and +control a git commit, then push it to the remote repo. However, often +this interface is very slow/laggy when a project has many files. An +alternative is to manage the commit in the “terminal” tab of the console +window. Here you can type git commands, which are typically much faster +for Rstudio to enact. Here is a typical commit process in the terminal, +with explanations for each step.
+git add -A
This adds any new files to tracking (ignoring
+files that are covered in .gitignore
), so that they are
+included in the commit. git commit -a -m "commit message"
+This commits the current state of all tracked files. Replace the text in
+quote marks to an appropriate commit message. (e.g., “Addressing issue
+#4, modified intitialize_project() function”). git pull
+This updates your local version with the remote version, in case someone
+else has made changes. If there are changes and they conflict with your
+changes, git will ask you to address those. git push
This
+updates the remote version with your commit and its associated
+changes.
Populate with links, basics of the work flow - branching and -pull requests - forking and pull requests - happy git with R link
+pull requests - forking and pull requests - Adapt git use example from +BDS coding practices. +If you have started making changes to a git repository and realize +before committing the changes that your work should really be on a new +branch, you can use the following git commands to achieve this:
+git switch -c newbranchname
where “newbranchname” is replaced by an appropriate name for your new +branch.
+