From e47dd5f047cefc39cb89c613ffd92e30d3128d62 Mon Sep 17 00:00:00 2001 From: cedwards-dfw Date: Thu, 26 Sep 2024 17:04:11 -0700 Subject: [PATCH 1/2] minor tweaks to folder outline, added todo --- README.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 81a4bca..6dab055 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,10 @@ Our evolving coding best practices document +## Todo + - CBE: read other R best practices docs, adapt parts that make sense to us. + - Populate / link directions on some of the suggestions (e.g., git, renv, etc) + ## General practices For any kind of substantial work involving more than one file, use Rprojects, the `here` package, and `renv` to make scripts easily shareable. The goal is that you can zip up a folder, send it to someone else, and they can run any scripts without making any changes. @@ -25,17 +29,22 @@ When working across multiple projects, it can be helpful if each project has a s Draft file structure? +*CBE: slightly updated. I like to keep the raw data in a separate folder from where cleaned / intermediate data files live.* + ``` project_folder ├── scripts -│ ├── data_clean.R +│ ├── data_clean.R # should save to `cleaned data/` │ └── analysis.R -├── data +├── original data │ ├── data.csv │ └── more_data.xlsx +├── cleaned data +│ ├── data_cleaned.csv ├── figures -├── results │ └── some_figure.png +├── results +│ └── some_spreadsheet.xlsx ├── .gitignore └── project_folder.Rproj ``` From 175a32db879306c68d8bcde0a95c18d3a8a1f614 Mon Sep 17 00:00:00 2001 From: cedwards-dfw Date: Fri, 27 Sep 2024 12:51:05 -0700 Subject: [PATCH 2/2] lots of new content, esp style stuff --- README.md | 84 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 74 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 6dab055..4b07210 100644 --- a/README.md +++ b/README.md @@ -8,20 +8,30 @@ Our evolving coding best practices document - CBE: read other R best practices docs, adapt parts that make sense to us. - Populate / link directions on some of the suggestions (e.g., git, renv, etc) -## General practices +# Best practices + +## Overview + +Good coding practices make collaboration easier and faster, and reduce the frequency and consequences of bugs and problems. At least initially, adhering to good practices can feel like it add unnecessary steps that slow progress. In the long run, however, we find that these practices save time. Further, they increase the transparency of our code, which in turn increases the overall transparency of our work. + +Below, we outline best practices organized into related topics. When it can be done succinctly, we provide explanations for *why* the practices save time. After the guidelines, we include some short tutorials and examples to show *how* to implement some of the less obvious practices. + +## Project management For any kind of substantial work involving more than one file, use Rprojects, the `here` package, and `renv` to make scripts easily shareable. The goal is that you can zip up a folder, send it to someone else, and they can run any scripts without making any changes. When developing a document to report results or findings to a general user, use Rmarkdown or Quarto to create a report that blends R code with explanations and graphics. -When code is likely to be re-used (e.g. not a one-off analysis), create a commented version with instructions on use. This should be stored somewhere accessible. Collin maintains the `snippets` [github repository](https://github.com/FRAMverse/snippets) for this kind of thing, or it could live in a Teams folder. It may also be appropriate to incorporate this code into an R package, or develop a new R package for this code. Converting code to packages is much more involved that storing a code snippet somewhere, but makes it much easier for incorporation into other code. - -#### Using fundamental tools +When a script is likely to be re-used (e.g. not a one-off analysis), create a commented version with instructions on use. This should be stored somewhere accessible. Collin Edwards maintains the `snippets` [github repository](https://github.com/FRAMverse/snippets) for this kind of thing, or it could live in a Teams folder. If code is likely to be useful to the team or others, it may also be appropriate to incorporate this code into an R package, or develop a new R package for this code. Converting code to packages is much more involved that storing a code snippet or re-useable script somewhere, but makes it much easier for incorporation into other code. -*Need to give directions for starting Rprojects, using the here package, using `renv`* +To make scripts easier to re-use, replace hard-coded specifics with variables that are defined at the top of the script. For example, if Collin wrote a script to read in the Mortalities table of a FRAM database and plot the landed catch for a specific fishery, he would probably initially write that script using the file name and fishery name wherever he needed it (e.g., `connect_fram_db("FramDBExample.Mdb")` and `data |> select(fishery_id == 19) |> ...`). To make this script easier to re-use, he could add lines of code near the top of the script, with -- Code that is meant to be shared should not include a hard-coded setwd() or file paths based on the local machine directory structure. Function calls that require file paths should be relative, such that someone with a copy of the project directory can run the script without needing to change those file paths. +```r +file_use = "FramDBExample.Mdb" +fishery_use = 19 +``` +and then replace any hard-coded uses of the filename and fishery ID with those variables (e.g., `connect_fram_db(file_use)` and `data |> select(fishery_id == fishery_use) |> ...`). This makes it very easy to re-use for a different case -- simply update the lines defining `file_use` and `fishery_use`. #### Common project directory structure @@ -72,14 +82,51 @@ con <- DBI::dbConnect(dsn='') ## R Practices -- Ensure that your code is reproducible by never saving / loading the environment. Scripts should include code to read in relevant files, and can save key objects for re-use later. In Rstudio, go to `Tools > Global Options` and in the `General` section, make sure that "Restore .Rdata into workspace on startup" is NOT checked, and make sure that "Save worskpace to .Rdata on exit:" dropdown is set to "Never" +- Ensure that your code is reproducible by never saving / loading the environment. Scripts should include code to read in relevant files, and can save key objects for re-use later. In Rstudio, go to `Tools > Global Options` and in the `General` section, make sure that "Restore .Rdata into workspace on startup" is NOT checked, and make sure that "Save workspace to .Rdata on exit:" dropdown is set to "Never" + +- Code that is meant to be shared should not include a hard-coded setwd() or file paths based on the local machine directory structure. Function calls that require file paths should be relative, such that someone with a copy of the project directory can run the script without needing to change those file paths. + +- Ensure that figure titles are correct. When copy-pasting figure-generation code to make comparable figures for different parts of the data (e.g., different stocks or different fisheries), it's easy to accidentally leave old titles in place, leading to confusion. Consider using `paste()` or `glue()` with variable names or even r functions so that the figure title auto-updates appropriately. + +```r +## "fragile" version of plotting an mtcar variable; copy-pasting and plotting a second variable requires careful updating of ggtitle() +dat.plot <- data |> + filter(fishery_title == "NT Area 10 Sport") +ggplot(dat.plot, aes(x = stock, y = AEQ))+ + geom_col()+ + ggtitle("Chinook AEQ of NT Area 10 Sport")+ + coord_flip() + +## robust version: +fishery_plot <- "NT Area 10 Sport" ## define the fishery to plot in one place at the top +dat.plot <- data |> + filter(fishery_title == fishery_plot) ## use variable in our filter function +ggplot(dat.plot, aes(x = stock, y = AEQ))+ + geom_col()+ + ggtitle(paste("Chinook AEQ of", fishery_plot))+ ## use paste and variable name + coord_flip() + +## alternative robust version: +dat.plot <- data |> + filter(fishery_title == "NT Area 10 Sport") +ggplot(dat.plot, aes(x = stock, y = AEQ))+ + geom_col()+ + ggtitle(paste("Chinook AEQ of", dat.plot$fishery_title[1]))+ ## obtain the fishery name directly from dat.plot + coord_flip() +``` + +- When loading libraries, use `library()` rather than `require()`. Put all library calls at the top of the script, so that users immediately encounter errors if they have not yet installed relevant libraries. ## Style guide -(Ty's plan, Collin has regrets) +The following are good general practices, but specific style choices are often a matter of taste. Consistency is the most important part -- use the same style throughout your script. -- Snakecase for variable names. E.g. `chinook_landed_catch`. -- `<-` for assignment rather than `=` +- Snakecase for variable names. E.g. `chinook_landed_catch`. *Using separators in variable names makes them easier to read. Using periods as separators becomes ambiguous when dealing with S3 methods* +- Use `<-` for assignment rather than `=`. Always ensure there is a space before and after the assignment operator. *This helps with visually distinguishing the assignment `x <- 10` and the test `x < -10`.* +- There's not cost to spreading code across more lines. When in doubt, break really long / complex lines into more, shorter lines; create intermediate variables if necessary. When using pipes, put each pipe operation on its own line. +- We recommend using "Code > Reindent Code" (select all, then Ctrl-I) and "Code > Reformat Code" (select all, then Ctrl-shift-A) to make code easier to read +- Avoid creating variables that share names with common functions (e.g., use `x_mean = mean(x)` instead of `mean = mean(X)`, and `cur_plot = ggplot(...` instead of `plot = ggplot(...`). +- Where possible, use names instead of numbers when indexing named vectors, dataframes, or lists. (e.g., `mtcars$cyl` or `mtcars[, cyl]` rather than `mtcars[, 2]`) ## Visualization @@ -95,6 +142,14 @@ We often need to create graphics to show aspects of the data. There is no one-si - When there is complexity in interpreting a plot, this should be included in text associated with the plot. This is easy to do in Quarto or Rmarkdown, as we can add caveates or comments right below or above the associated R chunk. In quarto reports, an explicit figure caption can be added with `#| fig-cap: "caption contents go here"`. - Sometimes we work with timeseries data using day of year as a numeric (e.g., converting dates to values from 1 to 365). Plotting results on a doy scale makes them difficult to interpret; instead, we can use the [doy_2md() function here](https://github.com/FRAMverse/snippets/blob/main/R/doy_2md.R) to translate back. In it's simplest form, using this function just requires including a `scale_x_continuous(labels = doy_2md)` call in your ggplot layers. +## Creating custom functions + +- Functions should have clear names, preferably involving a verb. This name should not be the same as common R functions (e.g., don't create a custom plotting function and call it `plot`) +- functions should not rely on objects in the global environment; if the function needs an object, ensure that the object is an argument for the function. +- Whenever possible, avoid writing functions that rely on side-effects, particularly creating new variables in the global environment (e.g., with `assign()`). If you need a function to create several objects, have the function return a list of those objects. (Note that file manipulation is an obvious exception to the general aim to avoid side-effects in functions; functions can read or write) +- When writing functions to create graphics, the user has much better control if the function creates and returns a gglot object instead of directly manipulating a graphics window using base R plotting functions. +- For longer scripts, consider separating the code into multiple scripts and using `source()` to call them from a single main script. This can be especially effective for scripts that contain many custom function definitions -- move the functions to a separate script that gets `source()`ed at the top of the remaining code leads to a primary script that is easy to read, and a companion script that is just the definitions of functions. + ## Version control When multiple people are collaborating on a project, it gets very important to be able to integrate changes in an intelligent way (emailing different versions of a zipped folder back and forth is not a good idea). Git and Github are the best tool for this, and Rstudio now supports using github to manage projects. @@ -117,3 +172,12 @@ When multiple people are collaborating on a project, it gets very important to b #### Other tips - The optional `inst/` folder of a package can hold misc files which are consistently accessible from the package functions. This allows us to develop packages to automate reporting -- we have a .qmd file in the `inst/` folder, and then a package in the function can copy that .qmd file to an appropriate folder (based on arguments), compile an html or pdf from the .qmd file, and then delete the .qmd file. See `TAMMsupport::tamm_report` for an example of this. + +# Appendix: help with implementation + +## Project management + +- link or description to starting Rprojects +- link or explanation for using `here::here()` +- link or explanation for using `renv` +