From 976b0476896f1016f738254009028b83763c8894 Mon Sep 17 00:00:00 2001 From: Ty Garber Date: Mon, 7 Oct 2024 15:06:06 -0700 Subject: [PATCH] data sharing guidance --- README.html | 370 ++++++++++++++++++++++++++++++++++++++-------------- README.md | 24 ++++ 2 files changed, 294 insertions(+), 100 deletions(-) diff --git a/README.html b/README.html index 7ea4812..4969580 100644 --- a/README.html +++ b/README.html @@ -364,7 +364,6 @@

Goals

reproducible and approachable.

@@ -384,6 +383,67 @@

Overview

tutorials and examples to show how to implement some of the less obvious practices.

+
+

Non-coding

+

The following are important for coding and non-coding projects +alike.

+ +
+
+

Data Sharing Guidance

+

Sharing what type of data and who it can be shared with is often +confusing. Generally sport data can be shared with everyone freely, +while commercial data has restrictions on who and how the data +can be shared, for example under the Magnuson-Stevens Act (MSA)

+
+

Sport

+

Sport data can usually freely shared to the public, although there +might be restrictions around sharing charter fishing data via the +MSA.

+
+
+

Commerical

+

The MSA has to be considered when sharing commercial data, often the +data has to be aggregated in way to not specifically identify fishers. +In the co-management realm this is rarely and issue as much of the data +is aggregated, but for sharing with the pubic guidance should be +requested through WDFWs Records Office.

+
+
+

Treaty

+

Public requests for treaty data should be directed to the individual +tribes themselves or WDFW’s Records Office. Tribal data can be shared +freely with the data’s respective tribe, sharing one tribes data with +another tribe should be done under caution with the guidance of NWIFC +staff.

+
+

Project management

For any kind of substantial work involving more than one file, use @@ -394,10 +454,10 @@

Project management

When developing a document to report results or findings to a general user, use Rmarkdown or Quarto to create a report that blends R code with explanations and graphics.

-

When a script is likely to be re-used (e.g. not a one-off analysis), -create a commented version with instructions on use. This should be -stored somewhere accessible. Collin Edwards maintains the -snippets github repository for +

When a script is likely to be re-used (e.g. not a one-off analysis) +or if it is going to be shared, create a commented version with +instructions on use. This should be stored somewhere accessible. Collin +Edwards maintains the snippets github repository for this kind of thing, or it could live in a Teams folder. If code is likely to be useful to the team or others, it may also be appropriate to incorporate this code into an R package, or develop a new R package for @@ -474,23 +534,17 @@

Outside WDFW

be deleted from the 3rd party when the receipt is verified.

When .zip files are blacklisted by the recipient’s IT department, an alternative would be the .7z format from the -7-zip software.

+7-zip software. Sometimes zipped +files can successfully be emailed if the file name is changed to end in +something else (e.g., .zap) and including instuctions to +change the file name back.

Common project directory structure

When working across multiple projects, it can be helpful if each -project has a similar file structure. Ty and Collin will discuss what -that should be, but good foundation is Ty’s approach, which has a -data/ and a scripts/ subfolder. It may be -helpful to also include a standardized readme with basic information -(when project was started, what goal was, who was working on it). When -Collin was working in an academic setting, he had a code -snippet that he ran whenever starting a new project, which created -his standardized folder structure and auto-populated a few key template -files. We could think about writing something similar.

-

Draft file structure? CBE: slightly updated. I like to keep the -raw data in a separate folder from where cleaned / intermediate data -files live.

+project has a similar file structure. Your needs for individual projects +may vary, but the following project structure is often a good option (or +at least a good starting point).

project_folder
 ├── scripts
 │   ├── data_clean.R # should save to `cleaned_data/`
@@ -506,6 +560,37 @@ 

Common project directory structure

│ └── some_spreadsheet.xlsx ├── .gitignore └── project_folder.Rproj
+

The idea with this folder structure is that:

+ +

To streamline giving new projects this folder structure, the +framrsquared package found here has the +initialize_project() function. By default, this generates +the folder structure above; optional arguments allow users to specify a +different folder structure, copy template files for quarto documents, +and initialize renv.

+

To reiterate, this project structure is not mandatory for good +coding. It’s simply a useful option.

Databases

@@ -531,7 +616,8 @@

Tips

  • using if(interactive()) allows you to write code that behaves differently when being compiled for a report than when its being run interactively. This can be useful when developing parameterized -reports.
  • +reports, as the parameters will live in the YAML header, which is not +run in interactive mode.
    @@ -539,6 +625,11 @@

    Tips

    R Practices

    -
    ## "fragile" version of plotting an mtcar variable; copy-pasting and plotting a second variable requires careful updating of ggtitle()
    -dat.plot <- data |> 
    -    filter(fishery_title == "NT Area 10 Sport")
    -ggplot(dat.plot, aes(x = stock, y = AEQ))+
    -   geom_col()+
    -   ggtitle("Chinook AEQ of NT Area 10 Sport")+
    -   coord_flip()
    -   
    -## robust version:
    -fishery_plot <- "NT Area 10 Sport" ## define the fishery to plot in one place at the top
    -dat.plot <- data |> 
    -    filter(fishery_title == fishery_plot) ## use variable in our filter function 
    -ggplot(dat.plot, aes(x = stock, y = AEQ))+
    -   geom_col()+
    -   ggtitle(paste("Chinook AEQ of", fishery_plot))+ ## use paste and variable name
    -   coord_flip()
    -   
    -## alternative robust version:
    -dat.plot <- data |> 
    -    filter(fishery_title == "NT Area 10 Sport")
    -ggplot(dat.plot, aes(x = stock, y = AEQ))+
    -   geom_col()+
    -   ggtitle(paste("Chinook AEQ of", dat.plot$fishery_title[1]))+ ## obtain the fishery name directly from dat.plot
    -   coord_flip()
    -
    @@ -652,8 +737,8 @@

    Naming Conventions

    Although Hadley Wickham recommends snake case for R scripting, R has no official naming convention. When writing a script a naming convention -should be chosen and be consistently used throughout the documents -entirety.

    +should be chosen and then used consistently throughout the entire +document.

    Assignment Operators

    @@ -662,7 +747,8 @@

    Assignment Operators

    well as their directional reversals. The vast majority of your assignments will either be <- or =, although essentially equal, one should be chosen and used exclusively -throughout the project.

    +throughout the project. In Rstudio, [ctrl][=] is a hotkey to create +<-.

    Pipes

    @@ -670,21 +756,10 @@

    Pipes

    magittr package. The ‘native pipe’ |> was introduced in R 4.0. These two perform essentially the same function, but with different placeholders which can lead to various errors in -scripts when mix. One pipe should be used in the document.

    - - +scripts when mix. One pipe should be used in the document. In Rstudio, +[ctrl][shift][m] generates a pipe; you can set which type of pipe is +generated in Tools > Global Options > Code, and check/uncheck the +“Use native pipe operator…” box.

    @@ -695,9 +770,9 @@

    Visualization

    good practices. The following is technically agnostic to packages, but suggestions are centered on ggplot2-based approaches

    +
    ## "fragile" version; copy-pasting and plotting a different fishery requires careful updating of ggtitle()
    +dat.plot <- data |> 
    +    filter(fishery_title == "NT Area 10 Sport")
    +ggplot(dat.plot, aes(x = stock, y = AEQ))+
    +   geom_col()+
    +   ggtitle("Chinook AEQ of NT Area 10 Sport")+
    +   coord_flip()
    +   
    +## robust version:
    +fishery_plot <- "NT Area 10 Sport" ## define the fishery to plot in one place at the top
    +dat.plot <- data |> 
    +    filter(fishery_title == fishery_plot) ## use variable in our filter function 
    +ggplot(dat.plot, aes(x = stock, y = AEQ))+
    +   geom_col()+
    +   ggtitle(paste("Chinook AEQ of", fishery_plot))+ ## use paste and variable name
    +   coord_flip()
    +   
    +## alternative robust version:
    +dat.plot <- data |> 
    +    filter(fishery_title == "NT Area 10 Sport")
    +ggplot(dat.plot, aes(x = stock, y = AEQ))+
    +   geom_col()+
    +   ggtitle(paste("Chinook AEQ of", dat.plot$fishery_title[1]))+ ## obtain the fishery name directly from dat.plot
    +   coord_flip()

    Creating custom functions

    Package development guidelines

    @@ -778,7 +948,7 @@

    Package development guidelines

  • When in doubt, we should have more packages that are small and focused.
  • Before developing something new, make sure there isn’t an existing -tool we can use (I’m looking at YOU/ME, Collin)
  • +tool we can use (I’m looking at YOU-ME, Collin)

    Tips for clean checks

    diff --git a/README.md b/README.md index 7bb1fbb..4393051 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,30 @@ The following are important for coding and non-coding projects alike. - Ensure that headers are consistent in different files that are meant to be combined. This includes capitalization and the use of spaces. Copy-pasting from a template file is a good way to ensure exactly identical headers in this case. - Ensure that categories have a consistent name in a given column. For example, we have encountered data sheets with a "Yes/No" type column in which "Yes" is a mix of "Y", "Yes", "yes", "yes ", and "yees". When read into R or another language, these will be treated as five different categories instead of one. Consider using data validation in Excel to constrain user inputs to intended values. This can also be used to ensure that fields that should contain numbers do not end up with character strings. - For more suggestions on good spreadsheet practices, see [Broman and Woo 2018](https://doi.org/10.1080/00031305.2017.1375989) + +## Data Sharing Guidance +Sharing what type of data and who it can be shared with is often confusing. +Generally sport data can be shared with everyone freely, while commercial data +has restrictions on who and *how* the data can be shared, for example under the +Magnuson-Stevens Act (MSA) + +### Sport +Sport data can usually freely shared to the public, although there might be +restrictions around sharing charter fishing data via the MSA. + +### Commerical +The MSA has to be considered when sharing commercial data, often the data has +to be aggregated in way to not specifically identify fishers. In the co-management +realm this is rarely and issue as much of the data is aggregated, but for sharing +with the pubic guidance should be requested through WDFWs Records Office. + +### Treaty +Public requests for treaty data should be directed to the individual tribes +themselves or WDFW's Records Office. Tribal data can be shared freely with the +data's respective tribe, sharing one tribes data with another tribe should be +done under caution with the guidance of NWIFC staff. + + ## Project management