Demonstration of an R workflow that implements encryption of shared/common files between collaborators
This repository demonstrates an example R workflow that implements
encryption of shared/common files between collaborators using the
{cyphr}
package. This demonstration uses the vignette found
here.
For this demonstration, the following scenario describes the use case of the example project team.
A small project team of 4 people are collaborating on a research project on cause of death (CoD) data. Given the nature of the data, the team’s ethical responsibilities and commitments as per their respective institution’s regulatory boards include ensuring that the raw CoD data and all of its data derivatives are kept restricted only to authorised research project team members. In addition, raw CoD data and all of its data derivatives are kept encrypted when stored in each of the authorised research project team members’ computers. When sharing the data between each other, the research project team members need to ensure that raw CoD data and all of its data derivatives are encrypted on transit and can only be decrypted by authorised research project team members.
Given this, the research project team needs to devise a project workflow that will satisfy the encryption requirements while at the same time allowing access of the data to all authorised research project team members. The research project team is using R for their data management, analysis, and reporting and uses GitHub for versioning.
The most appropriate tool in the R ecosystem that can support the
research project team in fulfilling the requirements for data protection
is the {cyphr}
package.
Following is the recommended/suggested R workflow that will meet the requirements for data protection as per the respective institution’s regulations.
The backbone of this recommended/suggested encryption workflow is the use of personal Secure Shell (SSH) protocol keys. Each authorised research project team member should create their personal SSH keys.
There are plenty of guidance available on the internet on how to do this. This guide is one of the most straightforward explanations on how to create your personal SSH keys.
Best practice when generating your personal SSH keys is to always provide a passphrase to encrypt your private key once it is generated and stored in your computer. Without a passphrase, anyone that can gain access to your computer will also be able use your personal SSH keys.
Note that this step should be done on the command line or terminal and not on R console.
This step is a setup step that should be done by the administrator or the research project team lead or any other research project team member whose role it is to determine who has permissions to access the data.
Other members of the research project team will not need to perform this step.
This step is done through the R console (directory or via an IDE i.e.,
RStudio) and is facilitated using the {cyphr}
package (hence, the
{cyphr}
package should be installed prior to doing these steps).
For this demonstration, we use a project repository structure where the
raw data will be placed within the data-raw
directory and the
processed raw data will be stored within the data
directory. So, we
will setup the key within the root directory of the project repository
for clarity and convenience when encrypting and decrypting files within
sub-directories.
To create a key, the following command should be issued in R:
cyphr::data_admin_init(".", path_user = path_key_admin)
where path_key_admin
is the path to the personal SSH keys generated by
the admin or research project team lead. For purposes of this
demonstration, let us say that the admin or research project team lead
created their SSH key in the default ~/.ssh
directory. So, the command
above can be issued as follows instead:
cyphr::data_admin_init(".", path_user = "~/.ssh/id_rsa")
or simply
cyphr::data_admin_init(".")
given that the cyphr::data_admin_init()
function will use the default
SSH key path when no path_user
is specified.
When running this command, the admin or the research project team lead
will be asked for the passphrase they created for their personal SSH key
(if they generated a passphrase). If the passphrase matches, then R will
generate a data key for the project repository and appropriately setup
the project for encryption. A directory named .cyphr
will be created
in project root directory (since this is a hidden directory, select
Show hidden files in your file manager settings to see the directory).
This directory should be kept within the project repository and should
be committed to GitHub for versioning.
Now that the admin or the research project team lead has setup the project repository for encryption, they can now add encrypted data to the project.
For this demonstration, we will use the iris
dataset as our example
raw data. We will store an encrypted CSV copy of this dataset in the
data-raw
directory using the following commands:
## Get the admin key ----
admin_key <- cyphr::data_key(".", path_user = path_key_admin)
## Store encrypted raw data in data-raw directory ----
cyphr::encrypt(
expr = write.csv(iris, "data-raw/iris.csv", row.names = FALSE),
key = admin_key
)
To check whether the iris.csv
file in the data-raw
directory is
indeed encrypted, we can try to read it:
read.csv("data-raw/iris.csv")
which results in the following error:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<b4>�/M+(<86><e1>�9�<99><aa><8b><cf>6A<f5>F~<bd><ff<9a><f5><92>t5<ef>{`<96><e0><92>iP�<e1><bd>'
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 3 appears to contain embedded nulls
But if we decrypt the file first and then read it:
cyphr::decrypt(
expr = read.csv("data-raw/iris.csv"),
key = admin_key
) |>
head()
we are able to retrieve the data into R.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Given that the team uses GitHub for versioning, the next step for the admin or research project team lead is to distribute the project repository as currently structured to their research project team members and authorised collaborators. This will be done by adding them to the GitHub project repository as members/collaborators.
Once added, these collaborators can now clone the repository and get their own copies of the workflow on their own machines. When they clone their own copies, this includes the encryption setup made by the admin or research project team lead.
In order for research collaborators to be able to access and decrypt any encrypted data in the project repository, they will need to have created their own personal SSH key and make a request to be added to the project via the following commands:
## Get the collaborator1 key ----
collaborator1_key <- cyphr::data_key(".", path_user = path_key_collaborator1)
## Request to be added ----
cyphr::data_request_access(".", path_user = path_key_collaborator1)
Once the research project team member or collaborator has made a request, the admin or research project team lead can approve this request as follows:
## View collaborator request ----
cyphr::data_admin_list_requests(".")
## Authorise collaborator request ----
cyphr::data_admin_authorise(".", yes = TRUE, path_user = admin_key)
We can check whether the collaborator key has been added by:
cyphr::data_admin_list_keys(".")
which now shows more than 1 key.
-
Since authorised collaborators will need to use their own keys to encrypt and decrypt, the step for creating an object for individual keys will need to be run in such a way that it will know who the current user is and then create the key object based on the path to the current user’s SSH key on their computers. Possible solutions to this are:
-
Manually create a key object outside of the reproducible workflow specific for the user (but with the same key object name for all users). Once the key object is generated, all other steps of the workflow can be reproduced (including encryption and decryption).
-
Creating a function that will identify who the current user is (maybe based on GitHub user credentials) and then specify the path to SSH keys based on this and then generate the key object. This approach has the potential of being reproducible all throughout.
-
fix the SSH key location to the default location used by
ssh-keygen
by default (~/.ssh/id_rsa
) so that the same path can be used to generate each users key object. This will likely work for macOS and Linux machines but will most likely not work for Windows machines.
-