generated from ernestguevarra/template-workflow-s3m
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
189 lines (126 loc) · 8.17 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Improving Nutrition Status for under 5 children in Zambezia and Nampula project baseline and endline survey data analysis and reporting workflow
<!-- badges: start -->
[![test analysis workflow](https://github.com/katilingban/zambezia-nampula-survey/actions/workflows/test-analysis-workflow.yaml/badge.svg)](https://github.com/katilingban/zambezia-nampula-survey/actions/workflows/test-analysis-workflow.yaml)
<!-- badges: end -->
This repository is a [`docker`](https://www.docker.com/get-started)-containerised, [`{targets}`](https://docs.ropensci.org/targets/)-based, [`{renv}`](https://rstudio.github.io/renv/articles/renv.html)-enabled [`R`](https://cran.r-project.org/) workflow developed for the data management, data analysis, and reporting of the **Improving Nutrition Status for under 5 Children in Zambezia and Nampula project baseline and endline survey**.
## Repository Structure
* `data/` contains intermediate data outputs produced by the workflow including an endline survey codebook describing all variables of the endline survey;
* `R/` contains functions created for use in this workflow;
* `reports/` contains literate code for R Markdown reports generated in the workflow;
* `outputs/` contains compiled reports and figures;
* `auth/` contains credentials for Google service account (see below for more information);
* `docs/` contains archived high frequency data checks reports produced during the implementation of the survey;
* `did/` contains R script for performing the difference-in-difference analysis using the results tables produced by the analysis pipeline/workflow;
* `_targets.R` file defines the steps in this workflow's data management and analysis pipeline.
## Reproducibility
### R package dependencies
This project requires `R 4.2.2 (2022-11-10 r83330)`. This project uses the `{renv}` framework to record R package dependencies and versions. Packages and versions used are recorded in `renv.lock` and code used to manage dependencies is in `renv/` and other files in the root project directory. On starting an R session in the working directory, `run renv::restore()` to install R package dependencies.
### System requirements
This project's data management and analysis pipeline was built to utilise multi-threaded parallel computing to make calculations faster and more efficient. This project's pipeline uses 4 computing threads. To run without issues, ensure that you use a machine with at least 5 computing threads when reproducing this pipeline (4 threads for the pipeline and the remaininig threads to run the machines regular processes).
With multi-threaded computation, the entire pipeline completes between 30 minutes to an hour (depending on CPU speed and RAM size). Without multi-threading, the entire pipeline completes between 3-5 hours.
### Data management and analysis
This project uses the `{targets}` package to create its data management and analysis pipeline as defined in the `_targets.R` file.
To execute the data management and processing workflow for baseline and endline survey data, run:
```R
targets::tar_make(baseline_data_weighted)
targets::tar_make(endline_data_weighted)
```
The schematic figure below summarises the steps in the data management and processing workflow:
```{r data_management_workflow, echo = FALSE, message = FALSE, results = 'asis'}
cat(
"```mermaid",
targets::tar_mermaid(
targets_only = TRUE,
names = c(baseline_data_weighted, endline_data_weighted),
legend = FALSE
),
"```",
sep = "\n"
)
```
This produces the following outputs:
* R object for processed baseline data that includes probability weights for use in weighted analysis;
* R object for processed endline data that includes probability weights for use in weighted analysis.
To execute the data analysis workflow for the baseline and endline survey data, run:
```R
targets::tar_make(dplyr::starts_with("baseline"))
targets::tar_make(dplyr::starts_with("endline"))
```
The schematic figure below summarises the steps in the data analysis workflow:
```{r data_analysis_workflow, echo = FALSE, message = FALSE, results = 'asis'}
cat(
"```mermaid",
targets::tar_mermaid(
targets_only = TRUE,
#names = c(dplyr::starts_with("overall")),
#shortcut = TRUE,
allow = dplyr::contains("results"),
legend = FALSE
),
"```",
sep = "\n"
)
```
This pipeline produces the following outputs:
* Baseline results tables in CSV and in XLSX format;
* Baseline results tables subset in CSV format which is used for the difference-in-difference analysis;
* Endline results tables in CSV and in XLSX format;
* Endline results tables subset in CSV format which is used for the difference-in-difference analysis.
These outputs are saved by the pipeline in the `outputs` directory with the following filenames:
* `baseline_results.csv`;
* `baseline_results.xlsx`;
* `baseline_results_subset.csv`;
* `endline_results.csv`;
* `endline_results.xlsx`;
* `endline_results_subset.csv`.
### Difference-in-difference analysis
The **difference-in-difference** analysis was performed using a single `R` script found in the `did/` directory.
To run the **difference-in-difference** analysis, one should first run the data analysis pipeline describe above using the following command in R:
```
targets::tar_make(dplyr::starts_with("baseline"))
targets::tar_make(dplyr::starts_with("endline"))
```
which will produce the necessary results outputs needed for the **difference-in-difference** analysis.
Then, one should run the **difference-in-difference** R script using the following R command:
```
source("did/did.R")
```
The outputs of the **difference-in-difference** R script are:
* an HTML file called `analysisDiD.html`;
* all the plots (in `.png` format) showing the **difference-in-difference** results for each indicator on which the analysis was performed.
These outputs are saved in the `did/` directory after the **difference-in-difference** R script is run.
This repository doesn't contain the outputs of data management and analysis workflows described above. The user is expected to use this code along with the data from **UNICEF Mozambique** (restricted; request access only; see below) to run the workflows and produce the outputs themselves. All outputs are available from **UNICEF Mozambique** on request.
## Encryption
This repository uses [`git-crypt`](https://github.com/AGWA/git-crypt) to enable transparent encryption and decryption of the `.env` file and the `auth/` directory.
The `.env` file contains:
* variables for authenticating with Google Cloud services account setup for this project for accessing the baseline survey data;
* variables for accessing the endline survey data direct from ONA;
* variables for `SMTP_PASSWORD` and `EMAIL_RECIPIENTS` which are used for sending the email updates for the high frequency data checks.
The `auth/` directory contains credentials for Google service account.
Those who would like to reproduce the results of this project will require ability to decrypt the `.env` file and the `auth/` directory.
To be able to work on this repository, a user/collaborator on this project will need to:
* Create their own **PGP (Pretty Good Privacy) public and private keys**; and,
* Share their public key to the authors and request for it to be added to the repository.
Once added, a collaborator can now decrypt the `.env` file and the `auth/` directory after pulling/cloning the repository by running:
```
git-crypt unlock
```
on the terminal.
## Authors
* Mark Myatt
* Ernest Guevarra
## License
The datasets for both baseline and endline survey (not included in this repository) are owned by **UNICEF Mozambique**. Access to these datasets is restricted. Communicate with **UNICEF Mozambique** to request access to the survey datasets.
The code included in this repository is owned by the authors and is licensed under a [GNU General Public License 3 (GPL-3)](https://opensource.org/licenses/GPL-3.0).
## Feedback
Feedback, bug reports and feature requests are welcome; file issues or seek support [here](https://github.com/katilingban/zambezia-nampula-survey/issues).