-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
338 lines (265 loc) · 11.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
output: github_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
library(ruler, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
options(tibble.print_min = 6, tibble.print_max = 6)
```
# ruler: Rule Your Data
<!-- badges: start -->
[![Travis-CI Build Status](https://travis-ci.org/echasnovski/ruler.svg?branch=master)](https://travis-ci.org/echasnovski/ruler)
[![R-CMD-check](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml)
[![Coverage Status](https://codecov.io/gh/echasnovski/ruler/graph/badge.svg)](https://app.codecov.io/github/echasnovski/ruler?branch=master)
[![CRAN](https://www.r-pkg.org/badges/version/ruler?color=blue)](https://cran.r-project.org/package=ruler)
[![Dependencies](https://tinyverse.netlify.com/badge/ruler)](https://CRAN.R-project.org/package=ruler)
[![Downloads](http://cranlogs.r-pkg.org/badges/ruler)](https://cran.r-project.org/package=ruler)
<!-- badges: end -->
`ruler` offers a set of tools for creating tidy data validation reports using
[dplyr](https://dplyr.tidyverse.org) grammar of data manipulation. It is structured to be flexible and extendable in terms of creating rules and using their output.
To fully use this package a solid knowledge of `dplyr` is required. The key idea behind `ruler`'s design is to validate data by modifying regular `dplyr` code with as little overhead as possible.
Some functionality is powered by the [keyholder](https://echasnovski.github.io/keyholder/) package. It is highly recommended to use its supported functions during rule construction. All one- and two-table `dplyr` verbs applied to local data frames are supported and considered the most appropriate way to create rules.
This README is structured as follows:
- __Installation__ shows ways to install package.
- __Example__ shows the basic usage of `ruler` for exploration of obeying user-defined rules and its automatic validation.
- __Overview__ explains basic data and function types with design behind them.
- __Usage__ describes `ruler`'s capabilities in more detail.
- __Other packages for validation and assertions__ lists alternatives for described tasks.
## Installation
You can install current stable version from CRAN with:
```{r cran-installation, eval = FALSE}
install.packages("ruler")
```
Also you can install development version from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("echasnovski/ruler")
```
## Example
```{r Example, error = TRUE, purl = FALSE}
# Utilities functions
is_integerish <- function(x) {
all(x == as.integer(x))
}
z_score <- function(x) {
abs(x - mean(x)) / sd(x)
}
# Define rule packs
my_packs <- list(
data_packs(
dims = . %>% summarise(nrow_low = nrow(.) >= 10, nrow_high = nrow(.) <= 15,
ncol_low = ncol(.) >= 20, ncol_high = ncol(.) <= 30)
),
group_packs(
vs_am_num = . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7),
.group_vars = c("vs", "am")
),
col_packs(
enough_col_sum = . %>%
summarise_if(is_integerish, rules(is_enough = sum(.) >= 14))
),
row_packs(
enough_row_sum = . %>%
filter(vs == 1) %>%
transmute(is_enough = rowSums(.) >= 200)
),
cell_packs(
dbl_not_outlier = . %>%
transmute_if(is.numeric, rules(is_not_out = z_score(.) < 1)) %>%
slice(-(1:5))
)
)
# Expose data to rules
mtcars_exposed <- mtcars %>% as_tibble() %>%
expose(my_packs)
# View exposure
mtcars_exposed %>% get_exposure()
# Assert any breaker
invisible(mtcars_exposed %>% assert_any_breaker())
```
## Overview
__Rule__ is a function which converts data unit of interest (data, group,
column, row, cell) to logical value indicating whether this object satisfies
certain condition.
__Rule pack__ is a function which combines several rules into one functional
block. The recommended way of creating rules is by creating packs right away with the use of `dplyr` and [magrittr](https://magrittr.tidyverse.org/)'s
pipe operator.
__Exposing__ data to rules means applying rules to data, collecting results in common format and attaching them to the data as an `exposure` attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.
__Exposure__ is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:
1. __Packs info__: a [tibble](https://tibble.tidyverse.org/) with the following structure:
- _name_ \<chr\> : Name of the pack. If not set manually it will be imputed during exposure.
- _type_ \<chr\> : Name of pack type. Indicates which data unit pack checks.
- _fun_ \<list\> : List of rule pack functions.
- _remove_obeyers_ \<lgl\> : Whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.
2. __Tidy data validation report__: a `tibble` with the following structure:
- _pack_ \<chr\> : Name of rule pack from column 'name' in packs info.
- _rule_ \<chr\> : Name of the rule defined in rule pack.
- _var_ \<chr\> : Name of the variable which validation result is reported. Value '.all' is reserved and interpreted as 'all columns as a whole'. __Note__ that _var_ doesn't always represent the actual column in data frame: for group packs it represents the created group name.
- _id_ \<int\> : Index of the row in tested data frame which validation result is reported. Value 0 is reserved and interpreted as 'all rows as a whole'.
- _value_ \<lgl\> : Whether the described data unit obeys the rule.
There are four basic combinations of `var` and `id` values which define five basic data units:
- `var == '.all'` and `id == 0`: Data as a whole.
- `var != '.all'` and `id == 0`: Group (`var` shouldn't be an actual column name) or column (`var` should be an actual column name) as a whole.
- `var == '.all'` and `id != 0`: Row as a whole.
- `var != '.all'` and `id != 0`: Described cell.
With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.
## Usage
### Creating packs
#### Data packs
```{r Data pack}
# List of two rule packs for checking data properties
my_data_packs <- data_packs(
# data_dims is a pack name
data_dims = . %>% summarise(
# ncol and nrow are rule names
ncol = ncol(.) == 12,
nrow = nrow(.) == 32
),
# Data after subsetting should have number of rows in between 10 and 30
# Rules are applied separately
vs_1 = . %>% filter(vs == 1) %>%
summarise(
nrow_low = nrow(.) > 10,
nrow_high = nrow(.) < 30
)
)
```
#### Group packs
```{r Group pack}
# List of one nameless rule pack for checking group property
my_group_packs <- group_packs(
# Name will be imputed during exposure
. %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6)),
# One should supply grouping variables for correct interpretation of output
.group_vars = c("vs", "am")
)
```
#### Column packs
```{r Column pack}
# rules() defines function predicators with necessary name imputations
# List of two rule pack for checking certain columns' properties
my_col_packs <- col_packs(
sum_bounds = . %>% summarise_at(
# Check only columns with names starting with 'c'
vars(starts_with("c")),
rules(sum_low = sum(.) > 300, sum_high = sum(.) < 400)
),
# In the edge case of checking one column with one rule there is a need
# for forcing inclusion of names in the output of summarise_at().
# This is done with naming argument in vars()
vs_mean = . %>% summarise_at(vars(vs = vs), rules(mean(.) > 0.5))
)
```
#### Row packs
```{r Row packs}
z_score <- function(x) {
(x - mean(x)) / sd(x)
}
# List of one rule pack checking certain rows' property
my_row_packs <- row_packs(
row_mean = . %>% mutate(rowMean = rowMeans(.)) %>%
transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
# Check only rows 10-15
# Values in 'id' column of report will be based on input data (i.e. 10-15)
# and not on output data (1-6)
slice(10:15)
)
```
#### Cell packs
```{r Cell packs}
is_integerish <- function(x) {
all(x == as.integer(x))
}
# List of two cell pack checking certain cells' property
my_cell_packs <- cell_packs(
my_cell_pack_1 = . %>% transmute_if(
# Check only integer-like columns
is_integerish,
rules(is_common = abs(z_score(.)) < 1)
) %>%
# Check only rows 20-30
slice(20:30),
# The same edge case as in column rule pack
vs_side = . %>% transmute_at(vars(vs = "vs"), rules(. > mean(.)))
)
```
### Exposing
By default exposing removes obeyers.
```{r Expose removes obeyers by default}
mtcars %>%
expose(my_data_packs, my_group_packs) %>%
get_exposure()
```
One can leave obeyers by setting `.remove_obeyers` to `FALSE`.
```{r Expose can not remove obeyers}
mtcars %>%
expose(my_data_packs, my_group_packs, .remove_obeyers = FALSE) %>%
get_exposure()
```
By default `expose()` guesses the pack type if 'not-pack' function is supplied. This behaviour has some edge cases but is useful for interactive use.
```{r Expose can guess}
mtcars %>%
expose(
some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.)))
) %>%
get_exposure()
```
To write strict and robust code one can set `.guess` to `FALSE`.
```{r Expose can not guess, error = TRUE, purl = FALSE}
mtcars %>%
expose(
some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))),
.guess = FALSE
) %>%
get_exposure()
```
### Acting after exposure
General actions are recommended to be done with `act_after_exposure()`. It takes two arguments:
- `.trigger` - a function which takes the data with attached exposure and returns `TRUE` if some action should be made.
- `.actor` - a function which takes the same argument as `.trigger` and performs some action.
If trigger didn't notify then the input data is returned untouched. Otherwise the output of `.actor()` is returned. __Note__ that `act_after_exposure()` is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe).
```{r Acting after exposure}
trigger_one_pack <- function(.tbl) {
packs_number <- .tbl %>%
get_packs_info() %>%
nrow()
packs_number > 1
}
actor_one_pack <- function(.tbl) {
cat("More than one pack was applied.\n")
invisible(.tbl)
}
mtcars %>%
expose(my_col_packs, my_row_packs) %>%
act_after_exposure(
.trigger = trigger_one_pack,
.actor = actor_one_pack
) %>%
invisible()
```
`ruler` has function `assert_any_breaker()` which can notify about presence of any breaker in exposure.
```{r Assert any breaker, error = TRUE, purl = FALSE}
mtcars %>%
expose(my_col_packs, my_row_packs) %>%
assert_any_breaker()
```
## Other packages for validation and assertions
More leaned towards assertions:
- [assertr](https://github.com/ropensci/assertr)
- [assertthat](https://github.com/hadley/assertthat)
- [checkmate](https://github.com/mllg/checkmate)
- [ensurer](https://github.com/smbache/ensurer)
- [tester](https://github.com/gastonstat/tester)
- [sealr](https://github.com/uribo/sealr)
More leaned towards validation:
- [naniar](https://github.com/njtierney/naniar)
- [skimr](https://github.com/ropensci/skimr)
- [validate](https://github.com/data-cleaning/validate)