diff --git a/.Rbuildignore b/.Rbuildignore index d245bf8c3..65a0a7f67 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -48,3 +48,5 @@ references.bib ^CRAN-SUBMISSION$ docs ^.dev$ +^vignettes/s. +^vignettes/t. diff --git a/DESCRIPTION b/DESCRIPTION index c56e3f9dd..67e87fd6c 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: datawizard Title: Easy Data Wrangling and Statistical Transformations -Version: 0.11.0.4 +Version: 0.12.3.4 Authors@R: c( person("Indrajeet", "Patil", , "patilindrajeet.science@gmail.com", role = "aut", comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")), @@ -21,10 +21,10 @@ Authors@R: c( person("Robert", "Garrett", , "rcg4@illinois.edu", role = "rev") ) Maintainer: Etienne Bacher -Description: A lightweight package to assist in key steps involved in any data - analysis workflow: (1) wrangling the raw data to get it in the needed form, - (2) applying preprocessing steps and statistical transformations, and - (3) compute statistical summaries of data properties and distributions. +Description: A lightweight package to assist in key steps involved in any data + analysis workflow: (1) wrangling the raw data to get it in the needed form, + (2) applying preprocessing steps and statistical transformations, and + (3) compute statistical summaries of data properties and distributions. It is also the data wrangling backend for packages in 'easystats' ecosystem. References: Patil et al. (2022) . License: MIT + file LICENSE @@ -33,10 +33,10 @@ BugReports: https://github.com/easystats/datawizard/issues Depends: R (>= 3.6) Imports: - insight (>= 0.20.0), + insight (>= 0.20.3), stats, utils -Suggests: +Suggests: bayestestR, boot, brms, @@ -49,7 +49,6 @@ Suggests: ggplot2 (>= 3.5.0), gt, haven, - htmltools, httr, knitr, lme4, @@ -68,12 +67,13 @@ Suggests: tibble, tidyr, withr -VignetteBuilder: +VignetteBuilder: knitr Encoding: UTF-8 Language: en-US Roxygen: list(markdown = TRUE) -RoxygenNote: 7.3.1 +RoxygenNote: 7.3.2 Config/testthat/edition: 3 Config/testthat/parallel: true Config/Needs/website: easystats/easystatstemplate +Remotes: easystats/insight diff --git a/NEWS.md b/NEWS.md index 0954f1214..8e329b0fb 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,43 @@ -# datawizard 0.11.0.1 +# datawizard (development) + +BREAKING CHANGES + +* `data_rename()` now errors when the `replacement` argument contains `NA` values + or empty strings (#539). + +CHANGES + +* The `pattern` argument in `data_rename()` can also be a named vector. In this + case, names are used as values for the `replacement` argument (i.e. `pattern` + can be a character vector using ` = ""`). + +* Minor additions to `reshape_ci()` to work with forthcoming changes in the + `{bayestestR}` package. + +# datawizard 0.12.3 + +CHANGES + +* `demean()` (and `degroup()`) now also work for nested designs, if argument + `nested = TRUE` and `by` specifies more than one variable (#533). + +* Vignettes are no longer provided in the package, they are now only available + on the website. There is only one "Overview" vignette available in the package, + it contains links to the other vignettes on the website. This is because there + are CRAN errors occurring when building vignettes on macOS and we couldn't + determine the cause after multiple patch releases (#534). + +# datawizard 0.12.2 + +* Remove `htmltools` from `Suggests` in an attempt of fixing an error in CRAN + checks due to failures to build a vignette (#528). + +# datawizard 0.12.1 + +This is a patch release to fix one error on CRAN checks occurring because of a +missing package namespace in one of the vignettes. + +# datawizard 0.12.0 BREAKING CHANGES @@ -21,6 +60,10 @@ CHANGES frame, where the first column contains name of the variable for which frequencies were calculated, and the second column contains the frequency table. +* `demean()` (and `degroup()`) now also work for cross-classified designs, or + more generally, for data with multiple grouping or cluster variables (i.e. + `by` can now specify more than one variable). + # datawizard 0.11.0 BREAKING CHANGES @@ -59,8 +102,8 @@ BREAKING CHANGES * The following arguments were deprecated in 0.5.0 and are now removed: - * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep` - * in `data_to_long()`: `colnames_to` + * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep` + * in `data_to_long()`: `colnames_to` * in `data_partition()`: `training_proportion` NEW FUNCTIONS @@ -79,7 +122,7 @@ CHANGES argument, to compute weighted frequency tables. `include_na` allows to include or omit missing values from the table. Furthermore, a `by` argument was added, to compute crosstables (#479, #481). - + # datawizard 0.9.1 CHANGES @@ -130,7 +173,7 @@ CHANGES * `unnormalize()` and `unstandardize()` now work with grouped data (#415). -* `unnormalize()` now errors instead of emitting a warning if it doesn't have the +* `unnormalize()` now errors instead of emitting a warning if it doesn't have the necessary info (#415). BUG FIXES @@ -153,7 +196,7 @@ BUG FIXES * Fixed issue in `data_filter()` where functions containing a `=` (e.g. when naming arguments, like `grepl(pattern, x = a)`) were mistakenly seen as - faulty syntax. + faulty syntax. * Fixed issue in `empty_column()` for strings with invalid multibyte strings. For such data frames or files, `empty_column()` or `data_read()` no longer @@ -190,14 +233,14 @@ CHANGES NEW FUNCTIONS -* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic - `tibble::rowid_to_column()`). Note that its behavior is different from +* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic + `tibble::rowid_to_column()`). Note that its behavior is different from `tibble::rowid_to_column()` for grouped data. See the Details section in the docs. * `data_unite()`, to merge values of multiple variables into one new variable. -* `data_separate()`, as counterpart to `data_unite()`, to separate a single +* `data_separate()`, as counterpart to `data_unite()`, to separate a single variable into multiple new variables. * `data_modify()`, to create new variables, or modify or remove existing @@ -220,7 +263,7 @@ BUG FIXES * `center()` and `standardize()` did not work for grouped data frames (of class `grouped_df`) when `force = TRUE`. - + * The `data.frame` method of `describe_distribution()` returns `NULL` instead of an error if no valid variable were passed (for example a factor variable with `include_factors = FALSE`) (#421). @@ -248,12 +291,12 @@ BUG FIXES # datawizard 0.7.0 -BREAKING CHANGES +BREAKING CHANGES * In selection patterns, expressions like `-var1:var3` to exclude all variables between `var1` and `var3` are no longer accepted. The correct expression is `-(var1:var3)`. This is for 2 reasons: - + * to be consistent with the behavior for numerics (`-1:2` is not accepted but `-(1:2)` is); * to be consistent with `dplyr::select()`, which throws a warning and only @@ -265,8 +308,8 @@ NEW FUNCTIONS or more variables into a new variable. * `mean_sd()` and `median_mad()` for summarizing vectors to their mean (or - median) and a range of one SD (or MAD) above and below. - + median) and a range of one SD (or MAD) above and below. + * `data_write()` as counterpart to `data_read()`, to write data frames into CSV, SPSS, SAS, Stata files and many other file types. One advantage over existing functions to write data in other packages is that labelled (numeric) @@ -282,8 +325,8 @@ MINOR CHANGES * `data_rename()` gets a `verbose` argument. * `winsorize()` now errors if the threshold is incorrect (previously, it provided - a warning and returned the unchanged data). The argument `verbose` is now - useless but is kept for backward compatibility. The documentation now contains + a warning and returned the unchanged data). The argument `verbose` is now + useless but is kept for backward compatibility. The documentation now contains details about the valid values for `threshold` (#357). * In all functions that have arguments `select` and/or `exclude`, there is now one warning per misspelled variable. The previous behavior was to have only one @@ -304,7 +347,7 @@ BUG FIXES * Fix unexpected warning in `convert_na_to()` when `select` is a list (#352). * Fixed issue with correct labelling of numeric variables with more than nine unique values and associated value labels. - + # datawizard 0.6.5 @@ -336,7 +379,7 @@ NEW FUNCTIONS * `data_codebook()`: to generate codebooks of data frames. * New functions to deal with duplicates: `data_duplicated()` (keep all duplicates, - including the first occurrence) and `data_unique()` (returns the data, excluding + including the first occurrence) and `data_unique()` (returns the data, excluding all duplicates except one instance of each, based on the selected method). MINOR CHANGES @@ -346,15 +389,15 @@ MINOR CHANGES * The `include_bounds` argument in `normalize()` can now also be a numeric value, defining the limit to the upper and lower bound (i.e. the distance to 1 and 0). - -* `data_filter()` now works with grouped data. + +* `data_filter()` now works with grouped data. BUG FIXES * `data_read()` no longer prints message for empty columns when the data actually had no empty columns. - - * `data_to_wide()` now drops columns that are not in `id_cols` (if specified), + + * `data_to_wide()` now drops columns that are not in `id_cols` (if specified), `names_from`, or `values_from`. This is the behaviour observed in `tidyr::pivot_wider()`. # datawizard 0.6.3 @@ -786,4 +829,3 @@ NEW FUNCTIONS # datawizard 0.1.0 * First release. - diff --git a/R/data_rename.R b/R/data_rename.R index b8f213c7f..18f45657b 100644 --- a/R/data_rename.R +++ b/R/data_rename.R @@ -13,11 +13,15 @@ #' @param pattern Character vector. For `data_rename()`, indicates columns that #' should be selected for renaming. Can be `NULL` (in which case all columns #' are selected). For `data_addprefix()` or `data_addsuffix()`, a character -#' string, which will be added as prefix or suffix to the column names. +#' string, which will be added as prefix or suffix to the column names. For +#' `data_rename()`, `pattern` can also be a named vector. In this case, names +#' are used as values for the `replacement` argument (i.e. `pattern` can be a +#' character vector using ` = ""` and argument `replacement` +#' will be ignored then). #' @param replacement Character vector. Indicates the new name of the columns #' selected in `pattern`. Can be `NULL` (in which case column are numbered #' in sequential order). If not `NULL`, `pattern` and `replacement` must be -#' of the same length. +#' of the same length. If `pattern` is a named vector, `replacement` is ignored. #' @param rows Vector of row names. #' @param safe Do not throw error if for instance the variable to be #' renamed/removed doesn't exist. @@ -33,12 +37,14 @@ #' head(data_rename(iris, "FakeCol", "length")) # This doesn't #' head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width"))) #' +#' # use named vector to rename +#' head(data_rename(iris, c(length = "Sepal.Length", width = "Sepal.Width"))) +#' #' # Reset names #' head(data_rename(iris, NULL)) #' #' # Change all #' head(data_rename(iris, replacement = paste0("Var", 1:5))) -#' #' @seealso #' - Functions to rename stuff: [data_rename()], [data_rename_rows()], [data_addprefix()], [data_addsuffix()] #' - Functions to reorder or remove columns: [data_reorder()], [data_relocate()], [data_remove()] @@ -66,11 +72,44 @@ data_rename <- function(data, insight::format_error("Argument `pattern` must be of type character.") } + # check if `pattern` has names, and if so, use as "replacement" + if (!is.null(names(pattern))) { + replacement <- names(pattern) + } + # name columns 1, 2, 3 etc. if no replacement if (is.null(replacement)) { replacement <- paste0(seq_along(pattern)) } + # coerce to character + replacement <- as.character(replacement) + + # check if `replacement` has no empty strings and no NA values + invalid_replacement <- is.na(replacement) | !nzchar(replacement) + if (any(invalid_replacement)) { + if (is.null(names(pattern))) { + # when user did not match `pattern` with `replacement` + msg <- c( + "`replacement` is not allowed to have `NA` or empty strings.", + sprintf( + "Following values in `pattern` have no match in `replacement`: %s", + toString(pattern[invalid_replacement]) + ) + ) + } else { + # when user did not name all elements of `pattern` + msg <- c( + "Either name all elements of `pattern` or use `replacement`.", + sprintf( + "Following values in `pattern` were not named: %s", + toString(pattern[invalid_replacement]) + ) + ) + } + insight::format_error(msg) + } + # if duplicated names in replacement, append ".2", ".3", etc. to duplicates # ex: c("foo", "foo") -> c("foo", "foo.2") if (anyDuplicated(replacement) > 0L) { diff --git a/R/demean.R b/R/demean.R index bbf7d2dfc..94bfc255f 100644 --- a/R/demean.R +++ b/R/demean.R @@ -12,7 +12,25 @@ #' @param select Character vector (or formula) with names of variables to select #' that should be group- and de-meaned. #' @param by Character vector (or formula) with the name of the variable that -#' indicates the group- or cluster-ID. +#' indicates the group- or cluster-ID. For cross-classified or nested designs, +#' `by` can also identify two or more variables as group- or cluster-IDs. If +#' the data is nested and should be treated as such, set `nested = TRUE`. Else, +#' if `by` defines two or more variables and `nested = FALSE`, a cross-classified +#' design is assumed. Note that `demean()` and `degroup()` can't handle a mix +#' of nested and cross-classified designs in one model. +#' +#' For nested designs, `by` can be: +#' - a character vector with the name of the variable that indicates the +#' levels, ordered from *highest* level to *lowest* (e.g. +#' `by = c("L4", "L3", "L2")`. +#' - a character vector with variable names in the format `by = "L4/L3/L2"`, +#' where the levels are separated by `/`. +#' +#' See also section _De-meaning for cross-classified designs_ and +#' _De-meaning for nested designs_ below. +#' @param nested Logical, if `TRUE`, the data is treated as nested. If `FALSE`, +#' the data is treated as cross-classified. Only applies if `by` contains more +#' than one variable. #' @param center Method for centering. `demean()` always performs #' mean-centering, while `degroup()` can use `center = "median"` or #' `center = "mode"` for median- or mode-centering, and also `"min"` @@ -31,168 +49,208 @@ #' @return #' A data frame with the group-/de-meaned variables, which get the suffix #' `"_between"` (for the group-meaned variable) and `"_within"` (for the -#' de-meaned variable) by default. +#' de-meaned variable) by default. For cross-classified or nested designs, +#' the name pattern of the group-meaned variables is the name of the centered +#' variable followed by the name of the variable that indicates the related +#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`. #' #' @seealso If grand-mean centering (instead of centering within-clusters) -#' is required, see [center()]. See [`performance::check_heterogeneity_bias()`] +#' is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`] #' to check for heterogeneity bias. #' -#' @details -#' -#' \subsection{Heterogeneity Bias}{ -#' Mixed models include different levels of sources of variability, i.e. -#' error terms at each level. When macro-indicators (or level-2 predictors, -#' or higher-level units, or more general: *group-level predictors that -#' **vary** within and across groups*) are included as fixed effects (i.e. -#' treated as covariate at level-1), the variance that is left unaccounted for -#' this covariate will be absorbed into the error terms of level-1 and level-2 -#' (\cite{Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6.}): -#' \dQuote{Such covariates contain two parts: one that is specific to the -#' higher-level entity that does not vary between occasions, and one that -#' represents the difference between occasions, within higher-level entities} -#' (\cite{Bell et al. 2015}). Hence, the error terms will be correlated with -#' the covariate, which violates one of the assumptions of mixed models -#' (iid, independent and identically distributed error terms). This bias is -#' also called the *heterogeneity bias* (\cite{Bell et al. 2015}). To -#' resolve this problem, level-2 predictors used as (level-1) covariates should -#' be separated into their "within" and "between" effects by "de-meaning" and -#' "group-meaning": After demeaning time-varying predictors, \dQuote{at the -#' higher level, the mean term is no longer constrained by Level 1 effects, -#' so it is free to account for all the higher-level variance associated -#' with that variable} (\cite{Bell et al. 2015}). -#' } -#' -#' \subsection{Panel data and correlating fixed and group effects}{ -#' `demean()` is intended to create group- and de-meaned variables -#' for panel regression models (fixed effects models), or for complex -#' random-effect-within-between models (see \cite{Bell et al. 2015, 2018}), -#' where group-effects (random effects) and fixed effects correlate (see -#' \cite{Bafumi and Gelman 2006}). This can happen, for instance, when -#' analyzing panel data, which can lead to *Heterogeneity Bias*. To -#' control for correlating predictors and group effects, it is recommended -#' to include the group-meaned and de-meaned version of *time-varying covariates* -#' (and group-meaned version of *time-invariant covariates* that are on -#' a higher level, e.g. level-2 predictors) in the model. By this, one can -#' fit complex multilevel models for panel data, including time-varying -#' predictors, time-invariant predictors and random effects. -#' } -#' -#' \subsection{Why mixed models are preferred over fixed effects models}{ -#' A mixed models approach can model the causes of endogeneity explicitly -#' by including the (separated) within- and between-effects of time-varying -#' fixed effects and including time-constant fixed effects. Furthermore, -#' mixed models also include random effects, thus a mixed models approach -#' is superior to classic fixed-effects models, which lack information of -#' variation in the group-effects or between-subject effects. Furthermore, -#' fixed effects regression cannot include random slopes, which means that -#' fixed effects regressions are neglecting \dQuote{cross-cluster differences -#' in the effects of lower-level controls (which) reduces the precision of -#' estimated context effects, resulting in unnecessarily wide confidence -#' intervals and low statistical power} (\cite{Heisig et al. 2017}). -#' } -#' -#' \subsection{Terminology}{ -#' The group-meaned variable is simply the mean of an independent variable -#' within each group (or id-level or cluster) represented by `by`. -#' It represents the cluster-mean of an independent variable. The regression -#' coefficient of a group-meaned variable is the *between-subject-effect*. -#' The de-meaned variable is then the centered version of the group-meaned -#' variable. De-meaning is sometimes also called person-mean centering or -#' centering within clusters. The regression coefficient of a de-meaned -#' variable represents the *within-subject-effect*. -#' } -#' -#' \subsection{De-meaning with continuous predictors}{ -#' For continuous time-varying predictors, the recommendation is to include -#' both their de-meaned and group-meaned versions as fixed effects, but not -#' the raw (untransformed) time-varying predictors themselves. The de-meaned -#' predictor should also be included as random effect (random slope). In -#' regression models, the coefficient of the de-meaned predictors indicates -#' the within-subject effect, while the coefficient of the group-meaned -#' predictor indicates the between-subject effect. -#' } -#' -#' \subsection{De-meaning with binary predictors}{ -#' For binary time-varying predictors, there are two recommendations. First -#' is to include the raw (untransformed) binary predictor as fixed effect -#' only and the *de-meaned* variable as random effect (random slope). -#' The alternative would be to add the de-meaned version(s) of binary -#' time-varying covariates as additional fixed effect as well (instead of -#' adding it as random slope). Centering time-varying binary variables to -#' obtain within-effects (level 1) isn't necessary. They have a sensible -#' interpretation when left in the typical 0/1 format (\cite{Hoffmann 2015, -#' chapter 8-2.I}). `demean()` will thus coerce categorical time-varying -#' predictors to numeric to compute the de- and group-meaned versions for -#' these variables, where the raw (untransformed) binary predictor and the -#' de-meaned version should be added to the model. -#' } -#' -#' \subsection{De-meaning of factors with more than 2 levels}{ -#' Factors with more than two levels are demeaned in two ways: first, these -#' are also converted to numeric and de-meaned; second, dummy variables -#' are created (binary, with 0/1 coding for each level) and these binary -#' dummy-variables are de-meaned in the same way (as described above). -#' Packages like \pkg{panelr} internally convert factors to dummies before -#' demeaning, so this behaviour can be mimicked here. -#' } -#' -#' \subsection{De-meaning interaction terms}{ There are multiple ways to deal -#' with interaction terms of within- and between-effects. A classical approach -#' is to simply use the product term of the de-meaned variables (i.e. -#' introducing the de-meaned variables as interaction term in the model -#' formula, e.g. `y ~ x_within * time_within`). This approach, however, -#' might be subject to bias (see \cite{Giesselmann & Schmidt-Catran 2020}). -#' \cr \cr -#' Another option is to first calculate the product term and then apply the -#' de-meaning to it. This approach produces an estimator \dQuote{that reflects -#' unit-level differences of interacted variables whose moderators vary -#' within units}, which is desirable if *no* within interaction of -#' two time-dependent variables is required. \cr \cr -#' A third option, when the interaction should result in a genuine within -#' estimator, is to "double de-mean" the interaction terms -#' (\cite{Giesselmann & Schmidt-Catran 2018}), however, this is currently -#' not supported by `demean()`. If this is required, the `wmb()` -#' function from the \pkg{panelr} package should be used. \cr \cr -#' To de-mean interaction terms for within-between models, simply specify -#' the term as interaction for the `select`-argument, e.g. -#' `select = "a*b"` (see 'Examples'). -#' } -#' -#' \subsection{Analysing panel data with mixed models using lme4}{ -#' A description of how to translate the -#' formulas described in *Bell et al. 2018* into R using `lmer()` -#' from \pkg{lme4} can be found in -#' [this vignette](https://easystats.github.io/parameters/articles/demean.html). -#' } +#' @section Heterogeneity Bias: +#' +#' Mixed models include different levels of sources of variability, i.e. +#' error terms at each level. When macro-indicators (or level-2 predictors, +#' or higher-level units, or more general: *group-level predictors that +#' **vary** within and across groups*) are included as fixed effects (i.e. +#' treated as covariate at level-1), the variance that is left unaccounted for +#' this covariate will be absorbed into the error terms of level-1 and level-2 +#' (_Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6._): +#' "Such covariates contain two parts: one that is specific to the higher-level +#' entity that does not vary between occasions, and one that represents the +#' difference between occasions, within higher-level entities" (_Bell et al. 2015_). +#' Hence, the error terms will be correlated with the covariate, which violates +#' one of the assumptions of mixed models (iid, independent and identically +#' distributed error terms). This bias is also called the *heterogeneity bias* +#' (_Bell et al. 2015_). To resolve this problem, level-2 predictors used as +#' (level-1) covariates should be separated into their "within" and "between" +#' effects by "de-meaning" and "group-meaning": After demeaning time-varying +#' predictors, "at the higher level, the mean term is no longer constrained by +#' Level 1 effects, so it is free to account for all the higher-level variance +#' associated with that variable" (_Bell et al. 2015_). +#' +#' @section Panel data and correlating fixed and group effects: +#' +#' `demean()` is intended to create group- and de-meaned variables for panel +#' regression models (fixed effects models), or for complex +#' random-effect-within-between models (see _Bell et al. 2015, 2018_), where +#' group-effects (random effects) and fixed effects correlate (see +#' _Bafumi and Gelman 2006_). This can happen, for instance, when analyzing +#' panel data, which can lead to *Heterogeneity Bias*. To control for correlating +#' predictors and group effects, it is recommended to include the group-meaned +#' and de-meaned version of *time-varying covariates* (and group-meaned version +#' of *time-invariant covariates* that are on a higher level, e.g. level-2 +#' predictors) in the model. By this, one can fit complex multilevel models for +#' panel data, including time-varying predictors, time-invariant predictors and +#' random effects. +#' +#' @section Why mixed models are preferred over fixed effects models: +#' +#' A mixed models approach can model the causes of endogeneity explicitly +#' by including the (separated) within- and between-effects of time-varying +#' fixed effects and including time-constant fixed effects. Furthermore, +#' mixed models also include random effects, thus a mixed models approach +#' is superior to classic fixed-effects models, which lack information of +#' variation in the group-effects or between-subject effects. Furthermore, +#' fixed effects regression cannot include random slopes, which means that +#' fixed effects regressions are neglecting "cross-cluster differences in the +#' effects of lower-level controls (which) reduces the precision of estimated +#' context effects, resulting in unnecessarily wide confidence intervals and +#' low statistical power" (_Heisig et al. 2017_). +#' +#' @section Terminology: +#' +#' The group-meaned variable is simply the mean of an independent variable +#' within each group (or id-level or cluster) represented by `by`. It represents +#' the cluster-mean of an independent variable. The regression coefficient of a +#' group-meaned variable is the *between-subject-effect*. The de-meaned variable +#' is then the centered version of the group-meaned variable. De-meaning is +#' sometimes also called person-mean centering or centering within clusters. +#' The regression coefficient of a de-meaned variable represents the +#' *within-subject-effect*. +#' +#' @section De-meaning with continuous predictors: +#' +#' For continuous time-varying predictors, the recommendation is to include +#' both their de-meaned and group-meaned versions as fixed effects, but not +#' the raw (untransformed) time-varying predictors themselves. The de-meaned +#' predictor should also be included as random effect (random slope). In +#' regression models, the coefficient of the de-meaned predictors indicates +#' the within-subject effect, while the coefficient of the group-meaned +#' predictor indicates the between-subject effect. +#' +#' @section De-meaning with binary predictors: +#' +#' For binary time-varying predictors, there are two recommendations. First +#' is to include the raw (untransformed) binary predictor as fixed effect +#' only and the *de-meaned* variable as random effect (random slope). +#' The alternative would be to add the de-meaned version(s) of binary +#' time-varying covariates as additional fixed effect as well (instead of +#' adding it as random slope). Centering time-varying binary variables to +#' obtain within-effects (level 1) isn't necessary. They have a sensible +#' interpretation when left in the typical 0/1 format (_Hoffmann 2015, +#' chapter 8-2.I_). `demean()` will thus coerce categorical time-varying +#' predictors to numeric to compute the de- and group-meaned versions for +#' these variables, where the raw (untransformed) binary predictor and the +#' de-meaned version should be added to the model. +#' +#' @section De-meaning of factors with more than 2 levels: +#' +#' Factors with more than two levels are demeaned in two ways: first, these +#' are also converted to numeric and de-meaned; second, dummy variables +#' are created (binary, with 0/1 coding for each level) and these binary +#' dummy-variables are de-meaned in the same way (as described above). +#' Packages like **panelr** internally convert factors to dummies before +#' demeaning, so this behaviour can be mimicked here. +#' +#' @section De-meaning interaction terms: +#' +#' There are multiple ways to deal with interaction terms of within- and +#' between-effects. +#' +#' - A classical approach is to simply use the product term of the de-meaned +#' variables (i.e. introducing the de-meaned variables as interaction term +#' in the model formula, e.g. `y ~ x_within * time_within`). This approach, +#' however, might be subject to bias (see _Giesselmann & Schmidt-Catran 2020_). +#' +#' - Another option is to first calculate the product term and then apply the +#' de-meaning to it. This approach produces an estimator "that reflects +#' unit-level differences of interacted variables whose moderators vary +#' within units", which is desirable if *no* within interaction of +#' two time-dependent variables is required. This is what `demean()` does +#' internally when `select` contains interaction terms. +#' +#' - A third option, when the interaction should result in a genuine within +#' estimator, is to "double de-mean" the interaction terms +#' (_Giesselmann & Schmidt-Catran 2018_), however, this is currently +#' not supported by `demean()`. If this is required, the `wmb()` +#' function from the **panelr** package should be used. +#' +#' To de-mean interaction terms for within-between models, simply specify +#' the term as interaction for the `select`-argument, e.g. `select = "a*b"` +#' (see 'Examples'). +#' +#' @section De-meaning for cross-classified designs: +#' +#' `demean()` can handle cross-classified designs, where the data has two or +#' more groups at the higher (i.e. second) level. In such cases, the +#' `by`-argument can identify two or more variables that represent the +#' cross-classified group- or cluster-IDs. The de-meaned variables for +#' cross-classified designs are simply subtracting all group means from each +#' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_ +#' for details). Note that de-meaning for cross-classified designs is *not* +#' equivalent to de-meaning of nested data structures from models with three or +#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For +#' cross-classified designs, de-meaning is supposed to work for models like +#' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like +#' `y ~ x + (1|level3/level2)`. Note that `demean()` and `degroup()` can't +#' handle a mix of nested and cross-classified designs in one model. +#' +#' @section De-meaning for nested designs: +#' +#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for +#' nested designs, which is implemented in `demean()`. For nested designs, set +#' `nested = TRUE` *and* specify the variables that indicate the different +#' levels in descending order in the `by` argument. E.g., +#' `by = c("level4", "level3, "level2")` assumes a model like +#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the +#' `by`-argument would be `by = "level4/level3/level2"`, similar to the +#' formula notation. +#' +#' @section Analysing panel data with mixed models using lme4: +#' +#' A description of how to translate the formulas described in *Bell et al. 2018* +#' into R using `lmer()` from **lme4** can be found in +#' [this vignette](https://easystats.github.io/parameters/articles/demean.html). #' #' @references #' #' - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors -#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the -#' American Political Science Association. +#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the +#' American Political Science Association. #' #' - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects -#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074 +#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074 #' #' - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects -#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science -#' Research and Methods, 3(1), 133–153. +#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science +#' Research and Methods, 3(1), 133–153. +#' +#' - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J., +#' Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in +#' Three-Level Contextual Models. Multivariate Behavioral Research, 52(2), +#' 149–163. https://doi.org/10.1080/00273171.2016.1256753 #' #' - Gelman A, Hill J. 2007. Data Analysis Using Regression and -#' Multilevel/Hierarchical Models. Analytical Methods for Social Research. -#' Cambridge, New York: Cambridge University Press +#' Multilevel/Hierarchical Models. Analytical Methods for Social Research. +#' Cambridge, New York: Cambridge University Press #' #' - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed -#' effects regression models. Sociological Methods & Research, 1–28. -#' https://doi.org/10.1177/0049124120914934 +#' effects regression models. Sociological Methods & Research, 1–28. +#' https://doi.org/10.1177/0049124120914934 +#' +#' - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects +#' in cross-classified multilevel models. Behavior Research Methods, 56(4), +#' 3023–3057. #' #' - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity: -#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster -#' Differences in the Effects of Controls. American Sociological Review 82 -#' (4): 796–827. +#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster +#' Differences in the Effects of Controls. American Sociological Review 82 +#' (4): 796–827. #' #' - Hoffman L. 2015. Longitudinal analysis: modeling within-person -#' fluctuation and change. New York: Routledge +#' fluctuation and change. New York: Routledge #' #' @examples #' @@ -223,6 +281,7 @@ demean <- function(x, select, by, + nested = FALSE, suffix_demean = "_within", suffix_groupmean = "_between", add_attributes = TRUE, @@ -238,6 +297,7 @@ demean <- function(x, x = x, select = select, by = by, + nested = nested, center = "mean", suffix_demean = suffix_demean, suffix_groupmean = suffix_groupmean, @@ -247,15 +307,12 @@ demean <- function(x, } - - - - #' @rdname demean #' @export degroup <- function(x, select, by, + nested = FALSE, center = "mean", suffix_demean = "_within", suffix_groupmean = "_between", @@ -274,20 +331,31 @@ degroup <- function(x, center <- match.arg(tolower(center), choices = c("mean", "median", "mode", "min", "max")) if (inherits(select, "formula")) { - # formula to character, remove "~", split at "+" + # formula to character, remove "~", split at "+". We don't use `all.vars()` + # here because we want to keep the interaction terms as they are select <- trimws(unlist( strsplit(gsub("~", "", insight::safe_deparse(select), fixed = TRUE), "+", fixed = TRUE), use.names = FALSE )) } + # handle different "by" options if (inherits(by, "formula")) { by <- all.vars(by) } + # we also allow lme4-syntax here: if by = "L4/L3/L2", we assume a nested design + if (length(by) == 1 && grepl("/", by, fixed = TRUE)) { + by <- insight::trim_ws(unlist(strsplit(by, "/", fixed = TRUE), use.names = FALSE)) + nested <- TRUE + } + + # identify interaction terms interactions_no <- select[!grepl("(\\*|\\:)", select)] interactions_yes <- select[grepl("(\\*|\\:)", select)] + # if we have interaction terms that should be de-meaned, calculate the product + # of the terms first, then demean the product if (length(interactions_yes)) { interaction_terms <- lapply(strsplit(interactions_yes, "*", fixed = TRUE), trimws) product <- lapply(interaction_terms, function(i) do.call(`*`, x[, i])) @@ -296,20 +364,22 @@ degroup <- function(x, select <- c(interactions_no, colnames(new_dat)) } - not_found <- setdiff(select, colnames(x)) - - if (length(not_found) && isTRUE(verbose)) { - insight::format_alert( - sprintf( - "%i variables were not found in the dataset: %s\n", - length(not_found), - toString(not_found) - ) + # check if all variables are present + not_found <- setdiff(c(select, by), colnames(x)) + + if (length(not_found)) { + insight::format_error( + paste0( + "Variable", + ifelse(length(not_found) > 1, "s ", " "), + text_concatenate(not_found, enclose = "\""), + ifelse(length(not_found) > 1, " were", " was"), + " not found in the dataset." + ), + .misspelled_string(colnames(x), not_found, "Possibly misspelled or not yet defined?") ) } - select <- intersect(colnames(x), select) - # get data to demean... dat <- x[, c(select, by)] @@ -366,37 +436,92 @@ degroup <- function(x, max = function(.gm) max(.gm, na.rm = TRUE), function(.gm) mean(.gm, na.rm = TRUE) ) - x_gm_list <- lapply(select, function(i) { - stats::ave(dat[[i]], dat[[by]], FUN = gm_fun) - }) - names(x_gm_list) <- select - # create de-meaned variables by subtracting the group mean from each individual value + # we allow disaggregating level-specific effects for cross-classified multilevel + # models (see Guo et al. 2024). Two levels should work as proposed by the authors, + # more levels also already work, but need to check the formula from the paper + # and validate results - x_dm_list <- lapply(select, function(i) dat[[i]] - x_gm_list[[i]]) - names(x_dm_list) <- select + if (length(by) == 1) { + # simple case: one level + group_means_list <- lapply(select, function(i) { + stats::ave(dat[[i]], dat[[by]], FUN = gm_fun) + }) + names(group_means_list) <- select + # create de-meaned variables by subtracting the group mean from each individual value + person_means_list <- lapply(select, function(i) dat[[i]] - group_means_list[[i]]) + } else if (nested) { + # nested design: by > 1, nested is explicitly set to TRUE + # We want: + # L3_between = xbar(k) + # L2_between = xbar(j,k) - xbar(k) + # L1_within = x(ijk) - xbar(jk) + # , where + # x(ijk) is the individual value / variable that is measured on level 1 + # xbar(k) <- ave(x_ijk, L3, FUN = mean), the group mean of the variable at highest level + # xbar(jk) <- ave(x_ijk, L3, L2, FUN = mean), the group mean of the variable at second level + group_means_list <- lapply(select, function(i) { + out <- lapply(seq_along(by), function(k) { + dat$higher_levels <- do.call(paste, c(dat[by[1:k]], list(sep = "_"))) + stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun) + }) + # subtract mean of higher level from lower level + for (j in 2:length(by)) { + out[[j]] <- out[[j]] - out[[j - 1]] + } + names(out) <- paste0(select, "_", by) + out + }) + # create de-meaned variables by subtracting the group mean from each individual value + person_means_list <- lapply( + # seq_along(select), + # function(i) dat[[select[i]]] - group_means_list[[i]][[length(by)]] + select, + function(i) { + dat$higher_levels <- do.call(paste, c(dat[by], list(sep = "_"))) + dat[[i]] - stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun) + } + ) + } else { + # cross-classified design: by > 1 + group_means_list <- lapply(by, function(j) { + out <- lapply(select, function(i) { + stats::ave(dat[[i]], dat[[j]], FUN = gm_fun) + }) + names(out) <- paste0(select, "_", j) + out + }) + # de-meaned variables for cross-classified design is simply subtracting + # all group means from each individual value + person_means_list <- lapply(seq_along(select), function(i) { + sum_group_means <- do.call(`+`, lapply(group_means_list, function(j) j[[i]])) + dat[[select[i]]] - sum_group_means + }) + } + # preserve names + names(person_means_list) <- select # convert to data frame and add suffix to column names - x_gm <- as.data.frame(x_gm_list) - x_dm <- as.data.frame(x_dm_list) + group_means <- as.data.frame(group_means_list) + person_means <- as.data.frame(person_means_list) - colnames(x_dm) <- sprintf("%s%s", colnames(x_dm), suffix_demean) - colnames(x_gm) <- sprintf("%s%s", colnames(x_gm), suffix_groupmean) + colnames(person_means) <- sprintf("%s%s", colnames(person_means), suffix_demean) + colnames(group_means) <- sprintf("%s%s", colnames(group_means), suffix_groupmean) if (isTRUE(add_attributes)) { - x_dm[] <- lapply(x_dm, function(i) { + person_means[] <- lapply(person_means, function(i) { attr(i, "within-effect") <- TRUE i }) - x_gm[] <- lapply(x_gm, function(i) { + group_means[] <- lapply(group_means, function(i) { attr(i, "between-effect") <- TRUE i }) } - cbind(x_gm, x_dm) + cbind(group_means, person_means) } diff --git a/R/reshape_ci.R b/R/reshape_ci.R index 99a670a2d..dcfc729a8 100644 --- a/R/reshape_ci.R +++ b/R/reshape_ci.R @@ -43,15 +43,20 @@ reshape_ci <- function(x, ci_type = "CI") { # Reshape if (length(unique(x$CI)) > 1) { if ("Parameter" %in% names(x)) { + idvar <- "Parameter" remove_parameter <- FALSE - } else { + } else if (is.null(attr(x, "idvars"))) { + idvar <- "Parameter" x$Parameter <- NA remove_parameter <- TRUE + } else { + idvar <- attr(x, "idvars") + remove_parameter <- FALSE } x <- stats::reshape( x, - idvar = "Parameter", + idvar = idvar, timevar = "CI", direction = "wide", v.names = c(ci_low, ci_high), diff --git a/R/standardize.models.R b/R/standardize.models.R index 6f5a1dfa8..a92ffe243 100644 --- a/R/standardize.models.R +++ b/R/standardize.models.R @@ -197,7 +197,7 @@ standardize.default <- function(x, ## ---- STANDARDIZE! ---- - w <- insight::get_weights(x, na_rm = TRUE) + w <- insight::get_weights(x, remove_na = TRUE) data_std <- standardize(data[do_standardize], robust = robust, diff --git a/cran-comments.md b/cran-comments.md index 58de89d2a..2c30d1287 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -4,7 +4,15 @@ ## revdepcheck results -We checked 17 reverse dependencies, comparing R CMD check results across CRAN and dev versions of this package. +We checked 18 reverse dependencies, comparing R CMD check results across CRAN and dev versions of this package. * We saw 0 new problems * We failed to check 0 packages + +## Other comments + +This is a patch release that should (hopefully) fix a failure occurring on macOS +when building vignettes. This only happens on macOS with R 4.3. We tried to +reproduce this locally and in CI with the same setup, but we couldn't. Hence, we +removed all vignettes (except for one "Overview"), they are now only available +on the website. diff --git a/inst/WORDLIST b/inst/WORDLIST index a3dd80b42..eda7dc71c 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -2,24 +2,31 @@ Analysing Asparouhov BMC Bafumi +Brincks +Bulotsky CMD Carle Catran Crosstables +Dhaliwal +Disaggregating DOI De Dom EFC +Enders EUROFAMCARE Fairbrother GLMM Gelman Giesecke Giesselmann +Guo Heisig Herrington Hoffmann Joanes +Llabre Lumley MADs Mattan @@ -79,6 +86,7 @@ midhinge modelbased modelling nd +panelr partialization patilindrajeets platykurtic diff --git a/man/coef_var.Rd b/man/coef_var.Rd index 0f0965076..92274ca59 100644 --- a/man/coef_var.Rd +++ b/man/coef_var.Rd @@ -79,14 +79,10 @@ This means that CV is \strong{NOT} invariant to shifting, but it is to scaling: \if{html}{\out{
}}\preformatted{sandwiches <- c(0, 4, 15, 0, 0, 5, 2, 7) coef_var(sandwiches) #> [1] 1.239094 -}\if{html}{\out{
}} -\if{html}{\out{
}}\preformatted{ coef_var(sandwiches / 2) # same #> [1] 1.239094 -}\if{html}{\out{
}} -\if{html}{\out{
}}\preformatted{ coef_var(sandwiches + 4) # different! 0 is no longer meaningful! #> [1] 0.6290784 }\if{html}{\out{
}} diff --git a/man/data_rename.Rd b/man/data_rename.Rd index f1f4de938..a45095805 100644 --- a/man/data_rename.Rd +++ b/man/data_rename.Rd @@ -46,7 +46,11 @@ data_rename_rows(data, rows = NULL) \item{pattern}{Character vector. For \code{data_rename()}, indicates columns that should be selected for renaming. Can be \code{NULL} (in which case all columns are selected). For \code{data_addprefix()} or \code{data_addsuffix()}, a character -string, which will be added as prefix or suffix to the column names.} +string, which will be added as prefix or suffix to the column names. For +\code{data_rename()}, \code{pattern} can also be a named vector. In this case, names +are used as values for the \code{replacement} argument (i.e. \code{pattern} can be a +character vector using \verb{ = ""} and argument \code{replacement} +will be ignored then).} \item{select}{Variables that will be included when performing the required tasks. Can be either @@ -104,7 +108,7 @@ functions (see 'Details'), this argument may be used as workaround.} \item{replacement}{Character vector. Indicates the new name of the columns selected in \code{pattern}. Can be \code{NULL} (in which case column are numbered in sequential order). If not \code{NULL}, \code{pattern} and \code{replacement} must be -of the same length.} +of the same length. If \code{pattern} is a named vector, \code{replacement} is ignored.} \item{safe}{Do not throw error if for instance the variable to be renamed/removed doesn't exist.} @@ -134,12 +138,14 @@ head(data_rename(iris, "Sepal.Length", "length")) head(data_rename(iris, "FakeCol", "length")) # This doesn't head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width"))) +# use named vector to rename +head(data_rename(iris, c(length = "Sepal.Length", width = "Sepal.Width"))) + # Reset names head(data_rename(iris, NULL)) # Change all head(data_rename(iris, replacement = paste0("Var", 1:5))) - } \seealso{ \itemize{ diff --git a/man/demean.Rd b/man/demean.Rd index d03a1010b..8a9a49308 100644 --- a/man/demean.Rd +++ b/man/demean.Rd @@ -10,6 +10,7 @@ demean( x, select, by, + nested = FALSE, suffix_demean = "_within", suffix_groupmean = "_between", add_attributes = TRUE, @@ -21,6 +22,7 @@ degroup( x, select, by, + nested = FALSE, center = "mean", suffix_demean = "_within", suffix_groupmean = "_between", @@ -33,6 +35,7 @@ detrend( x, select, by, + nested = FALSE, center = "mean", suffix_demean = "_within", suffix_groupmean = "_between", @@ -48,7 +51,28 @@ detrend( that should be group- and de-meaned.} \item{by}{Character vector (or formula) with the name of the variable that -indicates the group- or cluster-ID.} +indicates the group- or cluster-ID. For cross-classified or nested designs, +\code{by} can also identify two or more variables as group- or cluster-IDs. If +the data is nested and should be treated as such, set \code{nested = TRUE}. Else, +if \code{by} defines two or more variables and \code{nested = FALSE}, a cross-classified +design is assumed. Note that \code{demean()} and \code{degroup()} can't handle a mix +of nested and cross-classified designs in one model. + +For nested designs, \code{by} can be: +\itemize{ +\item a character vector with the name of the variable that indicates the +levels, ordered from \emph{highest} level to \emph{lowest} (e.g. +\code{by = c("L4", "L3", "L2")}. +\item a character vector with variable names in the format \code{by = "L4/L3/L2"}, +where the levels are separated by \code{/}. +} + +See also section \emph{De-meaning for cross-classified designs} and +\emph{De-meaning for nested designs} below.} + +\item{nested}{Logical, if \code{TRUE}, the data is treated as nested. If \code{FALSE}, +the data is treated as cross-classified. Only applies if \code{by} contains more +than one variable.} \item{suffix_demean, suffix_groupmean}{String value, will be appended to the names of the group-meaned and de-meaned variables of \code{x}. By default, @@ -72,7 +96,10 @@ or \code{"max"}.} \value{ A data frame with the group-/de-meaned variables, which get the suffix \code{"_between"} (for the group-meaned variable) and \code{"_within"} (for the -de-meaned variable) by default. +de-meaned variable) by default. For cross-classified or nested designs, +the name pattern of the group-meaned variables is the name of the centered +variable followed by the name of the variable that indicates the related +grouping level, e.g. \code{predictor_L3_between} and \code{predictor_L2_between}. } \description{ \code{demean()} computes group- and de-meaned versions of a variable that can be @@ -81,46 +108,50 @@ used in regression analysis to model the between- and within-subject effect. \code{demean()} always uses mean-centering, \code{degroup()} can also use the mode or median for centering. } -\details{ -\subsection{Heterogeneity Bias}{ +\section{Heterogeneity Bias}{ + + Mixed models include different levels of sources of variability, i.e. error terms at each level. When macro-indicators (or level-2 predictors, or higher-level units, or more general: \emph{group-level predictors that \strong{vary} within and across groups}) are included as fixed effects (i.e. treated as covariate at level-1), the variance that is left unaccounted for this covariate will be absorbed into the error terms of level-1 and level-2 -(\cite{Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6.}): -\dQuote{Such covariates contain two parts: one that is specific to the -higher-level entity that does not vary between occasions, and one that -represents the difference between occasions, within higher-level entities} -(\cite{Bell et al. 2015}). Hence, the error terms will be correlated with -the covariate, which violates one of the assumptions of mixed models -(iid, independent and identically distributed error terms). This bias is -also called the \emph{heterogeneity bias} (\cite{Bell et al. 2015}). To -resolve this problem, level-2 predictors used as (level-1) covariates should -be separated into their "within" and "between" effects by "de-meaning" and -"group-meaning": After demeaning time-varying predictors, \dQuote{at the -higher level, the mean term is no longer constrained by Level 1 effects, -so it is free to account for all the higher-level variance associated -with that variable} (\cite{Bell et al. 2015}). +(\emph{Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6.}): +"Such covariates contain two parts: one that is specific to the higher-level +entity that does not vary between occasions, and one that represents the +difference between occasions, within higher-level entities" (\emph{Bell et al. 2015}). +Hence, the error terms will be correlated with the covariate, which violates +one of the assumptions of mixed models (iid, independent and identically +distributed error terms). This bias is also called the \emph{heterogeneity bias} +(\emph{Bell et al. 2015}). To resolve this problem, level-2 predictors used as +(level-1) covariates should be separated into their "within" and "between" +effects by "de-meaning" and "group-meaning": After demeaning time-varying +predictors, "at the higher level, the mean term is no longer constrained by +Level 1 effects, so it is free to account for all the higher-level variance +associated with that variable" (\emph{Bell et al. 2015}). } -\subsection{Panel data and correlating fixed and group effects}{ -\code{demean()} is intended to create group- and de-meaned variables -for panel regression models (fixed effects models), or for complex -random-effect-within-between models (see \cite{Bell et al. 2015, 2018}), -where group-effects (random effects) and fixed effects correlate (see -\cite{Bafumi and Gelman 2006}). This can happen, for instance, when -analyzing panel data, which can lead to \emph{Heterogeneity Bias}. To -control for correlating predictors and group effects, it is recommended -to include the group-meaned and de-meaned version of \emph{time-varying covariates} -(and group-meaned version of \emph{time-invariant covariates} that are on -a higher level, e.g. level-2 predictors) in the model. By this, one can -fit complex multilevel models for panel data, including time-varying -predictors, time-invariant predictors and random effects. +\section{Panel data and correlating fixed and group effects}{ + + +\code{demean()} is intended to create group- and de-meaned variables for panel +regression models (fixed effects models), or for complex +random-effect-within-between models (see \emph{Bell et al. 2015, 2018}), where +group-effects (random effects) and fixed effects correlate (see +\emph{Bafumi and Gelman 2006}). This can happen, for instance, when analyzing +panel data, which can lead to \emph{Heterogeneity Bias}. To control for correlating +predictors and group effects, it is recommended to include the group-meaned +and de-meaned version of \emph{time-varying covariates} (and group-meaned version +of \emph{time-invariant covariates} that are on a higher level, e.g. level-2 +predictors) in the model. By this, one can fit complex multilevel models for +panel data, including time-varying predictors, time-invariant predictors and +random effects. } -\subsection{Why mixed models are preferred over fixed effects models}{ +\section{Why mixed models are preferred over fixed effects models}{ + + A mixed models approach can model the causes of endogeneity explicitly by including the (separated) within- and between-effects of time-varying fixed effects and including time-constant fixed effects. Furthermore, @@ -128,24 +159,28 @@ mixed models also include random effects, thus a mixed models approach is superior to classic fixed-effects models, which lack information of variation in the group-effects or between-subject effects. Furthermore, fixed effects regression cannot include random slopes, which means that -fixed effects regressions are neglecting \dQuote{cross-cluster differences -in the effects of lower-level controls (which) reduces the precision of -estimated context effects, resulting in unnecessarily wide confidence -intervals and low statistical power} (\cite{Heisig et al. 2017}). +fixed effects regressions are neglecting "cross-cluster differences in the +effects of lower-level controls (which) reduces the precision of estimated +context effects, resulting in unnecessarily wide confidence intervals and +low statistical power" (\emph{Heisig et al. 2017}). } -\subsection{Terminology}{ +\section{Terminology}{ + + The group-meaned variable is simply the mean of an independent variable -within each group (or id-level or cluster) represented by \code{by}. -It represents the cluster-mean of an independent variable. The regression -coefficient of a group-meaned variable is the \emph{between-subject-effect}. -The de-meaned variable is then the centered version of the group-meaned -variable. De-meaning is sometimes also called person-mean centering or -centering within clusters. The regression coefficient of a de-meaned -variable represents the \emph{within-subject-effect}. +within each group (or id-level or cluster) represented by \code{by}. It represents +the cluster-mean of an independent variable. The regression coefficient of a +group-meaned variable is the \emph{between-subject-effect}. The de-meaned variable +is then the centered version of the group-meaned variable. De-meaning is +sometimes also called person-mean centering or centering within clusters. +The regression coefficient of a de-meaned variable represents the +\emph{within-subject-effect}. } -\subsection{De-meaning with continuous predictors}{ +\section{De-meaning with continuous predictors}{ + + For continuous time-varying predictors, the recommendation is to include both their de-meaned and group-meaned versions as fixed effects, but not the raw (untransformed) time-varying predictors themselves. The de-meaned @@ -155,7 +190,9 @@ the within-subject effect, while the coefficient of the group-meaned predictor indicates the between-subject effect. } -\subsection{De-meaning with binary predictors}{ +\section{De-meaning with binary predictors}{ + + For binary time-varying predictors, there are two recommendations. First is to include the raw (untransformed) binary predictor as fixed effect only and the \emph{de-meaned} variable as random effect (random slope). @@ -163,51 +200,91 @@ The alternative would be to add the de-meaned version(s) of binary time-varying covariates as additional fixed effect as well (instead of adding it as random slope). Centering time-varying binary variables to obtain within-effects (level 1) isn't necessary. They have a sensible -interpretation when left in the typical 0/1 format (\cite{Hoffmann 2015, +interpretation when left in the typical 0/1 format (\emph{Hoffmann 2015, chapter 8-2.I}). \code{demean()} will thus coerce categorical time-varying predictors to numeric to compute the de- and group-meaned versions for these variables, where the raw (untransformed) binary predictor and the de-meaned version should be added to the model. } -\subsection{De-meaning of factors with more than 2 levels}{ +\section{De-meaning of factors with more than 2 levels}{ + + Factors with more than two levels are demeaned in two ways: first, these are also converted to numeric and de-meaned; second, dummy variables are created (binary, with 0/1 coding for each level) and these binary dummy-variables are de-meaned in the same way (as described above). -Packages like \pkg{panelr} internally convert factors to dummies before +Packages like \strong{panelr} internally convert factors to dummies before demeaning, so this behaviour can be mimicked here. } -\subsection{De-meaning interaction terms}{ There are multiple ways to deal -with interaction terms of within- and between-effects. A classical approach -is to simply use the product term of the de-meaned variables (i.e. -introducing the de-meaned variables as interaction term in the model -formula, e.g. \code{y ~ x_within * time_within}). This approach, however, -might be subject to bias (see \cite{Giesselmann & Schmidt-Catran 2020}). -\cr \cr -Another option is to first calculate the product term and then apply the -de-meaning to it. This approach produces an estimator \dQuote{that reflects +\section{De-meaning interaction terms}{ + + +There are multiple ways to deal with interaction terms of within- and +between-effects. +\itemize{ +\item A classical approach is to simply use the product term of the de-meaned +variables (i.e. introducing the de-meaned variables as interaction term +in the model formula, e.g. \code{y ~ x_within * time_within}). This approach, +however, might be subject to bias (see \emph{Giesselmann & Schmidt-Catran 2020}). +\item Another option is to first calculate the product term and then apply the +de-meaning to it. This approach produces an estimator "that reflects unit-level differences of interacted variables whose moderators vary -within units}, which is desirable if \emph{no} within interaction of -two time-dependent variables is required. \cr \cr -A third option, when the interaction should result in a genuine within +within units", which is desirable if \emph{no} within interaction of +two time-dependent variables is required. This is what \code{demean()} does +internally when \code{select} contains interaction terms. +\item A third option, when the interaction should result in a genuine within estimator, is to "double de-mean" the interaction terms -(\cite{Giesselmann & Schmidt-Catran 2018}), however, this is currently +(\emph{Giesselmann & Schmidt-Catran 2018}), however, this is currently not supported by \code{demean()}. If this is required, the \code{wmb()} -function from the \pkg{panelr} package should be used. \cr \cr +function from the \strong{panelr} package should be used. +} + To de-mean interaction terms for within-between models, simply specify -the term as interaction for the \code{select}-argument, e.g. -\code{select = "a*b"} (see 'Examples'). +the term as interaction for the \code{select}-argument, e.g. \code{select = "a*b"} +(see 'Examples'). } -\subsection{Analysing panel data with mixed models using lme4}{ -A description of how to translate the -formulas described in \emph{Bell et al. 2018} into R using \code{lmer()} -from \pkg{lme4} can be found in -\href{https://easystats.github.io/parameters/articles/demean.html}{this vignette}. +\section{De-meaning for cross-classified designs}{ + + +\code{demean()} can handle cross-classified designs, where the data has two or +more groups at the higher (i.e. second) level. In such cases, the +\code{by}-argument can identify two or more variables that represent the +cross-classified group- or cluster-IDs. The de-meaned variables for +cross-classified designs are simply subtracting all group means from each +individual value, i.e. \emph{fully cluster-mean-centering} (see \emph{Guo et al. 2024} +for details). Note that de-meaning for cross-classified designs is \emph{not} +equivalent to de-meaning of nested data structures from models with three or +more levels. Set \code{nested = TRUE} to explicitly assume a nested design. For +cross-classified designs, de-meaning is supposed to work for models like +\code{y ~ x + (1|level3) + (1|level2)}, but \emph{not} for models like +\code{y ~ x + (1|level3/level2)}. Note that \code{demean()} and \code{degroup()} can't +handle a mix of nested and cross-classified designs in one model. } + +\section{De-meaning for nested designs}{ + + +\emph{Brincks et al. (2017)} have suggested an algorithm to center variables for +nested designs, which is implemented in \code{demean()}. For nested designs, set +\code{nested = TRUE} \emph{and} specify the variables that indicate the different +levels in descending order in the \code{by} argument. E.g., +\verb{by = c("level4", "level3, "level2")} assumes a model like +\code{y ~ x + (1|level4/level3/level2)}. An alternative notation for the +\code{by}-argument would be \code{by = "level4/level3/level2"}, similar to the +formula notation. } + +\section{Analysing panel data with mixed models using lme4}{ + + +A description of how to translate the formulas described in \emph{Bell et al. 2018} +into R using \code{lmer()} from \strong{lme4} can be found in +\href{https://easystats.github.io/parameters/articles/demean.html}{this vignette}. +} + \examples{ data(iris) @@ -244,12 +321,19 @@ Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074 \item Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data. Political Science Research and Methods, 3(1), 133–153. +\item Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J., +Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in +Three-Level Contextual Models. Multivariate Behavioral Research, 52(2), +149–163. https://doi.org/10.1080/00273171.2016.1256753 \item Gelman A, Hill J. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge, New York: Cambridge University Press \item Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed effects regression models. Sociological Methods & Research, 1–28. https://doi.org/10.1177/0049124120914934 +\item Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects +in cross-classified multilevel models. Behavior Research Methods, 56(4), +3023–3057. \item Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity: Why Multilevel Models May Benefit from Accounting for Cross-Cluster Differences in the Effects of Controls. American Sociological Review 82 diff --git a/man/text_format.Rd b/man/text_format.Rd index 87f045193..5f246731f 100644 --- a/man/text_format.Rd +++ b/man/text_format.Rd @@ -63,7 +63,11 @@ text elements will not be enclosed.} \item{pattern}{Character vector. For \code{data_rename()}, indicates columns that should be selected for renaming. Can be \code{NULL} (in which case all columns are selected). For \code{data_addprefix()} or \code{data_addsuffix()}, a character -string, which will be added as prefix or suffix to the column names.} +string, which will be added as prefix or suffix to the column names. For +\code{data_rename()}, \code{pattern} can also be a named vector. In this case, names +are used as values for the \code{replacement} argument (i.e. \code{pattern} can be a +character vector using \verb{ = ""} and argument \code{replacement} +will be ignored then).} } \value{ A character string. diff --git a/pkgdown/_pkgdown.yaml b/pkgdown/_pkgdown.yaml index d52994e16..6e6feb5b2 100644 --- a/pkgdown/_pkgdown.yaml +++ b/pkgdown/_pkgdown.yaml @@ -125,6 +125,11 @@ reference: - nhanes_sample articles: + - title: Overview of vignettes + navbar: ~ + contents: + - overview_of_vignettes + - title: Data Preparation desc: | Articles explaining utility of 'datawizard' for data wrangling diff --git a/tests/testthat/_snaps/demean.md b/tests/testthat/_snaps/demean.md index 7f12d263d..a1c2da4a3 100644 --- a/tests/testthat/_snaps/demean.md +++ b/tests/testthat/_snaps/demean.md @@ -23,13 +23,13 @@ Code head(x) Output - Sepal.Length_between Species_between binary_between Species_setosa_between - 1 5.925000 0.850000 0.375 0.4250000 - 2 5.925000 0.850000 0.375 0.4250000 - 3 5.925000 0.850000 0.375 0.4250000 - 4 5.862222 1.133333 0.400 0.2888889 - 5 5.925000 0.850000 0.375 0.4250000 - 6 5.862222 1.133333 0.400 0.2888889 + Sepal.Length_between binary_between Species_between Species_setosa_between + 1 5.925000 0.375 0.850000 0.4250000 + 2 5.925000 0.375 0.850000 0.4250000 + 3 5.925000 0.375 0.850000 0.4250000 + 4 5.862222 0.400 1.133333 0.2888889 + 5 5.925000 0.375 0.850000 0.4250000 + 6 5.862222 0.400 1.133333 0.2888889 Species_versicolor_between Species_virginica_between Sepal.Length_within 1 0.3000000 0.2750000 -0.8250000 2 0.3000000 0.2750000 -1.0250000 @@ -37,13 +37,13 @@ 4 0.2888889 0.4222222 -1.2622222 5 0.3000000 0.2750000 -0.9250000 6 0.2888889 0.4222222 -0.4622222 - Species_within binary_within Species_setosa_within Species_versicolor_within - 1 -0.850000 -0.375 0.5750000 -0.3000000 - 2 -0.850000 0.625 0.5750000 -0.3000000 - 3 -0.850000 -0.375 0.5750000 -0.3000000 - 4 -1.133333 0.600 0.7111111 -0.2888889 - 5 -0.850000 0.625 0.5750000 -0.3000000 - 6 -1.133333 -0.400 0.7111111 -0.2888889 + binary_within Species_within Species_setosa_within Species_versicolor_within + 1 -0.375 -0.850000 0.5750000 -0.3000000 + 2 0.625 -0.850000 0.5750000 -0.3000000 + 3 -0.375 -0.850000 0.5750000 -0.3000000 + 4 0.600 -1.133333 0.7111111 -0.2888889 + 5 0.625 -0.850000 0.5750000 -0.3000000 + 6 -0.400 -1.133333 0.7111111 -0.2888889 Species_virginica_within 1 -0.2750000 2 -0.2750000 diff --git a/tests/testthat/test-center.R b/tests/testthat/test-center.R index 7bff1ebc9..e7e347848 100644 --- a/tests/testthat/test-center.R +++ b/tests/testthat/test-center.R @@ -169,8 +169,7 @@ test_that("center, factors (grouped data)", { poorman::ungroup() %>% poorman::pull(Species) - manual <- iris %>% - poorman::pull(Species) + manual <- poorman::pull(iris, Species) expect_identical(datawizard, manual) }) diff --git a/tests/testthat/test-data_rename.R b/tests/testthat/test-data_rename.R index a8d003b59..e01c42f8b 100644 --- a/tests/testthat/test-data_rename.R +++ b/tests/testthat/test-data_rename.R @@ -14,6 +14,10 @@ test_that("data_rename works with one or several replacements", { ), c("length", "width", "Petal.Length", "Petal.Width", "Species") ) + expect_named( + data_rename(test, c(length = "Sepal.Length", width = "Sepal.Width")), + c("length", "width", "Petal.Length", "Petal.Width", "Species") + ) }) test_that("data_rename returns a data frame", { @@ -24,11 +28,26 @@ test_that("data_rename returns a data frame", { test_that("data_rename: pattern must be of type character", { expect_error( data_rename(test, pattern = 1), - regexp = "Argument `pattern` must be of type character." + regexp = "Argument `pattern` must be of type character" ) expect_error( data_rename(test, pattern = TRUE), - regexp = "Argument `pattern` must be of type character." + regexp = "Argument `pattern` must be of type character" + ) +}) + +test_that("data_rename: replacement not allowed to have NA or empty strings", { + expect_error( + data_rename(test, pattern = c(test = "Species", "Sepal.Length")), + regexp = "Either name all elements of `pattern`" + ) + expect_error( + data_rename( + test, + pattern = c("Species", "Sepal.Length"), + replacement = c("foo", NA_character_) + ), + regexp = "`replacement` is not allowed" ) }) @@ -42,7 +61,9 @@ test_that("data_rename uses indices when no replacement", { test_that("data_rename works when too many names in 'replacement'", { expect_message( - x <- data_rename(test, replacement = paste0("foo", 1:6)), + { + x <- data_rename(test, replacement = paste0("foo", 1:6)) + }, "There are more names in" ) expect_identical(dim(test), dim(x)) @@ -51,7 +72,9 @@ test_that("data_rename works when too many names in 'replacement'", { test_that("data_rename works when not enough names in 'replacement'", { expect_message( - x <- data_rename(test, replacement = paste0("foo", 1:2)), + { + x <- data_rename(test, replacement = paste0("foo", 1:2)) + }, "There are more names in" ) expect_identical(dim(test), dim(x)) diff --git a/tests/testthat/test-demean.R b/tests/testthat/test-demean.R index 566bd6097..6e169f9c0 100644 --- a/tests/testthat/test-demean.R +++ b/tests/testthat/test-demean.R @@ -57,8 +57,174 @@ test_that("demean shows message if some vars don't exist", { ) set.seed(123) - expect_message( + expect_error( demean(dat, select = "foo", by = "ID"), regexp = "not found" ) }) + + +# see issue #520 +test_that("demean for cross-classified designs (by > 1)", { + skip_if_not_installed("poorman") + + data(efc, package = "datawizard") + dat <- na.omit(efc) + dat$e42dep <- factor(dat$e42dep) + dat$c172code <- factor(dat$c172code) + + x2a <- dat %>% + data_group(e42dep) %>% + data_modify( + c12hour_e42dep = mean(c12hour) + ) %>% + data_ungroup() %>% + data_group(c172code) %>% + data_modify( + c12hour_c172code = mean(c12hour) + ) %>% + data_ungroup() %>% + data_modify( + c12hour_within = c12hour - c12hour_e42dep - c12hour_c172code + ) + + out <- degroup( + dat, + select = "c12hour", + by = c("e42dep", "c172code"), + suffix_demean = "_within" + ) + + expect_equal( + out$c12hour_e42dep_between, + x2a$c12hour_e42dep, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$c12hour_within, + x2a$c12hour_within, + tolerance = 1e-4, + ignore_attr = TRUE + ) + + x2a <- dat %>% + data_group(e42dep) %>% + data_modify( + c12hour_e42dep = mean(c12hour, na.rm = TRUE), + neg_c_7_e42dep = mean(neg_c_7, na.rm = TRUE) + ) %>% + data_ungroup() %>% + data_group(c172code) %>% + data_modify( + c12hour_c172code = mean(c12hour, na.rm = TRUE), + neg_c_7_c172code = mean(neg_c_7, na.rm = TRUE) + ) %>% + data_ungroup() %>% + data_modify( + c12hour_within = c12hour - c12hour_e42dep - c12hour_c172code, + neg_c_7_within = neg_c_7 - neg_c_7_e42dep - neg_c_7_c172code + ) + + out <- degroup( + dat, + select = c("c12hour", "neg_c_7"), + by = c("e42dep", "c172code"), + suffix_demean = "_within" + ) + + expect_equal( + out$c12hour_e42dep_between, + x2a$c12hour_e42dep, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$neg_c_7_c172code_between, + x2a$neg_c_7_c172code, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$neg_c_7_within, + x2a$neg_c_7_within, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$c12hour_within, + x2a$c12hour_within, + tolerance = 1e-4, + ignore_attr = TRUE + ) +}) + + +test_that("demean, sanity checks", { + data(efc, package = "datawizard") + dat <- na.omit(efc) + dat$e42dep <- factor(dat$e42dep) + dat$c172code <- factor(dat$c172code) + + expect_error( + degroup( + dat, + select = c("c12hour", "neg_c_8"), + by = c("e42dep", "c172code"), + suffix_demean = "_within" + ), + regex = "Variable \"neg_c_8\" was not found" + ) + expect_error( + degroup( + dat, + select = c("c12hour", "neg_c_8"), + by = c("e42dep", "c173code"), + suffix_demean = "_within" + ), + regex = "Variables \"neg_c_8\" and \"c173code\" were not found" + ) +}) + + +test_that("demean for nested designs (by > 1), nested = TRUE", { + data(efc, package = "datawizard") + dat <- na.omit(efc) + dat$e42dep <- factor(dat$e42dep) + dat$c172code <- factor(dat$c172code) + + x_ijk <- dat$c12hour + xbar_k <- ave(x_ijk, dat$e42dep, FUN = mean) + xbar_jk <- ave(x_ijk, dat$e42dep, dat$c172code, FUN = mean) + + L3_between <- xbar_k + L2_between <- xbar_jk - xbar_k + L1_within <- x_ijk - xbar_jk + + out <- degroup( + dat, + select = "c12hour", + by = c("e42dep", "c172code"), + nested = TRUE, + suffix_demean = "_within" + ) + + expect_equal( + out$c12hour_within, + L1_within, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$c12hour_e42dep_between, + L3_between, + tolerance = 1e-4, + ignore_attr = TRUE + ) + expect_equal( + out$c12hour_c172code_between, + L2_between, + tolerance = 1e-4, + ignore_attr = TRUE + ) +}) diff --git a/vignettes/overview_of_vignettes.Rmd b/vignettes/overview_of_vignettes.Rmd new file mode 100644 index 000000000..033234607 --- /dev/null +++ b/vignettes/overview_of_vignettes.Rmd @@ -0,0 +1,37 @@ +--- +title: "Overview of Vignettes" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Overview of Vignettes} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r message=FALSE, warning=FALSE, include=FALSE} +library(knitr) +knitr::opts_chunk$set( + echo = TRUE, + collapse = TRUE, + warning = FALSE, + message = FALSE, + comment = "#>", + eval = TRUE +) +``` + +All package vignettes are available at [https://easystats.github.io/datawizard/](https://easystats.github.io/datawizard/). + +## Function Overview + +* [Function Reference](https://easystats.github.io/datawizard/reference/index.html) + + +## Data Preparation + +* [Coming from 'tidyverse'](https://easystats.github.io/datawizard/articles/tidyverse_translation.html) +* [A quick summary of selection syntax in `{datawizard}`](https://easystats.github.io/datawizard/articles/selection_syntax.html) + + +## Statistical Transformations + +* [Data Standardization](https://easystats.github.io/datawizard/articles/standardize_data.html) diff --git a/vignettes/selection_syntax.Rmd b/vignettes/selection_syntax.Rmd index 9b501ebd5..3c0953f65 100644 --- a/vignettes/selection_syntax.Rmd +++ b/vignettes/selection_syntax.Rmd @@ -15,8 +15,7 @@ knitr::opts_chunk$set( pkgs <- c( "datawizard", - "dplyr", - "htmltools" + "dplyr" ) if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))) { @@ -27,18 +26,10 @@ if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)) ```{r load, echo=FALSE, message=FALSE} library(datawizard) library(dplyr) -library(htmltools) set.seed(123) iris <- iris[sample(nrow(iris), 10), ] row.names(iris) <- NULL - -row <- function(...) { - div( - class = "custom_note", - ... - ) -} ``` ```{css, echo=FALSE} @@ -127,18 +118,26 @@ data_select(iris, contains("pal", "ec")) data_select(iris, regex("^Sep|ies")) ``` -```{r echo=FALSE} -row("Note: these functions are not exported by `datawizard` but are detected and -applied internally. This means that they won't be detected by autocompletion -when we write them.") -``` -```{r echo=FALSE} -row("Note #2: because these functions are not exported, they will not create -conflicts with the ones that come from the `tidyverse` and that have the same name. -So we can still use `dplyr` and its friends, it won't change anything for selection -in `datawizard` functions!") -``` + + + +
+

+ Note: these functions are not exported by `datawizard` but are detected and + applied internally. This means that they won't be detected by autocompletion + when we write them. +

+
+ +
+

+ Note #2: because these functions are not exported, they will not create + conflicts with the ones that come from the `tidyverse` and that have the same + name. Therefore, we can still use `dplyr` and its friends, it won't change + anything for selection in `datawizard` functions! +

+
# Excluding variables diff --git a/vignettes/tidyverse_translation.Rmd b/vignettes/tidyverse_translation.Rmd index b03402468..ae4b339b3 100644 --- a/vignettes/tidyverse_translation.Rmd +++ b/vignettes/tidyverse_translation.Rmd @@ -1,6 +1,6 @@ --- title: "Coming from 'tidyverse'" -output: +output: rmarkdown::html_vignette: toc: true vignette: > @@ -9,7 +9,7 @@ vignette: > %\VignetteEngine{knitr::rmarkdown} --- -```{r message=FALSE, warning=FALSE, include=FALSE, eval = TRUE} +```{r setup, message=FALSE, warning=FALSE, include=FALSE, eval = TRUE} library(knitr) options(knitr.kable.NA = "") knitr::opts_chunk$set( @@ -21,57 +21,71 @@ knitr::opts_chunk$set( pkgs <- c( "dplyr", - "datawizard", "tidyr" ) +all_deps_available <- all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L))) -# since we explicitely put eval = TRUE for some chunks, we can't rely on -# knitr::opts_chunk$set(eval = FALSE) at the beginning of the script. So we make -# a logical that is FALSE only if deps are not installed (cf easystats/easystats#317) -evaluate_chunk <- TRUE - -if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))) { - evaluate_chunk <- FALSE +if (all_deps_available) { + library(datawizard) + library(dplyr) + library(tidyr) } + +# Since we explicitly put `eval = TRUE` for some chunks, we can't rely on +# `knitr::opts_chunk$set(eval = FALSE)` at the beginning of the script. +# Therefore, we introduce a logical that is `FALSE` only if all suggested +# dependencies are not installed (cf easystats/easystats#317) +evaluate_chunk <- all_deps_available && getRversion() >= "4.1.0" ``` This vignette can be referred to by citing the following: Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. *Journal of Open Source Software*, *7*(78), 4684, https://doi.org/10.21105/joss.04684 -```{css, echo=FALSE, eval = evaluate_chunk} +```{css, echo=FALSE, eval = TRUE} .datawizard, .datawizard > .sourceCode { background-color: #e6e6ff; } .tidyverse, .tidyverse > .sourceCode { background-color: #d9f2e5; } +.custom_note { + border-left: solid 5px hsl(220, 100%, 30%); + background-color: hsl(220, 100%, 95%); + padding: 5px; + margin-bottom: 10px +} ``` # Introduction -`{datawizard}` package aims to make basic data wrangling easier than +`{datawizard}` package aims to make basic data wrangling easier than with base R. The data wrangling workflow it supports is similar to the one supported by the tidyverse package combination of `{dplyr}` and `{tidyr}`. However, one of its main features is that it has a very few dependencies: `{stats}` and `{utils}` -(included in base R) and `{insight}`, which is the core package of the _easystats_ -ecosystem. This package grew organically to simultaneously satisfy the +(included in base R) and `{insight}`, which is the core package of the _easystats_ +ecosystem. This package grew organically to simultaneously satisfy the "0 non-base hard dependency" principle of _easystats_ and the data wrangling needs -of the constituent packages in this ecosystem. - -One drawback of this genesis is that not all features of the `{tidyverse}` -packages are supported since only features that were necessary for _easystats_ -ecosystem have been implemented. Some of these missing features (such as `summarize` -or the pipe operator `%>%`) are made available in other dependency-free packages, -such as [`{poorman}`](https://github.com/nathaneastwood/poorman/). It is also -important to note that `{datawizard}` was designed to avoid namespace collisions +of the constituent packages in this ecosystem. It is also +important to note that `{datawizard}` was designed to avoid namespace collisions with `{tidyverse}` packages. -In this article, we will see how to go through basic data wrangling steps with -`{datawizard}`. We will also compare it to the `{tidyverse}` syntax for achieving the same. +In this article, we will see how to go through basic data wrangling steps with +`{datawizard}`. We will also compare it to the `{tidyverse}` syntax for achieving the same. This way, if you decide to make the switch, you can easily find the translations here. This vignette is largely inspired from `{dplyr}`'s [Getting started vignette](https://dplyr.tidyverse.org/articles/dplyr.html). + + + +
+

+ Note: In this vignette, we use the native pipe-operator, `|>`, which was + introduced in R 4.1. Users of R version 3.6 or 4.0 should replace the native + pipe by magrittr's one (`%>%`) so that examples work. +

+
+ ```{r, eval = evaluate_chunk} library(dplyr) library(tidyr) @@ -83,23 +97,23 @@ efc <- head(efc) # Workhorses -Before we look at their *tidyverse* equivalents, we can first have a look at +Before we look at their *tidyverse* equivalents, we can first have a look at `{datawizard}`'s key functions for data wrangling: -| Function | Operation | -| :---------------- | :------------------------------------------------ | -| `data_filter()` | [to select only certain observations](#filtering) | -| `data_select()` | [to select only a few variables](#selecting) | -| `data_modify()` | [to create variables or modify existing ones](#modifying) | -| `data_arrange()` | [to sort observations](#sorting) | -| `data_extract()` | [to extract a single variable](#extracting) | -| `data_rename()` | [to rename variables](#renaming) | -| `data_relocate()` | [to reorder a data frame](#relocating) | -| `data_to_long()` | [to convert data from wide to long](#reshaping) | -| `data_to_wide()` | [to convert data from long to wide](#reshaping) | -| `data_join()` | [to join two data frames](#joining) | -| `data_unite()` | [to concatenate several columns into a single one](#uniting) | -| `data_separate()` | [to separate a single column into multiple columns](#separating) | +| Function | Operation | +| :---------------- | :--------------------------------------------------------------- | +| `data_filter()` | [to select only certain observations](#filtering) | +| `data_select()` | [to select only a few variables](#selecting) | +| `data_modify()` | [to create variables or modify existing ones](#modifying) | +| `data_arrange()` | [to sort observations](#sorting) | +| `data_extract()` | [to extract a single variable](#extracting) | +| `data_rename()` | [to rename variables](#renaming) | +| `data_relocate()` | [to reorder a data frame](#relocating) | +| `data_to_long()` | [to convert data from wide to long](#reshaping) | +| `data_to_wide()` | [to convert data from long to wide](#reshaping) | +| `data_join()` | [to join two data frames](#joining) | +| `data_unite()` | [to concatenate several columns into a single one](#uniting) | +| `data_separate()` | [to separate a single column into multiple columns](#separating) | Note that there are a few functions in `{datawizard}` that have no strict equivalent in `{dplyr}` or `{tidyr}` (e.g `data_rotate()`), and so we won't discuss them in @@ -113,7 +127,7 @@ Before we look at them individually, let's first have a look at the summary tabl | :---------------- | :------------------------------------------------------------------ | | `data_filter()` | `dplyr::filter()`, `dplyr::slice()` | | `data_select()` | `dplyr::select()` | -| `data_modify()` | `dplyr::mutate()` | +| `data_modify()` | `dplyr::mutate()` | | `data_arrange()` | `dplyr::arrange()` | | `data_extract()` | `dplyr::pull()` | | `data_rename()` | `dplyr::rename()` | @@ -123,8 +137,8 @@ Before we look at them individually, let's first have a look at the summary tabl | `data_join()` | `dplyr::inner_join()`, `dplyr::left_join()`, `dplyr::right_join()`, | | | `dplyr::full_join()`, `dplyr::anti_join()`, `dplyr::semi_join()` | | `data_peek()` | `dplyr::glimpse()` | -| `data_unite()` | `tidyr::unite()` | -| `data_separate()` | `tidyr::separate()` | +| `data_unite()` | `tidyr::unite()` | +| `data_separate()` | `tidyr::separate()` | ## Filtering {#filtering} @@ -136,14 +150,14 @@ Before we look at them individually, let's first have a look at the summary tabl ```{r filter, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_filter( skin_color == "light", eye_color == "brown" ) # or -starwars %>% +starwars |> data_filter( skin_color == "light" & eye_color == "brown" @@ -155,7 +169,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> filter( skin_color == "light", eye_color == "brown" @@ -176,9 +190,9 @@ starwars <- head(starwars) ## Selecting {#selecting} -`data_select()` is the equivalent of `dplyr::select()`. +`data_select()` is the equivalent of `dplyr::select()`. The main difference between these two functions is that `data_select()` uses two -arguments (`select` and `exclude`) and requires quoted column names if we want to +arguments (`select` and `exclude`) and requires quoted column names if we want to select several variables, while `dplyr::select()` accepts any unquoted column names. :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} @@ -187,7 +201,7 @@ select several variables, while `dplyr::select()` accepts any unquoted column na ```{r select1, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_select(select = c("hair_color", "skin_color", "eye_color")) ``` ::: @@ -196,7 +210,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> select(hair_color, skin_color, eye_color) ``` ::: @@ -212,7 +226,7 @@ starwars %>% ```{r select2, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_select(select = -ends_with("color")) ``` ::: @@ -221,7 +235,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> select(-ends_with("color")) ``` ::: @@ -240,7 +254,7 @@ here and quoting them won't work. Should we comment on that? --> ```{r select3, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_select(select = -(hair_color:eye_color)) ``` ::: @@ -249,7 +263,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> select(!(hair_color:eye_color)) ``` ::: @@ -266,7 +280,7 @@ starwars %>% ```{r select4, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_select(exclude = regex("color$")) ``` ::: @@ -275,7 +289,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> select(-contains("color$")) ``` ::: @@ -292,7 +306,7 @@ starwars %>% ```{r select5, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_select(select = is.numeric) ``` ::: @@ -301,7 +315,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> select(where(is.numeric)) ``` ::: @@ -316,8 +330,8 @@ You can find a list of all the select helpers with `?data_select`. ## Modifying {#modifying} -`data_modify()` is a wrapper around `base::transform()` but has several additional -benefits: +`data_modify()` is a wrapper around `base::transform()` but has several additional +benefits: * it allows us to use newly created variables in the following expressions; * it works with grouped data; @@ -325,8 +339,8 @@ benefits: * it accepts expressions as character vectors so that it is easy to program with it -This last point is also the main difference between `data_modify()` and -`dplyr::mutate()`. +This last point is also the main difference between `data_modify()` and +`dplyr::mutate()`. :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} @@ -334,7 +348,7 @@ This last point is also the main difference between `data_modify()` and ```{r modify1, class.source = "datawizard"} # ---------- datawizard ----------- -efc %>% +efc |> data_modify( c12hour_c = center(c12hour), c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE), @@ -347,7 +361,7 @@ efc %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -efc %>% +efc |> mutate( c12hour_c = center(c12hour), c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE), @@ -400,7 +414,7 @@ such as `starts_with()` in `data_arrange()`. :::{} ```{r arrange1, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_arrange(c("hair_color", "height")) ``` ::: @@ -409,7 +423,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> arrange(hair_color, height) ``` ::: @@ -419,14 +433,14 @@ starwars %>% ```{r arrange1, eval = evaluate_chunk, echo = FALSE} ``` -You can also sort variables in descending order by putting a `"-"` in front of +You can also sort variables in descending order by putting a `"-"` in front of their name, like below: :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{} ```{r arrange2, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_arrange(c("-hair_color", "-height")) ``` ::: @@ -435,7 +449,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> arrange(desc(hair_color), -height) ``` ::: @@ -448,15 +462,15 @@ starwars %>% ## Extracting {#extracting} -Although we mostly work on data frames, it is sometimes useful to extract a single -column as a vector. This can be done with `data_extract()`, which reproduces the +Although we mostly work on data frames, it is sometimes useful to extract a single +column as a vector. This can be done with `data_extract()`, which reproduces the behavior of `dplyr::pull()`: :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{} ```{r extract1, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_extract(gender) ``` ::: @@ -465,7 +479,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> pull(gender) ``` ::: @@ -479,7 +493,7 @@ We can also specify several variables in `select`. In this case, `data_extract() is equivalent to `data_select()`: ```{r eval = evaluate_chunk} -starwars %>% +starwars |> data_extract(select = contains("color")) ``` @@ -488,9 +502,9 @@ starwars %>% ## Renaming {#renaming} -`data_rename()` is the equivalent of `dplyr::rename()` but the syntax between the +`data_rename()` is the equivalent of `dplyr::rename()` but the syntax between the two is different. While `dplyr::rename()` takes new-old pairs of column -names, `data_rename()` requires a vector of column names to rename, and then +names, `data_rename()` requires a vector of column names to rename, and then a vector of new names for these columns that must be of the same length. :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} @@ -499,7 +513,7 @@ a vector of new names for these columns that must be of the same length. ```{r rename1, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_rename( pattern = c("sex", "hair_color"), replacement = c("Sex", "Hair Color") @@ -511,7 +525,7 @@ starwars %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> rename( Sex = sex, "Hair Color" = hair_color @@ -524,14 +538,14 @@ starwars %>% ```{r rename1, eval = evaluate_chunk, echo = FALSE} ``` -The way `data_rename()` is designed makes it easy to apply the same modifications -to a vector of column names. For example, we can remove underscores and use +The way `data_rename()` is designed makes it easy to apply the same modifications +to a vector of column names. For example, we can remove underscores and use TitleCase with the following code: ```{r rename2} to_rename <- names(starwars) -starwars %>% +starwars |> data_rename( pattern = to_rename, replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE)) @@ -541,16 +555,16 @@ starwars %>% ```{r rename2, eval = evaluate_chunk, echo = FALSE} ``` -It is also possible to add a prefix or a suffix to all or a subset of variables -with `data_addprefix()` and `data_addsuffix()`. The argument `select` accepts +It is also possible to add a prefix or a suffix to all or a subset of variables +with `data_addprefix()` and `data_addsuffix()`. The argument `select` accepts all select helpers that we saw above with `data_select()`: ```{r rename3} -starwars %>% +starwars |> data_addprefix( pattern = "OLD.", select = contains("color") - ) %>% + ) |> data_addsuffix( pattern = ".NEW", select = -contains("color") @@ -566,7 +580,7 @@ Sometimes, we want to relocate one or a small subset of columns in the dataset. Rather than typing many names in `data_select()`, we can use `data_relocate()`, which is the equivalent of `dplyr::relocate()`. Just like `data_select()`, we can specify a list of variables we want to relocate with `select` and `exclude`. -Then, the arguments `before` and `after`^[Note that we use `before` and `after` +Then, the arguments `before` and `after`^[Note that we use `before` and `after` whereas `dplyr::relocate()` uses `.before` and `.after`.] specify where the selected columns should be relocated: @@ -576,32 +590,32 @@ be relocated: ```{r relocate1, class.source = "datawizard"} # ---------- datawizard ----------- -starwars %>% +starwars |> data_relocate(sex:homeworld, before = "height") ``` ::: - + ::: {} ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -starwars %>% +starwars |> relocate(sex:homeworld, .before = height) ``` ::: - + :::: ```{r relocate1, eval = evaluate_chunk, echo = FALSE} ``` In addition to column names, `before` and `after` accept column indices. Finally, -one can use `before = -1` to relocate the selected columns just before the last +one can use `before = -1` to relocate the selected columns just before the last column, or `after = -1` to relocate them after the last column. ```{r eval = evaluate_chunk} # ---------- datawizard ----------- -starwars %>% +starwars |> data_relocate(sex:homeworld, after = -1) ``` @@ -611,10 +625,10 @@ starwars %>% ### Longer Reshaping data from wide to long or from long to wide format can be done with -`data_to_long()` and `data_to_wide()`. These functions were designed to match -`tidyr::pivot_longer()` and `tidyr::pivot_wider()` arguments, so that the only -thing to do is to change the function name. However, not all of -`tidyr::pivot_longer()` and `tidyr::pivot_wider()` features are available yet. +`data_to_long()` and `data_to_wide()`. These functions were designed to match +`tidyr::pivot_longer()` and `tidyr::pivot_wider()` arguments, so that the only +thing to do is to change the function name. However, not all of +`tidyr::pivot_longer()` and `tidyr::pivot_wider()` features are available yet. We will use the `relig_income` dataset, as in the [`{tidyr}` vignette](https://tidyr.tidyverse.org/articles/pivot.html). @@ -623,11 +637,11 @@ relig_income ``` -We would like to reshape this dataset to have 3 columns: religion, count, and -income. The column "religion" doesn't need to change, so we exclude it with -`-religion`. Then, each remaining column corresponds to an income category. -Therefore, we want to move all these column names to a single column called -"income". Finally, the values corresponding to each of these columns will be +We would like to reshape this dataset to have 3 columns: religion, count, and +income. The column "religion" doesn't need to change, so we exclude it with +`-religion`. Then, each remaining column corresponds to an income category. +Therefore, we want to move all these column names to a single column called +"income". Finally, the values corresponding to each of these columns will be reshaped to be in a single new column, called "count". :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} @@ -636,7 +650,7 @@ reshaped to be in a single new column, called "count". ```{r pivot1, class.source = "datawizard"} # ---------- datawizard ----------- -relig_income %>% +relig_income |> data_to_long( -religion, names_to = "income", @@ -649,7 +663,7 @@ relig_income %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -relig_income %>% +relig_income |> pivot_longer( !religion, names_to = "income", @@ -676,7 +690,7 @@ billboard ```{r pivot2, class.source = "datawizard"} # ---------- datawizard ----------- -billboard %>% +billboard |> data_to_long( cols = starts_with("wk"), names_to = "week", @@ -690,7 +704,7 @@ billboard %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -billboard %>% +billboard |> pivot_longer( cols = starts_with("wk"), names_to = "week", @@ -721,7 +735,7 @@ fish_encounters ```{r pivot3, class.source = "datawizard"} # ---------- datawizard ----------- -fish_encounters %>% +fish_encounters |> data_to_wide( names_from = "station", values_from = "seen", @@ -734,7 +748,7 @@ fish_encounters %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -fish_encounters %>% +fish_encounters |> pivot_wider( names_from = station, values_from = seen, @@ -754,12 +768,12 @@ fish_encounters %>% -In `{datawizard}`, joining datasets is done with `data_join()` (or its alias -`data_merge()`). Contrary to `{dplyr}`, this unique function takes care of all +In `{datawizard}`, joining datasets is done with `data_join()` (or its alias +`data_merge()`). Contrary to `{dplyr}`, this unique function takes care of all types of join, which are then specified inside the function with the argument `join` (by default, `join = "left"`). -Below, we show how to perform the four most common joins: full, left, right and +Below, we show how to perform the four most common joins: full, left, right and inner. We will use the datasets `band_members`and `band_instruments` provided by `{dplyr}`: :::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} @@ -789,7 +803,7 @@ band_instruments ```{r join1, class.source = "datawizard"} # ---------- datawizard ----------- -band_members %>% +band_members |> data_join(band_instruments, join = "full") ``` ::: @@ -798,7 +812,7 @@ band_members %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -band_members %>% +band_members |> full_join(band_instruments) ``` ::: @@ -818,7 +832,7 @@ band_members %>% ```{r join2, class.source = "datawizard"} # ---------- datawizard ----------- -band_members %>% +band_members |> data_join(band_instruments, join = "left") ``` ::: @@ -827,7 +841,7 @@ band_members %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -band_members %>% +band_members |> left_join(band_instruments) ``` ::: @@ -844,7 +858,7 @@ band_members %>% ```{r join3, class.source = "datawizard"} # ---------- datawizard ----------- -band_members %>% +band_members |> data_join(band_instruments, join = "right") ``` ::: @@ -853,7 +867,7 @@ band_members %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -band_members %>% +band_members |> right_join(band_instruments) ``` ::: @@ -873,7 +887,7 @@ band_members %>% ```{r join4, class.source = "datawizard"} # ---------- datawizard ----------- -band_members %>% +band_members |> data_join(band_instruments, join = "inner") ``` ::: @@ -882,7 +896,7 @@ band_members %>% ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -band_members %>% +band_members |> inner_join(band_instruments) ``` ::: @@ -916,7 +930,7 @@ test ```{r unite1, class.source = "datawizard"} # ---------- datawizard ----------- -test %>% +test |> data_unite( new_column = "date", select = c("year", "month", "day"), @@ -924,12 +938,12 @@ test %>% ) ``` ::: - + ::: {} ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -test %>% +test |> unite( col = "date", year, month, day, @@ -937,7 +951,7 @@ test %>% ) ``` ::: - + :::: ```{r unite1, eval = evaluate_chunk, echo = FALSE} @@ -949,7 +963,7 @@ test %>% ```{r unite2, class.source = "datawizard"} # ---------- datawizard ----------- -test %>% +test |> data_unite( new_column = "date", select = c("year", "month", "day"), @@ -958,12 +972,12 @@ test %>% ) ``` ::: - + ::: {} ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -test %>% +test |> unite( col = "date", year, month, day, @@ -972,7 +986,7 @@ test %>% ) ``` ::: - + :::: ```{r unite2, eval = evaluate_chunk, echo = FALSE} @@ -999,26 +1013,26 @@ test ```{r separate1, class.source = "datawizard"} # ---------- datawizard ----------- -test %>% +test |> data_separate( select = "date_arrival", new_columns = c("Year", "Month", "Day") ) ``` ::: - + ::: {} ```{r, class.source = "tidyverse"} # ---------- tidyverse ----------- -test %>% +test |> separate( date_arrival, into = c("Year", "Month", "Day") ) ``` ::: - + :::: ```{r separate1, eval = evaluate_chunk, echo = FALSE} @@ -1028,7 +1042,7 @@ test %>% Unlike `tidyr::separate()`, you can separate multiple columns in one step with `data_separate()`. ```{r eval = evaluate_chunk} -test %>% +test |> data_separate( new_columns = list( date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"), @@ -1040,9 +1054,9 @@ test %>% # Other useful functions -`{datawizard}` contains other functions that are not necessarily included in -`{dplyr}` or `{tidyr}` or do not directly modify the data. Some of them are -inspired from the package `janitor`. +`{datawizard}` contains other functions that are not necessarily included in +`{dplyr}` or `{tidyr}` or do not directly modify the data. Some of them are +inspired from the package `janitor`. ## Work with rownames @@ -1053,12 +1067,12 @@ We can convert a column in rownames and move rownames to a new column with mtcars <- head(mtcars) mtcars -mtcars2 <- mtcars %>% +mtcars2 <- mtcars |> rownames_as_column(var = "model") mtcars2 -mtcars2 %>% +mtcars2 |> column_as_rownames(var = "model") ``` @@ -1068,7 +1082,7 @@ mtcars2 %>% The main difference is when we use it with grouped data. While `tibble::rowid_to_column()` uses one distinct rowid for every row in the dataset, `rowid_as_column()` creates one id for every row *in each group*. Therefore, two rows in different groups -can have the same row id. +can have the same row id. This means that `rowid_as_column()` is closer to using `n()` in `mutate()`, like the following: @@ -1081,16 +1095,16 @@ test <- data.frame( ) test -test %>% - data_group(group) %>% +test |> + data_group(group) |> tibble::rowid_to_column() -test %>% - data_group(group) %>% +test |> + data_group(group) |> rowid_as_column() -test %>% - data_group(group) %>% +test |> + data_group(group) |> mutate(id = seq_len(n())) ``` @@ -1107,11 +1121,11 @@ x <- data.frame( X_2 = c(NA, "Title2", 4:6) ) x -x2 <- x %>% +x2 <- x |> row_to_colnames(row = 2) x2 -x2 %>% +x2 |> colnames_to_row() ```