From a55625d484e042e725d3185d7f6be529d782a6ad Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 22:01:35 +0100 Subject: [PATCH 01/43] Add Kish-method to `rescale_weights` --- R/rescale_weights.R | 120 ++++++++++++++++++++++++++--------------- man/rescale_weights.Rd | 54 +++++++++++-------- 2 files changed, 111 insertions(+), 63 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index ec0c75616..f74705312 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -2,63 +2,73 @@ #' @name rescale_weights #' #' @description Most functions to fit multilevel and mixed effects models only -#' allow to specify frequency weights, but not design (i.e. sampling or -#' probability) weights, which should be used when analyzing complex samples -#' and survey data. `rescale_weights()` implements an algorithm proposed -#' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} to rescale design -#' weights in survey data to account for the grouping structure of multilevel -#' models, which then can be used for multilevel modelling. +#' allow to specify frequency weights, but not design (i.e. sampling or +#' probability) weights, which should be used when analyzing complex samples +#' and survey data. `rescale_weights()` implements two algorithms, one proposed +#' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by +#' \cite{Kish 1965}, to rescale design weights in survey data to account for the +#' grouping structure of multilevel models, which then can be used for +#' multilevel modelling. #' #' @param data A data frame. #' @param by Variable names (as character vector, or as formula), indicating -#' the grouping structure (strata) of the survey data (level-2-cluster -#' variable). It is also possible to create weights for multiple group -#' variables; in such cases, each created weighting variable will be suffixed -#' by the name of the group variable. +#' the grouping structure (strata) of the survey data (level-2-cluster +#' variable). It is also possible to create weights for multiple group +#' variables; in such cases, each created weighting variable will be suffixed +#' by the name of the group variable. #' @param probability_weights Variable indicating the probability (design or -#' sampling) weights of the survey data (level-1-weight). +#' sampling) weights of the survey data (level-1-weight). #' @param nest Logical, if `TRUE` and `by` indicates at least two -#' group variables, then groups are "nested", i.e. groups are now a -#' combination from each group level of the variables in `by`. +#' group variables, then groups are "nested", i.e. groups are now a +#' combination from each group level of the variables in `by`. +#' @param method `"carle"` or `"kish"`. #' #' @return `data`, including the new weighting variables: `pweights_a` -#' and `pweights_b`, which represent the rescaled design weights to use -#' in multilevel models (use these variables for the `weights` argument). +#' and `pweights_b`, which represent the rescaled design weights to use +#' in multilevel models (use these variables for the `weights` argument). #' #' @details +#' - `method = "carle"` #' -#' Rescaling is based on two methods: For `pweights_a`, the sample weights -#' `probability_weights` are adjusted by a factor that represents the proportion -#' of group size divided by the sum of sampling weights within each group. The -#' adjustment factor for `pweights_b` is the sum of sample weights within each -#' group divided by the sum of squared sample weights within each group (see -#' Carle (2009), Appendix B). In other words, `pweights_a` "scales the weights -#' so that the new weights sum to the cluster sample size" while `pweights_b` -#' "scales the weights so that the new weights sum to the effective cluster -#' size". -#' -#' Regarding the choice between scaling methods A and B, Carle suggests that -#' "analysts who wish to discuss point estimates should report results based on -#' weighting method A. For analysts more interested in residual between-group -#' variance, method B may generally provide the least biased estimates". In -#' general, it is recommended to fit a non-weighted model and weighted models -#' with both scaling methods and when comparing the models, see whether the -#' "inferential decisions converge", to gain confidence in the results. -#' -#' Though the bias of scaled weights decreases with increasing group size, -#' method A is preferred when insufficient or low group size is a concern. -#' -#' The group ID and probably PSU may be used as random effects (e.g. nested -#' design, or group and PSU as varying intercepts), depending on the survey -#' design that should be mimicked. +#' Rescaling is based on two methods: For `pweights_a`, the sample weights +#' `probability_weights` are adjusted by a factor that represents the +#' proportion of group size divided by the sum of sampling weights within each +#' group. The adjustment factor for `pweights_b` is the sum of sample weights +#' within each group divided by the sum of squared sample weights within each +#' group (see Carle (2009), Appendix B). In other words, `pweights_a` "scales +#' the weights so that the new weights sum to the cluster sample size" while +#' `pweights_b` "scales the weights so that the new weights sum to the +#' effective cluster size". +#' +#' Regarding the choice between scaling methods A and B, Carle suggests that +#' "analysts who wish to discuss point estimates should report results based +#' on weighting method A. For analysts more interested in residual +#' between-group variance, method B may generally provide the least biased +#' estimates". In general, it is recommended to fit a non-weighted model and +#' weighted models with both scaling methods and when comparing the models, +#' see whether the "inferential decisions converge", to gain confidence in the +#' results. +#' +#' Though the bias of scaled weights decreases with increasing group size, +#' method A is preferred when insufficient or low group size is a concern. +#' +#' The group ID and probably PSU may be used as random effects (e.g. nested +#' design, or group and PSU as varying intercepts), depending on the survey +#' design that should be mimicked. +#' +#' - `method = "kish"` +#' +#' to do... #' #' @references +#' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling +#' Weights. Communications in Statistics - Theory and Methods 35: 439-460 +#' #' - Carle A.C. (2009). Fitting multilevel models in complex survey data #' with design weights: Recommendations. BMC Medical Research Methodology #' 9(49): 1-13 #' -#' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling -#' Weights. Communications in Statistics - Theory and Methods 35: 439-460 +#' - Kish ... #' #' @examples #' if (require("lme4")) { @@ -87,7 +97,7 @@ #' ) #' } #' @export -rescale_weights <- function(data, by, probability_weights, nest = FALSE) { +rescale_weights <- function(data, by, probability_weights, nest = FALSE, method = "carle") { if (inherits(by, "formula")) { by <- all.vars(by) } @@ -107,6 +117,32 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE) { # sort id data_tmp$.bamboozled <- seq_len(nrow(data_tmp)) + switch(method, + carle = .rescale_weights_carle(nest, probability_weights, data_tmp, data, by, weight_non_na), + .rescale_weights_kish(probability_weights, data_tmp, data, weight_non_na) + ) +} + + +# rescale weights, method Carle ---------------------------- + +.rescale_weights_kish <- function(probability_weights, data_tmp, data, weight_non_na) { + weights <- mean(data_tmp[[probability_weights]]) + # design effect according to Kish + deff <- mean(weights^2) / (mean(weights)^2) + # rescale weights, so their mean is 1 + z_weights <- ((weights + 1) - mean(weights) ) / stats::sd(weights) + # divide weights by design effect + data$pweight <- NA_real_ + data$pweight[weight_non_na] <- z_weights / deff + # return result + data +} + + +# rescale weights, method Carle ---------------------------- + +.rescale_weights_carle <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { if (nest && length(by) < 2) { insight::format_warning( sprintf( diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index d9651decb..90e0727eb 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -4,7 +4,7 @@ \alias{rescale_weights} \title{Rescale design weights for multilevel analysis} \usage{ -rescale_weights(data, by, probability_weights, nest = FALSE) +rescale_weights(data, by, probability_weights, nest = FALSE, method = "carle") } \arguments{ \item{data}{A data frame.} @@ -21,6 +21,8 @@ sampling) weights of the survey data (level-1-weight).} \item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables in \code{by}.} + +\item{method}{\code{"carle"} or \code{"kish"}.} } \value{ \code{data}, including the new weighting variables: \code{pweights_a} @@ -31,29 +33,34 @@ in multilevel models (use these variables for the \code{weights} argument). Most functions to fit multilevel and mixed effects models only allow to specify frequency weights, but not design (i.e. sampling or probability) weights, which should be used when analyzing complex samples -and survey data. \code{rescale_weights()} implements an algorithm proposed -by \cite{Asparouhov (2006)} and \cite{Carle (2009)} to rescale design -weights in survey data to account for the grouping structure of multilevel -models, which then can be used for multilevel modelling. +and survey data. \code{rescale_weights()} implements two algorithms, one proposed +by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by +\cite{Kish 1965}, to rescale design weights in survey data to account for the +grouping structure of multilevel models, which then can be used for +multilevel modelling. } \details{ +\itemize{ +\item \code{method = "carle"} + Rescaling is based on two methods: For \code{pweights_a}, the sample weights -\code{probability_weights} are adjusted by a factor that represents the proportion -of group size divided by the sum of sampling weights within each group. The -adjustment factor for \code{pweights_b} is the sum of sample weights within each -group divided by the sum of squared sample weights within each group (see -Carle (2009), Appendix B). In other words, \code{pweights_a} "scales the weights -so that the new weights sum to the cluster sample size" while \code{pweights_b} -"scales the weights so that the new weights sum to the effective cluster -size". +\code{probability_weights} are adjusted by a factor that represents the +proportion of group size divided by the sum of sampling weights within each +group. The adjustment factor for \code{pweights_b} is the sum of sample weights +within each group divided by the sum of squared sample weights within each +group (see Carle (2009), Appendix B). In other words, \code{pweights_a} "scales +the weights so that the new weights sum to the cluster sample size" while +\code{pweights_b} "scales the weights so that the new weights sum to the +effective cluster size". Regarding the choice between scaling methods A and B, Carle suggests that -"analysts who wish to discuss point estimates should report results based on -weighting method A. For analysts more interested in residual between-group -variance, method B may generally provide the least biased estimates". In -general, it is recommended to fit a non-weighted model and weighted models -with both scaling methods and when comparing the models, see whether the -"inferential decisions converge", to gain confidence in the results. +"analysts who wish to discuss point estimates should report results based +on weighting method A. For analysts more interested in residual +between-group variance, method B may generally provide the least biased +estimates". In general, it is recommended to fit a non-weighted model and +weighted models with both scaling methods and when comparing the models, +see whether the "inferential decisions converge", to gain confidence in the +results. Though the bias of scaled weights decreases with increasing group size, method A is preferred when insufficient or low group size is a concern. @@ -61,6 +68,10 @@ method A is preferred when insufficient or low group size is a concern. The group ID and probably PSU may be used as random effects (e.g. nested design, or group and PSU as varying intercepts), depending on the survey design that should be mimicked. +\item \code{method = "kish"} + +to do... +} } \examples{ if (require("lme4")) { @@ -91,10 +102,11 @@ if (require("lme4")) { } \references{ \itemize{ +\item Asparouhov T. (2006). General Multi-Level Modeling with Sampling +Weights. Communications in Statistics - Theory and Methods 35: 439-460 \item Carle A.C. (2009). Fitting multilevel models in complex survey data with design weights: Recommendations. BMC Medical Research Methodology 9(49): 1-13 -\item Asparouhov T. (2006). General Multi-Level Modeling with Sampling -Weights. Communications in Statistics - Theory and Methods 35: 439-460 +\item Kish ... } } From 6da102ae3ab639551dc6d70d1820249a9fc64a00 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 22:03:26 +0100 Subject: [PATCH 02/43] fix --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index f74705312..d18bdac0e 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -127,7 +127,7 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE, method # rescale weights, method Carle ---------------------------- .rescale_weights_kish <- function(probability_weights, data_tmp, data, weight_non_na) { - weights <- mean(data_tmp[[probability_weights]]) + weights <- data_tmp[[probability_weights]] # design effect according to Kish deff <- mean(weights^2) / (mean(weights)^2) # rescale weights, so their mean is 1 From 469a88122da827b6987c384c48b22a3d3a2b18ff Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 23:25:05 +0100 Subject: [PATCH 03/43] fix --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index d18bdac0e..8f73b948a 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -131,7 +131,7 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE, method # design effect according to Kish deff <- mean(weights^2) / (mean(weights)^2) # rescale weights, so their mean is 1 - z_weights <- ((weights + 1) - mean(weights) ) / stats::sd(weights) + z_weights <- weights * (1 / mean(weights)) # divide weights by design effect data$pweight <- NA_real_ data$pweight[weight_non_na] <- z_weights / deff From c08b07a1afd48004b0e28e5eaac69b50b6d9c5d6 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 23:26:13 +0100 Subject: [PATCH 04/43] lintr --- R/rescale_weights.R | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 8f73b948a..162863609 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -127,11 +127,11 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE, method # rescale weights, method Carle ---------------------------- .rescale_weights_kish <- function(probability_weights, data_tmp, data, weight_non_na) { - weights <- data_tmp[[probability_weights]] + p_weights <- data_tmp[[probability_weights]] # design effect according to Kish - deff <- mean(weights^2) / (mean(weights)^2) + deff <- mean(p_weights^2) / (mean(p_weights)^2) # rescale weights, so their mean is 1 - z_weights <- weights * (1 / mean(weights)) + z_weights <- p_weights * (1 / mean(p_weights)) # divide weights by design effect data$pweight <- NA_real_ data$pweight[weight_non_na] <- z_weights / deff From 3cfc1285139bd517936040484937021b8c8e6542 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 23:28:47 +0100 Subject: [PATCH 05/43] fix --- R/rescale_weights.R | 6 +++++- man/rescale_weights.Rd | 8 +++++++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 162863609..12b839cc2 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -97,7 +97,11 @@ #' ) #' } #' @export -rescale_weights <- function(data, by, probability_weights, nest = FALSE, method = "carle") { +rescale_weights <- function(data, + by = NULL, + probability_weights = NULL, + nest = FALSE, + method = "carle") { if (inherits(by, "formula")) { by <- all.vars(by) } diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 90e0727eb..30f62fe6c 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -4,7 +4,13 @@ \alias{rescale_weights} \title{Rescale design weights for multilevel analysis} \usage{ -rescale_weights(data, by, probability_weights, nest = FALSE, method = "carle") +rescale_weights( + data, + by = NULL, + probability_weights = NULL, + nest = FALSE, + method = "carle" +) } \arguments{ \item{data}{A data frame.} From 46e7b6cd31559cb41c6f5b895274ed9f9fb3af8c Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 23:34:30 +0100 Subject: [PATCH 06/43] fix --- R/rescale_weights.R | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 12b839cc2..b7e5eedbd 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -118,9 +118,6 @@ rescale_weights <- function(data, data_tmp <- data } - # sort id - data_tmp$.bamboozled <- seq_len(nrow(data_tmp)) - switch(method, carle = .rescale_weights_carle(nest, probability_weights, data_tmp, data, by, weight_non_na), .rescale_weights_kish(probability_weights, data_tmp, data, weight_non_na) @@ -137,8 +134,8 @@ rescale_weights <- function(data, # rescale weights, so their mean is 1 z_weights <- p_weights * (1 / mean(p_weights)) # divide weights by design effect - data$pweight <- NA_real_ - data$pweight[weight_non_na] <- z_weights / deff + data$pweights <- NA_real_ + data$pweights[weight_non_na] <- z_weights / deff # return result data } @@ -147,6 +144,13 @@ rescale_weights <- function(data, # rescale weights, method Carle ---------------------------- .rescale_weights_carle <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { + # sort id + data_tmp$.bamboozled <- seq_len(nrow(data_tmp)) + + if (is.null(by)) { + insight::format_error("Argument `by` must be specified. Please provide one or more variable names in `by` that indicate the grouping structure (strata) of the survey data (level-2-cluster variable).") # nolint + } + if (nest && length(by) < 2) { insight::format_warning( sprintf( From ff727394c9551fc3662b67a8128a10d80145f05a Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 17 Dec 2024 23:36:59 +0100 Subject: [PATCH 07/43] fix --- R/rescale_weights.R | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index b7e5eedbd..7b7715038 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -151,6 +151,10 @@ rescale_weights <- function(data, insight::format_error("Argument `by` must be specified. Please provide one or more variable names in `by` that indicate the grouping structure (strata) of the survey data (level-2-cluster variable).") # nolint } + if (!by %in% colnames(data_tmp)) { + insight::format_error("The variable specified in `by` was not found in the data. Maybe misspelled?") # nolint + } + if (nest && length(by) < 2) { insight::format_warning( sprintf( From 80d9746e4ded39a72f5edb10f72cd7bc5187782e Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 08:20:36 +0100 Subject: [PATCH 08/43] docs --- R/rescale_weights.R | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 7b7715038..476916dc6 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -58,7 +58,11 @@ #' #' - `method = "kish"` #' -#' to do... +#' Rescaling is based on scaling the sample weights so the mean value is 1, +#' which means the sum of all weights equals the sample size. Next, the design +#' effect (_Kish 1965_) is calculated, which is the mean of the squared weights +#' divided by the squared mean of the weights. The scales sample weights are +#' then divided by the design effect. #' #' @references #' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling @@ -68,7 +72,7 @@ #' with design weights: Recommendations. BMC Medical Research Methodology #' 9(49): 1-13 #' -#' - Kish ... +#' - Kish, L. (1965) Survey Sampling. London: Wiley. #' #' @examples #' if (require("lme4")) { @@ -118,19 +122,35 @@ rescale_weights <- function(data, data_tmp <- data } + fun_args <- list( + nest = nest, + probability_weights = probability_weights, + data_tmp = data_tmp, + data = data, + by = by, + weight_non_na = weight_non_na + ) + switch(method, - carle = .rescale_weights_carle(nest, probability_weights, data_tmp, data, by, weight_non_na), - .rescale_weights_kish(probability_weights, data_tmp, data, weight_non_na) + carle = do.call(.rescale_weights_carle, fun_args), + do.call(.rescale_weights_kish, fun_args) ) } -# rescale weights, method Carle ---------------------------- +# rescale weights, method Kish ---------------------------- -.rescale_weights_kish <- function(probability_weights, data_tmp, data, weight_non_na) { +.rescale_weights_kish <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { p_weights <- data_tmp[[probability_weights]] + # design effect according to Kish deff <- mean(p_weights^2) / (mean(p_weights)^2) + + # n_per_group <- as.vector(table(data_tmp[[by]])) + # b_bar <- mean(n_per_group) + # icc <- 0.05 + # deff <- nrow(data_tmp) * (mean(p_weights^2) / (mean(p_weights)^2)) * (1 + (b_bar - 1) * icc) + # rescale weights, so their mean is 1 z_weights <- p_weights * (1 / mean(p_weights)) # divide weights by design effect From 5463eb3c41e5a40aca9d84e8f2247c9405c580b7 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 08:20:41 +0100 Subject: [PATCH 09/43] docs --- man/rescale_weights.Rd | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 30f62fe6c..7b6be3984 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -76,7 +76,11 @@ design, or group and PSU as varying intercepts), depending on the survey design that should be mimicked. \item \code{method = "kish"} -to do... +Rescaling is based on scaling the sample weights so the mean value is 1, +which means the sum of all weights equals the sample size. Next, the design +effect (\emph{Kish 1965}) is calculated, which is the mean of the squared weights +divided by the squared mean of the weights. The scales sample weights are +then divided by the design effect. } } \examples{ @@ -113,6 +117,6 @@ Weights. Communications in Statistics - Theory and Methods 35: 439-460 \item Carle A.C. (2009). Fitting multilevel models in complex survey data with design weights: Recommendations. BMC Medical Research Methodology 9(49): 1-13 -\item Kish ... +\item Kish, L. (1965) Survey Sampling. London: Wiley. } } From 456166b6ef8640a0f61b70c8935f510041ae9d7e Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 08:51:46 +0100 Subject: [PATCH 10/43] remove experimental code for now --- R/rescale_weights.R | 7 ------- 1 file changed, 7 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 476916dc6..0c8269baf 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -142,15 +142,8 @@ rescale_weights <- function(data, .rescale_weights_kish <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { p_weights <- data_tmp[[probability_weights]] - # design effect according to Kish deff <- mean(p_weights^2) / (mean(p_weights)^2) - - # n_per_group <- as.vector(table(data_tmp[[by]])) - # b_bar <- mean(n_per_group) - # icc <- 0.05 - # deff <- nrow(data_tmp) * (mean(p_weights^2) / (mean(p_weights)^2)) * (1 + (b_bar - 1) * icc) - # rescale weights, so their mean is 1 z_weights <- p_weights * (1 / mean(p_weights)) # divide weights by design effect From dd8ca23efc91e494354cb1cd1c3c147ee788da02 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:08:01 +0100 Subject: [PATCH 11/43] docs --- R/rescale_weights.R | 13 ++++++++++--- man/rescale_weights.Rd | 8 +++++--- 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 0c8269baf..340cdf2b4 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -23,9 +23,11 @@ #' combination from each group level of the variables in `by`. #' @param method `"carle"` or `"kish"`. #' -#' @return `data`, including the new weighting variables: `pweights_a` -#' and `pweights_b`, which represent the rescaled design weights to use -#' in multilevel models (use these variables for the `weights` argument). +#' @return `data`, including the new weighting variable(s). For +#' `method = "carle"`, new columns `pweights_a` and `pweights_b` are returned, +#' and for `method = "klish"`, the returned data contains a column `pweights`. +#' These represent the rescaled design weights to use in multilevel models (use +#' these variables for the `weights` argument). #' #' @details #' - `method = "carle"` @@ -110,6 +112,11 @@ rescale_weights <- function(data, by <- all.vars(by) } + # check for existing variable names + if (any(c("pweights_a", "pweights_b", "pweights") %in% colnames(data))) { + insight::format_warning("The variable name for the rescaled weights already exists in the data. Existing columns will be overwritten.") + } + # check if weight has missings. we need to remove them first, # and add back weights to correct cases later diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 7b6be3984..25d5a8958 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -31,9 +31,11 @@ combination from each group level of the variables in \code{by}.} \item{method}{\code{"carle"} or \code{"kish"}.} } \value{ -\code{data}, including the new weighting variables: \code{pweights_a} -and \code{pweights_b}, which represent the rescaled design weights to use -in multilevel models (use these variables for the \code{weights} argument). +\code{data}, including the new weighting variable(s). For +\code{method = "carle"}, new columns \code{pweights_a} and \code{pweights_b} are returned, +and for \code{method = "klish"}, the returned data contains a column \code{pweights}. +These represent the rescaled design weights to use in multilevel models (use +these variables for the \code{weights} argument). } \description{ Most functions to fit multilevel and mixed effects models only From 33f8482ef8d47ac91b6ad963a81d912af4351a88 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:26:38 +0100 Subject: [PATCH 12/43] fix, add tests --- R/rescale_weights.R | 10 +++++++-- tests/testthat/test-rescale_weights.R | 30 +++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 340cdf2b4..dd8e0724a 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -171,8 +171,14 @@ rescale_weights <- function(data, insight::format_error("Argument `by` must be specified. Please provide one or more variable names in `by` that indicate the grouping structure (strata) of the survey data (level-2-cluster variable).") # nolint } - if (!by %in% colnames(data_tmp)) { - insight::format_error("The variable specified in `by` was not found in the data. Maybe misspelled?") # nolint + if (!all(by %in% colnames(data_tmp))) { + dont_exist <- by[which(!by %in% colnames(data_tmp))] + insight::format_error(paste0( + "The following variable(s) specified in `by` don't exist in the dataset: ", + text_concatenate(dont_exist), "." + ), + .misspelled_string(colnames(data_tmp), dont_exist, "Possibly misspelled?") + ) } if (nest && length(by) < 2) { diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 504157180..bf94387e5 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -40,3 +40,33 @@ test_that("rescale_weights nested works as expected", { ) ) }) + + +test_that("rescale_weights errors and warnings", { + data(nhanes_sample) + expect_error( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = c("a", "SDMVSTRA", "c"), + probability_weights = "WTINT2YR" + ), + regex = "The following" + ) + expect_error( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = NULL, + probability_weights = "WTINT2YR" + ), + regex = "must be specified" + ) + nhanes_sample$pweights_a <- 1 + expect_warning( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = "SDMVSTRA", + probability_weights = "WTINT2YR" + ), + regex = "The variable name" + ) +}) From 8bd8ebe42dfec1a343e8e8d43c79f2a24b8bb7e6 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:40:33 +0100 Subject: [PATCH 13/43] fix, add tests --- R/rescale_weights.R | 12 ++++++++++-- tests/testthat/test-rescale_weights.R | 19 ++++++++++++++----- 2 files changed, 24 insertions(+), 7 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index dd8e0724a..5ffa73331 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -114,7 +114,7 @@ rescale_weights <- function(data, # check for existing variable names if (any(c("pweights_a", "pweights_b", "pweights") %in% colnames(data))) { - insight::format_warning("The variable name for the rescaled weights already exists in the data. Existing columns will be overwritten.") + insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") } # check if weight has missings. we need to remove them first, @@ -203,7 +203,15 @@ rescale_weights <- function(data, }) } - do.call(cbind, list(data, out)) + make_unique_names <- any(vapply(out, function(i) any(colnames(i) %in% colnames(data)), logical(1))) + # add weights to data frame + out <- do.call(cbind, list(data, out)) + # check if we have to rename columns + if (make_unique_names) { + colnames(out) <- make.unique(colnames(out), sep = "_") + } + + out } diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index bf94387e5..760f0c982 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -62,11 +62,20 @@ test_that("rescale_weights errors and warnings", { ) nhanes_sample$pweights_a <- 1 expect_warning( - rescale_weights( - data = head(nhanes_sample, n = 30), - by = "SDMVSTRA", - probability_weights = "WTINT2YR" - ), + { + out <- rescale_weights( + data = head(nhanes_sample, n = 30), + by = "SDMVSTRA", + probability_weights = "WTINT2YR" + ) + }, regex = "The variable name" ) + expect_named( + out, + c( + "total", "age", "RIAGENDR", "RIDRETH1", "SDMVPSU", "SDMVSTRA", + "WTINT2YR", "pweights_a", "pweights_a_1", "pweights_b" + ) + ) }) From 9b093ff28543a06dd31921d16d49766adfab9e28 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:43:17 +0100 Subject: [PATCH 14/43] add tests --- tests/testthat/test-rescale_weights.R | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 760f0c982..a1eaf49f8 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -4,6 +4,14 @@ test_that("rescale_weights works as expected", { expect_snapshot(head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))) expect_snapshot(head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR"))) + + expect_snapshot(head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish"))) + + out <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") + expect_equal(sum(out$pweights_a), 2992, tolerance = 1e-3) + expect_equal(sum(out$pweights_b), 2244.71451, tolerance = 1e-3) + out <- rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish") + expect_equal(sum(out$pweights_b), 2162.53961, tolerance = 1e-3) }) From 80b7e3a4cbe592a871f25403de54386efc4b47a9 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:44:51 +0100 Subject: [PATCH 15/43] add tests --- tests/testthat/_snaps/rescale_weights.md | 15 +++++++++++++++ tests/testthat/test-rescale_weights.R | 2 +- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/tests/testthat/_snaps/rescale_weights.md b/tests/testthat/_snaps/rescale_weights.md index 5de6d489a..48185e7cc 100644 --- a/tests/testthat/_snaps/rescale_weights.md +++ b/tests/testthat/_snaps/rescale_weights.md @@ -31,6 +31,21 @@ 5 0.3119698 0.3060151 0.2152722 6 0.5155503 0.5858662 0.4121388 +--- + + Code + head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish")) + Output + # A tibble: 6 x 8 + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights + + 1 1 2.2 1 3 2 31 97594. 1.40 + 2 7 2.08 2 3 1 29 39599. 0.566 + 3 3 1.48 2 1 2 42 26620. 0.381 + 4 4 1.32 2 4 2 33 34999. 0.500 + 5 1 2 2 1 1 41 14746. 0.211 + 6 6 2.2 2 4 1 38 28232. 0.404 + # rescale_weights nested works as expected Code diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index a1eaf49f8..c604c7dc5 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -11,7 +11,7 @@ test_that("rescale_weights works as expected", { expect_equal(sum(out$pweights_a), 2992, tolerance = 1e-3) expect_equal(sum(out$pweights_b), 2244.71451, tolerance = 1e-3) out <- rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish") - expect_equal(sum(out$pweights_b), 2162.53961, tolerance = 1e-3) + expect_equal(sum(out$pweights), 2162.53961, tolerance = 1e-3) }) From 2b2412024afab98474496270cc1811d24ef0f6c4 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:49:21 +0100 Subject: [PATCH 16/43] styler --- R/rescale_weights.R | 3 ++- R/select_nse.R | 11 +++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 5ffa73331..e74bad09e 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -173,7 +173,8 @@ rescale_weights <- function(data, if (!all(by %in% colnames(data_tmp))) { dont_exist <- by[which(!by %in% colnames(data_tmp))] - insight::format_error(paste0( + insight::format_error( + paste0( "The following variable(s) specified in `by` don't exist in the dataset: ", text_concatenate(dont_exist), "." ), diff --git a/R/select_nse.R b/R/select_nse.R index a085a4ce3..766809b6c 100644 --- a/R/select_nse.R +++ b/R/select_nse.R @@ -199,12 +199,11 @@ # small helper, to avoid duplicated code .action_if_not_found <- function( - x, - columns, - matches, - verbose, - ifnotfound -) { + x, + columns, + matches, + verbose, + ifnotfound) { msg <- paste0( "Following variable(s) were not found: ", toString(x[is.na(matches)]) From 3c0af2cae2f9be584612effe2cbaad94c969cef2 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 09:55:29 +0100 Subject: [PATCH 17/43] lintr, wordlist --- R/rescale_weights.R | 2 +- inst/WORDLIST | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index e74bad09e..7bf320a0e 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -114,7 +114,7 @@ rescale_weights <- function(data, # check for existing variable names if (any(c("pweights_a", "pweights_b", "pweights") %in% colnames(data))) { - insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") + insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") # nolint } # check if weight has missings. we need to remove them first, diff --git a/inst/WORDLIST b/inst/WORDLIST index a8b4ff08d..176d19b6a 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -26,6 +26,7 @@ Heisig Herrington Hoffmann Joanes +Kish Llabre Lumley MADs From f4666422bc2e98c6a64ae5febed21f28965280bc Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 10:08:13 +0100 Subject: [PATCH 18/43] docs --- R/rescale_weights.R | 66 ++++++++++++++++++++++++++++-------------- man/rescale_weights.Rd | 66 +++++++++++++++++++++++++++++------------- 2 files changed, 91 insertions(+), 41 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 7bf320a0e..bc10ca249 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -76,31 +76,55 @@ #' #' - Kish, L. (1965) Survey Sampling. London: Wiley. #' -#' @examples -#' if (require("lme4")) { -#' data(nhanes_sample) -#' head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) +#' @examplesIf all(insight::check_if_installed(c("lme4", "parameters"), quietly = TRUE)) +#' data(nhanes_sample) +#' head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) #' -#' # also works with multiple group-variables -#' head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) +#' # also works with multiple group-variables +#' head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) #' -#' # or nested structures. -#' x <- rescale_weights( -#' data = nhanes_sample, -#' by = c("SDMVSTRA", "SDMVPSU"), -#' probability_weights = "WTINT2YR", -#' nest = TRUE -#' ) -#' head(x) +#' # or nested structures. +#' x <- rescale_weights( +#' data = nhanes_sample, +#' by = c("SDMVSTRA", "SDMVPSU"), +#' probability_weights = "WTINT2YR", +#' nest = TRUE +#' ) +#' head(x) #' -#' nhanes_sample <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +#' \donttest{ +#' # compare different methods, using multilevel-Poisson regression #' -#' glmer( -#' total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)) + (1 | SDMVPSU), -#' family = poisson(), -#' data = nhanes_sample, -#' weights = pweights_a -#' ) +#' d <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +#' result1 <- lme4::glmer( +#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), +#' family = poisson(), +#' data = d, +#' weights = pweights_a +#' ) +#' result2 <- lme4::glmer( +#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), +#' family = poisson(), +#' data = d, +#' weights = pweights_b +#' ) +#' +#' d <- rescale_weights( +#' nhanes_sample, +#' probability_weights = "WTINT2YR", +#' method = "kish" +#' ) +#' result3 <- lme4::glmer( +#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), +#' family = poisson(), +#' data = d, +#' weights = pweights +#' ) +#' parameters::compare_parameters( +#' list(result1, result2, result3), +#' exponentiate = TRUE, +#' column_names = c("Carle (A)", "Carle (B)", "Kish") +#' ) #' } #' @export rescale_weights <- function(data, diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 25d5a8958..45dd460b1 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -86,31 +86,57 @@ then divided by the design effect. } } \examples{ -if (require("lme4")) { - data(nhanes_sample) - head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) +\dontshow{if (all(insight::check_if_installed(c("lme4", "parameters"), quietly = TRUE))) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} +data(nhanes_sample) +head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) - # also works with multiple group-variables - head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) +# also works with multiple group-variables +head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) - # or nested structures. - x <- rescale_weights( - data = nhanes_sample, - by = c("SDMVSTRA", "SDMVPSU"), - probability_weights = "WTINT2YR", - nest = TRUE - ) - head(x) +# or nested structures. +x <- rescale_weights( + data = nhanes_sample, + by = c("SDMVSTRA", "SDMVPSU"), + probability_weights = "WTINT2YR", + nest = TRUE +) +head(x) + +\donttest{ +# compare different methods, using multilevel-Poisson regression - nhanes_sample <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +d <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +result1 <- lme4::glmer( + total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), + family = poisson(), + data = d, + weights = pweights_a +) +result2 <- lme4::glmer( + total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), + family = poisson(), + data = d, + weights = pweights_b +) - glmer( - total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)) + (1 | SDMVPSU), - family = poisson(), - data = nhanes_sample, - weights = pweights_a - ) +d <- rescale_weights( + nhanes_sample, + probability_weights = "WTINT2YR", + method = "kish" +) +result3 <- lme4::glmer( + total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), + family = poisson(), + data = d, + weights = pweights +) +parameters::compare_parameters( + list(result1, result2, result3), + exponentiate = TRUE, + column_names = c("Carle (A)", "Carle (B)", "Kish") +) } +\dontshow{\}) # examplesIf} } \references{ \itemize{ From 8214df01029c5da943ebd04dcbbfe79d0b30e4c1 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 11:18:09 +0100 Subject: [PATCH 19/43] fix --- R/rescale_weights.R | 18 +++++++++++++++--- man/rescale_weights.Rd | 6 ++++-- tests/testthat/_snaps/rescale_weights.md | 16 +++++++--------- tests/testthat/test-rescale_weights.R | 22 ++++++++++++++++++++++ 4 files changed, 48 insertions(+), 14 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index bc10ca249..5113a281e 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -15,13 +15,15 @@ #' the grouping structure (strata) of the survey data (level-2-cluster #' variable). It is also possible to create weights for multiple group #' variables; in such cases, each created weighting variable will be suffixed -#' by the name of the group variable. +#' by the name of the group variable. Argument `by` only applies to the default +#' rescaling-method (`method = "carle"`), not to `method = "kish"`. #' @param probability_weights Variable indicating the probability (design or #' sampling) weights of the survey data (level-1-weight). #' @param nest Logical, if `TRUE` and `by` indicates at least two #' group variables, then groups are "nested", i.e. groups are now a #' combination from each group level of the variables in `by`. -#' @param method `"carle"` or `"kish"`. +#' @param method String, indicating which rescale-method is used for rescaling +#' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'. #' #' @return `data`, including the new weighting variable(s). For #' `method = "carle"`, new columns `pweights_a` and `pweights_b` are returned, @@ -137,10 +139,16 @@ rescale_weights <- function(data, } # check for existing variable names - if (any(c("pweights_a", "pweights_b", "pweights") %in% colnames(data))) { + if ((method == "carle" && any(c("pweights_a", "pweights_b") %in% colnames(data))) || + (method == "kish" && "pweights" %in% colnames(data))) { insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") # nolint } + # need probability_weights + if (is.null(probability_weights)) { + insight::format_error("The argument `probability_weights` is missing, but required to rescale weights.") + } + # check if weight has missings. we need to remove them first, # and add back weights to correct cases later @@ -172,6 +180,10 @@ rescale_weights <- function(data, # rescale weights, method Kish ---------------------------- .rescale_weights_kish <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { + # check argument + if (!is.null(by)) { + insight::format_warning("The `by` argument is not used for `method = \"kish\" and will be ignored.") + } p_weights <- data_tmp[[probability_weights]] # design effect according to Kish deff <- mean(p_weights^2) / (mean(p_weights)^2) diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 45dd460b1..41e932990 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -19,7 +19,8 @@ rescale_weights( the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group variables; in such cases, each created weighting variable will be suffixed -by the name of the group variable.} +by the name of the group variable. Argument \code{by} only applies to the default +rescaling-method (\code{method = "carle"}), not to \code{method = "kish"}.} \item{probability_weights}{Variable indicating the probability (design or sampling) weights of the survey data (level-1-weight).} @@ -28,7 +29,8 @@ sampling) weights of the survey data (level-1-weight).} group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables in \code{by}.} -\item{method}{\code{"carle"} or \code{"kish"}.} +\item{method}{String, indicating which rescale-method is used for rescaling +weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'.} } \value{ \code{data}, including the new weighting variable(s). For diff --git a/tests/testthat/_snaps/rescale_weights.md b/tests/testthat/_snaps/rescale_weights.md index 48185e7cc..ecdefdd06 100644 --- a/tests/testthat/_snaps/rescale_weights.md +++ b/tests/testthat/_snaps/rescale_weights.md @@ -36,15 +36,13 @@ Code head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish")) Output - # A tibble: 6 x 8 - total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights - - 1 1 2.2 1 3 2 31 97594. 1.40 - 2 7 2.08 2 3 1 29 39599. 0.566 - 3 3 1.48 2 1 2 42 26620. 0.381 - 4 4 1.32 2 4 2 33 34999. 0.500 - 5 1 2 2 1 1 41 14746. 0.211 - 6 6 2.2 2 4 1 38 28232. 0.404 + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights + 1 1 2.20 1 3 2 31 97593.68 1.3952529 + 2 7 2.08 2 3 1 29 39599.36 0.5661343 + 3 3 1.48 2 1 2 42 26619.83 0.3805718 + 4 4 1.32 2 4 2 33 34998.53 0.5003582 + 5 1 2.00 2 1 1 41 14746.45 0.2108234 + 6 6 2.20 2 4 1 38 28232.10 0.4036216 # rescale_weights nested works as expected diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index c604c7dc5..9c2415626 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -1,5 +1,7 @@ test_that("rescale_weights works as expected", { data(nhanes_sample) + # convert tibble into data frame, so check-hard GHA works + nhanes_sample <- as.data.frame(nhanes_sample) expect_snapshot(head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))) @@ -17,6 +19,8 @@ test_that("rescale_weights works as expected", { test_that("rescale_weights nested works as expected", { data(nhanes_sample) + # convert tibble into data frame, so check-hard GHA works + nhanes_sample <- as.data.frame(nhanes_sample) expect_snapshot( rescale_weights( @@ -60,6 +64,14 @@ test_that("rescale_weights errors and warnings", { ), regex = "The following" ) + expect_error( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = "SDMVSTRA", + probability_weights = NULL + ), + regex = "is missing, but required" + ) expect_error( rescale_weights( data = head(nhanes_sample, n = 30), @@ -68,6 +80,16 @@ test_that("rescale_weights errors and warnings", { ), regex = "must be specified" ) + expect_warning( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = "SDMVSTRA", + probability_weights = "WTINT2YR", + method = "kish" + ), + regex = "is not used" + ) + nhanes_sample$pweights_a <- 1 expect_warning( { From 5ad1bcbd3ed3d78df5b3f86f0c3a594b15dc8578 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 11:23:53 +0100 Subject: [PATCH 20/43] docs, tests, rename into rescaled_weights --- DESCRIPTION | 2 +- NEWS.md | 12 +- R/rescale_weights.R | 54 ++++---- man/rescale_weights.Rd | 30 ++--- tests/testthat/_snaps/rescale_weights.md | 159 ++++++++++++----------- tests/testthat/test-rescale_weights.R | 10 +- 6 files changed, 141 insertions(+), 126 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 68cfb6741..034c823ed 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: datawizard Title: Easy Data Wrangling and Statistical Transformations -Version: 0.13.0.19 +Version: 0.13.0.20 Authors@R: c( person("Indrajeet", "Patil", , "patilindrajeet.science@gmail.com", role = "aut", comment = c(ORCID = "0000-0003-1995-6531")), diff --git a/NEWS.md b/NEWS.md index 35e549ffa..82513340f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -4,20 +4,28 @@ BREAKING CHANGES AND DEPRECATIONS * *datawizard* now requires R >= 4.0 (#515). -* Argument `drop_na` in `data_match()` is deprecated now. Please use +* Argument `drop_na` in `data_match()` is deprecated now. Please use `remove_na` instead. * In `data_rename()` (#567): - argument `pattern` is deprecated. Use `select` instead. - - argument `safe` is deprecated. The function now errors when `select` + - argument `safe` is deprecated. The function now errors when `select` contains unknown column names. - when `replacement` is `NULL`, an error is now thrown (previously, column indices were used as new names). - if `select` (previously `pattern`) is a named vector, then all elements must be named, e.g. `c(length = "Sepal.Length", "Sepal.Width")` errors. +* The name of the rescaled weights variables in `rescale_weights()` have been + renamed. `pweights_a` and `pweights_b` are now named `rescaled_weights_a` + and `rescaled_weights_b`. + CHANGES +* `rescale_weights()` gets a `method` argument, to choose method to rescale + weights. Options are `"carle"` (the default) and `"kish"`, a newly added + method to rescale weights. + * The `select` argument, which is available in different functions to select variables, can now also be a character vector with quoted variable names, including a colon to indicate a range of several variables (e.g. `"cyl:gear"`). diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 5113a281e..de5d1874e 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -26,23 +26,23 @@ #' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'. #' #' @return `data`, including the new weighting variable(s). For -#' `method = "carle"`, new columns `pweights_a` and `pweights_b` are returned, -#' and for `method = "klish"`, the returned data contains a column `pweights`. -#' These represent the rescaled design weights to use in multilevel models (use -#' these variables for the `weights` argument). +#' `method = "carle"`, new columns `rescaled_weights_a` and `rescaled_weights_b` +#' are returned, and for `method = "klish"`, the returned data contains a column +#' `rescaled_weights`. These represent the rescaled design weights to use in +#' multilevel models (use these variables for the `weights` argument). #' #' @details #' - `method = "carle"` #' -#' Rescaling is based on two methods: For `pweights_a`, the sample weights -#' `probability_weights` are adjusted by a factor that represents the +#' Rescaling is based on two methods: For `rescaled_weights_a`, the sample +#' weights `probability_weights` are adjusted by a factor that represents the #' proportion of group size divided by the sum of sampling weights within each -#' group. The adjustment factor for `pweights_b` is the sum of sample weights -#' within each group divided by the sum of squared sample weights within each -#' group (see Carle (2009), Appendix B). In other words, `pweights_a` "scales -#' the weights so that the new weights sum to the cluster sample size" while -#' `pweights_b` "scales the weights so that the new weights sum to the -#' effective cluster size". +#' group. The adjustment factor for `rescaled_weights_b` is the sum of sample +#' weights within each group divided by the sum of squared sample weights +#' within each group (see Carle (2009), Appendix B). In other words, +#' `rescaled_weights_a` "scales the weights so that the new weights sum to the +#' cluster sample size" while `rescaled_weights_b` "scales the weights so that +#' the new weights sum to the effective cluster size". #' #' Regarding the choice between scaling methods A and B, Carle suggests that #' "analysts who wish to discuss point estimates should report results based @@ -102,13 +102,13 @@ #' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), #' family = poisson(), #' data = d, -#' weights = pweights_a +#' weights = rescaled_weights_a #' ) #' result2 <- lme4::glmer( #' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), #' family = poisson(), #' data = d, -#' weights = pweights_b +#' weights = rescaled_weights_b #' ) #' #' d <- rescale_weights( @@ -120,7 +120,7 @@ #' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), #' family = poisson(), #' data = d, -#' weights = pweights +#' weights = rescaled_weights #' ) #' parameters::compare_parameters( #' list(result1, result2, result3), @@ -139,8 +139,8 @@ rescale_weights <- function(data, } # check for existing variable names - if ((method == "carle" && any(c("pweights_a", "pweights_b") %in% colnames(data))) || - (method == "kish" && "pweights" %in% colnames(data))) { + if ((method == "carle" && any(c("rescaled_weights_a", "rescaled_weights_b") %in% colnames(data))) || + (method == "kish" && "rescaled_weights" %in% colnames(data))) { insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") # nolint } @@ -190,8 +190,8 @@ rescale_weights <- function(data, # rescale weights, so their mean is 1 z_weights <- p_weights * (1 / mean(p_weights)) # divide weights by design effect - data$pweights <- NA_real_ - data$pweights[weight_non_na] <- z_weights / deff + data$rescaled_weights <- NA_real_ + data$rescaled_weights[weight_non_na] <- z_weights / deff # return result data } @@ -277,12 +277,12 @@ rescale_weights <- function(data, w_b <- x[[probability_weights]] * x$sum_weights_by_group / x$sum_squared_weights_by_group out <- data.frame( - pweights_a = rep(NA_real_, times = n), - pweights_b = rep(NA_real_, times = n) + rescaled_weights_a = rep(NA_real_, times = n), + rescaled_weights_b = rep(NA_real_, times = n) ) - out$pweights_a[weight_non_na] <- w_a - out$pweights_b[weight_non_na] <- w_b + out$rescaled_weights_a[weight_non_na] <- w_a + out$rescaled_weights_b[weight_non_na] <- w_b out } @@ -325,12 +325,12 @@ rescale_weights <- function(data, w_b <- x[[probability_weights]] * x$sum_weights_by_group / x$sum_squared_weights_by_group out <- data.frame( - pweights_a = rep(NA_real_, times = n), - pweights_b = rep(NA_real_, times = n) + rescaled_weights_a = rep(NA_real_, times = n), + rescaled_weights_b = rep(NA_real_, times = n) ) - out$pweights_a[weight_non_na] <- w_a - out$pweights_b[weight_non_na] <- w_b + out$rescaled_weights_a[weight_non_na] <- w_a + out$rescaled_weights_b[weight_non_na] <- w_b out } diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 41e932990..65b4ec6cd 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -34,10 +34,10 @@ weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'. } \value{ \code{data}, including the new weighting variable(s). For -\code{method = "carle"}, new columns \code{pweights_a} and \code{pweights_b} are returned, -and for \code{method = "klish"}, the returned data contains a column \code{pweights}. -These represent the rescaled design weights to use in multilevel models (use -these variables for the \code{weights} argument). +\code{method = "carle"}, new columns \code{rescaled_weights_a} and \code{rescaled_weights_b} +are returned, and for \code{method = "klish"}, the returned data contains a column +\code{rescaled_weights}. These represent the rescaled design weights to use in +multilevel models (use these variables for the \code{weights} argument). } \description{ Most functions to fit multilevel and mixed effects models only @@ -53,15 +53,15 @@ multilevel modelling. \itemize{ \item \code{method = "carle"} -Rescaling is based on two methods: For \code{pweights_a}, the sample weights -\code{probability_weights} are adjusted by a factor that represents the +Rescaling is based on two methods: For \code{rescaled_weights_a}, the sample +weights \code{probability_weights} are adjusted by a factor that represents the proportion of group size divided by the sum of sampling weights within each -group. The adjustment factor for \code{pweights_b} is the sum of sample weights -within each group divided by the sum of squared sample weights within each -group (see Carle (2009), Appendix B). In other words, \code{pweights_a} "scales -the weights so that the new weights sum to the cluster sample size" while -\code{pweights_b} "scales the weights so that the new weights sum to the -effective cluster size". +group. The adjustment factor for \code{rescaled_weights_b} is the sum of sample +weights within each group divided by the sum of squared sample weights +within each group (see Carle (2009), Appendix B). In other words, +\code{rescaled_weights_a} "scales the weights so that the new weights sum to the +cluster sample size" while \code{rescaled_weights_b} "scales the weights so that +the new weights sum to the effective cluster size". Regarding the choice between scaling methods A and B, Carle suggests that "analysts who wish to discuss point estimates should report results based @@ -112,13 +112,13 @@ result1 <- lme4::glmer( total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), family = poisson(), data = d, - weights = pweights_a + weights = rescaled_weights_a ) result2 <- lme4::glmer( total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), family = poisson(), data = d, - weights = pweights_b + weights = rescaled_weights_b ) d <- rescale_weights( @@ -130,7 +130,7 @@ result3 <- lme4::glmer( total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), family = poisson(), data = d, - weights = pweights + weights = rescaled_weights ) parameters::compare_parameters( list(result1, result2, result3), diff --git a/tests/testthat/_snaps/rescale_weights.md b/tests/testthat/_snaps/rescale_weights.md index ecdefdd06..4ad736b27 100644 --- a/tests/testthat/_snaps/rescale_weights.md +++ b/tests/testthat/_snaps/rescale_weights.md @@ -3,13 +3,20 @@ Code head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) Output - total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights_a pweights_b - 1 1 2.20 1 3 2 31 97593.68 1.5733612 1.2005159 - 2 7 2.08 2 3 1 29 39599.36 0.6231745 0.5246593 - 3 3 1.48 2 1 2 42 26619.83 0.8976966 0.5439111 - 4 4 1.32 2 4 2 33 34998.53 0.7083628 0.5498944 - 5 1 2.00 2 1 1 41 14746.45 0.4217782 0.3119698 - 6 6 2.20 2 4 1 38 28232.10 0.6877550 0.5155503 + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a + 1 1 2.20 1 3 2 31 97593.68 1.5733612 + 2 7 2.08 2 3 1 29 39599.36 0.6231745 + 3 3 1.48 2 1 2 42 26619.83 0.8976966 + 4 4 1.32 2 4 2 33 34998.53 0.7083628 + 5 1 2.00 2 1 1 41 14746.45 0.4217782 + 6 6 2.20 2 4 1 38 28232.10 0.6877550 + rescaled_weights_b + 1 1.2005159 + 2 0.5246593 + 3 0.5439111 + 4 0.5498944 + 5 0.3119698 + 6 0.5155503 --- @@ -36,13 +43,13 @@ Code head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish")) Output - total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights - 1 1 2.20 1 3 2 31 97593.68 1.3952529 - 2 7 2.08 2 3 1 29 39599.36 0.5661343 - 3 3 1.48 2 1 2 42 26619.83 0.3805718 - 4 4 1.32 2 4 2 33 34998.53 0.5003582 - 5 1 2.00 2 1 1 41 14746.45 0.2108234 - 6 6 2.20 2 4 1 38 28232.10 0.4036216 + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights + 1 1 2.20 1 3 2 31 97593.68 1.3952529 + 2 7 2.08 2 3 1 29 39599.36 0.5661343 + 3 3 1.48 2 1 2 42 26619.83 0.3805718 + 4 4 1.32 2 4 2 33 34998.53 0.5003582 + 5 1 2.00 2 1 1 41 14746.45 0.2108234 + 6 6 2.20 2 4 1 38 28232.10 0.4036216 # rescale_weights nested works as expected @@ -50,66 +57,66 @@ rescale_weights(data = head(nhanes_sample, n = 30), by = c("SDMVSTRA", "SDMVPSU"), probability_weights = "WTINT2YR", nest = TRUE) Output - total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweights_a - 1 1 2.20 1 3 2 31 97593.679 1.0000000 - 2 7 2.08 2 3 1 29 39599.363 0.5502486 - 3 3 1.48 2 1 2 42 26619.834 0.9512543 - 4 4 1.32 2 4 2 33 34998.530 0.6766764 - 5 1 2.00 2 1 1 41 14746.454 0.7147710 - 6 6 2.20 2 4 1 38 28232.100 1.0000000 - 7 350 1.60 1 3 2 33 93162.431 1.8012419 - 8 NA 1.48 2 3 1 29 82275.986 1.1432570 - 9 3 2.28 2 4 1 41 24726.391 1.1985056 - 10 30 0.84 1 3 2 35 39895.048 1.0000000 - 11 70 1.24 1 4 2 33 27002.703 0.5220817 - 12 5 1.68 2 1 2 39 18792.034 0.3866720 - 13 60 2.20 1 3 2 30 76894.563 1.0000000 - 14 2 1.48 2 3 1 29 82275.986 1.1432570 - 15 8 2.36 2 3 2 39 78406.811 1.6133280 - 16 3 2.04 2 3 2 36 98200.912 1.0000000 - 17 1 2.08 1 3 1 40 87786.091 1.0000000 - 18 7 1.00 1 3 2 32 90803.158 1.2693642 - 19 9 2.28 2 3 2 34 45002.917 1.0000000 - 20 2 1.24 2 3 1 29 82275.986 1.1432570 - 21 4 2.28 2 3 1 34 91437.145 1.4088525 - 22 3 1.04 1 1 2 42 29348.027 1.0487457 - 23 4 1.12 1 1 1 34 38366.567 0.5911475 - 24 1 1.52 2 1 1 42 6622.334 1.0000000 - 25 22 2.24 1 4 1 41 22420.209 1.0867233 - 26 7 1.00 2 3 2 41 65529.204 1.0000000 - 27 5 0.92 2 4 1 30 27089.745 1.0000000 - 28 15 1.04 1 3 2 32 52265.570 0.7306358 - 29 3 0.80 1 3 1 33 64789.307 1.0000000 - 30 1 1.00 1 3 1 29 73404.222 1.0199804 - pweights_b - 1 1.0000000 - 2 0.5226284 - 3 0.9489993 - 4 0.5107078 - 5 0.6854605 - 6 1.0000000 - 7 1.3594509 - 8 1.0858702 - 9 1.1493587 - 10 1.0000000 - 11 0.3940306 - 12 0.2809766 - 13 1.0000000 - 14 1.0858702 - 15 1.1723308 - 16 1.0000000 - 17 1.0000000 - 18 1.1834934 - 19 1.0000000 - 20 1.0858702 - 21 1.2070771 - 22 1.0462596 - 23 0.5064835 - 24 1.0000000 - 25 1.0421602 - 26 1.0000000 - 27 1.0000000 - 28 0.6812093 - 29 1.0000000 - 30 0.9687816 + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a + 1 1 2.20 1 3 2 31 97593.679 1.0000000 + 2 7 2.08 2 3 1 29 39599.363 0.5502486 + 3 3 1.48 2 1 2 42 26619.834 0.9512543 + 4 4 1.32 2 4 2 33 34998.530 0.6766764 + 5 1 2.00 2 1 1 41 14746.454 0.7147710 + 6 6 2.20 2 4 1 38 28232.100 1.0000000 + 7 350 1.60 1 3 2 33 93162.431 1.8012419 + 8 NA 1.48 2 3 1 29 82275.986 1.1432570 + 9 3 2.28 2 4 1 41 24726.391 1.1985056 + 10 30 0.84 1 3 2 35 39895.048 1.0000000 + 11 70 1.24 1 4 2 33 27002.703 0.5220817 + 12 5 1.68 2 1 2 39 18792.034 0.3866720 + 13 60 2.20 1 3 2 30 76894.563 1.0000000 + 14 2 1.48 2 3 1 29 82275.986 1.1432570 + 15 8 2.36 2 3 2 39 78406.811 1.6133280 + 16 3 2.04 2 3 2 36 98200.912 1.0000000 + 17 1 2.08 1 3 1 40 87786.091 1.0000000 + 18 7 1.00 1 3 2 32 90803.158 1.2693642 + 19 9 2.28 2 3 2 34 45002.917 1.0000000 + 20 2 1.24 2 3 1 29 82275.986 1.1432570 + 21 4 2.28 2 3 1 34 91437.145 1.4088525 + 22 3 1.04 1 1 2 42 29348.027 1.0487457 + 23 4 1.12 1 1 1 34 38366.567 0.5911475 + 24 1 1.52 2 1 1 42 6622.334 1.0000000 + 25 22 2.24 1 4 1 41 22420.209 1.0867233 + 26 7 1.00 2 3 2 41 65529.204 1.0000000 + 27 5 0.92 2 4 1 30 27089.745 1.0000000 + 28 15 1.04 1 3 2 32 52265.570 0.7306358 + 29 3 0.80 1 3 1 33 64789.307 1.0000000 + 30 1 1.00 1 3 1 29 73404.222 1.0199804 + rescaled_weights_b + 1 1.0000000 + 2 0.5226284 + 3 0.9489993 + 4 0.5107078 + 5 0.6854605 + 6 1.0000000 + 7 1.3594509 + 8 1.0858702 + 9 1.1493587 + 10 1.0000000 + 11 0.3940306 + 12 0.2809766 + 13 1.0000000 + 14 1.0858702 + 15 1.1723308 + 16 1.0000000 + 17 1.0000000 + 18 1.1834934 + 19 1.0000000 + 20 1.0858702 + 21 1.2070771 + 22 1.0462596 + 23 0.5064835 + 24 1.0000000 + 25 1.0421602 + 26 1.0000000 + 27 1.0000000 + 28 0.6812093 + 29 1.0000000 + 30 0.9687816 diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 9c2415626..1a1c1f296 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -10,10 +10,10 @@ test_that("rescale_weights works as expected", { expect_snapshot(head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish"))) out <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") - expect_equal(sum(out$pweights_a), 2992, tolerance = 1e-3) - expect_equal(sum(out$pweights_b), 2244.71451, tolerance = 1e-3) + expect_equal(sum(out$rescaled_weights_a), 2992, tolerance = 1e-3) + expect_equal(sum(out$rescaled_weights_b), 2244.71451, tolerance = 1e-3) out <- rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish") - expect_equal(sum(out$pweights), 2162.53961, tolerance = 1e-3) + expect_equal(sum(out$rescaled_weights), 2162.53961, tolerance = 1e-3) }) @@ -90,7 +90,7 @@ test_that("rescale_weights errors and warnings", { regex = "is not used" ) - nhanes_sample$pweights_a <- 1 + nhanes_sample$rescaled_weights_a <- 1 expect_warning( { out <- rescale_weights( @@ -105,7 +105,7 @@ test_that("rescale_weights errors and warnings", { out, c( "total", "age", "RIAGENDR", "RIDRETH1", "SDMVPSU", "SDMVSTRA", - "WTINT2YR", "pweights_a", "pweights_a_1", "pweights_b" + "WTINT2YR", "rescaled_weights_a", "rescaled_weights_a_1", "rescaled_weights_b" ) ) }) From 6415437f3b69783aa2300c85dd04a28461457fc2 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 11:46:40 +0100 Subject: [PATCH 21/43] docs --- R/rescale_weights.R | 6 ++++++ man/rescale_weights.Rd | 6 ++++++ 2 files changed, 12 insertions(+) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index de5d1874e..61d2f2738 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -68,6 +68,12 @@ #' divided by the squared mean of the weights. The scales sample weights are #' then divided by the design effect. #' +#' Some tests on real-world survey-data suggest that, in comparison to the +#' Carle-method, the Kish-method comes closer to estimates from a regular +#' survey-design using the **survey** package. Note that these tests are not +#' representative and it is recommended to check your results against a +#' standard survey-design. +#' #' @references #' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling #' Weights. Communications in Statistics - Theory and Methods 35: 439-460 diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 65b4ec6cd..907f7314d 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -85,6 +85,12 @@ which means the sum of all weights equals the sample size. Next, the design effect (\emph{Kish 1965}) is calculated, which is the mean of the squared weights divided by the squared mean of the weights. The scales sample weights are then divided by the design effect. + +Some tests on real-world survey-data suggest that, in comparison to the +Carle-method, the Kish-method comes closer to estimates from a regular +survey-design using the \strong{survey} package. Note that these tests are not +representative and it is recommended to check your results against a +standard survey-design. } } \examples{ From 6778a51da207447aa2a7f7b2a90dc15421563acb Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 13:47:37 +0100 Subject: [PATCH 22/43] Update R/rescale_weights.R Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 61d2f2738..c69aad52c 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -6,7 +6,7 @@ #' probability) weights, which should be used when analyzing complex samples #' and survey data. `rescale_weights()` implements two algorithms, one proposed #' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by -#' \cite{Kish 1965}, to rescale design weights in survey data to account for the +#' \cite{Kish (1965)}, to rescale design weights in survey data to account for the #' grouping structure of multilevel models, which then can be used for #' multilevel modelling. #' From 749cb2f4a551b8d474e2a4737c4740325bdc8fa4 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 13:47:51 +0100 Subject: [PATCH 23/43] Update NEWS.md Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> --- NEWS.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/NEWS.md b/NEWS.md index 82513340f..57aadff56 100644 --- a/NEWS.md +++ b/NEWS.md @@ -23,8 +23,7 @@ BREAKING CHANGES AND DEPRECATIONS CHANGES * `rescale_weights()` gets a `method` argument, to choose method to rescale - weights. Options are `"carle"` (the default) and `"kish"`, a newly added - method to rescale weights. + weights. Options are `"carle"` (the default) and `"kish"`. * The `select` argument, which is available in different functions to select variables, can now also be a character vector with quoted variable names, From d085d8cd2a7ffc827be27b2d31886ad0b3fbe297 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 13:50:06 +0100 Subject: [PATCH 24/43] Update R/rescale_weights.R Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index c69aad52c..dd8ee8af9 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -214,7 +214,7 @@ rescale_weights <- function(data, } if (!all(by %in% colnames(data_tmp))) { - dont_exist <- by[which(!by %in% colnames(data_tmp))] + dont_exist <- setdiff(by, colnames(data_tmp)) insight::format_error( paste0( "The following variable(s) specified in `by` don't exist in the dataset: ", From 98695f2feb843b58507515505b9f482e9dadf541 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 18 Dec 2024 13:57:45 +0100 Subject: [PATCH 25/43] address comments --- R/data_arrange.R | 2 +- R/rescale_weights.R | 2 ++ man/rescale_weights.Rd | 2 +- tests/testthat/test-rescale_weights.R | 8 ++++++++ 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/R/data_arrange.R b/R/data_arrange.R index 0cff97bd8..07a7f5f56 100644 --- a/R/data_arrange.R +++ b/R/data_arrange.R @@ -42,7 +42,7 @@ data_arrange.default <- function(data, select = NULL, safe = TRUE) { select <- gsub("^-", "", select) # check for variables that are not in data - dont_exist <- select[which(!select %in% names(data))] + dont_exist <- setdiff(select, colnames(data)) if (length(dont_exist) > 0) { if (safe) { diff --git a/R/rescale_weights.R b/R/rescale_weights.R index dd8ee8af9..77135b14f 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -140,6 +140,8 @@ rescale_weights <- function(data, probability_weights = NULL, nest = FALSE, method = "carle") { + method <- insight::validate_argument(method, c("carle", "kish")) + if (inherits(by, "formula")) { by <- all.vars(by) } diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 907f7314d..e467f235b 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -45,7 +45,7 @@ allow to specify frequency weights, but not design (i.e. sampling or probability) weights, which should be used when analyzing complex samples and survey data. \code{rescale_weights()} implements two algorithms, one proposed by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by -\cite{Kish 1965}, to rescale design weights in survey data to account for the +\cite{Kish (1965)}, to rescale design weights in survey data to account for the grouping structure of multilevel models, which then can be used for multilevel modelling. } diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 1a1c1f296..00c20bcdf 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -89,6 +89,14 @@ test_that("rescale_weights errors and warnings", { ), regex = "is not used" ) + expect_warning( + rescale_weights( + data = head(nhanes_sample, n = 30), + probability_weights = "WTINT2YR", + method = "dish" + ), + regex = "Invalid option for argument" + ) nhanes_sample$rescaled_weights_a <- 1 expect_warning( From 164aea8205104adcc0a9610ac2883e1eafbfbadc Mon Sep 17 00:00:00 2001 From: Daniel Date: Thu, 19 Dec 2024 09:02:42 +0100 Subject: [PATCH 26/43] fix test --- tests/testthat/test-rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 00c20bcdf..bea16cd7d 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -89,7 +89,7 @@ test_that("rescale_weights errors and warnings", { ), regex = "is not used" ) - expect_warning( + expect_error( rescale_weights( data = head(nhanes_sample, n = 30), probability_weights = "WTINT2YR", From c055e3b137454ecd9e1e94c0d84017d5632a669b Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 12:41:33 +0100 Subject: [PATCH 27/43] address comments (docs) --- R/rescale_weights.R | 28 ++++++++++++++++------------ man/rescale_weights.Rd | 27 ++++++++++++++++----------- 2 files changed, 32 insertions(+), 23 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 77135b14f..10370a692 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -2,14 +2,15 @@ #' @name rescale_weights #' #' @description Most functions to fit multilevel and mixed effects models only -#' allow to specify frequency weights, but not design (i.e. sampling or -#' probability) weights, which should be used when analyzing complex samples -#' and survey data. `rescale_weights()` implements two algorithms, one proposed -#' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by -#' \cite{Kish (1965)}, to rescale design weights in survey data to account for the -#' grouping structure of multilevel models, which then can be used for -#' multilevel modelling. -#' +#' allow the user to specify frequency weights, but not design (i.e., sampling +#' or probability) weights, which should be used when analyzing complex samples +#' (e.g., probability samples). `rescale_weights()` implements two algorithms, +#' one proposed by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one +#' proposed by by \cite{Asparouhov (2006)} and \cite{Carle (2009)}, to rescale +#' design weights in survey data to account for the grouping structure of +#' multilevel models, and and one based on the design effect proposed by +#' \cite{Kish (1965)}, to rescale weights by the design effect to account for +#' additional sampling error introduced by weighting. #' @param data A data frame. #' @param by Variable names (as character vector, or as formula), indicating #' the grouping structure (strata) of the survey data (level-2-cluster @@ -27,7 +28,7 @@ #' #' @return `data`, including the new weighting variable(s). For #' `method = "carle"`, new columns `rescaled_weights_a` and `rescaled_weights_b` -#' are returned, and for `method = "klish"`, the returned data contains a column +#' are returned, and for `method = "kish"`, the returned data contains a column #' `rescaled_weights`. These represent the rescaled design weights to use in #' multilevel models (use these variables for the `weights` argument). #' @@ -64,9 +65,12 @@ #' #' Rescaling is based on scaling the sample weights so the mean value is 1, #' which means the sum of all weights equals the sample size. Next, the design -#' effect (_Kish 1965_) is calculated, which is the mean of the squared weights -#' divided by the squared mean of the weights. The scales sample weights are -#' then divided by the design effect. +#' effect (_Kish 1965_) is calculated, which is the mean of the squared +#' weights divided by the squared mean of the weights. The scales sample +#' weights are then divided by the design effect. This method is most +#' appropriate when weights are based on additional variables beyond the +#' grouping variables in the model (e.g., other demographic characteristics), +#' but may also be useful in other contexts. #' #' Some tests on real-world survey-data suggest that, in comparison to the #' Carle-method, the Kish-method comes closer to estimates from a regular diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index e467f235b..912f163a9 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -35,19 +35,21 @@ weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'. \value{ \code{data}, including the new weighting variable(s). For \code{method = "carle"}, new columns \code{rescaled_weights_a} and \code{rescaled_weights_b} -are returned, and for \code{method = "klish"}, the returned data contains a column +are returned, and for \code{method = "kish"}, the returned data contains a column \code{rescaled_weights}. These represent the rescaled design weights to use in multilevel models (use these variables for the \code{weights} argument). } \description{ Most functions to fit multilevel and mixed effects models only -allow to specify frequency weights, but not design (i.e. sampling or -probability) weights, which should be used when analyzing complex samples -and survey data. \code{rescale_weights()} implements two algorithms, one proposed -by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by -\cite{Kish (1965)}, to rescale design weights in survey data to account for the -grouping structure of multilevel models, which then can be used for -multilevel modelling. +allow the user to specify frequency weights, but not design (i.e., sampling +or probability) weights, which should be used when analyzing complex samples +(e.g., probability samples). \code{rescale_weights()} implements two algorithms, +one proposed by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one +proposed by by \cite{Asparouhov (2006)} and \cite{Carle (2009)}, to rescale +design weights in survey data to account for the grouping structure of +multilevel models, and and one based on the design effect proposed by +\cite{Kish (1965)}, to rescale weights by the design effect to account for +additional sampling error introduced by weighting. } \details{ \itemize{ @@ -82,9 +84,12 @@ design that should be mimicked. Rescaling is based on scaling the sample weights so the mean value is 1, which means the sum of all weights equals the sample size. Next, the design -effect (\emph{Kish 1965}) is calculated, which is the mean of the squared weights -divided by the squared mean of the weights. The scales sample weights are -then divided by the design effect. +effect (\emph{Kish 1965}) is calculated, which is the mean of the squared +weights divided by the squared mean of the weights. The scales sample +weights are then divided by the design effect. This method is most +appropriate when weights are based on additional variables beyond the +grouping variables in the model (e.g., other demographic characteristics), +but may also be useful in other contexts. Some tests on real-world survey-data suggest that, in comparison to the Carle-method, the Kish-method comes closer to estimates from a regular From dcfc24685e66f6e82f3a5f69699eb6ef0d42c792 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 12:45:22 +0100 Subject: [PATCH 28/43] implement by --- R/rescale_weights.R | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 10370a692..88069c225 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -192,9 +192,19 @@ rescale_weights <- function(data, # rescale weights, method Kish ---------------------------- .rescale_weights_kish <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) { - # check argument - if (!is.null(by)) { - insight::format_warning("The `by` argument is not used for `method = \"kish\" and will be ignored.") + # sort id + data_tmp$.bamboozled <- seq_len(nrow(data_tmp)) + + # check by argument + if (!is.null(by) && !all(by %in% colnames(data_tmp))) { + dont_exist <- setdiff(by, colnames(data_tmp)) + insight::format_error( + paste0( + "The following variable(s) specified in `by` don't exist in the dataset: ", + text_concatenate(dont_exist), "." + ), + .misspelled_string(colnames(data_tmp), dont_exist, "Possibly misspelled?") + ) } p_weights <- data_tmp[[probability_weights]] # design effect according to Kish @@ -204,6 +214,11 @@ rescale_weights <- function(data, # divide weights by design effect data$rescaled_weights <- NA_real_ data$rescaled_weights[weight_non_na] <- z_weights / deff + + # restore original order + x <- x[order(x$.bamboozled), ] + x$.bamboozled <- NULL + # return result data } From f56e84cd502a81b36a7222f95adb65f94abb4b94 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 13:08:52 +0100 Subject: [PATCH 29/43] implement `by` --- R/rescale_weights.R | 42 +++++++++++++++++---------- man/rescale_weights.Rd | 12 ++++---- tests/testthat/test-rescale_weights.R | 8 ++--- 3 files changed, 37 insertions(+), 25 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 88069c225..1c0e2c366 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -107,7 +107,7 @@ #' \donttest{ #' # compare different methods, using multilevel-Poisson regression #' -#' d <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +#' d <- rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA") #' result1 <- lme4::glmer( #' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), #' family = poisson(), @@ -123,7 +123,7 @@ #' #' d <- rescale_weights( #' nhanes_sample, -#' probability_weights = "WTINT2YR", +#' "WTINT2YR", #' method = "kish" #' ) #' result3 <- lme4::glmer( @@ -140,8 +140,8 @@ #' } #' @export rescale_weights <- function(data, - by = NULL, probability_weights = NULL, + by = NULL, nest = FALSE, method = "carle") { method <- insight::validate_argument(method, c("carle", "kish")) @@ -205,22 +205,34 @@ rescale_weights <- function(data, ), .misspelled_string(colnames(data_tmp), dont_exist, "Possibly misspelled?") ) + } else { + # if `by` = NULL, we create a dummy group + by <- "tmp_klish_by" + data_tmp[[by]] <- 1 } - p_weights <- data_tmp[[probability_weights]] - # design effect according to Kish - deff <- mean(p_weights^2) / (mean(p_weights)^2) - # rescale weights, so their mean is 1 - z_weights <- p_weights * (1 / mean(p_weights)) - # divide weights by design effect - data$rescaled_weights <- NA_real_ - data$rescaled_weights[weight_non_na] <- z_weights / deff - # restore original order - x <- x[order(x$.bamboozled), ] - x$.bamboozled <- NULL + # split into groups, and calculate weights + out <- lapply(split(data_tmp, data_tmp$by), function(group_data) { + p_weights <- group_data[[probability_weights]] + # design effect according to Kish + deff <- mean(p_weights^2) / (mean(p_weights)^2) + # rescale weights, so their mean is 1 + z_weights <- p_weights * (1 / mean(p_weights)) + # divide weights by design effect + group_data$rescaled_weights <- z_weights / deff + group_data + }) + + # bind data + result <- do.call(rbind, out) + + # restore original order, remove dummy variables + result <- result[order(result$.bamboozled), ] + result$.bamboozled <- NULL + result$tmp_klish_by # return result - data + result } diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 912f163a9..5c631b90c 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -6,8 +6,8 @@ \usage{ rescale_weights( data, - by = NULL, probability_weights = NULL, + by = NULL, nest = FALSE, method = "carle" ) @@ -15,6 +15,9 @@ rescale_weights( \arguments{ \item{data}{A data frame.} +\item{probability_weights}{Variable indicating the probability (design or +sampling) weights of the survey data (level-1-weight).} + \item{by}{Variable names (as character vector, or as formula), indicating the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group @@ -22,9 +25,6 @@ variables; in such cases, each created weighting variable will be suffixed by the name of the group variable. Argument \code{by} only applies to the default rescaling-method (\code{method = "carle"}), not to \code{method = "kish"}.} -\item{probability_weights}{Variable indicating the probability (design or -sampling) weights of the survey data (level-1-weight).} - \item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables in \code{by}.} @@ -118,7 +118,7 @@ head(x) \donttest{ # compare different methods, using multilevel-Poisson regression -d <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") +d <- rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA") result1 <- lme4::glmer( total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), family = poisson(), @@ -134,7 +134,7 @@ result2 <- lme4::glmer( d <- rescale_weights( nhanes_sample, - probability_weights = "WTINT2YR", + "WTINT2YR", method = "kish" ) result3 <- lme4::glmer( diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index bea16cd7d..c02991809 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -3,16 +3,16 @@ test_that("rescale_weights works as expected", { # convert tibble into data frame, so check-hard GHA works nhanes_sample <- as.data.frame(nhanes_sample) - expect_snapshot(head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))) + expect_snapshot(head(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA"))) - expect_snapshot(head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR"))) + expect_snapshot(head(rescale_weights(nhanes_sample, "WTINT2YR", c("SDMVSTRA", "SDMVPSU")))) expect_snapshot(head(rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish"))) - out <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR") + out <- rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA") expect_equal(sum(out$rescaled_weights_a), 2992, tolerance = 1e-3) expect_equal(sum(out$rescaled_weights_b), 2244.71451, tolerance = 1e-3) - out <- rescale_weights(nhanes_sample, probability_weights = "WTINT2YR", method = "kish") + out <- rescale_weights(nhanes_sample, "WTINT2YR", method = "kish") expect_equal(sum(out$rescaled_weights), 2162.53961, tolerance = 1e-3) }) From 67b9ddc05f8e17efd8067984dcb6cb4518595377 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 13:09:17 +0100 Subject: [PATCH 30/43] typo --- R/rescale_weights.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 1c0e2c366..104b21987 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -207,7 +207,7 @@ rescale_weights <- function(data, ) } else { # if `by` = NULL, we create a dummy group - by <- "tmp_klish_by" + by <- "tmp_kish_by" data_tmp[[by]] <- 1 } @@ -229,7 +229,7 @@ rescale_weights <- function(data, # restore original order, remove dummy variables result <- result[order(result$.bamboozled), ] result$.bamboozled <- NULL - result$tmp_klish_by + result$tmp_kish_by # return result result From 4512769856cd10195a68acb9ea55a1f9acafa7c8 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 13:12:10 +0100 Subject: [PATCH 31/43] update examples --- R/rescale_weights.R | 6 +++--- man/rescale_weights.Rd | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 104b21987..5aa2acb1f 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -90,16 +90,16 @@ #' #' @examplesIf all(insight::check_if_installed(c("lme4", "parameters"), quietly = TRUE)) #' data(nhanes_sample) -#' head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) +#' head(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA")) #' #' # also works with multiple group-variables -#' head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) +#' head(rescale_weights(nhanes_sample, "WTINT2YR", c("SDMVSTRA", "SDMVPSU"))) #' #' # or nested structures. #' x <- rescale_weights( #' data = nhanes_sample, -#' by = c("SDMVSTRA", "SDMVPSU"), #' probability_weights = "WTINT2YR", +#' by = c("SDMVSTRA", "SDMVPSU"), #' nest = TRUE #' ) #' head(x) diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 5c631b90c..b78baeb52 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -101,16 +101,16 @@ standard survey-design. \examples{ \dontshow{if (all(insight::check_if_installed(c("lme4", "parameters"), quietly = TRUE))) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} data(nhanes_sample) -head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) +head(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA")) # also works with multiple group-variables -head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) +head(rescale_weights(nhanes_sample, "WTINT2YR", c("SDMVSTRA", "SDMVPSU"))) # or nested structures. x <- rescale_weights( data = nhanes_sample, - by = c("SDMVSTRA", "SDMVPSU"), probability_weights = "WTINT2YR", + by = c("SDMVSTRA", "SDMVPSU"), nest = TRUE ) head(x) From 02283ee177cd428cd4fd92425454bfd22ffa522f Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 13:14:22 +0100 Subject: [PATCH 32/43] docs --- R/rescale_weights.R | 8 ++++++-- man/rescale_weights.Rd | 3 +-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 5aa2acb1f..59b8d43b6 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -16,8 +16,7 @@ #' the grouping structure (strata) of the survey data (level-2-cluster #' variable). It is also possible to create weights for multiple group #' variables; in such cases, each created weighting variable will be suffixed -#' by the name of the group variable. Argument `by` only applies to the default -#' rescaling-method (`method = "carle"`), not to `method = "kish"`. +#' by the name of the group variable. #' @param probability_weights Variable indicating the probability (design or #' sampling) weights of the survey data (level-1-weight). #' @param nest Logical, if `TRUE` and `by` indicates at least two @@ -146,10 +145,15 @@ rescale_weights <- function(data, method = "carle") { method <- insight::validate_argument(method, c("carle", "kish")) + # convert formulas to strings if (inherits(by, "formula")) { by <- all.vars(by) } + if (inherits(probability_weights, "formula")) { + probability_weights <- all.vars(probability_weights) + } + # check for existing variable names if ((method == "carle" && any(c("rescaled_weights_a", "rescaled_weights_b") %in% colnames(data))) || (method == "kish" && "rescaled_weights" %in% colnames(data))) { diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index b78baeb52..05aa0e73e 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -22,8 +22,7 @@ sampling) weights of the survey data (level-1-weight).} the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group variables; in such cases, each created weighting variable will be suffixed -by the name of the group variable. Argument \code{by} only applies to the default -rescaling-method (\code{method = "carle"}), not to \code{method = "kish"}.} +by the name of the group variable.} \item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two group variables, then groups are "nested", i.e. groups are now a From 71795b5074a54f6387306e3c55e03f62f8e1e73d Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:13:55 +0100 Subject: [PATCH 33/43] fix --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 59b8d43b6..be7487f3b 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -209,7 +209,7 @@ rescale_weights <- function(data, ), .misspelled_string(colnames(data_tmp), dont_exist, "Possibly misspelled?") ) - } else { + } else if (is.null(by)) { # if `by` = NULL, we create a dummy group by <- "tmp_kish_by" data_tmp[[by]] <- 1 From 725b631afd2b8d6b397c422979b321379cb9f35a Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:14:24 +0100 Subject: [PATCH 34/43] fix --- R/rescale_weights.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index be7487f3b..86f4a6835 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -216,7 +216,7 @@ rescale_weights <- function(data, } # split into groups, and calculate weights - out <- lapply(split(data_tmp, data_tmp$by), function(group_data) { + out <- lapply(split(data_tmp, data_tmp[by]), function(group_data) { p_weights <- group_data[[probability_weights]] # design effect according to Kish deff <- mean(p_weights^2) / (mean(p_weights)^2) From 5fc7686e22090004624f2494c5b6b204fd945559 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:40:26 +0100 Subject: [PATCH 35/43] fix --- R/rescale_weights.R | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 86f4a6835..41de1ff05 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -232,8 +232,7 @@ rescale_weights <- function(data, # restore original order, remove dummy variables result <- result[order(result$.bamboozled), ] - result$.bamboozled <- NULL - result$tmp_kish_by + result$.bamboozled <- result$tmp_kish_by <- NULL # return result result From cff6780517bc7b766ef8ecd1710548be1846e133 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:40:37 +0100 Subject: [PATCH 36/43] desc --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index b8a6c3134..720c5afe1 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: datawizard Title: Easy Data Wrangling and Statistical Transformations -Version: 0.13.0.21 +Version: 0.13.0.22 Authors@R: c( person("Indrajeet", "Patil", , "patilindrajeet.science@gmail.com", role = "aut", comment = c(ORCID = "0000-0003-1995-6531")), From 2d9d111b571793ad56b53d0f74d320a4f197f67c Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:56:27 +0100 Subject: [PATCH 37/43] fix, add tests --- NEWS.md | 2 ++ R/rescale_weights.R | 14 +++++++++----- man/rescale_weights.Rd | 6 ++++-- tests/testthat/_snaps/rescale_weights.md | 4 ++-- tests/testthat/test-rescale_weights.R | 8 +++++--- 5 files changed, 22 insertions(+), 12 deletions(-) diff --git a/NEWS.md b/NEWS.md index 5d8ec5e8a..b870855fa 100644 --- a/NEWS.md +++ b/NEWS.md @@ -16,6 +16,8 @@ BREAKING CHANGES AND DEPRECATIONS - if `select` (previously `pattern`) is a named vector, then all elements must be named, e.g. `c(length = "Sepal.Length", "Sepal.Width")` errors. +* Order of arguments `by` and `probability_weights` in `rescale_weights()` has + changed, because for `method = "kish"`, the `by` argument is optional. * The name of the rescaled weights variables in `rescale_weights()` have been renamed. `pweights_a` and `pweights_b` are now named `rescaled_weights_a` diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 41de1ff05..6a2ddc57d 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -16,14 +16,16 @@ #' the grouping structure (strata) of the survey data (level-2-cluster #' variable). It is also possible to create weights for multiple group #' variables; in such cases, each created weighting variable will be suffixed -#' by the name of the group variable. +#' by the name of the group variable. This argument is required for +#' `method = "carle"`, but optional for `method = "kish"`. #' @param probability_weights Variable indicating the probability (design or #' sampling) weights of the survey data (level-1-weight). #' @param nest Logical, if `TRUE` and `by` indicates at least two #' group variables, then groups are "nested", i.e. groups are now a #' combination from each group level of the variables in `by`. #' @param method String, indicating which rescale-method is used for rescaling -#' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'. +#' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'. If +#' `method = "carle"`, the `by` argument is required. #' #' @return `data`, including the new weighting variable(s). For #' `method = "carle"`, new columns `rescaled_weights_a` and `rescaled_weights_b` @@ -230,12 +232,14 @@ rescale_weights <- function(data, # bind data result <- do.call(rbind, out) - # restore original order, remove dummy variables + # restore original order result <- result[order(result$.bamboozled), ] - result$.bamboozled <- result$tmp_kish_by <- NULL + # add back rescaled weights to original data, but account for missing observations + data$rescaled_weights <- NA_real_ + data$rescaled_weights[weight_non_na] <- result$rescaled_weights # return result - result + data } diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 05aa0e73e..51f1cbaf1 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -22,14 +22,16 @@ sampling) weights of the survey data (level-1-weight).} the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group variables; in such cases, each created weighting variable will be suffixed -by the name of the group variable.} +by the name of the group variable. This argument is required for +\code{method = "carle"}, but optional for \code{method = "kish"}.} \item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables in \code{by}.} \item{method}{String, indicating which rescale-method is used for rescaling -weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'.} +weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'. If +\code{method = "carle"}, the \code{by} argument is required.} } \value{ \code{data}, including the new weighting variable(s). For diff --git a/tests/testthat/_snaps/rescale_weights.md b/tests/testthat/_snaps/rescale_weights.md index 4ad736b27..3c9c12cde 100644 --- a/tests/testthat/_snaps/rescale_weights.md +++ b/tests/testthat/_snaps/rescale_weights.md @@ -1,7 +1,7 @@ # rescale_weights works as expected Code - head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")) + head(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA")) Output total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a 1 1 2.20 1 3 2 31 97593.68 1.5733612 @@ -21,7 +21,7 @@ --- Code - head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR")) + head(rescale_weights(nhanes_sample, "WTINT2YR", c("SDMVSTRA", "SDMVPSU"))) Output total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweight_a_SDMVSTRA 1 1 2.20 1 3 2 31 97593.68 1.5733612 diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index c02991809..1c204057e 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -14,6 +14,8 @@ test_that("rescale_weights works as expected", { expect_equal(sum(out$rescaled_weights_b), 2244.71451, tolerance = 1e-3) out <- rescale_weights(nhanes_sample, "WTINT2YR", method = "kish") expect_equal(sum(out$rescaled_weights), 2162.53961, tolerance = 1e-3) + out <- rescale_weights(nhanes_sample, "WTINT2YR", by = "SDMVPSU", method = "kish") + expect_equal(sum(out$rescaled_weights), 2163.3657, tolerance = 1e-3) }) @@ -80,14 +82,14 @@ test_that("rescale_weights errors and warnings", { ), regex = "must be specified" ) - expect_warning( + expect_error( rescale_weights( data = head(nhanes_sample, n = 30), - by = "SDMVSTRA", + by = "abc", probability_weights = "WTINT2YR", method = "kish" ), - regex = "is not used" + regex = "The following variable(s)" ) expect_error( rescale_weights( From 05efe33f8fcdf1d9d71fe11404b9cc533a3ae2ef Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 16:58:02 +0100 Subject: [PATCH 38/43] examples --- R/rescale_weights.R | 16 ++++++++++++++-- man/rescale_weights.Rd | 16 ++++++++++++++-- 2 files changed, 28 insertions(+), 4 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 6a2ddc57d..b1c273e53 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -133,10 +133,22 @@ #' data = d, #' weights = rescaled_weights #' ) +#' d <- rescale_weights( +#' nhanes_sample, +#' "WTINT2YR", +#' "SDMVSTRA", +#' method = "kish" +#' ) +#' result4 <- lme4::glmer( +#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), +#' family = poisson(), +#' data = d, +#' weights = rescaled_weights +#' ) #' parameters::compare_parameters( -#' list(result1, result2, result3), +#' list(result1, result2, result3, result4), #' exponentiate = TRUE, -#' column_names = c("Carle (A)", "Carle (B)", "Kish") +#' column_names = c("Carle (A)", "Carle (B)", "Kish", "Kish (grouped)") #' ) #' } #' @export diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 51f1cbaf1..fe746a67d 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -144,10 +144,22 @@ result3 <- lme4::glmer( data = d, weights = rescaled_weights ) +d <- rescale_weights( + nhanes_sample, + "WTINT2YR", + "SDMVSTRA", + method = "kish" +) +result4 <- lme4::glmer( + total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU), + family = poisson(), + data = d, + weights = rescaled_weights +) parameters::compare_parameters( - list(result1, result2, result3), + list(result1, result2, result3, result4), exponentiate = TRUE, - column_names = c("Carle (A)", "Carle (B)", "Kish") + column_names = c("Carle (A)", "Carle (B)", "Kish", "Kish (grouped)") ) } \dontshow{\}) # examplesIf} From 126ee64ede68254403822a276cf0ccaed01d77df Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 17:02:00 +0100 Subject: [PATCH 39/43] docs --- R/rescale_weights.R | 3 ++- man/rescale_weights.Rd | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index b1c273e53..9205d5f03 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -19,7 +19,8 @@ #' by the name of the group variable. This argument is required for #' `method = "carle"`, but optional for `method = "kish"`. #' @param probability_weights Variable indicating the probability (design or -#' sampling) weights of the survey data (level-1-weight). +#' sampling) weights of the survey data (level-1-weight), provided as character +#' string or formula. #' @param nest Logical, if `TRUE` and `by` indicates at least two #' group variables, then groups are "nested", i.e. groups are now a #' combination from each group level of the variables in `by`. diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index fe746a67d..52e56c377 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -16,7 +16,8 @@ rescale_weights( \item{data}{A data frame.} \item{probability_weights}{Variable indicating the probability (design or -sampling) weights of the survey data (level-1-weight).} +sampling) weights of the survey data (level-1-weight), provided as character +string or formula.} \item{by}{Variable names (as character vector, or as formula), indicating the grouping structure (strata) of the survey data (level-2-cluster From edce1bfa209b5c01a640c2afe786250a0aaa733d Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 17:07:58 +0100 Subject: [PATCH 40/43] docs --- R/rescale_weights.R | 2 +- man/rescale_weights.Rd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index 9205d5f03..aa0d260be 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -68,7 +68,7 @@ #' Rescaling is based on scaling the sample weights so the mean value is 1, #' which means the sum of all weights equals the sample size. Next, the design #' effect (_Kish 1965_) is calculated, which is the mean of the squared -#' weights divided by the squared mean of the weights. The scales sample +#' weights divided by the squared mean of the weights. The scaled sample #' weights are then divided by the design effect. This method is most #' appropriate when weights are based on additional variables beyond the #' grouping variables in the model (e.g., other demographic characteristics), diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 52e56c377..646ae1da4 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -87,7 +87,7 @@ design that should be mimicked. Rescaling is based on scaling the sample weights so the mean value is 1, which means the sum of all weights equals the sample size. Next, the design effect (\emph{Kish 1965}) is calculated, which is the mean of the squared -weights divided by the squared mean of the weights. The scales sample +weights divided by the squared mean of the weights. The scaled sample weights are then divided by the design effect. This method is most appropriate when weights are based on additional variables beyond the grouping variables in the model (e.g., other demographic characteristics), From 1f9b8b7f707f27c189013d0a1d961dfa35b73072 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 17:12:17 +0100 Subject: [PATCH 41/43] tests --- R/rescale_weights.R | 23 +++++++++++++++-------- man/rescale_weights.Rd | 17 +++++++++-------- tests/testthat/test-rescale_weights.R | 10 ++++++++++ 3 files changed, 34 insertions(+), 16 deletions(-) diff --git a/R/rescale_weights.R b/R/rescale_weights.R index aa0d260be..35e1bba98 100644 --- a/R/rescale_weights.R +++ b/R/rescale_weights.R @@ -21,18 +21,20 @@ #' @param probability_weights Variable indicating the probability (design or #' sampling) weights of the survey data (level-1-weight), provided as character #' string or formula. -#' @param nest Logical, if `TRUE` and `by` indicates at least two -#' group variables, then groups are "nested", i.e. groups are now a -#' combination from each group level of the variables in `by`. +#' @param nest Logical, if `TRUE` and `by` indicates at least two group +#' variables, then groups are "nested", i.e. groups are now a combination from +#' each group level of the variables in `by`. This argument is not used when +#' `method = "kish"`. #' @param method String, indicating which rescale-method is used for rescaling #' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'. If #' `method = "carle"`, the `by` argument is required. #' -#' @return `data`, including the new weighting variable(s). For -#' `method = "carle"`, new columns `rescaled_weights_a` and `rescaled_weights_b` -#' are returned, and for `method = "kish"`, the returned data contains a column -#' `rescaled_weights`. These represent the rescaled design weights to use in -#' multilevel models (use these variables for the `weights` argument). +#' @return +#' `data`, including the new weighting variable(s). For `method = "carle"`, new +#' columns `rescaled_weights_a` and `rescaled_weights_b` are returned, and for +#' `method = "kish"`, the returned data contains a column `rescaled_weights`. +#' These represent the rescaled design weights to use in multilevel models (use +#' these variables for the `weights` argument). #' #' @details #' - `method = "carle"` @@ -214,6 +216,11 @@ rescale_weights <- function(data, # sort id data_tmp$.bamboozled <- seq_len(nrow(data_tmp)) + # `nest` is currently ignored + if (isTRUE(nest)) { + insight::format_warning("Argument `nest` is ignored for `method = \"kish\"`.") + } + # check by argument if (!is.null(by) && !all(by %in% colnames(data_tmp))) { dont_exist <- setdiff(by, colnames(data_tmp)) diff --git a/man/rescale_weights.Rd b/man/rescale_weights.Rd index 646ae1da4..85ffae0b9 100644 --- a/man/rescale_weights.Rd +++ b/man/rescale_weights.Rd @@ -26,20 +26,21 @@ variables; in such cases, each created weighting variable will be suffixed by the name of the group variable. This argument is required for \code{method = "carle"}, but optional for \code{method = "kish"}.} -\item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two -group variables, then groups are "nested", i.e. groups are now a -combination from each group level of the variables in \code{by}.} +\item{nest}{Logical, if \code{TRUE} and \code{by} indicates at least two group +variables, then groups are "nested", i.e. groups are now a combination from +each group level of the variables in \code{by}. This argument is not used when +\code{method = "kish"}.} \item{method}{String, indicating which rescale-method is used for rescaling weights. Can be either \code{"carle"} (default) or \code{"kish"}. See 'Details'. If \code{method = "carle"}, the \code{by} argument is required.} } \value{ -\code{data}, including the new weighting variable(s). For -\code{method = "carle"}, new columns \code{rescaled_weights_a} and \code{rescaled_weights_b} -are returned, and for \code{method = "kish"}, the returned data contains a column -\code{rescaled_weights}. These represent the rescaled design weights to use in -multilevel models (use these variables for the \code{weights} argument). +\code{data}, including the new weighting variable(s). For \code{method = "carle"}, new +columns \code{rescaled_weights_a} and \code{rescaled_weights_b} are returned, and for +\code{method = "kish"}, the returned data contains a column \code{rescaled_weights}. +These represent the rescaled design weights to use in multilevel models (use +these variables for the \code{weights} argument). } \description{ Most functions to fit multilevel and mixed effects models only diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 1c204057e..2800024cb 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -91,6 +91,16 @@ test_that("rescale_weights errors and warnings", { ), regex = "The following variable(s)" ) + expect_warning( + rescale_weights( + data = head(nhanes_sample, n = 30), + by = "SDMVSTRA", + probability_weights = "WTINT2YR", + nest = TRUE, + method = "kish" + ), + regex = "is ignored" + ) expect_error( rescale_weights( data = head(nhanes_sample, n = 30), From 475117e726c74526d96d7ddbaabe1378ae90557e Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 17:17:44 +0100 Subject: [PATCH 42/43] add tests --- tests/testthat/_snaps/rescale_weights.md | 75 ++++++++++++++++++++++++ tests/testthat/test-rescale_weights.R | 16 ++++- 2 files changed, 90 insertions(+), 1 deletion(-) diff --git a/tests/testthat/_snaps/rescale_weights.md b/tests/testthat/_snaps/rescale_weights.md index 3c9c12cde..f4a18f3bf 100644 --- a/tests/testthat/_snaps/rescale_weights.md +++ b/tests/testthat/_snaps/rescale_weights.md @@ -51,6 +51,81 @@ 5 1 2.00 2 1 1 41 14746.45 0.2108234 6 6 2.20 2 4 1 38 28232.10 0.4036216 +--- + + Code + rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA") + Output + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a + 1 1 2.20 1 3 2 31 97593.68 1.0000000 + 2 7 2.08 2 3 1 29 39599.36 0.5819119 + 3 3 1.48 2 1 2 42 NA NA + 4 4 1.32 2 4 2 33 34998.53 0.6766764 + 5 1 2.00 2 1 1 41 14746.45 0.7471696 + 6 6 2.20 2 4 1 38 28232.10 1.0000000 + 7 350 1.60 1 3 2 33 93162.43 1.8012419 + 8 NA 1.48 2 3 1 29 82275.99 1.2090441 + 9 3 2.28 2 4 1 41 24726.39 1.2528304 + 10 30 0.84 1 3 2 35 NA NA + 11 70 1.24 1 4 2 33 27002.70 0.5220817 + 12 5 1.68 2 1 2 39 18792.03 1.0000000 + 13 60 2.20 1 3 2 30 76894.56 1.0000000 + 14 2 1.48 2 3 1 29 NA NA + 15 8 2.36 2 3 2 39 NA NA + 16 3 2.04 2 3 2 36 98200.91 1.0000000 + 17 1 2.08 1 3 1 40 87786.09 1.0000000 + 18 7 1.00 1 3 2 32 90803.16 1.0000000 + 19 9 2.28 2 3 2 34 NA NA + 20 2 1.24 2 3 1 29 82275.99 1.2090441 + rescaled_weights_b + 1 1.0000000 + 2 0.5351412 + 3 NA + 4 0.5107078 + 5 0.7022777 + 6 1.0000000 + 7 1.3594509 + 8 1.1118681 + 9 1.1775572 + 10 NA + 11 0.3940306 + 12 1.0000000 + 13 1.0000000 + 14 NA + 15 NA + 16 1.0000000 + 17 1.0000000 + 18 1.0000000 + 19 NA + 20 1.1118681 + +--- + + Code + rescale_weights(nhanes_sample, "WTINT2YR", method = "kish") + Output + total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights + 1 1 2.20 1 3 2 31 97593.68 1.2734329 + 2 7 2.08 2 3 1 29 39599.36 0.5167049 + 3 3 1.48 2 1 2 42 NA NA + 4 4 1.32 2 4 2 33 34998.53 0.4566718 + 5 1 2.00 2 1 1 41 14746.45 0.1924164 + 6 6 2.20 2 4 1 38 28232.10 0.3683813 + 7 350 1.60 1 3 2 33 93162.43 1.2156126 + 8 NA 1.48 2 3 1 29 82275.99 1.0735629 + 9 3 2.28 2 4 1 41 24726.39 0.3226377 + 10 30 0.84 1 3 2 35 NA NA + 11 70 1.24 1 4 2 33 27002.70 0.3523397 + 12 5 1.68 2 1 2 39 18792.03 0.2452044 + 13 60 2.20 1 3 2 30 76894.56 1.0033444 + 14 2 1.48 2 3 1 29 NA NA + 15 8 2.36 2 3 2 39 NA NA + 16 3 2.04 2 3 2 36 98200.91 1.2813563 + 17 1 2.08 1 3 1 40 87786.09 1.1454605 + 18 7 1.00 1 3 2 32 90803.16 1.1848281 + 19 9 2.28 2 3 2 34 NA NA + 20 2 1.24 2 3 1 29 82275.99 1.0735629 + # rescale_weights nested works as expected Code diff --git a/tests/testthat/test-rescale_weights.R b/tests/testthat/test-rescale_weights.R index 2800024cb..7678e0816 100644 --- a/tests/testthat/test-rescale_weights.R +++ b/tests/testthat/test-rescale_weights.R @@ -19,6 +19,20 @@ test_that("rescale_weights works as expected", { }) +test_that("rescale_weights works as expected", { + data(nhanes_sample) + # convert tibble into data frame, so check-hard GHA works + nhanes_sample <- as.data.frame(nhanes_sample)[1:20, ] + + # add NAs + set.seed(123) + nhanes_sample$WTINT2YR[sample.int(nrow(nhanes_sample), 5)] <- NA + + expect_snapshot(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA")) + expect_snapshot(rescale_weights(nhanes_sample, "WTINT2YR", method = "kish")) +}) + + test_that("rescale_weights nested works as expected", { data(nhanes_sample) # convert tibble into data frame, so check-hard GHA works @@ -89,7 +103,7 @@ test_that("rescale_weights errors and warnings", { probability_weights = "WTINT2YR", method = "kish" ), - regex = "The following variable(s)" + regex = "The following variable" ) expect_warning( rescale_weights( From 227c99b0db9c4754e1f6cde1741f607a2187c409 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 31 Dec 2024 17:21:01 +0100 Subject: [PATCH 43/43] typo --- NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index b870855fa..57d2ddfe5 100644 --- a/NEWS.md +++ b/NEWS.md @@ -29,7 +29,7 @@ BREAKING CHANGES AND DEPRECATIONS was also fixed now. * `demean()` (and `degroup()`) gets an `append` argument that defaults to `TRUE`, - to append the centered variabled to the original data frame, instead of + to append the centered variables to the original data frame, instead of returning the de- and group-meaned variables only. Use `append = FALSE` to for the previous default behaviour (i.e. only returning the newly created variables).