diff --git a/NAMESPACE b/NAMESPACE index 54f41307..bfa83d9c 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -15,6 +15,7 @@ export(euclidean_dist) export(get_matches) export(mahalanobis_dist) export(match.data) +export(match_data) export(matchit) export(robust_mahalanobis_dist) export(scaled_euclidean_dist) diff --git a/NEWS.md b/NEWS.md index e9c64e0f..183d0489 100644 --- a/NEWS.md +++ b/NEWS.md @@ -12,6 +12,8 @@ output: * Fixed a bug when matching with a nonzero `ratio` where subclass membership was incorrectly calculated. Thanks to Simon Loewe (@simon-lowe) for originally pointing it out. (#207, #208) +* `match.data()` has been renamed to `match_data()`, but `match.data()` will remain as an alias for backward compatibility. + * Fixed a bug with printing. * Documentation fixes. diff --git a/R/add_s.weights.R b/R/add_s.weights.R index 9f4c08f8..a5b5ebf9 100644 --- a/R/add_s.weights.R +++ b/R/add_s.weights.R @@ -9,7 +9,7 @@ #' an effect to the correct population. Without adding sampling weights to the #' `matchit` object, balance assessment tools (i.e., [summary.matchit()] #' and [plot.matchit()]) will not calculate balance statistics correctly, and -#' the weights produced by [match.data()] and [get_matches()] will not +#' the weights produced by [match_data()] and [get_matches()] will not #' incorporate the sampling weights. #' #' @param m a `matchit` object; the output of a call to [matchit()], @@ -28,7 +28,7 @@ #' #' @author Noah Greifer #' -#' @seealso [matchit()]; [match.data()] +#' @seealso [matchit()]; [match_data()] #' #' @examples #' diff --git a/R/match.data.R b/R/match_data.R similarity index 88% rename from R/match.data.R rename to R/match_data.R index 9e5dac10..52f9409a 100644 --- a/R/match.data.R +++ b/R/match_data.R @@ -1,13 +1,13 @@ #' Construct a matched dataset from a `matchit` object -#' @name match.data -#' @aliases match.data get_matches +#' @name match_data +#' @aliases match_data match.data get_matches #' #' @description -#' `match.data()` and `get_matches()` create a data frame with +#' `match_data()` and `get_matches()` create a data frame with #' additional variables for the distance measure, matching weights, and #' subclasses after matching. This dataset can be used to estimate treatment #' effects after matching or subclassification. `get_matches()` is most -#' useful after matching with replacement; otherwise, `match.data()` is +#' useful after matching with replacement; otherwise, `match_data()` is #' more flexible. See Details below for the difference between them. #' #' @param object a `matchit` object; the output of a call to [matchit()]. @@ -28,32 +28,33 @@ #' frame output. Default is `"subclass"`. #' @param id a string containing the name that should be given to the variable #' containing the unit IDs in the data frame output. Default is `"id"`. -#' Only used with `get_matches()`; for `match.data()`, the units IDs +#' Only used with `get_matches()`; for `match_data()`, the units IDs #' are stored in the row names of the returned data frame. #' @param data a data frame containing the original dataset to which the #' computed output variables (`distance`, `weights`, and/or -#' `subclass`) should be appended. If empty, `match.data()` and +#' `subclass`) should be appended. If empty, `match_data()` and #' `get_matches()` will attempt to find the dataset using the environment #' of the `matchit` object, which can be unreliable; see Notes. #' @param include.s.weights `logical`; whether to multiply the estimated #' weights by the sampling weights supplied to `matchit()`, if any. #' Default is `TRUE`. If `FALSE`, the weights in the -#' `match.data()` or `get_matches()` output should be multiplied by +#' `match_data()` or `get_matches()` output should be multiplied by #' the sampling weights before being supplied to the function estimating the #' treatment effect in the matched data. #' @param drop.unmatched `logical`; whether the returned data frame should #' contain all units (`FALSE`) or only units that were matched (i.e., have #' a matching weight greater than zero) (`TRUE`). Default is `TRUE` #' to drop unmatched units. +#' @param \dots arguments passed to `match_data()`. #' #' @details -#' `match.data()` creates a dataset with one row per unit. It will be +#' `match_data()` creates a dataset with one row per unit. It will be #' identical to the dataset supplied except that several new columns will be #' added containing information related to the matching. When #' `drop.unmatched = TRUE`, the default, units with weights of zero, which #' are those units that were discarded by common support or the caliper or were #' simply not matched, will be dropped from the dataset, leaving only the -#' subset of matched units. The idea is for the output of `match.data()` +#' subset of matched units. The idea is for the output of `match_data()` #' to be used as the dataset input in calls to `glm()` or similar to #' estimate treatment effects in the matched sample. It is important to include #' the weights in the estimation of the effect and its standard error. The @@ -63,9 +64,9 @@ #' `matchit` object, which does not occur with matching with replacement, #' in which case `get_matches()` should be used. See #' `vignette("estimating-effects")` for information on how to use -#' `match.data()` output to estimate effects. +#' `match_data()` output to estimate effects. `match.data()` is an alias for `match_data()`. #' -#' `get_matches()` is similar to `match.data()`; the primary +#' `get_matches()` is similar to `match_data()`; the primary #' difference occurs when matching is performed with replacement, i.e., when #' units do not belong to a single matched pair. In this case, the output of #' `get_matches()` will be a dataset that contains one row per unit for @@ -78,10 +79,10 @@ #' created (named using the `id` argument) to identify when the same unit #' is present in multiple rows. This dataset structure allows for the inclusion #' of both subclass membership and repeated use of units, unlike the output of -#' `match.data()`, which lacks subclass membership when matching is done +#' `match_data()`, which lacks subclass membership when matching is done #' with replacement. A `match.matrix` component of the `matchit` #' object must be present to use `get_matches()`; in some forms of -#' matching, it is absent, in which case `match.data()` should be used +#' matching, it is absent, in which case `match_data()` should be used #' instead. See `vignette("estimating-effects")` for information on how to #' use `get_matches()` output to estimate effects after matching with #' replacement. @@ -90,11 +91,11 @@ #' A data frame containing the data supplied in the `data` argument or in the #' original call to `matchit()` with the computed #' output variables appended as additional columns, named according the -#' arguments above. For `match.data()`, the `group` and +#' arguments above. For `match_data()`, the `group` and #' `drop.unmatched` arguments control whether only subsets of the data are -#' returned. See Details above for how `match.data()` and +#' returned. See Details above for how `match_data()` and #' `get_matches()` differ. Note that `get_matches` sorts the data by -#' subclass and treatment status, unlike `match.data()`, which uses the +#' subclass and treatment status, unlike `match_data()`, which uses the #' order of the data. #' #' The returned data frame will contain the variables in the original data set @@ -113,11 +114,11 @@ #' reused in matching with replacement.} #' #' These columns will take on the name supplied to the corresponding arguments -#' in the call to `match.data()` or `get_matches()`. See Examples for +#' in the call to `match_data()` or `get_matches()`. See Examples for #' an example of rename the `distance` column to `"prop.score"`. #' #' If `data` or the original dataset supplied to `matchit()` was a -#' `data.table` or `tbl`, the `match.data()` output will have +#' `data.table` or `tbl`, the `match_data()` output will have #' the same class, but the `get_matches()` output will always be a base R #' `data.frame`. #' @@ -126,11 +127,11 @@ #' class is important when using [`rbind()`][rbind.matchdata] to #' append matched datasets. #' -#' @note The most common way to use `match.data()` and +#' @note The most common way to use `match_data()` and #' `get_matches()` is by supplying just the `matchit` object, e.g., -#' as `match.data(m.out)`. A data set will first be searched in the +#' as `match_data(m.out)`. A data set will first be searched in the #' environment of the `matchit` formula, then in the calling environment -#' of `match.data()` or `get_matches()`, and finally in the +#' of `match_data()` or `get_matches()`, and finally in the #' `model` component of the `matchit` object if a propensity score #' was estimated. #' @@ -142,13 +143,13 @@ #' occur when `matchit()` was run within an [lapply()] or #' `purrr::map()` call. The solution, which is recommended in all cases, #' is simply to supply the original dataset to the `data` argument of -#' `match.data()`, e.g., as `match.data(m.out, data = original_data)`, as demonstrated in the Examples. +#' `match_data()`, e.g., as `match_data(m.out, data = original_data)`, as demonstrated in the Examples. #' #' @seealso #' #' [matchit()]; [rbind.matchdata()] #' -#' `vignette("estimating-effects")` for uses of `match.data()` and +#' `vignette("estimating-effects")` for uses of `match_data()` and #' `get_matches()` in estimating treatment effects. #' #' @examples @@ -161,7 +162,7 @@ #' data = lalonde, replace = TRUE, #' caliper = .05, ratio = 4) #' -#' m.data1 <- match.data(m.out1, data = lalonde, +#' m.data1 <- match_data(m.out1, data = lalonde, #' distance = "prop.score") #' dim(m.data1) #one row per matched unit #' head(m.data1, 10) @@ -173,7 +174,7 @@ #' #' @export -match.data <- function(object, group = "all", distance = "distance", weights = "weights", subclass = "subclass", +match_data <- function(object, group = "all", distance = "distance", weights = "weights", subclass = "subclass", data = NULL, include.s.weights = TRUE, drop.unmatched = TRUE) { chk::chk_is(object, "matchit") @@ -266,19 +267,25 @@ match.data <- function(object, group = "all", distance = "distance", weights = " } #' @export -#' @rdname match.data +#' @rdname match_data +match.data <- function(...) { + match_data(...) +} + +#' @export +#' @rdname match_data get_matches <- function(object, distance = "distance", weights = "weights", subclass = "subclass", id = "id", data = NULL, include.s.weights = TRUE) { chk::chk_is(object, "matchit") if (is_null(object$match.matrix)) { - .err("a match.matrix component must be present in the matchit object, which does not occur with all types of matching. Use `match.data()` instead") + .err("a match.matrix component must be present in the matchit object, which does not occur with all types of matching. Use `match_data()` instead") } - #Get initial data using match.data; note weights and subclass will be removed, + #Get initial data using match_data; note weights and subclass will be removed, #including them here just checks their names don't clash - m.data <- match.data(object, group = "all", distance = distance, + m.data <- match_data(object, group = "all", distance = distance, weights = weights, subclass = subclass, data = data, include.s.weights = FALSE, drop.unmatched = TRUE) diff --git a/R/matchit.R b/R/matchit.R index 91fe1860..d4b3f0ee 100644 --- a/R/matchit.R +++ b/R/matchit.R @@ -258,7 +258,7 @@ #' If sampling weights are included through the #' `s.weights` argument, they will be included in the `matchit()` #' output object but not incorporated into the matching weights. -#' [match.data()], which extracts the matched set from a `matchit` object, +#' [match_data()], which extracts the matched set from a `matchit` object, #' combines the matching weights and sampling weights. #' #' @return When `method` is something other than `"subclass"`, a diff --git a/R/rbind.matchdata.R b/R/rbind.matchdata.R index db6a3ba5..5a3c7091 100644 --- a/R/rbind.matchdata.R +++ b/R/rbind.matchdata.R @@ -1,13 +1,13 @@ #' Append matched datasets together #' #' These functions are [rbind()] methods for objects resulting from calls to -#' [match.data()] and [get_matches()]. They function nearly identically to +#' [match_data()] and [get_matches()]. They function nearly identically to #' `rbind.data.frame()`; see Details for how they differ. #' #' @aliases rbind.matchdata rbind.getmatches #' #' @param \dots Two or more `matchdata` or `getmatches` objects the -#' output of calls to [match.data()] and [get_matches()], respectively. +#' output of calls to [match_data()] and [get_matches()], respectively. #' Supplied objects must either be all `matchdata` objects or all #' `getmatches` objects. #' @param deparse.level Passed to [rbind()]. @@ -37,7 +37,7 @@ #' `rbind.getmatches()` and `rbind.matchdata()` are identical. #' #' @author Noah Greifer -#' @seealso [match.data()], [rbind()] +#' @seealso [match_data()], [rbind()] #' #' See `vignettes("estimating-effects")` for details on using #' `rbind()` for effect estimation after subsetting the data. @@ -50,17 +50,17 @@ #' m.out_b <- matchit(treat ~ age + educ + married + #' nodegree + re74 + re75, #' data = subset(lalonde, race == "black")) -#' md_b <- match.data(m.out_b) +#' md_b <- match_data(m.out_b) #' #' m.out_h <- matchit(treat ~ age + educ + married + #' nodegree + re74 + re75, #' data = subset(lalonde, race == "hispan")) -#' md_h <- match.data(m.out_h) +#' md_h <- match_data(m.out_h) #' #' m.out_w <- matchit(treat ~ age + educ + married + #' nodegree + re74 + re75, #' data = subset(lalonde, race == "white")) -#' md_w <- match.data(m.out_w) +#' md_w <- match_data(m.out_w) #' #' #Bind the datasets together #' md_all <- rbind(md_b, md_h, md_w) @@ -118,11 +118,11 @@ rbind.matchdata <- function(..., deparse.level = 1) { setdiff(names(md_list[[d]]), unlist(lapply(attr_list, `[`, d))) }) - for (d in seq_along(md_list)[-1]) { - if (length(other_col_list[[d]]) != length(other_col_list[[1]]) || - !all(other_col_list[[d]] %in% other_col_list[[1]])) { + for (d in seq_along(md_list)[-1L]) { + if (length(other_col_list[[d]]) != length(other_col_list[[1L]]) || + !all(other_col_list[[d]] %in% other_col_list[[1L]])) { .err(sprintf("the %s inputs must come from the same dataset", - switch(type, "matchdata" = "`match.data()`", "`get_matches()`"))) + switch(type, "matchdata" = "`match_data()`", "`get_matches()`"))) } } @@ -149,7 +149,7 @@ rbind.matchdata <- function(..., deparse.level = 1) { #Put all columns in the same order if (d > 1) { - md_list[[d]] <- md_list[[d]][names(md_list[[1]])] + md_list[[d]] <- md_list[[d]][names(md_list[[1L]])] } class(md_list[[d]]) <- setdiff(class(md_list[[d]]), type) diff --git a/R/summary.matchit.R b/R/summary.matchit.R index 5dd2334a..8a64800e 100644 --- a/R/summary.matchit.R +++ b/R/summary.matchit.R @@ -724,7 +724,7 @@ print.summary.matchit.subclass <- function(x, digits = max(3, getOption("digits" return(X) } - #Attempt to extract data from matchit object; same as match.data() + #Attempt to extract data from matchit object; same as match_data() data.found <- FALSE for (i in 1:4) { if (i == 2L) { diff --git a/_pkgdown.yml b/_pkgdown.yml index 2378f74f..1fd2277c 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -14,7 +14,7 @@ reference: - plot.matchit - title: Extracting Matched Data - contents: - - match.data + - match_data - get_matches - rbind.matchdata - title: Datasets diff --git a/man/add_s.weights.Rd b/man/add_s.weights.Rd index 8d16b6fd..2c809b58 100644 --- a/man/add_s.weights.Rd +++ b/man/add_s.weights.Rd @@ -33,7 +33,7 @@ of the propensity score) but sampling weights are required for generalizing an effect to the correct population. Without adding sampling weights to the \code{matchit} object, balance assessment tools (i.e., \code{\link[=summary.matchit]{summary.matchit()}} and \code{\link[=plot.matchit]{plot.matchit()}}) will not calculate balance statistics correctly, and -the weights produced by \code{\link[=match.data]{match.data()}} and \code{\link[=get_matches]{get_matches()}} will not +the weights produced by \code{\link[=match_data]{match_data()}} and \code{\link[=get_matches]{get_matches()}} will not incorporate the sampling weights. } \examples{ @@ -62,7 +62,7 @@ summary(m.out, improvement = FALSE) } \seealso{ -\code{\link[=matchit]{matchit()}}; \code{\link[=match.data]{match.data()}} +\code{\link[=matchit]{matchit()}}; \code{\link[=match_data]{match_data()}} } \author{ Noah Greifer diff --git a/man/match.data.Rd b/man/match_data.Rd similarity index 84% rename from man/match.data.Rd rename to man/match_data.Rd index e9bfd34e..fc951713 100644 --- a/man/match.data.Rd +++ b/man/match_data.Rd @@ -1,11 +1,12 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/match.data.R -\name{match.data} +% Please edit documentation in R/match_data.R +\name{match_data} +\alias{match_data} \alias{match.data} \alias{get_matches} \title{Construct a matched dataset from a \code{matchit} object} \usage{ -match.data( +match_data( object, group = "all", distance = "distance", @@ -16,6 +17,8 @@ match.data( drop.unmatched = TRUE ) +match.data(...) + get_matches( object, distance = "distance", @@ -50,14 +53,14 @@ frame output. Default is \code{"subclass"}.} \item{data}{a data frame containing the original dataset to which the computed output variables (\code{distance}, \code{weights}, and/or -\code{subclass}) should be appended. If empty, \code{match.data()} and +\code{subclass}) should be appended. If empty, \code{match_data()} and \code{get_matches()} will attempt to find the dataset using the environment of the \code{matchit} object, which can be unreliable; see Notes.} \item{include.s.weights}{\code{logical}; whether to multiply the estimated weights by the sampling weights supplied to \code{matchit()}, if any. Default is \code{TRUE}. If \code{FALSE}, the weights in the -\code{match.data()} or \code{get_matches()} output should be multiplied by +\code{match_data()} or \code{get_matches()} output should be multiplied by the sampling weights before being supplied to the function estimating the treatment effect in the matched data.} @@ -66,20 +69,22 @@ contain all units (\code{FALSE}) or only units that were matched (i.e., have a matching weight greater than zero) (\code{TRUE}). Default is \code{TRUE} to drop unmatched units.} +\item{\dots}{arguments passed to \code{match_data()}.} + \item{id}{a string containing the name that should be given to the variable containing the unit IDs in the data frame output. Default is \code{"id"}. -Only used with \code{get_matches()}; for \code{match.data()}, the units IDs +Only used with \code{get_matches()}; for \code{match_data()}, the units IDs are stored in the row names of the returned data frame.} } \value{ A data frame containing the data supplied in the \code{data} argument or in the original call to \code{matchit()} with the computed output variables appended as additional columns, named according the -arguments above. For \code{match.data()}, the \code{group} and +arguments above. For \code{match_data()}, the \code{group} and \code{drop.unmatched} arguments control whether only subsets of the data are -returned. See Details above for how \code{match.data()} and +returned. See Details above for how \code{match_data()} and \code{get_matches()} differ. Note that \code{get_matches} sorts the data by -subclass and treatment status, unlike \code{match.data()}, which uses the +subclass and treatment status, unlike \code{match_data()}, which uses the order of the data. The returned data frame will contain the variables in the original data set @@ -98,11 +103,11 @@ belong to the same unit since the same unit may appear multiple times if reused in matching with replacement.} These columns will take on the name supplied to the corresponding arguments -in the call to \code{match.data()} or \code{get_matches()}. See Examples for +in the call to \code{match_data()} or \code{get_matches()}. See Examples for an example of rename the \code{distance} column to \code{"prop.score"}. If \code{data} or the original dataset supplied to \code{matchit()} was a -\code{data.table} or \code{tbl}, the \code{match.data()} output will have +\code{data.table} or \code{tbl}, the \code{match_data()} output will have the same class, but the \code{get_matches()} output will always be a base R \code{data.frame}. @@ -112,21 +117,21 @@ class is important when using \code{\link[=rbind.matchdata]{rbind()}} to append matched datasets. } \description{ -\code{match.data()} and \code{get_matches()} create a data frame with +\code{match_data()} and \code{get_matches()} create a data frame with additional variables for the distance measure, matching weights, and subclasses after matching. This dataset can be used to estimate treatment effects after matching or subclassification. \code{get_matches()} is most -useful after matching with replacement; otherwise, \code{match.data()} is +useful after matching with replacement; otherwise, \code{match_data()} is more flexible. See Details below for the difference between them. } \details{ -\code{match.data()} creates a dataset with one row per unit. It will be +\code{match_data()} creates a dataset with one row per unit. It will be identical to the dataset supplied except that several new columns will be added containing information related to the matching. When \code{drop.unmatched = TRUE}, the default, units with weights of zero, which are those units that were discarded by common support or the caliper or were simply not matched, will be dropped from the dataset, leaving only the -subset of matched units. The idea is for the output of \code{match.data()} +subset of matched units. The idea is for the output of \code{match_data()} to be used as the dataset input in calls to \code{glm()} or similar to estimate treatment effects in the matched sample. It is important to include the weights in the estimation of the effect and its standard error. The @@ -136,9 +141,9 @@ will only be included if there is a \code{subclass} component in the \code{matchit} object, which does not occur with matching with replacement, in which case \code{get_matches()} should be used. See \code{vignette("estimating-effects")} for information on how to use -\code{match.data()} output to estimate effects. +\code{match_data()} output to estimate effects. \code{match.data()} is an alias for \code{match_data()}. -\code{get_matches()} is similar to \code{match.data()}; the primary +\code{get_matches()} is similar to \code{match_data()}; the primary difference occurs when matching is performed with replacement, i.e., when units do not belong to a single matched pair. In this case, the output of \code{get_matches()} will be a dataset that contains one row per unit for @@ -151,20 +156,20 @@ Unmatched units are dropped. An additional column with unit IDs will be created (named using the \code{id} argument) to identify when the same unit is present in multiple rows. This dataset structure allows for the inclusion of both subclass membership and repeated use of units, unlike the output of -\code{match.data()}, which lacks subclass membership when matching is done +\code{match_data()}, which lacks subclass membership when matching is done with replacement. A \code{match.matrix} component of the \code{matchit} object must be present to use \code{get_matches()}; in some forms of -matching, it is absent, in which case \code{match.data()} should be used +matching, it is absent, in which case \code{match_data()} should be used instead. See \code{vignette("estimating-effects")} for information on how to use \code{get_matches()} output to estimate effects after matching with replacement. } \note{ -The most common way to use \code{match.data()} and +The most common way to use \code{match_data()} and \code{get_matches()} is by supplying just the \code{matchit} object, e.g., -as \code{match.data(m.out)}. A data set will first be searched in the +as \code{match_data(m.out)}. A data set will first be searched in the environment of the \code{matchit} formula, then in the calling environment -of \code{match.data()} or \code{get_matches()}, and finally in the +of \code{match_data()} or \code{get_matches()}, and finally in the \code{model} component of the \code{matchit} object if a propensity score was estimated. @@ -176,7 +181,7 @@ dataset used to construct the matched dataset will not be found. This can occur when \code{matchit()} was run within an \code{\link[=lapply]{lapply()}} or \code{purrr::map()} call. The solution, which is recommended in all cases, is simply to supply the original dataset to the \code{data} argument of -\code{match.data()}, e.g., as \code{match.data(m.out, data = original_data)}, as demonstrated in the Examples. +\code{match_data()}, e.g., as \code{match_data(m.out, data = original_data)}, as demonstrated in the Examples. } \examples{ @@ -188,7 +193,7 @@ m.out1 <- matchit(treat ~ age + educ + married + data = lalonde, replace = TRUE, caliper = .05, ratio = 4) -m.data1 <- match.data(m.out1, data = lalonde, +m.data1 <- match_data(m.out1, data = lalonde, distance = "prop.score") dim(m.data1) #one row per matched unit head(m.data1, 10) @@ -202,6 +207,6 @@ head(g.matches1, 10) \seealso{ \code{\link[=matchit]{matchit()}}; \code{\link[=rbind.matchdata]{rbind.matchdata()}} -\code{vignette("estimating-effects")} for uses of \code{match.data()} and +\code{vignette("estimating-effects")} for uses of \code{match_data()} and \code{get_matches()} in estimating treatment effects. } diff --git a/man/matchit.Rd b/man/matchit.Rd index e06dc078..54be48d0 100644 --- a/man/matchit.Rd +++ b/man/matchit.Rd @@ -358,7 +358,7 @@ units in that treatment group (i.e., to have an average of 1). If sampling weights are included through the \code{s.weights} argument, they will be included in the \code{matchit()} output object but not incorporated into the matching weights. -\code{\link[=match.data]{match.data()}}, which extracts the matched set from a \code{matchit} object, +\code{\link[=match_data]{match_data()}}, which extracts the matched set from a \code{matchit} object, combines the matching weights and sampling weights. } diff --git a/man/rbind.matchdata.Rd b/man/rbind.matchdata.Rd index 2977b5af..a47905e2 100644 --- a/man/rbind.matchdata.Rd +++ b/man/rbind.matchdata.Rd @@ -11,7 +11,7 @@ } \arguments{ \item{\dots}{Two or more \code{matchdata} or \code{getmatches} objects the -output of calls to \code{\link[=match.data]{match.data()}} and \code{\link[=get_matches]{get_matches()}}, respectively. +output of calls to \code{\link[=match_data]{match_data()}} and \code{\link[=get_matches]{get_matches()}}, respectively. Supplied objects must either be all \code{matchdata} objects or all \code{getmatches} objects.} @@ -27,7 +27,7 @@ original data object. } \description{ These functions are \code{\link[=rbind]{rbind()}} methods for objects resulting from calls to -\code{\link[=match.data]{match.data()}} and \code{\link[=get_matches]{get_matches()}}. They function nearly identically to +\code{\link[=match_data]{match_data()}} and \code{\link[=get_matches]{get_matches()}}. They function nearly identically to \code{rbind.data.frame()}; see Details for how they differ. } \details{ @@ -55,17 +55,17 @@ data("lalonde") m.out_b <- matchit(treat ~ age + educ + married + nodegree + re74 + re75, data = subset(lalonde, race == "black")) -md_b <- match.data(m.out_b) +md_b <- match_data(m.out_b) m.out_h <- matchit(treat ~ age + educ + married + nodegree + re74 + re75, data = subset(lalonde, race == "hispan")) -md_h <- match.data(m.out_h) +md_h <- match_data(m.out_h) m.out_w <- matchit(treat ~ age + educ + married + nodegree + re74 + re75, data = subset(lalonde, race == "white")) -md_w <- match.data(m.out_w) +md_w <- match_data(m.out_w) #Bind the datasets together md_all <- rbind(md_b, md_h, md_w) @@ -75,7 +75,7 @@ levels(md_all$subclass) } \seealso{ -\code{\link[=match.data]{match.data()}}, \code{\link[=rbind]{rbind()}} +\code{\link[=match_data]{match_data()}}, \code{\link[=rbind]{rbind()}} See \code{vignettes("estimating-effects")} for details on using \code{rbind()} for effect estimation after subsetting the data. diff --git a/vignettes/MatchIt.Rmd b/vignettes/MatchIt.Rmd index 08fda507..7d0cb0a5 100644 --- a/vignettes/MatchIt.Rmd +++ b/vignettes/MatchIt.Rmd @@ -190,15 +190,15 @@ How treatment effects are estimated depends on what form of matching was perform [^est]: In some cases, the coefficient on the treatment variable in the outcome model can be used as the effect estimate, but g-computation always yields a valid effect estimate regardless of the form of the outcome model and its use is the same regardless of the outcome model type or matching method (with some slight variations), so we always recommend performing g-computation after fitting the outcome model. G-computation is explained in detail in `vignette("estimating-effects")`. -Because full matching was successful at balancing the covariates, we'll demonstrate here how to estimate a treatment effect after performing such an analysis. First, we'll extract the matched dataset from the `matchit` object using `match.data()`. This dataset only contains the matched units and adds columns for `distance`, `weights`, and `subclass` (described previously). `r if (use == "none") notice` +Because full matching was successful at balancing the covariates, we'll demonstrate here how to estimate a treatment effect after performing such an analysis. First, we'll extract the matched dataset from the `matchit` object using `match_data()`. This dataset only contains the matched units and adds columns for `distance`, `weights`, and `subclass` (described previously). `r if (use == "none") notice` ```{r, eval = (use != "none")} -m.data <- match.data(m.out2) +m.data <- match_data(m.out2) head(m.data) ``` -We can then model the outcome in this dataset using the standard regression functions in R, like `lm()` or `glm()`, being sure to include the matching weights (stored in the `weights` variable of the `match.data()` output) in the estimation[^3]. Finally, we use `marginaleffects::avg_comparisons()` to perform g-computation to estimate the ATT. We recommend using cluster-robust standard errors for most analyses, with pair membership as the clustering variable; `avg_comparisons()` makes this straightforward. +We can then model the outcome in this dataset using the standard regression functions in R, like `lm()` or `glm()`, being sure to include the matching weights (stored in the `weights` variable of the `match_data()` output) in the estimation[^3]. Finally, we use `marginaleffects::avg_comparisons()` to perform g-computation to estimate the ATT. We recommend using cluster-robust standard errors for most analyses, with pair membership as the clustering variable; `avg_comparisons()` makes this straightforward. [^3]: With 1:1 nearest neighbor matching without replacement, excluding the matching weights does not change the estimates. For all other forms of matching, they are required, so we recommend always including them for consistency. diff --git a/vignettes/estimating-effects.Rmd b/vignettes/estimating-effects.Rmd index 63714d0e..18dcd4e4 100644 --- a/vignettes/estimating-effects.Rmd +++ b/vignettes/estimating-effects.Rmd @@ -213,7 +213,7 @@ mF <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + mF #Extract matched data -md <- match.data(mF) +md <- match_data(mF) head(md) ``` @@ -222,7 +222,7 @@ Typically one would assess balance and ensure that this matching specification w We perform all analyses using the matched dataset, `md`, which, for matching methods that involve dropping units, contains only the units retained in the sample. -First, we fit a model for the outcome given the treatment and (optionally) the covariates. It's usually a good idea to include treatment-covariate interactions, which we do below, but this is not always necessary, especially when excellent balance has been achieved. You can also include the propensity score (usually labeled `distance` in the `match.data()` output), which can add some robustness, especially when modeled flexibly (e.g., with polynomial terms or splines) [@austinDoublePropensityscoreAdjustment2017]; see [here](https://stats.stackexchange.com/a/580174/116195) for an example. +First, we fit a model for the outcome given the treatment and (optionally) the covariates. It's usually a good idea to include treatment-covariate interactions, which we do below, but this is not always necessary, especially when excellent balance has been achieved. You can also include the propensity score (usually labeled `distance` in the `match_data()` output), which can add some robustness, especially when modeled flexibly (e.g., with polynomial terms or splines) [@austinDoublePropensityscoreAdjustment2017]; see [here](https://stats.stackexchange.com/a/580174/116195) for an example. ```{r} #Linear model with covariates @@ -265,7 +265,7 @@ When matching for the ATE (including [coarsened] exact matching, full matching, When matching with replacement (i.e., nearest neighbor or genetic matching with `replace = TRUE`), effect and SE estimation need to account for control unit multiplicity (i.e., repeated use) and within-pair correlations [@hill2006; @austin2020a]. Although @abadie2008 demonstrated analytically that bootstrap SEs may be invalid for matching with replacement, simulation work by @hill2006 and @bodory2020 has found that bootstrap SEs are adequate and generally slightly conservative. See the section "Using Bootstrapping to Estimate Confidence Intervals" for instructions on using the bootstrap and an example that use matching with replacement. -Because control units do not belong to unique pairs, there is no pair membership in the `match.data()` output. One can simply change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `comparisons()` and `predictions()` to use robust SEs instead of cluster-robust SEs, as recommended by @hill2006. There is some evidence for an alternative approach that incorporates pair membership and adjusts for reuse of control units, though this has only been studied for survival outcomes [@austin2020a]. This adjustment involves using two-way cluster-robust SEs with pair membership and unit ID as the clustering variables. For continuous and binary outcomes, this involves the following two changes: 1) replace `match.data()` with `get_matches()`, which produces a dataset with one row per unit per pair, meaning control units matched to multiple treated units will appear multiple times in the dataset; 2) set `vcov = ~subclass + id` in the calls to `avg_comparisons()` and `avg_predictions()`. For survival outcomes, a special procedure must be used; see the section on survival outcomes below. +Because control units do not belong to unique pairs, there is no pair membership in the `match_data()` output. One can simply change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `comparisons()` and `predictions()` to use robust SEs instead of cluster-robust SEs, as recommended by @hill2006. There is some evidence for an alternative approach that incorporates pair membership and adjusts for reuse of control units, though this has only been studied for survival outcomes [@austin2020a]. This adjustment involves using two-way cluster-robust SEs with pair membership and unit ID as the clustering variables. For continuous and binary outcomes, this involves the following two changes: 1) replace `match_data()` with `get_matches()`, which produces a dataset with one row per unit per pair, meaning control units matched to multiple treated units will appear multiple times in the dataset; 2) set `vcov = ~subclass + id` in the calls to `avg_comparisons()` and `avg_predictions()`. For survival outcomes, a special procedure must be used; see the section on survival outcomes below. #### Matching without pairing @@ -288,7 +288,7 @@ mS <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + method = "subclass", estimand = "ATT") #Extract matched data -md <- match.data(mS) +md <- match_data(mS) fitS <- lm(Y_C ~ subclass * (A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9)), @@ -354,7 +354,7 @@ coxph(Surv(Y_S) ~ A, data = md, robust = TRUE, The `coef` column contains the log HR, and `exp(coef)` contains the HR. Remember to always use the `robust se` for the SE of the log HR. The displayed z-test p-value results from using the robust SE. -For matching with replacement, a special procedure described by @austin2020a can be necessary for valid inference. According to the results of their simulation studies, when the treatment prevalence is low (\<30%), a SE that does not involve pair membership (i.e., the `match.data()` approach, as demonstrated above) is sufficient. When treatment prevalence is higher, the SE that ignores pair membership may be too low, and the authors recommend using a custom SE estimator that uses information about both multiplicity and pairing. +For matching with replacement, a special procedure described by @austin2020a can be necessary for valid inference. According to the results of their simulation studies, when the treatment prevalence is low (\<30%), a SE that does not involve pair membership (i.e., the `match_data()` approach, as demonstrated above) is sufficient. When treatment prevalence is higher, the SE that ignores pair membership may be too low, and the authors recommend using a custom SE estimator that uses information about both multiplicity and pairing. Doing so must be done manually for survival models using `get_matches()` and several calls to `coxph()` as demonstrated in the appendix of @austin2020a. We demonstrate this below: @@ -407,7 +407,7 @@ boot_fun <- function(data, i) { replace = TRUE) #Extract matched dataset - md <- match.data(m, data = boot_data) + md <- match_data(m, data = boot_data) #Fit outcome model fit <- glm(Y_B ~ A * (X1 + X2 + X3 + X4 + X5 + @@ -465,7 +465,7 @@ mNN <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = d) mNN -md <- match.data(mNN) +md <- match_data(mNN) ``` Next, we'll write the function that takes in cluster membership and the sampled indices and returns an estimate. @@ -553,7 +553,7 @@ Although it is straightforward to assess balance overall using `summary()`, it i If we are satisfied with balance, we can then model the outcome with an interaction between the treatment and the moderator. ```{r} -mdP <- match.data(mP) +mdP <- match_data(mP) fitP <- lm(Y_C ~ A * X5, data = mdP, weights = weights) ``` @@ -597,13 +597,13 @@ There are a few common mistakes that should be avoided. It is important not only ### 1. Failing to include weights -Several methods involve weights that are to be used in estimating the treatment effect. With full matching and stratification matching (when analyzed using MMWS), the weights do the entire work of balancing the covariates across the treatment groups. Omitting weights essentially ignores the entire purpose of matching. Some cases are less obvious. When performing matching with replacement and estimating the treatment effect using the `match.data()` output, weights must be included to ensure control units matched to multiple treated units are weighted accordingly. Similarly, when performing k:1 matching where not all treated units receive k matches, weights are required to account for the differential weight of the matched control units. The only time weights can be omitted after pair matching is when performing 1:1 matching without replacement. Including weights even in this scenario will not affect the analysis and it can be good practice to always include weights to prevent this error from occurring. There are some scenarios where weights are not useful because the conditioning occurs through some other means, such as when using the direct subclass strategy rather than MMWS for estimating marginal effects after stratification. +Several methods involve weights that are to be used in estimating the treatment effect. With full matching and stratification matching (when analyzed using MMWS), the weights do the entire work of balancing the covariates across the treatment groups. Omitting weights essentially ignores the entire purpose of matching. Some cases are less obvious. When performing matching with replacement and estimating the treatment effect using the `match_data()` output, weights must be included to ensure control units matched to multiple treated units are weighted accordingly. Similarly, when performing k:1 matching where not all treated units receive k matches, weights are required to account for the differential weight of the matched control units. The only time weights can be omitted after pair matching is when performing 1:1 matching without replacement. Including weights even in this scenario will not affect the analysis and it can be good practice to always include weights to prevent this error from occurring. There are some scenarios where weights are not useful because the conditioning occurs through some other means, such as when using the direct subclass strategy rather than MMWS for estimating marginal effects after stratification. ### 2. Failing to use robust or cluster-robust standard errors Robust SEs are required when using weights to estimate the treatment effect. The model-based SEs resulting from weighted least squares or maximum likelihood are inaccurate when using matching weights because they assume weights are frequency weights rather than probability weights. Cluster-robust SEs account for both the matching weights and pair membership and should be used when appropriate. Sometimes, researchers use functions in the `survey` package to estimate robust SEs, especially with inverse probability weighting; this is a valid way to compute robust SEs and will give similar results to `sandwich::vcovHC()`.[^10] -[^10]: To use `survey` to adjust for pair membership, one can use the following code to specify the survey design to be used with `svyglm()`: `svydesign(ids = ~subclass, weights = ~weights, data = md)` where `md` is the output of `match.data()`. After `svyglm()`, `comparisons()` can be used, and the `vcov` argument does not need to be specified. +[^10]: To use `survey` to adjust for pair membership, one can use the following code to specify the survey design to be used with `svyglm()`: `svydesign(ids = ~subclass, weights = ~weights, data = md)` where `md` is the output of `match_data()`. After `svyglm()`, `comparisons()` can be used, and the `vcov` argument does not need to be specified. ### 3. Interpreting conditional effects as marginal effects diff --git a/vignettes/matching-methods.Rmd b/vignettes/matching-methods.Rmd index 73c04736..a5f86df0 100644 --- a/vignettes/matching-methods.Rmd +++ b/vignettes/matching-methods.Rmd @@ -165,7 +165,7 @@ Anti-exact matching adds a restriction such that a treated and control unit with ### Matching with replacement (`replace`) -Nearest neighbor matching and genetic matching have the option of matching with or without replacement, and this is controlled by the `replace` argument. Matching without replacement means that each control unit is matched to only one treated unit, while matching with replacement means that control units can be reused and matched to multiple treated units. Matching without replacement carries certain statistical benefits in that weights for each unit can be omitted or are more straightforward to include and dependence between units depends only on pair membership. However, it is not asymptotically consistent unless the propensity scores for all treated units are below .5 and there are many more control units than treated units [@savjeInconsistencyMatchingReplacement2022]. Special standard error estimators are sometimes required for estimating effects after matching with replacement [@austin2020a], and methods for accounting for uncertainty are not well understood for non-continuous outcomes. Matching with replacement will tend to yield better balance though, because the problem of "running out" of close control units to match to treated units is avoided, though the reuse of control units will decrease the effect sample size, thereby worsening precision [@austin2013b]. (This problem occurs in the Lalonde dataset used in `vignette("MatchIt")`, which is why nearest neighbor matching without replacement is not very effective there.) After matching with replacement, control units are assigned to more than one subclass, so the `get_matches()` function should be used instead of `match.data()` after matching with replacement if subclasses are to be used in follow-up analyses; see `vignette("estimating-effects")` for details. +Nearest neighbor matching and genetic matching have the option of matching with or without replacement, and this is controlled by the `replace` argument. Matching without replacement means that each control unit is matched to only one treated unit, while matching with replacement means that control units can be reused and matched to multiple treated units. Matching without replacement carries certain statistical benefits in that weights for each unit can be omitted or are more straightforward to include and dependence between units depends only on pair membership. However, it is not asymptotically consistent unless the propensity scores for all treated units are below .5 and there are many more control units than treated units [@savjeInconsistencyMatchingReplacement2022]. Special standard error estimators are sometimes required for estimating effects after matching with replacement [@austin2020a], and methods for accounting for uncertainty are not well understood for non-continuous outcomes. Matching with replacement will tend to yield better balance though, because the problem of "running out" of close control units to match to treated units is avoided, though the reuse of control units will decrease the effect sample size, thereby worsening precision [@austin2013b]. (This problem occurs in the Lalonde dataset used in `vignette("MatchIt")`, which is why nearest neighbor matching without replacement is not very effective there.) After matching with replacement, control units are assigned to more than one subclass, so the `get_matches()` function should be used instead of `match_data()` after matching with replacement if subclasses are to be used in follow-up analyses; see `vignette("estimating-effects")` for details. The `reuse.max` argument can also be used with `method = "nearest"` to control how many times each control unit can be reused as a match. Setting `reuse.max = 1` is equivalent to requiring matching without replacement (i.e., because each control can be used only once). Other values allow control units to be matched more than once, though only up to the specified number of times. Higher values will tend to improve balance at the cost of precision. diff --git a/vignettes/sampling-weights.Rmd b/vignettes/sampling-weights.Rmd index e7c8ccc3..974a21f7 100644 --- a/vignettes/sampling-weights.Rmd +++ b/vignettes/sampling-weights.Rmd @@ -147,12 +147,12 @@ Note that had we not added sampling weights to `mF`, the matching specification ## Estimating the Effect -Estimating the treatment effect after matching is straightforward when using sampling weights. Effects are estimated in the same way as when sampling weights are excluded, except that the matching weights must be multiplied by the sampling weights for use in the outcome model to yield accurate, generalizable estimates. `match.data()` and `get_matches()` do this automatically, so the weights produced by these functions already are a product of the matching weights and the sampling weights. Note this will only be true if sampling weights are incorporated into the `matchit` object. With `avg_comparisons()`, only the sampling weights should be included when estimating the treatment effect. +Estimating the treatment effect after matching is straightforward when using sampling weights. Effects are estimated in the same way as when sampling weights are excluded, except that the matching weights must be multiplied by the sampling weights for use in the outcome model to yield accurate, generalizable estimates. `match_data()` and `get_matches()` do this automatically, so the weights produced by these functions already are a product of the matching weights and the sampling weights. Note this will only be true if sampling weights are incorporated into the `matchit` object. With `avg_comparisons()`, only the sampling weights should be included when estimating the treatment effect. Below we estimate the effect of `A` on `Y_C` in the matched and sampling weighted sample, adjusting for the covariates to improve precision and decrease bias. ```{r, eval = eval_est} -md_F_s <- match.data(mF_s) +md_F_s <- match_data(mF_s) fit <- lm(Y_C ~ A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9), data = md_F_s, @@ -166,7 +166,7 @@ avg_comparisons(fit, wts = "SW") ``` -Note that `match.data()` and `get_weights()` have the option `include.s.weights`, which, when set to `FALSE`, makes it so the returned weights do not incorporate the sampling weights and are simply the matching weights. Because one might to forget to multiply the two sets of weights together, it is easier to just use the default of `include.s.weights = TRUE` and ignore the sampling weights in the rest of the analysis (because they are already included in the returned weights). +Note that `match_data()` and `get_weights()` have the option `include.s.weights`, which, when set to `FALSE`, makes it so the returned weights do not incorporate the sampling weights and are simply the matching weights. Because one might to forget to multiply the two sets of weights together, it is easier to just use the default of `include.s.weights = TRUE` and ignore the sampling weights in the rest of the analysis (because they are already included in the returned weights). ## Code to Generate Data used in Examples