diff --git a/Makefile b/Makefile index e8c9ac672..bef0a9e58 100644 --- a/Makefile +++ b/Makefile @@ -60,6 +60,7 @@ deploy: ## Deploy book to Github website Rscript -e "source('book/utils/utils.R');get_quarto_yaml(dev = FALSE)" make html rsync -a book/_book/* ./ + rsync -a book/data ./ rm -rf book Makefile _quarto.yml utils.R git add . git commit -m "Update book" @@ -81,6 +82,7 @@ deploydev: ## Deploy dev book to Github website Rscript -e "source('book/utils/utils.R');get_quarto_yaml(dev = TRUE)" make htmldev rsync -a book/_book/* ./dev/ + rsync -a book/data ./ rm -rf book Makefile _quarto.yml utils.R git add . git commit -m "Update book" diff --git a/NEWS.md b/NEWS.md index ffca77095..d9a929c59 100644 --- a/NEWS.md +++ b/NEWS.md @@ -13,6 +13,7 @@ Support new models: Misc: +* `inferences(method="simulation")` uses the original point estimate rather than the mean of the simulation distribution. Issue #851. * Better documentation and error messages for `newdata=NULL` * Some performance improvements for `predictions()` and `marginalmeans()` (#880, #882, @etiennebacher). diff --git a/book/articles/experiments.qmd b/book/articles/experiments.qmd index 537aafcd9..a82459396 100644 --- a/book/articles/experiments.qmd +++ b/book/articles/experiments.qmd @@ -24,17 +24,20 @@ theme_set(theme_minimal()) A 2×2 factorial design is a type of experimental design that allows researchers to understand the effects of two independent variables (each with two levels) on a single dependent variable. The design is popular among academic researchers as well as in industry when running A/B tests. -To illustrate how to analyze these designs with `marginaleffects`, we will use the `mtcars` dataset. We'll analyze fuel efficiency, `mpg` (miles per gallon), as a function of `am` (transmission type) and `vs` (engine shape). - -`vs` is an indicator variable for if the car has a straight engine (1 = straight engine, 0 = V-shaped). `am` is an indicator variable for if the car has manual transmission (1 = manual transmission, 0=automatic transmission). There are then four types of cars (1 type for each of the four combinations of binary indicators). +In this notebook, we illustrate how to analyze these designs with [the `marginaleffects` package for `R`.](https://marginaleffects.com) As we will see, `marginaleffects` includes many convenient functions for analyzing both experimental and observational data, and for plotting our results. ### Fitting a Model +We will use the `mtcars` dataset. We'll analyze fuel efficiency, `mpg` (miles per gallon), as a function of `am` (transmission type) and `vs` (engine shape). + +`vs` is an indicator variable for if the car has a straight engine (1 = straight engine, 0 = V-shaped). `am` is an indicator variable for if the car has manual transmission (1 = manual transmission, 0=automatic transmission). There are then four types of cars (1 type for each of the four combinations of binary indicators). + Let's start by creating a model for fuel efficiency. For simplicity, we'll use linear regression and model the interaction between `vs` and `am`. ```{r} library(tidyverse) library(marginaleffects) +library(modelsummary) ## See ?mtcars for variable definitions fit <- lm(mpg ~ vs + am + vs:am, data=mtcars) # equivalent to ~ vs*am @@ -52,22 +55,29 @@ plot_predictions(fit, by = c("vs", "am")) ### Evaluating Effects From The Model Summary -Since this model is fairly simple the estimated differences between any of the four possible combinations of `vs` and `am` can be read from `summary(fit)` with a little arithmetic. The estimated model is +Since this model is fairly simple the estimated differences between any of the four possible combinations of `vs` and `am` can be read from the regression table, which we create using the `modelsummary` package: -$$ \mbox{mpg} = 20.743 + 5.693 \cdot \mbox{vs} + 4.700 \cdot \mbox{am} + 2.929 \cdot \mbox{vs} \cdot \mbox{am} \>. $$ +```{r} +modelsummary(fit, gof_map = c("r.squared", "nobs")) +``` -The estimated differences in fuel efficiency are: +We can express the same results in the form of a linear equation: -* 5.693 mpg between straight engines and V-shaped engines when the car has automatic transmission. -* 4.700 mpg between manual transmissions and automatic transmissions when the car has a V-shaped engine. -* 7.629 mpg between manual transmissions and automatic transmissions when the car has a straight engine. -* 13.322 mpg between manual transmissions with straight engines and automatic transmissions with V-shaped engines. +$$ \mbox{mpg} = 15.050 + 5.693 \cdot \mbox{vs} + 4.700 \cdot \mbox{am} + 2.929 \cdot \mbox{vs} \cdot \mbox{am}.$$ -Reading off these differences from the model summary becomes more difficult as more variables are added (not to mention obtaining their estimated standard errors becomes nightmarish). To make the process easier on ourselves, we can leverage the `avg_comparisons` function to get the same estimates and their uncertainty. +With a little arithmetic, we can compute estimated differences in fuel efficiency between different groups: + +* 4.700 mpg between `am=1` and `am=0`, when `vs=0`. +* 5.693 mpg between `vs=1` and `vs=0`, when `am=0`. +* 7.629 mpg between `am=1` and `am=0`, when `vs=1`. +* 8.621 mpg between `vs=1` and `vs=0`, when `am=1`. +* 13.322 mpg between a car with `am=1` and `vs=1`, and a car with `am=0` and `vs=0`. + +Reading off these differences from the model summary is relatively straightforward in very simple cases like this one. However, it becomes more difficult as more variables are added to the model, not to mention obtaining estimated standard errors becomes nightmarish. To make the process easier, we can leverage the `avg_comparisons()` function from the `marginaleffects` package to compute the appropriate quantities and standard errors. ### Using `avg_comparisons` To Estimate All Differences -Note that the dot in the grey rectangle is the estimated fuel efficiency when `vs=0` and `am=0` (that is, for an automatic transmission car with V-shaped engine). +The grey rectangle in the graph below is the estimated fuel efficiency when `vs=0` and `am=0`, that is, for an automatic transmission car with V-shaped engine. ```{r, echo = FALSE} plot_predictions(fit, by = c("vs", "am")) + @@ -78,14 +88,17 @@ plot_predictions(fit, by = c("vs", "am")) + fill = "black") ``` -Let's use `avg_comparisons` to get the difference between straight engines and V-shaped engines when the car has automatic transmission. The call to `avg_comparisons` is shown below and results in the same estimate we made directly from the model. The contrast corresponding to this estimate is shown in the plot below. +Let's use `avg_comparisons` to get the difference between straight engines and V-shaped engines when the car has automatic transmission. In this call, the `variables` argument indicates that we want to estimate the effect of a change of 1 unit in the `vs` variable. The `newdata=datagrid(am=0)` determines the values of the covariates at which we want to evaluate the contrast. + ```{r} avg_comparisons(fit, - newdata = datagrid(am = 0), - variables = "vs") + variables = "vs", + newdata = datagrid(am = 0)) ``` +As expected, the results produced by `avg_comparisons()` are exactly the same as those which we read from the model summary table. The contrast that we just computed corresponds to the change illustrasted by the arrow in this plot: + ```{r, echo = FALSE} plot_predictions(fit, by = c("vs", "am")) + annotate( @@ -94,12 +107,12 @@ plot_predictions(fit, by = c("vs", "am")) + arrow = arrow(type = "closed", length = unit(0.02, "npc"))) ``` -The next difference is between manual transmissions and automatic transmissions when the car has a V-shaped engine. Again, the call to `avg_comparisons` is shown below, and the corresponding contrast is indicated in the plot below using an arrow. +The next difference that we compute is between manual transmissions and automatic transmissions when the car has a V-shaped engine. Again, the call to `avg_comparisons` is shown below, and the corresponding contrast is indicated in the plot below using an arrow. ```{r} avg_comparisons(fit, - newdata = datagrid(vs = 0), - variables = "am") + variables = "am", + newdata = datagrid(vs = 0)) ``` ```{r, echo = FALSE} @@ -110,12 +123,12 @@ plot_predictions(fit, by = c("vs", "am")) + arrow = arrow(type = "closed", length = unit(0.02, "npc"))) ``` -The third difference we estimated was between manual transmissions and automatic transmissions when the car has a straight engine. The model call and contrast are +The third difference we estimated was between manual transmissions and automatic transmissions when the car has a straight engine. The model call and contrast are: ```{r} avg_comparisons(fit, - newdata = datagrid(vs = 1), - variables = "am") + variables = "am", + newdata = datagrid(vs = 1)) ``` ```{r, echo = FALSE} @@ -126,11 +139,10 @@ plot_predictions(fit, by = c("vs", "am")) + arrow = arrow(type = "closed", length = unit(0.02, "npc"))) ``` -And the last difference and contrast between manual transmissions with straight engines and automatic transmissions with V-shaped engines is +The last difference and contrast between manual transmissions with straight engines and automatic transmissions with V-shaped engines. We call this a "cross-contrast" because we are measuring the difference between two groups that differ on two explanatory variables at the same time. To compute this contrast, we use the `cross` argument of `avg_comparisons`: ```{r} avg_comparisons(fit, - newdata = datagrid("vs", "am"), variables = c("am", "vs"), cross = TRUE) ``` @@ -147,8 +159,10 @@ plot_predictions(fit, by = c("vs", "am")) + The 2x2 design is a very popular design, and when using a linear model, the estimated differences between groups can be directly read off from the model summary, if not with a little arithmetic. However, when using models with a non-identity link function, or when seeking to obtain the standard errors for estimated differences, things become considerably more difficult. This vignette showed how to use `avg_comparisons` to specify contrasts of interests and obtain standard errors for those differences. The approach used applies to all generalized linear models and effects can be further stratified using the `by` argument (although this is not shown in this vignette.) + ## Regression adjustment + Many analysts who conduct and analyze experiments wish to use regression adjustment with a linear regression model to improve the precision of their estimate of the treatment effect. Unfortunately, regression adjustment can introduce small-sample bias and other undesirable properties (Freedman 2008). Lin (2013) proposes a simple strategy to fix these problems in sufficiently large samples: 1. Center all predictors by subtracting each of their means. diff --git a/book/articles/fig/mrt.jpg b/book/articles/fig/mrt.jpg new file mode 100755 index 000000000..b06e75736 Binary files /dev/null and b/book/articles/fig/mrt.jpg differ diff --git a/book/articles/mrp.qmd b/book/articles/mrp.qmd new file mode 100755 index 000000000..2fd9854c8 --- /dev/null +++ b/book/articles/mrp.qmd @@ -0,0 +1,249 @@ +--- +title: "MrP" +--- + +```{r, include = FALSE} +options(width = 1000) +## this vignette is in .Rbuildignore because lme4 is not available on old CRAN +## test machines. + +knitr::opts_chunk$set( + collapse = TRUE, + fig.width = 9, + fig.asp = .4, + out.width = "100%", + warning = FALSE, + message = FALSE, + comment = "#>" +) +#| include: false +options(width = 120) +library(marginaleffects) +library(modelsummary) +library(brms) +library(ggplot2) +library(ggridges) +library(gt) +library(tidyverse) +theme_set(theme_minimal()) +``` +```{r} +#| include: false +#| cache: true +library(marginaleffects) +library(countrycode) +library(tidyverse) +library(modelsummary) +library(gt) +library(brms) +set.seed(1024) + +cities <- c("New York City, NY", "Los Angeles, CA", "Chicago, IL", "Houston, TX", "Phoenix, AZ", "Philadelphia, PA", "San Antonio, TX", "San Diego, CA", "Dallas, TX", "San Jose, CA", "Austin, TX", "Jacksonville, FL", "Fort Worth, TX", "Columbus, OH", "San Francisco, CA", "Charlotte, NC", "Indianapolis, IN", "Seattle, WA", "Denver, CO", "Washington, DC", "Boston, MA", "Nashville, TN", "El Paso, TX", "Detroit, MI", "Memphis, TN", "Portland, OR", "Oklahoma City, OK", "Las Vegas, NV", "Louisville, KY", "Baltimore, MD", "Milwaukee, WI", "Albuquerque, NM", "Tucson, AZ", "Fresno, CA", "Sacramento, CA", "Mesa, AZ", "Atlanta, GA", "Kansas City, MO", "Colorado Springs, CO", "Miami, FL") +cities <- rev(sort(cities)) +education <- c("High school or less", "Post-secondary") +age <- c("18-54", "55+") +stratification <- expand.grid( + city = cities, + education = education, + age = age) |> + mutate( + population_share = runif(n(), min = .25, max = .75), + population_share = population_share / sum(population_share), + .by = "city",) |> + arrange(city) +N <- 1000 +survey <- data.frame( + city = sample(cities, N, replace = TRUE), + age = sample(age, N, replace = TRUE), + education = sample(education, N, replace = TRUE) +) +survey <- data.frame( + respondent = sprintf("%04d", 1:N), + survey) +M <- model.matrix(~., survey) +b <- runif(ncol(M)) +survey$meat_substitute <- as.numeric(cut(M %*% b, breaks = 7)) + +mod <- brm( + meat_substitute ~ age + education + (1 | city), + data = survey, + backend = "cmdstanr") +``` + +Data analysts often want to learn about a population using samples that are not representative of that population. Consider a few examples: + +* _Market research:_ A supermarket chain wants to assess consumer preferences in each of the markets where it operates, but it would be too expensive to collect distinct representative samples for many cities. +* _Political polling:_ A newspaper conducts a nationally representative survey in the lead up to a Presidential election, and wishes to compute state-by-state estimates of voting intentions. +* _Online surveys:_ A researcher conducts a poll online, but the respondents are younger and more highly educated than the general population. + +This notebook introduces [Multilevel Regression with Poststratification (MrP)](https://en.wikipedia.org/wiki/Multilevel_regression_with_poststratification), a popular strategy which can be used to limit the distortions in unrepresentative data, or to estimate subgroup characteristics on the basis of data gathered at a different level of aggregation. MrP can be deployed in a wide range of contexts (see this [paper](https://www.monicaalexander.com/pdf/mrp_chapter.pdf) and this [blog post](https://www.monicaalexander.com/posts/2019-08-07-mrp/) by the always excellent [Monica Alexander](https://www.monicaalexander.com/)). + +As we will see below, MrP is super easy to implement using the `marginaleffects` package for `R`. `marginaleffects` also offers tremendous benefits to analysts, including a consistent user interface to over 80 classes of statistical models, as well as many post-estimation and hypothesis testing tools. To illustrate these benefits, we will consider a hypothetical example with simulated data.^[See the bottom of this page for the simulation code.] + +![MrP, not Mister T.](fig/mrt.jpg) + +## MrP for surveys and market research + +Imagine that a national supermarket chain plans to introduce a line of meat substitutes in some of its stores. To guide marketing and distribution efforts, the company would like to know the share of the population in each city that is interested in tasting meat substitutes. + +The company conducts a telephone survey of 1000 randomly selected adults from 40 large American cities. For each survey respondent, we record the city of residence, age, level of education, and whether they are interested in tasting meat substitutes. The variable we focus on is "interest in meat substitutes," measured on a scale of 1 to 7 where 7 means "very interested" and 1 means "not at all interested". Our ultimate goal is to estimate the average of this 7 point scale for each city. + +The (simulated) data that we will use is stored in a `R` data frame called `survey`. We can use the `nrow()` function to confirm the sample size, and the `datasummary_df()` function from the `modelsummary` package to display the first few rows of data: + +```{r} +library(marginaleffects) +library(modelsummary) +library(tidyverse) +library(ggridges) +library(brms) + +nrow(survey) + +datasummary_df(survey[1:5, ]) +``` + +This dataset includes 1000 observations: one per survey respondent. Unfortunately, it is not straightforward to compute precise city-by-city estimates, because although the number of respondents is large overall, the number of respondents within each of the 40 cities is relatively small. Moreover, the company's sampling strategy does not guarantee that subsamples are representative of each city's population. MrP can help us circumvent these problems in two steps: + +1. *Multilevel regression (Mr)*: Estimate a multilevel regression at the individual level, with random intercepts for cities. +2. *Poststratification (P)*: Adjust the predictions of the model based on the demographic characteristics of each city. + +In the rest of this notebook, we show that the `marginaleffects` package makes it very easy to apply these steps. + +## "Mr" for "Multilevel regression" + +The first step of MrP is to use individual-level data to estimate a model that predicts the value of the variable of interest. One of the great benefits of using `marginaleffects` for MrP, is that this package is agnostic to the type of model used to make predictions. Analysts can use almost any model they like, and the workflow described below would remain the same. + +One popular choice for MrP is to estimate a multilevel regression model with random intercepts for each of the geographic regions of interest. To do so, analysts could use many different `R` packages, including [`lme4`](https://cran.r-project.org/package=lme4), [`glmmTMB`](https://cran.r-project.org/package=glmmTMB), or [`brms`](https://cran.r-project.org/package=brms). In this notebook, we use the `brms::brm()` function to estimate a bayesian multilevel model, with the `age` and `education` variables as fixed components, and random intercepts at the city level: + +```{r} +#| eval: false +mod <- brm(meat_substitute ~ age + education + (1 | city), data = survey) +``` + +We can visualize the model estimates using the `modelplot()` function from the `modelsummary` package: + +```{r} +#| warning: false +#| fig-asp: .3 +#| cache: false +modelplot(mod) +``` + +We are now ready for the second MrP step: poststratification. + +## "P" for "Poststratification" + +In the second MrP step, we use data from the US Census (or a similar source) to create a "poststratification table." This table includes one row for each combination of the predictor values in our model. In our model, the `age` variable has 2 categories ("18-54" and "54+"); the `education` variables has 2 categories ("High school or less" and "Post-secondary"); and the `city` variable has 40 unique entries. Therefore, the poststratification table must have $2 \times 2 \times 40 = 160$ entries. + +Crucially, the poststratification dataset must also include a column with the population share of each combination of predictor values. Consider the table used by our hypothetical supermarket chain. This table includes 160 rows: + +```{r} +nrow(stratification) +``` + +And here are the entries for the city of Tucson, AZ: + +```{r} +tucson <- subset(stratification, city == "Tucson, AZ") +datasummary_df(tucson) +``` + +According to these (simulated) data, the share of Tucson residents who are between 18 and 54 year old and have a High School degree or less is about `r sprintf("%.0f%%", tucson$population_share[1] * 100)`. + +We can use the `predictions()` function from the `marginaleffects` package to predict the value of the `meat_substitute` variable for each of the four categories of residents in Tucson: + +```{r} +predictions(mod, newdata = tucson) +``` + +The MrP estimate for this city is simply the weighted average of predicted values, where weights are the population shares of each category of residents. In this context, we have: + +```{r} +#| include: false +p <- predictions(mod, newdata = tucson) +a <- sprintf("%.2f \\times %.2f", p$population_share, p$estimate) +a <- paste(a, collapse = " + ") +a <- paste(a, "=", sprintf("%.2f", sum(p$population_share * p$estimate))) +``` + +$$`r a`$$ + +Instead of computing estimates by hand for each city, we can use the `by` and `wts` arguments of the `predictions()` function to do everything everywhere all at once: + +```{r} +#| cache: true +p <- predictions( # Compute predictions, + model = mod, # using the multilevel regression model `mod`, + newdata = stratification, # for each row of the `stratification` table. + by = "city", # Then, take the weighted average of predictions by city, + wts = "population_share") # using demographic weights. +p +``` + +Since we estimated a bayesian model in the "Mr" step, we can now use the `posterior_draws()` function to extract draws from the posterior distribution of the MrP estimates. This allows us to compute credible intervals for each city, and draw many fancy plots like this one: + +```{r} +#| fig-asp: 1.5 +#| messages: false +#| warnings: false +p |> + # extract draws from the posterior distribution + posterior_draws() |> + # sort cities by interest in meat substitutes + arrange(estimate) |> + mutate(city = factor(city, levels = rev(unique(city)))) |> + # plot the results + ggplot(aes(x = draw, y = city)) + + geom_density_ridges() + + theme_minimal() + + theme(panel.grid = element_blank()) + + labs( + x = "Average interest in meat substitutes", + y = NULL, + title = "Estimated interest in meat substitutes by city", + subtitle = "Multilevel Regression and Poststratification", + caption = "Source: Simulated data") +``` + +## Data simulation + +All the data used on this page were simulated using this code: + +```{r} +#| eval: false +library(marginaleffects) +library(countrycode) +library(tidyverse) +library(modelsummary) +library(gt) +library(brms) +set.seed(1024) + +cities <- c("New York City, NY", "Los Angeles, CA", "Chicago, IL", "Houston, TX", "Phoenix, AZ", "Philadelphia, PA", "San Antonio, TX", "San Diego, CA", "Dallas, TX", "San Jose, CA", "Austin, TX", "Jacksonville, FL", "Fort Worth, TX", "Columbus, OH", "San Francisco, CA", "Charlotte, NC", "Indianapolis, IN", "Seattle, WA", "Denver, CO", "Washington, DC", "Boston, MA", "Nashville, TN", "El Paso, TX", "Detroit, MI", "Memphis, TN", "Portland, OR", "Oklahoma City, OK", "Las Vegas, NV", "Louisville, KY", "Baltimore, MD", "Milwaukee, WI", "Albuquerque, NM", "Tucson, AZ", "Fresno, CA", "Sacramento, CA", "Mesa, AZ", "Atlanta, GA", "Kansas City, MO", "Colorado Springs, CO", "Miami, FL") +cities <- rev(sort(cities)) +education <- c("High school or less", "Post-secondary") +age <- c("18-54", "55+") +stratification <- expand.grid( + city = cities, + education = education, + age = age) |> + mutate( + population_share = runif(n(), min = .25, max = .75), + population_share = population_share / sum(population_share), + .by = "city",) |> + arrange(city) +N <- 1000 +survey <- data.frame( + city = sample(cities, N, replace = TRUE), + age = sample(age, N, replace = TRUE), + education = sample(education, N, replace = TRUE) +) +survey <- data.frame( + respondent = sprintf("%04d", 1:N), + survey) +M <- model.matrix(~., survey) +b <- runif(ncol(M)) +survey$meat_substitute <- as.numeric(cut(M %*% b, breaks = 7)) + + +``` diff --git a/book/utils/_quarto.yml b/book/utils/_quarto.yml index 939289284..a8623f5b8 100644 --- a/book/utils/_quarto.yml +++ b/book/utils/_quarto.yml @@ -49,6 +49,7 @@ book: - articles/logit.qmd - articles/lme4.qmd - articles/multiple_imputation.qmd + - articles/mrp.qmd - articles/svalues.qmd - part: Misc collapse-level: 1