-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path02-theory_and_practice.Rmd
146 lines (80 loc) · 49 KB
/
02-theory_and_practice.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# A causal inference framework for selection bias^[This chapter is a pre-copyedited, author-produced version of an article accepted for publication in Public Opinion Quarterly following peer review. The version of record: Mercer, Andrew W., Frauke Kreuter, Scott Keeter, and Elizabeth A. Stuart. 2017. “Theory and Practice in Nonprobability Surveys: Parallels Between Causal Inference and Survey Inference.” Public Opinion Quarterly 81 (S1): 250–71 is available online at: https://doi.org/10.1093/poq/nfw060] {#ch2}
\chaptermark{A causal inference framework}
The growing use of surveys that do not use traditional probability sampling has provoked both interest and concern from the survey community. Rising data collection costs coupled with declining response rates have highlighted the appeal of lower cost, nonprobability surveys that can be fielded rapidly online. However, respondent self-selection into these surveys renders design-based methods of survey inference inapplicable, and raises concerns about the potential for biased results.
Selection bias refers to systematic differences between a statistical estimate and the true population parameter caused by problems with the composition of the sample (rather than errors in measurement). Traditionally, survey researchers think of selection bias as resulting from noncoverage – when the sampling frame omits portions of the target population – or nonresponse – when selected units do not complete the survey. These concepts are tied to a process of starting with a complete population and randomly selecting a subset. These categories may prove limiting when applied in a nonprobability context. Many nonprobability surveys do not originate from anything resembling a sampling frame. Even the idea of a sample as a finite set of units, some of which may fail to respond, does not apply to many nonprobability surveys. For nonprobability surveys, the processes that lead to a respondent being included in a sample are numerous, potentially arbitrary, and may not resemble the traditional probability-based survey process at all.
Rather than evaluate nonprobability surveys using concepts designed for a different inferential framework and different data collection practices, we propose a more general framework that emphasizes the characteristics of the realized sample, regardless of how it was generated. The underpinnings of this framework are not new, but come from research into the estimation of causal effects from experimental and non-experimental data. In fields such as epidemiology, political science and economics where randomized experiments are frequently not possible and observational studies are commonplace, research has focused on identifying the conditions under which valid statistical inferences about causal effects can be made using observational data. In the causal context, the parameter of interest is a contrast between experimental treatments, whereas surveys measure a broad range of estimates including means, totals, correlations and other measures of association. Despite differences, the conditions that produce selection bias in causal analyses also apply in a survey context.
Others have noted similarities between causal inference and survey inference. @little2002 apply many of these same concepts to experiments, observational studies, survey nonresponse, and imputation. @groves2006a uses a causal framework to describe when nonresponse will produce bias in survey estimates. @keiding2016 reviewed the many objectives and challenges shared by both epidemiological studies and surveys, and suggest that both fields could benefit from sharing methodologies.
Drawing on this work, we identify three components that determine whether or not nonrandom selection could lead to biased results:
- Exchangeability – Are all confounding variables known and measured for all sampled units?
- Positivity – Does the sample include all of the necessary kinds of units in the target population, or are certain groups with distinct characteristics missing?
- Composition – Does the sample distribution match the target population with respect to the confounding variables, or can it be adjusted to match?
In this paper, we first describe how this framework applies in the familiar context of randomized experiments and probability based-surveys before demonstrating how it extends to cover observational studies and nonprobability surveys. Second, we demonstrate the mechanics by which each component can produce bias in survey estimates by way of a simplified example. Finally, through the lens of this framework, we provide a critical review of current practices in online, nonprobability data collection and their implications for selection bias.
## Randomization and unbiased inference in experiments and surveys
\sectionmark{Randomization and unbiased inference}
Questions about causal effects are usually framed in terms of potential outcomes or counterfactuals [@rubin1974]. A patient’s outcome may be different if he is given Treatment A or Treatment B. Prior to choosing a treatment, either outcome is possible, but we observe only the results under the treatment that is actually provided to the patient. We can never observe what would have happened if a different treatment had been applied. The causal effect is the difference between the two potential outcomes. Although we can never observe both outcomes on a single individual, we can compare the average outcome for people who receive Treatment A to that of people who receive Treatment B to make inferences about which treatment is better. When treatments are assigned randomly, we can be reasonably confident that observed differences in the outcomes across treatment conditions are due to the treatments themselves and not some other difference between the two groups. When treatments are not assigned randomly, these assessments are more difficult. For instance, if patients who receive Treatment A tend to do worse, but Treatment A is usually given to sicker patients, it is difficult to know if the difference is due to the treatment or due to the fact that the patients who received it were in worse shape to begin with. The baseline level of sickness is known as a confounder. Confounders are variables associated with both the choice of treatment and the outcome of interest, and are the primary source of selection bias in causal analyses.
The parallels between causal inference and survey inference are substantial. A probability-based survey is essentially a randomized experiment where the pool of subjects is the set of units on the sampling frame and the treatment is selection into the survey. Unlike experiments where we observe outcomes on both treated and untreated subjects, in surveys we observe outcomes only on the selected units, with the expectation that there should be no difference between selected and non-selected units. The conditions under which causal effects can be estimated without selection bias are analogous to the conditions that produce unbiased estimates in surveys. Before discussing nonprobability surveys, we will first examine how these conditions are met in the context of randomized experiments and probability-based surveys.
### Strong Ignorability – Exchangeability and Positivity
@rosenbaum1983a devised the notion of strong ignorability to describe the conditions under which inferences about causal effects can be estimated without selection bias for a given sample. Strong ignorability consists of two requirements. The first, known as “exchangeability” [@greenland1986; @greenland2009], “ignorability”, “no unobserved confounding,” or “no hidden bias” [@rosenbaum2002], requires the mechanism by which subjects are assigned a treatment to be independent of the measured outcome either unconditionally or conditional upon observed covariates. Unconditional exchangeability is analogous to the notion of data that is missing completely at random (MCAR), whereas conditional exchangeability corresponds to data missing at random (MAR) [@little2002]. When unobserved confounders are present, it is not possible to isolate the effect of the treatment from the effect of the confounder without additional assumptions.
Second, it must be possible for any subject to have received any of the treatments. This requirement is called positivity because it requires all subjects have a positive probability of receiving treatment. If certain types of subjects receive only treatment or control, it is not possible to learn about causal effects for those subjects, and the treatment and control groups will have systematic differences that cannot be resolved. In practice, we generally require not just a positive probability but also enough cases to produce sufficiently precise statistical estimates [@hernan2006; @petersen2012].
In experiments, random treatment assignment guarantees that on average, the exchangeability and positivity conditions will be met. Randomization ensures exchangeability by preventing any relationship between treatment assignment and unobserved variables and ensures positivity because any subject has a chance of receiving any treatment. In probability-based surveys, random selection functions in much the same way. By randomly selecting a sample from the entire population, there can be no unobserved variables systematically associated with selection, and all members of the population have a chance of being included.
### Composition
For experiments, the composition of treatment groups with respect to potential confounders is important in two respects. First, the distribution of potential confounders in the treatment group needs to match the distribution in the control group. Random treatment assignment guarantees that this will occur naturally on average, and this equivalence between treatment groups is implied whenever unconditional exchangeability holds. Second, the composition of the experimental sample affects the degree to which findings can be generalized to an external population.
Strong ignorability guarantees only that the results of an experiment are generalizable to the group of subjects included in an experiment; in other words, it ensures “internal validity” but does not necessarily imply “external validity” [@shadish2002]. It is rare for samples in randomized trials, which have historically prioritized internal validity, to match a larger population. Because of this there has been a growing literature on methods to allow the generalization of experimental results to target populations, including reweighting strategies that aim to equate the experimental sample and the population with respect to observed characteristics [@cole2010; @kern2016; @stuart2015]. @pearl2014a refer to the transportability of empirical findings from one sample to a separate target population. They note that generalization requires one to know the distribution of the outcome conditional upon treatment and any confounders, as well as the joint distribution of the confounding variables in the target population. Put simply, to generalize beyond the experimental sample to a target population, the sample needs to look like (or be made to look like) the target population with respect to the distribution of confounding variables.
The situation for surveys is somewhat less complex than for causal analyses. Whereas experiments must be concerned with the comparability of treatment and control as well as sample and population, surveys need be concerned only about sample and population. It is understood that the composition of a sample will match that of the population when all units have an equal probability of selection, implying unconditional exchangeability. When probabilities of selection are unequal but known for every unit in the frame, the situation is equivalent to conditional exchangeability, and weighting observations by the inverse of the probability of selection yields unbiased population estimates [@horvitz1952]. In either case, random selection ensures that on average the sample will match the target population on the distribution of any variables measured on the survey.
## Extending the Framework to Non-Random Samples
For causal analyses and surveys random treatment assignment and respondent selection provide a powerful mechanism for producing the conditions necessary for unbiased estimation of causal effects and population parameters. However, these conditions are guaranteed only when randomization is 100% successful. In practice, this is rarely the case. In experiments subjects drop out of trials or are lost to follow-up. In surveys, the sampling frames may not perfectly cover the target population, and nonresponse means that some share of sampled units is never observed. When such problems occur, the usual response is to perform statistical adjustments to correct any imbalance. In experiments methods such as matching or propensity score weighting can be used to adjust for imbalances between experimental treatment groups [see @imbens2015 Part VI]. In probability surveys corrections involve nonresponse weighting adjustments for which a variety of techniques exist [see @kalton2003; @valliant2013].
When we perform these adjustments to randomized experiments or probability surveys, we are no longer relying solely on randomization to produce unbiased estimates. Rather, these adjusted estimates are conditional upon a model that assumes that positivity and exchangeability hold and that the adjustment reconstructs the correct sample composition for the confounding covariates. Even if we perform no adjustment, we are implicitly assuming a model where the correlation between missingness and the outcome of interest is zero, or unconditional exchangeability.
In the causal world, it is recognized that as long as exchangeability and positivity hold, it is possible to make unbiased inferences about causal effects from non-experimental data [@greenland1986; @greenland2009; @rosenbaum1983a; @rubin1974; @rubin1978]. Quasi-experimental designs such as regression discontinuity and instrumental variables models are techniques that can be used to identify causal effects from non-experimental data when the appropriate conditions are met [@angrist2009; @west2008]. Methods such as matching, marginal structural models and structural nested models have been developed to estimate causal effects from observational data and have been proven to produce unbiased estimates when their underlying assumptions are met [@cole2003; @robins1999; @robins1999c; @stuart2010a]. However, for all of these techniques, one can never be certain if the exchangeability and positivity conditions have been met. Therefore, the bar for accepting results from non-experimental data is much higher than for randomized experiments.
The same is true for surveys that do not use probability sampling. When units are not randomly selected from the target population, researchers must rely on statistical models to generalize back to the target population. Probability-based surveys with undercoverage or nonresponse must also specify a model that relates the observed units to the unobserved [@brick2013; @valliant2000]. For probability samples the initial design performs most of the work in ensuring exchangeability, positivity, and correct sample composition. Statistical models are employed during estimation to correct what are hopefully minor biases. In contrast, nonprobability samples cannot rely on randomization to help meet these requirements, and instead must rely on models at all stages of the survey process from sample selection to estimation. As in causal analyses, researchers can never know with certainty that these requirements have been met.
## Mechanics of Selection Bias in Surveys
In this section we focus specifically on the survey context and demonstrate through a simplified example the mechanics behind each of the components in this framework to show how they can introduce bias into survey estimates.
### Exchangeability
Suppose we have a sample intended for estimating what share of the population will vote for the Democratic and Republican candidates in an election, and that we have measured each respondent’s candidate preference and age. Let us also assume that some feature of the recruitment process over-represents older people but that there are no additional unmeasured confounders. Because older people tend to vote Republican more than young people, an estimate of the overall vote using this sample would be biased in favor of the Republican candidate. However, because inclusion depends only on age, estimates of the vote within the younger and older subgroups would still be correct. In this case, the sampled individuals are exchangeable with non-sampled individuals within the same age group. When sampled observations are conditionally exchangeable, subgroups are internally unbiased with respect to the outcome of interest, even if some groups are over or under represented relative to their share of the target population. Because there are no additional confounders the overall proportion of the sample voting Democratic would be biased, but measures of the relationship between age and vote preference would be unbiased prior to any adjustment.
However, if inclusion in the sample depends on an unmeasured characteristic related to the survey outcome, the distribution of the outcome variable within the observed subgroups will no longer match that of the target population. In our voting example, suppose our sample also over-represents respondents who live in big cities but, unlike age, this has not been measured. Because urban dwellers tend to vote Democratic, the Democratic vote share among both young and old respondents will be too high. In this case, young and old respondents in our sample are not exchangeable with their non-sampled counterparts because they are more urban, making urbanicity an unmeasured confounder. The bias in favor of the Democrat due to an excess of city dwellers could actually offset some of the Republican bias produced by having too many older respondents. In this scenario, the estimated vote for the full sample could be close to the true population value while subgroup estimates would be biased. Note that the crucial aspect of exchangeability is not which cases are included in the sample but what characteristics have been measured. If we knew which cases were urban and which were rural, we could adjust by both age and urbanicity to recover the correct sample composition.
In practice, the biases need not cancel out. The unobserved variable could have opposite effects for young and old respondents, or there could be different unobserved variables affecting different subgroups. Because the confounding variables are unobserved, it is impossible to know from the data alone whether or not the exchangeability requirement is met.
The associations that produce bias need not be direct. If we took this same sample but measured something such as eye color, which is not directly related to either age or urbanicity, we might still achieve biased estimates if eye color is associated with race. Over-representing urban respondents likely also means over-representing racial groups that live in urban areas which could in turn affect the distribution of observed eye colors. The reverse is also true. Variables that are not confounders themselves but are closely correlated with confounders may help reduce bias by serving as proxies during adjustment.
### Positivity
The positivity requirement states that even if we know and have measured all potential confounders, all of the subgroups defined by confounding variables must also be represented in the sample [@hernan2006]. Groups that are underrepresented but present can be weighted up. However, it is not possible to weight up groups that were not surveyed. Returning to our example where inclusion depends on age and urbanicity, suppose that there are no older, urban respondents included in the sample. Even if we were able to record both age and urbanicity, there is no adjustment we can perform that will make up for the absence of older, urban respondents, although subgroup estimates for those groups that were observed would remain unbiased. On the other hand, if older and younger city dwellers are the same with respect to their voting preference, the absence of older urbanites would not introduce bias because younger urbanites could stand in for them in the sample with no change to the estimate. When a group is entirely missing from the set of observed units, the researcher requires a theoretical justification for believing that the missing group is not systematically different from other, superficially similar groups that were surveyed.
### Composition
In our example, we have assumed that our sample composition does not match the target population on age and urbanicity. If it can be adjusted to match the distribution in the target population, our estimates of the vote will be unbiased. We have already alluded to the simplest approach, which is to weight each group to be proportional to its share of the target population.
Sample composition can be managed by design as well as through post-hoc adjustment. Random selection yields the correct sample composition in expectation, though individual samples will not match exactly. If the confounders are known in advance, purposive methods such as quota sampling, where we pre-determine the number of interviews required in each group, can be used to produce an exact sample match[@gittelman2015].
Managing sample composition through design or adjustment rather than random selection requires the researcher to be confident that all confounders are truly known and measured. When exchangeability or positivity does not hold, bias will not be eliminated and may even be magnified. In our example, if we adjust only for age but not urbanicity, we would eliminate the pro-Republican bias caused by an older sample but not the pro-Democratic bias due to an excess of urban respondents. The biases no longer offset each other and the adjusted estimate would be more biased toward the Democrats than it was before weighting.
## Current Practices for Managing Bias in Online, Nonprobability Surveys
\sectionmark{Current Practices for Managing Bias}
We can use this framework to consider current practices in fielding nonprobability web surveys and producing statistical estimates from the resulting samples. We distinguish between recruitment, whereby an individual becomes eligible for inclusion in one or more surveys (e.g., joining a panel) and sampling, the process by which an individual is selected for a particular survey after recruitment. After reviewing these two features of the data collection process, we discuss alternative approaches to post-survey adjustment and estimation.
### Recruitment {#recruitment}
The most common form of recruitment involves inviting individuals to join opt-in panels, which are lists maintained by sample providers of individuals who have agreed to participate in surveys on an ongoing basis. Individuals can become empanelled in a variety of ways, such as directly through a panel website, clicking on banner advertisements, or when corporations grant panel vendors access to members of their customer loyalty programs. Panels provide an opportunity to collect a large amount of profile information on their members that can be used in both sampling and adjustment. Maintaining respondent profiles across many dimensions can aid in providing exchangeability only if the correct variables are measured. On the other hand, some fear that panel conditioning and attrition may mean that panel members may become less reflective of their non-empanelled counterparts over time, potentially reducing exchangeability [@callegaro2008; @callegaro2015; @couper2000].
The main alternative to panels is river sampling, in which potential respondents are recruited via similar sources, but are directed to a one-off survey rather than asked to join a long-term panel [@callegaro2014]. River sampling avoids panel attrition and conditioning, but provides no profile data on respondents in advance. Respondent characteristics must be obtained at the time of the survey, limiting the number of characteristics that can be measured. Some online survey providers have begun using a mixture of panel and river respondents [e.g. @lorch2010a; @young2012].
Both panels and river sampling face an immediate threat to the positivity requirement because individuals who do not use the internet cannot participate. Studies conducted on the Pew Research Center’s American Trends Panel and the Dutch LISS panel, two probability-based panels that take steps to cover individuals without internet access, found that the exclusion on non-internet individuals produced only small differences in most survey estimates. However, for outcomes pertaining to technology use, differences in estimates could be large. The Pew Research study also found that indicators of socioeconomic status differed considerably for some subgroups such as the elderly or racial minorities [@eckman2016; @keeter2015].
Obtaining a diverse array of potential respondents is crucial to the success of any recruitment method. @pettit2015 demonstrated that respondents recruited via different websites can exhibit dramatically different demographic distributions. Respondents recruited from different sources likely vary on other characteristics as well; for instance, individuals recruited via a website dedicated to video games could differ from those recruited from websites devoted to personal finance with respect to variables such as interest in retirement planning or their use of leisure time. Recruiting from a diverse set of sources necessarily improves the probability of meeting the positivity requirement; however, it also increases the complexity of the recruitment process, potentially creating a trade-off between positivity and exchangeability. As the number of sources increases, it may become more difficult to know which characteristics distinguish between individuals recruited from different sources.
To date, the great majority of research into nonprobability surveys has relied on data from online panels. Many of these studies have compared different panels to one another and found that while some nonprobability surveys compare favorably to probability-based surveys, the same survey fielded on different panels can result in dramatically different results [@callegaro2014a; @craig2013; @erens2014; @kennedy2016; @schnorf2014; @yeager2011]. However, none of these studies were designed to evaluate alternative methods of panel recruitment or isolate the design features that produce such varying results.
Very little research has directly compared panels to river sampling. One such analysis found that after weighting for demographic characteristics, panel respondents were largely similar to river respondents, although panelists were more likely to be registered to vote and more likely to use Twitter. River respondents were closer to the chosen benchmark on both measures [@clark2015]. A study performed as part of the Foundations of Quality 2 (FOQ2) initiative compared the demographic composition of surveys using panels and the river sampling. It found that on average, the river samples yielded demographic compositions similar to non-river samples, and required somewhat less extreme weighting when adjusted to match demographics not used in the sampling process [@bremer2013]. Unfortunately, there was no evaluation of differences in other non-demographic estimates.
At present, there is not enough research to recommend one recruitment method over the other. The availability of profile data on panels offers flexibility and control for the purposes of sampling and adjustment, but the limited empirical research discussed previously does suggest some possible advantages to river samples. Other practices such as profiling, sampling or quota design may also be more important than the recruitment process.
### Sampling {#sampling}
Nonprobability surveys generally rely on purposive selection to achieve the desired sample composition while data collection is ongoing. This is commonly achieved through quotas, where the researcher pre-specifies a particular distribution across one or more variables. Usually these are cells defined by a cross-classification of demographic characteristics such as gender by age, with each cell requiring a specified number of completed interviews in that category. The end result is a sample that matches the pre-specified distribution across the chosen variables. The use of quotas relies on the assumption that individuals that comprise each quota cell are exchangeable with non-sampled individuals who share those characteristics. If that assumption is met, the sample will have the correct composition on the confounding variables, allowing for the estimation means and proportions that generalize to the target population.
Most contemporary web surveys that employ quotas define the cells across no more than a handful of demographic variables. However, there is a growing consensus that basic demographic variables such as age, sex, race, and education are insufficient for achieving exchangeability. A recent study using the FOQ2 data compared three progressively more stringent sets of demographic quotas. Across a range of benchmarks, the application of more stringent quotas did nothing to reduce bias, and post-survey weighting actually increased the average bias for all but five out of seventeen sample providers. The study also evaluated three quota schemes that incorporated additional, non-demographic variables, however their success was mixed. (The details of the methods employed were not specified to avoid identifying the sample providers [@gittelman2015]). This finding is consistent with research in causal inference suggesting that demographics alone are generally insufficient for eliminating bias in observational studies [@cook2008].
If traditional quota methods are insufficient for producing strong ignorability, sampling methods that allow researchers to control both more and different dimensions may improve the ability to condition on a more appropriate set of potential confounders. The best documented of these methods is implemented by YouGov on surveys conducted using its panel in the United States. YouGov first draws a random sample of cases from a high quality data source, such as the American Community Survey (ACS) Public Use Microdata Sample, that is believed to reflect the true joint distribution for a large number of variables in the target population. This subsample is referred to as a synthetic sampling frame (SSF) and serves as a template for the eventual survey sample. Each panelist who completes the survey is matched to a case in the SSF with similar characteristics using a distance measure such as Euclidean distance. When every record in the SSF has been matched with a suitably similar respondent, the survey is complete [@rivers2007].
Because a limited number of covariates are available on any single survey such as the ACS, it is possible to impute additional variables onto the SSF using models built with other data sources. This was the approach taken on the 2008 Cooperative Congressional Election Study which augmented an SSF drawn from the ACS with estimates of voter registration and turnout from the Current Population Survey Voting and Registration Supplement, and of internet use, religion and interest in politics from Pew Research Center surveys. The resulting survey sample produced estimates of the presidential vote that closely matched national exit polls and the American National Election Studies [@ansolabehere2013].
This approach is appealing in its capacity to flexibly match the target population on a larger number of covariates than is possible with traditional quota methods. For this approach to succeed, the composition of matching variables in the SSF must accurately match the target population, and any models used to combine datasets must be correctly specified. More importantly, the matching variables must be the correct variables for ensuring conditional exchangeability, and the panel must be able to supply respondents that are close matches to each case in the SSF. If there are remaining confounders that are not accounted for, resulting survey estimates will be biased. One side-benefit of this approach is that problems with positivity should be immediately apparent if there are portions of the SSF for which no clear matching respondents can be found.
Another approach to sampling on a higher number of dimensions is the use of propensity score matching to construct quota cells. Under this approach, a probability survey that is assumed to accurately reflect the target population is fielded in parallel with a nonprobability survey. Using a set of common covariates collected on each survey, a propensity model is estimated by combining the two samples and predicting the probability that each respondent belongs to the probability survey. When subsequent online surveys are fielded, the propensity model is used to calculate a propensity score for each respondent as they are screened for the new survey. Quotas are set not on particular respondent characteristics but are based on quintiles of the propensity score distribution [@terhanian2012].
As with the SSF used in sample matching, much hinges on how well the parallel reference survey matches the target population. If the reference survey suffers from its own nonresponse or coverage bias, those biases will be transferred into the nonprobability survey. On the other hand, the researcher could tailor the contents of the baseline surveys to include any variables believed to be necessary to ensure conditional exchangeability. Under other approaches, researchers are limited to covariates that are available from preexisting data sources. This method performed well in a simulation, however the data used to construct the propensity model was the same data used to generate the simulated survey. The evaluation also generated only a single simulated dataset [@terhanian2012]. As such, it is difficult to know how this technique performs on new samples and over repeated applications. Dividing the propensity score into quintiles will result in a loss of information contained in the full distribution of propensity scores, though it is also possible that quintiles provide a sufficient foundation of balance and positivity that can be further refined through post-survey adjustment. Additional research comparing this approach with the matching approach described above would be valuable, particularly if the same survey and set of covariates can be used.
Another, less understood component of the sampling process for many nonprobability surveys is the use of routers. Most nonprobability survey vendors have many surveys fielding simultaneously. When a router is employed, rather than draw separate samples for each survey, respondents are invited to participate in an unspecified survey. The actual survey taken is determined dynamically based on the characteristics of the respondent and the needs of active surveys with respect to quotas or screening criteria. This makes for a more efficient use of sample, but means the sample for any one survey depends on what other surveys are in the field simultaneously. If there are ample respondents and few competing surveys, routers may pose little threat of bias. On the other hand, the presence of surveys focused on rare groups may mean that individuals belonging to those groups are not routed to other surveys. In such an event, the routing process becomes a confounder that would be difficult to observe and account for.
The only empirical study evaluating routers compared the effects of three different routing methods against a non-routed control and found that all four conditions produced similar estimates. In a set of simulations, the authors did find that routing could produce bias for questions that are highly correlated with the selection criteria for other surveys in the field. This study evaluated routing under a narrow set of conditions that the authors recognize may not generalize to many circumstances observed in practice [@brigham2014]. Additional experiments and simulations testing alternative algorithms and scenarios, or observational studies comparing router performance over time for different vendors would be of substantial benefit.
### Post-survey Adjustment
Because it may not be feasible to achieve the desired sample composition through sampling alone, post-survey adjustment is still needed. Most of the research on adjusting nonprobability samples has focused on adapting the methods used to perform non-response adjustment with probability samples. Calibration and propensity score weighting are the two most common approaches to weighting.
Calibration methods directly adjust the composition of the sample to match a known distribution of variables in the target population. The simplest form of calibration is post-stratification, in which the sample is divided into mutually exclusive cells that are weighted up or down such that the proportion of each cell in the sample matches the corresponding proportion in the target population. Whereas post-stratification requires knowledge of the joint distribution of the stratification variables in the target population, other calibration methods such as raking and generalized regression estimation require only knowledge of the marginal distribution of any adjustment variables [@deville1992; @kalton2003]. Calibration methods generally require that the outcome be a linear function of the calibration variables, and may not perform well in the presence of nonlinear relationships between the outcome and adjustment variables or unmodeled interactions [@valliant2000].
Propensity score weighting involves combining a nonprobability sample with a parallel probability or gold-standard data source as a reference sample. A model predicting sample membership is fit to these combined data, and observations in the nonprobability sample are weighted by the inverse of their probability of appearing in the nonprobability sample [@lee2006; @taylor2000; @terhanian2000; @valliant2011]. @valliant2011 demonstrated that for propensity score adjustment to be effective, the propensity model must incorporate any nonresponse adjustment and bias correction that has been applied to the reference sample. Otherwise, those biases will be transferred to the nonprobability sample.
Given the same set of covariates, generalized regression estimation (GREG) has been found to perform comparably to propensity score weighting, suggesting that a parallel reference survey may be unnecessary when the requisite population totals are available [@valliant2011]. Propensity score weighting can more easily accommodate nonlinear associations and interactions between confounding variables. If there are a large number of confounders or it is unknown which of the observed covariates are confounders, machine learning methods such as boosting or random forests can fit high dimensional propensity models if a suitable reference sample with common covariates is available [@buskirk2015; @lee2010].
Some have explored matching as an alternative to weighting for post-survey adjustment of nonprobability surveys. Traditionally, matching is used in causal inference in order to adjust for differences in composition between treatment groups (see @stuart2010 for a review of their use in causal inference). With matching, the idea is to create groups containing one or more observations from both a reference sample and a nonprobability sample that are similar on a set of auxiliary variables believed to be associated with selection. Groups in the nonprobability sample are then weighted so that their distribution matches the distribution in the reference sample. For example, a reference sample might be dived into cells based on a set of covariates or a propensity score, while cases in the nonprobability in matching cells would be weighted so that the proportion in each cell matches the proportion in the reference sample. In this sense, matching is very similar to propensity score weighting or poststratification with one important exception. In many applications, observations for which there is no acceptable match are removed from the final dataset. When this happens, information is lost, and inference is only possible for those portions of the samples that overlap. On the other hand, identifying a lack of overlap forces researchers to evaluate the validity of the positivity assumption in ways that other methods may not. Unlike standard weighting methods that will generally produce a weight for every observation (even if some are quite large), matching software often automatically identifies those observations in a reference sample for which no counterparts exist in the nonprobability sample (e.g. the MatchIt package for the R statistical software platform [@ho2011]). @dutwin2017 found that raking to basic demographics was more effective at reducing bias than matching on a more extensive set of demographics; however, a two-stage process of matching followed by raking reduced bias more than raking alone.
A final approach to post-survey estimation is multilevel regression and poststratification (MRP). In traditional poststratification, a sample is divided up into mutually exclusive cells, each of which is weighted to be proportional to their representation in the target population. As the number of cells becomes large, the number of observations in each cell becomes small and estimates become unstable. MRP enables poststratification using a large number of cells by fitting a multilevel model that pools information about cells sharing similar characteristics and allows for the estimation of cell means even when cells are sparse. A weighted mean is then constructed using the estimated cell means [@ghitza2013; @lax2009; @park2004].
This approach performed well when used to predict 2012 presidential election results using a survey conducted via the Microsoft Xbox platform whose sample composition differed radically from the population of voters and for which unadjusted estimates were wildly inaccurate [@wang2014a]. Unweighted, the sample was 93% male, only 1% 65 years old or older, and showed Barack Obama losing badly to Mitt Romney. On the surface, it seems unlikely that such a survey could produce accurate estimates. However, the Xbox study enjoyed two benefits not available to many other studies. The first is a very large sample size (345,858 unique respondents), which means that even groups that are dramatically underrepresented in the sample in relative terms still have enough observations in absolute terms to avoid problems with positivity. The 1% of the sample 65 years old or older yields 3,400 observations – more than enough cases to produce stable estimates for that subgroup. The second is that the authors had a very powerful set of covariates, including party identification and ideology, making it much more likely that the exchangeability requirement was satisfied for the purpose of predicting partisan voting behavior.
Another study using only demographic covariates met with less success. It compared MRP based estimates of presidential approval and country direction to estimates from the Pew Research Center’s probability-based telephone surveys over the same time period. For the share of the population that thinks the country is on the right track, the MRP estimates were not different from the estimates obtained using a simple post-stratification adjustment, and lower than the telephone based estimates. On the other hand, Presidential approval changed dramatically, moving from an underestimate to an overestimate relative to the comparison telephone survey [@petrin2015]. Although the telephone survey benchmarks are themselves estimates and have their own biases, if the goal of adjustment was to match that particular benchmark, neither MRP nor traditional post-stratification were successful.
Each of these approaches to estimation comes with advantages and disadvantages. When control totals are available for the confounders and their relationship with the survey outcome is linear, calibration methods are quite powerful and easy to apply. Propensity score methods provide a great deal of flexibility at the cost of requiring an auxiliary dataset with a shared set of covariates. It is less clear if matching offers substantial benefits over propensity score weighting or calibration. For approaches that produce weights, there is some indication that methods applied in combination may offer an improvement over the use of a single method [@brick2015; @dutwin2017; @lee2009; @mercer2018]. MRP may be most efficient at extracting information from smaller datasets, but at the cost of computational complexity and the fact that a separate model is required for each outcome variable. Additional research directly comparing adjustment methods to one another would be valuable in helping researchers choose the most appropriate tool.
All of these methods will fail if the exchangeability and positivity requirements are not met, or if the model specification does not correctly replicate the target composition on the confounding variables. If exchangeability and positivity are met, the best method is the one that can most closely mirror the correct sample composition using the available data and information. If exchangeability and positivity are not met, there is no a-priori reason to believe that any of these methods will perform better than any other.
### Variable Selection
Given the centrality of exchangeability and positivity in achieving unbiased estimates from nonprobability surveys, what variables should practitioners measure and utilize in sampling and adjustment? A number of researchers have attempted to find sets of variables that can reliably serve to achieve at least partial exchangeability for a broad range of survey topics. These include so-called “webographics,” early adopter characteristics and other behavioral and attitudinal factors intended to differentiate between survey participants and the broader population [@disogra2011; @fahimi2015; @schonlau2004; @schonlau2007]. While such general-purpose variables may fill a need, their effect will be limited unless they are correlated with the outcome to be measured.
Researchers will be best served if they can identify a likely set of theoretically grounded confounders prior to data collection, and use these as the starting point for a research design. For example, in studies of U.S. politics, many outcome variables of interest with be related to respondents’ underlying political engagement and partisanship. These may be effective confounders to use in sampling and adjustment. In the absence of strong theory regarding the survey topic, achieving exchangeability will prove extremely challenging. Researchers must also be confident that the variables they have identified can account for any indirect confounding resulting from idiosyncrasies associated with recruitment or sampling. Although some vendors consider sampling practices proprietary, vendors must be fully transparent about any variables used in the selection process to ensure that researchers are aware of any potential for confounding.
## Discussion
Whereas the emphasis in probability based surveys has traditionally been to develop processes that minimize confounding, the emphasis suggested here is to first identify likely confounders and design the data collection and analysis so that they are measured and actively accounted for. To be clear, this is more a shift in emphasis than a full-scale departure. Probability based surveys generally seek to measure and account for specific characteristics that are associated with bias, and we have discussed how data collection practices may introduce or mitigate confounding in nonprobability surveys.
Grounding this framework in causal inference suggests that there may be other techniques from that field that can be applied in a survey context. Testing the sensitivity of findings to unmeasured confounding is another common practice in causal inference whose adoption would likely benefit the survey field [@rosenbaum2005c]. Unlike probability surveys where the maximum range of bias is bounded by the size of the nonresponding sample, selection bias is unbounded and non-identifiable in nonprobability surveys. Although some methods such as pattern mixture models have been developed to evaluate selection bias under such constraints, they are not widely used in practice [@andridge2011]. Other techniques that do not rely on assumptions about the probability of selection may also prove useful for nonprobability surveys [e.g. @manski2007; @robins1999b]. Additionally, the use of causal diagrams and other methods of identifying confounders represent another worthwhile area for future research [e.g. @myers2011; @pearl2009a; @steiner2010].
Finally, it is one thing to know in principle that exchangeability, positivity and composition must be achieved in order to avoid selection bias in nonprobability survey estimates. It is another thing to achieve them successfully in practice. Even when the subject matter is well known and many likely confounders are identified, it may prove difficult to have complete confidence that there is not some yet unknown factor quietly introducing bias into survey estimates. Nevertheless, by making explicit a set of assumptions that to date have been largely implicit, the notions of exchangeability, positivity and composition provide a framework by which to evaluate and critique specific research findings and improve methodological practice.