forked from OpenIntroStat/ims
-
Notifications
You must be signed in to change notification settings - Fork 0
/
15-foundations-applications.Rmd
325 lines (247 loc) · 20 KB
/
15-foundations-applications.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# Applications: Foundations {#foundations-applications}
```{r, include = FALSE}
source("_common.R")
```
## Recap: Foundations {#foundations-sec-summary}
In the Foundations of inference chapters, we have provided three different methods for statistical inference.
We will continue to build on all three of the methods throughout the text, and by the end, you should have an understanding of the similarities and differences between them.
Meanwhile, it is important to note that the methods are designed to mimic variability with data, and we know that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \@ref(fig:randsampValloc)).
In Table \@ref(tab:foundations-summary), we have summarized some of the ways the inferential procedures feature specific sources of variability.
We hope that you refer back to the table often as you dive more deeply into inferential ideas in future chapters.
```{r foundations-summary}
inference_method_summary_table %>%
filter(question != "What are the technical conditions?") %>%
kbl(linesep = "", booktabs = TRUE,
caption = "Summary and comparison of randomization, bootstrapping, and mathematical models as inferential statistical methods.",
col.names = c("Question", "Randomization", "Bootstrapping", "Mathematical models")) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = TRUE) %>%
column_spec(1, width = "15em") %>%
add_header_above(c("", "Answer" = 3))
```
You might have noticed that the word *distribution* is used throughout this part (and will continue to be used in future chapters).
A distribution always describes variability, but sometimes it is worth reflecting on *what* is varying.
Typically the distribution either describes how the observations vary or how a statistic varies.
But even when describing how a statistic varies, there is a further consideration with respect to the study design, e.g., does the statistic vary from random sample to random sample or does it vary from random allocation to random allocation?
The methods presented in this text (and used in science generally) are typically used interchangeably across ideas of random samples or random allocations of the treatment.
Often, the two different analysis methods will give equivalent conclusions.
The most important thing to consider is how to contextualize the conclusion in terms of the problem.
See Figure \@ref(fig:randsampValloc) to confirm that your conclusions are appropriate.
Below, we synthesize the different types of distributions discussed throughout the text.
Reading through the different definitions and solidifying your understanding will help as you come across these distributions in future chapters and you can always return back here to refresh your understanding of the differences between the various distributions.
::: {.important data-latex=""}
**Distributions.**
- A **data distribution** describes the shape, center, and variability of the **observed data**.
This can also be referred to as the **sample distribution** but we'll avoid that phrase as it sounds too much like sampling distribution, which is different.
- A **population distribution** describes the shape, center, and variability of the entire **population of data**.
Except in very rare circumstances of very small, very well-defined populations, this is never observed.
- A **sampling distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from samples of a given sample size from a given population.
Since the population is never observed, it's never possible to observe the true sampling distribution either.
However, when certain conditions hold, the Central Limit Theorem tells us what the sampling distribution is.
- A **randomization distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from random allocations of the treatment variable.
We computationally generate the randomization distribution, though usually, it's not feasible to generate the full distribution of all possible values of the sample statistic, so we instead generate a large number of them.
Almost always, by randomly allocating the treatment variable, the randomization distribution describes the null hypothesis, i.e., it is centered at the null hypothesized value of the parameter.
- A **bootstrap distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from resamples of the observed data.
We computationally generate the bootstrap distribution, though usually, it's not feasible to generate all possible resamples of the observed data, so we instead generate a large number of them.
Since bootstrap distributions are generated by randomly resampling from the observed data, they are centered at the sample statistic.
Bootstrap distributions are most often used for estimation, i.e., we base confidence intervals off of them.
:::
\clearpage
## Case study: Malaria vaccine {#case-study-malaria-vaccine}
In this case study, we consider a new malaria vaccine called PfSPZ.
In the malaria study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine and 6 patients received a placebo vaccine.
Nineteen weeks later, all 20 patients were exposed to a drug-sensitive strain of the malaria parasite; the motivation of using a drug-sensitive strain here is for ethical considerations, allowing any infections to be treated effectively.
::: {.data data-latex=""}
The [`malaria`](http://openintrostat.github.io/openintro/reference/malaria.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
The results are summarized in Table \@ref(tab:malaria-vaccine-20-ex-summary), where 9 of the 14 treatment patients remained free of signs of infection while all of the 6 patients in the control group showed some baseline signs of infection.
```{r malaria-vaccine-20-ex-summary}
malaria %>%
count(treatment, outcome, .drop = FALSE) %>%
pivot_wider(names_from = outcome, values_from = n) %>%
adorn_totals(where = c("row", "col")) %>%
kbl(linesep = "", booktabs = TRUE, caption = "Summary results for the malaria vaccine experiment.") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = FALSE)
```
::: {.guidedpractice data-latex=""}
Is this an observational study or an experiment?
What implications does the study type have on what can be inferred from the results?[^foundations-applications-1]
:::
[^foundations-applications-1]: The study is an experiment, as patients were randomly assigned an experiment group.
Since this is an experiment, the results can be used to evaluate a causal relationship between the malaria vaccine and whether patients showed signs of an infection.
\vspace{-5mm}
### Variability within data
In this study, a smaller proportion of patients who received the vaccine showed signs of an infection (35.7% versus 100%).
However, the sample is very small, and it is unclear whether the difference provides *convincing evidence* that the vaccine is effective.
::: {.workedexample data-latex=""}
Statisticians and data scientists are sometimes called upon to evaluate the strength of evidence.
When looking at the rates of infection for patients in the two groups in this study, what comes to mind as we try to determine whether the data show convincing evidence of a real difference?
------------------------------------------------------------------------
The observed infection rates (35.7% for the treatment group versus 100% for the control group) suggest the vaccine may be effective.
However, we cannot be sure if the observed difference represents the vaccine's efficacy or if there is no treatment effect and the observed difference is just from random chance.
Generally there is a little bit of fluctuation in sample data, and we wouldn't expect the sample proportions to be *exactly* equal, even if the truth was that the infection rates were independent of getting the vaccine.
Additionally, with such small samples, perhaps it's common to observe such large differences when we randomly split a group due to chance alone!
:::
This example is a reminder that the observed outcomes in the data sample may not perfectly reflect the true relationships between variables since there is **random noise**.
While the observed difference in rates of infection is large, the sample size for the study is small, making it unclear if this observed difference represents efficacy of the vaccine or whether it is simply due to chance.
We label these two competing claims, $H_0$ and $H_A$:
- $H_0$: **Independence model.** The variables are independent.
They have no relationship, and the observed difference between the proportion of patients who developed an infection in the two groups, 64.3%, was due to chance.
- $H_A$: **Alternative model.** The variables are *not* independent.
The difference in infection rates of 64.3% was not due to chance.
Here (because an experiment was done), if the difference in infection rate is not due to chance, it was the vaccine that affected the rate of infection.
What would it mean if the independence model, which says the vaccine had no influence on the rate of infection, is true?
It would mean 11 patients were going to develop an infection *no matter which group they were randomized into*, and 9 patients would not develop an infection *no matter which group they were randomized into*.
That is, if the vaccine did not affect the rate of infection, the difference in the infection rates was due to chance alone in how the patients were randomized.
Now consider the alternative model: infection rates were influenced by whether a patient received the vaccine or not.
If this was true, and especially if this influence was substantial, we would expect to see some difference in the infection rates of patients in the groups.
We choose between these two competing claims by assessing if the data conflict so much with $H_0$ that the independence model cannot be deemed reasonable.
If this is the case, and the data support $H_A,$ then we will reject the notion of independence and conclude the vaccine was effective.
### Simulating the study
We're going to implement **simulation** under the setting where we will pretend we know that the malaria vaccine being tested does *not* work.
Ultimately, we want to understand if the large difference we observed in the data is common in these simulations that represent independence.
If it is common, then maybe the difference we observed was purely due to chance.
If it is very uncommon, then the possibility that the vaccine was helpful seems more plausible.
Table \@ref(tab:malaria-vaccine-20-ex-summary) shows that 11 patients developed infections and 9 did not.
For our simulation, we will suppose the infections were independent of the vaccine and we were able to *rewind* back to when the researchers randomized the patients in the study.
If we happened to randomize the patients differently, we may get a different result in this hypothetical world where the vaccine does not influence the infection.
Let's complete another **randomization** using a simulation.
In this **simulation**, we take 20 notecards to represent the 20 patients, where we write down "infection" on 11 cards and "no infection" on 9 cards.
In this hypothetical world, we believe each patient that got an infection was going to get it regardless of which group they were in, so let's see what happens if we randomly assign the patients to the treatment and control groups again.
We thoroughly shuffle the notecards and deal 14 into a pile and 6 into a pile.
Finally, we tabulate the results, which are shown in Table \@ref(tab:malaria-vaccine-20-ex-summary-rand-1).
```{r malaria-vaccine-20-ex-summary-rand-1}
# matching OS4
malaria_rand <- tibble(
treatment = c(rep("infection", 11),
rep("no infection", 9)),
outcome = c(rep("vaccine", 7), rep("placebo", 4),
rep("vaccine", 7), rep("placebo", 2))
)
malaria_rand %>%
count(treatment, outcome, .drop = FALSE) %>%
pivot_wider(names_from = outcome, values_from = n) %>%
adorn_totals(where = c("row", "col")) %>%
kbl(linesep = "", booktabs = TRUE, caption = "Simulation results, where any difference in infection ratio is purely due to chance.") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = FALSE)
```
::: {.guidedpractice data-latex=""}
How does this compare to the observed 64.3% difference in the actual data?[^foundations-applications-2]
:::
[^foundations-applications-2]: $4 / 6 - 7 / 14 = 0.167$ or about 16.7% in favor of the vaccine.
This difference due to chance is much smaller than the difference observed in the actual groups.
### Independence between treatment and outcome
We computed one possible difference under the independence model in the previous Guided Practice, which represents one difference due to chance, assuming there is no vaccine effect.
While in this first simulation, we physically dealt out notecards to represent the patients, it is more efficient to perform the simulation using a computer.
Repeating the simulation on a computer, we get another difference due to chance: $$ \frac{2}{6{}} - \frac{9}{14} = -0.310 $$
And another: $$ \frac{3}{6{}} - \frac{8}{14} = -0.071$$
And so on until we repeat the simulation enough times to create a *distribution of differences that could have occurred if the null hypothesis was true*.
Figure \@ref(fig:malaria-rand-dot-plot) shows a stacked plot of the differences found from 100 simulations, where each dot represents a simulated difference between the infection rates (control rate minus treatment rate).
```{r malaria-rand-dot-plot, fig.cap = "(ref:malaria-rand-dot-plot-cap)"}
set.seed(19)
malaria %>%
specify(response = outcome, explanatory = treatment, success = "infection") %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props", order = c("placebo", "vaccine")) %>%
# simplify by rounding
mutate(stat = round(stat, 3)) %>%
ggplot(aes(x = stat)) +
geom_dotplot(binwidth = 0.1, dotsize = 0.2) +
labs(y = NULL, x = "Difference in infection rates") +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
gghighlight(stat >= 0.643)
```
(ref:malaria-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the independence mode, $H_0,$ where in these simulations infections are unaffected by the vaccine. Two of the 100 simulations had a difference of at least 64.3%, the difference observed in the study.
Note that the distribution of these simulated differences is centered around 0.
We simulated these differences assuming that the independence model was true, and under this condition, we expect the difference to be near zero with some random fluctuation, where *near* is pretty generous in this case since the sample sizes are so small in this study.
::: {.workedexample data-latex=""}
How often would you observe a difference of at least 64.3% (0.643) according to Figure \@ref(fig:malaria-rand-dot-plot)?
Often, sometimes, rarely, or never?
------------------------------------------------------------------------
It appears that a difference of at least 64.3% due to chance alone would only happen about 2% of the time according to Figure \@ref(fig:malaria-rand-dot-plot).
Such a low probability indicates a rare event.
:::
The difference of 64.3% being a rare event suggests two possible interpretations of the results of the study:
- $H_0$: **Independence model.** The vaccine has no effect on infection rate, and we just happened to observe a difference that would only occur on a rare occasion.
- $H_A$: **Alternative model.** The vaccine has an effect on infection rate, and the difference we observed was actually due to the vaccine being effective at combating malaria, which explains the large difference of 64.3%.
Based on the simulations, we have two options.
(1) We conclude that the study results do not provide strong evidence against the independence model.
That is, we do not have sufficiently strong evidence to conclude the vaccine had an effect in this clinical setting.
(2) We conclude the evidence is sufficiently strong to reject $H_0$ and assert that the vaccine was useful.
When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event.
So in the vaccine case, we reject the independence model in favor of the alternative.
That is, we are concluding the data provide strong evidence that the vaccine provides some protection against malaria in this clinical setting.
One field of statistics, statistical inference, is built on evaluating whether such differences are due to chance.
In statistical inference, data scientists evaluate which model is most reasonable given the data.
Errors do occur, just like rare events, and we might choose the wrong model.
While we do not always choose correctly, statistical inference gives us tools to control and evaluate how often decision errors occur.
\clearpage
## Interactive R tutorials {#foundations-tutorials}
Navigate the concepts you've learned in this chapter in R using the following self-paced tutorials.
All you need is your browser to get started!
::: {.alltutorials data-latex=""}
[Tutorial 4: Foundations of inference](https://openintrostat.github.io/ims-tutorials/04-foundations/)\
```{asis, echo = knitr::is_latex_output()}
https://openintrostat.github.io/ims-tutorials/04-foundations
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 4 - Lesson 1: Sampling variability](https://openintro.shinyapps.io/ims-04-foundations-01/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-04-foundations-01
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 4 - Lesson 2: Randomization test](https://openintro.shinyapps.io/ims-04-foundations-02/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-04-foundations-02
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 4 - Lesson 3: Errors in hypothesis testing](https://openintro.shinyapps.io/ims-04-foundations-03/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-04-foundations-03
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 4 - Lesson 4: Parameters and confidence intervals](https://openintro.shinyapps.io/ims-04-foundations-04/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-04-foundations-04
```
:::
```{asis, echo = knitr::is_latex_output()}
You can also access the full list of tutorials supporting this book at\
[https://openintrostat.github.io/ims-tutorials](https://openintrostat.github.io/ims-tutorials).
```
```{asis, echo = knitr::is_html_output()}
You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).
```
## R labs {#foundations-labs}
Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.
::: {.singlelab data-latex=""}
[Sampling distributions - Does science benefit you?](https://www.openintro.org/go?id=ims-r-lab-foundations-1)\
```{asis, echo = knitr::is_latex_output()}
https://www.openintro.org/go?id=ims-r-lab-foundations-1
```
:::
::: {.singlelab data-latex=""}
[Confidence intervals - Climate change](https://www.openintro.org/go?id=ims-r-lab-foundations-2)\
```{asis, echo = knitr::is_latex_output()}
https://www.openintro.org/go?id=ims-r-lab-foundations-2
```
:::
```{asis, echo = knitr::is_latex_output()}
You can also access the full list of labs supporting this book at\
[https://www.openintro.org/go?id=ims-r-labs](https://www.openintro.org/go?id=ims-r-labs).
```
```{asis, echo = knitr::is_html_output()}
You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).
```