-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathintro-to-hypothesis-testing.Rmd
182 lines (115 loc) · 8.78 KB
/
intro-to-hypothesis-testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "Hypothesis Testing"
author: "Alex Hayes"
date: "`r Sys.Date()`"
output: pdf_document
urlcolor: blue
header-includes:
\usepackage{float}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
theme_set(theme_bw())
```
<!-- - Add explicit procedures for hypothesis testing -->
<!-- - Add explicit procedures for calculating power -->
## Hypothesis Testing
Recall: a *statistic* $T(X)$ is a function from a random sample into the real line. Since statistics are functions of random samples, they are themselves random variables.
Today we're interested in the following question: is a true parameter value of $\theta_0$ consistent with the data in our observed sample?
We call this is the *null hypothesis* and write
\begin{align}
H_0 &: \theta = \theta_0
\end{align}
where this means that true (population) value of a parameter $\theta$ is equal to some value $\theta_0$.
What do we do next? We *assume* that $\theta = \theta_0$ in the population, and then check if this assumption is compatible with our observed data. The population with $\theta = \theta_0$ corresponds to a probability distribution, which we call the *null distribution*.
Let's make this concrete. Suppose that we observe data $2, 3, 7$ and we know that our data comes from a normal distribution with known variance $\sigma^2 = 2$. Realistically, we won't know $\sigma^2$, or that our data is normal, but we'll work with these assumptions for now and relax them later.
Let's suppose we're interested in the population mean. Let's guess that the population mean is 8. In this case we would write the null hypothesis as $H_0 : \mu = 8$. This is a ridiculous guess for the population mean given our data, but it'll illustrate our point. Our null distribution is then $\mathrm{Normal}(8, 2)$.
Now that we have a null distribution, we need to dream up a *test statistic*. In this class, you'll always be given a test statistic. For now we'll use the T statistic.
\[
Z = {\bar x - \mu_0 \over \mathrm{se}\left(\bar x \right)} = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {4 \over \sqrt \frac 23} \approx 4.9
\]
Test statistics are chosen to have two important properties:
1. They need to relate to the population parameter we're interested in measuring
2. We need to know their sampling distributions
Sampling distributions you say! Why do test statistics have sampling distributions? Because we're just taking a function of a random sample.
For this example, we know that
$$Z \sim \mathrm{Normal}(0, 1)$$
and now we ask how probable is this statistic *given that we have assumed that null distribution is true*.
The idea is that if this number is very small, then our null distribution can't be correct: we shouldn't observe highly unlikely statistics. This means that hypothesis testing is a form of *falsification testing*.
```{r fig.height = 3.5}
z_score <- 1.6
tibble(x = seq(-4, 4, 0.005),
density = dnorm(x),
reject = if_else(abs(x) >= z_score, density, 0)) %>%
ggplot(aes(x = x, y = density)) +
geom_ribbon(
mapping = aes(x, ymin = 0, ymax = reject, fill = "red"),
alpha = 0.5) +
geom_line(size = 1) +
geom_vline(xintercept = z_score, color = "red") +
geom_text(
aes(x = z_score, y = 0.27, label="\nobserved z-score"),
angle = 90,
color = "red") +
labs(title = "Area of sampling distribution corresponding to p-value",
subtitle = "Sampling distribution of Z under the null distribution") +
theme(legend.position = "none")
```
For the example above, we are interested in the probability of observing a more extreme test statistic given the null distribution, which in this case is:
$$P(|Z| > 4.9) = P(Z < -4.9) + P(Z > 4.9) \approx 9.6 \cdot 10^{-7}$$
This probability is called a *p-value*. Since it's very small, we conclude that the null hypothesis is not realistic. In other words, the population mean is statistically distinguishable from 8 (whether or not it is practically distinguishable from 8 is entirely another matter).
This is the just of hypothesis testing. Of course there's a bunch of other associated nonsense that obscures the basic idea, which we'll dive into next.
### Things that can go wrong
#### False positives
We need to be concerned about rejecting the null hypothesis when the null hypothesis is true. This is called a *false positive* or a Type I error.
If the null hypothesis is true, and we calculate a statistic like we did above, we still expect to see a value of p-value of $9.6 \cdot 10^{-7}$ about $9.6 \cdot 10^{-5}$ percent of the time. For small p-values this isn't an issue, but let's consider a different null hypothesis of $\mu_0 = 3.9$. Now
\[
Z = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {4 - 3.9 \over \sqrt \frac 23} \approx 0.12
\]
and our corresponding p-value is
$$P(|Z| > 0.12) = P(Z < -0.12) + P(Z > 0.12) \approx 0.9$$
and we see that this is quite probable! We should definitely not reject the null hypothesis!
This leads us to a new question: when *should* we reject the null hypothesis? A standard choice is to set an acceptable probability for a false positive $\alpha$. One arbitrary but common choice is to set $\alpha = 0.05$, which means we are okay with a ${1 \over 20}$ chance of a false positive. We should then reject the null hypothesis when the p-value is less than $\alpha$. This is often called "rejecting the null hypothesis at significance level $\alpha$". More formally, we might write
$$P(\text{reject} \; H_0 | H_0 \; \text{true}) = \alpha$$
Exercise: explain why the XKCD joke on the following page is funny.
```{r, fig.align = "center"}
knitr::include_graphics("figures/significant.png")
```
#### False negatives
On the other hand, we may also fail to reject the null hypothesis when the null hypothesis is in fact false. We might just not have enough data to reject the null, for example. We call this a *false negative* or a Type II error. We write this as
$$\text{Power} = P(\text{fail to reject} \; H_0 | H_0 \; \text{false}) = 1 - \beta$$
To achieve a power of $1 - \beta$ for a one sample Z-test, you need
$$n \approx \left({\sigma \cdot (z_{\alpha / 2} + z_\beta) \over \mu_0 - \mu_A}\right)^2$$
where $\mu_A$ is the true mean and $\mu_0$ is the proposed mean. We'll do an exercise later that will help you see where this comes from.
### Sampling distributions
Suppose that $X_i$ are independent and identically $\mathrm{Normal}(\mu, \sigma^2)$.
\renewcommand{\arraystretch}{2}
\begin{figure}[h]
\begin{center}
\begin{tabular}{lllll}
known parameters & null hypothesis & test statistic & null distribution \\
\hline
$\mu$ unknown, $\sigma^2$ known & $H_0: \mu = \mu_0$ & $\displaystyle {\bar x - \mu_0 \over {\sigma \over \sqrt n}}$ & N(0, 1) \\
$\mu, \sigma^2$ unknown & $H_0: \mu = \mu_0$ & $\displaystyle {\bar x - \mu_0 \over {s \over \sqrt n}}$ & $t_{n-1}$
\end{tabular}
\end{center}
\end{figure}
## Examples
### Z-test
A company claims battery lifetimes are normally distributed with $\mu = 40$ and $\sigma = 5$ hours. We are curious if the claim about the mean is reasonable, and collect a random sample of 100 batteries. The sample mean is 39.8. What is the p-value of a Z-test for $H_0 : \mu = 40$?
We begin by calculating a Z-score
$$Z = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {39.8 - 40 \over {5 \over \sqrt 100}} = 0.4$$
and then we calulate, using the fact that $Z \sim \mathrm{Normal}(0, 1)$,
$$P(Z < -0.4) + P(Z > 0.4) \approx 0.69$$
we might also be interested in a *one-sided* test, where $H_A : \mu < 40$. In this case the p-value is only the case when $Z < -0.4$, and the p-value is
$$P(Z < -0.4) \approx 0.34$$
### Power for Z-test
Suppose a powdered medicine is supposed to have a mean particle diameter of $\mu = 15$ micrometers, and the standard deviation of diameters stays steady around 1.8 micrometers. The company would like to have high power to detect mean thicknesses 0.2 micrometers away from 15. When $n = 100$, what is the power of the test if the true $\mu$ is 15.2 micrometers. Assume the company is interested in controlling type I error at an $\alpha = 0.05$ level.
We will reject the null when our Z score is less than $z_{\alpha / 2}$ or $z_{1 - \alpha / 2}$, or when the Z score is less than -1.96 or greater than 1.96. Recall that the Z score is ${\bar x - \mu_0 \over {\sigma \over \sqrt n}}$, which we can rearrange in terms of $\bar x$ to see that we will reject the null when $\bar x < 14.65$ or $\bar x > 15.35$.
Now we are interested in the probability of being in this rejection region *when the alternative hypothesis $\mu_A = 15.2$ is true*.
<!-- this is magical why are we interested in this -->
$$P(\bar x > 15.35 | \mu = 15.2) + P(\bar x < 14.65 | \mu = 15.2)$$
and we know that $\bar x \sim \mathrm{Normal} \left(15.2, 1.8 / \sqrt{100}\right)$ so this equals
$$0.001 + 0.198 \approx 0.199$$
So we have only a power of about 20 percent. This is quite low.