24-inf-model-slr.Rmd

# (PART) Inferential modeling {.unnumbered}

```{r, include = FALSE}
source("_common.R")
```

# Inference for linear regression with a single predictor {#inf-model-slr}

\chaptermark{Inference for regression with a single predictor}

::: {.chapterintro data-latex=""}
We now bring together ideas of inferential analyses with the descriptive models seen in Chapters \@ref(model-slr).
In particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.
Additionally, we will build confidence intervals which quantify the slope of the linear regression line.
The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.
The sampling variability will inform the conclusions about the population that can be drawn.

Many of the inferential ideas are remarkably similar to those covered in previous chapters.
The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.

We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.
Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.
We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.
:::

```{r include=FALSE}
terms_chp_24 <- c("inference with single precictor regression")
```

## Case study: Sandwich store

### Observed data

We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.
The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.
Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.
Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.
The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.

Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure \@ref(fig:sandpop).
In this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.
The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.
There appears to be a linear relationship between advertising dollars and revenue (both in \$1,000).

```{r sandpop, fig.cap = "Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars."}
set.seed(4747)
popsize <- 1000
ad <- rnorm(popsize, 4, 1)
rev <- 12 + 4.7 * ad + rnorm(popsize, 0, 8)
sandwich <- tibble(ad, rev)

ggplot(sandwich, aes(x = ad, y = rev)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar(suffix = "K")) +
  labs(
    x = "Amount spent on advertising",
    y = "Revenue",
    title = "Chain sandwich store",
    subtitle = "Hypothetical population"
  ) +
  coord_cartesian(ylim = c(0, 65))
```

You may remember from Chapter \@ref(model-slr) that the population model is: $$y = \beta_0 + \beta_1 x + \varepsilon.$$

Again, the omniscient CEO (with the full population information) can write down the true population model as: $$\texttt{expected revenue} = 11.23 + 4.8 \times \texttt{advertising}.$$

### Variability of the statistic

Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.
The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.

As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.
Consider the linear model which describes revenue (in \$1,000) based on advertising dollars (in \$1,000).

The least squares regression model uses the data to find a sample linear fit: $$\hat{y} = b_0 + b_1 x.$$

A random sample of 20 stores shows a different least square regression line depending on which observations are selected.
A subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure \@ref(fig:sandpop) which described the population) despite having fewer observations on the plot.

```{r echo = FALSE}
set.seed(470)
sandwich2 <- sandwich %>%
  sample_n(size = 20)
sandwich3 <- sandwich %>%
  sample_n(size = 20)
sandwich_many <- sandwich %>%
  rep_sample_n(size = 20, replace = FALSE, reps = 50)
```

```{r fig.cap = "A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed."}
ggplot(sandwich2, aes(x = ad, y = rev)) +
  geom_point(size = 3, fill = IMSCOL["green", "full"], color = "#FFFFFF", shape = 22) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["green", "full"]) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar(suffix = "K")) +
  labs(
    x = "Amount spent on advertising",
    y = "Revenue",
    title = "Chain sandwich store",
    subtitle = "Random sample of 20 stores"
  ) +
  coord_cartesian(ylim = c(0, 65))
```

A second sample of size 20 also shows a positive trend!

```{r fig.cap = "A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed."}

ggplot(sandwich3, aes(x = ad, y = rev)) +
  geom_point(size = 3, fill = IMSCOL["gray", "full"], color = "#FFFFFF", shape = 23) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["gray", "full"]) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar(suffix = "K")) +
  labs(
    x = "Amount spent on advertising",
    y = "Revenue",
    title = "Chain sandwich store",
    subtitle = "Random sample of 20 stores"
  ) +
  coord_cartesian(ylim = c(0, 65))
```

But the lines are slightly different!

```{r fig.cap = "The linear models from the two different random samples are quite similar, but they are not the same line."}
ggplot() +
  geom_point(data = sandwich2, aes(x = ad, y = rev), 
             size = 3, , shape = 22,
             fill = IMSCOL["green", "full"], color = "#FFFFFF") +
  geom_smooth(data = sandwich2, aes(x = ad, y = rev), 
              method = "lm", se = FALSE, fullrange = TRUE, 
              color = IMSCOL["green", "full"]) +
  geom_point(data = sandwich3, aes(x = ad, y = rev), 
             size = 3, , shape = 23,
             fill = IMSCOL["gray", "full"], color = "#FFFFFF") +
  geom_smooth(data = sandwich3, aes(x = ad, y = rev), 
              method = "lm", se = FALSE, fullrange = TRUE, 
              color = IMSCOL["gray", "full"]) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar(suffix = "K")) +
  labs(
    x = "Amount spent on advertising",
    y = "Revenue",
    title = "Chain sandwich store",
    subtitle = "Two random samples of 20 stores"
  ) +
  coord_cartesian(ylim = c(0, 65))
```

That is, there is **variability** in the regression line from sample to sample.
The concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.

```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "variability of the slope")
```

```{r slopes, fig.cap = "If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population."}
ggplot() +
  geom_smooth(
    data = sandwich_many, aes(x = ad, y = rev, group = replicate),
    method = "lm", se = FALSE, fullrange = TRUE, 
    color = IMSCOL["gray", "f2"], size = 0.5
  ) +
  geom_smooth(
    data = sandwich, aes(x = ad, y = rev), method = "lm", se = FALSE,
    fullrange = TRUE, color = IMSCOL["red", "full"]
  ) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar(suffix = "K")) +
  labs(
    x = "Amount spent on advertising",
    y = "Revenue",
    title = "Chain sandwich store",
    subtitle = "Many random samples of 20 stores"
  ) +
  coord_cartesian(ylim = c(0, 65))
```

You might notice in Figure \@ref(fig:slopes) that the $\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.
The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.
The effect of the fan-shaped lines is that predicted revenue for advertising close to \$4,000 will be much more precise than the revenue predictions made for \$1,000 or \$7,000 of advertising.

The distribution of slopes (for samples of size $n=20$) can be seen in a histogram, as in Figure \@ref(fig:sand20lm).

```{r sand20lm, fig.cap="Variability of slope estimates taken from many different samples of stores, each of size 20."}
sandwich_many_lm <- sandwich_many %>%
  group_by(replicate) %>%
  do(tidy(lm(rev ~ ad, data = .))) %>%
  filter(term == "ad")

ggplot(sandwich_many_lm, aes(x = estimate)) +
  geom_histogram(binwidth = 0.5) +
  labs(
    x = "Slope estimate",
    y = "Count",
    title = "Chain sandwich store",
    subtitle = "Many random samples of 20 stores"
    )
```

Recall, the example described in this introduction is hypothetical.
That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.
The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.
We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.
However, it is nice to visually consider the linear models produced by different slopes.
Additionally, as with measuring the variability of previous statistics (e.g., $\overline{X}_1 - \overline{X}_2$ or $\hat{p}_1 - \hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.

In the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.
First (in Section \@ref(randslope)), the distribution of $b_1$ when $\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.
Next (in Section \@ref(bootbeta1)), we can bootstrap the data by taking random samples of size n from the original dataset.
And last (in Section \@ref(mathslope)), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in Section \@ref(one-mean-math).

## Randomization test for the slope {#randslope}

Consider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.
Some of the variables are plotted in Figure \@ref(fig:babyweight).

The scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.
The dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.

::: {.data data-latex=""}
The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
We will work with a random sample of 100 observations from these data.
:::

```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "randomization test for the slope")
```

```{r babyweight, fig.cap = "Weight of baby at birth (in lbs) as plotted by four other birth variables (mother's weight gain, mother's age, number of hospital visits, and weeks gestation)."}
set.seed(47)
births14_100 <- births14 %>%
  drop_na() %>%
  sample_n(100)

p_gained <- ggplot(births14_100) +
  geom_jitter(aes(y = weight, x = gained)) +
  xlab("weight gained by mother") +
  ylab("weight of baby")
p_mage <- ggplot(births14_100) +
  geom_jitter(aes(y = weight, x = mage)) +
  xlab("mother's age") +
  ylab("weight of baby")
p_visits <- ggplot(births14_100) +
  geom_jitter(aes(y = weight, x = visits)) +
  xlab("number of hospital visits during pregnancy") +
  ylab("weight of baby")
p_weeks <- ggplot(births14_100) +
  geom_jitter(aes(x = weeks, y = weight)) +
  xlab("weeks of gestation") +
  ylab("weight of baby")

p_gained + p_mage + p_visits + p_weeks +
  plot_layout(ncol = 2)
```

As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.
In the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.

The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.
Here the population refers to a larger population of births in the US.

-   $H_0: \beta_1= 0$, there is no linear relationship between `weight` and `weeks`.
-   $H_A: \beta_1 \ne 0$, there is some linear relationship between `weight` and `weeks`.

Recall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.
That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.
Figure \@ref(fig:permweightScatter) shows the observed data and a scatterplot of one permutation of the `weight` variable.
The careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.
That is, each `weight` value is randomly assigned to a new `weeks` gestation.

```{r permweightScatter, fig.cap = "(ref:permweightScatter-cap)", fig.asp = 0.5}
set.seed(4747)
p1 <- ggplot(births14_100) +
  geom_point(aes(x = weeks, y = weight)) +
  labs(
    x = "Length of gestation (weeks)",
    y = "Weight of baby (pounds)",
    title = "Original data"
  )

p2 <- ggplot(births14_100) +
  geom_point(aes(x = weeks, y = sample(weight))) +
  labs(
    x = "Length of gestation (weeks)",
    y = "Permuted weight of baby (pounds)",
    title = "Permuted data"
  )

p1 + p2
```

(ref:permweightScatter-cap) Original (left) and permuted (right) data. The permutation removes the linear relationship between `weight` and `weeks`. Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).

By repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).
The randomization test compares the slopes calculated from the permuted response variable with the observed slope.
If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).

### Observed data

We will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.
Note that the least squares model (see Chapter \@ref(model-slr)) describing the relationship is given in Table \@ref(tab:ls-births).
The columns in Table \@ref(tab:ls-births) are further described in Section \@ref(mathslope).

```{r ls-births}
lm(weight ~ weeks, data = births14_100) %>%
  tidy() %>%
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
  kbl(linesep = "", booktabs = TRUE, 
      caption = "The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.335.",
      digits = 2, align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "10em", monospace = TRUE) %>%
  column_spec(2:5, width = "5em")
```

### Variability of the statistic

After permuting the data, the least squares estimate of the line can be computed.
Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.
Figure \@ref(fig:permweekslm) shows two different permutations of `weight` and the resulting linear models.

```{r permweekslm, fig.cap = "(ref:permweekslm-cap)", fig.asp = 0.5}
set.seed(470)
p1 <- ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  labs(
    y = "Permuted weight of baby (pounds)",
    x = "Length of gestation (weeks)",
    title = "First permutation of \nweight"
  )

p2 <- ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  labs(
    y = "Permuted weight of baby (pounds)",
    x = "Length of gestation (weeks)",
    title = "Second permutation of \nweight"
  )

p1 + p2
```

(ref:permweekslm-cap) Two different permutations of the `weight` variable with slightly different least squares regression lines.

As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.
Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.

### Observed statistic vs. null statistics

```{r nulldistBirths, fig.cap = "(ref:nulldistBirths-cap)"}
perm_slope <- births14_100 %>%
  specify(weight ~ weeks) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "slope")

obs_slope <- births14_100 %>%
  specify(weight ~ weeks) %>%
  calculate(stat = "slope") %>%
  pull()

ggplot(data = perm_slope, aes(x = stat)) +
  geom_histogram() +
  geom_vline(xintercept = obs_slope, color = IMSCOL["red", "full"]) +
  labs(x = "Randomly generated slopes", y = "Count")
```

(ref:nulldistBirths-cap) Histogram of slopes given different permutations of the `weight` variable. The vertical red line is at the observed value of the slope, 0.335.

As we can see from Figure \@ref(fig:nulldistBirths), a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.
That is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.
We reject the null hypothesis.
Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.

## Bootstrap confidence interval for the slope {#bootbeta1}

As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).
Because interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.

```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "bootstrap CI for the slope")
```

### Observed data

Returning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.
Is `mage` a good predictor of `weight`?
And if so, what is the relationship?
That is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?
The linear model regressing `weight` on `mage` is provided in Table \@ref(tab:ls-births-mage).

```{r echo = FALSE}
set.seed(4747)
births4 <- births14_100 %>%
  sample_n(size = 100, replace = TRUE)
births5 <- births14_100 %>%
  sample_n(size = 100, replace = TRUE)
births_many_BS <- births14_100 %>%
  rep_sample_n(size = 100, replace = TRUE, reps = 50)
```

```{r magePlot, fig.cap = "(ref:magePlot-cap)"}
ggplot(births14_100) +
  geom_point(aes(x = mage, y = weight)) +
  geom_smooth(aes(x = mage, y = weight),
    method = "lm", se = FALSE,
    fullrange = TRUE, 
  ) +
  labs(x = "Mother's age", y = "Weight of baby")
```

(ref:magePlot-cap) Original data: `weight` of baby as a linear model of mother's age. Notice that the relationship between `mage` and `weight` is not as strong as the relationship we saw previously between `weeks` and `weight`.

```{r ls-births-mage}
lm(weight ~ mage, data = births14_100) %>%
  tidy() %>%
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
  kbl(linesep = "", booktabs = TRUE, 
      caption = "The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.036",
      digits = 2, align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "10em", monospace = TRUE) %>%
  column_spec(2:5, width = "5em")
```

### Variability of the statistic

Because the focused is not on a null distribution, sample with replacement $n=100$ observations from the original dataset.
Recall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.
When sampling in the linear model case, consider each observation to be a single dot.
If the dot is resampled, both the `weight` and the `mage` measurement are observed.
The measurements are linked to the dot (i.e., to the birth in the sample).

```{r birth2BS, fig.cap = "Original and one bootstrap sample of the births data.  Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another. The red circles represent points in the original data which were not included in the bootstrap sample. The blue circles represents a data point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.", fig.asp = 0.5}
p1 <- ggplot(births14_100) +
  geom_point(aes(x = mage, y = weight), alpha = 0.4) +
  geom_smooth(aes(x = mage, y = weight),
    method = "lm", se = FALSE,
    fullrange = TRUE
  ) +
  xlab("mother's age") +
  ylab("weight of baby") +
  ggtitle("Original births data.") +
  geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
  geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
  geom_point(x = 39, y = 9, pch = 1, size = 8)

p2 <- ggplot(births5) +
  geom_point(aes(x = mage, y = weight), alpha = 0.4) +
  geom_smooth(aes(x = mage, y = weight),
    method = "lm", se = FALSE,
    fullrange = TRUE
  ) +
  xlab("mother's age") +
  ylab("weight of baby") +
  ggtitle("Bootstrap sample from \nthe births data.") +
  geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
  geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
  geom_point(x = 39, y = 9, pch = 1, size = 8)

p1 + p2
```

Figure \@ref(fig:birth2BS) shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models.
The red circles represent points in the original data which were not included in the bootstrap sample.
The blue circles represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.
The green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.
By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure \@ref(fig:birthBS).

```{r birthBS, fig.cap = "Repeated bootstrap resamples of size 100 are taken from the original data.  Each of the bootstrapped linear model is slightly different."}
ggplot(births_many_BS, aes(x = mage, y = weight, group = replicate)) +
  geom_smooth(method = "lm", se = FALSE, color = IMSCOL["blue", "full"], fullrange = TRUE, alpha = 0.8) +
  xlab("mother's age") +
  ylab("weight of baby") 
```

Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.
Figure \@ref(fig:mageBSslopes) is a histogram of the relevant bootstrapped slopes.
We can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).
We are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds (notice that the CI overlaps zero!).

```{r mageBSslopes, fig.cap = "(ref:mageBSslopes-cap)"}
set.seed(47)
births_many_BS_1000 <- births14_100 %>%
  rep_sample_n(size = 100, replace = TRUE, reps = 1000)

births_many_lm_BS <- births_many_BS_1000 %>%
  group_by(replicate) %>%
  do(tidy(lm(weight ~ mage, data = .))) %>%
  filter(term == "mage")

lower <- round(quantile(births_many_lm_BS$estimate, 0.025), 3)
upper <- round(quantile(births_many_lm_BS$estimate, 0.975), 3)

ggplot(births_many_lm_BS, aes(x = estimate)) +
  geom_histogram() +
  annotate("segment", x = lower, xend = lower, y = 0, yend = 25, linetype = "dashed") +
  annotate("segment", x = upper, xend = upper, y = 0, yend = 25, linetype = "dashed") +
  annotate("text", x = lower, y = 35, label = "2.5 percentile\n-0.01", size = 5) +
  annotate("text", x = upper, y = 35, label = "97.5 percentile\n0.081", size = 5) +
  labs(
    x = "Bootstrapped values of the slope modeling weight on mother's age",
    y = NULL
  ) +
  theme(axis.text.y = element_blank())
```

(ref:mageBSslopes-cap) The original births data on `weight` and `mage` is bootstrapped 1,000 times. The histogram provides a sense for the variability of the slope of the linear model slope from sample to sample.

::: {.workedexample data-latex=""}
Using Figure \@ref(fig:mageBSslopes), calculate the bootstrap estimate for the standard error of the slope.
Using the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.

------------------------------------------------------------------------

Notice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).
Using the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.
The normal cutoff for a 95% confidence interval is $z^\star = 1.96$ which leads to a confidence interval of $b_1 \pm 1.96 \cdot SE \rightarrow 0.036 \pm 1.96 \cdot 0.0225 \rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.
In context, we are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds
:::

## Mathematical model for testing the slope {#mathslope}

When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.
The approximations will build on the t-distribution which was described in Chapter \@ref(inference-one-mean).
The mathematical model is often correct and is usually easy to implement computationally.
The validity of the technical conditions will be considered in detail in Section \@ref(tech-cond-linmod).

In this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.
Just as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.

### Observed data

**Midterm elections and unemployment**

Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.
Presidential election.
The set of House elections occurring during the middle of a Presidential term are called midterm elections.
In America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.
In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.

To assess the validity of this claim, we can compile historical data and look for a connection.
We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression.
The House of Representatives is made up of 435 voting members.

::: {.data data-latex=""}
The [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::

Figure \@ref(fig:unemploymentAndChangeInHouse) shows these data and the least-squares regression line:

$$
\begin{aligned}
&\texttt{percent change in House seats for President's party}  \\
&\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)}
\end{aligned}
$$

We consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.

Examining the data, there are no clear deviations from linearity or substantial outliers (see Section \@ref(resids) for a discussion on using residuals to visualize how well a linear model fits the data).
While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.

```{r unemploymentAndChangeInHouse, fig.cap="The percent change in House seats for the President's party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data."}
years_to_label <- c(2019, 2003, 1995, 1983, 2011, 1899)

midterms_house_with_labels <- midterms_house %>%
  mutate(
    label_top = word(potus, -1), 
    label_bottom = year - 1
    ) 

midterms_house_with_labels %>%
  filter(!(year %in% c(1935, 1939))) %>%
  ggplot(aes(x = unemp, y = house_change)) +
  geom_point(aes(color = party, shape = party), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  scale_color_openintro() +
  scale_shape_manual(values = c(16, 17)) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(3.5, 12.1), breaks = c(4, 8, 12)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(-31, 11), breaks = c(-30, -20, -10, 0, 10)) +
  labs(
    x = "Percent unemployed", 
    y = "Percent change in House seats for the President's party",
    color = "Party", shape = "Party"
    ) +
  geom_text(data = midterms_house_with_labels %>% filter(year %in% years_to_label),
    aes(x = unemp, y = house_change + 2, label = paste0(label_top, "\n", label_bottom)), color = IMSCOL["black", "full"]
  ) +
  theme(
    legend.position = c(0.8, 0.8),
    legend.background = element_rect(color = "white")
  )
```

::: {.guidedpractice data-latex=""}
The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.
Do you agree that they should be removed for this investigation?
Why or why not?[^inf-model-slr-1]
:::

[^inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.
    That is, in many situations, there is no "right" answer.
    As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.
    In terms of the Great Depression, we will provide two contrasting considerations.
    Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.
    On the other hand, these are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.

There is a negative slope in the line shown in Figure \@ref(fig:unemploymentAndChangeInHouse).
However, this slope (and the y-intercept) are only estimates of the parameter values.
We might wonder, is this convincing evidence that the "true" linear model has a negative slope?
That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?
We can frame this investigation into a statistical hypothesis test:

-   $H_0$: $\beta_1 = 0$. The true linear model has slope zero.
-   $H_A$: $\beta_1 \neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.

We would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.
To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.

### Variability of the statistic

Just like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.
We will generally label the test statistic using a $T$, since it follows the $t$-distribution.

We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.
Table \@ref(tab:midtermUnempRegTable) shows software output for the least squares regression line in Figure \@ref(fig:unemploymentAndChangeInHouse).
The row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).

```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "t-distribution for slope")
```

```{r midtermUnempRegTable}
d <- midterms_house
th <- !d$year %in% c(1935, 1939)
lm(house_change ~ unemp, d[th, ]) %>%
  tidy() %>%
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
  kbl(linesep = "", booktabs = TRUE, 
      caption = "Output from statistical software for the regression line modeling the midterm election losses for the President's party as a response to unemployment.",
      digits = 2, align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "10em", monospace = TRUE) %>%
  column_spec(2:5, width = "5em")
```

::: {.workedexample data-latex=""}
What do the first and second columns of Table \@ref(tab:midtermUnempRegTable) represent?

------------------------------------------------------------------------

The entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.
Using the estimates, we could write the equation for the least square regression line as

$$ \hat{y} = -7.36 - 0.89 x $$

where $\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.
:::

We previously used a $t$-test statistic for hypothesis testing in the context of numerical data.
Regression is very similar.
In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:

$$
T \ = \ \frac{\text{estimate} - \text{null value}}{\text{SE}} = \ \frac{-0.89 - 0}{0.835} = \ -1.07
$$

This corresponds to the third column of Table \@ref(tab:midtermUnempRegTable) .

::: {.workedexample data-latex=""}
Use Table \@ref(tab:midtermUnempRegTable) to determine the p-value for the hypothesis test

------------------------------------------------------------------------

The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.
:::

### Observed statistic vs. null statistics

As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\beta_1 = 0$).
Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition.
Based on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.

::: {.workedexample data-latex=""}
Examine Figure \@ref(fig:elmhurstScatterWLine), which relates the Elmhurst College aid and student family income.
Are you convinced that the slope is meaningfully different from zero?
That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?

------------------------------------------------------------------------

While the relationship between the variables is not perfect, there is an evident decreasing trend in the data.
This suggests the hypothesis test will reject the null claim that the slope is zero.
:::

::: {.data data-latex=""}
The [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::

The tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.

```{r rOutputForIncomeAidLSRLineInInferenceSection}
elmhurst %>%
  mutate(
    gift_aid = gift_aid * 1000,
    family_income = family_income * 1000
  ) %>%
  lm(gift_aid ~ family_income, .) %>%
  tidy() %>%
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
  kbl(linesep = "", booktabs = TRUE, 
      caption = "Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.",
      digits = 2, align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "17em", monospace = TRUE) %>%
  column_spec(2:5, width = "5em")
```

::: {.guidedpractice data-latex=""}
Table \@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection) shows statistical software output from fitting the least squares regression line shown in Figure \@ref(fig:elmhurstScatterWLine).
Use the output to formally evaluate the following hypotheses.[^inf-model-slr-2]

-   $H_0$: The true coefficient for family income is zero.
-   $H_A$: The true coefficient for family income is not zero.
:::

[^inf-model-slr-2]: We look in the second row corresponding to the family income variable.
    We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.
    The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.
    The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of Figure \@ref(fig:elmhurstScatterWLine).

::: {.important data-latex=""}
**Inference for regression.**

We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.
However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.
See Section \@ref(tech-cond-linmod).
:::

\clearpage

## Mathematical model, interval for the slope

### Observed data

Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.

::: {.workedexample data-latex=""}
Compute the 95% confidence interval for the coefficient using the regression output from Table \@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection).

------------------------------------------------------------------------

The point estimate is -0.0431 and the standard error is $SE = 0.0108$.
When constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.
The degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\star} = 2.01$ for use in the confidence interval.

We can now construct the confidence interval in the usual way:

$$
\begin{aligned}
\text{point estimate} &\pm t_{48}^{\star} \times SE \\
-0.0431 &\pm 2.01 \times 0.0108 \\
(-0.0648 &, -0.0214)
\end{aligned}
$$

We are 95% confident that with each dollar increase in , the university's gift aid is predicted to decrease on average by \$0.0214 to \$0.0648.
:::

### Variability of the statistic

::: {.important data-latex=""}
**Confidence intervals for coefficients.**

Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:

$$ b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} $$

where $t_{df}^{\star}$ is the appropriate $t$-value corresponding to the confidence level with the model's degrees of freedom.
:::

On the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.
However, there are other types of intervals that may be of interest, including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.

\clearpage

## Checking model conditions {#tech-cond-linmod}

In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.
In this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.

```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "technical conditions linear regression")
```

### What are the technical conditions for the mathematical model?

When fitting a least squares line, we generally require

-   **Linearity.** The data should show a linear trend.
    If there is a nonlinear trend (e.g., first panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel)) an advanced regression method from another book or later course should be applied.

-   **Independent observations.** Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day.
    Such data may have an underlying structure that should be considered in a model and analysis.
    An example of a dataset where successive observations are not independent is shown in the fourth panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel).
    There are also other instances where correlations within the data are important, which is further discussed in Chapter \@ref(inf-model-mlr).

-   **Nearly normal residuals.** Generally, the residuals must be nearly normal.
    When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we'll talk about more in Section \@ref(outliers-in-regression).
    An example of a residual that would be a potentially concern is shown in the second panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel), where one observation is clearly much further from the regression line than the others.

-   **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.
    An example of non-constant variability is shown in the third panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel), which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.

```{r whatCanGoWrongWithLinearModel, fig.cap = "Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the $x$ and $y$ relationship.  The bottom set of graphs is a residual plot.First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of $x$. Fourth panel: a time series dataset is shown, where successive observations are highly correlated.", out.width = "100%", fig.width=10, fig.asp = 0.5}
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0.5, 4))

source("helpers/helper-makeTubeAdv.R")
pch <- 20
cex <- 1.75
col <- IMSCOL["blue", "f2"]

layout(
  matrix(1:8, 2),
  rep(1, 4),
  c(2, 1)
)


these <- simulated_scatter$group == 20
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
  axes = FALSE,
  pch = pch,
  cex = cex,
  col = "#00000000"
)
box()
makeTube(x, y,
  type = "quad",
  addLine = FALSE,
  col =  IMSCOL["lgray", "f2"]
)
points(x, y,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
  xlab = "", ylab = "",
  axes = FALSE,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"],
  ylim = yR
)
abline(h = 0, lty = 2)
box()

these <- simulated_scatter$group == 21
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
  axes = FALSE,
  pch = pch,
  cex = cex,
  col = "#00000000"
)
box()
makeTube(x, y,
  addLine = FALSE,
  col =  IMSCOL["lgray", "f2"]
)
points(x, y,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
  xlab = "", ylab = "",
  axes = FALSE,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"],
  ylim = yR
)
abline(h = 0, lty = 2)
box()

these <- simulated_scatter$group == 22
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
  axes = FALSE,
  pch = pch,
  cex = cex,
  col = "#00000000"
)
box()
makeTubeAdv(x, y,
  type = "l",
  variance = "l",
  bw = 0.03,
  Z = 1.7,
  col =  IMSCOL["lgray", "f2"]
)
points(x, y,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
  axes = FALSE,
  xlab = "", ylab = "",
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"],
  ylim = yR
)
abline(h = 0, lty = 2)
box()

these <- simulated_scatter$group == 23
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
  axes = FALSE,
  pch = pch,
  cex = cex,
  col = "#00000000"
)
box()
makeTube(x, y,
  addLine = FALSE,
  col =  IMSCOL["lgray", "f2"]
)
points(x, y,
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
  axes = FALSE,
  xlab = "", ylab = "",
  pch = pch,
  cex = cex,
  col =  IMSCOL["blue", "f1"],
  ylim = yR
)
abline(h = 0, lty = 2)
box()

makeTubeAdv(x, y, col =  IMSCOL["lgray", "f2"])
par(par_og) # restore original par
```

::: {.guidedpractice data-latex=""}
Should we have concerns about applying least squares regression to the Elmhurst data in Figure \@ref(fig:elmhurstScatterW2Lines)?[^inf-model-slr-3]
:::

[^inf-model-slr-3]: The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant.
    These are also not time series observations.
    Least squares regression can be applied to these data.

The technical conditions are often remembered using the **LINE** mnemonic.
The linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure \@ref(fig:whatCanGoWrongWithLinearModel).
A careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.

-   L: **linear** model
-   I: **independent** observations
-   N: points are **normally** distributed around the line
-   E: **equal** variability around the line for all values of the explanatory variable

### Why do we need technical conditions?

As with other inferential techniques we have covered in this text, if the technical conditions above do not hold, then it is not possible to make concluding claims about the population.
That is, without the technical conditions, the T score (or Z score) will not have the assumed t-distribution (or standard normal $z$-distribution).
That said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions.
If there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.

**The model based on Linearity**

The linearity condition is among the most important if your goal is to understand a linear model between $x$ and $y$.
For example, the value of the slope will not be at all meaningful if the true relationship between $x$ and $y$ is quadratic, as in Figure \@ref(fig:notGoodAtAllForALinearModel).
Not only should we be cautious about the inference, but the model *itself* is also not an accurate portrayal of the relationship between the variables.

In Section \@ref(inf-model-mlr) we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones.
However, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.

**The importance of Independence**

The technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose.
It is also extremely difficult to gather a dataset which is a true random sample from the population of interest.
(Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)

Dependent observations can bias results in ways that produce fundamentally flawed analyses.
That is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university.
At best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).

In lieu of trying to answer whether your observations are a true random sample, you might instead focus on whether you believe your observations are representative of the populations.
Humans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.

\clearpage

**Some thoughts on Normality**

The normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion.
You should consider the "bell" of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.

The normality condition is less important than linearity or independence for a few reasons.
First, the linear model fit with least squares will still be an unbiased estimate of the true population model.
However, the standard errors associated with variability of the line will not be well estimated.
Fortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.
One analysis method that *does* require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given $x$ value, using the linear model.
One additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.

**Equal variability for prediction in particular**

As with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of $x$) will not cause problems for the estimate of the linear model.
That said, the **inference** on the model (e.g., computing p-values) will be incorrect if the variability around the line is heterogeneous.
Data that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).

The inference results for both a randomization test or a bootstrap confidence interval are robust to the equal variability condition, so they give the analyst methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line).
Although randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation).
When the equal variability condition is violated and a mathematical analysis (e.g., p-value from T score) is needed, there are other existing methods (outside the scope of this text) which can easily handle the unequal variance (e.g., weighted least squares analysis).

### What if all the technical conditions are met?

When the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software.
In addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests).
In particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions.
When the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand.
We will discuss some of the extended modeling and associated inference in Chapter \@ref(inf-model-mlr) and Section \@ref(inf-model-logistic).
Many of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one.
If you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.

\clearpage

## Chapter review {#chp24-review}

### Summary

Recall that early in the text we presented graphical techniques which communicated relationships across multiple variables.
We also used modeling to formalize the relationships.
Many chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data.
Not only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods.

Here in Chapter \@ref(inf-model-slr) we brought all of those ideas together by considering inferential claims on linear models through randomization tests, bootstrapping, and mathematical modeling.
We continue to emphasize the importance of experimental design in making conclusions about research claims.
In particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \@ref(fig:randsampValloc)).

### Terms

We introduced the following terms in the chapter.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.
However, you should be able to easily spot them as **bolded text**.

```{r}
make_terms_table(terms_chp_24)
```

\clearpage

## Exercises {#chp24-exercises}

Answers to odd numbered exercises can be found in Appendix \@ref(exercise-solutions-24).

::: {.exercises data-latex=""}
```{r exercises-24, child = "exercises/24-ex-inf-model-slr.Rmd"}
```
:::