forked from OpenIntroStat/ims
-
Notifications
You must be signed in to change notification settings - Fork 0
/
24-inf-model-slr.Rmd
1101 lines (889 loc) · 57.9 KB
/
24-inf-model-slr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Inferential modeling {.unnumbered}
```{r, include = FALSE}
source("_common.R")
```
# Inference for linear regression with a single predictor {#inf-model-slr}
\chaptermark{Inference for regression with a single predictor}
::: {.chapterintro data-latex=""}
We now bring together ideas of inferential analyses with the descriptive models seen in Chapters \@ref(model-slr).
In particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.
Additionally, we will build confidence intervals which quantify the slope of the linear regression line.
The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.
The sampling variability will inform the conclusions about the population that can be drawn.
Many of the inferential ideas are remarkably similar to those covered in previous chapters.
The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.
We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.
Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.
We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.
:::
```{r include=FALSE}
terms_chp_24 <- c("inference with single precictor regression")
```
## Case study: Sandwich store
### Observed data
We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.
The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.
Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.
Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.
The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.
Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure \@ref(fig:sandpop).
In this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.
The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.
There appears to be a linear relationship between advertising dollars and revenue (both in \$1,000).
```{r sandpop, fig.cap = "Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars."}
set.seed(4747)
popsize <- 1000
ad <- rnorm(popsize, 4, 1)
rev <- 12 + 4.7 * ad + rnorm(popsize, 0, 8)
sandwich <- tibble(ad, rev)
ggplot(sandwich, aes(x = ad, y = rev)) +
geom_point(alpha = 0.5, size = 2) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Hypothetical population"
) +
coord_cartesian(ylim = c(0, 65))
```
You may remember from Chapter \@ref(model-slr) that the population model is: $$y = \beta_0 + \beta_1 x + \varepsilon.$$
Again, the omniscient CEO (with the full population information) can write down the true population model as: $$\texttt{expected revenue} = 11.23 + 4.8 \times \texttt{advertising}.$$
### Variability of the statistic
Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.
The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.
As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.
Consider the linear model which describes revenue (in \$1,000) based on advertising dollars (in \$1,000).
The least squares regression model uses the data to find a sample linear fit: $$\hat{y} = b_0 + b_1 x.$$
A random sample of 20 stores shows a different least square regression line depending on which observations are selected.
A subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure \@ref(fig:sandpop) which described the population) despite having fewer observations on the plot.
```{r echo = FALSE}
set.seed(470)
sandwich2 <- sandwich %>%
sample_n(size = 20)
sandwich3 <- sandwich %>%
sample_n(size = 20)
sandwich_many <- sandwich %>%
rep_sample_n(size = 20, replace = FALSE, reps = 50)
```
```{r fig.cap = "A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed."}
ggplot(sandwich2, aes(x = ad, y = rev)) +
geom_point(size = 3, fill = IMSCOL["green", "full"], color = "#FFFFFF", shape = 22) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["green", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Random sample of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
A second sample of size 20 also shows a positive trend!
```{r fig.cap = "A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed."}
ggplot(sandwich3, aes(x = ad, y = rev)) +
geom_point(size = 3, fill = IMSCOL["gray", "full"], color = "#FFFFFF", shape = 23) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, color = IMSCOL["gray", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Random sample of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
But the lines are slightly different!
```{r fig.cap = "The linear models from the two different random samples are quite similar, but they are not the same line."}
ggplot() +
geom_point(data = sandwich2, aes(x = ad, y = rev),
size = 3, , shape = 22,
fill = IMSCOL["green", "full"], color = "#FFFFFF") +
geom_smooth(data = sandwich2, aes(x = ad, y = rev),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["green", "full"]) +
geom_point(data = sandwich3, aes(x = ad, y = rev),
size = 3, , shape = 23,
fill = IMSCOL["gray", "full"], color = "#FFFFFF") +
geom_smooth(data = sandwich3, aes(x = ad, y = rev),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["gray", "full"]) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Two random samples of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
That is, there is **variability** in the regression line from sample to sample.
The concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.
```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "variability of the slope")
```
```{r slopes, fig.cap = "If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population."}
ggplot() +
geom_smooth(
data = sandwich_many, aes(x = ad, y = rev, group = replicate),
method = "lm", se = FALSE, fullrange = TRUE,
color = IMSCOL["gray", "f2"], size = 0.5
) +
geom_smooth(
data = sandwich, aes(x = ad, y = rev), method = "lm", se = FALSE,
fullrange = TRUE, color = IMSCOL["red", "full"]
) +
scale_x_continuous(labels = label_dollar(suffix = "K")) +
scale_y_continuous(labels = label_dollar(suffix = "K")) +
labs(
x = "Amount spent on advertising",
y = "Revenue",
title = "Chain sandwich store",
subtitle = "Many random samples of 20 stores"
) +
coord_cartesian(ylim = c(0, 65))
```
You might notice in Figure \@ref(fig:slopes) that the $\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.
The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.
The effect of the fan-shaped lines is that predicted revenue for advertising close to \$4,000 will be much more precise than the revenue predictions made for \$1,000 or \$7,000 of advertising.
The distribution of slopes (for samples of size $n=20$) can be seen in a histogram, as in Figure \@ref(fig:sand20lm).
```{r sand20lm, fig.cap="Variability of slope estimates taken from many different samples of stores, each of size 20."}
sandwich_many_lm <- sandwich_many %>%
group_by(replicate) %>%
do(tidy(lm(rev ~ ad, data = .))) %>%
filter(term == "ad")
ggplot(sandwich_many_lm, aes(x = estimate)) +
geom_histogram(binwidth = 0.5) +
labs(
x = "Slope estimate",
y = "Count",
title = "Chain sandwich store",
subtitle = "Many random samples of 20 stores"
)
```
Recall, the example described in this introduction is hypothetical.
That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.
The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.
We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.
However, it is nice to visually consider the linear models produced by different slopes.
Additionally, as with measuring the variability of previous statistics (e.g., $\overline{X}_1 - \overline{X}_2$ or $\hat{p}_1 - \hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.
In the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.
First (in Section \@ref(randslope)), the distribution of $b_1$ when $\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.
Next (in Section \@ref(bootbeta1)), we can bootstrap the data by taking random samples of size n from the original dataset.
And last (in Section \@ref(mathslope)), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in Section \@ref(one-mean-math).
## Randomization test for the slope {#randslope}
Consider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.
Some of the variables are plotted in Figure \@ref(fig:babyweight).
The scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.
The dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.
::: {.data data-latex=""}
The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
We will work with a random sample of 100 observations from these data.
:::
```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "randomization test for the slope")
```
```{r babyweight, fig.cap = "Weight of baby at birth (in lbs) as plotted by four other birth variables (mother's weight gain, mother's age, number of hospital visits, and weeks gestation)."}
set.seed(47)
births14_100 <- births14 %>%
drop_na() %>%
sample_n(100)
p_gained <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = gained)) +
xlab("weight gained by mother") +
ylab("weight of baby")
p_mage <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = mage)) +
xlab("mother's age") +
ylab("weight of baby")
p_visits <- ggplot(births14_100) +
geom_jitter(aes(y = weight, x = visits)) +
xlab("number of hospital visits during pregnancy") +
ylab("weight of baby")
p_weeks <- ggplot(births14_100) +
geom_jitter(aes(x = weeks, y = weight)) +
xlab("weeks of gestation") +
ylab("weight of baby")
p_gained + p_mage + p_visits + p_weeks +
plot_layout(ncol = 2)
```
As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.
In the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.
The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.
Here the population refers to a larger population of births in the US.
- $H_0: \beta_1= 0$, there is no linear relationship between `weight` and `weeks`.
- $H_A: \beta_1 \ne 0$, there is some linear relationship between `weight` and `weeks`.
Recall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.
That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.
Figure \@ref(fig:permweightScatter) shows the observed data and a scatterplot of one permutation of the `weight` variable.
The careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.
That is, each `weight` value is randomly assigned to a new `weeks` gestation.
```{r permweightScatter, fig.cap = "(ref:permweightScatter-cap)", fig.asp = 0.5}
set.seed(4747)
p1 <- ggplot(births14_100) +
geom_point(aes(x = weeks, y = weight)) +
labs(
x = "Length of gestation (weeks)",
y = "Weight of baby (pounds)",
title = "Original data"
)
p2 <- ggplot(births14_100) +
geom_point(aes(x = weeks, y = sample(weight))) +
labs(
x = "Length of gestation (weeks)",
y = "Permuted weight of baby (pounds)",
title = "Permuted data"
)
p1 + p2
```
(ref:permweightScatter-cap) Original (left) and permuted (right) data. The permutation removes the linear relationship between `weight` and `weeks`. Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).
By repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).
The randomization test compares the slopes calculated from the permuted response variable with the observed slope.
If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).
### Observed data
We will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.
Note that the least squares model (see Chapter \@ref(model-slr)) describing the relationship is given in Table \@ref(tab:ls-births).
The columns in Table \@ref(tab:ls-births) are further described in Section \@ref(mathslope).
```{r ls-births}
lm(weight ~ weeks, data = births14_100) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.335.",
digits = 2, align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
```
### Variability of the statistic
After permuting the data, the least squares estimate of the line can be computed.
Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.
Figure \@ref(fig:permweekslm) shows two different permutations of `weight` and the resulting linear models.
```{r permweekslm, fig.cap = "(ref:permweekslm-cap)", fig.asp = 0.5}
set.seed(470)
p1 <- ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
labs(
y = "Permuted weight of baby (pounds)",
x = "Length of gestation (weeks)",
title = "First permutation of \nweight"
)
p2 <- ggplot(births14_100, aes(x = weeks, y = sample(weight))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
labs(
y = "Permuted weight of baby (pounds)",
x = "Length of gestation (weeks)",
title = "Second permutation of \nweight"
)
p1 + p2
```
(ref:permweekslm-cap) Two different permutations of the `weight` variable with slightly different least squares regression lines.
As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.
Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.
### Observed statistic vs. null statistics
```{r nulldistBirths, fig.cap = "(ref:nulldistBirths-cap)"}
perm_slope <- births14_100 %>%
specify(weight ~ weeks) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
obs_slope <- births14_100 %>%
specify(weight ~ weeks) %>%
calculate(stat = "slope") %>%
pull()
ggplot(data = perm_slope, aes(x = stat)) +
geom_histogram() +
geom_vline(xintercept = obs_slope, color = IMSCOL["red", "full"]) +
labs(x = "Randomly generated slopes", y = "Count")
```
(ref:nulldistBirths-cap) Histogram of slopes given different permutations of the `weight` variable. The vertical red line is at the observed value of the slope, 0.335.
As we can see from Figure \@ref(fig:nulldistBirths), a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.
That is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.
We reject the null hypothesis.
Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.
## Bootstrap confidence interval for the slope {#bootbeta1}
As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).
Because interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.
```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "bootstrap CI for the slope")
```
### Observed data
Returning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.
Is `mage` a good predictor of `weight`?
And if so, what is the relationship?
That is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?
The linear model regressing `weight` on `mage` is provided in Table \@ref(tab:ls-births-mage).
```{r echo = FALSE}
set.seed(4747)
births4 <- births14_100 %>%
sample_n(size = 100, replace = TRUE)
births5 <- births14_100 %>%
sample_n(size = 100, replace = TRUE)
births_many_BS <- births14_100 %>%
rep_sample_n(size = 100, replace = TRUE, reps = 50)
```
```{r magePlot, fig.cap = "(ref:magePlot-cap)"}
ggplot(births14_100) +
geom_point(aes(x = mage, y = weight)) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE,
) +
labs(x = "Mother's age", y = "Weight of baby")
```
(ref:magePlot-cap) Original data: `weight` of baby as a linear model of mother's age. Notice that the relationship between `mage` and `weight` is not as strong as the relationship we saw previously between `weeks` and `weight`.
```{r ls-births-mage}
lm(weight ~ mage, data = births14_100) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.036",
digits = 2, align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
```
### Variability of the statistic
Because the focused is not on a null distribution, sample with replacement $n=100$ observations from the original dataset.
Recall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.
When sampling in the linear model case, consider each observation to be a single dot.
If the dot is resampled, both the `weight` and the `mage` measurement are observed.
The measurements are linked to the dot (i.e., to the birth in the sample).
```{r birth2BS, fig.cap = "Original and one bootstrap sample of the births data. Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another. The red circles represent points in the original data which were not included in the bootstrap sample. The blue circles represents a data point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.", fig.asp = 0.5}
p1 <- ggplot(births14_100) +
geom_point(aes(x = mage, y = weight), alpha = 0.4) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE
) +
xlab("mother's age") +
ylab("weight of baby") +
ggtitle("Original births data.") +
geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
geom_point(x = 39, y = 9, pch = 1, size = 8)
p2 <- ggplot(births5) +
geom_point(aes(x = mage, y = weight), alpha = 0.4) +
geom_smooth(aes(x = mage, y = weight),
method = "lm", se = FALSE,
fullrange = TRUE
) +
xlab("mother's age") +
ylab("weight of baby") +
ggtitle("Bootstrap sample from \nthe births data.") +
geom_point(x = 29, y = 9, color = IMSCOL["green", "full"], pch = 1, size = 8) +
geom_point(x = 20, y = 6, color = IMSCOL["red", "full"], pch = 1, size = 8) +
geom_point(x = 39, y = 9, pch = 1, size = 8)
p1 + p2
```
Figure \@ref(fig:birth2BS) shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models.
The red circles represent points in the original data which were not included in the bootstrap sample.
The blue circles represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.
The green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.
By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure \@ref(fig:birthBS).
```{r birthBS, fig.cap = "Repeated bootstrap resamples of size 100 are taken from the original data. Each of the bootstrapped linear model is slightly different."}
ggplot(births_many_BS, aes(x = mage, y = weight, group = replicate)) +
geom_smooth(method = "lm", se = FALSE, color = IMSCOL["blue", "full"], fullrange = TRUE, alpha = 0.8) +
xlab("mother's age") +
ylab("weight of baby")
```
Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.
Figure \@ref(fig:mageBSslopes) is a histogram of the relevant bootstrapped slopes.
We can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).
We are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds (notice that the CI overlaps zero!).
```{r mageBSslopes, fig.cap = "(ref:mageBSslopes-cap)"}
set.seed(47)
births_many_BS_1000 <- births14_100 %>%
rep_sample_n(size = 100, replace = TRUE, reps = 1000)
births_many_lm_BS <- births_many_BS_1000 %>%
group_by(replicate) %>%
do(tidy(lm(weight ~ mage, data = .))) %>%
filter(term == "mage")
lower <- round(quantile(births_many_lm_BS$estimate, 0.025), 3)
upper <- round(quantile(births_many_lm_BS$estimate, 0.975), 3)
ggplot(births_many_lm_BS, aes(x = estimate)) +
geom_histogram() +
annotate("segment", x = lower, xend = lower, y = 0, yend = 25, linetype = "dashed") +
annotate("segment", x = upper, xend = upper, y = 0, yend = 25, linetype = "dashed") +
annotate("text", x = lower, y = 35, label = "2.5 percentile\n-0.01", size = 5) +
annotate("text", x = upper, y = 35, label = "97.5 percentile\n0.081", size = 5) +
labs(
x = "Bootstrapped values of the slope modeling weight on mother's age",
y = NULL
) +
theme(axis.text.y = element_blank())
```
(ref:mageBSslopes-cap) The original births data on `weight` and `mage` is bootstrapped 1,000 times. The histogram provides a sense for the variability of the slope of the linear model slope from sample to sample.
::: {.workedexample data-latex=""}
Using Figure \@ref(fig:mageBSslopes), calculate the bootstrap estimate for the standard error of the slope.
Using the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.
------------------------------------------------------------------------
Notice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).
Using the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.
The normal cutoff for a 95% confidence interval is $z^\star = 1.96$ which leads to a confidence interval of $b_1 \pm 1.96 \cdot SE \rightarrow 0.036 \pm 1.96 \cdot 0.0225 \rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.
In context, we are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds
:::
## Mathematical model for testing the slope {#mathslope}
When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.
The approximations will build on the t-distribution which was described in Chapter \@ref(inference-one-mean).
The mathematical model is often correct and is usually easy to implement computationally.
The validity of the technical conditions will be considered in detail in Section \@ref(tech-cond-linmod).
In this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.
Just as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.
### Observed data
**Midterm elections and unemployment**
Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.
Presidential election.
The set of House elections occurring during the middle of a Presidential term are called midterm elections.
In America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.
In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.
To assess the validity of this claim, we can compile historical data and look for a connection.
We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression.
The House of Representatives is made up of 435 voting members.
::: {.data data-latex=""}
The [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
Figure \@ref(fig:unemploymentAndChangeInHouse) shows these data and the least-squares regression line:
$$
\begin{aligned}
&\texttt{percent change in House seats for President's party} \\
&\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)}
\end{aligned}
$$
We consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.
Examining the data, there are no clear deviations from linearity or substantial outliers (see Section \@ref(resids) for a discussion on using residuals to visualize how well a linear model fits the data).
While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.
```{r unemploymentAndChangeInHouse, fig.cap="The percent change in House seats for the President's party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data."}
years_to_label <- c(2019, 2003, 1995, 1983, 2011, 1899)
midterms_house_with_labels <- midterms_house %>%
mutate(
label_top = word(potus, -1),
label_bottom = year - 1
)
midterms_house_with_labels %>%
filter(!(year %in% c(1935, 1939))) %>%
ggplot(aes(x = unemp, y = house_change)) +
geom_point(aes(color = party, shape = party), size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
scale_color_openintro() +
scale_shape_manual(values = c(16, 17)) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(3.5, 12.1), breaks = c(4, 8, 12)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1), limits = c(-31, 11), breaks = c(-30, -20, -10, 0, 10)) +
labs(
x = "Percent unemployed",
y = "Percent change in House seats for the President's party",
color = "Party", shape = "Party"
) +
geom_text(data = midterms_house_with_labels %>% filter(year %in% years_to_label),
aes(x = unemp, y = house_change + 2, label = paste0(label_top, "\n", label_bottom)), color = IMSCOL["black", "full"]
) +
theme(
legend.position = c(0.8, 0.8),
legend.background = element_rect(color = "white")
)
```
::: {.guidedpractice data-latex=""}
The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.
Do you agree that they should be removed for this investigation?
Why or why not?[^inf-model-slr-1]
:::
[^inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.
That is, in many situations, there is no "right" answer.
As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.
In terms of the Great Depression, we will provide two contrasting considerations.
Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.
On the other hand, these are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.
There is a negative slope in the line shown in Figure \@ref(fig:unemploymentAndChangeInHouse).
However, this slope (and the y-intercept) are only estimates of the parameter values.
We might wonder, is this convincing evidence that the "true" linear model has a negative slope?
That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?
We can frame this investigation into a statistical hypothesis test:
- $H_0$: $\beta_1 = 0$. The true linear model has slope zero.
- $H_A$: $\beta_1 \neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.
We would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.
To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.
### Variability of the statistic
Just like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.
We will generally label the test statistic using a $T$, since it follows the $t$-distribution.
We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.
Table \@ref(tab:midtermUnempRegTable) shows software output for the least squares regression line in Figure \@ref(fig:unemploymentAndChangeInHouse).
The row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).
```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "t-distribution for slope")
```
```{r midtermUnempRegTable}
d <- midterms_house
th <- !d$year %in% c(1935, 1939)
lm(house_change ~ unemp, d[th, ]) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "Output from statistical software for the regression line modeling the midterm election losses for the President's party as a response to unemployment.",
digits = 2, align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
```
::: {.workedexample data-latex=""}
What do the first and second columns of Table \@ref(tab:midtermUnempRegTable) represent?
------------------------------------------------------------------------
The entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.
Using the estimates, we could write the equation for the least square regression line as
$$ \hat{y} = -7.36 - 0.89 x $$
where $\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.
:::
We previously used a $t$-test statistic for hypothesis testing in the context of numerical data.
Regression is very similar.
In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:
$$
T \ = \ \frac{\text{estimate} - \text{null value}}{\text{SE}} = \ \frac{-0.89 - 0}{0.835} = \ -1.07
$$
This corresponds to the third column of Table \@ref(tab:midtermUnempRegTable) .
::: {.workedexample data-latex=""}
Use Table \@ref(tab:midtermUnempRegTable) to determine the p-value for the hypothesis test
------------------------------------------------------------------------
The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.
:::
### Observed statistic vs. null statistics
As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\beta_1 = 0$).
Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition.
Based on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.
::: {.workedexample data-latex=""}
Examine Figure \@ref(fig:elmhurstScatterWLine), which relates the Elmhurst College aid and student family income.
Are you convinced that the slope is meaningfully different from zero?
That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?
------------------------------------------------------------------------
While the relationship between the variables is not perfect, there is an evident decreasing trend in the data.
This suggests the hypothesis test will reject the null claim that the slope is zero.
:::
::: {.data data-latex=""}
The [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
The tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.
```{r rOutputForIncomeAidLSRLineInInferenceSection}
elmhurst %>%
mutate(
gift_aid = gift_aid * 1000,
family_income = family_income * 1000
) %>%
lm(gift_aid ~ family_income, .) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.",
digits = 2, align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "17em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
```
::: {.guidedpractice data-latex=""}
Table \@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection) shows statistical software output from fitting the least squares regression line shown in Figure \@ref(fig:elmhurstScatterWLine).
Use the output to formally evaluate the following hypotheses.[^inf-model-slr-2]
- $H_0$: The true coefficient for family income is zero.
- $H_A$: The true coefficient for family income is not zero.
:::
[^inf-model-slr-2]: We look in the second row corresponding to the family income variable.
We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.
The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.
The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of Figure \@ref(fig:elmhurstScatterWLine).
::: {.important data-latex=""}
**Inference for regression.**
We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.
However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.
See Section \@ref(tech-cond-linmod).
:::
\clearpage
## Mathematical model, interval for the slope
### Observed data
Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.
::: {.workedexample data-latex=""}
Compute the 95% confidence interval for the coefficient using the regression output from Table \@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection).
------------------------------------------------------------------------
The point estimate is -0.0431 and the standard error is $SE = 0.0108$.
When constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.
The degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\star} = 2.01$ for use in the confidence interval.
We can now construct the confidence interval in the usual way:
$$
\begin{aligned}
\text{point estimate} &\pm t_{48}^{\star} \times SE \\
-0.0431 &\pm 2.01 \times 0.0108 \\
(-0.0648 &, -0.0214)
\end{aligned}
$$
We are 95% confident that with each dollar increase in , the university's gift aid is predicted to decrease on average by \$0.0214 to \$0.0648.
:::
### Variability of the statistic
::: {.important data-latex=""}
**Confidence intervals for coefficients.**
Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:
$$ b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} $$
where $t_{df}^{\star}$ is the appropriate $t$-value corresponding to the confidence level with the model's degrees of freedom.
:::
On the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.
However, there are other types of intervals that may be of interest, including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.
\clearpage
## Checking model conditions {#tech-cond-linmod}
In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.
In this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.
```{r include=FALSE}
terms_chp_24 <- c(terms_chp_24, "technical conditions linear regression")
```
### What are the technical conditions for the mathematical model?
When fitting a least squares line, we generally require
- **Linearity.** The data should show a linear trend.
If there is a nonlinear trend (e.g., first panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel)) an advanced regression method from another book or later course should be applied.
- **Independent observations.** Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day.
Such data may have an underlying structure that should be considered in a model and analysis.
An example of a dataset where successive observations are not independent is shown in the fourth panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel).
There are also other instances where correlations within the data are important, which is further discussed in Chapter \@ref(inf-model-mlr).
- **Nearly normal residuals.** Generally, the residuals must be nearly normal.
When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we'll talk about more in Section \@ref(outliers-in-regression).
An example of a residual that would be a potentially concern is shown in the second panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel), where one observation is clearly much further from the regression line than the others.
- **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.
An example of non-constant variability is shown in the third panel of Figure \@ref(fig:whatCanGoWrongWithLinearModel), which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.
```{r whatCanGoWrongWithLinearModel, fig.cap = "Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the $x$ and $y$ relationship. The bottom set of graphs is a residual plot.First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of $x$. Fourth panel: a time series dataset is shown, where successive observations are highly correlated.", out.width = "100%", fig.width=10, fig.asp = 0.5}
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0.5, 4))
source("helpers/helper-makeTubeAdv.R")
pch <- 20
cex <- 1.75
col <- IMSCOL["blue", "f2"]
layout(
matrix(1:8, 2),
rep(1, 4),
c(2, 1)
)
these <- simulated_scatter$group == 20
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
axes = FALSE,
pch = pch,
cex = cex,
col = "#00000000"
)
box()
makeTube(x, y,
type = "quad",
addLine = FALSE,
col = IMSCOL["lgray", "f2"]
)
points(x, y,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
xlab = "", ylab = "",
axes = FALSE,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"],
ylim = yR
)
abline(h = 0, lty = 2)
box()
these <- simulated_scatter$group == 21
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
axes = FALSE,
pch = pch,
cex = cex,
col = "#00000000"
)
box()
makeTube(x, y,
addLine = FALSE,
col = IMSCOL["lgray", "f2"]
)
points(x, y,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
xlab = "", ylab = "",
axes = FALSE,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"],
ylim = yR
)
abline(h = 0, lty = 2)
box()
these <- simulated_scatter$group == 22
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
axes = FALSE,
pch = pch,
cex = cex,
col = "#00000000"
)
box()
makeTubeAdv(x, y,
type = "l",
variance = "l",
bw = 0.03,
Z = 1.7,
col = IMSCOL["lgray", "f2"]
)
points(x, y,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
axes = FALSE,
xlab = "", ylab = "",
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"],
ylim = yR
)
abline(h = 0, lty = 2)
box()
these <- simulated_scatter$group == 23
x <- simulated_scatter$x[these]
y <- simulated_scatter$y[these]
plot(x, y,
axes = FALSE,
pch = pch,
cex = cex,
col = "#00000000"
)
box()
makeTube(x, y,
addLine = FALSE,
col = IMSCOL["lgray", "f2"]
)
points(x, y,
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"]
)
g <- lm(y ~ x)
abline(g)
yR <- range(g$residuals)
yR <- yR + c(-1, 1) * diff(yR) / 10
plot(x, g$residuals,
axes = FALSE,
xlab = "", ylab = "",
pch = pch,
cex = cex,
col = IMSCOL["blue", "f1"],
ylim = yR
)
abline(h = 0, lty = 2)
box()
makeTubeAdv(x, y, col = IMSCOL["lgray", "f2"])
par(par_og) # restore original par
```
::: {.guidedpractice data-latex=""}
Should we have concerns about applying least squares regression to the Elmhurst data in Figure \@ref(fig:elmhurstScatterW2Lines)?[^inf-model-slr-3]
:::
[^inf-model-slr-3]: The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant.
These are also not time series observations.
Least squares regression can be applied to these data.
The technical conditions are often remembered using the **LINE** mnemonic.
The linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure \@ref(fig:whatCanGoWrongWithLinearModel).
A careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.
- L: **linear** model
- I: **independent** observations
- N: points are **normally** distributed around the line