-
Notifications
You must be signed in to change notification settings - Fork 172
/
data-design.qmd
862 lines (713 loc) · 43.5 KB
/
data-design.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
# Study design {#sec-data-design}
```{r}
#| include: false
source("_common.R")
```
::: {.chapterintro data-latex=""}
Before digging into the details of working with data, we stop to think about how data come to be.
That is, if the data are to be used to make broad and complete conclusions, then it is important to understand who or what the data represent.
One important aspect of data provenance is sampling.
Knowing how the observational units were selected from a larger entity will allow for generalizations back to the population from which the data were randomly selected.
Additionally, by understanding the structure of the study, causal relationships can be separated from those relationships which are only associated.
A good question to ask oneself before working with the data at all is, "How were these observations collected?".
You will learn a lot about the data by understanding its source.
:::
## Sampling principles and strategies {#sec-sampling-principles-strategies}
The first step in conducting research is to identify topics or questions that are to be investigated.
A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important.
It is also important to consider *how* data are collected so that the data are reliable and help achieve the research goals.
### Populations and samples
Consider the following three research questions:
1. What is the average mercury content in swordfish in the Atlantic Ocean?
2. Over the last five years, what is the average time to complete a degree for Duke undergrads?
3. Does a new drug reduce the number of deaths in patients with severe heart disease?
Each research question refers to a target **population**\index{population}.
In the first question, the target population is all swordfish in the Atlantic Ocean, and each fish represents a case.
Oftentimes, it is not feasible to collect data for every case in a population.
Collecting data for an entire population is called a **census**\index{census}.
A census is difficult because it is too expensive to collect data for the entire population, but it might also be because it is difficult or impossible to identify the entire population of interest!
Instead, a sample is taken.
A **sample**\index{sample} is the data you have.
Ideally, a sample is a small fraction of the population.
For instance, 60 swordfish (or some other number) in the population might be selected, and this sample data may be used to provide an estimate of the population average and to answer the research question.
```{r}
#| include: false
terms_chp_2 <- c("population", "census", "sample")
```
::: {.guidedpractice data-latex=""}
For the second and third questions above, identify the target population and what represents an individual case.[^02-data-design-1]
:::
[^02-data-design-1]: The question *"Over the last five years, what is the average time to complete a degree for Duke undergrads?"* is only relevant to students who complete their degree; the average cannot be computed using a student who never finished their degree.
Thus, only Duke undergrads who graduated in the last five years represent cases in the population under consideration.
Each such student is an individual case.
For the question *"Does a new drug reduce the number of deaths in patients with severe heart disease?"*, a person with severe heart disease represents a case.
The population includes all people with severe heart disease.
### Parameters and statistics
In most statistical analysis procedures, the research question at hand boils down to understanding a numerical summary.
The number (or set of numbers) may be a quantity you are already familiar with (like the average) or it may be something you learn through this text (like the slope and intercept from a least squares model, provided in @sec-least-squares-regression).
A numerical summary can be calculated on either the sample of observations or the entire population.
However, measuring every unit in the population is usually prohibitive.
So, a "typical" numerical summary is calculated from a sample.
Yet, we can still conceptualize calculating the average income of all adults in Argentina.
We use specific terms in order to differentiate when a number is being calculated on a sample of data (**sample statistic**\index{sample statistic}) and when it is being calculated or considered for calculation on the entire population (**population parameter**\index{population parameter}).
The terms statistic and parameter are useful for communicating claims and models and will be used extensively in later chapters which delve into making inference on populations.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "population parameter", "sample statistic")
```
### Anecdotal evidence
Consider the following possible responses to the three research questions:
1. A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high.
2. I met two students who took more than 7 years to graduate from Duke, so it must take longer to graduate at Duke than at many other colleges.
3. My friend's dad had a heart attack and died after they gave him a new heart disease drug, so the drug must not work.
Each conclusion is based on data.
However, there are two problems.
First, the data only represent one or two cases.
Second, and more importantly, it is unclear whether these cases are actually representative of the population.
Data collected in this haphazard fashion are called **anecdotal evidence**\index{anecdotal evidence}.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "anecdotal evidence")
```
::: {.important data-latex=""}
**Anecdotal evidence.**
Be careful of data collected in a haphazard fashion.
Such evidence may be true and verifiable, but it may only represent extraordinary cases and therefore not be a good representation of the population.
:::
Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics.
For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years.
Instead, of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
\clearpage
### Sampling from a population
\index{sampling!random} \index{bias}
We might try to estimate the time to graduation for Duke undergraduates in the last five years by collecting a sample of graduates.
All graduates in the last five years represent the *population*\index{population}, and graduates who are selected for review are collectively called the *sample*\index{sample}.
In general, we always seek to *randomly* select a sample from a population.
The most basic type of random selection is equivalent to how raffles are conducted.
For example, in selecting graduates, we could write each graduate's name on a raffle ticket and draw 10 tickets.
The selected names would represent a random sample of 10 graduates.
```{r}
#| label: fig-pop-to-sample
#| fig-cap: |
#| 10 graduates are randomly selected from the population to
#| be included in the sample.
#| fig-asp: 0.43
#| fig-width: 8.0
#| fig-alt: |
#| A large circle contains many dots which indicate all the graduates.
#| A smaller circle contains a few of the dots (i.e., graduates) which have
#| been randomly selected from the larger circle.
#| out-width: 70%
set.seed(1234)
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0, 4))
plot(c(0, 2), c(0, 1.1), type = "n", axes = FALSE, xlab = "", ylab = "")
temp <- seq(0, 2 * pi, 2 * pi / 100)
x <- 0.5 + 0.5 * cos(temp)
y <- 0.5 + 0.5 * sin(temp)
lines(x, y)
s <- matrix(runif(1000), ncol = 2)
S <- matrix(NA, 350, 2)
j <- 0
for (i in 1:nrow(s)) {
if (sum((s[i, ] - 0.5)^2) < 0.23) {
j <- j + 1
S[j, ] <- s[i, ]
}
}
points(S, col = IMSCOL["blue", "f2"], pch = 20)
text(0.5, 1, "all graduates", pos = 3, cex = 1.3)
set.seed(50)
N <- sample(j, 25)
lines((x - 0.5) / 2 + 1.5, (y - 0.5) / 2 + 0.5, pch = 20)
SS <- (S[N, ] - 0.5) / 2 + 0.5
these <- c(2, 5, 10, 12, 20, 21, 22, 23, 1, 8)
points(SS[these, 1] + 1, SS[these, 2], col = IMSCOL["red", "f1"], pch = 20, cex = 1.5)
text(1.5, 0.75, "sample", pos = 3, cex = 1.3)
for (i in these) {
arrows(S[N[i], 1], S[N[i], 2],
SS[i, 1] + 1 - 0.03, SS[i, 2],
length = 0.08, col = IMSCOL["black", "full"], lwd = 1.5
)
}
par(par_og) # restore original par
```
::: {.workedexample data-latex=""}
Suppose we ask a student who happens to be majoring in nutrition to select several graduates for the study.
Which students do you think they might pick?
Do you think their sample would be representative of all graduates?
------------------------------------------------------------------------
They might pick a disproportionate number of graduates from health-related fields, as shown in @fig-pop-to-sub-sample-graduates.
When selecting samples by hand, we run the risk of picking a **biased** sample\index{sample bias}, even if our bias is unintended.
:::
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "sample bias")
```
```{r}
#| label: fig-pop-to-sub-sample-graduates
#| fig-cap: |
#| Asked to pick a sample of graduates, a nutrition major might inadvertently
#| pick a disproportionate number of graduates from health-related majors.
#| fig-asp: 0.43
#| fig-width: 8.0
#| fig-alt: |
#| A large circle contains many dots which indicate all the graduates, but some
#| of the dots have been greyed out where others are dark dots from which the sample
#| is taken. A smaller circle contains a few of the dots (i.e., graduates) which have
#| been selected from the biased group of dark dots in the large circle.
#| out-width: 70%
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0, 4))
plot(c(0, 2), c(0, 1.1), type = "n", axes = FALSE, xlab = "", ylab = "")
temp <- seq(0, 2 * pi, 2 * pi / 100)
x <- 0.5 + 0.5 * cos(temp)
y <- 0.5 + 0.5 * sin(temp)
lines(x, y)
s <- matrix(runif(1000), ncol = 2)
S <- matrix(NA, 350, 2)
j <- 0
sub <- rep(FALSE, 1000)
for (i in 1:nrow(s)) {
if (sum((s[i, ] - 0.5)^2) < 0.23) {
j <- j + 1
S[j, ] <- s[i, ]
}
if (sum((s[i, ] - c(0.05, 0.18) - 0.5)^2) < 0.07) {
sub[j] <- TRUE
}
}
points(S, col = IMSCOL["blue", 4 - 2 * sub], pch = 20)
text(0.5, 1, "all graduates", pos = 3, cex = 1.3)
lines(
(x - 0.5) * 2 * sqrt(0.07) + 0.55,
(y - 0.5) * 2 * sqrt(0.07) + 0.68
)
set.seed(7)
N <- sample((1:j)[sub], 25)
lines((x - 0.5) / 2 + 1.5,
(y - 0.5) / 2 + 0.5,
pch = 20
)
SS <- (S[N, ] - 0.5) / 2 + 0.5
these <- c(2, 5, 10, 12, 20, 21, 22, 23, 1, 8)
points(SS[these, 1] + 1, SS[these, 2], col = IMSCOL["red", "f1"], pch = 20, cex = 1.5)
text(1.5, 0.75, "sample", pos = 3, cex = 1.3)
for (i in these) {
arrows(S[N[i], 1], S[N[i], 2],
SS[i, 1] + 1 - 0.03, SS[i, 2],
length = 0.08,
col = IMSCOL["black", "full"],
lwd = 1.5
)
}
rect(0.143, 0.2, 0.952, 0.301,
border = "#00000000",
col = "#FFFFFF88"
)
rect(0.236, 0.301, 0.858, 0.403,
border = "#00000000",
col = "#FFFFFF88"
)
text(0.55, 0.5 + 0.18 - sqrt(0.07),
"graduates from\nhealth-related fields",
pos = 1, cex = 1.3
)
par(par_og) # restore original par
```
If someone was permitted to pick and choose exactly which graduates were included in the sample, it is entirely possible that the sample would overrepresent that person's interests, which may be entirely unintentional.
This introduces **bias** into a sample\index{bias}\index{sampling!bias}.
Sampling randomly helps address this problem.
The most basic random sample is called a **simple random sample**\index{simple random sample}\index{sampling!simple random} and is equivalent to drawing names out of a hat to select cases.
This means that each case in the population has an equal chance of being included and the cases in the sample are not related to each other.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "bias", "simple random sample")
```
\clearpage
The act of taking a simple random sample helps minimize bias.
However, bias can crop up in other ways.
Even when people are picked at random, e.g., for surveys, caution must be exercised if the **non-response rate**\index{non-response rate} is high.
For instance, if only 30% of the people randomly sampled for a survey actually respond, then it is unclear whether the results are **representative**\index{representative sample}\index{sampling!representative} of the entire population.
This **non-response bias**\index{bias!non-response} can skew results.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "non-response rate", "representative", "non-response bias")
```
```{r}
#| label: fig-survey-sample
#| fig-cap: |
#| Due to the possibility of non-response, survey studies may only reach a certain
#| group within the population. It is difficult, and oftentimes impossible,
#| to completely fix this problem.
#| fig-asp: 0.43
#| fig-width: 8.0
#| fig-alt: |
#| A large circle contains many dots which indicate the population of interest,
#| but some of the dots have been greyed out where others are dark dots from which
#| the sample is taken (where the grey dots are potentially due to non-response bias). A
#| smaller circle contains a few of the dots which have been selected from the group
#| of dark dots in the large circle who were individuals willing to respond to the
#| survey.
#| out-width: 70%
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0, 4))
plot(c(0, 2),
c(0, 1.1),
type = "n",
axes = FALSE,
xlab = "",
ylab = ""
)
temp <- seq(0, 2 * pi, 2 * pi / 100)
x <- 0.5 + 0.5 * cos(temp)
y <- 0.5 + 0.5 * sin(temp)
lines(x, y)
s <- matrix(runif(700), ncol = 2)
S <- matrix(NA, 350, 2)
j <- 0
sub <- rep(FALSE, 1000)
for (i in 1:nrow(s)) {
if (sum((s[i, ] - 0.5)^2) < 0.23) {
j <- j + 1
S[j, ] <- s[i, ]
}
if (sum((s[i, ] - c(-0.15, 0.05) - 0.5)^2) < 0.115) {
sub[j] <- TRUE
}
}
points(S, col = IMSCOL["blue", 4 - 2 * sub], pch = 20)
text(0.5, 1, "population of interest", pos = 3, cex = 1.3)
lines(
(x - 0.5) * 2 * sqrt(0.115) + 0.35,
(y - 0.5) * 2 * sqrt(0.115) + 0.55
)
set.seed(7)
N <- sample((1:j)[sub], 25)
lines((x - 0.5) / 2 + 1.5,
(y - 0.5) / 2 + 0.5,
pch = 20
)
SS <- (S[N, ] - 0.5) / 2 + 0.5
these <- c(2, 5, 6, 7, 15)
points(SS[these, 1] + 1,
SS[these, 2],
col = IMSCOL["red", "f1"],
pch = 20,
cex = 1.5
)
text(1.5, 0.75, "sample", pos = 3, cex = 1.3)
for (i in these) {
arrows(S[N[i], 1],
S[N[i], 2],
SS[i, 1] + 1 - 0.03,
SS[i, 2],
length = 0.08,
col = IMSCOL["black", "full"],
lwd = 1.5
)
}
rect(0.145, 0.195, 0.775, 0.11,
border = "#00000000",
col = "#FFFFFF88"
)
rect(0.31, 0.018, 0.605, 0.11,
border = "#00000000",
col = "#FFFFFF88"
)
text(0.46, 0.5 + 0.06 - sqrt(0.115),
"population actually\nsampled",
pos = 1,
cex = 1
)
par(par_og) # restore original par
```
Another common downfall is a **convenience sample**\index{convenience sample}\index{sampling!convenience}, where individuals who are easily accessible are more likely to be included in the sample.
For instance, if a political survey is done by stopping people walking in the Bronx, this will not represent all of New York City.
It is often difficult to discern what sub-population a convenience sample represents.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "convenience sample")
```
::: {.guidedpractice data-latex=""}
We can easily access ratings for products, sellers, and companies through websites.
These ratings are based only on those people who go out of their way to provide a rating.
If 50% of online reviews for a product are negative, do you think this means that 50% of buyers are dissatisfied with the product?
Why?[^02-data-design-2]
:::
[^02-data-design-2]: Answers will vary.
From our own anecdotal experiences, we believe people tend to rant more about products that fell below expectations than rave about those that perform as expected.
For this reason, we suspect there is a negative bias in product ratings on sites like Amazon.
However, since our experiences may not be representative, we also keep an open mind.
### Four sampling methods {#sec-samp-methods}
Almost all statistical methods are based on the notion of implied randomness.
If data are not collected in a random framework from a population, these statistical methods -- the estimates and errors associated with the estimates -- are not reliable.
Here we consider four random sampling techniques: simple, stratified, cluster, and multistage sampling.
@fig-simple-stratified and @fig-cluster-multistage provide graphical representations of these techniques.
**Simple random sampling**\index{simple random sample}\index{sampling!simple random} is probably the most intuitive form of random sampling.
Consider the salaries of Major League Baseball (MLB) players, where each player is a member of one of the league's 30 teams.
To take a simple random sample of 120 baseball players and their salaries, we could write the names of that season's several hundreds of players onto slips of paper, drop the slips into a bucket, shake the bucket around until we are sure the names are all mixed up, then draw out slips until we have the sample of 120 players.
In general, a sample is referred to as "simple random" if each case in the population has an equal chance of being included in the final sample *and* knowing that a case is included in a sample does not provide useful information about which other cases are included.
```{r}
#| label: fig-simple-stratified
#| fig-cap: |
#| Examples of simple random and stratified sampling. In the top panel, simple
#| random sampling was used to randomly select the 18 cases (denoted in red). In the
#| bottom panel, stratified sampling was used: cases were first grouped into strata,
#| then simple random sampling was employed to randomly select 3 cases within each
#| stratum.'
#| fig-width: 10.0
#| out-width: 80%
#| fig-alt: |
#| The top box shows a population of dots (i.e., individuals) where a handful
#| of the dots have been sampled randomly. The bottom box shows the same population
#| of dots but grouped in such a way that there are six strata. From each stratum
#| three dots (i.e., individuals) are randomly selected.
source("helpers/helper-sampling.R") # source helper
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0.5, 4), mfrow = c(2, 1)) # no margin, 2 figures
build_srs(n = 18, N = 108) # build figure
build_stratified(N = 108) # build figure
par(par_og) # restore original par
```
**Stratified sampling**\index{stratified sample}\index{sampling!stratified} is a divide-and-conquer sampling strategy.
The population is divided into groups called **strata**\index{sampling!strata}.
The strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum.
In the baseball salary example, each of the 30 teams could represent a stratum, since some teams have a lot more money (up to 4 times as much!).
Then we might randomly sample 4 players from each team for our sample of 120 players.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "simple random sampling", "stratified sampling", "strata")
```
**Stratified sampling** is especially useful when the cases in each stratum are very similar with respect to the outcome of interest.
The downside is that analyzing data from a stratified sample is a more complex task than analyzing data from a simple random sample.
The analysis methods introduced in this book would need to be extended to analyze data collected using stratified sampling.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "stratified sampling")
```
::: {.workedexample data-latex=""}
Why would it be good for cases within each stratum to be very similar?
------------------------------------------------------------------------
We might get a more stable estimate for the subpopulation in a stratum if the cases are very similar, leading to more precise estimates within each group.
When we combine these estimates into a single estimate for the full population, that population estimate will tend to be more precise since each individual group estimate is itself more precise.
:::
In a **cluster sample**\index{cluster sample}\index{sampling!cluster}, we break up the population into many groups, called **clusters**.
Then we sample a fixed number of clusters and include all observations from each of those clusters in the sample.
A **multistage sample**\index{multistage sample}\index{sampling!multistage} is like a cluster sample, but rather than keeping all observations in each cluster, we would collect a random sample within each selected cluster.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "cluster sampling", "cluster", "multistage sample")
```
```{r}
#| label: fig-cluster-multistage
#| fig-cap: |
#| Examples of cluster and multistage sampling. In the top panel, cluster sampling
#| was used: data were binned into nine clusters, three of these clusters were sampled,
#| and all observations within these three clusters were included in the sample. In
#| the bottom panel, multistage sampling was used, which differs from cluster sampling
#| only in that we randomly select a subset of each cluster to be included in the sample
#| rather than measuring every case in each sampled cluster.'
#| fig-width: 10.0
#| out-width: 80%
#| fig-alt: |
#| In the top figure, dots are grouped into clusters, three clusters are selected,
#| and every dot (i.e., all individuals) from each of the three clusters are sampled.
#| In the bottom figure, dots are again grouped into clusters and three clusters
#| are selected. However, random sampling is applied so that a random sample
#| from each of the three selected clusters is taken.
source("helpers/helper-sampling.R") # source helper
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0.5, 4), mfrow = c(2, 1)) # no margin, 2 figures
build_cluster() # build figure
build_multistage() # build figure
par(par_og) # restore original par
```
Sometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.
Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves do not look very different from one another.
For example, if neighborhoods represented clusters, then cluster or multistage sampling work best when the populations inside each neighborhood are very diverse.
A downside of these methods is that more advanced techniques are typically required to analyze the data, though the methods in this book can be extended to handle such data.
::: {.workedexample data-latex=""}
Suppose we are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia.
We learn that there are 30 villages in that part of the Indonesian jungle, each more or less like the next, but the distances between the villages are substantial.
We want to test 150 individuals for malaria.
What sampling method should we use?
------------------------------------------------------------------------
A simple random sample would likely draw individuals from all 30 villages, which could make data collection expensive.
Stratified sampling would be a challenge since it is unclear how we would build strata of similar individuals.
However, cluster sampling or multistage sampling seem like very good ideas.
With multistage sampling, we could randomly select half of the villages, then randomly select 10 people from each.
This could reduce data collection costs substantially in comparison to a simple random sample, and the cluster sample would still yield reliable information, even if we would need to analyze the data with more advanced methods than those introduced in this book.
:::
\clearpage
## Experiments {#sec-experiments}
Studies where the researchers assign treatments to cases are called **experiments**\index{experiment}\index{study!experiment}.
When this assignment includes randomization, e.g., using a coin flip to decide which treatment a patient receives, it is called a **randomized experiment**\index{randomized experiment}\index{study!randomized experiment}.
Randomized experiments are fundamentally important when trying to show a causal connection between two variables.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "experiment", "randomized experiment")
```
### Principles of experimental design {#sec-principles-experimental-design}
1. **Controlling.** Researchers assign treatments to cases, and they do their best to **control**\index{control} any other differences in the groups[^02-data-design-3]. For example, when patients take a drug in pill form, some patients take the pill with only a sip of water while others may have it with an entire glass of water. To control for the effect of water consumption, a doctor may instruct every patient to drink a 12-ounce glass of water with the pill.
[^02-data-design-3]: This is a different concept than a *control group*, which we discuss in the second principle and in @sec-reducing-bias-human-experiments.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "control")
```
2. **Randomization.** Researchers randomize patients into treatment groups to account for variables that cannot be controlled.
For example, some patients may be more susceptible to a disease than others due to their dietary habits.
In this example dietary habit is a **confounding variable**[^02-data-design-4], which is defined as a variable that is associated with both the explanatory and response variables.
Randomizing patients into the treatment or control group helps even out such differences.
::: {.important data-latex=""}
**Confounding variable.**
A **confounding variable**\index{confounding variable} is one that is associated with both the explanatory and response variables.
Because it is associated with both variables, it prevents the study from concluding that the explanatory variable caused the response variable.
Consider a silly example with total ice-cream sales as the explanatory variable and number of boating accidents as the response variable (which may seem highly correlated).
Outside temperature is associated with both variables, and therefore we cannot conclude that high ice-cream sales is a cause of more boating accidents.
Confounding variables may or may not be measured as part of the study.
Regardless, drawing cause-and-effect conclusions is difficult in an observational study because of the ever-present possibility of confounding variables.
:::
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "confounding variable")
```
3. **Replication.**\index{replication} The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response.
In a single study, we **replicate** by collecting a sufficiently large sample.
What is considered sufficiently large varies from experiment to experiment, but at a minimum we want to have multiple subjects (experimental units) per treatment group.
Another way of achieving replication is replicating an entire study to verify an earlier finding.
The term **replication crisis**\index{replication crisis} refers to the ongoing methodological crisis in which past findings from scientific studies in several disciplines have failed to be replicated.
**Pseudoreplication**\index{pseudoreplication} occurs when individual observations under different treatments are heavily dependent on each other.
For example, suppose you have 50 subjects in an experiment where you're taking blood pressure measurements at 10 time points throughout the course of the study.
By the end, you will have 50 $\times$ 10 = 500 measurements.
Reporting that you have 500 observations would be considered pseudoreplication, as the blood pressure measurements of a given individual are not independent of each other.
Pseudoreplication often happens when the wrong entity is replicated, and the reported sample sizes are exaggerated.
[^02-data-design-4]: Also called a **lurking variable**, **confounding factor**, or a **confounder**.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "replication", "pseudoreplication", "replication crisis")
```
\index{replication} \index{lurking variable}
4. **Blocking.**\index{blocking} Researchers sometimes know or suspect that variables, other than the treatment, influence the response. Under these circumstances, they may first group individuals based on this variable into **blocks** and then randomize cases within each block to the treatment groups. This strategy is often referred to as **blocking**. For instance, if we are looking at the effect of a drug on heart attacks, we might first split patients in the study into low-risk and high-risk blocks, then randomly assign half the patients from each block to the control group and the other half to the treatment group, as shown in @fig-blocking. This strategy ensures that each treatment group has the same number of low-risk patients and the same number of high-risk patients.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "blocking")
```
```{r}
#| label: fig-blocking
#| fig-cap: |
#| Blocking for patient risk. Patients are first divided into low-risk and
#| high-risk blocks, then patients in each block are evenly randomized into
#| the treatment groups. This strategy ensures equal representation of
#| patients in each treatment group from both risk categories.
#| warning: false
#| fig-width: 6.0
#| fig-asp: 0.97
#| fig-alt: |
#| Before randomly allocating, the red low risk patients and blue high risk
#| patients are split into two separate groups. Subsequently, half of the red low risk
#| patients are randomly chosen to receive the treatment, and half of the blue high
#| risk patients are randomly chosen to receive the treatment.
#| out-width: 90%
set.seed(12345)
source("helpers/helper-blocking.R") # source helper
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0, 4)) # no margins
build_blocking() # build figure
par(par_og) # restore original par
```
It is important to incorporate the first three experimental design principles into any study, and this book describes applicable methods for analyzing data from such experiments.
Blocking is a slightly more advanced technique, and statistical methods in this book may be extended to analyze data collected using blocking.
### Reducing bias in human experiments {#sec-reducing-bias-human-experiments}
Randomized experiments have long been considered to be the gold standard for data collection, but they do not ensure an unbiased perspective into the cause-and-effect relationship in all cases.
Human studies are perfect examples where bias can unintentionally arise.
Here we reconsider a study where a new drug was used to treat heart attack patients.
In particular, researchers wanted to know if the drug reduced deaths in patients.
These researchers designed a randomized experiment because they wanted to draw causal conclusions about the drug's effect.
Study volunteers[^02-data-design-5] were randomly placed into two study groups.
One group, the **treatment group**\index{treatment group}, received the drug.
The other group, called the **control group**\index{control group}, did not receive any drug treatment.
[^02-data-design-5]: Human subjects are often called **patients**, **volunteers**, or **study participants**.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "treatment group", "control group")
```
\clearpage
Put yourself in the place of a person in the study.
If you are in the treatment group, you are given a fancy new drug that you anticipate will help you.
On the other hand, a person in the other group does not receive the drug and sits idly, hoping her participation does not increase her risk of death.
These perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.
Researchers aren't usually interested in the emotional effect, which might bias the study.
To circumvent this problem, researchers do not want patients to know which group they are in.
When researchers keep the patients uninformed about their treatment, the study is said to be **blind**\index{blind}.
But there is one problem: if a patient does not receive a treatment, they will know they're in the control group.
A solution to this problem is to give a fake treatment to patients in the control group.
This is called a **placebo**\index{placebo}\index{experiment!placebo}, and an effective placebo is the key to making a study truly blind.
A classic example of a placebo is a sugar pill that is made to look like the actual treatment pill.
However, offering such a fake treatment may not be ethical in certain experiments.
For example, in medical experiments, typically the control group must get the current standard of care.
Oftentimes, a placebo results in a slight but real improvement in patients.
This effect has been dubbed the **placebo effect**\index{placebo effect}.
\index{blinding}
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "blind", "placebo", "placebo effect")
```
The patients are not the only ones who should be blinded: doctors and researchers can unintentionally bias a study.
When a doctor knows a patient has been given the real treatment, they might inadvertently give that patient more attention or care than a patient that they know is on the placebo.
To guard against this bias, which again has been found to have a measurable effect in some instances, most modern studies employ a **double-blind**\index{double-blind} setup where doctors or researchers who interact with patients are, just like the patients, unaware of who is or is not receiving the treatment.[^02-data-design-6]
[^02-data-design-6]: There are always some researchers involved in the study who do know which patients are receiving which treatment.
However, they do not interact with the study's patients and do not tell the blinded health care professionals who is receiving which treatment.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "double-blind")
```
::: {.guidedpractice data-latex=""}
Look back to the study in @sec-case-study-stents-strokes where researchers were testing whether stents were effective at reducing strokes in at-risk patients.
Is this an experiment?
Was the study blinded?
Was it double-blinded?[^02-data-design-7]
:::
[^02-data-design-7]: The researchers assigned the patients into their treatment groups, so this study was an experiment.
However, the patients could distinguish what treatment they received because a stent is a surgical procedure.
There is no equivalent surgical placebo, so this study was not blind.
The study could not be double-blind since it was not blind.
::: {.guidedpractice data-latex=""}
For the study in @sec-case-study-stents-strokes, could the researchers have employed a placebo?
If so, what would that placebo have looked like?[^02-data-design-8]
:::
[^02-data-design-8]: Ultimately, can we make patients think they got treated from a surgery?
In fact, we can, and some experiments use a **sham surgery**.
In a sham surgery, the patient does undergo surgery, but the patient does not receive the full treatment, though they will still get a placebo effect.
You may have many questions about the ethics of sham surgeries to create a placebo.
These questions may have even arisen in your mind when in the general experiment context, where a possibly helpful treatment was withheld from individuals in the control group; the main difference is that a sham surgery tends to create additional risk, while withholding a treatment only maintains a person's risk.
There are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is ethically "correct".
For instance, is it ethical to use a sham surgery when it creates a risk to the patient?
However, if we do not use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
Ultimately, this is a difficult situation where we cannot perfectly protect both the patients who have volunteered for the study and the patients who may benefit (or not) from the treatment in the future.
\clearpage
## Observational studies {#sec-observational-studies}
Studies where no treatment has been explicitly applied (or explicitly withheld) are called **observational studies**\index{study!observational}\index{observational study}.
For instance, studies on the loan data and county data described in @sec-data-basics are would both be considered observational, as they rely on **observational data**.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "observational study")
```
Making causal conclusions based on experiments is often reasonable, since we can randomly assign the explanatory variable(s), i.e., the treatments.
However, making the same causal conclusions based on observational data can be treacherous and is not recommended.
Thus, observational studies are generally only sufficient to show associations or form hypotheses that can be later checked with experiments.
Suppose an observational study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer.
Does this mean sunscreen *causes* skin cancer?
No!
Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer, as shown in @fig-sun-causes-cancer.
One important piece of information that is absent is sun exposure.
If someone is out in the sun all day, they are more likely to use sunscreen *and* more likely to get skin cancer.
Exposure to the sun is unaccounted for in the simple observational investigation.
```{r}
#| label: fig-sun-causes-cancer
#| fig-align: center
#| fig-asp: 0.4
#| fig-cap: |
#| Sun exposure may be the root cause of both sunscreen use and skin cancer.
#| fig-alt: |
#| Three boxes are shown in a triangle arrangement representing: sun exposure,
#| using sunscreen, and skin cancer. A solid arrow connects sun exposure as a causal
#| mechanism to using sunscreen; a solid arrow also connects sun exposure as a causal
#| mechanism to skin cancer. A questioning arrow indicates that the causal effect
#| of using sunscreen on skin cancer is unknown.
#| out-width: 60%
par_og <- par(no.readonly = TRUE) # save original par
par(mar = rep(0, 4))
plot(c(-0.05, 1.2),
c(0.39, 1),
type = "n",
axes = FALSE
)
text(0.59, 0.89, "sun exposure", cex = 1)
rect(0.4, 0.8, 0.78, 1)
text(0.3, 0.49, "use sunscreen", cex = 1)
rect(0.1, 0.4, 0.48, 0.6)
arrows(0.49, 0.78, 0.38, 0.62,
length = 0.08, lwd = 1.5
)
text(0.87, 0.5, "skin cancer", cex = 1)
rect(0.71, 0.4, 1.01, 0.6)
arrows(0.67, 0.78, 0.8, 0.62,
length = 0.08, lwd = 1.5
)
arrows(0.5, 0.5, 0.69, 0.5,
length = 0.08, col = IMSCOL["gray", "f1"]
)
text(0.595, 0.565, "?",
cex = 1.5, col = IMSCOL["gray", "full"]
)
par(par_og) # restore original par
```
In this example, sun exposure is a confounding variable.
The presence of confounding variables is what inhibits the ability for observational studies to make causal claims.
While one method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured.
::: {.guidedpractice data-latex=""}
@fig-county-multi-unit-homeownership shows a negative association between the homeownership rate and the percentage of housing units that are in multi-unit structures in a county.
However, it is unreasonable to conclude that there is a causal relationship between the two variables.
Suggest a variable that might explain the negative relationship.[^02-data-design-10]
:::
[^02-data-design-10]: Answers will vary.
Population density may be important.
If a county is very dense, then this may require a larger percentage of residents to live in housing units that are in multi-unit structures.
Additionally, the high density may contribute to increases in property value, making homeownership unfeasible for many residents.
Observational studies come in two forms: prospective and retrospective studies.
A **prospective study**\index{study!prospective}\index{prospective study} identifies individuals and collects information as events unfold.
For instance, medical researchers may identify and follow a group of patients over many years to assess the possible influences of behavior on cancer risk.
One example of such a study is The Nurses' Health Study.
Started in 1976 and expanded in 1989, the Nurses' Health Study has collected data on over 275,000 nurses and is still enrolling participants.
This prospective study recruits registered nurses and then collects data from them using questionnaires.
**Retrospective studies**\index{study!retrospective}\index{retrospective study} collect data after events have taken place, e.g., researchers may review past events in medical records.
Some datasets may contain both prospectively- and retrospectively collected variables, such as medical studies which gather information on participants' lives before they enter the study and subsequently collect data on participants throughout the study.
```{r}
#| include: false
terms_chp_2 <- c(terms_chp_2, "prospective study", "retrospective study")
```
\clearpage
## Chapter review {#sec-chp2-review}
### Summary
A proficient analyst will have a good sense of the types of data they are working with and how to visualize the data in order to gain a complete understanding of the variables.
Equally important, however, is the data source.
In this chapter, we have discussed randomized experiments and taking good, random, representative samples from a population.
When we discuss inferential methods (starting in @sec-foundations-randomization), the conclusions that can be drawn will be dependent on how the data were collected.
@fig-randsampValloc summarizes how sampling and assignment methods relate to the scope of inference.[^02-data-design-11]
Regularly revisiting @fig-randsampValloc will be important when making conclusions from a given data analysis.
[^02-data-design-11]: Derived from similar figures in @ISCAM and @sleuth.
```{r}
#| label: fig-randsampValloc
#| out-width: 96%
#| fig-cap: |
#| Analysis conclusions should be made carefully according to how the data
#| were collected. Very few datasets come from the top left box because
#| usually ethics require that random assignment of treatments can only be
#| given to volunteers. Both representative (ideally random) sampling and
#| experiments (random assignment of treatments) are important for how
#| statistical conclusions can be made on populations.
#| fig-alt: |
#| A two by two table describing the scenarios of random sample or not and
#| random allocation or not. Selecting randomly from a population allows for
#| generalization back to the population. Randomly allocating in an experiment
#| allows for establishing causation.
#| fig-pos: H
knitr::include_graphics("images/randsampValloc.png")
```
### Terms
The terms introduced in this chapter are presented in @tbl-terms-chp-2.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.
```{r}
#| label: tbl-terms-chp-2
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_2)
```
\clearpage
## Exercises {#sec-chp2-exercises}
Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-02].
::: {.exercises data-latex=""}
{{< include exercises/_02-ex-data-design.qmd >}}
:::