-
Notifications
You must be signed in to change notification settings - Fork 6
/
004-psychometrics.Rmd
456 lines (344 loc) · 47.8 KB
/
004-psychometrics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
# Measurement Properties of the CDI {#psychometrics}
```{r psycho-prep_data, child="_psychometrics.Rmd", eval=FALSE}
```
Many researchers are initially shocked to hear that one of the most important methods for studying child language is parent report. Aren't parents extremely biased observers? Yet, as we argued in Chapter \@ref(intro-practical), alternative methods like naturalistic observation or lab experiments can also be biased, and are quite costly to revisit at scale. Thus, the goal of this chapter is to revisit the strengths and weaknesses of parent report in depth, since the remainder of our book depends on the use of CDI data.
Our goal is to assess the psychometric utility of the CDI. Many studies provide evidence for reliability in the form of concurrent and longitudinal correlations between CDI scores and for validity in the form of correlations between the CDI and other language measures; some of the most prominent of these studies are cited below and a number of others others are reviewed in @fenson2007. We also address some issues that have received a little less attention, however. In the first part of the chapter, we discuss the limitations of the CDI (and the design features that address these limitations); in the second part, we use longitudinal data to examine the test-retest reliability of the CDI; and in the third part, we present evidence for the measurement properties of the CDI (including comprehension questions) from a psychometric perspective.
## Strengths and limitations of parent report
Although the standardization of parent report using the CDI contributes to the availability of large amounts of data in a comparable format, there are significant limitations to the parent report methodology that are important to understand [@tomasello1994; @feldman2000]. To begin to do so, it is useful to reflect on what it means when a parent reports that their child "understands" or "understands and says" a word.
```{r psycho-parentsketch, fig.cap="The intuitive structure of parent report.", fig.width=4}
knitr::include_graphics("images/parent_report_sketch.png")
```
In an ideal world, the parent's responses would be an unbiased reflection of their observations of their child's language development. But parent reports are almost certainly less transparent. Figure \@ref(fig:psycho-parentsketch) shows a caricature of the process of parent report. A particular report could depend on direct recall of a particular case when their child actually produced or showed comprehension of the word. For example, when asked if their child produces the word *ball*, a parent is likely recalling situations in which their child has used the word *ball* correctly, and then reporting on the success or failure of this process of recollection. Of course, this judgment clearly depends on the parent's ability to accurately judge that the child intended to say the word *ball*, that the child's target word form was *ball*, and that the child has some meaning for the word form *ball* that at least approximates the expected meaning. But, in addition to these factors, parents probably draw on their general assessment of the difficulty of the word and on their overall assessment of the child's linguistic abilities. As even this simple sketch shows, parent report judgments are based on a fairly complex set of factors. Hence, there are legitimate concerns about the ability of parents to provide detailed and specific knowledge about their children's language. We discuss specific concerns below.
First, parents are imperfect observers. Most parents do not have specialized training in language development, and may not be sensitive to subtle aspects of language structure and use. Further, a natural pride in the child and a failure to critically test their impressions may cause parents to overestimate the child's ability; conversely, frustration in the case of delayed language may lead to underestimates. Parent report is most likely to be accurate under three general conditions: (1) when assessment is limited to current behaviors, (2) when assessment is focused on emergent behaviors, and (3) when a recognition format is used. Each of these conditions acts to reduce demands on the respondent's memory. In addition, parents are likely to be better able to report on their child's language at the present time than at times past and better able to report specific items when their child is actively learning those particular items (e.g., reporting on names for animals after a trip to the zoo).
Following (3), one particular key design principle of the CDI is that parents are better able to choose from a list of items that are likely candidates (recognition), rather than requiring that the parents generate the list themselves (recall). Although this second type of assessment sounds implausibly bad, it is surprising how often it is still used (or, even worse, asking the global question "Does your child know at least 50 words?" that is so commonly used in pediatric assessments).
CDI forms are also designed around commonsense age limitations on parent report. In typically-developing samples, the assumption is that parents can track their child's comprehension vocabulary to about 16--18 months, after which the number of words a child can understand is thought to be too large to monitor. For productive vocabulary, the assumption is that specific word productions can be monitored until about 2.5--3 years, after which the number of words a child can say becomes too large. Different instrument developers make different choices about the ceiling of CDI-type forms but relatively few have considered CDI-type parent report for measuring older children's vocabularies [but cf. @libertus2015].
Second, parent reports likely suffer from a number of biases that interact with sub-portions of the forms and the ages of the target children. For example, it is likely that parents may have more difficulty reporting on children's comprehension or production of function words (e.g., *so*, *then*, *if*), perhaps because these words are more abstract and less referential, than content words (e.g., *baby*, *house*). Estimates for function words may then rely more on parent estimates of the words' general difficulty, rather than actual observations. We return to this question below in our psychometric analyses.
In addition, asking parents to reflect on their child's language abilities may be particularly difficult for early vocabulary and especially for early comprehension. As @tomasello1994 point out, for the youngest children, especially 8--10 month olds, vocabulary comprehension scores can be surprisingly high, possibly reflecting a lack of clarity in what the term "understands" means for parents of children at this young age (cf. Chapter \@ref(methods-and-data), subsection on "difficult data"). On the other hand, more recent evidence has suggested that children in this age range do plausibly have some comprehension skill even if it is somewhat fragmentary [@tincoff1999;@tincoff2012;@bergelson2012;@bergelson2013;@bergelson2015]. Thus, the degree to which very early comprehension reports are artifactual -- or were actually ahead of the research literature -- is unknown. (Resolving this question will require detailed studies of the correspondence between parent reports and experimental data for individual children). Below we assess some of the measurement properties of comprehension items, but we are unable to resolve the issue fully.
One study that bears on the earliest production data is @schneider2015, who compiled a number of sources of data on children's first words. Surprisingly, that study found relatively few differences for the age and topic distribution of this very salient milestone across datasets collected via a number of different methods, including concurrent (CDI) and retrospective report. The age at which a first word was reported was also relatively similar between CDI data and the concurrent diary reports of a sample of psycholinguists (though some CDI data appeared to be shifted a little bit earlier such that more parents were reporting first words in the 7--9 month period). Thus, there was convergence across different reporting methods in parents' report on first word production. Parent report could be flawed here, but the specific CDI format may not be to blame.
Third, there is some evidence that variability in reporting biases may be moderated by factors such as SES [@feldman2000; @fenson2000; @feldman2005]. Some studies suggest that parents from some SES groups may be more likely to underestimate child's abilities [@roberts1999], while others report that parents from lower-SES groups may over-estimate children's abilities, especially comprehension at younger ages [@goldfield1990; @feldman2000]. Later studies, however, have shown that for children over 2 years patterns of validity were consistent in lower and higher-SES groups [@feldman2005; @reese2000]. Thus, SES-differences could reflect valid delays in children’s language development that parallel those obtained with different methods, such as naturalistic observation or standardized tests [e.g., @hammer2010].
Fourth, as discussed in Chapter \@ref(intro-practical), the items on the original CDI instruments were chosen to be a representative sample of vocabulary for the appropriate age and language [@fenson1994]. The checklists contain some words that most, including the youngest, children are able to understand or produce, some words that are understood or produced by the "average" child, and some which only children who are relatively more advanced will understand or produce. This structure ensures that the list has the psychometric property of capturing individual differences in vocabulary both across younger and older children and across children of different developmental levels. Validity of the CDIs has been demonstrated in reference to both standardized tests and naturalistic language sampling [see Chapter 4 of @fenson2007].
The checklists were not originally constructed with the assumption that responses on individual items would be reliable and valid, however. (Indeed, as we show below, not all words have ideal psychometric properties -- e.g., "mittens"). While item-level responses provide useful information about patterns of words that children are likely to understand or produce, responses on the vocabulary checklist do not necessarily license the conclusion that a child would respond appropriately when asked "can you say ______?" by an experimenter in a confrontation naming task. Nonetheless, if parents' observations at the item level reflect any signal -- even in the context of significant influence from other factors -- then this signal should be observable by aggregating together data from many children. Thus, the item-level analyses we present in Chapter \@ref(items-prediction) (for example) are not predicated on an assumption of high item-level reliability for individual children.
Fifth, while the lengths of the vocabulary checklists on the CDIs may give the impression that they yield an estimate of the child's full vocabulary, in fact, the vocabulary size estimates only reflect a child's relative standing compared to other children assessed with the same list of words. Such estimates should not be misconstrued as a comprehensive estimate of the child's vocabulary knowledge, as CDI scores likely understate the size of a child's "true" vocabulary substantially, especially for older children.
<!-- Given variation across forms in the procedure for selecting items, raw score (or even percentile) comparisons across languages is therefore problematic (see Chapter \@ref(vocabulary)). -->
<!-- Moreover, while it is tempting to ask parents to indicate additional words that their child might be able to understand or say, users should be aware that including those items may introduce bias in the main CDI estimate unless they are presented after the primary CDI. -->
Sixth, when a parent reports on a word on the vocabulary checklist, there is no information about the actual form of the word used, and hence, these vocabulary estimates can say little about phonological development (e.g. segmental vs. suprasegmental approaches to the analysis of speech). Parents are instructed that they should check that a child can produce a word even if it is pronounced in the child's "special way," and only approximates the adult form. Thus, throughout this book we refrain from analyzing the phonological forms of words reported on CDI instruments (with the exception of Chapter \@ref(items-prediction), in which we use word length in the correct adult form as a predictor of production).
Finally, we also gain no information from parent report about the frequency with which children use a particular word in their spontaneous speech, nor can we know the range of contexts in which individual lexical items are used (e.g., is that word used productively vs. in a memorized chunk of speech). Thus, the vocabulary size that is captured by the CDIs reflects the number of different word types that the child is able to understand or produce, with little information about nuances in meaning that might be reflected in actual usage.
Despite these limitations, when used appropriately, the CDI instruments yield reliable and valid estimates of total vocabulary size. Because the instruments were designed to minimize bias by targeting current behaviors and asking parents about highly salient features of their child's abilities, they have proven to be an important tool in the field. Dozens of studies demonstrate concurrent and predictive relations with naturalistic and observational measures, in both typically-developing and at-risk populations [e.g., @dale1996; @thal1999; @marchman2002]. In addition, a variety of recent work has shown that individual item-level responses can yield exciting new insights, for example about the growth patterns of semantic networks when aggregated across children [@hills2009; @hills2010]. Such analyses have the potential to be even more powerful when applied to larger samples and across languages.
## Longitudinal stability of CDI measurements
A classic test of the reliability of a psychometric instrument is its test-retest correlation. Assessing this correlation for CDIs for a single reporter is a bit impractical however, since -- unlike e.g., a math test with objective answers and different question forms -- this procedure would involve asking a caregiver to fill out the exact same survey twice in a row, and presumably they would remember many of their answers. An alternative possibility would be to measure the same child via multiple caregivers. This procedure was followed by @de-houwer2005, who found that caregivers varied substantially from one another in their responding; but plausibly this is due not only to parent bias but also to the different contexts in which caregivers interact with children (e.g., one caregiver takes the child to the zoo more often, another plays kitchen at home).
Avoiding the issues of these procedures, we instead examine correlations in CDI measurements across developmental time. There are only a small number of deeply longitudinal corpora in Wordbank, so we will limit our investigation to two languages: Norwegian and English. Furthermore, the largest group of longitudinal data cover the WS form so we restrict to these data for simplicity. Within each of these datasets, the modal number of observations is two, but there are some children with more than 10 CDIs available.
In this type of analysis, differences between a particular individual's measurements could vary for two primary reasons: first, measurement error (parent forgetfulness, mistakes, etc.) and second, true developmental change (learning new words). Since vocabulary typically increases over time, we can look at the relative magnitudes of CDI scores via correlations; this is our first analysis. Our second analysis attempts to normalize these absolute differences by extracting percentile ranks and finds that this procedure in fact increases longitudinal correlations. Because there are two sources of differences between measurements, when correlations are low, we do not have direct evidence for whether 1) children's relative linguistic abilities are shifting with respect to one another or 2) we are observing measurement error. But, when correlations are high, we can assume the converse: measurement error is low *and* developmental stability is relatively high. It turns out that this latter situation is the case. As we will discuss in more detail in Chapter \@ref(vocabulary), there is substantial variability between children in vocabulary size. The current analysis suggests that this variability appears to be quite stable longitudinally.
```{r psycho-style_long_data}
longitudinal_admins <- admins %>%
mutate(langform = paste(language, form, sep = .inst_sep)) %>%
group_by(langform, original_id) %>%
count() %>%
filter(n > 1)
n_long_ws <- admins %>%
filter(original_id %in% longitudinal_admins$original_id,
language %in% c("Norwegian", "English (American)"),
form == "WS") %>%
group_by(original_id, language, source_name) %>%
mutate(n_admins = n()) %>%
filter(n_admins > 1)
n_long_wg <- admins %>%
filter(original_id %in% longitudinal_admins$original_id,
language %in% c("Norwegian", "English (American)"),
form == "WG") %>%
group_by(original_id, language, source_name) %>%
mutate(n_admins = n()) %>%
filter(n_admins > 1)
ms_ws <- n_long_ws %>%
group_by(language, age) %>%
summarise(production = median(production))
ms_wg <- n_long_wg %>%
group_by(age) %>%
summarise(production = median(production))
```
Figure \@ref(fig:psycho-style-long-data-spaghetti) shows the trajectories of children (individual colors) who were measured more than ten times; it includes Norwegian data only, due to data sparsity issues in English. These trajectories appear quite stable; the ranking of individuals does not appear to change much over the course of several years. This general conclusion -- longitudinal stability of language ability as well as limited measurement error -- is ratified by other studies using different datasets, for example @bornstein2012, who found substantial stability (_r_ = 0.84) between latent constructs inferred from early language at 20 months and later language measured at 48 months.
```{r psycho-style-long-data-spaghetti, fig.cap="Vocabulary size as a function of age for children with more than 10 administrations (color indicates child)."}
ggplot(filter(n_long_ws, n_admins > 10,
language == "Norwegian"),
aes(x = age, y = production, col = fct_reorder(original_id, production))) +
geom_line() +
# facet_wrap(~language) +
.scale_colour_numerous(guide = FALSE) +
ylab("Words produced") +
xlab("Age (months)") +
scale_x_continuous(breaks = .ages)
```
```{r psycho-style_long_data_ecdf}
n_cross_ws <- admins %>%
filter(language %in% c("Norwegian", "English (American)"),
form == "WS") %>%
group_by(original_id, language) %>%
mutate(n_admins = n()) %>%
filter(n_admins == 1)
# gets percentiles for longitudinal based on cross-sectional.
get_empirical_percentiles <- function(df) {
# assumes ages are uniform in this sample
this_age <- df$age[1]
this_lang <- df$language[1]
cross_data <- filter(n_cross_ws,
age == this_age,
language == this_lang)
Fn <- ecdf(cross_data$production)
df$percentile <- Fn(df$production)
return(df)
}
n_long_ws <- n_long_ws %>%
split(list(.$age,.$language), drop = TRUE) %>%
map_df(get_empirical_percentiles)
```
One way to operationalize the question of stability is how children's percentile ranks tend to change over time. We examine this question qualitatively by showing the longitudinal trajectory of individual children's empirical percentile ranks based on the full normative sample for that language.^[We could use a model-based method (e.g., the `gcrq` method used in the Wordbank app and Chapter \@ref(vocabulary) and \@ref(demographics)) but in practice we have enough data in each of these languages that this method should perform well.] As shown in Figure \@ref(fig:psycho-style-long-data-norwegian-ecdf), these ranks are visually quite stable.
```{r psycho-style-long-data-norwegian-ecdf, fig.cap="Vocabulary percentile as a function of age for children with more than 10 administrations (color indicates child)."}
ggplot(filter(n_long_ws, n_admins > 10,
language == "Norwegian"),
aes(x = age, y = percentile, col = fct_reorder(original_id, production))) +
geom_line() +
.scale_colour_numerous(guide = FALSE) +
xlab("Age (months)") +
ylab("Empirical percentile") +
scale_x_continuous(breaks = .ages)
```
The transformation to percentile ranks allows us to assess the correlation between a child's percentile rank at time 1 and their rank at time 2, depending on the gap between these two. Because of sparsity, we bin children into two-month age bins and eliminate age bins with fewer than 50 children, then calculate between-bin correlations in percentiles. Figure \@ref(fig:psycho-style-pairwise-long-cors) shows this analysis, which reveals that percentile ranks are quite stable. Regardless of the age of the children, across a 2--4 month age gap the two percentiles are correlated at better than 0.8.
Longitudinal stability declines to around 0.5 at a maximal remove of 16 months, but this decline should be taken with a grain of salt. First, a 16 month gap amounts to a doubling of the child's age, so stability might be expected to be lower. Second, many children who are measured longitudinally across a 16-month gap will be expected to move from the floor of the form to the ceiling, compromising measurement accuracy. To test this last hypothesis, we evaluated the longitudinal stability of correlations using the same analysis as above, but varying whether we used raw scores or percentiles. The percentile method substantially increased correlations.^[We also used latent abilities derived from a 4-parameter IRT model as below. While the IRT-derived ability parameters showed a consistent improvement in longitudinal correlations over the use of raw scores, percentiles realized a further gain over the IRT parameters in this case.]
```{r psycho-style-pairwise-long-cors, fig.height=4, fig.cap="Correlations between vocabulary percentiles at multiple age points as a function of the age difference between them."}
age_binsize <- 2
long_cors <- n_long_ws %>%
unite("id", c("original_id", "source_name")) %>%
mutate(age = round(age/age_binsize) * age_binsize) %>% # round age into two-month bins
select(id, language, age, percentile) %>%
group_by(id, age) %>%
sample_n(size = 1) %>% # if there are multiple measurements from one age, remove
ungroup
long_cor_ns <- long_cors %>%
split(.$language) %>%
map_df(function(df) {
language <- df$language[1]
df %>%
select(-language) %>%
widyr::pairwise_count(age, id) %>%
mutate(language = language) %>%
rename(age1 = item1,
age2 = item2)
})
long_cor_pairs <- long_cors %>%
spread(age, percentile) %>%
split(.$language) %>%
map_df(function(df) {
language <- df$language[1]
cor_mat <- select(df, -language, -id) %>%
cor(use = "pairwise.complete.obs")
as_data_frame(cor_mat) %>%
mutate(age2 = rownames(cor_mat)) %>%
gather(age1, cor, -age2) %>%
mutate(language = language,
age1 = as.numeric(age1),
age2 = as.numeric(age2),
dist = age2 - age1) %>%
filter(dist > 0)
}) %>%
left_join(long_cor_ns)
ggplot(filter(long_cor_pairs, n >= 50),
aes(x = dist, y = cor, col = age1)) +
geom_point(aes(size = n)) +
geom_smooth(method = "lm", method.args = list(weight = n)) +
facet_wrap(~language) +
ylab("Correlation") +
xlab("Measurement gap (months)") +
.scale_colour_continuous(name = "Starting age") +
scale_size_continuous(name = "N") +
ylim(0,1)
```
In sum, the variability between children that we observe in the CDI is quite stable longitudinally. It declines over time, but some of this decline may simply be due to the unavoidable limitations of CDI forms with respect to floor and ceiling effects.
## Psychometric modeling
In this next section, we examine the psychometric properties of the CDI through the lens of psychometric models and Item Response Theory (IRT). In brief, IRT provides a set of models for estimating the measurement properties of tests consisting of multiple items. These models assume that individuals vary on some latent trait, and that each item in a test measures this latent trait to some, possibly variable, extent [see @baker2001 for detailed introduction]. IRT models are a useful tool for constructing and evaluating CDI instruments, as they can help to identify items that perform poorly in estimating underlying ability. For example, @weber2018 used IRT to identify poorly-performing items in a new CDI instrument for Wolof (a language spoken in Senegal). IRT can also be used in the construction of computer-adaptive tests; this method has recently been applied to the CDI [@makransky2016; cf. @mayor2018].
IRT models vary in their parameterization. In the simplest (Rasch) IRT model, each item has a difficulty parameter that controls how likely a test-taker with a particular ability will be to get a correct answer. In contrast, in a two-parameter model, each item also has a discrimination parameter that controls how much response probabilities vary with varying abilities. Good items will tend to have high discrimination parameters across a range of difficulties so as to identify test-takers at a range of abilities. Three- and four-parameter models add items for estimating lower- and upper-bounds of responding for individual items.
We examine IRT models as a window into the psychometric properties of the CDI. In the first subsection, we explore latent factor scores using the English WS data. In the second subsection, we examine individual items and find generally positive measurement properties, although with some items at ceiling (included via carry-over from the Words & Gestures form). In the third subsection, we look at differences between comprehension and production in the WG form. In the fourth subsection, we look at the properties of the instrument by word category in both WS and WG.
Overall, the conclusions of our analysis are that:
* Latent factor scores may have some advantages relative to raw scores in capturing individuals' abilities, but for the purposes of the analyses we perform in the main body of the book, they may carry some risks as well; hence, we do not adopt them more generally.
* In general, CDI WS items tend to perform well, but from a purely psychometric perspective there are a number of items that could be removed from the English WS form because their measurement properties are not ideal.
* Comprehension items, in general, tend to have less discrimination than production, suggesting that they are not as clear indicators of children's underlying abilities.
* Function words tend to have lower discrimination than other items, but the lexical class differences are not huge and do not interact with whether they are measured using production vs. comprehension.
These analyses generally ratify the conclusion that the measurement properties of the CDI are good, even for function words and for comprehension measures. These questions may carry slightly less signal about the specifics of a child's vocabulary and load more heavily on a parents' general estimation of the child's linguistic ability, but they do carry some signal that relates to other responses. Further, when the English CDI departs from good measurement practice it generally does so for completeness (e.g., including *mom* and *dad* words because these are important to parents, even though they do not show good measurement properties or are just different in some other way).
```{r psycho-load_irt_data_summary_ws}
# eng_ws <- read_feather("data/psychometrics/eng_ws_raw_data.feather")
base::load("data/psychometrics/eng_ws_raw_data.Rds")
d_ws <- eng_ws %>%
mutate(produces = value == "produces") %>%
filter(!is.na(category)) %>%
select(data_id, produces, age, production, sex, definition)
base::load("data/psychometrics/eng_ws_mods_2pl.Rds")
d_ws_summary <- d_ws %>%
group_by(data_id, sex, age) %>%
summarise(production = production[1]) %>%
right_join(fscores_2pl %>%
mutate(data_id = as.numeric(data_id))) %>%
filter(!is.na(sex))
```
### Measurement properties of individual WS items
A first question that we can ask using a fitted IRT model is how well individual items relate to children's overall latent abilities. Practically speaking, in these analyses, we use the `mirt` package [@chalmers2012;@chalmers2016] to estimate the parameters of a four-parameter IRT model. As described above, the two-parameter model includes difficulty and discrimination parameters for each item. The four-parameter model supplements the standard two-parameter model with two parameters corresponding to floor and ceiling performance for a particular item. Items with high rates of guessing or universal acceptance across test takers would tend to have abnormal values on these bounds.
We fit Rasch, two-, three-, and four-parameter models to the English WS data and performed a set of model comparisons. On all metrics -- AIC, BIC, and direct likelihood comparison -- the 2PL model handily outperfomrmed the Rasch model, suggesting that not every item had the same discrimination. Similarly, the 3PL outperformed the 4PL on all metrics, suggesting that adding an upper bound parameter did not increase model fit. On the other hand, the 2PL and 3PL were close in fit, with AIC and log likelihood favoring the 3PL but BIC favoring the 2PL. In the remainder of the analyses below save one, we adopt the 2PL for simplicity. In an exploratory analysis, we examine upper and lower bounds from the 4PL because the estimated upper bounds help us reason about those items that are not yet universally known by the older children in our sample.
```{r psycho-item-individual-ws, fig.cap="Item characteristic curves for a set of individual items from the English WS sample.", fig.height=4.5}
thetas <- seq(-6,6,.1)
irt4pl <- function(a, d, g, u, theta = seq(-6,6,.1)) {
p = g + (u - g) * boot::inv.logit(a * (theta + d))
return(p)
}
irt2pl <- function(a, d, theta = seq(-6,6,.1)) {
p = boot::inv.logit(a * (theta + d))
return(p)
}
examples <- c("table","mommy*","trash","yesterday")
iccs <- coefs_2pl %>%
filter(definition %in% examples) %>%
split(.$definition) %>%
map_df(function(d) {
return(data_frame(definition = d$definition,
theta = thetas,
p = irt2pl(d$a1, d$d, thetas)))
})
ggplot(iccs,
aes(x = theta, y = p)) +
geom_line() +
facet_wrap(~definition) +
xlab("Ability") +
ylab("Probability of production")
```
We begin by examining some individual item curves from the 2PL fits. Figure \@ref(fig:psycho-item-individual-ws) shows four representative item characteristic curves. Each plots the probability of production by a range of latent ability scores. *Mommy* is produced by children at all abilities and is relatively uninformative about ability. In contrast, *table* and *trash* are both of moderate difficulty, but *table* is more informative because it has a steeper slope. Finally, *yesterday* is more difficult overall -- in our sample many high vocabulary children still did not produce this word. Generally, items with steeper slopes are considered more diagnostic of ability and hence more desirable.
```{r psycho-items-ws, fig.cap="Words (points), plotted by their difficulty and discrimination parameters, as recovered by the 2-parameter IRT model (see text). Outliers are labeled."}
ggplot(coefs_2pl,
aes(x = a1, y = -d)) +
geom_point(alpha = .3) +
ggrepel::geom_text_repel(data = filter(coefs_2pl,
-d < -3.8 | -d > 5.3 | a1 > 4 | a1 < 1),
aes(label = definition), size = 3, family = .font) +
xlab("Discrimination") +
ylab("Difficulty")
```
<!-- Our goal in this first analysis is simply to examine parameter estimates across individuals and items. -->
We now examine these properties across the whole instrument. Figure \@ref(fig:psycho-items-ws) shows item discrimination and difficulty across the full set of items, with outlying items labeled. Difficulty refers to the latent ability necessary for a child to produce an item, on average. Discrimination refers to how well an item discriminates between children of lower and higher ability (as judged by their performance on other items). For example, the word *table* is spoken by just about half of the children in the sample. Hence, asking whether a child says *table* is a good way to guess whether they are in the top or bottom half of the distribution.
In contrast, visual inspection shows a tail of items with limited discrimination and low difficulty (e.g., *mommy*, *daddy*, *uh oh*, etc.). These are clearly those items that are produced by nearly all of the children in the sample -- they do not discriminate because they are passed by all children in the sample. If the only goal of the instrument were discrimination of different ability levels, they could likely be removed. But, as discussed above, these items tend to be included for completeness. Including these items also helps with compatibility between instruments, since the WS instrument is a strict superset of the WG instrument, which is used with younger children and for which their would presumably be more variability in a word like *uh oh*. On the upper part of the plot, we also see a large cluster of words that are quite difficult (e.g., *country*, *would*, *were*); these items show some useful discrimination, but presumably only for high ability children.
```{r psycho-bounds-plot, fig.cap="Words (points), plotted now by their lower and upper bound parameters from the 4-parameter IRT model."}
base::load("data/psychometrics/eng_ws_mods_4pl.Rds")
ggplot(coefs_4pl,
aes(x = g, y = u)) +
geom_point(alpha = .3) +
ggrepel::geom_text_repel(data = filter(coefs_4pl,
abs(g) > .4 | u < .75),
aes(label = definition), size = 3, family = .font) +
xlab("Lower bound (high base rate)") +
ylab("Upper bound (not known by many)")
```
Turning now to an exploratory analysis using the 4PL model, we examine the recovered upper and lower bounds estimated for particular words, as shown in Figure \@ref(fig:psycho-bounds-plot). While overall the 4PL model does not improve fit, these parameters are useful because they show the subset of words that are known by only a small number of children (low ceiling) or are known by almost all children (high floor), respectively. Examining those with a very low ceiling, we see items that are likely to be quite idiosyncratic, for a variety of reasons. For example, *babysitter*, *camping*, and *basement* likely vary by children's home experiences (further mediated by access to resources, parenting practices, and circumstances). In contrast, genital items (e.g. *vagina*, or the version of this item used in the child's family) vary by gender (see Chapter \@ref(items-demographics)). Examining those items with a very high floor shows early learned words like *mommy*. These words are similar to words with very low discrimination patterns. Because these words are known by essentially all children, the four-parameter model may have fit these words as having a high chance level with essentially no discrimination ability.
One way to think about these analyses is that they show that the CDI has not only a large core of words with good measurement properties but also some other words that do not contribute as substantially and add length without adding much signal. If the goal of the CDI were only to provide psychometric estimates of vocabulary size, these would be good candidates for deletion. But because CDIs are also used for other purposes -- such as the analyses we present in subsequent chapters -- a larger set of items can be useful. We return to this general set of issues in Chapter \@ref(conclusion-beyond-cdi).
### Production and comprehension
```{r psycho-load-irt-data-wg, fig.cap="Histograms of words' difficulty and discrimination parameters, for comprehension and production."}
base::load("data/psychometrics/eng_wg_mods_2pl.Rds")
coefs_2pl_wg <- bind_rows(coefs_2pl_wg_produces %>%
mutate(measure = "Produces"),
coefs_2pl_wg_understands %>%
mutate(measure = "Understands"))
wg_comp_prod <- coefs_2pl_wg %>%
select(a1, d, measure) %>%
gather(parameter, value, a1, d) %>%
mutate(parameter = fct_recode(parameter,
Discrimination = "a1",
Difficulty = "d") %>%
relevel("Difficulty"),
measure = fct_relevel(measure, "Understands"))
ggplot(wg_comp_prod,
aes(x = value)) +
geom_histogram(binwidth = .5) +
facet_grid(measure ~ parameter) +
# xlim(-5,5) +
xlab("Parameter value") +
ylab("Number of words")
wg_comp_prod_summary <- wg_comp_prod %>%
group_by(measure, parameter) %>%
summarise(value = mean(value))
```
We next use IRT to estimate whether there are differences between production and comprehension, using WG data. To do so, we fit 2PL moels to the WG data and examine the distribution of item parameters. In general, a good item distribution will have a range of difficulties, so as to be sensitive to differences between children at a variety of levels. These items also should have relatively high discrimination, so that answers to individual items tend to provide relatively more information.
Figure \@ref(fig:psycho-load-irt-data-wg) shows discrimination and difficulty parameter value distributions for WG production and comprehension. Difficulty is much higher (negative values) for production relative to comprehension, reflecting the expected asymmetry of production coming "after" (being more difficult than) comprehension. With respect to comprehension, several trends are visible. First, comprehension questions largely have positive discrimination parameters. Thus, these questions on the whole carry signal about children's latent linguistic ability. There do appear to be more itmes with low or even negative discrimination parameters, however, indicating more items that are not measuring ability appropriately (perhaps because they are difficult for all children or because they are too hard to assess). Mean discrimination is substantially lower for comprehension relative to production (`r roundp(wg_comp_prod_summary$value[wg_comp_prod_summary$parameter == "Discrimination" & wg_comp_prod_summary$measure == "Understands"], 1)` vs. `r roundp(wg_comp_prod_summary$value[wg_comp_prod_summary$parameter == "Discrimination" & wg_comp_prod_summary$measure == "Produces"], 1)`).
Overall, this pattern is consistent with the hypothesis that production behavior is a clearer signal of children's underlying knowledge than assumed comprehension. Why? Perhaps parents are better reporters of production than comprehension, and hence these items are more discriminative of true behavior. The source of error in this case would be parents' mistaken belief that their child understands a word. Or perhaps comprehension is a fundamentally more variable construct and that, hence, individual word knowledge consistent with understanding could be due to partial knowledge. Here the source of error is variance in how well children know the meanings of words. We cannot distinguish between these two models, but they have different underlying implications for the CDI.
### Lexical category effects on item performance
One hypothesis that we have often speculated about is the question of whether there are special psychometric issues with particular word classes. For example, do parents struggle especially to identify whether children produce or understand function words?
```{r psycho-lexcat, fig.height=6.5, fig.cap="Lexical class effects on difficulty and discrimination for Words and Sentences. The top plot shows individual words plotted by their parameter values, with color representing the lexical class of the words. The bottom plot shows discrimination information in the form of a histogram."}
coefs_2pl <- coefs_2pl %>%
left_join(items %>%
filter(language == "English (American)",
form == "WS")) %>%
mutate(lexical_class_label = lexical_class %>% factor() %>%
fct_relabel(~.x %>% label_caps() %>% as.character()))
class_summary <- coefs_2pl %>%
group_by(lexical_class, lexical_class_label) %>%
summarise(sd_a1 = sd(a1, na.rm=TRUE),
a1 = mean(a1))
a <- ggplot(coefs_2pl,
aes(x = a1, y = -d, col = lexical_class_label)) +
geom_point(alpha = .3) +
ggrepel::geom_text_repel(data = filter(coefs_2pl,
a1 < 1 | a1 > 3.8 | -d > 5 | -d < -2.5),
aes(label = definition), size = 2, family = .font,
show.legend = FALSE) +
.scale_colour_discrete(name = "Lexical class") +
xlab("Discrimination") +
ylab("Difficulty")
b <- ggplot(coefs_2pl,
aes(x = a1, fill = lexical_class_label)) +
geom_histogram() +
.scale_fill_discrete(name = "Lexical class") +
xlab("Discrimination") +
ylab("Number of words") +
xlim(0,4)
gridExtra::grid.arrange(a, b)
```
Using 2PL parameter fits, Figure \@ref(fig:psycho-lexcat) shows WS item difficulty and discrimination (as above) and the histogram of discrimination, but broken down by lexical class (color). Many of the easy, non-discriminating items are found in the "other" section. In contrast, the hardest items tend to be function words. These items tend to have similar discrimination on average (`r roundp(class_summary$a1[class_summary$lexical_class == "function_words"], 1)`) compared with nouns (`r roundp(class_summary$a1[class_summary$lexical_class == "nouns"], 1)`), and modestly lower discrimination than adjectives (`r roundp(class_summary$a1[class_summary$lexical_class == "adjectives"], 1)`), and especially verbs (`r roundp(class_summary$a1[class_summary$lexical_class == "verbs"], 1)`). The situation is not dire: all have a discrimination parameter above one. Thus, although function words are not the most discriminative items on the CDI WS, these items still appear to encode valid signal about children's abilities.
```{r psycho-lexcat-summary, fig.cap="Mean discrimination values for individual words' in production and comprehension measures from the Words and Gestures form (error bars show SD).", fig.height=3.5}
coefs_2pl_wg <- coefs_2pl_wg %>%
left_join(items %>%
filter(language == "English (American)",
form == "WG"))
wg_comp_prod <- coefs_2pl_wg %>%
select(a1, d, measure, lexical_class) %>%
gather(parameter, value, a1, d) %>%
mutate(parameter = fct_recode(parameter,
Discrimination = "a1",
Difficulty = "d") %>%
relevel("Difficulty"))
class_summary <- wg_comp_prod %>%
group_by(lexical_class, measure, parameter) %>%
summarise(mean = mean(value),
sd = sd(value)) %>%
ungroup() %>%
mutate(lexical_class = lexical_class %>% factor() %>%
fct_relabel(~.x %>% label_caps() %>% as.character()) %>%
fct_rev(),
measure = fct_relevel(measure, "Understands"))
#
# ggplot(wg_comp_prod,
# aes(x = value, fill = lexical_class)) +
# geom_histogram(binwidth = .5) +
# facet_grid(measure ~ parameter) +
# xlim(-5,5)
# wg_comp_prod_summary <- wg_comp_prod %>%
# group_by(measure, parameter) %>%
# summarise(value = mean(value))
ggplot(filter(class_summary, parameter=="Discrimination"),
aes(y = lexical_class, x = mean)) +
ggstance::geom_pointrangeh(aes(xmin = mean - sd, xmax = mean + sd)) +
facet_grid(measure ~ .) +
geom_vline(xintercept = 0, linetype = .refline, colour = .grey) +
# .scale_colour_discrete(guide = FALSE) +
xlab("Discrimination") +
ylab("") +
theme(panel.grid.major.y = .coef_line)
```
In our last analysis, we turn to the WG data. Figure \@ref(fig:psycho-lexcat-summary) shows the mean (error bars show SD) for discrimination parameter values. The only major trend is that there is a moderate level of discrimination for all classes except "other" (which includes items like *mommy* and *daddy* and a variety of animal sounds and social routines). One hypothesis about this finding is that, especially early on, parents are very generous in their interpretation of whether their child understands these specific words.
In sum, we do not find evidence that function words are particularly low-performing items from a psychometric perspective -- even in comprehension assessments! Rather, there are some low-performing items spread across all categories of the CDI form, and many of these likely perform poorly for the reasons described above -- especially difficulty in interpretation of very early behavior and variability in home experience.
### IRT models: Conclusions
One question regarding IRT-model derived parameters for individual children is whether they should be used in place of percentiles or raw scores for some of the measurement problems we encounter throughout the rest of the book. Although these latent ability scores might be overall better reflections of children's vocabulary than other measures, we do not find strong evidence to support that conclusion. For example, in the analysis above, we compared longitudinal correlations derived from raw scores, percentiles, and IRT ability parameters. While IRT parameters yielded higher correlations than raw scores, empirical percentiles performed better still (at least for Norwegian and English, two languages for which we have large amounts of data).
Furthermore, there are other negatives associated with swapping an imperfect but straightforward measure (raw and percentile scores) for a model-derived measure (latent ability). Interpretation clearly suffers if we use the model-derived measure, since readers will not be able to map scores back to actual behavior in terms of the checklist. In addition, model estimation issues across instruments introduce further difficulties in interpretation. Most obviously, model estimates with smaller datasets may vary in unpredictable ways; similarly, a greater presence of poorly-performing items in certain datasets may lead to systematic issues in the latent estimates for those datasets. In the absence of clear solutions to these model-fitting problems, we choose the route of using the "sumscore" [@borsboom2006], while acknowledging its limitations.
## Conclusions
In this chapter, we examined the measurement properties of the CDI from three perspectives. From a theoretical perspective, we reviewed why the design features of the CDI make it a reasonable tool for measuring child language, even if there are opportunities for error and bias throughout. (Of course, one of these design features is the style of administration for a particular study, so of course a poorly-administered form will yield a dataset with lower reliability and greater bias). Then, we took advantage of the deep longitudinal data available for two languages and showed quite strong longitudinal correlations between CDI administrations. This pattern indicates that early language is a stable construct across development [@bornstein2012]. It also signals that measurement error between CDI administrations appears to be limited, at least when the span of time between administrations is not too great. Finally, we used item-response theory to examine the measurement properties of individual items. While the CDI includes some items with limited measurement value (if all that the user cares about is a single ability score), most items show good psychometric properties. This analysis also revealed that comprehension questions and questions about function words do not appear to be particularly worse than other items, contrary to previous speculations. In sum, the CDI appears to be a reliable instrument for measuring children's early language, with measurement properties that support a range of further analyses.