-
Notifications
You must be signed in to change notification settings - Fork 6
/
008-items-consistency.Rmd
735 lines (576 loc) · 44.1 KB
/
008-items-consistency.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
# Consistency in Early Vocabulary {#items-consistency}
Which words do children learn first? In spite of tremendous individual variation in rate of development [see Chapter \@ref(vocabulary); @fenson1994; @hart1995], the first words that children utter are reported to be quite consistent. We examine this claim both qualitatively and quantitatively, focusing on the first ten words initially and then zooming out to examine this claim in typological and developmental context.
## Introduction and methods
Based on the examination of diary studies, a number of early studies noted the similarities in children's first words across languages [e.g., @clark1973; @slobin1970; see also @schneider2015]. This observation formed the basis for a number of theories of usage (including e.g., Clark's influential semantic feature hypothesis). While we return briefly to the question of *why* we see similarity across languages in the contents of early vocabulary, in this chapter we seek to establish a firmer empirical understanding of both the similarities and differences in early-learned words across languages.
Our approach primarily follows the lead of a systematic examination of the early vocabularies of children learning English, Mandarin, and Cantonese [@tardif2008]. Tardif and colleagues found that children's first 10 words in all three languages tended to be about important people in their life (_mom_, _dad_), social routines (_hi_, _uh oh_), animals (_dog_, _duck_), and foods (_milk_, _banana_). Here we attempt to generalize this analysis, asking more broadly whether words tend to be learned in the same order across languages.
One challenge is that the precise words that children learn in different languages are (of course) language-specific. We would really like to ask whether the *concepts* that are being talked about are the same -- or at least similar. As detailed in the Chapter \@ref(intro-practical), the items on each language's form are adaptations and not translations: They are intended to capture the spirit of the items on the English form rather to replicate them exactly. Thus, not all words appear on all forms. In addition, conceptual mappings across languages are also subject to cross-cultural variation (the "tortilla problem" we discuss earlier). In what follows, we acknowledge this caveat, but assume for simplicity that *dog*, *chien*, and *perro* name (roughly) the same concept. Our approach is thus to take advantage of when translation equivalents appear on multiple forms and examine variability in how quickly these words are acquired across languages. This analysis is in some sense a "rough draft" of our more systematic quantitative approach in Chapter \@ref(items-prediction), focusing on first words specifically.
```{r items-load_consistency_data, eval=FALSE}
langs <- get_instruments() %>%
select(language, form) %>%
filter(form %in% c("WS", "WG")) %>%
group_by(language) %>%
filter(n() == 2)
items <- map2(langs$language, langs$form, get_item_data) %>%
bind_rows() %>%
filter(!is.na(uni_lemma), type == "word")
items_nested <- items %>%
group_by(language, form) %>%
nest()
item_data <- map_dfr(1:nrow(items_nested),
~get_instrument_data(language = items_nested[.x, "language"] %>% pull(),
form = items_nested[.x, "form"] %>% pull(),
items = items_nested[[.x, "data"]]$item_id,
administrations = T))
```
```{r items-get_aoas, eval = FALSE}
understand_items <- item_data %>%
filter(form == "WG") %>%
mutate(value = if_else(is.na(value), "", value)) %>%
left_join(items %>% filter(form == "WG"),
by = c("num_item_id", "language", "form")) %>%
mutate(value = nchar(value) > 0) %>%
group_by(language, uni_lemma, age, data_id) %>%
summarise(value = any(value)) %>%
left_join(items %>% filter(form == "WG") %>%
group_by(language, uni_lemma) %>%
distinct(num_item_id) %>%
slice(1), by = c("language", "uni_lemma")) %>%
mutate(value = if_else(value, "understands", ""))
understand_aoas <- understand_items %>%
group_by(language) %>%
nest() %>%
mutate(aoas = map(data, ~fit_aoa(.x, measure = "understands", method = "glm",
age_max = 30))) %>%
select(-data) %>%
unnest() %>%
left_join(items %>% filter(form == "WG") %>%
group_by(language, uni_lemma) %>%
distinct(num_item_id) %>%
slice(1), by = c("language", "num_item_id", "uni_lemma"))
produce_items <- items %>%
filter(form == "WG") %>%
mutate(value = if_else(is.na(value), "", value)) %>%
distinct(language, uni_lemma) %>%
left_join(items, by = c("language", "uni_lemma")) %>%
left_join(item_data, by = c("language", "num_item_id", "form")) %>%
mutate(value = value == "produces") %>%
ungroup() %>%
arrange(form) %>%
group_by(language, uni_lemma, age, data_id) %>%
summarise(num_item_id = first(num_item_id), value = any(value)) %>% #within subj
group_by(uni_lemma) %>%
mutate(num_item_id = first(num_item_id)) %>% #across forms
mutate(value = if_else(value, "produces", ""))
produce_aoas <- produce_items %>%
group_by(language) %>%
nest() %>%
mutate(aoas = map(data, ~fit_aoa(.x, measure = "produces", method = "glm",
age_max = 36))) %>%
select(-data) %>%
unnest() %>%
left_join(items %>% filter(form == "WG"), by = "language") %>%
select(language, num_item_id, aoa, uni_lemma)
all_aoas <- understand_aoas %>%
mutate(measure = "understands") %>%
bind_rows(produce_aoas %>% mutate(measure = "produces")) %>%
filter(!is.na(aoa))
write_feather(all_aoas,"data/items-consistency/aoas.feather")
```
```{r items-load-aoas}
all_aoas <- read_feather("data/items-consistency/aoas.feather")
```
To estimate the similarity of each item's trajectory, we use a single measure of its difficulty: age of acquisition (AoA) -- the age at which 50% of children in each language are estimated to have acquired it (Appendix \@ref(appendix-aoa)). We analyzed consistency in both comprehension and production, using Words & Gestures forms to estimate age of acquisition in comprehension, and stitching across Words & Gestures and Words & Sentences forms to estimate age of acquisition in production. Because of this strategy of combining forms, we were restricted to the `r n_distinct(all_aoas$language)` languages for which data for both forms were available.
```{r items-uni-lemma-completeness}
IN_LANGS <- 8
lemma_completeness <- all_aoas %>%
group_by(measure, uni_lemma) %>%
summarise(in_langs = sum(!is.na(aoa)))
cutoffs <- lemma_completeness %>%
group_by(measure, in_langs) %>%
summarise(n = n()) %>%
arrange(measure, desc(in_langs)) %>%
mutate(prop = n/sum(n)) %>%
mutate(cum_prop = cumsum(prop),
cum_n = cumsum(n))
cutoff_n <- cutoffs %>%
filter(measure == "produces", in_langs == IN_LANGS) %>%
slice(1) %>%
pull(cum_n)
cutoffs_prod <- cutoffs %>% filter(measure == "produces")
```
In total, we estimated ages of acquisition for `r all_aoas %>% distinct(uni_lemma) %>% nrow()` total words spread across the `r n_distinct(all_aoas$language)` languages. Unfortunately, not every word appeared on all forms. Figure \@ref(fig:items-uni-lemma-completeness-figure) shows the cumulative proportion of forms on which every word appears. For our consistency analysis, we considered only the `r cutoff_n` words that appeared in at least `r IN_LANGS` of the `r n_distinct(all_aoas$language)` languages.
```{r items-uni-lemma-completeness-figure, fig.width = 6, fig.height = 4, fig.cap = glue("The proportion of words found on at least each of a number of languages' CDI forms (e.g. all words appear on at least one form, {filter(cutoffs_prod, in_langs == 2)$n} words appear on at least 2 forms, and so on, with {filter(cutoffs_prod, in_langs == max(in_langs))$n} words appearing on all {max(cutoffs_prod$in_langs)} languages' forms). The dotted line shows the cutoff value we chose ({IN_LANGS}).")}
ggplot(cutoffs_prod, aes(x = in_langs, y = cum_prop)) +
geom_line() +
geom_segment(aes(x = IN_LANGS, xend = IN_LANGS, y = 0,
yend = filter(cutoffs, measure == "produces",
in_langs == IN_LANGS) %>% pull(cum_prop)),
linetype = .refline, colour = .grey) +
geom_segment(aes(x = 1, xend = IN_LANGS,
y = filter(cutoffs, measure == "produces",
in_langs == IN_LANGS) %>% pull(cum_prop),
yend = filter(cutoffs, measure == "produces",
in_langs == IN_LANGS) %>% pull(cum_prop)),
linetype = .refline, colour = .grey) +
scale_x_continuous(expand = c(0, 0), breaks = seq(2, 14, 2)) +
scale_y_continuous(expand = c(0, 0)) +
labs(x = "Number of forms", y = "Proportion of words") +
theme(plot.margin = margin(t = 10))
```
## The first 10 words
```{r items-top10-compute}
top10_long <- all_aoas %>%
inner_join(filter(lemma_completeness, in_langs >= IN_LANGS)) %>%
group_by(measure,language) %>%
arrange(aoa) %>%
slice(1:10)
top10 <- top10_long %>%
select(-aoa, -num_item_id, -in_langs) %>%
mutate(order = 1:n()) %>%
spread(language, uni_lemma)
in_langs <- top10_long %>%
group_by(measure, uni_lemma, in_langs) %>%
select(measure, uni_lemma, in_langs) %>%
distinct() %>%
arrange(measure, desc(in_langs), uni_lemma)
in_lang_props <- in_langs %>%
group_by(measure, in_langs) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n))
```
Following @tardif2008, we begin by examining the first 10 words acquired by children across the `r n_distinct(all_aoas$language)` languages we measured (Tables \@ref(tab:items-top10-production) and \@ref(tab:items-top10-comprehension)). Similar words appeared in the top 10 across languages, especially in children's earliest productions. In production, `r in_lang_props %>% filter(measure == "produces", in_langs == 15) %>% pull(n)` of the `r in_lang_props %>% filter(measure == "produces") %>% pull(n) %>% sum()` words appeared in the top ten earliest words of every language (`r in_lang_props %>% filter(measure == "produces", in_langs == 15) %>% pull(prop) %>% roundp()`), and all but `r in_lang_props %>% filter(measure == "produces", in_langs < 10) %>% pull(n)` appeared in at least ten languages (`r (1 - in_lang_props %>% filter(measure == "produces", in_langs < 10) %>% pull(prop)) %>% roundp()`). In comprehension, `r in_lang_props %>% filter(measure == "understands", in_langs == 15) %>% pull(n)` of the `r in_lang_props %>% filter(measure == "understands") %>% pull(n) %>% sum()` words appeared in the top ten earliest words of every language (`r in_lang_props %>% filter(measure == "understands", in_langs == 15) %>% pull(prop) %>% roundp()`), and all but `r in_lang_props %>% filter(measure == "understands", in_langs < 10) %>% pull(n)` appeared in at least ten languages (`r (1 - in_lang_props %>% filter(measure == "understands", in_langs < 10) %>% pull(prop)) %>% roundp()`). These words consist primarily of important family members (_mommy_, _daddy_, _grandma_), social routines (_hi_, _bye_, _peekaboo_), and sounds (_yum yum_, _vroom_, _woof woof_).
```{r items-top10-production, results="asis"}
MIN_HIGHLIGHT_PRODUCTION <- 6
top10_measure <- function(meas, min_highlight) {
top10 %>%
ungroup() %>%
filter(measure == meas) %>%
select(-measure) %>%
pivot_longer(cols = -order, names_to = "language", values_to = "word") %>%
group_by(word) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(word = word %>% str_remove(" \\(.*\\)$") %>% str_remove("own "),
word = cell_spec(word, bold = n >= min_highlight)) %>%
select(-n) %>%
pivot_wider(names_from = "language", values_from = "word") %>%
select(-order)
}
top10_table <- function(meas, action, min_highlight) {
df <- top10_measure(meas, min_highlight)
kable(df, escape = FALSE, col.names = rep("", ncol(df)),
caption = glue("The 10 earliest words that children {action} in each language. Bolded words appear in at least {min_highlight} languages.")) %>%
kable_styling(latex_options = "scale_down", font_size = 10) %>%
add_header_above(str_replace(colnames(df), " ", "\n"), align = "l", line = FALSE) %>%
landscape()
}
top10_table("produces", "produce", MIN_HIGHLIGHT_PRODUCTION)
```
```{r items-top10-comprehension, dependson="items-top10-production"}
MIN_HIGHLIGHT_COMPREHENSION <- 6
top10_table("understands", "understand", MIN_HIGHLIGHT_COMPREHENSION)
```
Strongly ratifying the conclusions of @tardif2009, similar words appeared in the top 10 across languages. Similarities were especially prominent in children's earliest productions. These words consist primarily of important family members (_mommy_, _daddy_, _grandma_), social routines (_hi_, _bye_, _peekaboo_), and sounds (_yum yum_, _vroom_, _woof woof_).
Unfortunately, we cannot determine if the greater consistency found in early production is a real regularity about children's lexical development, or is instead a measurement artifact arising from the greater difficulty of reporting on a child's comprehension (see Chapter \@ref(psychometrics)).^[This finding is *prima facie* inconsistent with another recent analysis comparing variability in comprehension and production vocabularies [@mayor2014]. This analysis noted that comprehension vocabularies tend to be less idiosyncratic across children -- rather than across languages -- than production vocabularies.] It may be that early communicative needs drive the first words children produce to be even more similar than the first words they comprehend.
<!-- ## Global cross-linguistic similarity -->
```{r items-understands-and-produces, width = 4, height = 4}
mean_aoas <- all_aoas %>%
group_by(measure, uni_lemma) %>%
summarise(n = sum(!is.na(aoa)),
aoa = mean(aoa, na.rm = T)) %>%
filter(n >= IN_LANGS) %>%
arrange(measure, aoa) %>%
group_by(measure) %>%
mutate(order = 1:n())
wide_means <- mean_aoas %>%
spread(measure, aoa) %>%
group_by(uni_lemma) %>%
summarise_at(vars(n, produces, understands), mean, na.rm=T)
measure_correlation <- cor.test(wide_means$produces,
wide_means$understands, use = "pairwise")
```
```{r items-understands-and-produces-plots, fig.cap = "Average age of acquisition in comprehension and production for each measured word. Dashed line provides a reference with slope = 1 (identical age of acquisition)."}
ggplot(wide_means, aes(x = produces, y = understands, label = uni_lemma)) +
geom_smooth(method = "lm", aes(alpha = .1), se = F, color = "lightgray") +
geom_text(position=position_jitter(width=.25,height=.25), size = 2,
family = .font) +
labs(x = "Production", y = "Comprehension") +
theme(legend.position = "none") +
geom_abline(lty = 2)
```
Despite these differences between comprehension and production, words that are reported to be acquired early in one measure are also generally reported to be acquired early in the other. Figure \@ref(fig:items-understands-and-produces-plots) shows the relationship between the mean age of acquisition in production and the mean age of acquisition in comprehension for each of these `r cutoff_n` words across the `r n_distinct(all_aoas$language)` languages. The correlation between the two measures was quite high: _r_ = `r roundp(measure_correlation$estimate)` (_p_ `r print_pvalue(measure_correlation$p.value)`). Thus, apparent inconsistencies in first words for comprehension may be more a function of measurement errors in comprehension than any systematic difference.
Taken together, these analyses suggest that children's earliest words, and by inference the processes that underpin them, are highly similar across languages. The source of this similarity is hard to pin down, however. One possibility is that the difficulty of learning a word is determined predominantly by the complexity of the concept denoted by that word, and thus that variability in linguistic (e.g., phonological and syntactic complexity) and cultural (e.g., styles of parental interaction with children) features play a relatively small role in determining the difficulty of learning a word [@gentner2001]. Alternatively, the primary driver of difficulty could be linguistic, but the dimensions of linguistic variability could be orthogonal to the difficulty of learning. For instance, verbs may be more difficult than nouns because they are relational, and thus learning nouns makes learning verbs relatively easier than learning verbs makes learning nouns [@gleitman1990]. In this case, the linguistically relevant dimensions would be relatively invariant across languages [@snedeker2007]. Finally, it is worth noting that because the words on the CDI are not a random sample of words in each language, these correlations may overestimate the degree of cross-linguistic similarity, even though they are consistent with earlier diary studies.
In Chapter \@ref(items-prediction) we begin to take up these questions using predictive models. Prior to taking this step, however we consider cross-linguistic ordering more holistically. In the remainder of the chapter, we address this problem from two directions: (1) Is similarity in order of acquisition for two languages related to the degree of similarity between the two languages, and (2) Does similarity in order of acquisition change over development?
## Acquisition similarity and linguistic similarity
```{r items-load-asjp-distances, eval = F}
# Download file from https://github.com/ddediu/lgfam-newick/blob/master/input/distances/ASJP/asjp16-dists.RData?raw=true
load("asjp16-dists.RData")
asjp <- asjp16.dm %>%
as_data_frame()
isocodes <- ISOcodes::ISO_639_3 %>%
select(Id,eng) %>%
rename(iso = Id, language = eng)
tidy_asjp <- asjp %>%
mutate(code1 = names(asjp)) %>%
gather(code2, distance, -code1)
languages <- data_frame(language = unique(all_aoas$language)) %>%
mutate(trim_language = gsub("\\s*\\([^\\)]+\\)", "", language)) %>%
left_join(isocodes, by = c("trim_language" = "language")) %>%
mutate(iso = if_else(trim_language == "Spanish", "spa",
if_else(trim_language == "Kiswahili", "swh",
if_else(trim_language == "Hebrew", "hbo", iso)))) %>%
select(-trim_language)
asjp_dist <- tidy_asjp %>%
left_join(languages, by = c("code1" = "iso")) %>%
filter(!is.na(language)) %>%
rename(language1 = language) %>%
select(-code1) %>%
left_join(languages, by = c("code2" = "iso")) %>%
filter(!is.na(language)) %>%
rename(language2 = language) %>%
select(-code2)
write_feather(asjp_dist, "data/items-consistency/asjp_dist.feather")
```
```{r items-aoa-consistency}
cor_reorder <- function(df) {
mat <- df %>%
ungroup() %>%
select(-measure) %>%
spread(language2, correlation) %>%
as.data.frame() %>%
column_to_rownames("language1") %>%
data.matrix()
corOrder <- corrplot::corrMatOrder(mat, order = "hclust")
df %>%
mutate(language1 = factor(language1, levels = unique(df$language1)[corOrder]),
language2 = factor(language2, levels = unique(df$language2)[corOrder]))
}
cors <- all_aoas %>%
group_by(measure) %>%
nest() %>%
mutate(cors = map(data, ~widyr::pairwise_cor(.x, language, uni_lemma, aoa))) %>%
select(-data) %>%
unnest(cols = c(cors)) %>%
complete(measure, item1, item2, fill = list(correlation = 1)) %>%
rename(language1 = item1, language2 = item2) %>%
split(.$measure) %>%
map_dfr(cor_reorder) %>%
ungroup() %>%
mutate(measure = fct_relevel(measure, "understands"))
wide_method_cor <- cors %>% spread(measure, correlation) %>%
filter(language1 != language2)
pairwise_cor_test <- cor.test(wide_method_cor$produces, wide_method_cor$understands)
```
Unfortunately, the `r n_distinct(all_aoas$language)` languages in our analyses are both a small and non-representative sample of the world's languages, and thus do not have sufficient power to detect typological features of language that might be responsible for differences in the similarity of acquisition across languages [@piantadosi2014]. Nonetheless, the languages do come from different languages families, and do vary in their phylogenetic distance. We leverage this variability to ask whether the similarity between two languages is related to similarity in how quickly words for the same concepts are learned in those two languages.
Instead of correlating the average similarity of age of acquisition across all languages, we consider the pairwise similarities in the age of acquisition of each of the `r cutoff_n` words in each language. Figure \@ref(fig:items-consistency-heatmap) shows these pairwise correlations for production as a matrix in which each cell shows a single pairwise correlation. This correlation matrix appears to contain a significant amount of structure, with languages that are from the same language family (e.g. Norwegian and Danish) showing higher correlations in their ages of acquisition for the same concepts. Perhaps unsurprisingly from the high average correlation between production and comprehension, pairwise correlations were nearly identical for production and comprehension (_r_ = `r roundp(pairwise_cor_test$estimate)`, _p_ `r print_pvalue(pairwise_cor_test$p.value)`); we omit the comprehension matrices for length. Figure \@ref(fig:items-consistency-dendro) shows a dendrogram produced by hierarchically clustering these pairwise correlations.
```{r items-consistency-heatmap, fig.height=7.5, out.width="80%", fig.cap = "Correlation matrix showing pairwise correlations in words' age of acquisition. Languages that are more similar have more similar acquisition orders."}
ggplot(filter(cors, measure == "produces"),
aes(x = language1, y = language2, fill = correlation)) +
# facet_grid(measure ~ ., labeller = label_caps) +
coord_fixed() +
geom_tile() +
geom_text(aes(label = roundp(correlation)), family = .font) +
scale_x_discrete(expand = expand_scale()) +
scale_y_discrete(expand = expand_scale()) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.title = element_blank())
```
```{r items-dendro-setup}
typology <- yaml::read_yaml("data/items-consistency/typology.yaml") %>%
tibble(label = names(.), family = unlist(.)) %>%
select(label, family)
make_hc <- function(df) {
mat <- df %>%
spread(language2, correlation) %>%
as.data.frame() %>%
column_to_rownames("language1") %>%
data.matrix()
hclust(dist(mat)) %>%
ggdendro::dendro_data(., type = "triangle")
}
```
```{r items-consistency-dendro, fig.cap = "Dendrograms of the similarity in the ages of words' first production cross-linguistically."}
dendros <- cors %>%
group_by(measure) %>%
nest() %>%
mutate(hc = map(data, make_hc)) %>%
select(-data)
dendro_segments <- dendros %>%
mutate(segment = map(hc, ~.x %>% ggdendro::segment())) %>%
select(-hc) %>%
unnest(cols = segment)
dendro_labels <- dendros %>%
mutate(segment = map(hc, ~.x %>% ggdendro::label())) %>%
select(-hc) %>%
unnest(cols = segment) %>%
mutate(label = label %>% str_remove(" \\(.*\\)")) %>%
left_join(typology, by = "label")
plt <- ggplot(dendro_segments) +
facet_grid(. ~ measure, scales = "free",
labeller = as_labeller(label_caps)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(aes(x = x, y = y - 0.02, label = label, colour = family),
data = dendro_labels, hjust = 0, family = .font) +
coord_flip() +
scale_x_reverse() +
scale_y_reverse() +
# scale_colour_manual(values = lang_colours_aoa, guide = FALSE) +
.scale_color_discrete(name = "Language family",
guide = guide_legend(title.position = "top",
title.hjust = 0.5,
override.aes = list(size = 0),
keyheight = unit(0, "lines"))) +
expand_limits(y = -1) +
theme_get() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.position = "bottom",
legend.text = element_blank(),
legend.title = element_text(size = rel(0.8), margin = margin(b = -8)),
legend.margin = margin(t = 0),
panel.border = element_blank())
lgnd <- typology %>%
distinct(family) %>%
mutate(x = scale(1:n())) %>%
ggplot(aes(x = x, y = 0)) +
geom_text(aes(label = family, colour = family), family = .font, size = 3) +
.scale_colour_discrete(guide = FALSE) +
lims(x = c(-2, 2)) +
theme_void()
cowplot::plot_grid(plt, lgnd, ncol = 1, rel_heights = c(20, 1))
```
```{r items-asjp-distance}
asjp_dist <- read_feather("data/items-consistency/asjp_dist.feather")
asjp_tests <- cors %>%
left_join(asjp_dist) %>%
filter(language1 < language2) %>%
group_by(measure) %>%
nest() %>%
mutate(cor = map(data, ~cor.test(.x$distance, .x$correlation))) %>%
mutate(estimate = map(cor, ~.x$estimate),
p.value = map(cor, ~.x$p.value)) %>%
select(-data, -cor) %>%
unnest(cols = c(estimate, p.value))
```
These dendrograms show high similarity within the North Germanic, Slavic, and Romance language families. Some relationships resist straightforward linguistic explanations (e.g., the relationship of Quebec French to other languages). These may be due to non-uniform sparsity of data across these languages, or may instead reflect interesting cultural or other sources of variability. Despite these cases, the order in which words are acquired appears to a high degree to reflect the structure of the languages that children learning these words speak. To confirm this observation quantitatively, we borrowed an established measure for measuring linguistic similarity: the lexical similarity of words for the same meaning [@wichmann2010].
Using a set of 40 words for meanings common to all of the words languages, @holman2008 were able to use a string-edit distance metric to recover linguistic similarity estimates that correlated highly with geographic distance and also several typological systems. This method is appealing for our purposes as it is relatively agnostic as to the processes of language contact and change that have produced modern-day languages and instead tracks the similarity of word forms themselves. The language distance measures produced by this method were highly correlated with pairwise correlations in acquisition trajectories for both production (_r_ = `r filter(asjp_tests, measure == "produces") %>% pull(estimate) %>% roundp()`, _p_ `r filter(asjp_tests, measure == "produces") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(asjp_tests, measure == "understands") %>% pull(estimate) %>% roundp()`, _p_ `r filter(asjp_tests, measure == "understands") %>% pull(p.value) %>% print_pvalue()`).
```{r items-language-levs, eval = F}
# clean_words(c("dog", "dog / cat", "dog (animal)", "(a) dog", "dog*", "dog(go)", "(a)dog", " dog ", "Cat"))
clean_words <- function(word_set) {
word_set %>%
# dog / doggo
strsplit("/") %>% flatten_chr() %>%
# dog (animal) | (a) dog
strsplit(" \\(.*\\)|\\(.*\\) ") %>% flatten_chr() %>%
# dog* | dog? | dog! | ¡dog! | dog's
gsub("[*?!¡']", "", .) %>%
# dog(go) | (a)dog
map_if(
# if "dog(go)"
~grepl("\\(.*\\)", .x),
# replace with "dog" and "doggo"
~c(sub("\\(.*\\)", "", .x),
sub("(.*)\\((.*)\\)", "\\1\\2", .x))
) %>%
flatten_chr() %>%
# trim
gsub("^ +| +$", "", .) %>%
keep(nchar(.) > 0) %>%
tolower() %>%
unique() %>%
first()
}
item_aoas <- all_aoas %>%
filter(measure == "produces") %>%
left_join(items, by = c("language", "uni_lemma")) %>%
select(language, aoa, uni_lemma, definition) %>%
group_by(language, aoa, uni_lemma) %>%
slice(1) %>%
mutate(definition = clean_words(definition))
pairwise_levs <- function(langs, definitions) {
lang1 <- first(langs)
lang2 <- last(langs)
lang1_aoas <- definitions %>%
filter(language == lang1) %>%
rename(lang1 = language, def1 = definition) %>%
ungroup() %>%
select(lang1, uni_lemma, def1)
lang2_aoas <- definitions %>%
filter(language == lang2) %>%
rename(lang2 = language, def2 = definition) %>%
ungroup() %>%
select(lang2, uni_lemma, def2)
left_join(lang1_aoas, lang2_aoas, by = "uni_lemma") %>%
filter(!is.na(lang1) ,!is.na(lang2)) %>%
mutate(length = pmax(nchar(def1), nchar(def2))) %>%
mutate(dist = stringdist::stringdist(def1, def2, method= "lv")/length) %>%
summarise(dist = mean(dist, na.rm = T)) %>%
mutate(language1 = lang1, language2 = lang2)
}
lang_pairs <- combn(unique(item_aoas$language), 2, simplify = F)
levs <- map_dfr(lang_pairs, ~pairwise_levs(.x, item_aoas))
lev_char_cors <- cors %>%
left_join(levs) %>%
filter(language1 < language2)
```
```{r items-ipa-levs, eval = F}
lang_codes <- list(
"Croatian" = "hr",
"Danish" = "da",
"English (American)" = "en-us",
"French (French)" = "fr",
"French (Quebecois)" = "fr",
"Italian" = "it",
"Norwegian" = "no",
"Russian" = "ru",
"Spanish (Mexican)" = "es",
"Swedish" = "sv",
"Turkish" = "tr",
"Kiswahili" = "sw",
"Slovak" = "sk"
)
get_ipa <- function(word, lang) {
lang_code <- lang_codes[[lang]]
system2("espeak", args = c("--ipa=3", "-v", lang_code, "-q", glue("'{word}'")),
stdout = TRUE) %>%
gsub("^ ", "", .) %>%
gsub("[ˈˌ]", "", .)
}
get_phons <- function(words, lang) {
words %>% map_chr(function(word) word %>% get_ipa(lang))
}
phons <- item_aoas %>%
filter(!language %in% c("Hebrew", "Korean")) %>%
group_by(language) %>%
nest() %>%
mutate(phons = map2(data, language, ~get_phons(.x$clean_definition, .y))) %>%
unnest()
ipas <- phons %>%
mutate(ipas = str_split(phons, "[ _]+")) %>%
unnest() %>%
filter(nchar(ipas) > 0)
chars <- ipas %>%
distinct(ipas)
ascii_chars <- paste0("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
"0123456789", " !\"#$%&'()*+,./:;<=>?@[\\^_`{|}~-", "]")
latin_extended <- paste0("ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ")
mapping_chars <- paste0(ascii_chars, latin_extended) %>%
str_split("") %>%
unlist() %>%
data_frame(mapping = .) %>%
distinct() %>%
slice(1:nrow(chars))
mapping_table <- bind_cols(chars, mapping_chars)
mapped_ipas <- ipas %>%
left_join(mapping_table) %>%
group_by(language, phons, aoa, uni_lemma, definition, clean_definition) %>%
summarise(mapping = paste0(mapping, collapse = "")) %>%
ungroup() %>%
select(-definition) %>%
rename(definition = mapping)
lang_ipa_pairs <- combn(unique(mapped_ipas$language), 2, simplify = F)
ipa_levs <- map_dfr(lang_ipa_pairs, ~pairwise_levs(.x, mapped_ipas))
lev_ipa_cors <- cors %>%
left_join(ipa_levs) %>%
filter(language1 < language2)
all_lev_cors <- lev_char_cors %>%
rename(char = dist) %>%
left_join(lev_ipa_cors,
by = c("measure", "language1", "language2", "correlation")) %>%
rename(ipa = dist)
write_feather(all_lev_cors, "data/items-consistency/lev_cors.feather")
```
```{r items-lev-cors}
lev_cors <- read_feather("data/items-consistency/lev_cors.feather")
lev_tests <- lev_cors %>%
gather(lev_type, distance, char, ipa) %>%
group_by(measure, lev_type) %>%
nest() %>%
mutate(cor = map(data, ~cor.test(.x$distance, .x$correlation))) %>%
mutate(estimate = map(cor, ~.x$estimate),
p.value = map(cor, ~.x$p.value)) %>%
select(-data, -cor) %>%
unnest()
```
We also applied this same analysis to the words on the CDI themselves. For each language, we computed the average normalized @levenshtein1966 distance between words for each of the `r cutoff_n` common words in our analyses.^[Levenshtein distance is a measure of the minimum number of insertions, deletions, or substitutions required to transform one string into another. For instance, the distance between the Italian and Norwegian words for dog (_cane_ and _hund_) is `r stringdist::stringdist("cane","hund", method = "lv")`. We computed this measure pairwise for all words, and then divided it by the number of characters in the longest word in order to get the edit distance per character (`r stringdist::stringdist("cane","hund", method = "lv") /nchar("hund")` for _cane_ and _hund_).] This measure was even more highly correlated with pairwise acquisition trajectories than similarity computed using the 40 words identified by @holman2008, with relatively high correlations for both production (_r_ = `r filter(lev_tests, measure == "produces", lev_type == "char") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "produces", lev_type == "char") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(lev_tests, measure == "understands", lev_type == "char") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "understands", lev_type == "char") %>% pull(p.value) %>% print_pvalue()`).
Because this analysis likely overestimates the dissimilarity of languages written in different scripts -- as every word receives a normalized Levenshtein distance of 1 in this case -- we replicated this analysis at the phonemic level. We used `eSpeak` to compute phonetic transcripts of each word and repeated the same analysis on distance between words' phonetic units in the International Phonetic Alphabet [IPA; @decker1999]. These correlations between IPA distance and pairwise age of acquisition trajectories were again reliable although slightly attenuated for both production (_r_ = `r filter(lev_tests, measure == "produces", lev_type == "ipa") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "produces", lev_type == "ipa") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(lev_tests, measure == "understands", lev_type == "ipa") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "understands", lev_type == "ipa") %>% pull(p.value) %>% print_pvalue()`). The robustness of these correlations across a variety of methods suggests that in addition to the high degree of general cross-linguistic similarities in the order of acquisition of words, the dissimilarities between them likely reflect differences in the wordforms of the target languages being learned.
Because the languages we studied here are far from a reliable, representative sample of the world's languages, the correlation between linguistic similarity and acquisition order similarity is hard to interpret definitively [@naroll1965]. Languages in which word forms are similar are also likely to have similar cultural beliefs around parenting, similar household organization and incomes, and generally share other non-linguistic features in common. Nonetheless, these analyses suggest that in addition to early communicative need -- which may be quite similar cross-linguistically -- language and culture-specific features govern the order of acquisition. In the following section, we take on the the relationship between early communicative need and linguistic variability directly, asking whether acquisition orders are equally cross-linguistically similar over development, or whether they instead diverge or converge as children learn more words.
## Consistency across development
In the next analysis, we ask whether similarities in ages of acquisition are constant over the course of acquisition, or whether the similarity across languages changes over development. If variability in acquisition trajectories across languages reflects variability in those languages, we might expect that children's trajectories diverge over the course of language acquisition as the structure of their target language or their cultural milieu play a stronger role in guiding which words are easy or important to learn. Put more simply: our analyses of the first 10 words above shows striking similarity in the earliest words. Does this similarity decrease for the next 300 words?
```{r items-step-corrs}
step_cor <- function(max_order, meas, mean_aoas) {
uni_lemmas <- mean_aoas %>%
filter(measure == meas) %>%
slice(1:max_order) %>%
pull(uni_lemma)
all_aoas %>%
ungroup() %>%
filter(measure == meas, uni_lemma %in% uni_lemmas) %>%
select(-measure, -num_item_id) %>%
pairwise_cor(language, uni_lemma, aoa, use = "pairwise") %>%
group_by(item1) %>%
summarise(correlation = mean(correlation)) %>%
mutate(n = max_order, measure = meas)
}
MIN_ITEMS <- 5
```
```{r items-compute-corrs, eval=FALSE}
understands_step_cors <- map_dfr(MIN_ITEMS:max(filter(mean_aoas,
measure == "understands") %>% pull(order)),
function(x) step_cor(x,"understands", mean_aoas))
produces_step_cors <- map_dfr(MIN_ITEMS:max(filter(mean_aoas,
measure == "produces") %>% pull(order)),
function(x) step_cor(x,"produces", mean_aoas))
empirical_step_cors <- bind_rows(understands_step_cors, produces_step_cors) %>%
group_by(measure, n) %>%
summarise(correlation = mean(correlation, na.rm = T))
make_random_step_cors <- function() {
random_aoas <- mean_aoas %>%
group_by(measure) %>%
sample_frac(1)
understands_step_cors_shuffle <- map_dfr(MIN_ITEMS:max(filter(mean_aoas,
measure == "understands") %>% pull(order)),
function(x) step_cor(x,"understands", random_aoas))
produces_step_cors_shuffle <- map_dfr(MIN_ITEMS:max(filter(mean_aoas,
measure == "produces") %>% pull(order)),
function(x) step_cor(x,"produces", random_aoas))
bind_rows(understands_step_cors_shuffle,
produces_step_cors_shuffle)
}
random_step_cors <- replicate(100, make_random_step_cors(), simplify = F) %>%
bind_rows(.id = "sample") %>%
group_by(measure, n, sample) %>%
summarise(correlation = mean(correlation)) %>%
summarise(ci_upper = quantile(correlation, .975),
ci_lower = quantile(correlation, .025),
correlation = mean(correlation))
write_feather(random_step_cors, "data/items-consistency/random_step_cors.feather")
write_feather(empirical_step_cors, "data/items-consistency/empirical_step_cors.feather")
```
In order to measure change in cross-linguistic consistency over development, we extend the age of acquisition-correlation approach we have used throughout this chapter. For each concept that appeared in at least `r IN_LANGS` languages, we computed its average age of acquisition across all languages in whose CDIs it appeared in both comprehension and production. We then ordered these words from the earliest learned word on average (*`r mean_aoas %>% filter(measure == "produces") %>% filter(measure == "produces", aoa == min(aoa)) %>% pull(uni_lemma)`* to the latest learned word *`r (mean_aoas %>% filter(measure == "produces") %>% filter(measure == "produces", aoa == max(aoa)) %>% pull(uni_lemma))`*). We then computed the average cross-linguistic correlation in age of acquisition for the increasingly-large sets of words starting with `r MIN_ITEMS` words to `r mean_aoas %>% filter(measure == "produces") %>% distinct(uni_lemma) %>% nrow()` words. If the correlation increases over acquisition, we can infer that acquisition trajectories become more similar as more words are learned, that is, the hardest to learn words are learned more similarly across languages. In contrast, if the correlation decreases, we can infer that children start out learning similar concepts regardless of their native language, but that linguistic and cultural variability plays a greater role in the learning of later words.
```{r items-plot-cors, fig.height=4.5, fig.cap = "Cross-linguistic correlation ages of words' acquisition over the course of language development. Colored lines show empirical correlations, the gray area shows a 95 percent confidence interval for a randomly shuffled baseline. Especially in production, cross-linguistic similarity declines over the course of language development."}
empirical_step_cors <-
read_feather("data/items-consistency/empirical_step_cors.feather") %>%
mutate(measure = fct_relevel(measure, "understands"))
random_step_cors <-
read_feather("data/items-consistency/random_step_cors.feather") %>%
mutate(measure = fct_relevel(measure, "understands"))
ggplot(random_step_cors, aes(x = n, y = correlation)) +
facet_wrap(~measure, labeller = label_caps) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, alpha = .1)) +
geom_line() +
geom_line(aes(color = measure), data = empirical_step_cors) +
labs(x = "Mean acquisition order number", y = "Cross-linguistic correlation") +
theme(legend.position = "none") +
.scale_colour_discrete()
```
Figure \@ref(fig:items-plot-cors) shows these correlations for both comprehension and production over the course of acquisition. In addition, the gray shaded region shows a 95% confidence interval for a random baseline in which the concepts were ordered randomly, rather than in average acquisition order. This baseline is important to control for changes in measurement error that arise from changing numbers of concepts in the correlation. For both comprehension and production, the trajectories are reliably above the shuffled baseline. This trend is much more apparent for the earliest words in production, mirroring our qualitative sense from the analysis of the first 10 words above. Further, both trajectories clearly decrease over the course of acquisition.
These results confirm that there is substantially more similarity in the earliest learned words than in later learned words cross-linguistically, especially in production. This pattern of results is consistent with an account in which cross-linguistically shared communicative needs are a strong driver of the earliest acquired words. After these needs are met by the initial vocabulary, language-specific factors factors -- variability in the forms, frequencies, and contexts of use for words -- may play a larger role in the order of children's acquisition.
## Conclusions
Children in all languages and culture learn language, but the languages they learn vary, and the cultures into which they are born may have quite different cultural practices around both language and cognitive development. Nonetheless, the order in which children learn the word for specific concepts in their own language shows a substantial degree of cross-linguistic similarity. Further, dissimilarities are well-explained by measurable linguistic dissimilarity. This cross-linguistic similarity in concepts decreases over the course of acquisition. While the first ten words acquired in each language were highly consistent, later words were substantially more different.
As we noted in the introduction, the general observation of cross-linguistic similarity in early vocabulary has been taken as evidence for a wide variety of different theoretical claims. Our view is that these results indicate a shared core of concepts -- e.g., social routines, important people, and some early foods and household animals -- that are perhaps especially important for communication independent of their linguistic realization.
We acknowledge, however, that there are likely many reasons for consistency of early words. One intriguing suggestion is that the phonological forms of words used with children actually evolve (or are adapted by parents) to be easier for children to say. One version of this hypothesis comes from @jakobson1962, who hypothesized that parents adapt the word forms for *mother* and *father* to be easy for children to say or even to babble. Thus, the sound convergence across languages in the forms of words for these concepts (which is quite substantial) is due to convergence in what sounds are easy for children to say. This same mechanism could operate over other important early vocabulary as well, though note that this account already presupposes some notion of cognitive importance!
Regardless of the precise reason for this phenomenon, the similarity in early vocabulary is undeniable [ratifying suggestions by @clark1973 and others]. As acquisition unfolds, however, the features that make languages (and cultures) different from one another play an ever increasing role in driving vocabulary development. In Chapter \@ref(items-demographics), we explore demographic differences in acquisition that help to explain why two children learning the same language may acquire different words at different rates.