008-items-consistency.Rmd

# Consistency in Early Vocabulary {#items-consistency}

Which words do children learn first? In spite of tremendous individual variation in rate of development [see Chapter \@ref(vocabulary); @fenson1994; @hart1995], the first words that children utter are reported to be quite consistent. We examine this claim both qualitatively and quantitatively, focusing on the first ten words initially and then zooming out to examine this claim in typological and developmental context.  

## Introduction and methods

Based on the examination of diary studies, a number of early studies noted the similarities in children's first words across languages [e.g., @clark1973; @slobin1970; see also @schneider2015]. This observation formed the basis for a number of theories of usage (including e.g., Clark's influential semantic feature hypothesis). While we return briefly to the question of *why* we see similarity across languages in the contents of early vocabulary, in this chapter we seek to establish a firmer empirical understanding of both the similarities and differences in early-learned words across languages.

Our approach primarily follows the lead of a systematic examination of the early vocabularies of children learning English, Mandarin, and Cantonese [@tardif2008]. Tardif and colleagues found that children's first 10 words in all three languages tended to be about important people in their life (_mom_, _dad_), social routines (_hi_, _uh oh_), animals (_dog_, _duck_), and foods (_milk_, _banana_). Here we attempt to generalize this analysis, asking more broadly whether words tend to be learned in the same order across languages.

One challenge is that the precise words that children learn in different languages are (of course) language-specific. We would really like to ask whether the *concepts* that are being talked about are the same -- or at least similar. As detailed in the Chapter \@ref(intro-practical), the items on each language's form are adaptations and not translations: They are intended to capture the spirit of the items on the English form rather to replicate them exactly. Thus, not all words appear on all forms. In addition, conceptual mappings across languages are also subject to cross-cultural variation (the "tortilla problem" we discuss earlier). In what follows, we acknowledge this caveat, but assume for simplicity that *dog*, *chien*, and *perro* name (roughly) the same concept. Our approach is thus to take advantage of when translation equivalents appear on multiple forms and examine variability in how quickly these words are acquired across languages. This analysis is in some sense a "rough draft" of our more systematic quantitative approach in Chapter \@ref(items-prediction), focusing on first words specifically.  

```{r items-load_consistency_data, eval=FALSE}
langs <- get_instruments() %>%
  select(language, form) %>%
  filter(form %in% c("WS", "WG")) %>%
  group_by(language) %>%
  filter(n() == 2) 


items <- map2(langs$language, langs$form, get_item_data) %>%
  bind_rows() %>%
  filter(!is.na(uni_lemma), type == "word")

items_nested <- items %>%
  group_by(language, form) %>%
  nest()

item_data <- map_dfr(1:nrow(items_nested), 
                 ~get_instrument_data(language = items_nested[.x, "language"] %>% pull(), 
                                      form = items_nested[.x, "form"] %>% pull(), 
                                      items = items_nested[[.x, "data"]]$item_id, 
                                      administrations = T))
```


```{r items-get_aoas, eval = FALSE}
understand_items <- item_data %>%
  filter(form == "WG") %>%
  mutate(value = if_else(is.na(value), "", value)) %>%
  left_join(items %>% filter(form == "WG"),
            by = c("num_item_id", "language", "form")) %>%
  mutate(value = nchar(value) > 0) %>%
  group_by(language, uni_lemma, age, data_id) %>%
  summarise(value = any(value))  %>% 
  left_join(items %>% filter(form == "WG") %>% 
              group_by(language, uni_lemma) %>% 
              distinct(num_item_id) %>%
              slice(1),  by = c("language", "uni_lemma")) %>%
  mutate(value = if_else(value, "understands", ""))

understand_aoas <- understand_items %>%
  group_by(language) %>%
  nest() %>%
  mutate(aoas = map(data, ~fit_aoa(.x, measure = "understands", method = "glm",
                                   age_max = 30))) %>%
  select(-data) %>%
  unnest() %>%
  left_join(items %>% filter(form == "WG") %>% 
              group_by(language, uni_lemma) %>% 
              distinct(num_item_id) %>%
              slice(1), by = c("language", "num_item_id", "uni_lemma"))

produce_items <- items %>%
  filter(form == "WG") %>%
  mutate(value = if_else(is.na(value), "", value)) %>%
  distinct(language, uni_lemma) %>%
  left_join(items, by = c("language", "uni_lemma")) %>%
  left_join(item_data, by = c("language", "num_item_id", "form")) %>%
  mutate(value = value == "produces") %>%
  ungroup() %>%
  arrange(form) %>%
  group_by(language, uni_lemma, age, data_id) %>%
  summarise(num_item_id = first(num_item_id), value = any(value)) %>% #within subj
  group_by(uni_lemma) %>%
  mutate(num_item_id = first(num_item_id)) %>% #across forms
  mutate(value = if_else(value, "produces", ""))

produce_aoas <- produce_items %>%
  group_by(language) %>%
  nest() %>%
  mutate(aoas = map(data, ~fit_aoa(.x, measure = "produces", method = "glm",
                                   age_max = 36))) %>%
  select(-data) %>%
  unnest() %>%
  left_join(items %>% filter(form == "WG"), by = "language") %>%
  select(language, num_item_id, aoa, uni_lemma)

all_aoas <- understand_aoas %>% 
  mutate(measure = "understands") %>%
  bind_rows(produce_aoas %>% mutate(measure = "produces")) %>%
  filter(!is.na(aoa))

write_feather(all_aoas,"data/items-consistency/aoas.feather")
```

```{r items-load-aoas}
all_aoas <- read_feather("data/items-consistency/aoas.feather")
```

To estimate the similarity of each item's trajectory, we use a single measure of its difficulty: age of acquisition (AoA) -- the age at which 50% of children in each language are estimated to have acquired it (Appendix \@ref(appendix-aoa)). We analyzed consistency in both comprehension and production, using Words & Gestures forms to estimate age of acquisition in comprehension, and stitching across Words & Gestures and Words & Sentences forms to estimate age of acquisition in production. Because of this strategy of combining forms, we were restricted to the `r n_distinct(all_aoas$language)` languages for which data for both forms were available.

```{r items-uni-lemma-completeness}
IN_LANGS <- 8

lemma_completeness <- all_aoas %>%
  group_by(measure, uni_lemma) %>%
  summarise(in_langs = sum(!is.na(aoa))) 

cutoffs <- lemma_completeness %>%
  group_by(measure, in_langs) %>%
  summarise(n = n()) %>%
  arrange(measure, desc(in_langs)) %>%
  mutate(prop = n/sum(n)) %>%
  mutate(cum_prop = cumsum(prop),
         cum_n = cumsum(n))

cutoff_n <- cutoffs %>%
  filter(measure == "produces", in_langs == IN_LANGS) %>%
  slice(1) %>%
  pull(cum_n)

cutoffs_prod <- cutoffs %>% filter(measure == "produces")
```

In total, we estimated ages of acquisition for `r all_aoas %>% distinct(uni_lemma) %>% nrow()` total words spread across the `r n_distinct(all_aoas$language)` languages. Unfortunately, not every word appeared on all forms. Figure \@ref(fig:items-uni-lemma-completeness-figure) shows the cumulative proportion of forms on which every word appears. For our consistency analysis, we considered only the `r cutoff_n` words that appeared in at least `r IN_LANGS` of the `r n_distinct(all_aoas$language)` languages. 

```{r items-uni-lemma-completeness-figure, fig.width = 6, fig.height = 4, fig.cap = glue("The proportion of words found on at least each of a number of languages' CDI forms (e.g. all words appear on at least one form, {filter(cutoffs_prod, in_langs == 2)$n} words appear on at least 2 forms, and so on, with {filter(cutoffs_prod, in_langs == max(in_langs))$n} words appearing on all {max(cutoffs_prod$in_langs)} languages' forms). The dotted line shows the cutoff value we chose ({IN_LANGS}).")}
ggplot(cutoffs_prod, aes(x = in_langs, y = cum_prop)) +
  geom_line() +
  geom_segment(aes(x = IN_LANGS, xend = IN_LANGS, y = 0,
                   yend = filter(cutoffs,  measure == "produces", 
                                 in_langs == IN_LANGS) %>% pull(cum_prop)),
               linetype = .refline, colour = .grey) +
  geom_segment(aes(x = 1, xend = IN_LANGS, 
                   y = filter(cutoffs,  measure == "produces", 
                                 in_langs == IN_LANGS) %>% pull(cum_prop),
                   yend = filter(cutoffs,  measure == "produces", 
                                 in_langs == IN_LANGS) %>% pull(cum_prop)),
                   linetype = .refline, colour = .grey) +
  scale_x_continuous(expand = c(0, 0), breaks = seq(2, 14, 2)) +
  scale_y_continuous(expand = c(0, 0)) +
  labs(x = "Number of forms", y = "Proportion of words") +
  theme(plot.margin = margin(t = 10))
```

## The first 10 words

```{r items-top10-compute}
top10_long <- all_aoas %>%
  inner_join(filter(lemma_completeness, in_langs >= IN_LANGS)) %>%
  group_by(measure,language) %>%
  arrange(aoa) %>%
  slice(1:10)

top10 <- top10_long %>%
  select(-aoa, -num_item_id, -in_langs) %>%
  mutate(order = 1:n()) %>%
  spread(language, uni_lemma) 

in_langs <- top10_long %>% 
  group_by(measure, uni_lemma, in_langs) %>% 
  select(measure, uni_lemma, in_langs) %>%
  distinct() %>%
  arrange(measure, desc(in_langs), uni_lemma) 

in_lang_props <- in_langs %>%
  group_by(measure, in_langs) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n))
```

Following @tardif2008, we begin by examining the first 10 words acquired by children across the `r n_distinct(all_aoas$language)` languages we measured (Tables \@ref(tab:items-top10-production) and \@ref(tab:items-top10-comprehension)). Similar words appeared in the top 10 across languages, especially in children's earliest productions. In production, `r in_lang_props %>% filter(measure == "produces", in_langs == 15) %>% pull(n)` of the `r in_lang_props %>% filter(measure == "produces") %>% pull(n) %>% sum()` words appeared in the top ten earliest words of every language (`r in_lang_props %>% filter(measure == "produces", in_langs == 15) %>% pull(prop) %>% roundp()`), and all but `r in_lang_props %>% filter(measure == "produces", in_langs < 10) %>% pull(n)` appeared in at least ten languages (`r (1 - in_lang_props %>% filter(measure == "produces", in_langs < 10) %>% pull(prop)) %>% roundp()`). In comprehension, `r in_lang_props %>% filter(measure == "understands", in_langs == 15) %>% pull(n)` of the `r in_lang_props %>% filter(measure == "understands") %>% pull(n) %>% sum()` words appeared in the top ten earliest words of every language (`r in_lang_props %>% filter(measure == "understands", in_langs == 15) %>% pull(prop) %>% roundp()`), and all but `r in_lang_props %>% filter(measure == "understands", in_langs < 10) %>% pull(n)` appeared in at least ten languages (`r (1 - in_lang_props %>% filter(measure == "understands", in_langs < 10) %>% pull(prop)) %>% roundp()`). These words consist primarily of important family members (_mommy_, _daddy_, _grandma_), social routines (_hi_, _bye_, _peekaboo_), and sounds (_yum yum_, _vroom_, _woof woof_). 

```{r items-top10-production, results="asis"}
MIN_HIGHLIGHT_PRODUCTION <- 6

top10_measure <- function(meas, min_highlight) {
  top10 %>%
    ungroup() %>%
    filter(measure == meas) %>%
    select(-measure) %>%
    pivot_longer(cols = -order, names_to = "language", values_to = "word") %>%
    group_by(word) %>%
    mutate(n = n()) %>%
    ungroup() %>%
    mutate(word = word %>% str_remove(" \\(.*\\)$") %>% str_remove("own "),
           word = cell_spec(word, bold = n >= min_highlight)) %>%
    select(-n) %>%
    pivot_wider(names_from = "language", values_from = "word") %>%
    select(-order)
}

top10_table <- function(meas, action, min_highlight) {
  df <- top10_measure(meas, min_highlight)
  kable(df, escape = FALSE, col.names = rep("", ncol(df)),
        caption = glue("The 10 earliest words that children {action} in each language. Bolded words appear in at least {min_highlight} languages.")) %>%
  kable_styling(latex_options = "scale_down", font_size = 10) %>%
  add_header_above(str_replace(colnames(df), " ", "\n"), align = "l", line = FALSE) %>%
  landscape()
}
top10_table("produces", "produce", MIN_HIGHLIGHT_PRODUCTION)
```

```{r items-top10-comprehension, dependson="items-top10-production"}
MIN_HIGHLIGHT_COMPREHENSION <- 6
top10_table("understands", "understand", MIN_HIGHLIGHT_COMPREHENSION)
```

Strongly ratifying the conclusions of @tardif2009, similar words appeared in the top 10 across languages. Similarities were especially prominent in children's earliest productions. These words consist primarily of important family members (_mommy_, _daddy_, _grandma_), social routines (_hi_, _bye_, _peekaboo_), and sounds (_yum yum_, _vroom_, _woof woof_). 

Unfortunately, we cannot determine if the greater consistency found in early production is a real regularity about children's lexical development, or is instead a measurement artifact arising from the greater difficulty of reporting on a child's comprehension (see Chapter \@ref(psychometrics)).^[This finding is *prima facie* inconsistent with another recent analysis comparing variability in comprehension and production vocabularies [@mayor2014]. This analysis noted that comprehension vocabularies tend to be less idiosyncratic across children -- rather than across languages -- than production vocabularies.] It may be that early communicative needs drive the first words children produce to be even more similar than the first words they comprehend.

<!-- ## Global cross-linguistic similarity  -->

```{r items-understands-and-produces, width = 4, height = 4}
mean_aoas <- all_aoas %>%
  group_by(measure, uni_lemma) %>%
  summarise(n = sum(!is.na(aoa)),
            aoa = mean(aoa, na.rm = T)) %>%
  filter(n >= IN_LANGS) %>%
  arrange(measure, aoa) %>%
  group_by(measure) %>%
  mutate(order = 1:n())

wide_means <- mean_aoas %>%
  spread(measure, aoa) %>%
  group_by(uni_lemma) %>%
  summarise_at(vars(n, produces, understands), mean, na.rm=T)

measure_correlation <- cor.test(wide_means$produces, 
                                wide_means$understands, use = "pairwise")

```


```{r items-understands-and-produces-plots, fig.cap = "Average age of acquisition in comprehension and production for each measured word. Dashed line provides a reference with slope = 1 (identical age of acquisition)."}
ggplot(wide_means, aes(x = produces, y = understands, label = uni_lemma)) + 
  geom_smooth(method = "lm", aes(alpha = .1), se = F, color = "lightgray") +
  geom_text(position=position_jitter(width=.25,height=.25), size = 2,
            family = .font) +
  labs(x = "Production", y = "Comprehension") +
  theme(legend.position = "none") + 
  geom_abline(lty = 2)
```

Despite these differences between comprehension and production, words that are reported to be acquired early in one measure are also generally reported to be acquired early in the other. Figure \@ref(fig:items-understands-and-produces-plots) shows the relationship between the mean age of acquisition in production and the mean age of acquisition in comprehension for each of these `r cutoff_n` words across the `r n_distinct(all_aoas$language)` languages. The correlation between the two measures was quite high: _r_ = `r roundp(measure_correlation$estimate)` (_p_ `r print_pvalue(measure_correlation$p.value)`). Thus, apparent inconsistencies in first words for comprehension may be more a function of measurement errors in comprehension than any systematic difference. 

Taken together, these analyses suggest that children's earliest words, and by inference the processes that underpin them, are highly similar across languages. The source of this similarity is hard to pin down, however. One possibility is that the difficulty of learning a word is determined predominantly by the complexity of the concept denoted by that word, and thus that variability in linguistic (e.g., phonological and syntactic complexity) and cultural (e.g., styles of parental interaction with children) features play a relatively small role in determining the difficulty of learning a word [@gentner2001]. Alternatively, the primary driver of difficulty could be linguistic, but the dimensions of linguistic variability could be orthogonal to the difficulty of learning. For instance, verbs may be more difficult than nouns because they are relational, and thus learning nouns makes learning verbs relatively easier than learning verbs makes learning nouns [@gleitman1990]. In this case, the linguistically relevant dimensions would be relatively invariant across languages [@snedeker2007]. Finally, it is worth noting that because the words on the CDI are not a random sample of words in each language, these correlations may overestimate the degree of cross-linguistic similarity, even though they are consistent with earlier diary studies.

In Chapter \@ref(items-prediction) we begin to take up these questions using predictive models. Prior to taking this step, however we consider cross-linguistic ordering more holistically. In the remainder of the chapter, we address this problem from two directions: (1) Is similarity in order of acquisition for two languages related to the degree of similarity between the two languages, and (2) Does similarity in order of acquisition change over development?

## Acquisition similarity and linguistic similarity

```{r items-load-asjp-distances, eval = F}
# Download file from https://github.com/ddediu/lgfam-newick/blob/master/input/distances/ASJP/asjp16-dists.RData?raw=true
load("asjp16-dists.RData")

asjp <- asjp16.dm %>%
  as_data_frame()

isocodes <- ISOcodes::ISO_639_3 %>%
  select(Id,eng) %>%
  rename(iso = Id, language = eng)

tidy_asjp <- asjp %>%
  mutate(code1 = names(asjp)) %>%
  gather(code2, distance, -code1)


languages <- data_frame(language = unique(all_aoas$language)) %>%
  mutate(trim_language = gsub("\\s*\\([^\\)]+\\)", "", language)) %>%
  left_join(isocodes, by = c("trim_language" = "language")) %>%
  mutate(iso = if_else(trim_language == "Spanish", "spa",
                       if_else(trim_language == "Kiswahili", "swh",
                               if_else(trim_language == "Hebrew", "hbo", iso)))) %>%
  select(-trim_language)

asjp_dist <- tidy_asjp %>%
  left_join(languages, by = c("code1" = "iso")) %>%
  filter(!is.na(language)) %>%
  rename(language1 = language) %>%
  select(-code1) %>%
  left_join(languages, by = c("code2" = "iso")) %>%
  filter(!is.na(language)) %>%
  rename(language2 = language) %>%
  select(-code2)

write_feather(asjp_dist, "data/items-consistency/asjp_dist.feather")
```


```{r items-aoa-consistency}
cor_reorder <- function(df) {
  
  mat <- df %>%
    ungroup() %>%
    select(-measure) %>%
    spread(language2, correlation) %>%
    as.data.frame() %>%
    column_to_rownames("language1") %>%
    data.matrix()
  
  corOrder <- corrplot::corrMatOrder(mat, order = "hclust")
  
  df %>%
    mutate(language1 = factor(language1, levels = unique(df$language1)[corOrder]),
           language2 = factor(language2, levels = unique(df$language2)[corOrder]))
}

cors <- all_aoas %>%
  group_by(measure) %>%
  nest() %>%
  mutate(cors = map(data, ~widyr::pairwise_cor(.x, language, uni_lemma, aoa))) %>%
  select(-data) %>%
  unnest(cols = c(cors)) %>%
  complete(measure, item1, item2, fill = list(correlation = 1)) %>%
  rename(language1 = item1, language2 = item2) %>%
  split(.$measure) %>%
  map_dfr(cor_reorder) %>%
  ungroup() %>%
  mutate(measure = fct_relevel(measure, "understands"))

wide_method_cor <- cors %>% spread(measure, correlation) %>% 
  filter(language1 != language2)

pairwise_cor_test <- cor.test(wide_method_cor$produces, wide_method_cor$understands)
```

Unfortunately, the `r n_distinct(all_aoas$language)` languages in our analyses are both a small and non-representative sample of the world's languages, and thus do not have sufficient power to detect typological features of language that might be responsible for differences in the similarity of acquisition across languages [@piantadosi2014]. Nonetheless, the languages do come from different languages families, and do vary in their phylogenetic distance. We leverage this variability to ask whether the similarity between two languages is related to similarity in how quickly words for the same concepts are learned in those two languages.

Instead of correlating the average similarity of age of acquisition across all languages, we consider the pairwise similarities in the age of acquisition of each of the `r cutoff_n` words in each language. Figure \@ref(fig:items-consistency-heatmap) shows these pairwise correlations for  production as a matrix in which each cell shows a single pairwise correlation. This correlation matrix appears to contain a significant amount of structure, with languages that are from the same language family (e.g. Norwegian and Danish) showing higher correlations in their ages of acquisition for the same concepts. Perhaps unsurprisingly from the high average correlation between production and comprehension, pairwise correlations were nearly identical for production and comprehension (_r_ = `r roundp(pairwise_cor_test$estimate)`, _p_ `r print_pvalue(pairwise_cor_test$p.value)`); we omit the comprehension matrices for length. Figure \@ref(fig:items-consistency-dendro) shows a dendrogram produced by hierarchically clustering these pairwise correlations. 

```{r items-consistency-heatmap, fig.height=7.5, out.width="80%", fig.cap = "Correlation matrix showing pairwise correlations in words' age of acquisition. Languages that are more similar have more similar acquisition orders."}
ggplot(filter(cors, measure == "produces"),
       aes(x = language1, y = language2, fill = correlation)) +
  # facet_grid(measure ~ ., labeller = label_caps) + 
  coord_fixed() +
  geom_tile() + 
  geom_text(aes(label = roundp(correlation)), family = .font) +
  scale_x_discrete(expand = expand_scale()) +
  scale_y_discrete(expand = expand_scale()) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.title = element_blank()) 
```

```{r items-dendro-setup}
typology <- yaml::read_yaml("data/items-consistency/typology.yaml") %>%
  tibble(label = names(.), family = unlist(.)) %>%
  select(label, family)

make_hc <- function(df) {
   mat <- df %>%
    spread(language2, correlation) %>%
    as.data.frame() %>%
    column_to_rownames("language1") %>%
    data.matrix()
   
   hclust(dist(mat)) %>% 
     ggdendro::dendro_data(., type = "triangle")
}
```


```{r items-consistency-dendro, fig.cap = "Dendrograms of the similarity in the ages of words' first production cross-linguistically."}
dendros <- cors %>%
  group_by(measure) %>%
  nest() %>%
  mutate(hc = map(data, make_hc)) %>%
  select(-data) 

dendro_segments <- dendros %>%
  mutate(segment = map(hc, ~.x %>% ggdendro::segment())) %>%
  select(-hc) %>%
  unnest(cols = segment)

dendro_labels <- dendros %>%
  mutate(segment = map(hc, ~.x %>% ggdendro::label())) %>%
  select(-hc) %>%
  unnest(cols = segment) %>%
  mutate(label = label %>% str_remove(" \\(.*\\)")) %>%
  left_join(typology, by = "label")

plt <- ggplot(dendro_segments) +
  facet_grid(. ~ measure, scales = "free",
             labeller = as_labeller(label_caps)) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_text(aes(x = x, y = y - 0.02, label = label, colour = family),
            data = dendro_labels, hjust = 0, family = .font) +
  coord_flip() +
  scale_x_reverse() +
  scale_y_reverse() +
  # scale_colour_manual(values = lang_colours_aoa, guide = FALSE) +
  .scale_color_discrete(name = "Language family",
                        guide = guide_legend(title.position = "top",
                                             title.hjust = 0.5,
                                             override.aes = list(size = 0),
                                             keyheight = unit(0, "lines"))) +
  expand_limits(y = -1) +
  theme_get() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_blank(),
        legend.title = element_text(size = rel(0.8), margin = margin(b = -8)),
        legend.margin = margin(t = 0),
        panel.border = element_blank())

lgnd <- typology %>% 
  distinct(family) %>% 
  mutate(x = scale(1:n())) %>%
  ggplot(aes(x = x, y = 0)) +
    geom_text(aes(label = family, colour = family), family = .font, size = 3) +
    .scale_colour_discrete(guide = FALSE) +
    lims(x = c(-2, 2)) +
    theme_void()

cowplot::plot_grid(plt, lgnd, ncol = 1, rel_heights = c(20, 1))
```

```{r items-asjp-distance}
asjp_dist <- read_feather("data/items-consistency/asjp_dist.feather")

asjp_tests <- cors %>%
  left_join(asjp_dist) %>%
  filter(language1 < language2) %>%
  group_by(measure) %>%
  nest() %>%
  mutate(cor = map(data, ~cor.test(.x$distance, .x$correlation))) %>%
  mutate(estimate = map(cor, ~.x$estimate),
         p.value = map(cor, ~.x$p.value)) %>%
  select(-data, -cor) %>%
  unnest(cols = c(estimate, p.value))

```

These dendrograms show high similarity within the North Germanic, Slavic, and Romance language families. Some relationships resist straightforward linguistic explanations (e.g., the relationship of Quebec French to other languages). These may be due to non-uniform sparsity of data across these languages, or may instead reflect interesting cultural or other sources of variability. Despite these cases, the order in which words are acquired appears to a high degree to reflect the structure of the languages that children learning these words speak. To confirm this observation quantitatively, we borrowed an established measure for measuring linguistic similarity: the lexical similarity of words for the same meaning [@wichmann2010].  

Using a set of 40 words for meanings common to all of the words languages, @holman2008 were able to use a string-edit distance metric to recover linguistic similarity estimates that correlated highly with geographic distance and also several typological systems. This method is appealing for our purposes as it is relatively agnostic as to the processes of language contact and change that have produced modern-day languages and instead tracks the similarity of word forms themselves. The language distance measures produced by this method were highly correlated with pairwise correlations in acquisition trajectories for both production (_r_ = `r filter(asjp_tests, measure == "produces") %>% pull(estimate) %>% roundp()`, _p_ `r filter(asjp_tests, measure == "produces") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(asjp_tests, measure == "understands") %>% pull(estimate) %>% roundp()`, _p_ `r filter(asjp_tests, measure == "understands") %>% pull(p.value) %>% print_pvalue()`). 

```{r items-language-levs, eval = F}
# clean_words(c("dog", "dog / cat", "dog (animal)", "(a) dog", "dog*", "dog(go)", "(a)dog", " dog ", "Cat"))
clean_words <- function(word_set) {
  word_set %>%
    
    # dog / doggo
    strsplit("/") %>% flatten_chr() %>%
    
    # dog (animal) | (a) dog
    strsplit(" \\(.*\\)|\\(.*\\) ") %>% flatten_chr() %>%
    
    # dog* | dog? | dog! | ¡dog! | dog's
    gsub("[*?!¡']", "", .) %>%
    
    # dog(go) | (a)dog
    map_if(
      # if "dog(go)"
      ~grepl("\\(.*\\)", .x),
      # replace with "dog" and "doggo"
      ~c(sub("\\(.*\\)", "", .x),
         sub("(.*)\\((.*)\\)", "\\1\\2", .x))
    ) %>%
    flatten_chr() %>%
  
    # trim
    gsub("^ +| +$", "", .) %>%
    
    keep(nchar(.) > 0) %>%
    tolower() %>%
    unique() %>% 
    first()
   
}


item_aoas <- all_aoas %>%
  filter(measure == "produces") %>%
  left_join(items, by = c("language", "uni_lemma")) %>%
  select(language, aoa, uni_lemma, definition) %>%
  group_by(language, aoa, uni_lemma) %>%
  slice(1) %>%
  mutate(definition = clean_words(definition))


pairwise_levs <- function(langs, definitions) {
  lang1 <- first(langs)
  lang2 <- last(langs)
  
  
  lang1_aoas <- definitions %>%
    filter(language == lang1) %>%
    rename(lang1 = language, def1 = definition) %>%
    ungroup() %>%
    select(lang1, uni_lemma, def1)
  
  lang2_aoas <- definitions %>%
    filter(language == lang2) %>%
    rename(lang2 = language, def2 = definition) %>%
    ungroup() %>%
    select(lang2, uni_lemma, def2)
  
  left_join(lang1_aoas, lang2_aoas, by = "uni_lemma") %>%
   filter(!is.na(lang1) ,!is.na(lang2)) %>%
   mutate(length = pmax(nchar(def1), nchar(def2))) %>%
   mutate(dist = stringdist::stringdist(def1, def2, method= "lv")/length) %>%
   summarise(dist = mean(dist, na.rm = T)) %>%
   mutate(language1 = lang1, language2 = lang2)
}

lang_pairs <- combn(unique(item_aoas$language), 2, simplify = F)

levs <- map_dfr(lang_pairs, ~pairwise_levs(.x, item_aoas))
  
lev_char_cors <- cors %>%
  left_join(levs) %>%
  filter(language1 < language2)
```


```{r items-ipa-levs, eval = F}
lang_codes <- list(
  "Croatian" = "hr",
  "Danish" = "da",
  "English (American)" = "en-us",
  "French (French)" = "fr",
  "French (Quebecois)" = "fr",
  "Italian" = "it",
  "Norwegian" = "no",
  "Russian" = "ru",
  "Spanish (Mexican)" = "es",
  "Swedish" = "sv",
  "Turkish" = "tr",
  "Kiswahili" = "sw",
  "Slovak" = "sk"
)

get_ipa <- function(word, lang) {
  lang_code <- lang_codes[[lang]]
  system2("espeak", args = c("--ipa=3", "-v", lang_code, "-q", glue("'{word}'")),
          stdout = TRUE) %>%
    gsub("^ ", "", .) %>%
    gsub("[ˈˌ]", "", .)
}
get_phons <- function(words, lang) {
  words %>% map_chr(function(word) word %>% get_ipa(lang))
}

phons <- item_aoas %>%
  filter(!language %in% c("Hebrew", "Korean")) %>%
  group_by(language) %>%
  nest() %>%
  mutate(phons = map2(data, language, ~get_phons(.x$clean_definition, .y))) %>%
  unnest()

ipas <- phons %>%
  mutate(ipas = str_split(phons, "[ _]+")) %>%
  unnest() %>%
  filter(nchar(ipas) > 0)

chars <- ipas %>%
  distinct(ipas)

ascii_chars <- paste0("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz", 
        "0123456789", " !\"#$%&'()*+,./:;<=>?@[\\^_`{|}~-", "]")

latin_extended <- paste0("ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿ")

mapping_chars <- paste0(ascii_chars, latin_extended) %>%
  str_split("") %>%
  unlist() %>%
  data_frame(mapping = .) %>%
  distinct() %>%
  slice(1:nrow(chars))

mapping_table <- bind_cols(chars, mapping_chars)

mapped_ipas <- ipas  %>%
  left_join(mapping_table) %>%
  group_by(language, phons, aoa, uni_lemma, definition, clean_definition) %>%
  summarise(mapping = paste0(mapping, collapse = "")) %>%
  ungroup() %>%
  select(-definition) %>%
  rename(definition = mapping)

lang_ipa_pairs <- combn(unique(mapped_ipas$language), 2, simplify = F)

ipa_levs <- map_dfr(lang_ipa_pairs, ~pairwise_levs(.x, mapped_ipas))

lev_ipa_cors <- cors %>%
  left_join(ipa_levs) %>%
  filter(language1 < language2)

all_lev_cors <- lev_char_cors %>%
  rename(char = dist) %>%
  left_join(lev_ipa_cors, 
            by = c("measure", "language1", "language2", "correlation")) %>%
  rename(ipa = dist)

write_feather(all_lev_cors, "data/items-consistency/lev_cors.feather")
```


```{r items-lev-cors}
lev_cors <- read_feather("data/items-consistency/lev_cors.feather")


lev_tests <- lev_cors %>%
  gather(lev_type, distance, char, ipa) %>%
  group_by(measure, lev_type) %>%
  nest() %>%
  mutate(cor = map(data, ~cor.test(.x$distance, .x$correlation))) %>%
  mutate(estimate = map(cor, ~.x$estimate),
         p.value = map(cor, ~.x$p.value)) %>%
  select(-data, -cor) %>%
  unnest()
```

We also applied this same analysis to the words on the CDI themselves. For each language, we computed the average normalized @levenshtein1966 distance between words for each of the `r cutoff_n` common words in our analyses.^[Levenshtein distance is a measure of the minimum number of insertions, deletions, or substitutions required to transform one string into another. For instance, the distance between the Italian and Norwegian words for dog (_cane_ and _hund_) is `r stringdist::stringdist("cane","hund", method = "lv")`. We computed this measure pairwise for all words, and then divided it by the number of characters in the longest word in order to get the edit distance per character (`r stringdist::stringdist("cane","hund", method = "lv") /nchar("hund")` for _cane_ and _hund_).] This measure was even more highly correlated with pairwise acquisition trajectories than similarity computed using the 40 words identified by @holman2008, with relatively high correlations for both production (_r_ = `r filter(lev_tests, measure == "produces", lev_type == "char") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "produces", lev_type == "char") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(lev_tests, measure == "understands", lev_type == "char") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "understands", lev_type == "char") %>% pull(p.value) %>% print_pvalue()`). 

Because this analysis likely overestimates the dissimilarity of languages written in different scripts -- as every word receives a normalized Levenshtein distance of 1 in this case -- we replicated this analysis at the phonemic level. We used `eSpeak` to compute phonetic transcripts of each word and repeated the same analysis on distance between words' phonetic units in the International Phonetic Alphabet [IPA; @decker1999]. These correlations between IPA distance and pairwise age of acquisition trajectories were again reliable although slightly attenuated for both production (_r_ = `r filter(lev_tests, measure == "produces", lev_type == "ipa") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "produces", lev_type == "ipa") %>% pull(p.value) %>% print_pvalue()`) and comprehension (_r_ = `r filter(lev_tests, measure == "understands", lev_type == "ipa") %>% pull(estimate) %>% roundp()`, _p_ `r filter(lev_tests, measure == "understands", lev_type == "ipa") %>% pull(p.value) %>% print_pvalue()`). The robustness of these correlations across a variety of methods suggests that in addition to the high degree of general cross-linguistic similarities in the order of acquisition of words, the dissimilarities between them likely reflect differences in the wordforms of the target languages being learned.

Because the languages we studied here are far from a reliable, representative sample of the world's languages, the correlation between linguistic similarity and acquisition order similarity is hard to interpret definitively [@naroll1965]. Languages in which word forms are similar are also likely to have similar cultural beliefs around parenting, similar household organization and incomes, and generally share other non-linguistic features in common. Nonetheless, these analyses suggest that in addition to early communicative need -- which may be quite similar cross-linguistically -- language and culture-specific features govern the order of acquisition. In the following section, we take on the the relationship between early communicative need and linguistic variability directly, asking whether acquisition orders are equally cross-linguistically similar over development, or whether they instead diverge or converge as children learn more words.

## Consistency across development

In the next analysis, we ask whether similarities in ages of acquisition are constant over the course of acquisition, or whether the similarity across languages changes over development. If variability in acquisition trajectories across languages reflects variability in those languages, we might expect that children's trajectories diverge over the course of language acquisition as the structure of their target language or their cultural milieu play a stronger role in guiding which words are easy or important to learn. Put more simply: our analyses of the first 10 words above shows striking similarity in the earliest words. Does this similarity decrease for the next 300 words?

```{r items-step-corrs}
step_cor <- function(max_order, meas, mean_aoas) {
  
   uni_lemmas <- mean_aoas %>%
      filter(measure == meas) %>%
      slice(1:max_order) %>%
      pull(uni_lemma)
   
  all_aoas %>%
    ungroup() %>%
    filter(measure == meas, uni_lemma %in% uni_lemmas) %>%
    select(-measure, -num_item_id) %>%
    pairwise_cor(language, uni_lemma, aoa, use = "pairwise") %>%
    group_by(item1) %>%
    summarise(correlation = mean(correlation)) %>%
    mutate(n = max_order, measure = meas)
}

MIN_ITEMS <- 5

```


```{r items-compute-corrs, eval=FALSE}
understands_step_cors <- map_dfr(MIN_ITEMS:max(filter(mean_aoas, 
                                           measure == "understands") %>% pull(order)), 
                              function(x) step_cor(x,"understands", mean_aoas))

produces_step_cors <- map_dfr(MIN_ITEMS:max(filter(mean_aoas, 
                                        measure == "produces") %>% pull(order)), 
                           function(x) step_cor(x,"produces", mean_aoas))


empirical_step_cors <- bind_rows(understands_step_cors, produces_step_cors) %>%
  group_by(measure, n) %>%
  summarise(correlation = mean(correlation, na.rm = T))

make_random_step_cors <- function() {
  
  random_aoas <- mean_aoas %>%
    group_by(measure) %>%
    sample_frac(1)
  
  understands_step_cors_shuffle <- map_dfr(MIN_ITEMS:max(filter(mean_aoas, 
                                             measure == "understands") %>% pull(order)), 
                                function(x) step_cor(x,"understands", random_aoas))
  
  produces_step_cors_shuffle <- map_dfr(MIN_ITEMS:max(filter(mean_aoas, 
                                          measure == "produces") %>% pull(order)), 
                             function(x) step_cor(x,"produces", random_aoas))
  
  bind_rows(understands_step_cors_shuffle, 
            produces_step_cors_shuffle)
}


random_step_cors <- replicate(100, make_random_step_cors(), simplify = F) %>%
  bind_rows(.id = "sample") %>% 
  group_by(measure, n, sample) %>%
  summarise(correlation = mean(correlation)) %>%
  summarise(ci_upper = quantile(correlation, .975),
            ci_lower = quantile(correlation, .025),
            correlation = mean(correlation))

write_feather(random_step_cors, "data/items-consistency/random_step_cors.feather")
write_feather(empirical_step_cors, "data/items-consistency/empirical_step_cors.feather")
```

In order to measure change in cross-linguistic consistency over development, we extend the age of acquisition-correlation approach we have used throughout this chapter. For each concept that appeared in at least `r IN_LANGS` languages, we computed its average age of acquisition across all languages in whose CDIs it appeared in both comprehension and production. We then ordered these words from the earliest learned word on average (*`r mean_aoas %>% filter(measure == "produces") %>% filter(measure == "produces", aoa == min(aoa)) %>% pull(uni_lemma)`* to the latest learned word *`r (mean_aoas %>% filter(measure == "produces") %>% filter(measure == "produces", aoa == max(aoa)) %>% pull(uni_lemma))`*). We then computed the average cross-linguistic correlation in age of acquisition for the increasingly-large sets of words starting with `r MIN_ITEMS` words to `r mean_aoas %>% filter(measure == "produces") %>% distinct(uni_lemma) %>% nrow()` words. If the correlation increases over acquisition, we can infer that acquisition trajectories become more similar as more words are learned, that is, the hardest to learn words are learned more similarly across languages. In contrast, if the correlation decreases, we can infer that children start out learning similar concepts regardless of their native language, but that linguistic and cultural variability plays a greater role in the learning of later words.

```{r items-plot-cors, fig.height=4.5, fig.cap = "Cross-linguistic correlation ages of words' acquisition over the course of language development. Colored lines show empirical correlations, the gray area shows a 95 percent confidence interval for a randomly shuffled baseline. Especially in production, cross-linguistic similarity declines over the course of language development."}

empirical_step_cors <-
  read_feather("data/items-consistency/empirical_step_cors.feather") %>%
  mutate(measure = fct_relevel(measure, "understands"))

random_step_cors <- 
  read_feather("data/items-consistency/random_step_cors.feather") %>%
  mutate(measure = fct_relevel(measure, "understands"))

ggplot(random_step_cors, aes(x = n, y = correlation)) +
  facet_wrap(~measure, labeller = label_caps) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, alpha = .1)) +
  geom_line() +
  geom_line(aes(color = measure), data = empirical_step_cors) +
  labs(x = "Mean acquisition order number", y = "Cross-linguistic correlation") +
  theme(legend.position = "none") + 
  .scale_colour_discrete()
```

Figure \@ref(fig:items-plot-cors) shows these correlations for both comprehension and production over the course of acquisition. In addition, the gray shaded region shows a 95% confidence interval for a random baseline in which the concepts were ordered randomly, rather than in average acquisition order. This baseline is important to control for changes in measurement error that arise from changing numbers of concepts in the correlation. For both comprehension and production, the trajectories are reliably above the shuffled baseline. This trend is much more apparent for the earliest words in production, mirroring our qualitative sense from the analysis of the first 10 words above. Further, both trajectories clearly decrease over the course of acquisition. 

These results confirm that there is substantially more similarity in the earliest learned words than in later learned words cross-linguistically, especially in production. This pattern of results is consistent with an account in which cross-linguistically shared communicative needs are a strong driver of the earliest acquired words. After these needs are met by the initial vocabulary, language-specific factors factors -- variability in the forms, frequencies, and contexts of use for words -- may play a larger role in the order of children's acquisition. 

## Conclusions

Children in all languages and culture learn language, but the languages they learn vary, and the cultures into which they are born may have quite different cultural practices around both language and cognitive development. Nonetheless, the order in which children learn the word for specific concepts in their own language shows a substantial degree of cross-linguistic similarity. Further, dissimilarities are well-explained by measurable linguistic dissimilarity. This cross-linguistic similarity in concepts decreases over the course of acquisition. While the first ten words acquired in each language were highly consistent, later words were substantially more different. 

As we noted in the introduction, the general observation of cross-linguistic similarity in early vocabulary has been taken as evidence for a wide variety of different theoretical claims. Our view is that these results indicate a shared core of concepts -- e.g., social routines, important people, and some early foods and household animals -- that are perhaps especially important for communication independent of their linguistic realization.

We acknowledge, however, that there are likely many reasons for consistency of early words. One intriguing suggestion is that the phonological forms of words used with children actually evolve (or are adapted by parents) to be easier for children to say. One version of this hypothesis comes from @jakobson1962, who hypothesized that parents adapt the word forms for *mother* and *father* to be easy for children to say or even to babble. Thus, the sound convergence across languages in the forms of words for these concepts (which is quite substantial) is due to convergence in what sounds are easy for children to say. This same mechanism could operate over other important early vocabulary as well, though note that this account already presupposes some notion of cognitive importance!

Regardless of the precise reason for this phenomenon, the similarity in early vocabulary is undeniable [ratifying suggestions by @clark1973 and others]. As acquisition unfolds, however, the features that make languages (and cultures) different from one another play an ever increasing role in driving vocabulary development. In Chapter \@ref(items-demographics), we explore demographic differences in acquisition that help to explain why two children learning the same language may acquire different words at different rates.