forked from phoible/phoible.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
_faq.Rmd
671 lines (550 loc) · 28.6 KB
/
_faq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
---
title: "PHOIBLE frequently asked questions"
author: Steven Moran & Daniel McCloy
layout: default
bibliography: bib/faq.bibtex
output:
md_document:
variant: gfm
preserve_yaml: true
toc: true
toc_depth: 2
html_document:
toc: true
theme: default
csl: bib/phoible.csl
---
```{r opts, echo=FALSE}
knitr::opts_chunk$set(fig.path = "images/faq/")
```
# Introduction
This FAQ answers questions regarding the editorial principles and design
decisions that went into the creation of PHOIBLE. We appreciate and welcome
feedback regarding these FAQs via
[our issue tracker](https://github.com/phoible/dev/issues)
or by contacting the editors directly.
When relevant, we provide [R](https://www.r-project.org/) code snippets to
elucidate the questions raised in this FAQ. To run these code snippets
requires the following
[R packages](https://cran.r-project.org/web/packages/available_packages_by_name.html):
```{r load_libraries, warning=FALSE, message=FALSE}
library(readr)
library(stringr)
library(dplyr)
library(knitr)
library(ggplot2)
```
This document was rendered with `r R.version.string` and package versions
`r c("dplyr", "readr", "stringr", "knitr", "ggplot2") %>% sapply(function(pkg) as.character(packageVersion(pkg))) %>% str_c(names(.), ., sep=": ", collapse=", ")`.
## How do I get the data?
You can get the most recent "official" release from our
[download page](https://phoible.org/download), get the most current version
(with bugfixes or new additions since last release) from
[GitHub](https://github.com/phoible/dev/blob/master/data/phoible.csv?raw=true),
or use the following code snippet to download the most current version from
GitHub directly within `R`:
```{r load_data}
url_ <- "https://github.com/phoible/dev/blob/master/data/phoible.csv?raw=true"
col_types <- cols(InventoryID='i', Marginal='l', .default='c')
phoible <- read_csv(url(url_), col_types=col_types)
```
# Inventories, language codes, doculects, and sources
## How are PHOIBLE inventories created?
For the most part, every phonological inventory in PHOIBLE is based on
one-and-only-one language description (usually a research article, book
chapter, dissertation, or descriptive grammar). The technical term for this in
comparative linguistics is "doculect" (from "documented lect"), in which
[lect](https://en.wikipedia.org/wiki/Variety_(linguistics)) means a specific
form of a language or dialect, i.e. an instance of documentation of an instance
of linguistic behavior at a particular time and place [@CysouwGood2013].
A brief explanation and some history of why linguists use the term "doculect",
which has gained broad acceptance in light of the issues of language
identification and the use of "language codes", is given in
[this blog post](https://dlc.hypotheses.org/623) by Michael Cysouw.
Contributors to PHOIBLE start with a doculect, extract the contrastive phonemes
and allophones, and (if necessary) adapt the authors' choice of symbols to
align with PHOIBLE's [symbol guidelines](http://phoible.github.io/conventions/).
If the authors have not provided ISO 639-3 and Glottolog codes, these are
determined before adding the inventory to PHOIBLE.
Each inventory is then given a unique numeric ID.
Doculects are tracked in PHOIBLE using BibTeX keys.
## Why do some phonological inventories combine more than one doculect?
An exception to the "one doculect per inventory" rule arises for inventories
that were originally part of a curated phonological database such as
UPSID [@Maddieson1984;@MaddiesonPrecoda1990] or SPA [@CrothersEtAl1979]. In
those collections, inventories were often based on multiple descriptions of
linguistic behavior, written by different linguists; those descriptions were
believed to be describing the same language, and disagreements between the
descriptions were adjudicated by the experts who compiled the collection.
We can quickly see how many of PHOIBLE's inventories are based on multiple
doculects by looking at the mapping table between PHOIBLE inventory IDs and
BibTeX keys:
```{r multiple_doculect_invs}
url_ <- "https://github.com/phoible/dev/blob/master/mappings/InventoryID-Bibtex.csv?raw=true"
id_to_bibtex_mapping <- read_csv(url(url_), col_types=cols(InventoryID='i',
.default='c'))
id_to_bibtex_mapping %>%
group_by(InventoryID) %>%
tally(name="Number of doculects consulted") %>%
group_by(`Number of doculects consulted`) %>%
count(name="Number of inventories") %>%
kable()
```
Clearly, the majority of inventories in PHOIBLE represent a phonological
description from a single doculect. But it seems strange that a single
phonological inventory in PHOIBLE could be based on 11 different doculects.
Let's examine it:
```{r eleven_doculect_inv}
id_to_bibtex_mapping %>%
group_by(InventoryID) %>%
tally(name="Number of doculects consulted") %>%
filter(`Number of doculects consulted` == 11) %>%
pull(InventoryID) ->
this_inventory_id
phoible %>%
filter(InventoryID == this_inventory_id) %>%
distinct(Source, LanguageName, Glottocode, ISO6393) %>%
kable()
```
As we can see, this inventory represents a description of
[Hausa](https://glottolog.org/resource/languoid/id/haus1257) and was added to
PHOIBLE from the UPSID database [@Maddieson1984;@MaddiesonPrecoda1990]. To
understand why this UPSID entry consulted 11 different sources, consider first
that Hausa is typologically interesting (e.g., it has both ejective and
implosive phonation mechanisms) and has tens of millions of speakers, making it
relatively well-studied (the [Glottolog](https://glottolog.org/) reference
catalog has [more than 1400 references](https://glottolog.org/resource/languoid/id/haus1257)
related to Hausa).
Second, note that Maddieson's work on UPSID involved "typologizing"
phonological inventories from different doculects, so that they were comparable
across all entries in his database [cf. @Hyman2008]. Maddieson's work was
groundbreaking at the time because he was the first typologist to generate a
stratified language sample aimed at being genealogically balanced, i.e. for
each language family he chose one representative language. This allowed
Maddieson to make statements about the cross-linguistic distribution of
contrastive speech sounds with some level of statistical confidence. In fact,
much about what we know about the distribution of the sounds of the world's
languages is due to Maddieson's original language sample and his meticulous
curation of the data.
## Where do the language codes in PHOIBLE come from?
Every phonological inventory in PHOIBLE has a unique numeric inventory ID.
Since most PHOIBLE inventories (aside from some UPSID or SPA ones, as mentioned
above) are based on a single document, it is fairly straightforward to link
each PHOIBLE inventory to [the Glottolog](https://glottolog.org/), which
provides links between linguistic description documents and unique identifiers
for dialects, languages, and groupings of dialects and languages at various
levels [the preferred term for any of these levels of specificity / grouping is
"languoid"; see @CysouwGood2013]. Thus in PHOIBLE each inventory typically
corresponds to a single languoid, and in most cases that
languoid is a "leaf node" in the Glottolog tree, i.e., it represents a
particular dialect known to be used at a particular place and time (rather than
a group of dialects, a language, or a language family). However, in the few
cases where multiple document sources were consulted for an inventory, it may
not be possible to link that inventory to a unique Glottolog leaf node.
In such cases, the inventory in PHOIBLE is linked to the lowest possible
Glottolog node that dominates the leaf nodes of each source document.
For example, inventory 298 describes the Dan language, and was ingested from
UPSID and ultimately based on a single journal article. At the time UPSID was
compiled, Dan and Kla-Dan were considered a single language (isocode `daf`) but
in 2013 a proposal was accepted to assign them separate isocodes (`dnj` and
`lda`, respectively). Since it is unknown whether the consultants for the
doculect were speakers of what we would now call "Dan" or "Kla-Dan", the
inventory is labeled with the old isocode `daf` and linked to the corresponding
non-leaf node in the glottolog (`dann1241`).
In addition to providing Glottolog languoid codes ("glottocodes") for each
inventory, PHOIBLE also includes
[ISO 639-3](https://en.wikipedia.org/wiki/ISO_639-3) language codes
("isocodes") for each inventory. The link between glottocodes and isocodes is
maintained and provided by Glottolog. This situation can result in two possible
problems:
* When there are
[updates to ISO 639-3](https://iso639-3.sil.org/code_changes/submitting_change_requests),
the Glottolog may not update its mapping immediately, so the two can get out
of sync temporarily. Such problems are typically resolved with new version
releases of Glottolog.
* In some cases, the editors of the Glottolog do not agree with the language
classification choices of ISO 639-3 (see again the above-mentioned
[blog post by Cysouw](https://dlc.hypotheses.org/623)). This disagreement
results in cases where PHOIBLE must choose whether to accept the official
ISO 639-3 code assignment, or use the isocode that the Glottolog associates
with the glottocode for that inventory. In such cases, PHOIBLE policy is to
report the ISO 639-3 version of the isocode. Consequently, there are a few
languoids for which PHOIBLE and the Glottolog will report different isocodes
for the same glottocode. If this is a problem for your analysis, you can
always download a glottocode-isocode mapping from the Glottolog and merge it
into the PHOIBLE dataset before performing your isocode-based analyses.
## Missing isocodes
There are cases of languages reported in the Glottolog for which there exists
no isocode. For example, the Vach-Vasjugan variety of
[Khanty](https://en.wikipedia.org/wiki/Khanty_language) is a Uralic language
classified in the Glottolog as
[vach1239](https://glottolog.org/resource/languoid/id/vach1239). However,
[SIL](https://www.sil.org/) only assigns an isocode for one variety of Northern
Khanty (isocode `kca`, glottocode `khan1273`), and because SIL is the
Registration Authority of [ISO 639-3](https://iso639-3.sil.org/), Vach-Vasjugan
Khanty has not been assigned its own isocode distinct from Northern Khanty.
When no isocode exists for a particular phonological inventory in PHOIBLE, as
in the Vach-Vasjugan example above, PHOIBLE follows the recommended practice of
using the isocode `mis` ("missing") to denote that the language is not included
in the ISO 639-3 standard. In PHOIBLE these are these inventories with missing
isocodes:
```{r missing_isocodes}
phoible %>%
filter(ISO6393 == "mis") %>%
distinct(InventoryID, LanguageName, ISO6393, Glottocode) %>%
kable()
```
Many of these inventories come from
[Erich Round's](https://languages-cultures.uq.edu.au/profile/1160/erich-round)
[contribution of Australian phonemic inventories](https://zenodo.org/record/3464333#.XyK5qxMzY3E)
to PHOIBLE. Unfortunately, some of these languages are extinct and have no
representation in the [Ethnologue](https://www.ethnologue.com/), and hence, no
code assigned in ISO 639-3.
For users, it is important to note that multiple phonological inventories in
PHOIBLE may have the same isocode (e.g. "mis") or the same glottocode (e.g.,
in cases of two different descriptions of the same lect). In essence, this
means that any programmatic code that groups by isocode or glottocode risks
combining inventories from different doculects into a single apparent inventory.
This could lead to incorrect results (e.g., if the goal is to count the number
of phonemes). Therefore, **most analyses of inventory properties should be done
on the level of inventory IDs** rather than isocodes or glottocodes. See also
the [section on filtering and sampling](#filtsamp) below.
## Why do some languages have multiple entries in PHOIBLE?
It is not uncommon that phonological descriptions of a particular language's
speech sounds have different sets of contrastive phonemes when analyzed by
different linguists (or sometimes even by the same linguist throughout their
career). For example,
[Kabardian](https://en.wikipedia.org/wiki/Kabardian_language)
is represented in PHOIBLE by
[five distinct inventories](https://phoible.org/languages/kaba1278).
```{r kabardian}
phoible %>%
filter(ISO6393 == "kbd") %>%
group_by(InventoryID) %>%
summarise(`Number of phonemes`=n(),
Phonemes=str_c(Phoneme, collapse=" "),
.groups="drop") %>%
kable()
```
The differences among them can be due to a variety of reasons, but the main
reason is that these phonological descriptions represent different doculects
(i.e., different instances of linguistic behavior at different places, times,
or with different speakers). This should probably not surprise most linguists,
since it has long been known that phoneme analysis is a non-deterministic
process [@Chao1934;@Hockett1963]. See Moran [-@Moran2012, ch. 2.3.3] for a
general discussion, and Hyman [-@Hyman2008, p. 99] for a detailed discussion of
Kabardian in particular.
In light of the above discussion, it
should come as no surprise that PHOIBLE can contain multiple inventories for
"the same" languoid, depending on what kind of languoid you're interested in.
All it takes is the existence of two or more descriptive documents associated
with the same group of speakers, where "group of speakers" is ultimately
determined by your desired level of granularity. To give a concrete example,
one researcher may be interested in comparing lects at the "language" level,
and so might wish to treat all inventories of "English" as duplicates for the
purposes of her analysis (regardless of any differences in regional dialect or
sociolect represented in the original doculects and encoded in the phonological
inventory). That researcher might *filter* or *sample* PHOIBLE's inventories to
include only one inventory for each isocode (how she chooses to implement that
filter is a separate question; see "How can I filter or sample inventories?"
for examples). Other researchers may not care about
"duplicates" in that sense, and may choose to include all inventories in their
analysis (or, they may filter the dataset to include only inventories with a
particular feature of interest such as breathy-voiced vowels).
Below is a summary of the number of isocodes that are represented by multiple
inventories in PHOIBLE:
```{r stemplot_inv_per_iso}
offset <- 25
phoible %>%
group_by(ISO6393) %>%
summarise(y=n_distinct(InventoryID), .groups="drop") %>%
count(y) ->
counts
counts %>%
ggplot(aes(y=as.factor(y), x=n)) +
geom_point() +
geom_text(aes(x=n+offset, label=n), hjust=0) +
geom_segment(aes(xend=0, yend=as.factor(y))) +
xlim(NA, max(counts$n) + 2*offset) +
theme_light() +
labs(title="Prevalence of multiple inventories per ISO code",
x="Number of ISO codes having N inventories",
y="N inventories")
```
So most ISO 639-3 codes have only 1 inventory (the long, bottom line).
Here is the same representation, for glottocodes instead of isocodes:
```{r stemplot_inv_per_glotto}
offset <- 25
phoible %>%
group_by(Glottocode) %>%
summarise(y=n_distinct(InventoryID), .groups="drop") %>%
count(y) ->
counts
counts %>%
ggplot(aes(y=as.factor(y), x=n)) +
geom_point() +
geom_text(aes(x=n+offset, label=n), hjust=0) +
geom_segment(aes(xend=0, yend=as.factor(y))) +
xlim(NA, max(counts$n) + 2*offset) +
theme_light() +
labs(title="Prevalence of multiple inventories per glottocode",
x="Number of glottocodes having N inventories",
y="N inventories")
```
Again, most glottocodes are represented by just one or two inventories in
PHOIBLE. Let's see the few glottocodes that have the most inventories:
```{r glottocodes_with_most_invs}
phoible %>%
group_by(Glottocode) %>%
summarise(Names=str_c(unique(LanguageName), collapse=", "),
`Number of inventories`=n_distinct(InventoryID),
.groups="drop") %>%
arrange(desc(`Number of inventories`)) %>%
filter(`Number of inventories` > 5) %>%
kable()
```
## Why are multiple inventories sometimes linked to the same document?
Occasionally, a single document may provide information about multiple
phonological inventories. For example, a dissertation that describes and
compares three related dialects or languages spoken in a particular region. In
that case, three phonological inventories in PHOIBLE corresponding to three
doculects might all be linked to the same document, but each inventory is still
linked to only one document (it just happens to be *the same document* for
those three inventories). One example is @Terrill1998, a grammar of
[Biri](https://en.wikipedia.org/wiki/Biri_language), which contains a chapter
describing its dialects, [many of which appear in
PHOIBLE](https://phoible.org/languages/biri1256).
## What are the different "sources" in PHOIBLE?
PHOIBLE contains inventories from various
[contributors](https://phoible.org/contributors). These contributions are
grouped into so-called "sources", denoted by abbreviations. Here they
are in the chronological order that they were added to PHOIBLE:
* [SPA](https://github.com/phoible/dev/tree/master/raw-data/SPA): The Stanford Phonology Archive [@CrothersEtAl1979]
* [UPSID](https://github.com/phoible/dev/tree/master/raw-data/UPSID): The [UCLA Phonological Segment Inventory Database](http://web.phonetik.uni-frankfurt.de/upsid.html) [@Maddieson1984;@MaddiesonPrecoda1990]
* [AA](https://github.com/phoible/dev/tree/master/raw-data/AA): [Alphabets of Africa](http://sumale.vjf.cnrs.fr/phono/index.htm) [@hartell1993;@chanard2006]
* [PH](https://github.com/phoible/dev/tree/master/raw-data/PH): Data drawn from journal articles, theses, and published grammars, added by members of the Linguistic Phonetics Laboratory at the University of Washington [@Moran2012]
* [GM](https://github.com/phoible/dev/tree/master/raw-data/GM): Data from African and Southeast Asian languages
* [RA](https://github.com/phoible/dev/tree/master/raw-data/RA): Common Linguistic Features in Indian Languages [@ramaswami1999]
* [SAPHON](https://github.com/phoible/dev/tree/master/raw-data/SAPHON): [South American Phonological Inventory Database](http://linguistics.berkeley.edu/~saphon/en/) [@saphon]
* [UZ](https://github.com/phoible/dev/tree/master/raw-data/UZ): Data drawn from journal articles, theses, and published grammars, added by the phoible developers while at the Department of Comparative Linguistics at the University of Zurich
* [EA](https://github.com/phoible/dev/tree/master/raw-data/EA): The database of [Eurasian phonological inventories](http://eurasianphonology.info/) (beta version) [@nikolaev_etal2015]
* [ER](https://github.com/phoible/dev/tree/master/raw-data/ER): Australian phonemic inventories [@Round2019]
The acronyms above link to the GitHub page for each data source, which provides
information about the source and how it was aggregated into PHOIBLE. Some
sources are quite specialized; for example, UPSID contains a quota sample,
i.e., one language per genealogical grouping (see Section "How can I filter or
sample inventories?"); AA contains descriptions of only African languages;
RA represents languges of India; SAPHON represents languages of South America;
GM represents languages of Africa and Asia; EA represents languages of Eurasia;
ER represents languages of Australia. Other sources like PH and UZ were added
mainly to fill in the typological gaps left by the more specialized sources
(e.g., to add language isolates, or to increase coverage of poorly-represented
geographic areas or language families). Here is a table showing the number of
inventories per source:
```{r invs_per_source}
phoible %>%
group_by(Source) %>%
summarise(`Number of inventories`=n_distinct(InventoryID),
.groups="drop") %>%
arrange(desc(`Number of inventories`)) %>%
kable()
```
Note that the same languoid may be reported in different sources as encoded in
different doculects. Here are the lects included in the most sources:
```{r lects_in_most_sources}
phoible %>%
group_by(Glottocode) %>%
distinct(Glottocode, Source) %>%
summarise(`Found in how many sources?`=n(),
`Which ones?`=str_c(Source, collapse=", "),
.groups="drop") %>%
filter(`Found in how many sources?` > 3) %>%
kable()
```
# Filtering and sampling {#filtsamp}
Different research questions will require including / excluding certain
inventories from PHOIBLE. These sections describe how to *filter* and *sample*
the PHOIBLE data based on various criteria.
## Random sampling: one inventory per isocode/glottocode
If multiple inventories per isocode/glottocode are problematic for your
analysis or research question, one approach is to select one inventory from each
isocode/glottocode via random sampling:
```{r sample_one_per_code}
phoible %>%
distinct(InventoryID, Glottocode) %>%
group_by(Glottocode) %>%
sample_n(1) %>%
pull(InventoryID) ->
inventory_ids_sampled_one_per_glottocode
phoible %>%
distinct(InventoryID, ISO6393) %>%
group_by(ISO6393) %>%
sample_n(1) %>%
pull(InventoryID) ->
inventory_ids_sampled_one_per_isocode
message("Picking one inventory per glottocode reduces PHOIBLE from ",
n_distinct(phoible$InventoryID), " inventories\nto ",
length(inventory_ids_sampled_one_per_glottocode),
" inventories. Picking one per ISO 639-3 code yields ",
length(inventory_ids_sampled_one_per_isocode), " inventories.")
```
You can then apply your sample like this:
```{r how_to_apply_a_sample}
phoible %>%
filter(InventoryID %in% inventory_ids_sampled_one_per_glottocode) ->
my_sample
```
## Filtering by data source
Another approach is to only use only inventories from a data source that already
provides a one-inventory-per-language sample. For example, UPSID represents a
"quota" sample (one language per family, for some definition of "family"). To do
this, you can filter the PHOIBLE data by the `Source` column:
```{r prove_upsid_one_inv_per_iso}
phoible %>%
filter(Source == "upsid") ->
upsid
# show that there is exactly one inventory per ISO 639-3 code:
upsid %>%
group_by(ISO6393) %>%
summarise(n_inventories_per_isocode=n_distinct(InventoryID),
.groups="drop") %>%
pull(n_inventories_per_isocode) %>%
all(. == 1)
```
<!--
However, note that the state of knowledge of linguistic genealogical groupings
has changed since UPSID was published. A quota sample that reflects current
understanding can also be achieved. Here, we choose one leaf node from within
each top-level Glottolog family node that has representation in PHOIBLE:
```{r quota_sample_glottolog}
# TODO: EXAMPLE OF QUOTA SAMPLE USING GLOTTOLOG
# NB: every Isolate will be included
```
-->
## Filtering and sampling based on inventory properties
Another approach is to select inventories based on properties of the
inventories themselves, such as whether they include information about
allophones, contrastive tone, etc. For example, one might wish to include
phonological inventories from sources other than UPSID, when available, since
it does not include allophones in its inventories.
```{r preferred_source_ordering}
# get lists of all sources, and sources that include allophones
phoible %>% distinct(Source) %>% pull() -> all_sources
phoible %>%
filter(!is.na(Allophones)) %>%
distinct(Source) %>%
pull() ->
sources_with_allophones
# make a vector encoding allophone absence/presence as 0/1
all_sources %in% sources_with_allophones %>%
as.integer() %>%
setNames(all_sources) ->
source_weights
# example 1: one language per isocode, only keep if includes allophones
phoible %>%
distinct(InventoryID, ISO6393, Source) %>%
filter(source_weights[Source] == 1) %>%
group_by(ISO6393) %>%
slice_sample(n=1) %>%
pull(InventoryID) ->
sample_of_inventory_ids_with_allophones
# example 2: one language per isocode, *preferentially* pick ones w/ allophones
new_weights <- source_weights + 1e-9 # so that all weights are non-zero
phoible %>%
distinct(InventoryID, ISO6393, Source) %>%
group_by(ISO6393) %>%
slice_sample(n=1, weight_by=new_weights[Source]) %>%
pull(InventoryID) ->
sample_of_inventory_ids_with_preference_for_having_allophones
message("Sampling one inventory per ISO code while *requiring* allophones yielded ",
length(sample_of_inventory_ids_with_allophones),
" inventories; merely *preferring* allophones yielded ",
length(sample_of_inventory_ids_with_preference_for_having_allophones),
" inventories.")
```
You can then extract your sample using `filter()` as seen above:
```{r how_to_apply_a_sample_2}
phoible %>% filter(InventoryID %in% sample_of_inventory_ids_with_allophones) ->
my_sample
```
If you're inclined to be a "phoneme splitter", you might prefer to pick the
largest inventory for a given isocode:
```{r sample_by_inventory_size}
phoible %>%
group_by(InventoryID) %>%
summarise(n_phonemes=n(), isocode=unique(ISO6393), .groups="drop") %>%
group_by(isocode) %>%
arrange(desc(n_phonemes), .by_group=TRUE) %>%
slice_head(n=1) %>%
pull(InventoryID) ->
inventory_ids_of_biggest_inventories
```
... and again, extracting your sample using `filter()`:
```{r how_to_apply_a_sample_3}
phoible %>% filter(InventoryID %in% inventory_ids_of_biggest_inventories) ->
my_sample
```
# Integrating geographic information
One way to look at the geographic coverage of PHOIBLE is to merge its data with
information about languages and dialects as provided by
[the Glottolog](https://glottolog.org/meta/downloads). Here we use only the
PHOIBLE index (the mapping from InventoryID to Glottocode, without any Phoneme
information) and merge it with the Glottolog geographic and genealogical data:
```{r merge_glottolog_geo}
url_ <- "https://raw.githubusercontent.com/phoible/dev/master/mappings/InventoryID-LanguageCodes.csv"
phoible_index <- read_csv(url(url_), col_types=cols(InventoryID='i', .default='c'))
url_ <- "https://cdstar.shh.mpg.de/bitstreams/EAEA0-18EC-5079-0173-0/languages_and_dialects_geo.csv"
glottolog <- read_csv(url(url_), col_types=cols(latitude='d', longitude='d', .default='c'))
phoible_geo <- left_join(phoible_index, glottolog,
by=c("Glottocode"="glottocode"))
# show the merged data
phoible_geo %>% head() %>% kable()
```
We can then easily see how many languages there are in PHOIBLE for each
macroarea:
```{r iso_by_macroarea}
phoible_geo %>%
distinct(ISO6393, macroarea) %>%
group_by(macroarea) %>%
tally(name="Number of unique isocodes") %>%
kable()
```
Or, we can count the total number of inventories per macroarea:
```{r invs_by_macroarea}
phoible_geo %>%
group_by(macroarea) %>%
tally(name="Number of inventories") %>%
kable()
```
Note that there are some langoids/glottocodes for which geographic information
is unavailable (their macroarea is `NA`). Also note that each languoid is given
a single latitude-longitude coordinate pair (i.e., there is no information
about the *spatial extent* of a languoid's use).
Finally, let's look at the global distribution of languages represented in
PHOIBLE:
```{r world_map}
ggplot(data=phoible_geo, aes(x=longitude, y=latitude)) +
borders("world", colour="gray50", fill="gray50") +
geom_point(alpha=0.5, size=1, colour="orange")
```
Of course this does not show all of the data points for languages that are
*not* in PHOIBLE!
# Phonological features in PHOIBLE
In addition to phoneme inventories, PHOIBLE includes distinctive feature data
for every phoneme in every inventory. The feature system was created by the
PHOIBLE developers to be descriptively adequate cross-linguistically. In other
words, if two phonemes differ in their graphemic representation, then they
should necessarily differ in their featural representation as well (regardless
of whether those two phonemes coexist in any known doculect). The feature
system is loosely based on the feature system in @Hayes2009 with some additions
drawn from @MoisikEsling2011.
Note that the feature system is potentially subject to change as new languages
are added in subsequent editions of PHOIBLE, and/or as errors are found and
corrected. More information about the PHOIBLE feature set can be found
[here](https://github.com/phoible/dev/tree/master/raw-data/FEATURES).
# How is PHOIBLE used in academic research and/or industry?
The data in PHOIBLE have been used in many published research papers pertaining
to a variety of scientific fields and industrial applications. For a sampling,
view Google Scholar's list of
[papers citing PHOIBLE](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=576981116309388928&as_sdt=5).
# References