-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
297 lines (229 loc) · 9.48 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
output: github_document
---
```{r rmd-setup, include = FALSE}
library(elections.dtree)
library(prefio)
knitr::opts_chunk$set(fig.path = "man/figures/")
set.seed(1)
```
# [elections.dtree](https://fleverest.github.io/elections.dtree): Audit ranked voting elections with Dirichlet-trees
Perform ballot-polling Bayesian audits for ranked voting elections using a Dirichlet-tree prior distribution.
<!-- badges: start -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/elections.dtree)](https://cran.r-project.org/package=elections.dtree)
[![pkgdown](https://github.com/fleverest/elections.dtree/workflows/pkgdown/badge.svg)](https://fleverest.github.io/elections.dtree/)
[![R-CMD-check](https://github.com/fleverest/elections.dtree/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/fleverest/elections.dtree/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/fleverest/elections.dtree/branch/main/graph/badge.svg)](https://app.codecov.io/gh/fleverest/elections.dtree?branch=main)
<!-- badges: end -->
## Installation
#### CRAN
To install the latest minor release of `elections.dtree` from CRAN:
```{r, eval = FALSE}
install.packages("elections.dtree")
```
#### GitHub
To install the development release of `elections.dtree` from GitHub:
```{r, eval = FALSE}
# install.packages("remotes")
remotes::install_github("fleverest/elections.dtree")
```
## About the project
#### Why?
Bayesian audits of elections typically employ a Dirichlet prior, conjugate to
a multinomial distribution for the observed ballots. For ranked voting systems,
the number of ballot types can grow factorially. In the case of instant-runoff
voting (IRV), a popular type of ranked voting system, the number of possible
ballot types is `n!` with `n` candidates (assuming all candidates are ranked;
it is even greater if partially completed ballots are permitted).
The Dirichlet distribution is a simple and effective choice if the number of
ballot types is small, but becomes problematic when this number gets large. As
`n` grows, the prior concentration parameters (defined by `a0` in our
implementation) need to be on the order of `1 / n!` to ensure the prior is not
overly informative. If `n` is large enough, this may be smaller than the
available precision. Also, the fact that this varies by `n` is inconvenient.
A more practical alternative is given by the Dirichlet-tree distribution,
which we implement in this package.
The Dirichlet-tree distribution consists of nested Dirichlet distributions,
arranged hierarchically in a tree structure. The structure represents the
preference ordering of the candidates. Branches coming out of each node
correspond to choices for which candidate to select as the next preferred, and
nodes represent a ranking of candidates (a complete ranking for leaf nodes,
and an incomplete ranking for internal nodes). We place a Dirichlet
distribution at each node, to model the conditional split of preferences at
that node. The structure as a whole then defines a Dirichlet-tree
distribution. Just like the Dirichlet, it is conjugate to a multinomial
distribution. Also, the Dirichlet-tree is a generalisation, including
a Dirichlet as a special case.
Using the Dirichlet-tree as a prior distribution allows it to scale efficiently
to large `n`, and does not require setting the concentration parameters (`a0`)
to values that depend on `n` to perform well.
depend on `n` to perform well.
#### How?
In this package, the tree structure is implemented such that the nodes are only
initialised when they appear in the observed ballot data (i.e. when at least
one of the ballots includes a preference sequence that is represented by that
node). This allows the memory-complexity to be `O(n*m)`, where `m` is the
number of ballots observed in the audit process.
Without such lazy evaluation, the memory-complexity is necessarily `O(n!)`.
Hence, our implementation of the Dirichlet distribution (based on a reducible
Dirichlet-tree structure) enables a larger set of candidates than would be
possible using traditional methods.
To sample unseen ballots without loading all nodes into memory, this repository
implements a recursive strategy that generates samples starting from any
uninitialized point in the tree. This works well for IRV ballot structures,
since the permutation-tree structure is easily navigated given a target ballot.
Currently, only IRV elections have been implemented, but other ranked voting
systems could be supported by implementing the corresponding social choice
function.
In order to support different tree structures or other elections which deal
with high cardinality, a similar recursive strategy for sampling should be
developed (and anyone designing a new tree structure should consider this
carefully).
## Usage
#### Load some data
Let's start off by downloading some data from
[PrefLib](https://www.preflib.org/) using
[prefio](https://cran.r-project.org/package=prefio). In this example, we select
the Wakehurst 2023 dataset containing the ballots cast in Wakehurst for the 2023
NSW Legislative Assembly Election. Other elections for the NSW Legislative
Assembly can be found [here](https://www.preflib.org/dataset/00058).
```{r}
wakehurst <- prefio::read_preflib("nswla/00058-00000273.soi", from_preflib = TRUE)
head(wakehurst)
```
We can see that in this election, voters could specify as few as one candidate.
Since the underlying structure for `prefio::preferences` is a ranking matrix,
we can easily find the length of the longest ballot as follows:
```{r}
max(wakehurst$preferences, na.rm = TRUE)
```
In this election, the maximum ballot length was not bounded.
If we inspect the first-preferences, we can see that while `"WILLIAMS Toby"`
receives the most first-preference ballots, `"REGAN Michael"` wins the election
with the distributed ballots from the IRV process:
```{r}
wakehurst[, 1, by.rank = TRUE]
```
```{r}
social_choice(
as.preferences(wakehurst),
sc_function = "plurality"
)
```
```{r}
social_choice(
as.preferences(wakehurst),
sc_function = "irv"
)
```
#### Specifying the prior
To work with `dirichlet_tree`s in R, there are two interfaces at your disposal.
The first is a standard S3 interface which all R users will be familiar with.
The second is an R6 interface which can be useful when you would like to chain
commands. Other than the ability to chain commands, either interface is
equivalent.
Here, we initialise a new Dirichlet-tree for the Wakehurst 2023 election, requiring
exactly at least one preference specified per formal ballot and using a prior
`a0` parameter equal to `1.5`.
```{r}
# S3 interface
dtree <- dirtree(
candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)
# R6 interface (equivalent to S3)
dtree <- dirichlet_tree$new(
candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)
dtree
```
#### Observing data
The data observed during the auditing process should be formatted as a
`prefio::preferences` object. Currently, our `wakehurst` object has class
`prefio::aggregated_preferences`. We will also shuffle the ballots to imitate
a real auditing scenario where we sample the ballots at random. We will only
use 1000 ballots for ease of computation. To convert it to the appropriate
format and shuffle we can do the following:
```{r}
ballots <- sample(prefio::as.preferences(wakehurst))[1:1000]
```
Then to observe our first batch of say 10 ballots, we can update the model to obtain our posterior:
```{r}
# S3
update(dtree, ballots[1:10])
# R6
dtree$update(ballots[1:10])
# R6 using chained commands
dtree <- dirichlet_tree$new(
candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)$update(ballots[1:10])
dtree
```
#### Bayesian inference using the posterior Dirichlet-tree
To conduct Bayesian inference using the posterior we have just obtained, we
can employ monte-carlo simulation to simulate the unseen ballots and determine
the election outcome many times. This can be used to estimate the probability
that any particular candidate goes on to win the election under the posterior
distribution.
```{r}
# S3
ps <- sample_posterior(dtree, n_elections = 1000, n_ballots = length(ballots))
# R6
ps <- dtree$sample_posterior(n_elections = 1000, n_ballots = length(ballots))
# R6 using chained commands
ps <- dirichlet_tree$new(
candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)$update(
ballots[1:10]
)$sample_posterior(
n_elections = 100,
n_ballots = length(ballots)
)
ps
```
We can also take one realisation from the posterior distribution to examine it
directly, rather than automatically compute simulated election outcomes and
aggregating the results in a single step. To simulate the unobserved
`N = length(ballots) - 10` ballots from the posterior predictive distribution,
we can run the following command:
```{r}
# S3
simulated <- sample_predictive(dtree, n_ballots = length(ballots) - 10)
# R6
simulated <- dtree$sample_predictive(n_ballots = length(ballots) - 10)
# R6 using chained commands
simulated <- dirichlet_tree$new(
candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)$update(
ballots[1:10]
)$sample_predictive(n_ballots = length(ballots) - 10)
head(simulated)
```
Then we can compute the (IRV) election outcome for this simulation (plus the
observed data) explicitly like this:
```{r}
social_choice(
rbind(
ballots[1:10],
prefio::as.preferences(simulated)
),
sc_function = "irv"
)
```
Finally, we can reset the model to the original prior distribution:
```{r}
# S3
reset(dtree)
# R6
dtree$reset()
dtree
```