-
Notifications
You must be signed in to change notification settings - Fork 0
/
data.qmd
255 lines (188 loc) · 8.54 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# Data {#sec-data}
```{r}
#| include: false
source("common.R")
```
Before we start on inference, we need to start thinking about how our models fit to data, and how we can get data into the models. This gets very close to the interface that we need to work with [monty](https://mrc-ide.github.io/monty), which is the focus of the next section of the book.
```{r}
#| include: false
source("common.R")
set.seed(42)
```
```{r}
library(odin2)
library(dust2)
```
## The SIR model revisited
In a couple of chapters we'll explore fitting a stochastic SIR model to data, so we'll start with expressions from the model in @sec-stochastic-sir
```{r}
sir <- odin({
update(S) <- S - n_SI
update(I) <- I + n_SI - n_IR
update(R) <- R + n_IR
update(incidence) <- incidence + n_SI
p_SI <- 1 - exp(-beta * I / N * dt)
p_IR <- 1 - exp(-gamma * dt)
n_SI <- Binomial(S, p_SI)
n_IR <- Binomial(I, p_IR)
initial(S) <- N - I0
initial(I) <- I0
initial(R) <- 0
initial(incidence, zero_every = 1) <- 0
N <- parameter(1000)
I0 <- parameter(10)
beta <- parameter(0.2)
gamma <- parameter(0.1)
})
```
Assuming our time unit is one day, if we run our model in time steps of a quarter of a day, then plotting incidence at daily time intervals we get:
```{r}
pars <- list(beta = 1, gamma = 0.6)
sys <- dust_system_create(sir, pars, dt = 0.25)
dust_system_set_state_initial(sys)
t <- seq(0, 20)
y <- dust_system_simulate(sys, t)
y <- dust_unpack_state(sys, y)
plot(t, y$incidence, type = "p", xlab = "Time", ylab = "Incidence")
```
## The data cometh
We need a data set to fit to; we'll use the data set [`data/incidence.csv`](data/incidence.csv), which you can download.
```{r}
d <- read.csv("data/incidence.csv")
head(d)
tail(d)
```
We have a column for time `time` and one for observed cases `cases` spanning the time range 0 to 20.
## Comparison to data {#sec-simple-sir-data}
The next step is to tell our model about this data:
```{r}
sir <- odin({
update(S) <- S - n_SI
update(I) <- I + n_SI - n_IR
update(R) <- R + n_IR
update(incidence) <- incidence + n_SI
p_SI <- 1 - exp(-beta * I / N * dt)
p_IR <- 1 - exp(-gamma * dt)
n_SI <- Binomial(S, p_SI)
n_IR <- Binomial(I, p_IR)
initial(S) <- N - I0
initial(I) <- I0
initial(R) <- 0
initial(incidence, zero_every = 1) <- 0
N <- parameter(1000)
I0 <- parameter(10)
beta <- parameter(0.2)
gamma <- parameter(0.1)
cases <- data()
cases ~ Poisson(incidence)
})
```
The last two lines here are doing the work for us:
First, `cases <- data()` says that `cases` is a special **data** variable. It will vary over time (there are different observations of `cases` at different times) and it will come from the data rather than from the model dynamics.
Second, `cases ~ Poisson(incidence)` describes the per-data-point [likelihood calculation](https://en.wikipedia.org/wiki/Likelihood_function); the syntax may be familiar to you if you have read [Richard McElreath's *Statistical Rethinking*](https://xcelab.net/rm/) or used any number of Bayesian statistical frameworks.
::: {.callout-note}
A data variable cannot have the same name as a state variable - in situations where one directly corresponds to the other you may have to think carefully about naming to distinguish between the two in a way that you can remember which refers to which.
:::
The generator will advertise that this system can be compared to data:
```{r}
sir
```
Once we have our new model, we can see how the data comparison works.
```{r}
sys <- dust_system_create(sir, list(), n_particles = 10)
dust_system_set_state_initial(sys)
dust_system_run_to_time(sys, d$time[[1]])
dust_system_compare_data(sys, d[1, ])
```
This has run the model to the point where we have the first observed data (time `r d$time[[1]]`), this time without returning any data to R, then we have used `dust_system_compare_data` with the first row of data. This returns a vector of 10 likelihoods -- one per particle.
::: {.callout-note}
This is probably the first mention of a "particle", and it is from this that the name `dust` is derived. As a nod to the particle filter, which will be the focus of the next section, we refer to each realisation of the dynamical system as a "particle".
:::
To demystify this a little we can perform this calculation manually. First we extract the state from the system, unpacking it to make it easier to work with:
```{r}
s <- dust_unpack_state(sys, dust_system_state(sys))
s
```
We can then manually compute the likelihood, and bind this together with the calculation from dust to compare:
```{r}
cbind(dpois(d$cases[[1]], s$incidence, log = TRUE),
dust_system_compare_data(sys, d[1, ]))
```
```{r}
#| include: false
# Error on build if these are in fact not the same!
stopifnot(isTRUE(all.equal(
dpois(d$cases[[1]], s$incidence, log = TRUE),
dust_system_compare_data(sys, d[1, ]))))
```
As you can see, these are the same; the expression
```r
cases ~ Poisson(incidence)
```
has done the same sort of thing as we can do ourselves by asking "what is the (log) probability of observing `cases` cases if the underlying rate of incidence is `incidence`". Note that this can be `-Inf` (a probability of 0) in the case where the modelled incidence is zero.
```{r}
dpois(10, 0, log = TRUE)
```
This is fine, so long as not all particles are impossible.
## The particle filter {#sec-data-filter}
We include a simple bootstrap particle filter in `dust` for estimating the marginal likelihood in this case; the probability of the entire data series conditional on our model, marginalised (averaged) over all stochastic realisations. Because our model is stochastic, this is an **estimate** of the likelihood but the estimate will get less noisy as the number of particles increases.
```{r}
filter <- dust_filter_create(sir, time_start = 0, data = d, n_particles = 200)
```
Here, we've created a filter where the time series starts at 0, using our data set `d` and we've picked 200 particles to start with. We run the filter with `dust_likelihood_run`, and this returns a likelihood:
```{r}
dust_likelihood_run(filter, list())
```
Each time we run the filter, we'll get a different answer though.
```{r}
dust_likelihood_run(filter, list())
```
For example, running for 100 times:
```{r}
ll <- replicate(100, dust_likelihood_run(filter, list()))
mean(ll)
var(ll)
```
If we increase the number of particles, this variance will decrease, at the cost of taking longer to run:
```{r}
filter2 <- dust_filter_create(sir, time_start = 0, data = d,
n_particles = 2000)
ll2 <- replicate(100, dust_likelihood_run(filter2, list()))
mean(ll2)
var(ll2)
```
The log-likelihoods from different numbers of particles are not directly comparable, though.
You can extract trajectories from the filter after running it: this gives a *tree* showing the ancestry of the particles remaining in the sample. To enable this, you must run with `save_trajectories = TRUE`, as this incurs a runtime cost.
```{r}
ll <- dust_likelihood_run(filter, list(), save_trajectories = TRUE)
h <- dust_likelihood_last_trajectories(filter)
```
The trajectories will be an array with 3 dimensions representing (in turn) state, particle and time:
```{r}
dim(h)
```
We can plot the trajectories of our incidence here:
```{r}
matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
lty = 1, col = "#00000044", ylim = c(0, 75),
xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
```
This model does not fit the data very well! The points don't lie close to the modelled trajectories. We also see everything before time ~10 reduced to a single particle's trajectories, which will have the effect of increasing the variance.
If we found better parameters things would look much better. We just so happen to know that if `beta` is 1 and `gamma` 0.6 then the model will fit better:
```{r}
pars <- list(beta = 1, gamma = 0.6)
ll <- dust_likelihood_run(filter, pars, save_trajectories = TRUE)
h <- dust_likelihood_last_trajectories(filter)
matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
lty = 1, col = "#00000044", ylim = c(0, 75),
xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
```
As the model fit improves and the log-likelihood increases, the variance in that estimator will also decrease as the model struggles less to describe the data:
```{r}
ll <- replicate(100, dust_likelihood_run(filter, pars))
mean(ll)
var(ll)
```
Moving from the poor-fitting parameters to the better fitting ones is the process of **inference**, which will be the focus of most of the rest of this book.