data.qmd

# Data {#sec-data}

```{r}
#| include: false
source("common.R")
```

Before we start on inference, we need to start thinking about how our models fit to data, and how we can get data into the models.  This gets very close to the interface that we need to work with [monty](https://mrc-ide.github.io/monty), which is the focus of the next section of the book.

```{r}
#| include: false
source("common.R")
set.seed(42)
```

```{r}
library(odin2)
library(dust2)
```

## The SIR model revisited

In a couple of chapters we'll explore fitting a stochastic SIR model to data, so we'll start with expressions from the model in @sec-stochastic-sir

```{r}
sir <- odin({
  update(S) <- S - n_SI
  update(I) <- I + n_SI - n_IR
  update(R) <- R + n_IR
  update(incidence) <- incidence + n_SI

  p_SI <- 1 - exp(-beta * I / N * dt)
  p_IR <- 1 - exp(-gamma * dt)
  n_SI <- Binomial(S, p_SI)
  n_IR <- Binomial(I, p_IR)

  initial(S) <- N - I0
  initial(I) <- I0
  initial(R) <- 0
  initial(incidence, zero_every = 1) <- 0

  N <- parameter(1000)
  I0 <- parameter(10)
  beta <- parameter(0.2)
  gamma <- parameter(0.1)
})
```

Assuming our time unit is one day, if we run our model in time steps of a quarter of a day, then plotting incidence at daily time intervals we get:

```{r}
pars <- list(beta = 1, gamma = 0.6)
sys <- dust_system_create(sir, pars, dt = 0.25)
dust_system_set_state_initial(sys)
t <- seq(0, 20)
y <- dust_system_simulate(sys, t)
y <- dust_unpack_state(sys, y)
plot(t, y$incidence, type = "p", xlab = "Time", ylab = "Incidence")
```

## The data cometh

We need a data set to fit to; we'll use the data set [`data/incidence.csv`](data/incidence.csv), which you can download.

```{r}
d <- read.csv("data/incidence.csv")
head(d)
tail(d)
```

We have a column for time `time` and one for observed cases `cases` spanning the time range 0 to 20.

## Comparison to data {#sec-simple-sir-data}

The next step is to tell our model about this data:

```{r}
sir <- odin({
  update(S) <- S - n_SI
  update(I) <- I + n_SI - n_IR
  update(R) <- R + n_IR
  update(incidence) <- incidence + n_SI

  p_SI <- 1 - exp(-beta * I / N * dt)
  p_IR <- 1 - exp(-gamma * dt)
  n_SI <- Binomial(S, p_SI)
  n_IR <- Binomial(I, p_IR)

  initial(S) <- N - I0
  initial(I) <- I0
  initial(R) <- 0
  initial(incidence, zero_every = 1) <- 0

  N <- parameter(1000)
  I0 <- parameter(10)
  beta <- parameter(0.2)
  gamma <- parameter(0.1)

  cases <- data()
  cases ~ Poisson(incidence)
})
```

The last two lines here are doing the work for us:

First, `cases <- data()` says that `cases` is a special **data** variable.  It will vary over time (there are different observations of `cases` at different times) and it will come from the data rather than from the model dynamics.

Second, `cases ~ Poisson(incidence)` describes the per-data-point [likelihood calculation](https://en.wikipedia.org/wiki/Likelihood_function); the syntax may be familiar to you if you have read [Richard McElreath's *Statistical Rethinking*](https://xcelab.net/rm/) or used any number of Bayesian statistical frameworks.

::: {.callout-note}
A data variable cannot have the same name as a state variable - in situations where one directly corresponds to the other you may have to think carefully about naming to distinguish between the two in a way that you can remember which refers to which.
:::

The generator will advertise that this system can be compared to data:

```{r}
sir
```

Once we have our new model, we can see how the data comparison works.

```{r}
sys <- dust_system_create(sir, list(), n_particles = 10)
dust_system_set_state_initial(sys)
dust_system_run_to_time(sys, d$time[[1]])
dust_system_compare_data(sys, d[1, ])
```

This has run the model to the point where we have the first observed data (time `r d$time[[1]]`), this time without returning any data to R, then we have used `dust_system_compare_data` with the first row of data.  This returns a vector of 10 likelihoods -- one per particle.

::: {.callout-note}
This is probably the first mention of a "particle", and it is from this that the name `dust` is derived.  As a nod to the particle filter, which will be the focus of the next section, we refer to each realisation of the dynamical system as a "particle".
:::

To demystify this a little we can perform this calculation manually.  First we extract the state from the system, unpacking it to make it easier to work with:

```{r}
s <- dust_unpack_state(sys, dust_system_state(sys))
s
```

We can then manually compute the likelihood, and bind this together with the calculation from dust to compare:

```{r}
cbind(dpois(d$cases[[1]], s$incidence, log = TRUE),
      dust_system_compare_data(sys, d[1, ]))
```

```{r}
#| include: false
# Error on build if these are in fact not the same!
stopifnot(isTRUE(all.equal(
      dpois(d$cases[[1]], s$incidence, log = TRUE),
      dust_system_compare_data(sys, d[1, ]))))
```

As you can see, these are the same; the expression

```r
cases ~ Poisson(incidence)
```

has done the same sort of thing as we can do ourselves by asking "what is the (log) probability of observing `cases` cases if the underlying rate of incidence is `incidence`".  Note that this can be `-Inf` (a probability of 0) in the case where the modelled incidence is zero.

```{r}
dpois(10, 0, log = TRUE)
```

This is fine, so long as not all particles are impossible.

## The particle filter {#sec-data-filter}

We include a simple bootstrap particle filter in `dust` for estimating the marginal likelihood in this case; the probability of the entire data series conditional on our model, marginalised (averaged) over all stochastic realisations.  Because our model is stochastic, this is an **estimate** of the likelihood but the estimate will get less noisy as the number of particles increases.

```{r}
filter <- dust_filter_create(sir, time_start = 0, data = d, n_particles = 200)
```

Here, we've created a filter where the time series starts at 0, using our data set `d` and we've picked 200 particles to start with.  We run the filter with `dust_likelihood_run`, and this returns a likelihood:

```{r}
dust_likelihood_run(filter, list())
```

Each time we run the filter, we'll get a different answer though.

```{r}
dust_likelihood_run(filter, list())
```

For example, running for 100 times:

```{r}
ll <- replicate(100, dust_likelihood_run(filter, list()))
mean(ll)
var(ll)
```

If we increase the number of particles, this variance will decrease, at the cost of taking longer to run:

```{r}
filter2 <- dust_filter_create(sir, time_start = 0, data = d,
                              n_particles = 2000)
ll2 <- replicate(100, dust_likelihood_run(filter2, list()))
mean(ll2)
var(ll2)
```

The log-likelihoods from different numbers of particles are not directly comparable, though.

You can extract trajectories from the filter after running it: this gives a *tree* showing the ancestry of the particles remaining in the sample.  To enable this, you must run with `save_trajectories = TRUE`, as this incurs a runtime cost.

```{r}
ll <- dust_likelihood_run(filter, list(), save_trajectories = TRUE)
h <- dust_likelihood_last_trajectories(filter)
```

The trajectories will be an array with 3 dimensions representing (in turn) state, particle and time:

```{r}
dim(h)
```

We can plot the trajectories of our incidence here:

```{r}
matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
        lty = 1, col = "#00000044", ylim = c(0, 75),
        xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
```

This model does not fit the data very well!  The points don't lie close to the modelled trajectories.  We also see everything before time ~10 reduced to a single particle's trajectories, which will have the effect of increasing the variance.

If we found better parameters things would look much better.  We just so happen to know that if `beta` is 1 and `gamma` 0.6 then the model will fit better:

```{r}
pars <- list(beta = 1, gamma = 0.6)
ll <- dust_likelihood_run(filter, pars, save_trajectories = TRUE)
h <- dust_likelihood_last_trajectories(filter)
matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
        lty = 1, col = "#00000044", ylim = c(0, 75),
        xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
```

As the model fit improves and the log-likelihood increases, the variance in that estimator will also decrease as the model struggles less to describe the data:

```{r}
ll <- replicate(100, dust_likelihood_run(filter, pars))
mean(ll)
var(ll)
```

Moving from the poor-fitting parameters to the better fitting ones is the process of **inference**, which will be the focus of most of the rest of this book.