-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.tex.Rmd
259 lines (198 loc) · 16.7 KB
/
README.tex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "NUNC Vignette"
author: "Gaetano Romano, Edward Austin, Idris Eckley, and Paul Fernhead"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_width: 7
fig_height: 5
vignette: >
%\VignetteIndexEntry{NUNC Vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r include=FALSE}
library(NUNC)
```
## Introduction to NUNC
NUNC is an acronym for Non-Parametric Unbounded Changepoint detection and is designed to perform the detection of a changes in the distribution of data that has been drawn from an unknown distribution. Furthermore, the algorithm runs in a sliding window. This allows for sequential changepoint detection to take place, along with affording NUNC the ability to deal with non-stationarity in the stream of data. The primary purpose of NUNC is to detect changes in non-stationary or complex distributions of data, where existing parametric detection techniques may fail. The second purpose of NUNC is to perform the analysis online. This means that the data observed by time $t$ is processed before the next data point arrives, and requires that the memory requirements of the algorithm do not overflow as the algorithm is left running and $t \rightarrow \infty$.
The NUNC package contains functions that implement both offline sequential changepoint detection from a pre-recorded dataset, and online changepoint detection where the input is a data generating process rather than a fixed set of data. Furthermore, the package contains implementations of all three variants of the NUNC algorithm. The NUNC algorithms are R wrappers for C++ code, providing fast changepoint detection performance. Furthermore, the online implementation of NUNC utilises the Boost library, and a circular buffer structure, to allow for a stream of data observed in real time to be monitored for a change in distribution.
So that NUNC is able to identify changes in distribution for data that has been drawn from an unknown distribution, a non-parametric approach is taken. This means that NUNC specifies the distribution based only on the observed data, and this distribution changes as different data is observed, rather than relying on classical assumptions such as Gaussianity. As will be discussed in the mathematical background Section, NUNC compares the empirical CDF for the pre and post change data and, if the difference is significant enough, then a changepoint is identified in the data. The comparison takes place using a cost function, and this cost function is formulated by aggregating the difference between the two CDFs at K different quantiles.
An example implementation of NUNC for an online data generating process is presented below. Further examples for both NUNC online, and NUNC offline, are presented later.
```{r eval = TRUE}
set.seed(10)
onlinedata <- c(rnorm(100, 1, 10), rcauchy(100, 12, .1))
f <- function() {
out <- onlinedata[i]
i <<- i + 1
out
}
i <- 1
NUNC_run <- NUNC(f, w = 25, beta = 8, K = 6, method = "local")
#f is the data generating process, w is the window size, beta is the threshold and K the number of quantiles
NUNC_run$changepoint #changepoint location
NUNC_run$t #time detected (edge of sliding window)
```
```{r ani1, fig.show='animate', interval = 0.05, cache = FALSE, echo = FALSE}
set.seed(10)
onlinedata <- c(rnorm(100, 1, 3), rcauchy(100, 12, .1))
for (i in 1:200) {
plot(x = 1:i, y = onlinedata[1:i], xlim=c(1,200), ylim=c(-30, 35))
if(i >= NUNC_run$t){
abline(v = NUNC_run$changepoint, col = "blue")
abline(v = NUNC_run$t, col = "green")
}
legend(x = "topleft",c("Changepoint", "Detection"),
col = c("blue", "green"), lty = 1)
}
```
This vignette is divided into three parts. The first explores the mathematical background for NUNC, the second presents the NUNC package and the functions it contains, and the final section uses NUNC to analyse a dataset drawn from an accelerometer.
## Mathematical Background for NUNC
NUNC is based upon the comparison of the empirical CDF for independent data points $x_1, ..., x_t$ drawn from a distribution $F_{1:t}$. The empirical CDF, $\hat{F}_{1:t}$, provides an estimate for the distribution of the data, and is given by
$$
\hat{F}_{1:t}(q) = \frac{1}{t} \left\{ \sum_{j=1}^{t} \mathbb{I}(x_j < q) + 0.5 \times \mathbb{I}(x_j = q) \right\},
$$
for some quantile $q$. Using the independence assumption, it can be shown that $t\hat{F}(q) \sim \mathrm{Binom}(t, F(q))$, and a likelihood for a set of data points can then be computed. We write the log-likelihood of the segment $x_{\tau_1+1}, \ldots, x_{\tau_2}$ as
$$
\mathcal{L}(x_{\tau_1+1:\tau_2}; q) = (\tau_2 - \tau_1) \left[\hat{F}_{\tau_1+1:\tau_2}(q) \log(\hat{F}_{\tau_1+1:\tau_2}(q)) - (1-\hat{F}_{\tau_1+1:\tau_2}(q))\log(1-\hat{F}_{\tau_1+1:\tau_2}(q)) \right].
$$
This likelihood can then be used to define a statistic for the detection of a change at a single quantile of the distribution based on the likelihood ratio test. This statistic is defined as
$$
\max_{1\leq \tau \leq t} \mathcal{L}(x_{1:\tau}; q) + \mathcal{L}(x_{\tau+1:t}; q) - \mathcal{L}(x_{1:t}; q),
$$
and then in order to increase power we aggregate this statistic across $K$ quantiles. This gives the following test statistic
$$
C_K(x_{1:t}) = \max_{1 \leq \tau \leq t} \frac{1}{K} \sum_{k=1}^K 2 \left[ \mathcal{L}(x_{1:\tau}; q_k) + \mathcal{L}(x_{\tau+1:t}; q_k) - \mathcal{L}(x_{1:t}; q_k) \right],
$$
and it is this test statistic that is used by NUNC to determine if a change occurs.
NUNC Local is based on this test statistic, although restricts the search to a point inside the sliding window of size W. Furthermore, a point is only identified as a change if the test statistic takes a significant enough value, and this significance is measured by the exceedance of a threshold denoted by $\beta$. The stopping rule for for a change at time $t - W + 1 \leq \tau \leq t$ is
$$
\max_{t-W+1 \leq \tau \leq t} \sum_{k=1}^K 2 \left[ \mathcal{L}(x_{t-W+1:\tau}; q_k) + \mathcal{L}(x_{\tau+1:t}; q_k) - \mathcal{L}(x_{t-W+1:t}; q_k) \right] \geq K\beta.
$$
Using this test statistic, NUNC Local can be seen as identifying changes in the distribution of the data inside the sliding window.
One drawback of NUNC Local is that $W$ checks are performed inside each sliding window, despite the test statistic being correlated between nearby points and the power of the test being reduced at the edges of the window. To handle this, a subset of the points in the window can be tested instead. This is implemented in the package using the argument \code{max_checks} in the NUNC functions.
In contrast to NUNC Local, NUNC Global does not search for changes inside the sliding window but instead compares the data in the window to the historic data that has been seen so far. In order to store the historic information in a memory efficient manner we again fix $K$ quantiles and update the longrun CDF, denoted by $z^{(t)}_W(\cdot)$, each time a point leaves the sliding window. This update step is denoted by the recursive equation
$$
z^{(t+1)}_W(q) = \frac{1}{t-W+1} \bigg [ (t-W) z^{(t)}_W(q) + F_{t-W+1:t-W+1}(q).
\bigg ].
$$
This can then be used to compute the likelihood functions for the data, and a change detection test performed in a similar manner to as defined above.
One question that remains is how many quantiles to use, and what value they should take. In line with Haynes (2017) we propose that $K = \lceil a \log(W) \rceil$, and select the value of the quantiles from the sample of data using the formula
$$
q_k = \hat{F}^{-1} \left(1 + \left(2W+1\right) \exp \left[ \frac{c}{K}(2k-1) \right] \right)^{-1}.
$$
The advantage of selecting the quantiles using this formula is that is gives more weight to the tails of the distribution than would be offered by using quantiles that are evenly spaced.
By selecting the $K$ quantiles using the sample of data, the need to understand the underlying distribution for the data is avoided. Despite this, in some circumstances NUNC may be used to monitor a process with a baseline from a known distribution. In this case, it may be advantageous to specify the quantiles from this known distribution, rather than estimating them from a sample of $W$ points. The reason for this is that the estimates of the tail quantiles may be poor in situations where $W$ is not large enough. It is this issue that NUNC Semi-Parametric addresses, by allowing the user to specify the number of quantiles to use through use of the \code{quantiles} argument.
Having discussed the mathematics underpinning NUNC, we can now explore how the NUNC package can be used to detect changes in the distribution of data. We first present a series of examples that detail the different functions in the NUNC package. After this, we analyse an accelerometer dataset that is contained in the package.
## Detecting Changepoints Using NUNC
Three different forms of NUNC are provided in the NUNC package: NUNC Local, NUNC Global, and NUNC Semi-Parametric. NUNC Local tests for changes in the distribution of the data inside the sliding window, whereas NUNC Global tests for changes in the distribution of the data by asking whether the observations in the window are drawn from the same distribution as the historic observations. In both these methods, the test uses quantiles of the empirical CDF that have been selected using the observed data. In some cases, however, it is argued that there will be a strong belief as to what the distribution of the data is, and so NUNC Semi-Parametric allows for the user to specify the quantiles that are used.
In this section we will explore the different variants, and present code for implementing the algorithms on several different examples. We begin by examining the offline variants of NUNC. For NUNC Local we consider a change in a set of mixture distributions:
``` {r eval = TRUE}
set.seed(1)
x_pre <- c(rnorm(500), rnorm(200, 1, 4), rnorm(300, 2, 2))
x_pre <- sample(x_pre, 1000)
x_post <- c(rnorm(100), rnorm(200, -3, 1), rnorm(200, 3, 2), rnorm(500, -1, 1))
x_post <- sample(x_pre, 1000)
x <- c(x_pre, x_post)
#we use [1:3] to supress the data output
NUNCoffline(data = x, w = 200, beta = 12, K = 4*log(200), method = "local")[1:3]
```
We see that NUNCoffline has returned three outputs: \code{t}, the time the change is detected; \code{changepoint}, the location of the change in the window; and \code{method}, which is the method used. We see that a similar analysis can be obtained in a shorter period of time using the \code{max_checks} argument. This is specified as an additional argument using the \code{params} input: this is a \code{list()} containing the argument \code{max_checks =} and can be written as follows:
```{r eval = TRUE}
A <- Sys.time()
NUNCoffline(data = x, w = 200, beta = 12, K = 4*log(200), method = "local")$changepoint
B <- Sys.time()
NUNCoffline(data = x, w = 200, beta = 12, K = 4*log(200), method = "local",
params = list(max_checks = 30))$changepoint
C <- Sys.time()
paste("NUNC Local has a", B-A)
paste("NUNC Local on a subset of points in the window has a ", C-B)
```
We next consider analysing a set of data using NUNC Global. We illustrate this using an example of data that is heavy tailed:
``` {r, eval = TRUE}
set.seed(2)
x_pre <- rcauchy(1000)
x_post <- rcauchy(1000, 1)
x <- c(x_pre, x_post)
NUNCoffline(data = x, w = 100, beta = 5, K = 4*log(100), method = "global")[1:4]
```
We see that the input structure is similar, however the output structure also contains an argument called \code{zVals}; this is the value of the longrun CDF at the $K$ quantiles for the data.
The third form of NUNC is NUNC Semi-Parametric. We explore how this can provide better results than NUNC Local in cases where the quantiles for the data are pre-specified correctly by examining a scenario where there is a high variance in the distribution for the data under the null hypothesis. This scenario makes estimation of the quantiles from a small sample of data difficult.
As in the \code{max_check} case, the additional \code{quantiles} argument is specified using the \code{params} argument. We see in the following example we obtain a correct changepoint when using NUNC Semi-Parametric, whereas an incorrect changepoint is returned using NUNC Local.
``` {r eval = TRUE}
set.seed(3)
x_pre <- rnorm(1000, 0, 10)
x_post <- rnorm(1000, 10, 4)
x <- c(x_pre, x_post)
NUNCoffline(data = x, w = 50, beta = 6, K = 4*log(50), method = "local")[1:3]
quantiles <- qnorm(seq(0.05, 0.99, length = ceiling(4*log(50))), 0, 10)
NUNCoffline(data = x, w = 50, beta = 6, K = 4*log(50), method = "semi-param",
params = list(quantiles = quantiles))[1:4]
```
Online analysis of a data stream can also be performed using NUNC, and this is illustrated by the following examples using the \code{NUNC} function. This takes a similar set of inputs to the \code{NUNCoffline} function, however rather than take a \code{data =} argument, we instead use {f =}, where \code{f} is a data generating function that returns a single number at each time point.
```{r eval = TRUE}
set.seed(4)
onlinedata <- c(rnorm(300, 1, 10), rcauchy(1200, 12, .1))
f <- function() {
out <- onlinedata[i]
i <<- i + 1
#uncomment and run with x11() or windows() to see real time plotting
#plot(tail(onlinedata[1:i], 80), ylab = "test data", type = "l")
out
}
i <- 1
NUNC(f, 50, beta = 10, K = 15, params = list(max_checks = 3), method = "local")
i <- 1
NUNC(f, 50, beta = 10, K = 15)
```
In the above example, the data is fed into the function one point at a time and analysis performed. If the code is run with the plotting aspect of the function uncommented, a user will also be able to see the stream of data plotted whilst it is being observed.
## Analysing Accelerometer Data Using NUNC
Now that a user is familiar with the different functions contained in the NUNC package, we can explore a set of data created using an accelerometer. This dataset is contained within the package, and contains 100 instances where the movement of a Dualshock4 controller has been recorded. The movement is captured by a change in the y axis value recorded by the motion sensor in the controller.
```{r}
data(accelerometer)
```
Each entry in the list is a data frame containing three variables:
\describe{
\item{y}{A vector of length 2000 indicating the y axis position recorded by the controller.}
\item{changepoint}{The time the action occurs.}
\item{action}{The action that is performed.}
}
The four actions that can take place are: picking up the controller, sitting with the controller and moving it, sliding the controller along a surface, and shaking the controller. In every example, the movement of the controller changes the distribution of y axis values.
We can explore an example of this change in distribution:
``` {r, eval = TRUE, echo = FALSE, fig.width = 8.5}
par(mfrow = c(1,2))
plot(accelerometer[[52]]$y, xlab = "Time (milliseconds)", ylab = "y axis",
main = "Sliding Action")
abline(v = accelerometer[[52]]$changepoint, col = "red", lty = 2)
plot(density(accelerometer[[51]]$y[1:accelerometer[[52]]$changepoint]), col = "blue",
main = "Comparison of Distributions \n for Sliding Data")
lines(density(accelerometer[[51]]$y[(accelerometer[[51]]$changepoint +
1):length(accelerometer[[52]]$y)]), col = "red")
legend(x = 1540, y = 0.020, c("Pre-Change", "Post-Change"), col = c("blue", "red"), lty = 1)
```
We can now use NUNC to detect changes in the distribution of the data. For example
```{r, eval = TRUE}
accelerometer[[52]]$changepoint
NUNCoffline(data = accelerometer[[52]]$y, w = 100, beta = 17, K = 10, method = "local",
params = list(max_checks = 30))$changepoint
NUNCoffline(data = accelerometer[[52]]$y, w = 100, beta = 17, K = 10, method= "global")$changepoint
```
In both instances of NUNC, there is some delay between the emergence and the detection of the change. This is to be expected given the sequential nature of the algorithm. We also observe the better performance of NUNC Global, which again is to be expected given the fact that the pre-change distribution is stationary.
We are then able to plot the output of this analysis using the \code{NUNCplot} function, with the changepoint location indicated by the red line:
```{r echo = FALSE, eval = TRUE, fig.width = 8.5}
par(mfrow = c(1,2))
local_run <- NUNCoffline(data = accelerometer[[52]]$y, w = 100, beta = 17, K = 10,
method = "local", params = list(max_checks = 30))
NUNCplot(local_run, xlab = "Time", ylab = "Data", main = "NUNC Local")
abline(v = accelerometer[[52]]$changepoint, col = "red", lty = 2)
global_run <- NUNCoffline(data = accelerometer[[52]]$y, w = 100, beta = 17, K = 10,
method= "global")
NUNCplot(global_run, xlab = "Time", ylab = "Data", main = "NUNC Global")
abline(v = accelerometer[[52]]$changepoint, col = "red", lty = 2)
```