-
Notifications
You must be signed in to change notification settings - Fork 0
/
part1Ver4.Rmd
425 lines (305 loc) · 11.5 KB
/
part1Ver4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
title: 'Goals scored in a soccer match: fitting distributions (Ver 2)'
author: "Magdiel Ablan"
date: "07/02/2019"
always_allow_html: true
output:
github_document:
toc: true
toc_depth: 2
---
## Introduction
This is part 1 of the project about goals scored in a football/soccer match.
The objetive is to analyze the fit of 3 different distributions against observed
data. The data are:
* Home Goals
* Away Goals
* Total Goals
For each of these variables 3 distributions will be fit:
* Poisson
* Negative Binomial
* Zero-inflated Poisson
## Data preparation
First, we need to load all the required libraries:
```{r setup, echo=TRUE, message=FALSE}
# To read xlsx files:
library(readxl)
# To fit ditributions:
library(fitdistrplus)
# Add to R zip related distributions
library(VGAM)
# Legend for the plots
plot.legend <- c("Poisson", "Negative binomial","ZIP")
plot.legend2 <- c("Poisson", "Negative binomial")
# To print nicer tables
library(kableExtra)
# To know how to print tables
outputFormat = knitr::opts_knit$get("rmarkdown.pandoc.to")
```
We now read the data. The `raw.xlsx` file contains only the columns of data
corresponding at the goal variables.
```{r}
raw <- read_excel("data/raw.xlsx")
head(raw)
summary(raw)
n <- length(raw$HG)
```
There are a total of `r n ` soccer games in the file. Summary stats look as
expected.
## Home goals (HG)
`HG.sum` is table that summarises the number of home goals from all the matches
```{r}
HG.sum <-table(raw$HG)
HG.sum
HG.goals <-as.numeric(names(HG.sum))
```
The density and CDF distribution are:
```{r}
plotdist(raw$HG, histo = TRUE, discrete=TRUE)
```
We now fit the poisson distribution to this data. The only parameter of the
poisson is the mean, lambda:
```{r}
HG.pois <- fitdist(raw$HG,"pois")
HG.pois
```
Likewise, we fit the negative binomial, using the alternative parametrization
with size (target for number of successful trials) and mean (mu):
```{r}
HG.nbinom <- fitdist(raw$HG,"nbinom")
HG.nbinom
```
Finally, we fit the ZIP distribution using the functions provided in the `countreg`
package. Since this is not one of the starndard distributions provided in the
`fitdistrplus` package, we need to provide starting values for their parameters
$\pi$ and $\lambda$. These are momentum estimators based on the mean and variance
of the data
```{r}
m <- mean(raw$HG)
s <- var(raw$HG)
lambdaI <- (s + m^2)/m - 1
lambdaI
piI <- (s-m)/(s+m^2-m)
piI
```
Provided with these initial values we continue with the `fitdistr`function as
before:
```{r}
HG.zip <-fitdist(raw$HG,"zipois",start=list(lambda=lambdaI,pstr0=piI))
HG.zip
```
The maximum likelihood estimators returned by the `fitdistr` are very similar
to the previous ones. Also, as seen shortly, provide a good fit of the data so
we can safely ignore the warning.
We now compare graphically the fit of these three distributions:
```{r}
par(mfrow=c(2,2))
denscomp(list(HG.pois, HG.nbinom, HG.zip), fittype = "o" ,legendtext = plot.legend,main="HG: Histogram
and theoretical densities")
cdfcomp(list(HG.pois, HG.nbinom,HG.zip), legendtext = plot.legend,main="HG: Empirical
and theoretical CDFs")
ppcomp(list(HG.pois, HG.nbinom,HG.zip), legendtext = plot.legend,main="HG: P-P plot")
qqcomp(list(HG.pois, HG.nbinom,HG.zip), legendtext = plot.legend,main="HG: Q-Q plot")
```
All the distribution seem very close to the data. However, a zoom in the first
graph shows an improvement with the ZIP adjusted model specially for the values
of zero and one:
```{r}
denscomp(list(HG.pois, HG.nbinom,HG.zip), fittype = "o" ,legendtext = plot.legend,main="HG: Histogram
and theoretical densities")
```
More formally, we can run goodness of fit statistics to evaluate which one has
the best fit:
```{r}
gofstat(list(HG.pois, HG.nbinom,HG.zip),fitnames=c("Poisson","Negative binomial","ZIP"))
```
These results suggest that both the negative binomial and the ZIP distribution
are good models for the data.
We now repeat this analysis for the two other variables in the data set, away team
goals and total goals.
## Away goals (AG)
`AG.sum` is table that summarises the number of away goals from all the matches
```{r}
AG.sum <-table(raw$AG)
AG.sum
AG.goals <-as.numeric(names(AG.sum))
```
The density and CDF distribution are:
```{r}
plotdist(raw$AG, histo = TRUE, discrete=TRUE)
```
We now fit the poisson distribution to this data. The only parameter of the
poisson is the mean, lambda:
```{r}
AG.pois <- fitdist(raw$AG,"pois")
AG.pois
```
Likewise, we fit the negative binomial, using the alternative parametrization
with size (target for number of successful trials) and mean (mu):
```{r}
AG.nbinom <- fitdist(raw$AG,"nbinom")
AG.nbinom
```
Finally, we fit the ZIP distribution. But before, we need to calculate initial
values for the parameters:
```{r}
m <- mean(raw$AG)
s <- var(raw$AG)
lambdaI <- (s + m^2)/m - 1
lambdaI
piI <- (s-m)/(s+m^2-m)
piI
```
```{r}
AG.zip <-fitdist(raw$AG,"zipois",start=list(lambda=lambdaI,pstr0=piI))
AG.zip
```
As before, only a slight modification with respect to previous values. We can
safely ignore the warnings produced.
We now compare graphically the fit of these distributions:
```{r}
par(mfrow=c(2,2))
denscomp(list(AG.pois, AG.nbinom,AG.zip), fittype = "o" ,legendtext = plot.legend,main="AG: Histogram
and theoretical densities")
cdfcomp(list(AG.pois, AG.nbinom,AG.zip), legendtext = plot.legend,main="AG: Empirical
and theoretical CDFs")
ppcomp(list(AG.pois, AG.nbinom,AG.zip), legendtext = plot.legend,main="AG: P-P plot")
qqcomp(list(AG.pois, AG.nbinom,AG.zip), legendtext = plot.legend,main="AG: Q-Q plot")
```
Let's zoom in the first graph to see better:
```{r}
denscomp(list(AG.pois, AG.nbinom,AG.zip), fittype = "o" ,legendtext = plot.legend,main="AG: Histogram
and theoretical densities")
```
More formally, we can run goodness of fit statistics to evaluate which one has
the best fit:
```{r}
gofstat(list(AG.pois, AG.nbinom,AG.zip),fitnames=c("Poisson","Negative binomial","ZIP"))
```
This time the ZIP distribution seem to be the best model for the data.
## Total goals (TOT)
`TOT.sum` is table that summarises the number of away goals from all the matches
```{r}
TOT.sum <-table(raw$TOT)
TOT.sum
TOT.goals <-as.numeric(names(TOT.sum))
```
The density and CDF distribution are:
```{r}
plotdist(raw$TOT, histo = TRUE, discrete=TRUE)
```
We now fit the poisson distribution to this data. The only parameter of the
poisson is the mean, lambda:
```{r}
TOT.pois <- fitdist(raw$TOT,"pois")
TOT.pois
```
Likewise, we fit the negative binomial, using the alternative parametrization
with size (target for number of successful trials) and mean (mu):
```{r}
TOT.nbinom <- fitdist(raw$TOT,"nbinom")
TOT.nbinom
```
The size parameter is very large, specially as compared to the values obtained
for the home and away goal distribution.
The density plot for total goals does not suggest that the ZIP distribution can
be a good model for this data. In fact, when we try to find initial values for
the parameters this is what we get:
```{r}
m <- mean(raw$TOT)
s <- var(raw$TOT)
lambdaI <- (s + m^2)/m - 1
lambdaI
piI <- (s-m)/(s+m^2-m)
piI
```
$\pi$ cannot be a negative number. Therefore, for the total goal distribution
we only fit Poisson and negative binomial models.
We now compare graphically the fit of these two distributions:
```{r}
par(mfrow=c(2,2))
denscomp(list(TOT.pois, TOT.nbinom), fittype = "o" ,legendtext = plot.legend2,main="TOT: Histogram
and theoretical densities")
cdfcomp(list(TOT.pois, TOT.nbinom), legendtext = plot.legend2,main="TOT: Empirical
and theoretical CDFs")
ppcomp(list(TOT.pois, TOT.nbinom), legendtext = plot.legend2,main="TOT: P-P plot")
qqcomp(list(TOT.pois, TOT.nbinom), legendtext = plot.legend2,main="TOT: Q-Q plot")
```
Let's zoom in the first graph to see better:
```{r}
denscomp(list(TOT.pois, TOT.nbinom), fittype = "o" ,legendtext = plot.legend2,main="TOT: Histogram
and theoretical densities")
```
More formally, we can run goodness of fit statistics to evaluate which one has
the best fit:
```{r}
gofstat(list(TOT.pois, TOT.nbinom),fitnames=c("Poisson","Negative binomial"))
```
This shows an almost technical tie between the two. Based on the
chi-square test, the distribution of best fit would be the negative binomial.
However, the rather large size value might be an indication that perhaps the
Poisson distribution is a better model for this variable.
## Expected goals for the Poisson distribution
This is the table for expected goals for the Poisson distribution using the data
on `r n` games and the parameters estimated in the previous sections.
```{r}
max.goals <- max(HG.goals,AG.goals,TOT.goals)
# pois.dist has the expected goal count for poisson
pois.dist <- data.frame(goals=0:max.goals,HG=numeric(max.goals+1),
AG=numeric(max.goals+1),TOT = numeric(max.goals+1))
for (i in 0:(max.goals)) {
j <- i+1
pois.dist$HG[j] <- dpois(pois.dist$goals[j],HG.pois$estimate["lambda"])*n
pois.dist$AG[j] <- dpois(pois.dist$goals[j],AG.pois$estimate["lambda"])*n
pois.dist$TOT[j] <- dpois(pois.dist$goals[j],TOT.pois$estimate["lambda"])*n
}
pois.dist
```
## Expected goals for the negative binomial distribution
This is the table for expected goals for the negative binomial distribution using
the data on `r n` games and the parameters estimated in the previous sections.
```{r}
nbinom.dist <- data.frame(goals=0:max.goals,HG=numeric(max.goals+1),
AG=numeric(max.goals+1),TOT = numeric(max.goals+1))
for (i in 0:max.goals) {
j <- i+1
nbinom.dist$HG[j] <- dnbinom(nbinom.dist$goals[j],size = HG.nbinom$estimate["size"],
mu=HG.nbinom$estimate["mu"])*n
nbinom.dist$AG[j] <- dnbinom(nbinom.dist$goals[j],size = AG.nbinom$estimate["size"],
mu=AG.nbinom$estimate["mu"])*n
nbinom.dist$TOT[j] <- dnbinom(nbinom.dist$goals[j],size = TOT.nbinom$estimate["size"],
mu=TOT.nbinom$estimate["mu"])*n
}
nbinom.dist
```
## Expected goals for the ZIP distribution
This is the table for expected goals for the ZIP distribution using the data
on `r n` games and the parameters estimated in the previous sections.
```{r}
zip.dist <- data.frame(goals=0:max.goals,HG=numeric(max.goals+1),
AG=numeric(max.goals+1))
for (i in 0:max.goals) {
j <- i+1
zip.dist$HG[j] <- dzipois(zip.dist$goals[j],lambda = HG.zip$estimate["lambda"],
pstr0=HG.zip$estimate["pstr0"])*n
zip.dist$AG[j] <- dzipois(zip.dist$goals[j],lambda = AG.zip$estimate["lambda"],
pstr0=AG.zip$estimate["pstr0"])*n
}
zip.dist
```
## By way of conclusion
The following table summarizes the fit of the given models:
```{r}
lambdas<-round(c(HG.pois$estimate["lambda"],AG.pois$estimate["lambda"],TOT.pois$estimate["lambda"]),2)
mus <- round(c(HG.nbinom$estimate["mu"],AG.nbinom$estimate["mu"],TOT.nbinom$estimate["mu"]),2)
sizes <-round(c(HG.nbinom$estimate["size"],AG.nbinom$estimate["size"],TOT.nbinom$estimate["size"]),2)
pis <- round(c(HG.zip$estimate["pstr0"],AG.zip$estimate["pstr0"],NA),2)
Zlambdas<-round(c(HG.zip$estimate["lambda"],AG.zip$estimate["lambda"],NA),2)
best<-c("Negative binomial or ZIP","ZIP","Poisson")
sum.tab <- data.frame(lambda=lambdas,mu=mus,size=sizes,pi=pis,zlambda=Zlambdas,best=best)
row.names(sum.tab)<-c("Home","Away","Total")
kable(sum.tab,col.names=c("$\\lambda$", "$\\mu$", "n", "$\\pi$","$\\lambda$","Best model"),
booktabs=T,escape=T) %>%
kable_styling() %>%
add_header_above(c(" " = 1, "Poisson" = 1, "Negative binomial" = 2, "ZIP" = 2," "=1),align="c")
```