forked from hardin47/prediction2016
-
Notifications
You must be signed in to change notification settings - Fork 0
/
predblog.Rmd
217 lines (151 loc) · 9.42 KB
/
predblog.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: "Presidential Election Predictions 2016"
author: "Jo Hardin"
date: "August 16, 2016"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r message=FALSE, echo=FALSE, warning=FALSE}
require(XML)
require(dplyr)
require(tidyr)
require(readr)
require(mosaic)
require(RCurl)
require(ggplot2)
require(lubridate)
require(RJSONIO)
```
## ASA's Prediction Competition
In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:
http://thisisstatistics.org/electionprediction2016/
To get you started, I've written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight funciton?). Or you might try some kind of model - either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!
## Getting the Data
Thanks to the Internet, there is a lot of polling data which is publicly accessible. For the competition, you are welcome to get your data from anywhere. However, I'm going to take mine from 538. http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/ (Other good sources of data are http://www.realclearpolitics.com/epolls/latest_polls/ and http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton and http://www.gallup.com/products/170987/gallup-analytics.aspx)
Note the date indicated above as to when this R Markdown file was written. That's the day the data were scraped from 538. If you run the Markdown file on a different day, you are likely to get different results as the polls are constantly being updated.
Because the original data were scraped as a JSON file, it gets pulled into R as a list of lists. The data wrangling used to convert it into a tidy format is available from the source code at https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd.
```{r}
url = "http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/"
doc <- htmlParse(url, useInternalNodes = TRUE)
sc = xpathSApply(doc, "//script[contains(., 'race.model')]",
function(x) c(xmlValue(x), xmlAttrs(x)[["href"]]))
jsobj = gsub(".*race.stateData = (.*);race.pathPrefix.*", "\\1", sc)
data = fromJSON(jsobj)
allpolls <- data$polls
#unlisting the whole thing
indx <- sapply(allpolls, length)
pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, `length<-`, max(indx))))
```
```{r echo=FALSE}
#unlisting the weights
pollswt <- as.data.frame(t(as.data.frame(do.call(cbind, lapply(pollsdf$weight, data.frame,
stringsAsFactors=FALSE)))))
names(pollswt) <- c("wtpolls", "wtplus", "wtnow")
row.names(pollswt) <- NULL
pollsdf <- cbind(pollsdf, pollswt)
#unlisting the voting
indxv <- sapply(pollsdf$votingAnswers, length)
pollsvot <- as.data.frame(do.call(rbind, lapply(pollsdf$votingAnswers,
`length<-`, max(indxv))))
pollsvot1 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V1, data.frame,
stringsAsFactors=FALSE))))
pollsvot2 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V2, data.frame,
stringsAsFactors=FALSE))))
pollsvot1 <- cbind(polltype = rownames(pollsvot1), pollsvot1,
polltypeA = gsub('[0-9]+', '', rownames(pollsvot1)),
polltype1 = extract_numeric(rownames(pollsvot1)))
pollsvot1$polltype1 <- ifelse(is.na(pollsvot1$polltype1), 1, pollsvot1$polltype1 + 1)
pollsvot2 <- cbind(polltype = rownames(pollsvot2), pollsvot2,
polltypeA = gsub('[0-9]+', '', rownames(pollsvot2)),
polltype1 = extract_numeric(rownames(pollsvot2)))
pollsvot2$polltype1 <- ifelse(is.na(pollsvot2$polltype1), 1, pollsvot2$polltype1 + 1)
pollsdf <- pollsdf %>%
mutate(population = unlist(population),
sampleSize = as.numeric(unlist(sampleSize)),
pollster = unlist(pollster),
startDate = ymd(unlist(startDate)),
endDate = ymd(unlist(endDate)),
pollsterRating = unlist(pollsterRating)) %>%
select(population, sampleSize, pollster, startDate, endDate, pollsterRating,
wtpolls, wtplus, wtnow)
allpolldata <- cbind(rbind(pollsdf[rep(seq_len(nrow(pollsdf)), each=3),],
pollsdf[rep(seq_len(nrow(pollsdf)), each=3),]),
rbind(pollsvot1, pollsvot2))
allpolldata <- allpolldata %>%
arrange(polltype1, choice)
```
## A Quick Visualization
Before coming up with a prediction for the vote percentages for the 2016 US Presidential Race, it is worth trying to look at the data. The data are in a tidy form, so ggplot2 will be the right tool for visualizing the data.
```{r}
ggplot(subset(allpolldata, ((polltypeA == "now") & (endDate > ymd("2016-08-01")))),
aes(y=adj_pct, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wtnow)) +
labs(title = "Vote percentage by date and poll weight\n",
y = "Percent Vote if Election Today", x = "Poll Date",
color = "Candidate", size="538 Poll\nWeight")
```
## A Quick Analysis
Let's try to think about the percentage of votes that each candidate will get based on the *now cast* polling percentages. We'd like to weight the votes based on what 538 thinks (hey, they've been doing this longer than I have!), the sample size, and the number of days since the poll closed.
$$ w = \frac{w_{538}}{days since poll} \sqrt{sample size}$$
Using my weight, I'll calculate a weighted average and a weighted SE for the predicted percent of votes. (The SE of the weighted variance is taken from Cochran (1977) and cited in Gatz and Smith (1995).) The weights can be used to calculate the average or the running average for the *now cast* polling percentages.
```{r echo=FALSE}
# code found at http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation
# cited from http://www.cs.tufts.edu/~nr/cs257/archive/donald-gatz/weighted-standard-error.pdf
# Donald F. Gatz and Luther Smith, "THE STANDARD ERROR OF A WEIGHTED MEAN CONCENTRATION-I. BOOTSTRAPPING VS OTHER METHODS"
weighted.var.se <- function(x, w, na.rm=FALSE)
# Computes the variance of a weighted mean following Cochran 1977 definition
{
if (na.rm) { w <- w[i <- !is.na(x)]; x <- x[i] }
n = length(w)
xWbar = weighted.mean(x,w,na.rm=na.rm)
wbar = mean(w)
out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))
return(out)
}
```
```{r}
allpolldata2 <- allpolldata %>%
filter(wtnow > 0) %>%
filter(polltypeA == "now") %>%
mutate(dayssince = as.numeric(today() - endDate)) %>%
mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%
mutate(votewt = wt*pct) %>%
group_by(choice) %>%
arrange(choice, -dayssince) %>%
mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%
mutate(cum.mean = cummean(pct))
```
### Plotting the Cumulative Mean / Weighted Mean
In tidy format, the data are ready to plot. Note that the cumulative mean is much smoother than the cumulative weighted mean because the weights are much heavier toward the later polls.
```{r}
ggplot(subset(allpolldata2, ( endDate > ymd("2016-01-01"))),
aes(y=cum.mean, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wt)) +
labs(title = "Cumulative Mean Vote Percentage\n",
y = "Cumulative Percent Vote if Election Today", x = "Poll Date",
color = "Candidate", size="Calculated Weight")
ggplot(subset(allpolldata2, (endDate > ymd("2016-01-01"))),
aes(y=cum.mean.wt, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wt)) +
labs(title = "Cumulative Weighted Mean Vote Percentage\n",
y = "Cumulative Weighted Percent Vote if Election Today", x = "Poll Date",
color = "Candidate", size="Calculated Weight")
```
Additionally, the weighted average and the SE of the average (given by Cochran (1977)) can be computed for each candidate. Using the formula, we have our prediction of the final percentage of the popular vote for each major candidate!
```{r}
pollsummary <- allpolldata2 %>%
select(choice, pct, wt, votewt, sampleSize, dayssince) %>%
group_by(choice) %>%
summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),
std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))
pollsummary
```
## Other people's advice
> Prediction is very difficult, especially about the future. - Niels Bohr
Along with good data sources, you should also be able to find information about prediction and modeling. I've provided a few resources to get you started.
* Andrew Gelman: http://andrewgelman.com/2016/08/17/29654/
* Sam Wang: http://election.princeton.edu/2016/08/21/sharpening-the-forecast/
* Fivethirtyeight: http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
* Christensen and Florence: http://www.amstat.org/misc/tasarticle.pdf and https://tofu.byu.edu/electionpollproject/