forked from GDSL-UL/san
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-multilevel-02.qmd
267 lines (183 loc) · 14.9 KB
/
08-multilevel-02.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# Multilevel Modelling - Part 2 {#sec-chp8}
This chapter explains varying slopes and draws on the following references:
The content of this chapter is based on:
- @gelman2006data provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.
- @bristol2020 is an useful online resource on multilevel modelling and is free!
## Dependencies
This chapter uses the following libraries which are listed in the @sec-dependencies in Chapter 1:
```{r, message = FALSE}
# Data manipulation, transformation and visualisation
library(tidyverse)
# Nice tables
library(kableExtra)
# Simple features (a standardised way to encode vector data ie. points, lines, polygons)
library(sf)
# Spatial objects conversion
library(sp)
# Thematic maps
library(tmap)
# Colour palettes
library(viridis)
# Fitting multilevel models
library(lme4)
# Tools for extracting information generated by lme4
library(merTools)
# Exportable regression tables
library(jtools)
```
## Data
For this chapter, we will data for Liverpool from England's 2011 Census. The original source is the [Office of National Statistics](https://www.nomisweb.co.uk/home/census2001.asp) and the dataset comprises a number of selected variables capturing demographic, health and socio-economic of the local resident population at four geographic levels: Output Area (OA), Lower Super Output Area (LSOA), Middle Super Output Area (MSOA) and Local Authority District (LAD). The variables include population counts and percentages. For a description of the variables, see the readme file in the mlm data folder.[^08-multilevel-02-1]
[^08-multilevel-02-1]: Read the file in R by executing `read_tsv("data/mlm/readme.txt")` . Ensure the library `readr` is installed before running `read_tsv`.
Let us read the data:
```{r, results = 'hide'}
# clean workspace
rm(list=ls())
# read data
oa_shp <- st_read("data/mlm/OA.shp")
```
## Conceptual Overview
So far, we have estimated varying-intercept models; that is, when the intercept ($\beta_{0}$) is allowed to vary by group (eg. geographical area) - as shown in Fig. 1(a). The strength of the relationship between $y$ (i.e. unemployment rate) and $x$ (long-term illness) has been assumed to be the same across groups (i.e. MSOAs), as captured by the regression slope ($\beta_{1}$). Yet it can also vary by group as shown in Fig. 1(b), or we can observe group variability for both intercepts and slopes as represented in Fig. 1(c).
![Fig. 1. Linear regression model with (a) varying intercepts, (b) varying slopes, and (c) both. Source: @gelman2006data p.238.](figs/ch6/fig11.1_Gelman_Hill.png)
### Exploratory Analysis: Varying Slopes
Let's then explore if there is variation in the relationship between unemployment rate and the share of population in long-term illness. We do this by selecting the 8 MSOAs containing OAs with the highest unemployment rates in Liverpool.
```{r}
# Sort data
oa_shp <- oa_shp %>% arrange(-unemp)
oa_shp[1:9, c("msoa_cd", "unemp")]
# Select MSOAs
s_t8 <- oa_shp %>% dplyr::filter(
as.character(msoa_cd) %in% c(
"E02001354",
"E02001369",
"E02001366",
"E02001365",
"E02001370",
"E02001390",
"E02001368",
"E02001385")
)
```
And then we generate a set of scatter plots and draw regression lines for each MSOA.
```{r}
ggplot(s_t8, aes(x = lt_ill, y = unemp)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ msoa_cd, nrow = 2) +
ylab("Unemployment rate") +
xlab("Long-term Illness (%)") +
theme_classic()
```
We can observe great variability in the relationship between unemployment rates and the percentage of population in long-term illness. A strong and positive relationship exists in MSOA `E02001366` (Tuebrook and Stoneycroft), while it is negative in MSOA `E02001370` (Everton) and neutral in MSOA `E02001390` (Princes Park & Riverside). This visual inspection suggests that accounting for differences in the way unmployment rates relate to long-term illness is important. Contextual factors may differ across MSOAs in systematic ways.
## Estimating Varying Intercept and Slopes Models
A way to capture for these group differences in the relationship between unemployment rates and long-term illness is to allow the relevant slope to vary by group (i.e. MSOA). We can do this estimating the following model:
OA-level:
$$y_{ij} = \beta_{0j} + \beta_{1j}x_{ij} + e_{ij}$$
MSOA-level:
$$\beta_{0j} = \beta_{0} + u_{0j}$$ $$\beta_{1j} = \beta_{1} + u_{1j} $$ Replacing the first equation into the second generates:
$$y_{ij} = (\beta_{0} + u_{0j}) + (\beta_{1} + u_{1j})x_{ij} + e_{ij}$$ where, as in the previous Chapter, $y$ the proportion of unemployed population in OA $i$ within MSOA $j$; $\beta_{0}$ is the fixed intercept (averaging over all MSOAs); $u_{0j}$ represents the MSOA-level residuals, or *random effects*, of the intercept; $e_{ij}$ is the individual-level residuals; and, $x_{ij}$ represents the percentage of long-term illness population. *But* now we have a varying slope represented by $\beta_{1}$ and $u_{1j}$: $\beta_{1}$ is estimated average slope - fixed part of the model; and, $u_{1j}$ is the estimated group-level errors of the slope.
To estimate such model, we add `lt_ill` in the bracket with a `+` sign between `1` and `|` i.e. `(1 + lt_ill | msoa_cd)`.
```{r}
#| warning: false
# attach df
attach(oa_shp)
# change to proportion
oa_shp$lt_ill <- lt_ill/100
# specify a model equation
eq6 <- unemp ~ lt_ill + (1 + lt_ill | msoa_cd)
model6 <- lmer(eq6, data = oa_shp)
# estimates
summary(model6)
```
In this model, the estimated standard deviation of the unexplained within-MSOA variation is 0.04974, and the estimated standard deviation of the MSOA intercepts is 0.05855. But, additionally, we also have estimates of standard deviation of the MSOA slopes (0.17154) and correlation between MSOA-level residuals for the intercept and slope (-0.73). While the former measures the extent of average deviation in the slopes across MSOAs, the latter indicates that the intercept and slope MSOA-level residuals are negatively associated; that is, MSOAs with large slopes have relatively smaller intercepts and *vice versa*. We will come back to this in Section [Interpreting Correlations Between Group-level Intercepts and Slopes].
Similarly, the correlation of fixed effects indicates a negative relationship between the intercept and slope of the average regression model; that is, as the average model intercept tends to increase, the average strength of the relationship between unemployment rate and long-term illness decreases and *vice versa*.
We then explore the estimated average coefficients (*fixed effects*):
```{r}
fixef(model6)
```
yields an estimated regression line in an average LSOA: $y = 0.04764998 + 0.30125916x$. The fixed intercept indicates that the average unemployment rate is 0.05 if the percentage of population with long-term illness is zero.The fixed slope indicates that the average relationship between unemployment rate and long-term illness is positive across MSOAs i.e. as the percentage of population with long-term illness increases by 1 percentage point, the unemployment rate increases by 0.3.
We look the estimated MSOA-level errors (*random effects*):
```{r}
ranef_m6 <- ranef(model6)
head(ranef_m6$msoa_cd, 5)
```
Recall these estimates indicate the extent of deviation of the MSOA-specific intercept and slope from the estimated model average captured by the fixed model component.
We can also regain the estimated intercept and slope for each county by adding the estimated MSOA-level errors to the estimated average coefficients; or by executing:
```{r}
#coef(model6)
```
We are normally more interested in identifying the extent of deviation and its significance. To this end, we create a caterpillar plot:
```{r}
# plot
plotREsim(REsim(model6))
```
These plots reveal some interesting patterns. First, only one MSOA, containing wards such as Tuebrook and Stoneycroft, Anfield & Everton, seems to have a statistically significantly different intercept, or average unemployment rate. Confidence intervals overlap zero for all other 60 MSOAs. Despite this, note that when a slope is allowed to vary by group, it generally makes sense for the intercept to also vary. Second, significant variability exists in the association between unemployment rate and long-term illness across MSOAs. Ten MSOAs display a significant positive association, while 12 exhibit a significantly negative relationship. Third, these results reveal that geographical differences in the relationship between unemployment rate and long-term illness can explain the significant differences in average unemployment rates in the varying intercept only model.
Let's try to get a better understanding of the varying relationship between unemployment rate and long-term illness by mapping the relevant MSOA-level errors.
```{r}
#| warning: false
# read data
msoa_shp <- st_read("data/mlm/MSOA.shp")
# create a dataframe for MSOA-level random effects
re_msoa_m6 <- REsim(model6) %>% filter(groupFctr == "msoa_cd") %>%
filter(term == "lt_ill")
str(re_msoa_m6)
# merge data
msoa_shp <- merge(x = msoa_shp, y = re_msoa_m6, by.x = "MSOA_CD", by.y = "groupID")
```
```{r}
# ensure geometry is valid
msoa_shp = sf::st_make_valid(msoa_shp)
# create a map
legend_title = expression("MSOA-level residuals")
map_msoa = tm_shape(msoa_shp) +
tm_fill(col = "median", title = legend_title, palette = magma(256, begin = 0, end = 1), style = "cont") +
tm_borders(col = "white", lwd = .01) +
tm_compass(type = "arrow", position = c("right", "top") , size = 4) +
tm_scale_bar(breaks = c(0,1,2), text.size = 0.5, position = c("center", "bottom"))
map_msoa
```
The map indicates that the relationship between unemployment rate and long-term illness is tends to stronger and positive in northern MSOAs; that is, the percentage of population with long-term illness explains a greater share of the variation in unemployment rates in these locations. As expected, a greater share of population in long-term illness is associated with higher local unemployment. In contrast, the relationship between unemployment rate and long-term illness tends to operate in the reverse direction in north-east and middle-southern MSOAs. In these MSOAs, OAs tend to have a higher unemployment rate relative the share of population in long-term illness. You can confirm this examining the data for specific MSOA executing:
```{r}
oa_shp %>% dplyr::select(msoa_cd, ward_nm, unemp, lt_ill) %>%
filter(as.character(msoa_cd) == "E02001370")
```
Now try adding a group-level predictor and an individual-level predictor to the model. Unsure, look at @sec-indlevel and @sec-grouplevel in @sec-chp7.
## Interpreting Correlations Between Group-level Intercepts and Slopes
Correlations of random effects are confusing to interpret. Key for their appropriate interpretation is to recall they refer to group-level residuals i.e. deviation of intercepts and slopes from the average model intercept and slope. A strong *negative* correlation indicates that groups with high intercepts have relatively low slopes, and *vice versa*. A strong *positive* correlation indicates that groups with high intercepts have relatively high slopes, and *vice versa*. A correlation close to *zero* indicate little or no systematic between intercepts and slopes. Note that a high correlation between intercepts and slopes is not a problem, but it makes the interpretation of the estimated intercepts more challenging. For this reason, a suggestion is to center predictors ($x's$); that is, substract their average value ($z = x - \bar{x}$). For a more detailed discussion, see @bristol2020.
To illustrate this, let's reestimate our model adding an individual-level predictor: the share of population with no educational qualification.
```{r}
# centering to the mean
oa_shp$z_no_qual <- no_qual/100 - mean(no_qual/100)
oa_shp$z_lt_ill <- lt_ill - mean(lt_ill)
# specify a model equation
eq7 <- unemp ~ z_lt_ill + z_no_qual + (1 + z_lt_ill | msoa_cd)
model7 <- lmer(eq7, data = oa_shp)
# estimates
summary(model7)
```
How do you interpret the random effect correlation?
## Model building
Now we know how to estimate multilevel regression models in *R*. The question that remains is: *When does multilevel modeling make a difference?* The short answer is: when there is little group-level variation. When there is very little group-level variation, the multilevel modelling reduces to classical linear regression estimates *with no group indicators*. Inversely, when group-level coefficients vary greatly (compared to their standard errors of estimation), multilevel modelling reduces to classical regression *with group indicators* @gelman2006data.
*How do you go about building a model?*
We generally start simple by fitting simple linear regressions and then work our way up to a full multilevel model - see @gelman2006data p. 270.
*How many groups are needed?*
As an absolute minimum, more than two groups are required. With only one or two groups, a multilevel model reduces to a linear regression model.
*How many observations per group?*
Two observations per group is sufficient to fit a multilevel model.
### Model Comparison
*How we assess different candidate models?* We can use the function `anova()` and assess various statistics: The Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), Loglik and Deviance. Generally, we look for lower scores for all these indicators. We can also refer to the *Chisq* statistic below. It tests the hypothesis of whether additional predictors improve model fit. Particularly it tests the *Null Hypothesis* whether the coefficients of the additional predictors equal 0. It does so comparing the deviance statistic and determining if changes in the deviance are statistically significant. Note that a major limitation of the deviance test is that it is for nested models i.e. a model being compared must be nested in the other. Below we compare our two models. The results indicate that adding an individual-level predictor (i.e. the share of population with no qualification) provides a model with better.
```{r}
anova(model6, model7)
```
## Questions
We will continue to use the COVID-19 dataset. Please see @sec-chp11 for details on the data.
```{r}
sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
```
Using these data, you are required to address the following challenges:
1. Fit a varying-slope model. Let one slope to vary by region. Think carefully your choice.
2. Fit a varying-intercept and varying-slope model.
3. Compare the results for models fitted in 1 and 2. Which is better? Why?
Use the same explanatory variables used for the @sec-chp7 challenge, so you can compare the model results from this chapter.
Analyse and discuss:
1. the varying slope estimate(s) from your model(s) (to what extent does the relationship between your dependent and independent variables vary across groups / areas? are they statistically significantly different?).
2. differences between your varying intercept and varying slope models.