Skip to content

Commit

Permalink
update to side notes
Browse files Browse the repository at this point in the history
  • Loading branch information
Francisco Rowe committed Apr 19, 2024
1 parent b228b53 commit cb57873
Show file tree
Hide file tree
Showing 29 changed files with 728 additions and 663 deletions.
2 changes: 1 addition & 1 deletion 06-spatial-econometrics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ As we have just discussed, SH is about effects of phenomena that are *explicitly

**Spatial Weights**

There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-2] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.
There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-2] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.

[^06-spatial-econometrics-2]: If you need to refresh your knowledge on spatial weight matrices. [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course [Chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) of @reyABwolf provide a good explanation of theory around spatial weights and the [Spatial Weights](https://fcorowe.github.io/intro-gds/03-spatial_weights.html) Section of @rowe2022a illustrates the use of R to compute different types of spatial weight matrices.

Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
8 changes: 5 additions & 3 deletions 07-multilevel-01.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -188,11 +188,13 @@ summary(model2)

We can estimate a three-level model by replacing `(1 | lsoa_cd)` for `(1 | msoa_cd/lsoa_cd)` to allow the intercept to also vary by MSOAs and account for the nesting structure of LSOAs within MSOAs. In multilevel modelling, these types of models are formally known as *nested random effects* and they differ from a different set of models known as *crossed random effects*.

::: column-margin ::: callout-note A crossed random effect model in our example would be expressed as follows:
::: column-margin
::: callout-note
A crossed random effect model in our example would be expressed as follows:

`unemp ~ 1 + (1 | lsoa_cd) + (1 | msoa_cd)`

::: ::: column-margin
:::
:::

```{r}
# specify a model equation
Expand Down
36 changes: 25 additions & 11 deletions 09-gwr.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -45,15 +45,15 @@ For this chapter, we will use data on:

- resident population characteristics from the 2011 census, available from the [Office of National Statistics](https://www.nomisweb.co.uk/home/census2001.asp); and,

- 2019 Index of Multiple Deprivation (IMD) data from [GOV.UK](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) and published by the Ministry of Housing, Communities & Local Government.
- 2019 Index of Multiple Deprivation (IMD) data from [GOV.UK](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) published by the Ministry of Housing, Communities & Local Government.

The data used for this Chapter are organised at the ONS Upper Tier Local Authority (UTLA) level - also known as [Counties and Unitary Authorities](https://geoportal.statistics.gov.uk). They are the geographical units used to report COVID-19 data.
The data used for this Chapter are organised at the ONS Upper Tier Local Authority (UTLA) level - also known as [Counties and Unitary Authorities](https://geoportal.statistics.gov.uk). They were the geographical units used to report COVID-19 data.

If you use the dataset utilised in this chapter, make sure cite this book. For a full list of the variables included in the data set used in this Chapter, see the readme file in the gwr data folder.[^09-gwr-1]

[^09-gwr-1]: Read the file in R by executing `read_tsv("data/gwr/readme.txt")`. Ensure the library readr is installed before running read_tsv.99079

Let's read the data:
Let us read the data:

```{r, results = 'hide'}
# clean workspace
Expand Down Expand Up @@ -178,7 +178,14 @@ The results indicate that the incidence of COVID-19 is significantly and positiv

The results also reveal high collinearity between particular pairs of variables, notably between the share of crowded housing and of nonwhite ethnic population, the share of crowded housing and of elderly population, the share of overcrowded housing and of administrative & support workers, the share of elderly population and of population suffering from long-term illness. A more refined analysis of multicollinearity is needed. Various diagnostics for multicollinearity in a regression framework exist, including matrix condition numbers (CNs), predictor variance inflation factors (VIFs) and variance decomposition factors (VDPs). Rules of thumb (CNs \> 30, VIFs \> 10 and VDPs \> 0.5) to indicate worrying levels of collinearity can be found in @belsley2005regression. To avoid problems of multicollinearity, often a simple strategy is to remove highly correlated predictors. The difficultly is in deciding which predictor(s) to remove, especially when all are considered important. Keep this in mind when specifying your model.

> Challenge 2: Analyse the relationship of all the variables executing `pairs(df_sel)`. How accurate would a linear regression be in capturing the relationships for our set of variables?
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**

Analyse the relationship of all the variables executing `pairs(df_sel)`. How accurate would a linear regression be in capturing the relationships for our set of variables?
:::
:::


### Global Regression Results

Expand Down Expand Up @@ -206,7 +213,7 @@ The regression results indicate a positive relationship exists between the share

The $R^{2}$ value for the OLS regression is 0.393 indicating that our model explains only 39% of the variance in the rate of COVID-19 infection. This leaves 71% of the variance unexplained. Some of this unexplained variance can be because we have only included two explanatory variables in our model, but also because the OLS regression model assumes that the relationships in the model are constant over space; that is, it assumes a stationary process. Hence, an OLS regression model is considered to capture global relationships. However, relationships may vary over space. Suppose, for instance, that there are intrinsic behavioural variations across England and that people have adhered more strictly to self-isolation and social distancing measures in some areas than in others, or that ethnic minorities are less exposed to contracting COVID-19 in certain parts of England. If such variations in associations exist over space, our estimated OLS model will be a misspecification of reality because it assumes these relationships to be constant.

To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. -@Fotheringham_et_al_2002_book (2002, p.40-44) discuss alternative approaches and their limitations.
To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. @Fotheringham_et_al_2002_book [p.40-44] discuss alternative approaches and their limitations.

```{r}
utla_shp$res_m1 <- residuals(model1)
Expand Down Expand Up @@ -253,7 +260,7 @@ Two general set of kernel functions can be distinguished: continuous kernels and

### Selecting a Bandwidth

Let's now implement a GWR model. The first key step is to define the optimal bandwidth. We first illustrate the use of a fixed spatial kernel.
Let us implement a GWR model. The first key step is to define the optimal bandwidth. We first illustrate the use of a fixed spatial kernel.

#### Fixed Bandwidth

Expand All @@ -273,7 +280,7 @@ fbw <- gwr.sel(eq1,
fbw
```

The result indicates that the optimal bandwidth is 39.79 kms. This means that neighbouring UTLAs within a fixed radius of 39.79 kms will be taken to estimate local regressions. To estimate a GWR, we execute the code below in which the optimal bandwidth above is used as an input in the argument `bandwidth`.
The result indicates that the optimal bandwidth is 29.30 kms. This means that neighbouring UTLAs within a fixed radius of 29.30 kms will be taken to estimate local regressions. To estimate a GWR, we execute the code below in which the optimal bandwidth above is used as an input in the argument `bandwidth`.

```{r, warning=FALSE}
# fit a gwr based on fixed bandwidth
Expand All @@ -289,7 +296,7 @@ fb_gwr <- gwr(eq1,
fb_gwr
```

We will skip the interpretation of the results for now and consider them in the next section. Now, we want to focus on the overall model fit and will map the results of the $R^{2}$ for the estimated local regressions. To do this, we extract the model results stored in a Spatial Data Frame (SDF) and add them to our spatial data frame `utla_shp`. Note that the Quasi-global $R^{2}$ is very high (0.77) indicating a high in-sample prediction accuracy.
We will skip the interpretation of the results for now and consider them in the next section. Now, we want to focus on the overall model fit and will map the results of the $R^{2}$ for the estimated local regressions. To do this, we extract the model results stored in a Spatial Data Frame (SDF) and add them to our spatial data frame `utla_shp`. Note that the Quasi-global $R^{2}$ is very high indicating a high in-sample prediction accuracy.

```{r}
# write gwr output into a data frame
Expand Down Expand Up @@ -375,7 +382,7 @@ The map reveals notable improvements in local estimates for UTLAs within West an

### Interpretation

The key strength of GWR models is in identifying patterns of spatial variation in the associations between pairs of variables. The results reveal how these coefficients vary across the 150 UTLAs of England. To examine this variability, let's first focus on the adaptive GWR output reported in Section 8.6.4.2. The output includes a summary of GWR coefficient estimates at various data points. The last column reports the global estimates which are the same as the coefficients from the OLS regression we fitted at the start of our analysis. For our variable nonwhite ethnic population, the GWR outputs reveals that local coefficients range from a minimum value of -148.41 to a maximum value of 1076.84, indicating that one percentage point increase in the share of nonwhite ethnic population is associated with a a reduction of 148.41 in the number of cumulative confirmed cases of COVID-19 per 100,000 people in some UTLAs and an increase of 1076.84 in others. For half of the UTLAs in the dataset, as the share of nonwhite ethnic population increases by one percentage point, the rate of COVID-19 will increase between 106.29 and 291.24 cases; that is, the inter-quartile range between the 1st Qu and the 3rd Qu. To analyse the spatial structure, we next map the estimated coefficients obtained from the adaptive kernel GWR.
The key strength of GWR models is in identifying patterns of spatial variation in the associations between pairs of variables. The results reveal how these coefficients vary across the 150 UTLAs of England. To examine this variability, let's first focus on the adaptive GWR output reported in Section 8.6.4.2. The output includes a summary of GWR coefficient estimates at various data points. The last column reports the global estimates which are the same as the coefficients from the OLS regression we fitted at the start of our analysis. For our variable nonwhite ethnic population, the GWR outputs reveals that local coefficients range from a minimum value of -121.87 to a maximum value of 1162.12, indicating that one percentage point increase in the share of nonwhite ethnic population is associated with a a reduction of 121.87 in the number of cumulative confirmed cases of COVID-19 per 100,000 people in some UTLAs and an increase of 1162.12 in others. For half of the UTLAs in the dataset, as the share of nonwhite ethnic population increases by one percentage point, the rate of COVID-19 will increase between 106.82 and 283.74 cases; that is, the inter-quartile range between the 1st Qu and the 3rd Qu. To analyse the spatial structure, we next map the estimated coefficients obtained from the adaptive kernel GWR.

```{r}
# Ethnic
Expand Down Expand Up @@ -433,9 +440,16 @@ map_sig + tm_shape(reg_shp) + # add region boundaries
table(utla_shp$t_ethnic_cat)
```

For the share of nonwhite population, 67% of all local coefficients are statistically significant and these are largely in the South of England. Coefficients in the North tend to be insignificant. Through outliers exist in both regions. In the South, nonsignificant coefficients are observed in the metropolitan areas of London, Birmingham and Nottingham, while significant coefficients exist in the areas of Newcastle and Middlesbrough in the North.
For the share of nonwhite population, 70% of all local coefficients are statistically significant and these are largely in the South of England. Coefficients in the North tend to be insignificant. Through outliers exist in both regions. In the South, nonsignificant coefficients are observed in the metropolitan areas of London, Birmingham and Nottingham, while significant coefficients exist in the areas of Newcastle and Middlesbrough in the North.

::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**

Compute the t values for the intercept and estimated coefficient for long-term illness and create maps of their statistical significance. How many UTLAs report statistically significant coefficients?
:::
:::

> Challenge 3 Compute the t values for the intercept and estimated coefficient for long-term illness and create maps of their statistical significance. How many UTLAs report statistically significant coefficients?

### Collinearity in GWR

Expand Down
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-10-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-13-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-16-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-17-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-18-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-3-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-4-1.pdf
Binary file not shown.
Binary file added 09-gwr_files/figure-pdf/unnamed-chunk-7-1.pdf
Binary file not shown.
4 changes: 2 additions & 2 deletions docs/03-data-wrangling.html

Large diffs are not rendered by default.

Binary file modified docs/05-flows_files/figure-html/unnamed-chunk-17-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/05-flows_files/figure-html/unnamed-chunk-18-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/05-flows_files/figure-html/unnamed-chunk-22-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/06-spatial-econometrics.html
Original file line number Diff line number Diff line change
Expand Up @@ -1888,7 +1888,7 @@
<span id="cb32-209"><a href="#cb32-209" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-210"><a href="#cb32-210" aria-hidden="true" tabindex="-1"></a>**Spatial Weights**</span>
<span id="cb32-211"><a href="#cb32-211" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-212"><a href="#cb32-212" aria-hidden="true" tabindex="-1"></a>There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication <span class="co">[</span><span class="ot">see @anselin2003spatial for a good overview</span><span class="co">]</span>. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)<span class="ot">[^06-spatial-econometrics-2]</span> These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}&gt;0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one. </span>
<span id="cb32-212"><a href="#cb32-212" aria-hidden="true" tabindex="-1"></a>There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication <span class="co">[</span><span class="ot">see @anselin2003spatial for a good overview</span><span class="co">]</span>. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)<span class="ot">[^06-spatial-econometrics-2]</span> These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}&gt;0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.</span>
<span id="cb32-213"><a href="#cb32-213" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-214"><a href="#cb32-214" aria-hidden="true" tabindex="-1"></a><span class="ot">[^06-spatial-econometrics-2]: </span>If you need to refresh your knowledge on spatial weight matrices. <span class="co">[</span><span class="ot">Block E</span><span class="co">](https://darribas.org/gds_course/content/bE/concepts_E.html)</span> of @darribas_gds_course <span class="co">[</span><span class="ot">Chapter 4</span><span class="co">](https://geographicdata.science/book/notebooks/04_spatial_weights.html)</span> of @reyABwolf provide a good explanation of theory around spatial weights and the <span class="co">[</span><span class="ot">Spatial Weights</span><span class="co">](https://fcorowe.github.io/intro-gds/03-spatial_weights.html)</span> Section of @rowe2022a illustrates the use of R to compute different types of spatial weight matrices.</span>
<span id="cb32-215"><a href="#cb32-215" aria-hidden="true" tabindex="-1"></a></span>
Expand Down
Loading

0 comments on commit cb57873

Please sign in to comment.