diff --git a/_quarto.yml b/_quarto.yml index 00ba1ea..c03607e 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -55,6 +55,8 @@ format: reference-location: margin mermaid: theme: neutral + pdf: + documentclass: scrreprt editor: visual diff --git a/docs/01-overview.html b/docs/01-overview.html index 689e258..0c7c8a1 100644 --- a/docs/01-overview.html +++ b/docs/01-overview.html @@ -416,8 +416,8 @@

1.5 Assessment

The final module mark is composed of the two computational essays. Together they are designed to cover the materials introduced in the entirety of content covered during the semester. A computational essay is an essay whose narrative is supported by code and computational results that are included in the essay itself. Each teaching week, you will be required to address a set of questions relating to the module content covered in that week, and to use the material that you will produce for this purpose to build your computational essay.

-

Assignment 1 (50%) refer to the set of questions at the end of Chapter 4, Chapter 5 and Chapter 6. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

-

Assignment 2 (50%) refer to the set of questions at the end of Chapter 7, Chapter 8, Chapter 9 and Chapter 10. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

+

Assignment 1 (50%) refer to the set of questions at the end of sec-chp4, sec-chp5 and sec-chp6. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

+

Assignment 2 (50%) refer to the set of questions at the end of sec-chp7, sec-chp8, sec-chp9 and sec-chp10. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

1.5.1 Format Requirements

Both assignments will have the same requirements:

diff --git a/docs/02-spatial_data.html b/docs/02-spatial_data.html index 9ffb1b5..2729b8c 100644 --- a/docs/02-spatial_data.html +++ b/docs/02-spatial_data.html @@ -288,13 +288,13 @@

Different classifications of spatial data types exist. Knowing the structure of the data at hand is important as specific analytical methods would be more appropriate for particular data types. We will use a particular classification involving four data types: lattice/areal data, point data, flow data and trajectory data (Fig. 1). This is not a exhaustive list but it is helpful to motivate the analytical and modelling methods that we cover in this book.

-
Fig. 1. Data Types. Area / Lattice data source: Önnerfors et al. (2019). Point data source: Tao et al. (2018). Flow data source: Rowe and Patias (2020). Trajectory data source: Kwan and Lee (2004).
+
Fig. 1. Data Types. Area / Lattice data source: Önnerfors et al. (2019). Point data source: Tao et al. (2018). Flow data source: Rowe and Patias (2020). Trajectory data source: Kwan and Lee (2004).

Lattice/Areal Data. These data correspond to records of attribute values (such as population counts) for a fixed geographical area. They may comprise regular shapes (such as grids or pixels) or irregular shapes (such as states, counties or travel-to-work areas). Raster data are a common source of regular lattice/areal area, while censuses are probably the most common form of irregular lattice/areal area. Point data within an area can be aggregated to produce lattice/areal data.

Point Data. These data refer to records of the geographic location of an discrete event, or the number of occurrences of geographical process at a given location. As displayed in Fig. 1, examples include the geographic location of bus stops in a city, or the number of boarding passengers at each bus stop.

Flow Data. These data refer to records of measurements for a pair of geographic point locations. or pair of areas. These data capture the linkage or spatial interaction between two locations. Migration flows between a place of origin and a place of destination is an example of this type of data.

-

Trajectory Data. These data record geographic locations of moving objects at various points in time. A trajectory is composed of a single string of data recording the geographic location of an object at various points in time and each record in the string contains a time stamp. These data are complex and can be classified into explicit trajectory data and implicit trajectory data. The former refer to well-structured data and record positions of objects continuously and intensively at uniform time intervals, such as GPS data. The latter is less structured and record data in relatively time point intervals, including sensor-based, network-based and signal-based data (Kong et al. 2018).

-

In this course, we cover analytical and modelling approaches for point, lattice/areal and flow data. While we do not explicitly analyse trajectory data, various of the analytical approaches described in this book can be extended to incorporate time, and can be applied to model these types of data. In Chapter 10, we describe approaches to analyse and model spatio-temporal data. These same methods can be applied to trajectory data.

+

Trajectory Data. These data record geographic locations of moving objects at various points in time. A trajectory is composed of a single string of data recording the geographic location of an object at various points in time and each record in the string contains a time stamp. These data are complex and can be classified into explicit trajectory data and implicit trajectory data. The former refer to well-structured data and record positions of objects continuously and intensively at uniform time intervals, such as GPS data. The latter is less structured and record data in relatively time point intervals, including sensor-based, network-based and signal-based data (Kong et al. 2018).

+

In this course, we cover analytical and modelling approaches for point, lattice/areal and flow data. While we do not explicitly analyse trajectory data, various of the analytical approaches described in this book can be extended to incorporate time, and can be applied to model these types of data. In sec-chp10, we describe approaches to analyse and model spatio-temporal data. These same methods can be applied to trajectory data.

2.2 Hierarchical Structure of Data

The hierarchical organisation is a key feature of spatial data. Smaller geographical units are organised within larger geographical units. You can find the hierarchical representation of UK Statistical Geographies on the Office for National Statistics website. In the bottom part of the output below, we can observe a spatial data frame for Liverpool displaying the hierarchical structure of census data (from the smallest to the largest): Output Areas (OAs), Lower Super Output Areas (LSOAs), Middle Super Output Areas (MSOAs) and Local Authority Districts (LADs). This hierarchical structure entails that units in smaller geographies are nested within units in larger geographies, and that smaller units can be aggregated to produce large units.

@@ -320,34 +320,34 @@

Major challenges exist when working with spatial data. Below we explore some of the key longstanding problems data scientists often face when working with geographical data.

2.3.1 Modifible Area Unit Problem (MAUP)

-

The Modifible Area Unit Problem (MAUP) represents a challenge that has troubled geographers for decades (Openshaw 1981). Two aspects of the MAUP are normally recognised in empirical analysis relating to scale and zonation. Fig. 2 illustrates these issues

+

The Modifible Area Unit Problem (MAUP) represents a challenge that has troubled geographers for decades (Openshaw 1981). Two aspects of the MAUP are normally recognised in empirical analysis relating to scale and zonation. Fig. 2 illustrates these issues

  • Scale refers to the idea that a geographical area can be divided into geographies with differing numbers of spatial units.

  • Zonation refers to the idea that a geographical area can be divided into the same number of units in a variety of ways.

-
Fig. 2. MAUP effect. (a) scale effect; and, (b) zonation effect. Source: Loidl et al. (2016).
+
Fig. 2. MAUP effect. (a) scale effect; and, (b) zonation effect. Source: Loidl et al. (2016).
-

The MAUP is a critical issue as it can impact our analysis and thus any conclusions we can infer from our results (e.g. Fotheringham and Wong 1991). There is no agreed systematic approach on how to handle the effects of the MAUP. Some have suggested to perform analyses based on different existing geographical scales, and assess the consistency of the results and identify potential sources of change. The issue with such approach is that results from analysis at different scales are likely to differ because distinct dimensions of a geographic process may be captured at different scales. For example, in migration studies, smaller geographies may be more suitable to capture residential mobility over short distances, while large geographies may be more suitable to capture long-distance migration. And it is well documented that these types of moves are driven by different factors. While residential mobility tends to be driven by housing related reasons, long-distance migration is more closely related to employment-related motives (Niedomysl 2011).

-

An alternative approach is to use the smallest geographical system available and create random aggregations at various geographical scales, to directly quantify the extent of scale and zonation. This approach has shown promising results in applications to study internal migration flows (Stillwell, Daras, and Bell 2018). Another approach involves the production of “meaningful” or functional geographies that can more appropriately capture the process of interest. There is an active area of work defining functional labour markets (Casado-Díaz, Martínez-Bernabéu, and Rowe 2017), urban areas (Arribas-Bel, Garcia-López, and Viladecans-Marsal 2021) and various forms of geodemographic classifications (Singleton and Spielman 2013; Patias, Rowe, and Cavazzi 2019) . However there is the recognition that none of the existing approaches resolve the effects of the MAUP and recently it has been suggested that the most plausible ‘solution’ would be to ignore the MAUP (Wolf et al. 2020).

+

The MAUP is a critical issue as it can impact our analysis and thus any conclusions we can infer from our results (e.g. Fotheringham and Wong 1991). There is no agreed systematic approach on how to handle the effects of the MAUP. Some have suggested to perform analyses based on different existing geographical scales, and assess the consistency of the results and identify potential sources of change. The issue with such approach is that results from analysis at different scales are likely to differ because distinct dimensions of a geographic process may be captured at different scales. For example, in migration studies, smaller geographies may be more suitable to capture residential mobility over short distances, while large geographies may be more suitable to capture long-distance migration. And it is well documented that these types of moves are driven by different factors. While residential mobility tends to be driven by housing related reasons, long-distance migration is more closely related to employment-related motives (Niedomysl 2011).

+

An alternative approach is to use the smallest geographical system available and create random aggregations at various geographical scales, to directly quantify the extent of scale and zonation. This approach has shown promising results in applications to study internal migration flows (Stillwell, Daras, and Bell 2018). Another approach involves the production of “meaningful” or functional geographies that can more appropriately capture the process of interest. There is an active area of work defining functional labour markets (Casado-Díaz, Martínez-Bernabéu, and Rowe 2017), urban areas (Arribas-Bel, Garcia-López, and Viladecans-Marsal 2021) and various forms of geodemographic classifications (Singleton and Spielman 2013; Patias, Rowe, and Cavazzi 2019) . However there is the recognition that none of the existing approaches resolve the effects of the MAUP and recently it has been suggested that the most plausible ‘solution’ would be to ignore the MAUP (Wolf et al. 2020).

2.3.2 Ecological Fallacy

-

Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is Robinson (1950) who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. Robinson (1950) used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was 0.53. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative -0.11 which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.

+

Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is Robinson (1950) who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. Robinson (1950) used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was 0.53. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative -0.11 which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.

2.3.3 Spatial Dependence

-

Spatial dependence refers to the spatial relationship of a variable’s values for a pair of locations at a certain distance apart, so that these values are more similar (or less similar) than expected for randomly associated pairs of observations (Anselin 1988). For example, we could think of observed patterns of ethnic segregation in an area are a result of spillover effects of pre-existing patterns of ethnic segregation in neighbouring areas. Chapter 5 will illustrate approach to explicitly incorporate spatial dependence in regression analysis.

+

Spatial dependence refers to the spatial relationship of a variable’s values for a pair of locations at a certain distance apart, so that these values are more similar (or less similar) than expected for randomly associated pairs of observations (Anselin 1988). For example, we could think of observed patterns of ethnic segregation in an area are a result of spillover effects of pre-existing patterns of ethnic segregation in neighbouring areas. sec-chp5 will illustrate approach to explicitly incorporate spatial dependence in regression analysis.

2.3.4 Spatial Heterogeneity

-

Spatial heterogeneity refers to the uneven distribution of a variable’s values across space. Concentration of deprivation or unemployment across an area are good examples of spatial heterogeneity. We illustrate various ways to visualise, explore and measure the spatial distribution of data in multiple chapters. We also discuss on potential modelling approaches to capture spatial heterogeneity in Chapter 5, Chapter 7 and Chapter 10.

+

Spatial heterogeneity refers to the uneven distribution of a variable’s values across space. Concentration of deprivation or unemployment across an area are good examples of spatial heterogeneity. We illustrate various ways to visualise, explore and measure the spatial distribution of data in multiple chapters. We also discuss on potential modelling approaches to capture spatial heterogeneity in sec-chp5, sec-chp7 and sec-chp10.

2.3.5 Spatial nonstationarity

-

Spatial nonstationarity refers to variations in the relationship between an outcome variable and a set of predictor variables across space. In a modelling context, it relates to a situation in which a simple “global” model is inappropriate to explain the relationships between a set of variables. The geographical nature of the model must be modified to reflect local structural relationships within the data. For example, ethinic segregation has been positively associated with employment outcomes in some countries pointing to networks in pre-existing communities facilitating access to the local labour market. Inversely ethinic segregation has been negatively associated with employment outcomes pointing to lack of integration into the broader local community. We illustrate various modelling approaches to capture spatial nonstationarity in Chapter 8 and Chapter 9.

+

Spatial nonstationarity refers to variations in the relationship between an outcome variable and a set of predictor variables across space. In a modelling context, it relates to a situation in which a simple “global” model is inappropriate to explain the relationships between a set of variables. The geographical nature of the model must be modified to reflect local structural relationships within the data. For example, ethinic segregation has been positively associated with employment outcomes in some countries pointing to networks in pre-existing communities facilitating access to the local labour market. Inversely ethinic segregation has been negatively associated with employment outcomes pointing to lack of integration into the broader local community. We illustrate various modelling approaches to capture spatial nonstationarity in sec-chp8 and sec-chp9.

-
-

To get familiar with good practices in writing your code in R, we recommend the Chapter Workflow: basics and Workflow: scripts and projects from the R in Data Science book by Wickham, Çetinkaya-Rundel, and Grolemund (2023).

+

To get familiar with good practices in writing your code in R, we recommend the Chapter Workflow: basics and Workflow: scripts and projects from the R in Data Science book by Wickham, Çetinkaya-Rundel, and Grolemund (2023).

https://r4ds.hadley.nz/workflow-basics.html

@@ -449,7 +449,7 @@
  • Save the script: File > Save As, select your required destination folder, and enter any filename that you like, provided that it ends with the file extension .R
  • -

    An R Notebook or a Quarto Document are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019). A Quarto Document is an improved version of the original R Notebook. Quarto Document requires a package called Quarto. Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in R Notebook but later transitioned into Quarto Documents.

    +

    An R Notebook or a Quarto Document are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019). A Quarto Document is an improved version of the original R Notebook. Quarto Document requires a package called Quarto. Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in R Notebook but later transitioned into Quarto Documents.

    To create an R Notebook, you need to:

    • Open a new script file: File > New File > R Notebook @@ -873,7 +873,7 @@
      census <- census %>% 
         mutate( per_ghealth = ghealth / pop )
      -

      Note we used a pipe operator %>%, which helps make the code more efficient and readable - more details, see Grolemund and Wickham (2019). When using the pipe operator, recall to first indicate the data frame before %>%.

      +

      Note we used a pipe operator %>%, which helps make the code more efficient and readable - more details, see Grolemund and Wickham (2019). When using the pipe operator, recall to first indicate the data frame before %>%.

      Note also the use a variable name before the = sign in brackets to indicate the name of the new variable after mutate.

    3.9.2 Selecting Variables

    @@ -948,7 +948,7 @@

    3.10 Using Spatial Data Frames

    -

    A core area of the module is learning to work with spatial data in R. R has various purposedly designed packages for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including sf (Pebesma 2018, 2022a), stars (Pebesma 2022b), terra, s2 (Dunnington, Pebesma, and Rubak 2023), lwgeom (Pebesma 2023), gstat (Pebesma 2004; Pebesma and Graeler 2022), spdep (Bivand 2022), spatialreg (Bivand and Piras 2022), spatstat (Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2022), tmap (Tennekes 2018, 2022), mapview (Appelhans et al. 2022) and more. A key package is this ecosystem is sf (Pebesma and Bivand 2023). R package sf provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the tidyverse (Wickham et al. 2019), particularly ggplot2, dplyr, and tidyr. Hence, this book relies heavely on sf for the manipulation and analysis of the data.

    +

    A core area of the module is learning to work with spatial data in R. R has various purposedly designed packages for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including sf (Pebesma 2018, 2022a), stars (Pebesma 2022b), terra, s2 (Dunnington, Pebesma, and Rubak 2023), lwgeom (Pebesma 2023), gstat (Pebesma 2004; Pebesma and Graeler 2022), spdep (Bivand 2022), spatialreg (Bivand and Piras 2022), spatstat (Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2022), tmap (Tennekes 2018, 2022), mapview (Appelhans et al. 2022) and more. A key package is this ecosystem is sf (Pebesma and Bivand 2023). R package sf provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the tidyverse (Wickham et al. 2019), particularly ggplot2, dplyr, and tidyr. Hence, this book relies heavely on sf for the manipulation and analysis of the data.

    @@ -961,7 +961,7 @@
    -

    Lovelace, Nowosad, and Muenchow (2024) provide a helpful overview and evolution of R spatial package ecosystem.

    +

    Lovelace, Nowosad, and Muenchow (2024) provide a helpful overview and evolution of R spatial package ecosystem.

    To read our spatial data, we use the st_read function. We read a shapefile containing data at Output Area (OA) level for Liverpool. These data illustrates the hierarchical structure of spatial data.

    @@ -1184,7 +1184,7 @@ -

    To master ggplot, see Wickham (2009).

    +

    To master ggplot, see Wickham (2009).

    Using tmap

    Similar to ggplot2, tmap is based on the idea of a ‘grammar of graphics’ which involves a separation between the input data and aesthetics (i.e. the way data are visualised). Each data set can be mapped in various different ways, including location as defined by its geometry, colour and other features. The basic building block is tm_shape() (which defines input data), followed by one or more layer elements such as tm_fill() and tm_dots().

    @@ -1220,8 +1220,8 @@
    Warning: In view mode, scale bar breaks are ignored.
    -
    - +
    +
    @@ -1239,7 +1239,7 @@

    3.10.3 Comparing geographies

    -

    If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see (spatial_data?). To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.

    +

    If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see (spatial_data?). To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.

    @@ -1309,7 +1309,7 @@
    -
    Appelhans, Tim, Florian Detsch, Christoph Reudenbach, and Stefan Woellauer. 2022. Mapview: Interactive Viewing of Spatial Data in r. https://github.com/r-spatial/mapview.
    diff --git a/docs/04-points.html b/docs/04-points.html index c49a0eb..35f0cad 100644 --- a/docs/04-points.html +++ b/docs/04-points.html @@ -318,14 +318,14 @@

    This chapter is based on the following references, which are great follow-up’s on the topic:

    • -Lovelace, Nowosad, and Muenchow (2019) offer a great introduction.
    • -
    • Chapter 6 of Brunsdon and Comber (2015), in particular subsections 6.3 and 6.7.
    • +Lovelace, Nowosad, and Muenchow (2019) offer a great introduction. +
    • Chapter 6 of Brunsdon and Comber (2015), in particular subsections 6.3 and 6.7.
    • -Bivand, Pebesma, and Gómez-Rubio (2013) provides an in-depth treatment of spatial data in R.
    • +Bivand, Pebesma, and Gómez-Rubio (2013) provides an in-depth treatment of spatial data in R.

    4.1 Dependencies

    -

    We will rely on the following libraries in this section, all of them included in Section 1.4.1:

    +

    We will rely on the following libraries in this section, all of them included in sec-dependencies:

    # data manipulation, transformation and visualisation
     library(tidyverse)
    @@ -524,10 +524,10 @@
     4.5 Spatial Interpolation
     

    The previous section demonstrates how to visualize the distribution of a set of spatial objects represented as points. In particular, given a bunch of house locations, it shows how one can effectively visualize their distribution over space and get a sense of the density of occurrences. Such visualization, because it is based on KDE, is based on a smooth continuum, rather than on a discrete approach (as a choropleth would do, for example).

    Many times however, we are not particularly interested in learning about the density of occurrences, but about the distribution of a given value attached to each location. Think for example of weather stations and temperature: the location of the stations is no secret and rarely changes, so it is not of particular interest to visualize the density of stations; what we are usually interested instead is to know how temperature is distributed over space, given we only measure it in a few places. One could argue the example we have been working with so far, house prices in AirBnb, fits into this category as well: although where a house is advertised may be of relevance, more often we are interested in finding out what the “surface of price” looks like. Rather than where are most houses being advertised? we usually want to know where the most expensive or most affordable houses are located.

    -

    In cases where we are interested in creating a surface of a given value, rather than a simple density surface of occurrences, KDE cannot help us. In these cases, what we are interested in is spatial interpolation, a family of techniques that aim at exactly that: creating continuous surfaces for a particular phenomenon (e.g. temperature, house prices) given only a finite sample of observations. Spatial interpolation is a large field of research that is still being actively developed and that can involve a substantial amount of mathematical complexity in order to obtain the most accurate estimates possible1. In this chapter, we will introduce the simplest possible way of interpolating values, hoping this will give you a general understanding of the methodology and, if you are interested, you can check out further literature. For example, Banerjee, Carlin, and Gelfand (2014) or Cressie (2015) are hard but good overviews.

    +

    In cases where we are interested in creating a surface of a given value, rather than a simple density surface of occurrences, KDE cannot help us. In these cases, what we are interested in is spatial interpolation, a family of techniques that aim at exactly that: creating continuous surfaces for a particular phenomenon (e.g. temperature, house prices) given only a finite sample of observations. Spatial interpolation is a large field of research that is still being actively developed and that can involve a substantial amount of mathematical complexity in order to obtain the most accurate estimates possible1. In this chapter, we will introduce the simplest possible way of interpolating values, hoping this will give you a general understanding of the methodology and, if you are interested, you can check out further literature. For example, Banerjee, Carlin, and Gelfand (2014) or Cressie (2015) are hard but good overviews.

    1 There is also an important economic incentive to do this: some of the most popular applications are in the oil and gas or mining industries. In fact, the very creator of this technique, Danie G. Krige, was a mining engineer. His name is usually used to nickname spatial interpolation as kriging.

    4.5.1 Inverse Distance Weight (IDW) interpolation

    -

    The technique we will cover here is called Inverse Distance Weighting, or IDW for convenience. Brunsdon and Comber (2015) offer a good description:

    +

    The technique we will cover here is called Inverse Distance Weighting, or IDW for convenience. Brunsdon and Comber (2015) offer a good description:

    In the inverse distance weighting (IDW) approach to interpolation, to estimate the value of \(z\) at location \(x\) a weighted mean of nearby observations is taken […]. To accommodate the idea that observations of \(z\) at points closer to \(x\) should be given more importance in the interpolation, greater weight is given to these points […]

    — Page 204

    @@ -704,7 +704,7 @@ -

    The results contain the estimated mean, median and standard deviation for the intercept within each group (e.g. LSOA). The mean estimates are similar to those obtained from ranef with some small differences due to rounding.

    @@ -1026,9 +1026,9 @@ $ groupFctr: chr "msoa_cd" "msoa_cd" "msoa_cd" "msoa_cd" ... $ groupID : chr "E02001347" "E02001348" "E02001349" "E02001350" ... $ term : chr "(Intercept)" "(Intercept)" "(Intercept)" "(Intercept)" ... - $ mean : num -0.01044 -0.02588 -0.0287 0.00515 0.02168 ... - $ median : num -0.01136 -0.02427 -0.02754 0.00583 0.02144 ... - $ sd : num 0.0323 0.0319 0.0332 0.0324 0.0186 ...
    + $ mean : num -0.01325 -0.02661 -0.03065 0.00537 0.02327 ... + $ median : num -0.01406 -0.02572 -0.02964 0.00511 0.02398 ... + $ sd : num 0.0305 0.0354 0.031 0.0349 0.0162 ...
    # merge data
     msoa_shp <- merge(x = msoa_shp, y = re_msoa, by.x = "MSOA_CD", by.y = "groupID")
    @@ -1429,7 +1429,7 @@ -

    8.5 Interpreting Correlations Between Group-level Intercepts and Slopes

    -

    Correlations of random effects are confusing to interpret. Key for their appropriate interpretation is to recall they refer to group-level residuals i.e. deviation of intercepts and slopes from the average model intercept and slope. A strong negative correlation indicates that groups with high intercepts have relatively low slopes, and vice versa. A strong positive correlation indicates that groups with high intercepts have relatively high slopes, and vice versa. A correlation close to zero indicate little or no systematic between intercepts and slopes. Note that a high correlation between intercepts and slopes is not a problem, but it makes the interpretation of the estimated intercepts more challenging. For this reason, a suggestion is to center predictors (\(x's\)); that is, substract their average value (\(z = x - \bar{x}\)). For a more detailed discussion, see Multilevel Modelling (n.d.).

    +

    Correlations of random effects are confusing to interpret. Key for their appropriate interpretation is to recall they refer to group-level residuals i.e. deviation of intercepts and slopes from the average model intercept and slope. A strong negative correlation indicates that groups with high intercepts have relatively low slopes, and vice versa. A strong positive correlation indicates that groups with high intercepts have relatively high slopes, and vice versa. A correlation close to zero indicate little or no systematic between intercepts and slopes. Note that a high correlation between intercepts and slopes is not a problem, but it makes the interpretation of the estimated intercepts more challenging. For this reason, a suggestion is to center predictors (\(x's\)); that is, substract their average value (\(z = x - \bar{x}\)). For a more detailed discussion, see Multilevel Modelling (n.d.).

    To illustrate this, let’s reestimate our model adding an individual-level predictor: the share of population with no educational qualification.

    # centering to the mean
    @@ -640,9 +640,9 @@
     

    How do you interpret the random effect correlation?

    8.6 Model building

    -

    Now we know how to estimate multilevel regression models in R. The question that remains is: When does multilevel modeling make a difference? The short answer is: when there is little group-level variation. When there is very little group-level variation, the multilevel modelling reduces to classical linear regression estimates with no group indicators. Inversely, when group-level coefficients vary greatly (compared to their standard errors of estimation), multilevel modelling reduces to classical regression with group indicators Gelman and Hill (2006).

    +

    Now we know how to estimate multilevel regression models in R. The question that remains is: When does multilevel modeling make a difference? The short answer is: when there is little group-level variation. When there is very little group-level variation, the multilevel modelling reduces to classical linear regression estimates with no group indicators. Inversely, when group-level coefficients vary greatly (compared to their standard errors of estimation), multilevel modelling reduces to classical regression with group indicators Gelman and Hill (2006).

    How do you go about building a model?

    -

    We generally start simple by fitting simple linear regressions and then work our way up to a full multilevel model - see Gelman and Hill (2006) p. 270.

    +

    We generally start simple by fitting simple linear regressions and then work our way up to a full multilevel model - see Gelman and Hill (2006) p. 270.

    How many groups are needed?

    As an absolute minimum, more than two groups are required. With only one or two groups, a multilevel model reduces to a linear regression model.

    How many observations per group?

    @@ -669,7 +669,7 @@

    8.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -689,7 +689,7 @@
  • Fit a varying-intercept and varying-slope model.

  • Compare the results for models fitted in 1 and 2. Which is better? Why?

  • -

    Use the same explanatory variables used for the Chapter 7 challenge, so you can compare the model results from this chapter.

    +

    Use the same explanatory variables used for the sec-chp7 challenge, so you can compare the model results from this chapter.

    Analyse and discuss:

    1. the varying slope estimate(s) from your model(s) (to what extent does the relationship between your dependent and independent variables vary across groups / areas? are they statistically significantly different?).
    2. @@ -699,7 +699,7 @@ -

      The regression results indicate a positive relationship exists between the share of nonwhite population and an increased risk of COVID-19 infection. A one percentage point increase in the share of nonwhite population returns a 271 rise in the cumulative count of COVID-19 infection per 100,000 people, everything else constant. The results also reveal a positive (albeit statistically insignificant) relationship between the share of population suffering from long-term illness and an increased risk of COVID-19 infection, after controlling for the share of nonwhite population, thereby confirming our suspicion about the limitations of correlation coefficients; that is, once differences in the share of nonwhite population are taken into account, the association between the share of population suffering from long-term illness and an increased risk of COVID-19 infection becomes positive. We also test for multicollinearity. The VIFs are below 10 indicating that multicollinearity is not highly problematic.

      The \(R^{2}\) value for the OLS regression is 0.393 indicating that our model explains only 39% of the variance in the rate of COVID-19 infection. This leaves 71% of the variance unexplained. Some of this unexplained variance can be because we have only included two explanatory variables in our model, but also because the OLS regression model assumes that the relationships in the model are constant over space; that is, it assumes a stationary process. Hence, an OLS regression model is considered to capture global relationships. However, relationships may vary over space. Suppose, for instance, that there are intrinsic behavioural variations across England and that people have adhered more strictly to self-isolation and social distancing measures in some areas than in others, or that ethnic minorities are less exposed to contracting COVID-19 in certain parts of England. If such variations in associations exist over space, our estimated OLS model will be a misspecification of reality because it assumes these relationships to be constant.

      -

      To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. -Fotheringham, Brunsdon, and Charlton (2002) (2002, p.40-44) discuss alternative approaches and their limitations.

      +

      To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. -Fotheringham, Brunsdon, and Charlton (2002) (2002, p.40-44) discuss alternative approaches and their limitations.

      utla_shp$res_m1 <- residuals(model1)
       
      @@ -594,14 +594,14 @@
       

      Graphically, GWR involves fitting a spatial kernel to the data as described in the Fig. 1. For a given regression point \(X\), the weight (\(W\)) of a data point is at a maximum at the location of the regression point. The weight decreases gradually as the distance between two points increases. A regression model is thus calibrated locally by moving the regression point across the area under study. For each location, the data are weighted differently so that the resulting estimates are unique to a particular location.

      -
      Fig. 1. GWR with fixed spatial kernel. Source: Fotheringham et al. (2002, 45).
      +
      Fig. 1. GWR with fixed spatial kernel. Source: Fotheringham et al. (2002, 45).

      9.6.1 Fixed or Adaptive Kernel

      A key issue is to decide between two options of spatial kernels: a fixed kernel or an adaptive kernel. Intuitively, a fixed kernel involves using a fixed bandwidth to define a region around all regression points as displayed in Fig. 1. The extent of the kernel is determined by the distance to a given regression point, with the kernel being identical at any point in space. An adaptive kernel involves using varying bandwidth to define a region around regression points as displayed in Fig. 2. The extent of the kernel is determined by the number of nearest neighbours from a given regression point. The kernels have larger bandwidths where the data are sparse.

      -
      Fig. 2. GWR with adaptive spatial kernel. Source: Fotheringham et al. (2002, 47).
      +
      Fig. 2. GWR with adaptive spatial kernel. Source: Fotheringham et al. (2002, 47).

      9.6.2 Optimal Bandwidth

      @@ -611,7 +611,7 @@

      Choosing an optimal bandwidth involves a compromise between bias and precision. For example, a larger bandwidth will involve using a larger number of observations to fit a local regression, and hence result in reduced variance (or increased precision) but high bias of estimates. On the other hand, too small bandwidth involves using a very small number of observations resulting in increased variance but small bias. An optimal bandwidth offers a compromise between bias and variance.

      9.6.3 Shape of Spatial Kernel

      -

      Two general set of kernel functions can be distinguished: continuous kernels and kernels with compact support. Continuous kernels are used to weight all observations in the study area and includes uniform, Gaussian and Exponential kernel functions. Kernel with compact support are used to assign a nonzero weight to observations within a certain distance and a zero weight beyond it. The shape of the kernel has been reported to cause small changes to resulting estimates (Brunsdon, Fotheringham, and Charlton 1998).

      +

      Two general set of kernel functions can be distinguished: continuous kernels and kernels with compact support. Continuous kernels are used to weight all observations in the study area and includes uniform, Gaussian and Exponential kernel functions. Kernel with compact support are used to assign a nonzero weight to observations within a certain distance and a zero weight beyond it. The shape of the kernel has been reported to cause small changes to resulting estimates (Brunsdon, Fotheringham, and Charlton 1998).

      9.6.4 Selecting a Bandwidth

      Let’s now implement a GWR model. The first key step is to define the optimal bandwidth. We first illustrate the use of a fixed spatial kernel.

      @@ -820,10 +820,10 @@
    -

    Analysing the map for long-term illness, a clear North-South divide can be identified. In the North we observed the expected positive relationship between COVID-19 and long-term illness i.e. as the share of the local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to increase. In the South, we observe the inverse pattern i.e. as the share of local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to drop. This pattern is counterintuitive but may be explained by the wider socio-economic disadvantages between the North and the South of England. The North is usually characterised by a persistent concentration of more disadvantaged neighbourhoods than the South where affluent households have tended to cluster for the last 40 years (Patias, Rowe, and Arribas-Bel 2021).

    +

    Analysing the map for long-term illness, a clear North-South divide can be identified. In the North we observed the expected positive relationship between COVID-19 and long-term illness i.e. as the share of the local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to increase. In the South, we observe the inverse pattern i.e. as the share of local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to drop. This pattern is counterintuitive but may be explained by the wider socio-economic disadvantages between the North and the South of England. The North is usually characterised by a persistent concentration of more disadvantaged neighbourhoods than the South where affluent households have tended to cluster for the last 40 years (Patias, Rowe, and Arribas-Bel 2021).

    9.6.7 Assessing statistical significance

    -

    While the maps above offer valuable insights to understand the spatial pattering of relationships, they do not identify whether these associations are statistically significant. They may not be. Roughly, if a coefficient estimate has an absolute value of t greater than 1.96 and the sample is sufficiently large, then it is statistically significant. Our sample has only 150 observations, so we are more conservative and considered a coefficient to be statistically significant if it has an absolute value of t larger than 2. Note also that p-values could be computed - see Lu et al. (2014).

    +

    While the maps above offer valuable insights to understand the spatial pattering of relationships, they do not identify whether these associations are statistically significant. They may not be. Roughly, if a coefficient estimate has an absolute value of t greater than 1.96 and the sample is sufficiently large, then it is statistically significant. Our sample has only 150 observations, so we are more conservative and considered a coefficient to be statistically significant if it has an absolute value of t larger than 2. Note also that p-values could be computed - see Lu et al. (2014).

    # compute t statistic
     utla_shp$t_ethnic = ab_gwr_out$ethnic / ab_gwr_out$ethnic_se
    @@ -864,10 +864,10 @@
     
     

    9.6.8 Collinearity in GWR

    -

    An important final note is: collinearity tends to be problematic in GWR models. It can be present in the data subsets to estimate local coefficients even when not observed globally Wheeler and Tiefelsdorf (2005). Collinearity can be highly problematic in the case of compositional, categorical and ordinal predictors, and may result in exact local collinearity making the search for an optimal bandwidth impossible. A recent paper suggests potential ways forward (Comber et al. 2022).

    +

    An important final note is: collinearity tends to be problematic in GWR models. It can be present in the data subsets to estimate local coefficients even when not observed globally Wheeler and Tiefelsdorf (2005). Collinearity can be highly problematic in the case of compositional, categorical and ordinal predictors, and may result in exact local collinearity making the search for an optimal bandwidth impossible. A recent paper suggests potential ways forward (Comber et al. 2022).

    9.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -895,7 +895,7 @@ -

    10.5.2 Exploratory Analysis

    In addition to visualising data, we often want to obtain numerical summaries of the data. Again, innovative ways to reduce the inherent dimensionality of the data and examine dependence structures and potential relationships in time and space are needed. We consider visualisations of empirical spatial and temporal means, dependence structures and some basic time-series analysis.

    @@ -709,7 +709,7 @@

    10.5.2.2 Dependence

    Spatial Dependence

    -

    As we know spatial dependence refers to the spatial relationship of a variable’s values for a pairs of locations at a certain distance apart, so that are more similar (or less similar) than expected for randomly associated pairs of observations. Patterns of spatial dependence may change over time. In the case of a disease outbreak patterns of spatial dependence can change very quickly as new cases emerge and social distancing measures are implemented. Chapter 6 illustrates how to measure spatial dependence in the context of spatial data.

    +

    As we know spatial dependence refers to the spatial relationship of a variable’s values for a pairs of locations at a certain distance apart, so that are more similar (or less similar) than expected for randomly associated pairs of observations. Patterns of spatial dependence may change over time. In the case of a disease outbreak patterns of spatial dependence can change very quickly as new cases emerge and social distancing measures are implemented. sec-chp6 illustrates how to measure spatial dependence in the context of spatial data.

    Challenge 1: Measure how spatial dependence change over time. Hint: compute the Moran’s I on the rate of new COVID-19 cases (i.e. n_covid19_r in the covid19 data frame) at multiple time points.

    @@ -759,7 +759,7 @@ -

    For a good introduction to time-series analysis in R, refer to Hyndman and Athanasopoulos (2018) and DataCamp.

    +

    For a good introduction to time-series analysis in R, refer to Hyndman and Athanasopoulos (2018) and DataCamp.

    10.6 Spatio-Temporal Data Modelling

    Having some understanding of the spatio-temporal patterns of COVID-19 spread through data exploration, we are ready to start further examining structural relationships between the rate of new infections and local contextual factors via regression modelling across UTLAs. Specifically we consider the number of new cases per 100,000 people to capture the rate of new infections and only one contextual factor; that is, the share of population suffering from long-term sickness or disabled. We will consider some basic statistical models, of the form of linear regression and generalized linear models, to account for spatio-temporal dependencies in the data. Note that we do not consider more complex structures based on hierarchical models or spatio-temporal weighted regression models which would be the natural step moving forward.

    @@ -796,7 +796,7 @@

    These basis functions are incorporated as independent variables in the regression model. Additionally, we also include the share of population suffering from long-term illness as we know it is highly correlated to the cumulative number of COVID-19 cases. The share of population suffering long-term illness is incorporated as a spatial-variant, temporal-invariant covariates given that rely in 2011 census data.

    10.6.2 Fitting Spatio-Temporal Models

    -

    As indicated at the start of this Chapter, we use the FRK framework developed by Cressie and Johannesson (2008). It provides a scalable, relies on the use a spatial random effects model (with which we have some familiarity) and can be easily implemented in R by the use of the FRK package (Zammit-Mangion and Cressie 2017). In this framework, a spatially correlated errors can be decomposed using a linear combination of spatial basis functions, effectively addressing issues of spatial-temporal dependence and nonstationarity. The specification of spatio-temporal basis functions is a key component of the model and they can be generated automatically or by the user via the FRK package. We will use the automatically generated functions. While as we will notice they are difficult to interpret, user generated functions require greater understanding of the spatio-temporal structure of COVID-19 which is beyond the scope of this Chapter.

    +

    As indicated at the start of this Chapter, we use the FRK framework developed by Cressie and Johannesson (2008). It provides a scalable, relies on the use a spatial random effects model (with which we have some familiarity) and can be easily implemented in R by the use of the FRK package (Zammit-Mangion and Cressie 2017). In this framework, a spatially correlated errors can be decomposed using a linear combination of spatial basis functions, effectively addressing issues of spatial-temporal dependence and nonstationarity. The specification of spatio-temporal basis functions is a key component of the model and they can be generated automatically or by the user via the FRK package. We will use the automatically generated functions. While as we will notice they are difficult to interpret, user generated functions require greater understanding of the spatio-temporal structure of COVID-19 which is beyond the scope of this Chapter.

    Prepare Data

    The first step to create a data frame with the variables that we will consider for the analysis. We first remove the geometries to convert covid19_spt from a simple feature object to a data frame and then compute the share of long-term illness population.

    @@ -1076,7 +1076,7 @@

    Note that you may need to install the huxtable package.

    10.6.2.1 Model Comparision

    -

    To compare regression models based on different specifications and assumptions, like those reported above, you may want to consider the cross-validation approach used in Chapter 5. An alternative approach if you would like to get a quick sense of model fit is to explore the correlation between observed and predicted values of the dependent variable. For our models, we can achieve this by executing:

    +

    To compare regression models based on different specifications and assumptions, like those reported above, you may want to consider the cross-validation approach used in sec-chp5. An alternative approach if you would like to get a quick sense of model fit is to explore the correlation between observed and predicted values of the dependent variable. For our models, we can achieve this by executing:

    # computing predictions for all models
     lm_cnt <- predict(lm_m)
    @@ -1118,7 +1118,7 @@
     Option 4 Check for collinearity. Collinearity is likely to be an issue given the way basis functions are created. Checking for collinearity of course will not improve the fit of the existing model but it is important to remove collinear terms if statistical inference is a key goal - which in this case is. Over to you now!
     

    10.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -1146,7 +1146,7 @@ -