-
Notifications
You must be signed in to change notification settings - Fork 19
/
11-datasets.qmd
103 lines (73 loc) · 4.35 KB
/
11-datasets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Data Sets {#sec-chp11}
```{r}
#| warning: false
library(sf)
```
## Madrid AirBnb
This dataset contains a pre-processed set of properties advertised on the AirBnb website within the region of Madrid (Spain), together with house characteristics.
### Availability {.unnumbered}
The dataset is stored on a Geopackage that can be found, within the structure of this project, under:
```{r}
path <- "data/assignment_1_madrid/madrid_abb.gpkg"
```
```{r}
db <- st_read(path)
```
### Variables {.unnumbered}
For each of the `r length(db)` properties, the following characteristics are available:
- `price`: \[string\] Price with currency
- `price_usd`: \[int\] Price expressed in USD
- `log1pm_price_usd`: \[float\] Log of the price
- `accommodates`: \[integer\] Number of people the property accommodates
- `bathrooms`: \[integer\] Number of bathrooms
- `bedrooms`: \[integer\] Number of bedrooms
- `beds`: \[integer\] Number of beds
- `neighbourhood`: \[string\] Name of the neighbourhood the property is located in
- `room_type`: \[string\] Type of room offered (shared, private, entire home, hotel room)
- `property_type`: \[string\] Type of property advertised (apartment, house, hut, etc.)
- `WiFi`: \[binary\] Takes `1` if the property has WiFi, `0` otherwise
- `Coffee`: \[binary\] Takes `1` if the property has a coffee maker, `0` otherwise
- `Gym`: \[binary\] Takes `1` if the property has access to a gym, `0` otherwise
- `Parking`: \[binary\] Takes `1` if the property offers parking, `0` otherwise
- `km_to_retiro`: \[float\] Euclidean distance from the property to the El Retiro park
- `geom`: \[geometry\] Point geometry
### Projection {.unnumbered}
The location of each property is stored as point geometries and expressed in longitude and latitude coordinates:
```{r}
st_crs(db)
```
### Source & Pre-processing {.unnumbered}
The data are sourced from [Inside Airbnb](http://insideairbnb.com/). A Jupyter notebook in Python (available at `data/assignment_1_madrid/clean_data.ipynb`) details the process from the original file available from source to the data in `madrid_abb.gpkg`.
## England COVID-19
This dataset contains:
- daily COVID-19 confirmed cases from 1st January, 2020 to 2nd February, 2021 from the [GOV.UK dashboard](https://coronavirus.data.gov.uk);
- resident population characteristics from the 2011 census, available from the [Office of National Statistics](https://www.nomisweb.co.uk/home/census2001.asp); and,
- 2019 Index of Multiple Deprivation (IMD) data from [GOV.UK](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) and published by the Ministry of Housing, Communities & Local Government.
The data are at the Upper Tier Local Authority District (UTLAD) level - also known as [Counties and Unitary Authorities](https://geoportal.statistics.gov.uk).
### Availability {.unnumbered}
The dataset is stored on a Geopackage:
```{r}
sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
```
### Variables {.unnumbered}
The data set contains 508 variables:
- `objectid`: \[integer\] unit identifier
- `ctyua19cd`: \[integer\] Upper Tier Local Authority District (or Counties and Unitary Authorities) identifier
- `ctyua19nm`: \[character\] Upper Tier Local Authority District (or Counties and Unitary Authorities) name
- `Region`: \[character\] Region name
- `long`: \[numeric\] longitude
- `lat`: \[numeric\] latitude
- `st_areasha`: \[numeric\] area in hectare
- `X2020.01.31` to `X2021.02.05`: \[numeric\] Daily COVID-19 cases from 31st January, 2020 to 5th February, 2021
- `IMD...Average.score` - `IMD.2019...Local.concentration`: \[numeric\] IMD indicators - for details see [File 11: upper-tier local authority summaries](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019).
- `Residents`: \[numeric\] Total resident population
- `Households`: \[numeric\] Total households
- `Dwellings`: \[numeric\] Total dwellings
- `Household_Spaces`: \[numeric\] Total household spaces
- `Aged_16plus` to `Other_industry`: \[numeric\] comprise 114 variables relating to various population and household attributes of the resident population. A description of all these variables can be found [here](data/assignment_2_covid/census_vars.csv)
- `geom`: \[geometry\] Point geometry
### Projection {.unnumbered}
Details of the coordinate reference system:
```{r}
st_crs(sdf)
```