-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
272 lines (206 loc) · 11.7 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
title: "Kevin's TidyTuesdays"
output:
html_document:
code_folding: hide
toc: yes
toc_depth: 3
toc_float: yes
editor_options:
chunk_output_type: console
---
```{r, message = FALSE, echo = FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE)
library(tidyverse)
library(plotly)
library(ggsci)
library(directlabels)
library(geofacet)
theme_set(theme_classic(18) +
theme(legend.position = "bottom"))
```
# Introduction
This webpage was originally designed to host all of my work for the [tidytuesday](https://github.com/rfordatascience/tidytuesday) challenge.
However, as a data scientist, I often need a quick workspace to host all my other statistical analytics. Hence, as my analytical interests are not limited to just tidytuesday datasets, I also use this repo to host a whole range RMarkdown HTML reports.
This webpage was deployed using GitHub Pages and GitHub Actions and divided into years.
# 2021-03-23 UNvotes
Codes are available [here](2021-03-23-UNvotes/UNvotes.html).
This dataset is on the voting records of countries in United Nations. The main question that I was interested in answering were, which countries had similar voting patterns in the past and how have those shifted and changed over the years.
The main method used here is time series clustering on the percentage of times that a country has voted yes across all issues raised at the UN.
What I learnt:
+ `funFEM` can be a great package for functional clustering. However, setting up the basis functions and the number of clusters requires a bit of exploration, as there will be degeneracy if the two numbers are too close to each other.
+ Doing similar exercise in Python actually made me realise how consistent the `tidyverse` is when wrangling data.
Main findings:
+ United States had increasing higher percentage of non-yes votes (i.e. voting either no or abstain). See cluster 6.
+ Former Soviet states voted quite consistently in the past, however, have since diverged. See cluster 4.
+ A number of European countries showed very similar voting patterns. Interestingly, Australia and New Zealand are also in this group. See cluster 3 and 8.
+ South American countries are clustered together in cluster 5 and Middle Eastern countries (along with China) are in cluster 12.
<img src="2021-03-23-UNvotes/UNvotes_clustering.png" width="1000" height="500" />
# 2021-04-06 Deforestation
Codes are available [here](2021-04-06-deforestation/2021-04-06-deforestation.html).
This o[riginal tidytuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-06/readme.md) is consist of 5 data. I have chosen to use only one, which is the `forest` data for it contains the deforestation measurements that I was most interested in.
In this data, deforestation is recorded as `net_forest_conversion`, in the units of hectares.
The main visualisation used here are `plotly`.
Some interesting points in producing this plot:
+ Again, `plotly` was reliable as always in making interactive visualisations. However, I found these two documentation pages to be confusing.
+ https://plotly.com/r/animations/ is a much better example of how to introduce sliders into the plot. Despite the title, https://plotly.com/r/sliders/ is much more advanced than its name suggests and it is more a page on very fine customisation.
+ `tidyr::complete` is a very well designed function. In wrangling the data, there was an issue with the forest data, in that some country's deforestation values for some years were not recorded. These missing entries in the data meant that I had to complete the data.
The solution was simple:
```
forest %>%
tidyr::complete(expand(., nesting(entity, code), year),
fill = list(net_forest_conversion = NA))
```
In this code chunk:
+ `complete` is the main action to be performed on the data
+ Each column named in `expand(...)` will then have its unique values expanded (in the sense of a Cartesian set product).
+ `nesting` looks into the both the `entity` and `code` columns (which stand for the country name and the country's 3-letters ISO code) and extract out their unique combinations within the data. Its placement within the `expand(...)` function was the key here, because each `entity` can only have one `code` and vice versa. The use of `nesting` keeps both columns in the final output.
+ The `fill=...` argument then specifies exactly what needs to go into every column.
+ Similar functions exist in `python`, but as this [post](https://stackoverflow.com/questions/44287445/pandas-or-python-equivalent-of-tidyr-complete) suggests, it is much more manual and require a specific combination of functions.
**Click and drag the globe to interact with this map**.
```{r}
subforest = read_csv(file = "2021-04-06-deforestation/subforest.csv")
fig <- plot_ly(
subforest,
type = 'choropleth',
locations = ~code,
z = ~net_forest_conversion_log10,
text = ~entity,
frame = ~year,
colors = "RdYlGn") %>%
layout(
geo = list(projection = list(type = "orthographic")),
showlegend = FALSE)
fig
```
Main findings:
+ Australia is pretty interesting in that prior to 2010, it had decreased forest areas, but have turned that to the positive direction after.
+ China had consistently high increases in forest area. This is something that has been [published in Nature](https://www.nature.com/articles/s41467-019-13798-8). This fact might sound surprising to some as China enjoyed rapid economic growth and industrialisation in the past few decades.
# 2021-06-29 Animal rescue in London
Codes are available [here](2021-06-29-animal-rescues/2021-06-29-animal-rescues.html).
This dataset records the animal rescued around the city of London. The main aim of this analysis is to visualise the spatial distribution of the animal types around the city.
The main visualisation packages used here were `plotly` and Mapbox.
Some interesting points in producing this plot:
+ I would highly recommend Mapbox for its elegant looks for relatively simple coding efforts. Even though Mapbox requires you to register, I find that it offers a generous amount of free queries.
+ The `plotly` (yyds!) package in `R` offers a good interface to Mapbox, and it is what I chose for my final visualisation.
+ I would also highly recommend `mapdeck` [(link)](https://symbolixau.github.io/mapdeck/index.html) package. It has a slightly different look to what `plotly` offers but could be just as good.
**Click the legend to interact with the map**, e.g. you can double click "fox" to only display foxes around the city.
```{r}
subdata = readRDS(file = "2021-06-29-animal-rescues/sub_animal_rescues.rds")
Sys.setenv('MAPBOX_TOKEN' = Sys.getenv('MAPBOX_AUTH'))
fig = subdata %>%
plot_mapbox(lat = ~latitude, lon = ~longitude,
split = ~animal_type,
size = 2,
mode = 'scattermapbox',
hoverinfo = 'animal_type')
fig = fig %>%
layout(title = 'Animals rescued around London',
font = list(color='white'),
plot_bgcolor = '#191A1A',
paper_bgcolor = '#191A1A',
mapbox = list(style = 'dark',
zoom = 7,
center = list(
lat = median(subdata$latitude),
lon = median(subdata$longitude))),
legend = list(orientation = 'h',
font = list(size = 8)),
margin = list(l = 25, r = 25,
b = 25, t = 25,
pad = 2))
fig
```
Main findings:
+ Birds were the only type of animal that is concentrated in the middle of the city. One explanation is that they fly around and when they need rescue, the incidents are reported in places where there are more people. Alternatively, birds are more in dangers in the city due to the high rise buildings.
+ Most interestingly, foxes were found throughout the city, and a good number of them appeared in the middle of the city too. I personally find that interesting, because I have never seen a fox in city streets/parks before.
+ As expected, larger animals like horses and deers were rescued on the outskirts of the city.
+ Cats and dogs are rescued all around the city. Likely because these were dometic pets that ran into danger.
# 2021-07-24 US droughts
Codes are available [here](2021-07-20-droughts/2021-07-20-droughts.html).
This data is a bit hard to evaluate because there are different levels of droughts:
+ None: no drought
+ D0: abnormally dry
+ D1: moderate drought
+ D2: severe drought
+ D3: extreme drought
+ D4: exceptional drought
The main variables indicating droughts are:
+ area_pct: Percent of state currently in that drought category
+ area_total: Total land area (sq miles) of state currently in that drought category
+ pop_pct: Population percent of total state population in that drought category
+ pop_total: Population total of that state in that drought category
The data is recorded on a weekly scale between 2001 July and 2021 July. Note that these percentages do not sum to 100%.
For now, I looked into only extreme drought and the percentage of area impacted. For plotting purposes, I also averaged the percentage on a monthly scale. It is obvious that the areas impacted are spatially correlated.
```{r}
subdata2 = read_rds(file = "2021-07-20-droughts/subset_drought_data.rds")
fig <- plot_ly(
data = subdata2 %>%
dplyr::filter(drought_lvl == "D3"),
type = 'choropleth',
locationmode = 'USA-states',
locations = ~state_abb,
z = ~avg_area_pct,
frame = ~time,
color = ~avg_area_pct,
zauto = FALSE,
zmin = 0,
zmax = 100) %>%
layout(
title = "Extreme droughts in the US",
geo = list(
scope = 'usa'),
showlegend = FALSE) %>%
colorbar(title = "Avg. % of area impacted")
fig
```
## `geofacet`
I think `geofacet` is a great package for visualisations. It offers a twist on the usual `facet_wrap`, and places each facet based on the geographical location of the states being plotted. When using this package with `ggplotly`, I would recommend using `theme_minimal` to hide the excess facets (due to how the package is setup).
```{r, fig.height = 15, fig.width = 25}
p = subdata2 %>%
dplyr::filter(drought_lvl == "D3") %>%
ggplot(aes(x = month,
y = avg_area_pct,
colour = factor(year),
label = year)) +
geom_point(size = 2) +
geom_dl(method = list("last.points"), position = position_nudge(x = 0.3), size = 3) +
geom_line(aes(group = year)) +
scale_x_discrete(expand = c(0, 0, 0, 4),
breaks = c("Jan", "Mar", "May", "Jul", "Sep", "Dec")) +
scale_y_continuous(limits = c(0, 100),
n.breaks = 5,
labels = scales::percent_format(accuracy = 1, scale = 1)) +
guides(colour = "none") +
facet_geo(~state_abb, grid = "us_state_grid2") +
labs(title = "Extreme droughts in the US",
x = "",
y = "Avg. % of area impacted") +
theme_minimal(18) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
p
```
## California
```{r}
subdata2 %>%
dplyr::filter(drought_lvl == "D3",
state_abb == "CA") %>%
ggplot(aes(x = month,
y = avg_area_pct,
colour = factor(year),
label = year)) +
geom_point(size = 2) +
geom_dl(method = list("last.points"), position = position_nudge(x = 0.3), size = 3) +
geom_line(aes(group = year)) +
scale_x_discrete(expand = c(0, 0, 0, 4)) +
scale_y_continuous(limits = c(0, 100),
n.breaks = 5,
labels = scales::percent_format(accuracy = 1, scale = 1)) +
guides(colour = "none") +
labs(x = "",
y = "Avg. % of area impacted",
title = "Extreme droughts in California") +
theme_classic(18) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
```