-
Notifications
You must be signed in to change notification settings - Fork 4
/
session-janitor.qmd
84 lines (60 loc) · 1.97 KB
/
session-janitor.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
title: "Introduction to R and Rstudio"
subtitle: "Session - {janitor} clean data"
---
## Specific packages to clean data
Packages like {janitor} have functions to do a lot of the cleaning required for data like:
::: incremental
- Remove blank rows and columns
- Change Excel serial dates to read dates
- Standardise column names and remove spaces
:::
## Example of cleaning column headers
Getting the data from following slides
```{r}
by_ethnicity <- read_csv("https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv")
```
## Changing the column names
Removes spaces and changes the `%` to a word
```{r}
library(janitor)
by_ethnicity |>
clean_names()
```
## Example of removing blank rows and columns
```{r}
# Add in blank row and column
by_ethnicity_blank <- by_ethnicity |>
mutate(blank_column = NA) |> # Blank column
add_row() # Blank row
by_ethnicity_blank |>
remove_empty(which = c("rows", "cols"))
```
## Getting duplicates
Often code removes duplicates but sometimes you'll want to see all the duplicated information:
```{r}
#| eval: true
duplicates <- tibble::tribble(
~Ethnicity, ~`%`, ~`estimated.number.(thousands)`,
"All", 90.8, 48098,
"All", 90.8, 48098,
"All", 90.8, 48098,
"Bangladeshi", 91.9, 354,
"Chinese", 98.6, 265,
"Indian", 90.4, 1077,
"Pakistani", 91.1, 767,
"Asian other", 95.6, 620,
"Black", 92.8, 1376,
"Mixed", 96, 547,
"White", 90.5, 42296,
"Other", 94.5, 796
)
```
. . .
```{r}
#| eval: true
library(janitor)
duplicates |>
get_dupes()
```
## End session