session-janitor.qmd

---
title: "Introduction to R and Rstudio"
subtitle: "Session - {janitor} clean data"
---

## Specific packages to clean data

Packages like {janitor} have functions to do a lot of the cleaning required for data like:

::: incremental
-   Remove blank rows and columns
-   Change Excel serial dates to read dates
-   Standardise column names and remove spaces
:::

## Example of cleaning column headers

Getting the data from following slides 

```{r}

by_ethnicity <- read_csv("https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv")
```

## Changing the column names

Removes spaces and changes the `%` to a word

```{r}
library(janitor)

by_ethnicity |> 
  clean_names()
```

## Example of removing blank rows and columns

```{r}
# Add in blank row and column

by_ethnicity_blank <- by_ethnicity |> 
  mutate(blank_column = NA) |> # Blank column
  add_row() # Blank row

by_ethnicity_blank |> 
  remove_empty(which = c("rows", "cols"))
```

## Getting duplicates

Often code removes duplicates but sometimes you'll want to see all the duplicated information:

```{r}
#| eval: true

duplicates <- tibble::tribble(
     ~Ethnicity, ~`%`, ~`estimated.number.(thousands)`,
          "All", 90.8,                           48098,
          "All", 90.8,                           48098,
          "All", 90.8,                           48098,
  "Bangladeshi", 91.9,                             354,
      "Chinese", 98.6,                             265,
       "Indian", 90.4,                            1077,
    "Pakistani", 91.1,                             767,
  "Asian other", 95.6,                             620,
        "Black", 92.8,                            1376,
        "Mixed",   96,                             547,
        "White", 90.5,                           42296,
        "Other", 94.5,                             796
  )

```

. . .

```{r}
#| eval: true

library(janitor)
duplicates |> 
  get_dupes()
```

## End session