05-regex.Rmd

# Regular expressions {#regex}

```{r, include=FALSE}
source("_common.R", local = knitr::knit_global())
```

*Regular expressions* are patterns that are used to match combinations of
characters in a string. Before we begin, just a cautionary note that if your
regular expressions are becoming too complex, perhaps it is time to step back
and think about whether it is necessary and if they can be represented multiple
expressions that are easier to understand.

## Prerequisites

The functions `str_view()` and `str_view_all()` from the `{stringr}` package
(part of `{tidyverse}`) will be used to learn regular expressions interactively.
`str_view()` shows the first match while `str_view_all()` shows all the matches.

```{r, message=FALSE}
library(tidyverse)
```

## Basic matches

The most basic form of matching is to match exact strings

```{r}
x <- c("abc ABC\n123. !?\\(){}")
cat(x)
```

```{r}
str_view(x, "abc")
```

```{r}
str_view(x, "123")
```

## Character classes {#regex-classes}

Character classes allow you to specify a list of characters for matching.

A bracket `[...]` can be used to specify a list of character. Therefore, it will
match any characters that was specified within the brackets.

```{r}
str_view_all(x, "[bcde]")
```

If a caret `^` is added to the start of the list of characters, it will match
any characters that are NOT in the list.

```{r}
str_view_all(x, "[^bcde]")
```

You can also specify a range expression using a hyphen `-` between two
characters.

```{r}
str_view_all(x, "[a-zA-Z]")
```

You can also specify character classes using pre-defined names of these classes
with a bracket expression.

|    Regex    | What it matches     |
|:-----------:|:--------------------|
| `[:alnum:]` | letters and numbers |
| `[:alpha:]` | letters             |
| `[:upper:]` | uppercase letters   |
| `[:lower:]` | lowercase letters   |
| `[:digit:]` | numbers             |
| `[:punct:]` | punctuations        |
| `[:space:]` | space characters    |

Try out some of the named classes using the function `str_view_all()`

```{r}
str_view_all(x, "[:alnum:]")
```

```{r}
str_view_all(x, "[:punct:]")
```

There are also special metacharacters you can use to match entire classes of
characters.

| Regex | What it matches                                          |
|:-----:|:---------------------------------------------------------|
|  `.`  | any character except newline `"\n"`                      |
| `\d`  | digit                                                    |
| `\D`  | non-digit                                                |
| `\s`  | whitespace                                               |
| `\S`  | non-whitespace                                           |
| `\t`  | tab                                                      |
| `\n`  | newline                                                  |
| `\w`  | "word" i.e., letters (a-z and A-Z), digits (0-9) or (\_) |
| `\W`  | non-"word"                                               |

Note that to include a `\` in a regular expression, you need to escape it using
`\\`. This is explained in the next subsection on escaping.

```{r}
str_view_all(x, ".")
```

```{r}
str_view_all(x, "\\d")
```

```{r}
str_view_all(x, "\\D")
```

## Escaping {#regex-escape}

In regular expressions, the backslash `\` is used as an *escape* character and
is used to "escape" any special characters that comes after the backslash.
However, in R, the same backslashes `\` are also used as an *escape* character
in strings. For example, the string `"abc ABC 123.\n!?\\(){}"` is used to
represent the characters `abc ABC 123.\n!?\(){}`. You will see that there is any
additional backslash `\` in the string representation. This is because `\` is a
special character for strings in R. Therefore, to represent a backslash `\` in a
string, another backslash needs to be added to escape the special representation
of a `\` in strings. This means that to create a string containing "", you need
to write `"\\"`.

```{r}
x <- c("abc ABC\n123. !?\\(){}")
cat(x)
```

Therefore, to create any regular expressions that contains a backslash, you
would need to use a string that contains another backslash `\` to escape the
backslash `\` that forms a part of the regular expression.

For example, how would you create a regular expression to match the character
`"."` if it is defined to match any character except newline. You would need to
escape it with `\.`. However, the backslash is a special character in a string.
Therefore you need the string `"\\."` to represent the regular expression `\.`.
The same logic is applied when representing metacharacters such as `\d`, `\D`,
`\w`, `\W`, etc. You need to use "\\" to represent `\` in regular expressions.

```{r}
str_view_all(x, ".")
```

```{r}
str_view_all(x, "\\.")
```

To represent a backslash `\` as a regular expression, two levels of escape would
be required! To elaborate, to represent a `\` as a regular expression, you would
need to escape it by creating the regular expression `\\`. To represent each of
these `\` you need to use a string, which also requires you to add an additional
`\` to escape it. Therefore, to match a `\` you need to write `\\\\`.

```{r}
str_view(x, "\\\\")
```

## Anchors

Regular expressions will match any part of a string unless you use anchors to
specify positions such as the start or end of the string. Instead of characters,
anchors are used to specify position.

| Regex | Position            |
|:-----:|:--------------------|
|  `^`  | start of string     |
|  `$`  | end of string       |
| `\b`  | word boundaries     |
| `\B`  | non-word boundaries |

You can use `^` to match the start of a string and `$` to match the end of a
string. For a complete match, you could anchor a regular expression with
`^...$`.

```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "apple")
```

```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple")
```

```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "apple$")
```

```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple$")
```

`\b` is used to match a position known as word boundary. You can use `\b` to
match the start or end of a word. You can think of `\B` as the inverse of `\b`
and basically it matches any position that `\b` does not.

```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\bapple\\b")
```

```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "apple\\b")
```

```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple\\B")
```

```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple")
```

## Quantifies

You can use quantifiers to specify the number of times a pattern matches.

|  Regex  | Number of times                      |
|:-------:|:-------------------------------------|
|   `+`   | one or more                          |
|   `*`   | zero or more                         |
|   `?`   | zero or one                          |
|  `{m}`  | exactly `m`                          |
| `{m,}`  | at least `m` (`m` or more)           |
| `{m,n}` | between `m` and `n` (both inclusive) |

```{r}
str_view(c("a", "abb", "abbb"), "ab+")
```

```{r}
str_view(c("a", "abb", "abbb"), "ab*")
```

```{r}
str_view(c("a", "abb", "abbb"), "ab?")
```

```{r}
str_view(c("a", "abb", "abbb"), "ab{1}")
```

```{r}
str_view(c("a", "abb", "abbb"), "ab{1,}")
```

```{r}
str_view(c("a", "abb", "abbb"), "ab{1,2}")
```

By default, quantifies are applied to a single character. You can use `(...)` to
apply quantifies to more than one character.

```{r}
str_view(c("ab", "abab", "ababab"), "ab+")
```

```{r}
str_view(c("ab", "abab", "ababab"), "(ab)+")
```

```{=html}
<!--## Applications TODO

Combining what you have learned about regular expressions with data transformation skills picked up in the previous chapter can bring about many useful applications. 


If your regular expression is 

For example, the format of the outputs from building energy simulation models are often not in a tidy format -->
```