-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-regex.Rmd
275 lines (207 loc) · 7.84 KB
/
05-regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# Regular expressions {#regex}
```{r, include=FALSE}
source("_common.R", local = knitr::knit_global())
```
*Regular expressions* are patterns that are used to match combinations of
characters in a string. Before we begin, just a cautionary note that if your
regular expressions are becoming too complex, perhaps it is time to step back
and think about whether it is necessary and if they can be represented multiple
expressions that are easier to understand.
## Prerequisites
The functions `str_view()` and `str_view_all()` from the `{stringr}` package
(part of `{tidyverse}`) will be used to learn regular expressions interactively.
`str_view()` shows the first match while `str_view_all()` shows all the matches.
```{r, message=FALSE}
library(tidyverse)
```
## Basic matches
The most basic form of matching is to match exact strings
```{r}
x <- c("abc ABC\n123. !?\\(){}")
cat(x)
```
```{r}
str_view(x, "abc")
```
```{r}
str_view(x, "123")
```
## Character classes {#regex-classes}
Character classes allow you to specify a list of characters for matching.
A bracket `[...]` can be used to specify a list of character. Therefore, it will
match any characters that was specified within the brackets.
```{r}
str_view_all(x, "[bcde]")
```
If a caret `^` is added to the start of the list of characters, it will match
any characters that are NOT in the list.
```{r}
str_view_all(x, "[^bcde]")
```
You can also specify a range expression using a hyphen `-` between two
characters.
```{r}
str_view_all(x, "[a-zA-Z]")
```
You can also specify character classes using pre-defined names of these classes
with a bracket expression.
| Regex | What it matches |
|:-----------:|:--------------------|
| `[:alnum:]` | letters and numbers |
| `[:alpha:]` | letters |
| `[:upper:]` | uppercase letters |
| `[:lower:]` | lowercase letters |
| `[:digit:]` | numbers |
| `[:punct:]` | punctuations |
| `[:space:]` | space characters |
Try out some of the named classes using the function `str_view_all()`
```{r}
str_view_all(x, "[:alnum:]")
```
```{r}
str_view_all(x, "[:punct:]")
```
There are also special metacharacters you can use to match entire classes of
characters.
| Regex | What it matches |
|:-----:|:---------------------------------------------------------|
| `.` | any character except newline `"\n"` |
| `\d` | digit |
| `\D` | non-digit |
| `\s` | whitespace |
| `\S` | non-whitespace |
| `\t` | tab |
| `\n` | newline |
| `\w` | "word" i.e., letters (a-z and A-Z), digits (0-9) or (\_) |
| `\W` | non-"word" |
Note that to include a `\` in a regular expression, you need to escape it using
`\\`. This is explained in the next subsection on escaping.
```{r}
str_view_all(x, ".")
```
```{r}
str_view_all(x, "\\d")
```
```{r}
str_view_all(x, "\\D")
```
## Escaping {#regex-escape}
In regular expressions, the backslash `\` is used as an *escape* character and
is used to "escape" any special characters that comes after the backslash.
However, in R, the same backslashes `\` are also used as an *escape* character
in strings. For example, the string `"abc ABC 123.\n!?\\(){}"` is used to
represent the characters `abc ABC 123.\n!?\(){}`. You will see that there is any
additional backslash `\` in the string representation. This is because `\` is a
special character for strings in R. Therefore, to represent a backslash `\` in a
string, another backslash needs to be added to escape the special representation
of a `\` in strings. This means that to create a string containing "", you need
to write `"\\"`.
```{r}
x <- c("abc ABC\n123. !?\\(){}")
cat(x)
```
Therefore, to create any regular expressions that contains a backslash, you
would need to use a string that contains another backslash `\` to escape the
backslash `\` that forms a part of the regular expression.
For example, how would you create a regular expression to match the character
`"."` if it is defined to match any character except newline. You would need to
escape it with `\.`. However, the backslash is a special character in a string.
Therefore you need the string `"\\."` to represent the regular expression `\.`.
The same logic is applied when representing metacharacters such as `\d`, `\D`,
`\w`, `\W`, etc. You need to use "\\" to represent `\` in regular expressions.
```{r}
str_view_all(x, ".")
```
```{r}
str_view_all(x, "\\.")
```
To represent a backslash `\` as a regular expression, two levels of escape would
be required! To elaborate, to represent a `\` as a regular expression, you would
need to escape it by creating the regular expression `\\`. To represent each of
these `\` you need to use a string, which also requires you to add an additional
`\` to escape it. Therefore, to match a `\` you need to write `\\\\`.
```{r}
str_view(x, "\\\\")
```
## Anchors
Regular expressions will match any part of a string unless you use anchors to
specify positions such as the start or end of the string. Instead of characters,
anchors are used to specify position.
| Regex | Position |
|:-----:|:--------------------|
| `^` | start of string |
| `$` | end of string |
| `\b` | word boundaries |
| `\B` | non-word boundaries |
You can use `^` to match the start of a string and `$` to match the end of a
string. For a complete match, you could anchor a regular expression with
`^...$`.
```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "apple")
```
```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple")
```
```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "apple$")
```
```{r}
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple$")
```
`\b` is used to match a position known as word boundary. You can use `\b` to
match the start or end of a word. You can think of `\B` as the inverse of `\b`
and basically it matches any position that `\b` does not.
```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\bapple\\b")
```
```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "apple\\b")
```
```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple\\B")
```
```{r}
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple")
```
## Quantifies
You can use quantifiers to specify the number of times a pattern matches.
| Regex | Number of times |
|:-------:|:-------------------------------------|
| `+` | one or more |
| `*` | zero or more |
| `?` | zero or one |
| `{m}` | exactly `m` |
| `{m,}` | at least `m` (`m` or more) |
| `{m,n}` | between `m` and `n` (both inclusive) |
```{r}
str_view(c("a", "abb", "abbb"), "ab+")
```
```{r}
str_view(c("a", "abb", "abbb"), "ab*")
```
```{r}
str_view(c("a", "abb", "abbb"), "ab?")
```
```{r}
str_view(c("a", "abb", "abbb"), "ab{1}")
```
```{r}
str_view(c("a", "abb", "abbb"), "ab{1,}")
```
```{r}
str_view(c("a", "abb", "abbb"), "ab{1,2}")
```
By default, quantifies are applied to a single character. You can use `(...)` to
apply quantifies to more than one character.
```{r}
str_view(c("ab", "abab", "ababab"), "ab+")
```
```{r}
str_view(c("ab", "abab", "ababab"), "(ab)+")
```
```{=html}
<!--## Applications TODO
Combining what you have learned about regular expressions with data transformation skills picked up in the previous chapter can bring about many useful applications.
If your regular expression is
For example, the format of the outputs from building energy simulation models are often not in a tidy format -->
```