forked from quanteda/spacyr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
218 lines (145 loc) · 9.77 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
output:
github_document:
html_preview: false
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "README-"
)
```
[![CRAN Version](https://www.r-pkg.org/badges/version/spacyr)](https://CRAN.R-project.org/package=spacyr) ![Downloads](https://cranlogs.r-pkg.org/badges/spacyr) [![Travis-CI Build Status](https://travis-ci.org/quanteda/spacyr.svg?branch=master)](https://travis-ci.org/quanteda/spacyr) [![Appveyor Build status](https://ci.appveyor.com/api/projects/status/jqt2atp1wqtxy5xd/branch/master?svg=true)](https://ci.appveyor.com/project/kbenoit/spacyr/branch/master) [![codecov.io](https://codecov.io/github/quanteda/spacyr/coverage.svg?branch=master)][1]
[1]: https://codecov.io/gh/quanteda/spacyr/branch/master
# spacyr: an R wrapper for spaCy
This package is an R wrapper to the spaCy "industrial strength natural language processing" Python library from http://spacy.io.
## Installing the package
1. Install miniconda
The easiest way to install spaCy and **spacyr** is through an auto-installation function in **spacyr** package. This function utilizes a conda environment and therefore, some version of conda has to be installed in the system. You can install miniconda from https://conda.io/miniconda.html (Choose 64-bit version for your system).
If you have any version of conda, you can skip this step. You can check it by entering `conda --version` in Console.
2. Install the **spacyr** R package:
* From GitHub:
To install the latest package from source, you can simply run the following.
```{r, eval = FALSE}
devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
```
* From CRAN:
```{r, eval = FALSE}
install.packages("spacyr")
```
3. Install spaCy in a conda environment
* For Windows, you need to run R as an administrator to make installation work properly. To do so, right click Rstudio (or R desktop icon) and select "Run as administrator" when launching R.
* To install spaCy, you can simply run
```{r, eval = FALSE}
library("spacyr")
spacy_install()
```
This will install the latest version of spaCy (and its required packages) and English language model. After installation, you can initialize spacy in R with
```{r, eval = FALSE}
spacy_initialize()
```
This will return the following message if spaCy was installed with this method.
```{r, eval = FALSE}
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.11, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
```
4. (optional) Add more language models
For spaCy installed by `spacy_install()`, **spacyr** provides a useful helper function to install additional language models. For instance, to install Gernman language model
```{r, eval = FALSE}
spacy_download_langmodel("de")
```
(Again, Windows users have to run this command as an administrator. Otherwise, sim-link to language model will fail.)
## Comments and feedback
We welcome your comments and feedback. Please file issues on the [issues](https://github.com/quanteda/spacyr/issues) page, and/or send us comments at [email protected] and [email protected].
## A walkthrough of **spacyr**
### Starting a **spacyr** session
To allow R to access the underlying Python functionality, it must open a connection by being initialized within your R session.
We provide a function for this, `spacy_initialize()`, which attempts to make this process as painless as possible by searching your system for Python executables, and testing which have spaCy installed. For power users (such as those with multiple installations of Python), it is possible to specify the path manually through the `python_executable` argument, which also makes initialization faster. (You will need to change the value on your system of the Python executable.)
```{r}
library("spacyr")
spacy_initialize()
```
### Tokenizing and tagging texts
The `spacy_parse()` is **spacyr**'s main function. It calls spaCy both to tokenize and tag the texts. It provides two options for part of speech tagging, plus options to return word lemmas, entity recognition, and dependency parsing. It returns a `data.frame` corresponding to the emerging [*text interchange format*](https://github.com/ropensci/tif) for token data.frames.
The approach to tokenizing taken by spaCy is inclusive: it includes all tokens without restrictions, including punctuation characters and symbols.
Example:
```{r}
txt <- c(d1 = "spaCy excels at large-scale information extraction tasks.",
d2 = "Mr. Smith goes to North Carolina.")
# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt
```
Two fields are available for part-of-speech tags. The `pos` field returned is the [Universal tagset for parts-of-speech](http://universaldependencies.org/u/pos/all.html), a general scheme that most users will find serves their needs, and also that provides equivalencies across langages. **spacyr** also provides a more detailed tagset, defined in each spaCy language model. For English, this is the [OntoNotes 5 version of the Penn Treebank tag set](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english).
```{r}
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)
```
For the German language model, the Universal tagset (`pos`) remains the same, but the detailed tagset (`tag`) is the [TIGER Treebank](https://spacy.io/docs/usage/pos-tagging#pos-tagging-german) scheme.
### Extracting entities
**spacyr** can extract entities, either named or ["extended"](https://spacy.io/docs/usage/entity-recognition#entity-types).
```{r}
parsedtxt <- spacy_parse(txt, lemma = FALSE)
entity_extract(parsedtxt)
```
```{r}
entity_extract(parsedtxt, type = "all")
```
Or, convert multi-word entities into single "tokens":
```{r}
entity_consolidate(parsedtxt)
```
### Dependency parsing
Detailed parsing of syntactic dependencies is possible with the `dependency = TRUE` option:
```{r}
spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)
```
### Using other language models
By default, **spacyr** loads an English language model. You also can load SpaCy's other [language models](https://spacy.io/docs/usage/models) or use one of the [language models with alpha support](https://spacy.io/docs/api/language-models#alpha-support) by specifying the `model` option when calling `spacy_initialize()`. We have sucessfully tested following language models with spacy version 2.0.1.
```{r echo = FALSE}
knitr::kable(data.frame(Language = c("German", "Spanish", "Portuguese", "French", "Italian", "Dutch"),
ModelName = c("`de`", "`es`", "`pt`", "`fr`", "`it`", "`nl`")) )
```
This is an example of parsing German texts.
```{r}
## first finalize the spacy if it's loaded
spacy_finalize()
spacy_initialize(model = "de")
txt_german <- c(R = "R ist eine freie Programmiersprache für statistische Berechnungen und Grafiken. Sie wurde von Statistikern für Anwender mit statistischen Aufgaben entwickelt.",
python = "Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache. Sie will einen gut lesbaren, knappen Programmierstil fördern.")
results_german <- spacy_parse(txt_german, dependency = TRUE, lemma = FALSE, tag = TRUE)
results_german
```
Note that the additional language models must first be installed in spaCy. The German language model, for example, can be installed (`python -m spacy download de`) before you call `spacy_initialize()`.
### When you finish
A background process of spaCy is initiated when you ran `spacy_initialize()`. Because of the size of language models of spaCy, this takes up a lot of memory (typically 1.5GB). When you do not need the Python connection any longer, you can finalize the python connection (and terminate the process) by calling the `spacy_finalize()` function.
```{r, eval = FALSE}
spacy_finalize()
```
By calling `spacy_initialize()` again, you can restart the backend spaCy.
### Permanently seting the default Python
If you want to skip **spacyr** searching for Python intallation with spaCy, you can do so by permanently setting the path to the spaCy-enabled Python by specifying it in an R-startup file (For Mac/Linux, the file is `~/.Rprofile`), which is read every time a new `R` is launched. You can set the option permanently when you call `spacy_initialize`:
```{R eval = FALSE}
spacy_initialize(save_profile = TRUE)
```
Once this is appropriately set up, the message from `spacy_initialize()` changes to something like:
```
## spacy python option is already set, spacyr will use:
## condaenv = "spacy_condaenv"
## successfully initialized (spaCy Version: 2.0.11, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
```
To ignore the permanently set options, you can initialize spacy with `refresh_settings = TRUE`.
## Using **spacyr** with other packages
### **quanteda**
Some of the token- and type-related standard methods from [**quanteda**](https://github.com/quanteda/quanteda) also work on the new tagged token objects:
```{r}
require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
docnames(parsedtxt)
ndoc(parsedtxt)
ntoken(parsedtxt)
ntype(parsedtxt)
```
### Conformity to the *Text Interchange Format*
The [Text Interchange Format](https://github.com/ropensci/tif) is an emerging standard structure for text package objects in R, such as corpus and token objects. `spacy_initialize()` can take a TIF corpus data.frame or character object as a valid input. Moreover, the data.frames returned by `spacy_parse()` and `entity_consolidate()` conform to the TIF tokens standard for data.frame tokens objects. This will make it easier to use with any text analysis package for R that works with TIF standard objects.