README.Rmd

---
output:
  github_document:
    html_preview: false
---


```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "README-"
)
```

[![CRAN Version](https://www.r-pkg.org/badges/version/spacyr)](https://CRAN.R-project.org/package=spacyr) ![Downloads](https://cranlogs.r-pkg.org/badges/spacyr) [![Travis-CI Build Status](https://travis-ci.org/quanteda/spacyr.svg?branch=master)](https://travis-ci.org/quanteda/spacyr) [![Appveyor Build status](https://ci.appveyor.com/api/projects/status/jqt2atp1wqtxy5xd/branch/master?svg=true)](https://ci.appveyor.com/project/kbenoit/spacyr/branch/master) [![codecov.io](https://codecov.io/github/quanteda/spacyr/coverage.svg?branch=master)][1]

[1]: https://codecov.io/gh/quanteda/spacyr/branch/master


# spacyr: an R wrapper for spaCy

This package is an R wrapper to the spaCy "industrial strength natural language processing" Python library from http://spacy.io.

## Installing the package

1.  Install miniconda

    The easiest way to install spaCy and **spacyr** is through an auto-installation function in **spacyr** package. This function utilizes a conda environment and therefore, some version of conda has to be installed in the system. You can install miniconda from https://conda.io/miniconda.html (Choose 64-bit version for your system). 
    
    If you have any version of conda, you can skip this step. You can check it by entering `conda --version` in Console.

2.  Install the **spacyr** R package:

    * From GitHub:

      To install the latest package from source, you can simply run the following.
    ```{r, eval = FALSE}
    devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
    ```
    
    * From CRAN:
    ```{r, eval = FALSE}
    install.packages("spacyr")
    ```

3.  Install spaCy in a conda environment

    * For Windows, you need to run R as an administrator to make installation work properly. To do so, right click Rstudio (or R desktop icon) and select "Run as administrator" when launching R. 
    
    * To install spaCy, you can simply run

    ```{r, eval = FALSE}
    library("spacyr")
    spacy_install()
    ```
      
      This will install the latest version of spaCy (and its required packages) and English language model. After installation, you can initialize spacy in R with 
      
    ```{r, eval = FALSE}
    spacy_initialize()
    ```
    
    This will return the following message if spaCy was installed with this method.
    
    ```{r, eval = FALSE}
    ## Found 'spacy_condaenv'. spacyr will use this environment
    ## successfully initialized (spaCy Version: 2.0.11, language model: en)
    ## (python options: type = "condaenv", value = "spacy_condaenv")
    ```
      
      
4. (optional) Add more language models

    For spaCy installed by `spacy_install()`, **spacyr** provides a useful helper function to install additional language models. For instance, to install Gernman language model
    
    ```{r, eval = FALSE}
    spacy_download_langmodel("de")
    ```
    
    (Again, Windows users have to run this command as an administrator. Otherwise, sim-link to language model will fail.)


## Comments and feedback

We welcome your comments and feedback.  Please file issues on the [issues](https://github.com/quanteda/spacyr/issues) page, and/or send us comments at kbenoit@lse.ac.uk and A.Matsuo@lse.ac.uk.

## A walkthrough of **spacyr**

### Starting a **spacyr** session

To allow R to access the underlying Python functionality, it must open a connection by being initialized within your R session.  

We provide a function for this, `spacy_initialize()`, which attempts to make this process as painless as possible by searching your system for Python executables, and testing which have spaCy installed.  For power users (such as those with multiple installations of Python), it is possible to specify the path manually through the `python_executable` argument, which also makes initialization faster.  (You will need to change the value on your system of the Python executable.)

```{r}
library("spacyr")
spacy_initialize()
```

### Tokenizing and tagging texts

The `spacy_parse()` is **spacyr**'s main function.  It calls spaCy both to tokenize and tag the texts.  It provides two options for part of speech tagging, plus options to return word lemmas, entity recognition, and dependency parsing.  It returns a `data.frame` corresponding to the emerging [*text interchange format*](https://github.com/ropensci/tif) for token data.frames. 

The approach to tokenizing taken by spaCy is inclusive: it includes all tokens without restrictions, including punctuation characters and symbols.

Example:
```{r}
txt <- c(d1 = "spaCy excels at large-scale information extraction tasks.",
         d2 = "Mr. Smith goes to North Carolina.")

# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt
```

Two fields are available for part-of-speech tags.  The `pos` field returned is the [Universal tagset for parts-of-speech](http://universaldependencies.org/u/pos/all.html), a general scheme that most users will find serves their needs, and also that provides equivalencies across langages.   **spacyr** also provides a more detailed tagset, defined in each spaCy language model.  For English, this is the [OntoNotes 5 version of the Penn Treebank tag set](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english).
```{r}
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)
```

For the German language model, the Universal tagset (`pos`) remains the same, but the detailed tagset (`tag`) is the [TIGER Treebank](https://spacy.io/docs/usage/pos-tagging#pos-tagging-german) scheme.


### Extracting entities

**spacyr** can extract entities, either named or ["extended"](https://spacy.io/docs/usage/entity-recognition#entity-types).
```{r}
parsedtxt <- spacy_parse(txt, lemma = FALSE)
entity_extract(parsedtxt)
```

```{r}
entity_extract(parsedtxt, type = "all")
```

Or, convert multi-word entities into single "tokens":
```{r}
entity_consolidate(parsedtxt)
```


### Dependency parsing

Detailed parsing of syntactic dependencies is possible with the `dependency = TRUE` option:
```{r}
spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)
```


### Using other language models

By default, **spacyr** loads an English language model. You also can load SpaCy's other [language models](https://spacy.io/docs/usage/models) or use one of the [language models with alpha support](https://spacy.io/docs/api/language-models#alpha-support) by specifying the `model` option when calling `spacy_initialize()`. We have sucessfully tested following language models with spacy version 2.0.1.  

```{r echo = FALSE}
knitr::kable(data.frame(Language = c("German", "Spanish", "Portuguese",  "French", "Italian", "Dutch"),
           ModelName = c("`de`", "`es`", "`pt`", "`fr`", "`it`", "`nl`")) )
```


This is an example of parsing German texts.
```{r}
## first finalize the spacy if it's loaded
spacy_finalize()
spacy_initialize(model = "de")

txt_german <- c(R = "R ist eine freie Programmiersprache für statistische Berechnungen und Grafiken. Sie wurde von Statistikern für Anwender mit statistischen Aufgaben entwickelt.",
               python = "Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache. Sie will einen gut lesbaren, knappen Programmierstil fördern.")
results_german <- spacy_parse(txt_german, dependency = TRUE, lemma = FALSE, tag = TRUE)
results_german
```
Note that the additional language models must first be installed in spaCy.  The German language model, for example, can be installed (`python -m spacy download de`) before you call `spacy_initialize()`.

### When you finish

A background process of spaCy is initiated when you ran `spacy_initialize()`. Because of the size of language models of spaCy, this takes up a lot of memory (typically 1.5GB). When you do not need the Python connection any longer, you can finalize the python connection (and terminate the process) by calling the `spacy_finalize()` function.

```{r, eval = FALSE}
spacy_finalize()
```
By calling `spacy_initialize()` again, you can restart the backend spaCy.

### Permanently seting the default Python

If you want to skip **spacyr** searching for Python intallation with spaCy, you can do so by permanently setting the path to the spaCy-enabled Python by specifying it in an R-startup file (For Mac/Linux, the file is `~/.Rprofile`), which is read every time a new `R` is launched. You can set the option permanently when you call  `spacy_initialize`:

```{R eval = FALSE}
spacy_initialize(save_profile = TRUE)
```

Once this is appropriately set up, the message from `spacy_initialize()` changes to something like:
```
## spacy python option is already set, spacyr will use:
##	condaenv = "spacy_condaenv"
## successfully initialized (spaCy Version: 2.0.11, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
```

To ignore the permanently set options, you can initialize spacy with `refresh_settings = TRUE`. 

## Using **spacyr** with other packages

### **quanteda**

Some of the token- and type-related standard methods from [**quanteda**](https://github.com/quanteda/quanteda) also work on the new tagged token objects:
```{r}
require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
docnames(parsedtxt)
ndoc(parsedtxt)
ntoken(parsedtxt)
ntype(parsedtxt)
```

### Conformity to the *Text Interchange Format*

The [Text Interchange Format](https://github.com/ropensci/tif) is an emerging standard structure for text package objects in R, such as corpus and token objects.  `spacy_initialize()` can take a TIF corpus data.frame or character object as a valid input.  Moreover, the data.frames returned by `spacy_parse()` and `entity_consolidate()` conform to the TIF tokens standard for data.frame tokens objects.  This will make it easier to use with any text analysis package for R that works with TIF standard objects.