spacyr wishlist #109

amatsuo · 2018-05-05T06:38:32Z

Hello @kbenoit and other users of spacyr

This is a comprehensive wishlist of spacyr updates inspired by our discussion with @honnibal and @ines. We will implement some of them in future, but is there anything you are particularly intereted in?

Something likely to be implemented

Turn off the part of pipes at the time of execution
Noun phrases extractions
Having a "just tokenization" option
Size of batch in pipe for performance tuning

Something nice to have but not sure how many users need it

Allowing to add user-defined attributes
Functionality to add word embeddings to the model

Just a wish

https://explosion.ai/demos/
- consider the use of matcher in spacyr (construction of JSON is necessary)
parse-tree navigations (maybe a different package)

The text was updated successfully, but these errors were encountered:

aourednik · 2018-05-14T13:54:24Z

Having a "just tokenization" option with lemmatization would be great.
Currently trying to use

parsed <- spacy_parse(my_corpus, pos=FALSE, entity=FALSE, dependency=FALSE)
parsed$token <- parsed$lemma
my_tokens <- as.tokens(parsed)

The first line yields a memory overload on a large my_corpus, while tokens(my_corpus) is fast, with no memory problem. I don't to what extent this is due to inherent memory use of spaCy, though.

Could spacyr somehow be included as an option with the tokens function? Like this: ?

my_tokens <- tokens(txtc,
  what='word',
  remove_numbers = TRUE, 
  remove_punct = TRUE, 
  remove_separators=TRUE, 
  remove_symbols = TRUE,
  include_docvars = TRUE,
  lemmatize = "spacy_parse"
)

kbenoit · 2018-05-14T15:06:08Z

Not a bad idea. @amatsuo maybe add:

spacy_tokenize(x, what = c("word", "sentence"), 
  remove_numbers = FALSE, remove_punct = FALSE,
  remove_symbols = FALSE, remove_separators = TRUE,
  remove_twitter = FALSE, remove_hyphens = FALSE, 
  remove_url = FALSE, value = c("list", "data.frame")

where the last one returns one of the two TIF formats for tokens? This is as close to the quanteda::tokens() as possible and with spacy_tokenize(x, value = "list") %>% as.tokens() provides the options of going straight to a quanteda tokens class using the spaCy tokeniser.

We could also add to spacy_parse() a new option for sentence = TRUE that would remove the sentence_id return field, and number tokens consecutively within document. So if all options are FALSE, it's the same as spacy_tokenize(x, what = "word", value = "data.frame") -- indeed, that function could call this version of spacy_parse().

dmklotz · 2018-05-15T16:14:45Z

Definitely would be interested in noun phrase extractions.

amatsuo · 2018-06-05T10:08:19Z

Hi @dmklotz

I opened an issue for noun-phrase extraction (#117). Please provide your thoughts there.

amatsuo · 2018-06-06T13:31:07Z

@aourednik and @kbenoit

I have implemented spacy_tokenize in tokenize-function branch. Please try and give some feedback to me.

Some options are left out: remove_symbols, remove_hyphens, remove_twitter. In my opinion, these options are about text-preprocessing before handing texts to spaCy NLP. At the moment, spacyr does not import stringi and I don't see much reason to use gsub() in 2018 for potentially large-scale text processing.

cecilialee · 2018-06-07T08:06:44Z

Is it possible to train a new model with spacyr at the moment?

kbenoit · 2018-06-07T08:19:53Z

@cecilialee No, for training a new language model you would need to do that in Python using the spaCy instructions. We unlikely to add this facility to spacyr in the foreseeable future.

cecilialee · 2018-06-11T08:17:27Z

@kbenoit Sure. Then if I've trained a model with python, how can I use (initialize) that model with spacyr?

amatsuo · 2018-06-11T10:44:40Z

@cecilialee

The model argument of spacyr_initialize is handed to the model name argument of spacy.load('**'). So you should be able to use the name of the model you saved in python when you call spacy_initialize.

aourednik · 2018-08-30T13:09:45Z

@amatsuo Is there a simple way to install the full tokenize-function branch version of spacyr in R ?

kbenoit · 2018-08-30T13:31:12Z

@aourednik that would be

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")

aourednik · 2018-08-30T13:41:15Z

Great thanks for these developments!
By the way, this has more to do with Quanteda in general than with spacyr, but since we are speaking of lemmatization, I was wondering if it would it be feasible to implement a udpipe lemmatizer in the totokens() function ? Or something like udpipe_tokenize() taking a Quanteda corpus as argument and returning lemmatized tokens? UDPipe is reported to perform better, though slower, lemmatization for French, Italian and Spanish than SpaCy.
For now, I can get a list of lists of tokens like this (below) but having a Quanteda toknes object would allow me to remain within the Quanteda framework.

library("udpipe")
# dl <- udpipe_download_model(language = "french") # necessary only when not yet downloaded
udmodel_french <- udpipe_load_model(file = "french-ud-2.0-170801.udpipe")
#txtc is my quanteda corpus
txtudpipetokens <- lapply(head(texts(txtc)), function(x) {
  udp <- udpipe_annotate(udmodel_french, x)
  return(as.data.table(udp)$lemma)
  }
)

cf. https://github.com/bnosac/udpipe @amatsuo @jwijffels

kbenoit · 2018-08-30T13:44:12Z

Glad it's working for you! We should be finished with the integration of the tokenize-function branch next week. When that's completed, it will be very easy to use spacyr for tokenisation or lemmatising.

On integration with udpipe, that's probably better done in that package. @jwijffels we'd be happy to assist with this.

aourednik · 2018-08-30T14:07:05Z

@amatsuo @kbenoit I have tried out:

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in spacy_tokenize(corpus_sample(txtc, 10)) : 
#  could not find function "spacy_tokenize"
source("https://raw.githubusercontent.com/quanteda/spacyr/tokenize-function/R/spacy_tokenize.R")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in UseMethod("spacy_tokenize") : 
#  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"

Session info -------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.423)           
 language en_US                       
 collate  en_US.UTF-8                 
 tz       Europe/Zurich               
 date     2018-08-30                  

Packages -----------------------------------------------------------------------------------------------------
 package      * version date       source                          
 assertthat     0.2.0   2017-04-11 CRAN (R 3.4.1)                  
 backports      1.1.2   2017-12-13 CRAN (R 3.4.3)                  
 base         * 3.4.4   2018-03-16 local                           
 base64enc      0.1-3   2015-07-28 CRAN (R 3.4.2)                  
 bindr          0.1.1   2018-03-13 CRAN (R 3.4.3)                  
 bindrcpp       0.2.2   2018-03-29 CRAN (R 3.4.4)                  
 checkmate      1.8.5   2017-10-24 CRAN (R 3.4.3)                  
 codetools      0.2-15  2016-10-05 CRAN (R 3.3.1)                  
 colorspace     1.3-2   2016-12-14 CRAN (R 3.4.0)                  
 compiler       3.4.4   2018-03-16 local                           
 crayon         1.3.4   2017-09-16 CRAN (R 3.4.3)                  
 curl           3.2     2018-03-28 CRAN (R 3.4.4)                  
 data.table   * 1.11.4  2018-05-27 CRAN (R 3.4.4)                  
 datasets     * 3.4.4   2018-03-16 local                           
 devtools       1.13.6  2018-06-27 CRAN (R 3.4.4)                  
 digest         0.6.16  2018-08-22 CRAN (R 3.4.4)                  
 doMC         * 1.3.5   2017-12-12 CRAN (R 3.4.3)                  
 dplyr          0.7.6   2018-06-29 CRAN (R 3.4.4)                  
 evaluate       0.11    2018-07-17 CRAN (R 3.4.4)                  
 fastmatch      1.1-1   2017-11-21 local                           
 forcats      * 0.3.0   2018-02-19 CRAN (R 3.4.4)                  
 foreach      * 1.4.4   2017-12-12 CRAN (R 3.4.3)                  
 ggplot2      * 3.0.0   2018-07-03 CRAN (R 3.4.4)                  
 git2r          0.23.0  2018-07-17 CRAN (R 3.4.4)                  
 glue           1.3.0   2018-07-17 CRAN (R 3.4.4)                  
 graphics     * 3.4.4   2018-03-16 local                           
 grDevices    * 3.4.4   2018-03-16 local                           
 grid           3.4.4   2018-03-16 local                           
 gtable         0.2.0   2016-02-26 CRAN (R 3.4.0)                  
 htmlTable    * 1.12    2018-05-26 CRAN (R 3.4.4)                  
 htmltools      0.3.6   2017-04-28 CRAN (R 3.4.2)                  
 htmlwidgets    1.2     2018-04-19 CRAN (R 3.4.4)                  
 httr           1.3.1   2017-08-20 CRAN (R 3.4.2)                  
 igraph       * 1.1.2   2017-07-21 CRAN (R 3.4.2)                  
 iterators    * 1.0.10  2018-07-13 CRAN (R 3.4.4)                  
 jsonlite       1.5     2017-06-01 CRAN (R 3.4.2)                  
 knitr          1.20    2018-02-20 CRAN (R 3.4.3)                  
 labeling       0.3     2014-08-23 CRAN (R 3.4.0)                  
 lattice        0.20-35 2017-03-25 CRAN (R 3.3.3)                  
 lazyeval       0.2.1   2017-10-29 CRAN (R 3.4.2)                  
 lubridate      1.7.4   2018-04-11 CRAN (R 3.4.4)                  
 magrittr       1.5     2014-11-22 CRAN (R 3.4.0)                  
 Matrix         1.2-14  2018-04-09 CRAN (R 3.4.4)                  
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.3)                  
 methods      * 3.4.4   2018-03-16 local                           
 munsell        0.5.0   2018-06-12 CRAN (R 3.4.4)                  
 parallel     * 3.4.4   2018-03-16 local                           
 pillar         1.3.0   2018-07-14 CRAN (R 3.4.4)                  
 pkgconfig      2.0.2   2018-08-16 CRAN (R 3.4.4)                  
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)                  
 purrr          0.2.5   2018-05-29 CRAN (R 3.4.4)                  
 qdapRegex      0.7.2   2017-04-09 CRAN (R 3.4.2)                  
 quanteda     * 1.3.4   2018-07-15 CRAN (R 3.4.4)                  
 R2HTML       * 2.3.2   2016-06-23 CRAN (R 3.4.3)                  
 R6             2.2.2   2017-06-17 CRAN (R 3.4.1)                  
 RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.4.1)                  
 Rcpp           0.12.18 2018-07-23 CRAN (R 3.4.4)                  
 RcppParallel   4.4.1   2018-07-19 CRAN (R 3.4.4)                  
 readtext     * 0.71    2018-05-10 CRAN (R 3.4.4)                  
 rlang          0.2.2   2018-08-16 CRAN (R 3.4.4)                  
 rlist        * 0.4.6.1 2016-04-04 CRAN (R 3.4.4)                  
 rmarkdown      1.10    2018-06-11 CRAN (R 3.4.4)                  
 rprojroot      1.3-2   2018-01-03 CRAN (R 3.4.3)                  
 rstudioapi     0.7     2017-09-07 CRAN (R 3.4.3)                  
 scales       * 1.0.0   2018-08-09 CRAN (R 3.4.4)                  
 spacyr         0.9.91  2018-08-30 Github (quanteda/spacyr@240b6ef)
 stats        * 3.4.4   2018-03-16 local                           
 stopwords      0.9.0   2017-12-14 CRAN (R 3.4.3)                  
 stringi        1.2.4   2018-07-20 CRAN (R 3.4.4)                  
 stringr      * 1.3.1   2018-05-10 CRAN (R 3.4.4)                  
 textclean    * 0.9.3   2018-07-23 CRAN (R 3.4.4)                  
 tibble         1.4.2   2018-01-22 CRAN (R 3.4.3)                  
 tidyselect     0.2.4   2018-02-26 CRAN (R 3.4.3)                  
 tools          3.4.4   2018-03-16 local                           
 udpipe       * 0.6.1   2018-07-30 CRAN (R 3.4.4)                  
 utils        * 3.4.4   2018-03-16 local                           
 withr          2.1.2   2018-03-15 CRAN (R 3.4.4)                  
 yaml           2.2.0   2018-07-25 CRAN (R 3.4.4)

amatsuo · 2018-08-31T05:39:26Z

It seems that you forgot to load the package by library(spacyr).

jwijffels · 2018-08-31T12:21:59Z

If you just want to get the lemma's in French using udpipe and put it into the quanteda corpus structure. I think this is just this (example below just takes nouns & proper nouns).

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
x <- subset(x, upos %in% c('NOUN', 'PROPN'))
txtc$tokens <- split(x$lemma, x$doc_id)

Why do you think such code would have to be put into the udpipe R package?

aourednik · 2018-09-04T08:59:53Z

@amatsuo Yes, my mistake, forgot to reload package, the first error was due this, sorry. Now I am getting only the second error on my machine (same Session info as before) :

> class(txtc)
[1] "corpus" "list"  
> txtc
Corpus consisting of 35,701 documents and 5 docvars.
> devtools::install_github("quanteda/spacyr", ref = "tokenize-function",force=TRUE)
Downloading GitHub repo quanteda/spacyr@tokenize-function
from URL https://api.github.com/repos/quanteda/spacyr/zipball/tokenize-function
Installing spacyr
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp3TNiYi/devtoolsc003f381001/quanteda-spacyr-240b6ef'  \
  --library='/home/andre/R/x86_64-pc-linux-gnu-library/3.4' --install-tests 

* installing *source* package ‘spacyr’ ...
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (spacyr)
Reloading installed spacyr
unloadNamespace("spacyr") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.

Attaching package: ‘spacyr’

The following object is masked from ‘package:quanteda’:

    spacy_parse

> library("spacyr")
> parsed <- spacy_tokenize(corpus_sample(txtc,10))
Error in UseMethod("spacy_tokenize") : 
  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"

aourednik · 2018-09-04T11:22:41Z

@jwijffels Many thanks for the code! It currently returns a named list of character vectors containing lemmatized tokens, which comes much closer to what I (and most probably other users of both Quanteda and udpipe) would need. The best, though, would be having a udpipe function return a Quanteda object of class tokens. A tokens object is normally generated by tokens() or by the new spacy_tokenize() discussed here. The tokens object can be easily turned to a document-feature-matrix with dfm() that allows, for instance, fast dictionary lookup with dfm_lookup().
My concrete use-case is lexicon-based sentiment analysis and emotion mining.

jwijffels · 2018-09-04T12:23:38Z

quanteda's tokens element of the corpus seems to be a list of terms with the class tokenizedTexts. If you want this, just wrap the code that I showed above in as.tokenizedTexts which is part of quanteda.

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
txtc$tokens <- as.tokenizedTexts(split(x$lemma, x$doc_id))

If you want to use udpipe, to get a DTM/document-feature-matrix of adjectives for sentiment analysis, you can just use the code below and proceed with e.g. dfm_lookup if you need it.

## For sentiment analysis, with udpipe, just take the adjectives and get a dtm
x <- subset(x, upos %in% c('ADJ'))
dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)

ChengYJon · 2018-11-17T04:54:34Z

This has been super useful! Thank you!

Are there any plans to implement spacy's neural coreference functions into R?

kasperwelbers · 2019-04-04T11:12:58Z

@ChengYJon I was also looking to use the neuralcoref pipeline component, so I took a stab at it in this fork

There is some hassle though (as explained in the README), because neuralcoref currently doesn't seem to work with spacy > 2.0.12. Simply downgrading spacy in turn resulted in other compatibility issues, so for me a clean conda install was required. Until these compatibility issues are resolved it's quite cumbersome.

ChengYJon · 2019-04-26T17:09:18Z

@kasperwelbers Thank you so much for this. I kept having to switch between Python and R. I'll try this fork out and let you know if I'm able to recreate the process.

fkrauer · 2019-06-11T08:23:36Z

If it isn't already incorporated (I haven't found anything), I'd love to have a "start" and "end" character for each token. Otherwise they cannot be uniquely identified in the running text.

amatsuo · 2019-06-11T08:43:12Z

@fkrauer Thank you for the post.

I am not sure what that means by start and end.

Could you elaborate it a bit more? Or could you show us a desirable output?

fkrauer · 2019-06-11T09:05:25Z

I mean the character position of each token with respect to the original text. For example:

text <- "This is a dummy text."
output <- spacy_parse(text)

> output
token	start	end
This	1	4
is	6	7
a	9	9
dummy	11	15
text	17	20
.	21	21

The count starts with 1 at the first character, and all characters are counted (also whitespaces). coreNLP (R wrapper for Stanford's CoreNLP) has this feature, which is very useful, when you have to map the original text back onto the tokens or compare different NLP algorithms.

amatsuo · 2019-06-11T11:23:00Z

I see. It's not implemented in spacyr, but you could do something like this.

library(spacyr)
library(tidyverse)

txt <- c(doc1 = "spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.",
         doc2 = "spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.")
out <- spacy_parse(txt, additional_attributes = c("idx"), entity = FALSE,
                   lemma = FALSE, pos = FALSE)

out %>%
    mutate(start = idx - idx[1] + 1) %>%
    mutate(end = start + nchar(token) - 1)

What the code does is:

run spacy_parse with additional attribute of idx, which returns the character offset of the token in the document.
calculate start and end.

The head of output is:

##    doc_id sentence_id token_id       token idx start end
## 1    doc1           1        1       spaCy   0     1   5
## 2    doc1           1        2      excels   6     7  12
## 3    doc1           1        3          at  13    14  15
## 4    doc1           1        4       large  16    17  21
## 5    doc1           1        5           -  21    22  22
## 6    doc1           1        6       scale  22    23  27
## 7    doc1           1        7 information  28    29  39
## 8    doc1           1        8  extraction  40    41  50
## 9    doc1           1        9       tasks  51    52  56
## 10   doc1           1       10           .  56    57  57
## 11   doc1           2        1          It  58    59  60
## 12   doc1           2        2          's  60    61  62
## 13   doc1           2        3     written  63    64  70
## 14   doc1           2        4        from  71    72  75
## 15   doc1           2        5         the  76    77  79
## 16   doc1           2        6      ground  80    81  86
## 17   doc1           2        7          up  87    88  89
## 18   doc1           2        8          in  90    91  92
## 19   doc1           2        9   carefully  93    94 102
## 20   doc1           2       10      memory 103   104 109

I am not sure whether we should provide this as a functionality of spacy_parse yet, but could be.

fkrauer · 2019-06-11T11:35:53Z

I have written a for loop with a stringr::str_locate(), but your solution is much quicker, thank you.

mshariful · 2020-08-24T10:44:50Z

spacyr_initialize

Hi,
I have saved an updated Spacy NER model in 'c\updated_model'. The folder 'updated_model' contains
'tagger', 'parser', 'ner', and 'vocab' folders together with two files 'meta,json' and 'tokenizer'. I can easily
load and use this updated model in python by simply using
spacy.load( 'c\updated_model')
How do I load it in Spacyr? I tried
spacy_initialize(model='c\updated_model')
I did not get any error but it seems spacyr uses the default 'de' model. How do I make sure, spacyr uses my updated model?

TIA
Sharif

amatsuo added the wishlist label May 5, 2018

amatsuo mentioned this issue Jun 8, 2018

Tokenize and noun-phrase extraction #119

Merged

3 tasks

kbenoit mentioned this issue Aug 25, 2020

Add spacy_install_langmodel() #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spacyr wishlist #109

spacyr wishlist #109

amatsuo commented May 5, 2018

aourednik commented May 14, 2018 •

edited by kbenoit

Loading

kbenoit commented May 14, 2018 •

edited

Loading

dmklotz commented May 15, 2018

amatsuo commented Jun 5, 2018

amatsuo commented Jun 6, 2018

cecilialee commented Jun 7, 2018

kbenoit commented Jun 7, 2018

cecilialee commented Jun 11, 2018

amatsuo commented Jun 11, 2018

aourednik commented Aug 30, 2018

kbenoit commented Aug 30, 2018

aourednik commented Aug 30, 2018 •

edited

Loading

kbenoit commented Aug 30, 2018

aourednik commented Aug 30, 2018

amatsuo commented Aug 31, 2018

jwijffels commented Aug 31, 2018

aourednik commented Sep 4, 2018

aourednik commented Sep 4, 2018

jwijffels commented Sep 4, 2018 •

edited

Loading

ChengYJon commented Nov 17, 2018

kasperwelbers commented Apr 4, 2019

ChengYJon commented Apr 26, 2019

fkrauer commented Jun 11, 2019

amatsuo commented Jun 11, 2019

fkrauer commented Jun 11, 2019

amatsuo commented Jun 11, 2019 •

edited

Loading

fkrauer commented Jun 11, 2019

mshariful commented Aug 24, 2020 •

edited

Loading

spacyr wishlist #109

spacyr wishlist #109

Comments

amatsuo commented May 5, 2018

aourednik commented May 14, 2018 • edited by kbenoit Loading

kbenoit commented May 14, 2018 • edited Loading

dmklotz commented May 15, 2018

amatsuo commented Jun 5, 2018

amatsuo commented Jun 6, 2018

cecilialee commented Jun 7, 2018

kbenoit commented Jun 7, 2018

cecilialee commented Jun 11, 2018

amatsuo commented Jun 11, 2018

aourednik commented Aug 30, 2018

kbenoit commented Aug 30, 2018

aourednik commented Aug 30, 2018 • edited Loading

kbenoit commented Aug 30, 2018

aourednik commented Aug 30, 2018

amatsuo commented Aug 31, 2018

jwijffels commented Aug 31, 2018

aourednik commented Sep 4, 2018

aourednik commented Sep 4, 2018

jwijffels commented Sep 4, 2018 • edited Loading

ChengYJon commented Nov 17, 2018

kasperwelbers commented Apr 4, 2019

ChengYJon commented Apr 26, 2019

fkrauer commented Jun 11, 2019

amatsuo commented Jun 11, 2019

fkrauer commented Jun 11, 2019

amatsuo commented Jun 11, 2019 • edited Loading

fkrauer commented Jun 11, 2019

mshariful commented Aug 24, 2020 • edited Loading

aourednik commented May 14, 2018 •

edited by kbenoit

Loading

kbenoit commented May 14, 2018 •

edited

Loading

aourednik commented Aug 30, 2018 •

edited

Loading

jwijffels commented Sep 4, 2018 •

edited

Loading

amatsuo commented Jun 11, 2019 •

edited

Loading

mshariful commented Aug 24, 2020 •

edited

Loading