Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spacyr wishlist #109

Open
amatsuo opened this issue May 5, 2018 · 28 comments
Open

spacyr wishlist #109

amatsuo opened this issue May 5, 2018 · 28 comments
Labels

Comments

@amatsuo
Copy link
Collaborator

amatsuo commented May 5, 2018

Hello @kbenoit and other users of spacyr

This is a comprehensive wishlist of spacyr updates inspired by our discussion with @honnibal and @ines. We will implement some of them in future, but is there anything you are particularly intereted in?

Something likely to be implemented

  • Turn off the part of pipes at the time of execution
  • Noun phrases extractions
  • Having a "just tokenization" option
  • Size of batch in pipe for performance tuning

Something nice to have but not sure how many users need it

  • Allowing to add user-defined attributes
  • Functionality to add word embeddings to the model

Just a wish

  • https://explosion.ai/demos/
    • consider the use of matcher in spacyr (construction of JSON is necessary)
  • parse-tree navigations (maybe a different package)
@aourednik
Copy link

aourednik commented May 14, 2018

Having a "just tokenization" option with lemmatization would be great.
Currently trying to use

parsed <- spacy_parse(my_corpus, pos=FALSE, entity=FALSE, dependency=FALSE)
parsed$token <- parsed$lemma
my_tokens <- as.tokens(parsed)

The first line yields a memory overload on a large my_corpus, while tokens(my_corpus) is fast, with no memory problem. I don't to what extent this is due to inherent memory use of spaCy, though.

Could spacyr somehow be included as an option with the tokens function? Like this: ?

my_tokens <- tokens(txtc,
  what='word',
  remove_numbers = TRUE, 
  remove_punct = TRUE, 
  remove_separators=TRUE, 
  remove_symbols = TRUE,
  include_docvars = TRUE,
  lemmatize = "spacy_parse"
)

@kbenoit
Copy link
Collaborator

kbenoit commented May 14, 2018

Not a bad idea. @amatsuo maybe add:

spacy_tokenize(x, what = c("word", "sentence"), 
  remove_numbers = FALSE, remove_punct = FALSE,
  remove_symbols = FALSE, remove_separators = TRUE,
  remove_twitter = FALSE, remove_hyphens = FALSE, 
  remove_url = FALSE, value = c("list", "data.frame")

where the last one returns one of the two TIF formats for tokens? This is as close to the quanteda::tokens() as possible and with spacy_tokenize(x, value = "list") %>% as.tokens() provides the options of going straight to a quanteda tokens class using the spaCy tokeniser.

We could also add to spacy_parse() a new option for sentence = TRUE that would remove the sentence_id return field, and number tokens consecutively within document. So if all options are FALSE, it's the same as spacy_tokenize(x, what = "word", value = "data.frame") -- indeed, that function could call this version of spacy_parse().

@dmklotz
Copy link

dmklotz commented May 15, 2018

Definitely would be interested in noun phrase extractions.

@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 5, 2018

Hi @dmklotz

I opened an issue for noun-phrase extraction (#117). Please provide your thoughts there.

@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 6, 2018

@aourednik and @kbenoit

I have implemented spacy_tokenize in tokenize-function branch. Please try and give some feedback to me.

Some options are left out: remove_symbols, remove_hyphens, remove_twitter. In my opinion, these options are about text-preprocessing before handing texts to spaCy NLP. At the moment, spacyr does not import stringi and I don't see much reason to use gsub() in 2018 for potentially large-scale text processing.

@cecilialee
Copy link

Is it possible to train a new model with spacyr at the moment?

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 7, 2018

@cecilialee No, for training a new language model you would need to do that in Python using the spaCy instructions. We unlikely to add this facility to spacyr in the foreseeable future.

@cecilialee
Copy link

@kbenoit Sure. Then if I've trained a model with python, how can I use (initialize) that model with spacyr?

@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 11, 2018

@cecilialee

The model argument of spacyr_initialize is handed to the model name argument of spacy.load('**'). So you should be able to use the name of the model you saved in python when you call spacy_initialize.

@aourednik
Copy link

@amatsuo Is there a simple way to install the full tokenize-function branch version of spacyr in R ?

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 30, 2018

@aourednik that would be

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")

@aourednik
Copy link

aourednik commented Aug 30, 2018

Great thanks for these developments!
By the way, this has more to do with Quanteda in general than with spacyr, but since we are speaking of lemmatization, I was wondering if it would it be feasible to implement a udpipe lemmatizer in the totokens() function ? Or something like udpipe_tokenize() taking a Quanteda corpus as argument and returning lemmatized tokens? UDPipe is reported to perform better, though slower, lemmatization for French, Italian and Spanish than SpaCy.
For now, I can get a list of lists of tokens like this (below) but having a Quanteda toknes object would allow me to remain within the Quanteda framework.

library("udpipe")
# dl <- udpipe_download_model(language = "french") # necessary only when not yet downloaded
udmodel_french <- udpipe_load_model(file = "french-ud-2.0-170801.udpipe")
#txtc is my quanteda corpus
txtudpipetokens <- lapply(head(texts(txtc)), function(x) {
  udp <- udpipe_annotate(udmodel_french, x)
  return(as.data.table(udp)$lemma)
  }
) 

cf. https://github.com/bnosac/udpipe @amatsuo @jwijffels

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 30, 2018

Glad it's working for you! We should be finished with the integration of the tokenize-function branch next week. When that's completed, it will be very easy to use spacyr for tokenisation or lemmatising.

On integration with udpipe, that's probably better done in that package. @jwijffels we'd be happy to assist with this.

@aourednik
Copy link

@amatsuo @kbenoit I have tried out:

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in spacy_tokenize(corpus_sample(txtc, 10)) : 
#  could not find function "spacy_tokenize"
source("https://raw.githubusercontent.com/quanteda/spacyr/tokenize-function/R/spacy_tokenize.R")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in UseMethod("spacy_tokenize") : 
#  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"
Session info -------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.423)           
 language en_US                       
 collate  en_US.UTF-8                 
 tz       Europe/Zurich               
 date     2018-08-30                  

Packages -----------------------------------------------------------------------------------------------------
 package      * version date       source                          
 assertthat     0.2.0   2017-04-11 CRAN (R 3.4.1)                  
 backports      1.1.2   2017-12-13 CRAN (R 3.4.3)                  
 base         * 3.4.4   2018-03-16 local                           
 base64enc      0.1-3   2015-07-28 CRAN (R 3.4.2)                  
 bindr          0.1.1   2018-03-13 CRAN (R 3.4.3)                  
 bindrcpp       0.2.2   2018-03-29 CRAN (R 3.4.4)                  
 checkmate      1.8.5   2017-10-24 CRAN (R 3.4.3)                  
 codetools      0.2-15  2016-10-05 CRAN (R 3.3.1)                  
 colorspace     1.3-2   2016-12-14 CRAN (R 3.4.0)                  
 compiler       3.4.4   2018-03-16 local                           
 crayon         1.3.4   2017-09-16 CRAN (R 3.4.3)                  
 curl           3.2     2018-03-28 CRAN (R 3.4.4)                  
 data.table   * 1.11.4  2018-05-27 CRAN (R 3.4.4)                  
 datasets     * 3.4.4   2018-03-16 local                           
 devtools       1.13.6  2018-06-27 CRAN (R 3.4.4)                  
 digest         0.6.16  2018-08-22 CRAN (R 3.4.4)                  
 doMC         * 1.3.5   2017-12-12 CRAN (R 3.4.3)                  
 dplyr          0.7.6   2018-06-29 CRAN (R 3.4.4)                  
 evaluate       0.11    2018-07-17 CRAN (R 3.4.4)                  
 fastmatch      1.1-1   2017-11-21 local                           
 forcats      * 0.3.0   2018-02-19 CRAN (R 3.4.4)                  
 foreach      * 1.4.4   2017-12-12 CRAN (R 3.4.3)                  
 ggplot2      * 3.0.0   2018-07-03 CRAN (R 3.4.4)                  
 git2r          0.23.0  2018-07-17 CRAN (R 3.4.4)                  
 glue           1.3.0   2018-07-17 CRAN (R 3.4.4)                  
 graphics     * 3.4.4   2018-03-16 local                           
 grDevices    * 3.4.4   2018-03-16 local                           
 grid           3.4.4   2018-03-16 local                           
 gtable         0.2.0   2016-02-26 CRAN (R 3.4.0)                  
 htmlTable    * 1.12    2018-05-26 CRAN (R 3.4.4)                  
 htmltools      0.3.6   2017-04-28 CRAN (R 3.4.2)                  
 htmlwidgets    1.2     2018-04-19 CRAN (R 3.4.4)                  
 httr           1.3.1   2017-08-20 CRAN (R 3.4.2)                  
 igraph       * 1.1.2   2017-07-21 CRAN (R 3.4.2)                  
 iterators    * 1.0.10  2018-07-13 CRAN (R 3.4.4)                  
 jsonlite       1.5     2017-06-01 CRAN (R 3.4.2)                  
 knitr          1.20    2018-02-20 CRAN (R 3.4.3)                  
 labeling       0.3     2014-08-23 CRAN (R 3.4.0)                  
 lattice        0.20-35 2017-03-25 CRAN (R 3.3.3)                  
 lazyeval       0.2.1   2017-10-29 CRAN (R 3.4.2)                  
 lubridate      1.7.4   2018-04-11 CRAN (R 3.4.4)                  
 magrittr       1.5     2014-11-22 CRAN (R 3.4.0)                  
 Matrix         1.2-14  2018-04-09 CRAN (R 3.4.4)                  
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.3)                  
 methods      * 3.4.4   2018-03-16 local                           
 munsell        0.5.0   2018-06-12 CRAN (R 3.4.4)                  
 parallel     * 3.4.4   2018-03-16 local                           
 pillar         1.3.0   2018-07-14 CRAN (R 3.4.4)                  
 pkgconfig      2.0.2   2018-08-16 CRAN (R 3.4.4)                  
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)                  
 purrr          0.2.5   2018-05-29 CRAN (R 3.4.4)                  
 qdapRegex      0.7.2   2017-04-09 CRAN (R 3.4.2)                  
 quanteda     * 1.3.4   2018-07-15 CRAN (R 3.4.4)                  
 R2HTML       * 2.3.2   2016-06-23 CRAN (R 3.4.3)                  
 R6             2.2.2   2017-06-17 CRAN (R 3.4.1)                  
 RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.4.1)                  
 Rcpp           0.12.18 2018-07-23 CRAN (R 3.4.4)                  
 RcppParallel   4.4.1   2018-07-19 CRAN (R 3.4.4)                  
 readtext     * 0.71    2018-05-10 CRAN (R 3.4.4)                  
 rlang          0.2.2   2018-08-16 CRAN (R 3.4.4)                  
 rlist        * 0.4.6.1 2016-04-04 CRAN (R 3.4.4)                  
 rmarkdown      1.10    2018-06-11 CRAN (R 3.4.4)                  
 rprojroot      1.3-2   2018-01-03 CRAN (R 3.4.3)                  
 rstudioapi     0.7     2017-09-07 CRAN (R 3.4.3)                  
 scales       * 1.0.0   2018-08-09 CRAN (R 3.4.4)                  
 spacyr         0.9.91  2018-08-30 Github (quanteda/spacyr@240b6ef)
 stats        * 3.4.4   2018-03-16 local                           
 stopwords      0.9.0   2017-12-14 CRAN (R 3.4.3)                  
 stringi        1.2.4   2018-07-20 CRAN (R 3.4.4)                  
 stringr      * 1.3.1   2018-05-10 CRAN (R 3.4.4)                  
 textclean    * 0.9.3   2018-07-23 CRAN (R 3.4.4)                  
 tibble         1.4.2   2018-01-22 CRAN (R 3.4.3)                  
 tidyselect     0.2.4   2018-02-26 CRAN (R 3.4.3)                  
 tools          3.4.4   2018-03-16 local                           
 udpipe       * 0.6.1   2018-07-30 CRAN (R 3.4.4)                  
 utils        * 3.4.4   2018-03-16 local                           
 withr          2.1.2   2018-03-15 CRAN (R 3.4.4)                  
 yaml           2.2.0   2018-07-25 CRAN (R 3.4.4)  

@amatsuo
Copy link
Collaborator Author

amatsuo commented Aug 31, 2018

It seems that you forgot to load the package by library(spacyr).

@jwijffels
Copy link

If you just want to get the lemma's in French using udpipe and put it into the quanteda corpus structure. I think this is just this (example below just takes nouns & proper nouns).

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
x <- subset(x, upos %in% c('NOUN', 'PROPN'))
txtc$tokens <- split(x$lemma, x$doc_id)

Why do you think such code would have to be put into the udpipe R package?

@aourednik
Copy link

@amatsuo Yes, my mistake, forgot to reload package, the first error was due this, sorry. Now I am getting only the second error on my machine (same Session info as before) :

> class(txtc)
[1] "corpus" "list"  
> txtc
Corpus consisting of 35,701 documents and 5 docvars.
> devtools::install_github("quanteda/spacyr", ref = "tokenize-function",force=TRUE)
Downloading GitHub repo quanteda/spacyr@tokenize-function
from URL https://api.github.com/repos/quanteda/spacyr/zipball/tokenize-function
Installing spacyr
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp3TNiYi/devtoolsc003f381001/quanteda-spacyr-240b6ef'  \
  --library='/home/andre/R/x86_64-pc-linux-gnu-library/3.4' --install-tests 

* installing *source* packagespacyr...
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (spacyr)
Reloading installed spacyr
unloadNamespace("spacyr") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.

Attaching package:spacyrThe following object is masked frompackage:quanteda:

    spacy_parse

> library("spacyr")
> parsed <- spacy_tokenize(corpus_sample(txtc,10))
Error in UseMethod("spacy_tokenize") : 
  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"

@aourednik
Copy link

@jwijffels Many thanks for the code! It currently returns a named list of character vectors containing lemmatized tokens, which comes much closer to what I (and most probably other users of both Quanteda and udpipe) would need. The best, though, would be having a udpipe function return a Quanteda object of class tokens. A tokens object is normally generated by tokens() or by the new spacy_tokenize() discussed here. The tokens object can be easily turned to a document-feature-matrix with dfm() that allows, for instance, fast dictionary lookup with dfm_lookup().
My concrete use-case is lexicon-based sentiment analysis and emotion mining.

@jwijffels
Copy link

jwijffels commented Sep 4, 2018

quanteda's tokens element of the corpus seems to be a list of terms with the class tokenizedTexts. If you want this, just wrap the code that I showed above in as.tokenizedTexts which is part of quanteda.

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
txtc$tokens <- as.tokenizedTexts(split(x$lemma, x$doc_id))

If you want to use udpipe, to get a DTM/document-feature-matrix of adjectives for sentiment analysis, you can just use the code below and proceed with e.g. dfm_lookup if you need it.

## For sentiment analysis, with udpipe, just take the adjectives and get a dtm
x <- subset(x, upos %in% c('ADJ'))
dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)

@ChengYJon
Copy link

This has been super useful! Thank you!

Are there any plans to implement spacy's neural coreference functions into R?

@kasperwelbers
Copy link

@ChengYJon I was also looking to use the neuralcoref pipeline component, so I took a stab at it in this fork

There is some hassle though (as explained in the README), because neuralcoref currently doesn't seem to work with spacy > 2.0.12. Simply downgrading spacy in turn resulted in other compatibility issues, so for me a clean conda install was required. Until these compatibility issues are resolved it's quite cumbersome.

@ChengYJon
Copy link

@kasperwelbers Thank you so much for this. I kept having to switch between Python and R. I'll try this fork out and let you know if I'm able to recreate the process.

@fkrauer
Copy link

fkrauer commented Jun 11, 2019

If it isn't already incorporated (I haven't found anything), I'd love to have a "start" and "end" character for each token. Otherwise they cannot be uniquely identified in the running text.

@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 11, 2019

@fkrauer Thank you for the post.

I am not sure what that means by start and end.

Could you elaborate it a bit more? Or could you show us a desirable output?

@fkrauer
Copy link

fkrauer commented Jun 11, 2019

I mean the character position of each token with respect to the original text. For example:

text <- "This is a dummy text."
output <- spacy_parse(text)

> output
token	start	end
This	1	4
is	6	7
a	9	9
dummy	11	15
text	17	20
.	21	21

The count starts with 1 at the first character, and all characters are counted (also whitespaces). coreNLP (R wrapper for Stanford's CoreNLP) has this feature, which is very useful, when you have to map the original text back onto the tokens or compare different NLP algorithms.

@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 11, 2019

I see. It's not implemented in spacyr, but you could do something like this.

library(spacyr)
library(tidyverse)

txt <- c(doc1 = "spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.",
         doc2 = "spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.")
out <- spacy_parse(txt, additional_attributes = c("idx"), entity = FALSE,
                   lemma = FALSE, pos = FALSE)

out %>%
    mutate(start = idx - idx[1] + 1) %>%
    mutate(end = start + nchar(token) - 1) 

What the code does is:

  1. run spacy_parse with additional attribute of idx, which returns the character offset of the token in the document.
  2. calculate start and end.

The head of output is:

##    doc_id sentence_id token_id       token idx start end
## 1    doc1           1        1       spaCy   0     1   5
## 2    doc1           1        2      excels   6     7  12
## 3    doc1           1        3          at  13    14  15
## 4    doc1           1        4       large  16    17  21
## 5    doc1           1        5           -  21    22  22
## 6    doc1           1        6       scale  22    23  27
## 7    doc1           1        7 information  28    29  39
## 8    doc1           1        8  extraction  40    41  50
## 9    doc1           1        9       tasks  51    52  56
## 10   doc1           1       10           .  56    57  57
## 11   doc1           2        1          It  58    59  60
## 12   doc1           2        2          's  60    61  62
## 13   doc1           2        3     written  63    64  70
## 14   doc1           2        4        from  71    72  75
## 15   doc1           2        5         the  76    77  79
## 16   doc1           2        6      ground  80    81  86
## 17   doc1           2        7          up  87    88  89
## 18   doc1           2        8          in  90    91  92
## 19   doc1           2        9   carefully  93    94 102
## 20   doc1           2       10      memory 103   104 109

I am not sure whether we should provide this as a functionality of spacy_parse yet, but could be.

@fkrauer
Copy link

fkrauer commented Jun 11, 2019

I have written a for loop with a stringr::str_locate(), but your solution is much quicker, thank you.

@mshariful
Copy link

mshariful commented Aug 24, 2020

spacyr_initialize

Hi,
I have saved an updated Spacy NER model in 'c\updated_model'. The folder 'updated_model' contains
'tagger', 'parser', 'ner', and 'vocab' folders together with two files 'meta,json' and 'tokenizer'. I can easily
load and use this updated model in python by simply using
spacy.load( 'c\updated_model')
How do I load it in Spacyr? I tried
spacy_initialize(model='c\updated_model')
I did not get any error but it seems spacyr uses the default 'de' model. How do I make sure, spacyr uses my updated model?

TIA
Sharif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants