-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spacyr wishlist #109
Comments
Having a "just tokenization" option with lemmatization would be great. parsed <- spacy_parse(my_corpus, pos=FALSE, entity=FALSE, dependency=FALSE)
parsed$token <- parsed$lemma
my_tokens <- as.tokens(parsed) The first line yields a memory overload on a large my_corpus, while tokens(my_corpus) is fast, with no memory problem. I don't to what extent this is due to inherent memory use of spaCy, though. Could my_tokens <- tokens(txtc,
what='word',
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators=TRUE,
remove_symbols = TRUE,
include_docvars = TRUE,
lemmatize = "spacy_parse"
) |
Not a bad idea. @amatsuo maybe add: spacy_tokenize(x, what = c("word", "sentence"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE,
remove_url = FALSE, value = c("list", "data.frame") where the last one returns one of the two TIF formats for tokens? This is as close to the We could also add to |
Definitely would be interested in noun phrase extractions. |
@aourednik and @kbenoit I have implemented Some options are left out: |
Is it possible to train a new model with spacyr at the moment? |
@cecilialee No, for training a new language model you would need to do that in Python using the spaCy instructions. We unlikely to add this facility to spacyr in the foreseeable future. |
@kbenoit Sure. Then if I've trained a model with python, how can I use (initialize) that model with spacyr? |
The |
@amatsuo Is there a simple way to install the full tokenize-function branch version of spacyr in R ? |
@aourednik that would be devtools::install_github("quanteda/spacyr", ref = "tokenize-function") |
Great thanks for these developments! library("udpipe")
# dl <- udpipe_download_model(language = "french") # necessary only when not yet downloaded
udmodel_french <- udpipe_load_model(file = "french-ud-2.0-170801.udpipe")
#txtc is my quanteda corpus
txtudpipetokens <- lapply(head(texts(txtc)), function(x) {
udp <- udpipe_annotate(udmodel_french, x)
return(as.data.table(udp)$lemma)
}
) |
Glad it's working for you! We should be finished with the integration of the On integration with udpipe, that's probably better done in that package. @jwijffels we'd be happy to assist with this. |
@amatsuo @kbenoit I have tried out: devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in spacy_tokenize(corpus_sample(txtc, 10)) :
# could not find function "spacy_tokenize"
source("https://raw.githubusercontent.com/quanteda/spacyr/tokenize-function/R/spacy_tokenize.R")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in UseMethod("spacy_tokenize") :
# no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')" Session info -------------------------------------------------------------------------------------------------
setting value
version R version 3.4.4 (2018-03-15)
system x86_64, linux-gnu
ui RStudio (1.1.423)
language en_US
collate en_US.UTF-8
tz Europe/Zurich
date 2018-08-30
Packages -----------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1)
backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
base * 3.4.4 2018-03-16 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.2)
bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.4.4)
checkmate 1.8.5 2017-10-24 CRAN (R 3.4.3)
codetools 0.2-15 2016-10-05 CRAN (R 3.3.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
compiler 3.4.4 2018-03-16 local
crayon 1.3.4 2017-09-16 CRAN (R 3.4.3)
curl 3.2 2018-03-28 CRAN (R 3.4.4)
data.table * 1.11.4 2018-05-27 CRAN (R 3.4.4)
datasets * 3.4.4 2018-03-16 local
devtools 1.13.6 2018-06-27 CRAN (R 3.4.4)
digest 0.6.16 2018-08-22 CRAN (R 3.4.4)
doMC * 1.3.5 2017-12-12 CRAN (R 3.4.3)
dplyr 0.7.6 2018-06-29 CRAN (R 3.4.4)
evaluate 0.11 2018-07-17 CRAN (R 3.4.4)
fastmatch 1.1-1 2017-11-21 local
forcats * 0.3.0 2018-02-19 CRAN (R 3.4.4)
foreach * 1.4.4 2017-12-12 CRAN (R 3.4.3)
ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.4.4)
git2r 0.23.0 2018-07-17 CRAN (R 3.4.4)
glue 1.3.0 2018-07-17 CRAN (R 3.4.4)
graphics * 3.4.4 2018-03-16 local
grDevices * 3.4.4 2018-03-16 local
grid 3.4.4 2018-03-16 local
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
htmlTable * 1.12 2018-05-26 CRAN (R 3.4.4)
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.2)
htmlwidgets 1.2 2018-04-19 CRAN (R 3.4.4)
httr 1.3.1 2017-08-20 CRAN (R 3.4.2)
igraph * 1.1.2 2017-07-21 CRAN (R 3.4.2)
iterators * 1.0.10 2018-07-13 CRAN (R 3.4.4)
jsonlite 1.5 2017-06-01 CRAN (R 3.4.2)
knitr 1.20 2018-02-20 CRAN (R 3.4.3)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
lubridate 1.7.4 2018-04-11 CRAN (R 3.4.4)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
Matrix 1.2-14 2018-04-09 CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.3)
methods * 3.4.4 2018-03-16 local
munsell 0.5.0 2018-06-12 CRAN (R 3.4.4)
parallel * 3.4.4 2018-03-16 local
pillar 1.3.0 2018-07-14 CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 CRAN (R 3.4.4)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
purrr 0.2.5 2018-05-29 CRAN (R 3.4.4)
qdapRegex 0.7.2 2017-04-09 CRAN (R 3.4.2)
quanteda * 1.3.4 2018-07-15 CRAN (R 3.4.4)
R2HTML * 2.3.2 2016-06-23 CRAN (R 3.4.3)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.1)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.4.4)
RcppParallel 4.4.1 2018-07-19 CRAN (R 3.4.4)
readtext * 0.71 2018-05-10 CRAN (R 3.4.4)
rlang 0.2.2 2018-08-16 CRAN (R 3.4.4)
rlist * 0.4.6.1 2016-04-04 CRAN (R 3.4.4)
rmarkdown 1.10 2018-06-11 CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
rstudioapi 0.7 2017-09-07 CRAN (R 3.4.3)
scales * 1.0.0 2018-08-09 CRAN (R 3.4.4)
spacyr 0.9.91 2018-08-30 Github (quanteda/spacyr@240b6ef)
stats * 3.4.4 2018-03-16 local
stopwords 0.9.0 2017-12-14 CRAN (R 3.4.3)
stringi 1.2.4 2018-07-20 CRAN (R 3.4.4)
stringr * 1.3.1 2018-05-10 CRAN (R 3.4.4)
textclean * 0.9.3 2018-07-23 CRAN (R 3.4.4)
tibble 1.4.2 2018-01-22 CRAN (R 3.4.3)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.4.3)
tools 3.4.4 2018-03-16 local
udpipe * 0.6.1 2018-07-30 CRAN (R 3.4.4)
utils * 3.4.4 2018-03-16 local
withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
yaml 2.2.0 2018-07-25 CRAN (R 3.4.4) |
It seems that you forgot to load the package by |
If you just want to get the lemma's in French using udpipe and put it into the quanteda corpus structure. I think this is just this (example below just takes nouns & proper nouns).
Why do you think such code would have to be put into the udpipe R package? |
@amatsuo Yes, my mistake, forgot to reload package, the first error was due this, sorry. Now I am getting only the second error on my machine (same Session info as before) : > class(txtc)
[1] "corpus" "list"
> txtc
Corpus consisting of 35,701 documents and 5 docvars.
> devtools::install_github("quanteda/spacyr", ref = "tokenize-function",force=TRUE)
Downloading GitHub repo quanteda/spacyr@tokenize-function
from URL https://api.github.com/repos/quanteda/spacyr/zipball/tokenize-function
Installing spacyr
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \
'/tmp/Rtmp3TNiYi/devtoolsc003f381001/quanteda-spacyr-240b6ef' \
--library='/home/andre/R/x86_64-pc-linux-gnu-library/3.4' --install-tests
* installing *source* package ‘spacyr’ ...
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (spacyr)
Reloading installed spacyr
unloadNamespace("spacyr") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.
Attaching package: ‘spacyr’
The following object is masked from ‘package:quanteda’:
spacy_parse
> library("spacyr")
> parsed <- spacy_tokenize(corpus_sample(txtc,10))
Error in UseMethod("spacy_tokenize") :
no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')" |
@jwijffels Many thanks for the code! It currently returns a named list of character vectors containing lemmatized tokens, which comes much closer to what I (and most probably other users of both Quanteda and udpipe) would need. The best, though, would be having a udpipe function return a Quanteda object of class tokens. A tokens object is normally generated by |
quanteda's tokens element of the corpus seems to be a list of terms with the class
If you want to use
|
This has been super useful! Thank you! Are there any plans to implement spacy's neural coreference functions into R? |
@ChengYJon I was also looking to use the neuralcoref pipeline component, so I took a stab at it in this fork There is some hassle though (as explained in the README), because neuralcoref currently doesn't seem to work with spacy > 2.0.12. Simply downgrading spacy in turn resulted in other compatibility issues, so for me a clean conda install was required. Until these compatibility issues are resolved it's quite cumbersome. |
@kasperwelbers Thank you so much for this. I kept having to switch between Python and R. I'll try this fork out and let you know if I'm able to recreate the process. |
If it isn't already incorporated (I haven't found anything), I'd love to have a "start" and "end" character for each token. Otherwise they cannot be uniquely identified in the running text. |
@fkrauer Thank you for the post. I am not sure what that means by Could you elaborate it a bit more? Or could you show us a desirable output? |
I mean the character position of each token with respect to the original text. For example:
The count starts with 1 at the first character, and all characters are counted (also whitespaces). coreNLP (R wrapper for Stanford's CoreNLP) has this feature, which is very useful, when you have to map the original text back onto the tokens or compare different NLP algorithms. |
I see. It's not implemented in library(spacyr)
library(tidyverse)
txt <- c(doc1 = "spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.",
doc2 = "spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.")
out <- spacy_parse(txt, additional_attributes = c("idx"), entity = FALSE,
lemma = FALSE, pos = FALSE)
out %>%
mutate(start = idx - idx[1] + 1) %>%
mutate(end = start + nchar(token) - 1) What the code does is:
The head of output is: ## doc_id sentence_id token_id token idx start end
## 1 doc1 1 1 spaCy 0 1 5
## 2 doc1 1 2 excels 6 7 12
## 3 doc1 1 3 at 13 14 15
## 4 doc1 1 4 large 16 17 21
## 5 doc1 1 5 - 21 22 22
## 6 doc1 1 6 scale 22 23 27
## 7 doc1 1 7 information 28 29 39
## 8 doc1 1 8 extraction 40 41 50
## 9 doc1 1 9 tasks 51 52 56
## 10 doc1 1 10 . 56 57 57
## 11 doc1 2 1 It 58 59 60
## 12 doc1 2 2 's 60 61 62
## 13 doc1 2 3 written 63 64 70
## 14 doc1 2 4 from 71 72 75
## 15 doc1 2 5 the 76 77 79
## 16 doc1 2 6 ground 80 81 86
## 17 doc1 2 7 up 87 88 89
## 18 doc1 2 8 in 90 91 92
## 19 doc1 2 9 carefully 93 94 102
## 20 doc1 2 10 memory 103 104 109 I am not sure whether we should provide this as a functionality of spacy_parse yet, but could be. |
I have written a for loop with a stringr::str_locate(), but your solution is much quicker, thank you. |
Hi, TIA |
Hello @kbenoit and other users of
spacyr
This is a comprehensive wishlist of
spacyr
updates inspired by our discussion with @honnibal and @ines. We will implement some of them in future, but is there anything you are particularly intereted in?Something likely to be implemented
Something nice to have but not sure how many users need it
Just a wish
The text was updated successfully, but these errors were encountered: