Tokenize and noun-phrase extraction #119

amatsuo · 2018-06-08T10:33:47Z

This PR includes the implementation of two new functions discussed in #109 and #117 :

spacy_tokenize for tokenizing documents either to list or to data.frame
spacy_extract_nounphrase for noun-phrase extraction

@kbenoit
We need followings before merging

Tidy up documentation
Implement tests for these two functions
Increment the version number

codecov-io · 2018-06-08T11:04:26Z

Codecov Report

Merging #119 into master will decrease coverage by 1%.
The diff coverage is 35.29%.

@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
- Coverage    42.2%   41.19%   -1.01%     
==========================================
  Files           9       11       +2     
  Lines         699      852     +153     
==========================================
+ Hits          295      351      +56     
- Misses        404      501      +97

Impacted Files	Coverage Δ
R/parse-extractor-functions.R	`65.21% <0%> (-23.02%)`	⬇️
R/spacy_extract_nounphrases.R	`0% <0%> (ø)`
R/spacy_parse.R	`76.92% <18.75%> (-15.02%)`	⬇️
R/spacy_tokenize.R	`77.27% <77.27%> (ø)`
R/spacy_initialize.R	`44.44% <0%> (+1.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 186eb12...586a55f. Read the comment docs.

- Rename `type` to `output` - Some linting

Many of these are not working - but that's part of test-driven development.

kbenoit · 2018-10-07T15:37:03Z

We have some interesting functionality issues with remove_separators, with respect to repeated separators and tabs and newlines in particular. The quanteda::tokens() behaviour seems better to me.

> txt <- "One space  two spaces one\ttab\t\ttwo one\nnewline\n\ntwo."
> spacy_tokenize(txt, remove_separators = FALSE)
$t
 [1] "One"     " "       "space"   " "       " "       "two"     " "       "spaces"  " "      
[10] "one"     "\t"      "tab"     "\t\t"    "two"     " "       "one"     "\n"      "newline"
[19] "\n\n"    "two"     "."      

> spacy_tokenize(txt, remove_separators = TRUE)
$t
 [1] "One"     "space"   " "       "two"     "spaces"  "one"     "\t"      "tab"     "\t\t"   
[10] "two"     "one"     "\n"      "newline" "\n\n"    "two"     "."      

> quanteda::tokens(txt, remove_separators = FALSE)
tokens from 1 document.
text1 :
 [1] "One"     " "       "space"   " "       " "       "two"     " "       "spaces"  " "      
[10] "one"     "\t"      "tab"     "\t"      "\t"      "two"     " "       "one"     "\n"     
[19] "newline" "\n"      "\n"      "two"     "."      

> quanteda::tokens(txt, remove_separators = TRUE)
tokens from 1 document.
text1 :
 [1] "One"     "space"   "two"     "spaces"  "one"     "tab"     "two"     "one"     "newline"
[10] "two"     "."

kbenoit

See the tests I added. They are breaking but we should work on the code until they pass. If we decide the tests are inappropriate, we should discuss that before changing them.

Other changes: I fixed a bug in the padding condition in the Python code, and renamed the separator argument.

What are the merits of adding arguments to match the quanteda behaviour of remove_hyphens, remove_twitter, and remove_symbols? (The last we could easily do on the final R side.)

remove_twitter in particular behaves very differently for the spacyr version:

> spacy_tokenize("I am @kenbenoit on Twitter #quanteda.")
$t
[1] "I"          "am"         "@kenbenoit" "on"         "Twitter"    "#"         
[7] "quanteda"   "."         

> spacy_tokenize("I am @kenbenoit on Twitter #quanteda.", remove_punct = TRUE)
$t
[1] "I"          "am"         "@kenbenoit" "on"         "Twitter"    "quanteda"

remove_hyphens comparison:

> txt <- "Jacob Rees-Mogg is a floccinaucinihilipilificator"
> spacy_tokenize(txt)
$t
[1] "Jacob"  "Rees"   "-"      "Mogg"   "is"     "a"      "floccinaucinihilipilificator"

> tokens(txt, remove_hyphens = TRUE)
tokens from 1 document.
text1 :
[1] "Jacob"  "Rees"   "-"      "Mogg"   "is"     "a"      "floccinaucinihilipilificator"

> tokens(txt, remove_hyphens = FALSE)
tokens from 1 document.
text1 :
[1] "Jacob"     "Rees-Mogg" "is"        "a"         "floccinaucinihilipilificator"

amatsuo · 2018-10-22T19:00:06Z

@kbenoit
Question:

At the moment, following two lines:

txt <- "This: £ = GBP! 15% not! > 20 percent?"
spacy_tokenize(txt, remove_punct = TRUE, padding = FALSE) %>% dput

Returns:

list(text1 = c("This", "£", "=", "GBP", "15", "not", ">", "20", "percent"))

That's different from the test expectation. Do you think we should remove characters that may or may not be counted as punctuations such as "£"?

Merge remote-tracking branch 'origin/master' into tokenize-function # Conflicts: # inst/python/initialize_spacyPython.py

…into tokenize-function

kbenoit · 2018-11-02T21:16:57Z

There is a discrepancy in what spaCy considers to be a SYM and the Unicode category classification:

> spacy_tokenize("Contains symbols £ ±", remove_symbols = TRUE)
$text1
[1] "Contains" "symbols"  "±"       

> spacy_parse("Contains symbols £ ±")
  doc_id sentence_id token_id    token   lemma  pos  entity
1  text1           1        1 Contains contain NOUN        
2  text1           1        2  symbols  symbol VERB        
3  text1           1        3        £       £  SYM        
4  text1           1        4        ±       ± NOUN MONEY_B
> stringi::stri_detect_charclass(c("£", "±"), "\\p{S}")
[1] TRUE TRUE

kbenoit · 2018-11-05T10:58:16Z

R/spacy_parse.R

@@ -46,6 +47,7 @@ spacy_parse <- function(x,
                        lemma = TRUE,
                        entity = TRUE, 
                        dependency = FALSE,
+                        noun_phrase = FALSE,


let's remove the underscore, since none of the other arguments have it. So just nounphrase.

amatsuo · 2018-11-05T11:58:09Z

I implemented the first version of noun_phrase option in spacy_parse(). At the moment it works like below. Since the noun phrases od not have the categories like entities, different set of fields are returned.

@kbenoit, what do you think?

Code:

library(spacyr)
txt <- c(doc1 = "Natural Language Processing is a branch of computer science that employs various Artificial Intelligence (AI) techniques to process content written in natural language. NLP-enhanced wikis can support users in finding, developing and organizing knowledge contained inside the wiki repository. ", 
  doc2 = "Paul earned a postgraduate degree from MIT.")
spacy_parse(txt, noun_phrase = TRUE)

Output:

   doc_id sentence_id token_id        token        lemma   pos entity
1    doc1           1        1      Natural      natural PROPN  ORG_B
2    doc1           1        2     Language     language PROPN  ORG_I
3    doc1           1        3   Processing   processing PROPN  ORG_I
4    doc1           1        4           is           be  VERB       
5    doc1           1        5            a            a   DET       
6    doc1           1        6       branch       branch  NOUN       
7    doc1           1        7           of           of   ADP       
8    doc1           1        8     computer     computer  NOUN       
9    doc1           1        9      science      science  NOUN       
10   doc1           1       10         that         that   ADJ       
11   doc1           1       11      employs       employ  VERB       
12   doc1           1       12      various      various   ADJ       
13   doc1           1       13   Artificial   artificial PROPN  ORG_B
14   doc1           1       14 Intelligence intelligence PROPN  ORG_I
15   doc1           1       15            (            ( PUNCT  ORG_I
16   doc1           1       16           AI           ai PROPN  ORG_I
17   doc1           1       17            )            ) PUNCT       
18   doc1           1       18   techniques    technique  NOUN       
19   doc1           1       19           to           to  PART       
20   doc1           1       20      process      process  VERB       
21   doc1           1       21      content      content  NOUN       
22   doc1           1       22      written        write  VERB       
23   doc1           1       23           in           in   ADP       
24   doc1           1       24      natural      natural   ADJ       
25   doc1           1       25     language     language  NOUN       
26   doc1           1       26            .            . PUNCT       
27   doc1           2        1          NLP          nlp PROPN  ORG_B
28   doc1           2        2            -            - PUNCT       
29   doc1           2        3     enhanced      enhance  VERB       
30   doc1           2        4        wikis         wiki  NOUN       
31   doc1           2        5          can          can  VERB       
32   doc1           2        6      support      support  VERB       
33   doc1           2        7        users         user  NOUN       
34   doc1           2        8           in           in   ADP       
35   doc1           2        9      finding         find  VERB       
36   doc1           2       10            ,            , PUNCT       
37   doc1           2       11   developing      develop  VERB       
38   doc1           2       12          and          and CCONJ       
39   doc1           2       13   organizing     organize  VERB       
40   doc1           2       14    knowledge    knowledge  NOUN       
41   doc1           2       15    contained      contain  VERB       
42   doc1           2       16       inside       inside   ADP       
43   doc1           2       17          the          the   DET       
44   doc1           2       18         wiki         wiki  NOUN       
45   doc1           2       19   repository   repository  NOUN       
46   doc1           2       20            .            . PUNCT       
47   doc2           1        1         Paul         paul PROPN  ORG_B
48   doc2           1        2       earned         earn  VERB       
49   doc2           1        3            a            a   DET       
50   doc2           1        4 postgraduate postgraduate  NOUN       
51   doc2           1        5       degree       degree  NOUN       
52   doc2           1        6         from         from   ADP       
53   doc2           1        7          MIT          mit PROPN  ORG_B
54   doc2           1        8            .            . PUNCT       
                                       noun_phrase noun_phrase_root_text
1                      Natural Language Processing            Processing
2                      Natural Language Processing            Processing
3                      Natural Language Processing            Processing
4                                             <NA>                  <NA>
5                                         a branch                branch
6                                         a branch                branch
7                                             <NA>                  <NA>
8                                 computer science               science
9                                 computer science               science
10                                            <NA>                  <NA>
11                                            <NA>                  <NA>
12 various Artificial Intelligence (AI) techniques            techniques
13 various Artificial Intelligence (AI) techniques            techniques
14 various Artificial Intelligence (AI) techniques            techniques
15 various Artificial Intelligence (AI) techniques            techniques
16 various Artificial Intelligence (AI) techniques            techniques
17 various Artificial Intelligence (AI) techniques            techniques
18 various Artificial Intelligence (AI) techniques            techniques
19                                            <NA>                  <NA>
20                                            <NA>                  <NA>
21                                         content               content
22                                            <NA>                  <NA>
23                                            <NA>                  <NA>
24                                natural language              language
25                                natural language              language
26                                            <NA>                  <NA>
27                              NLP-enhanced wikis                 wikis
28                              NLP-enhanced wikis                 wikis
29                              NLP-enhanced wikis                 wikis
30                              NLP-enhanced wikis                 wikis
31                                            <NA>                  <NA>
32                                            <NA>                  <NA>
33                                           users                 users
34                                            <NA>                  <NA>
35                                            <NA>                  <NA>
36                                            <NA>                  <NA>
37                                            <NA>                  <NA>
38                                            <NA>                  <NA>
39                                            <NA>                  <NA>
40                                       knowledge             knowledge
41                                            <NA>                  <NA>
42                                            <NA>                  <NA>
43                             the wiki repository            repository
44                             the wiki repository            repository
45                             the wiki repository            repository
46                                            <NA>                  <NA>
47                                            Paul                  Paul
48                                            <NA>                  <NA>
49                           a postgraduate degree                degree
50                           a postgraduate degree                degree
51                           a postgraduate degree                degree
52                                            <NA>                  <NA>
53                                             MIT                   MIT
54                                            <NA>                  <NA>
   noun_phrase_length start_token_id root_token_id
1                   3              1             3
2                   3              1             3
3                   3              1             3
4                  NA             NA            NA
5                   2              5             6
6                   2              5             6
7                  NA             NA            NA
8                   2              8             9
9                   2              8             9
10                 NA             NA            NA
11                 NA             NA            NA
12                  7             12            18
13                  7             12            18
14                  7             12            18
15                  7             12            18
16                  7             12            18
17                  7             12            18
18                  7             12            18
19                 NA             NA            NA
20                 NA             NA            NA
21                  1             21            21
22                 NA             NA            NA
23                 NA             NA            NA
24                  2             24            25
25                  2             24            25
26                 NA             NA            NA
27                  4              1             4
28                  4              1             4
29                  4              1             4
30                  4              1             4
31                 NA             NA            NA
32                 NA             NA            NA
33                  1              7             7
34                 NA             NA            NA
35                 NA             NA            NA
36                 NA             NA            NA
37                 NA             NA            NA
38                 NA             NA            NA
39                 NA             NA            NA
40                  1             14            14
41                 NA             NA            NA
42                 NA             NA            NA
43                  3             17            19
44                  3             17            19
45                  3             17            19
46                 NA             NA            NA
47                  1              1             1
48                 NA             NA            NA
49                  3              3             5
50                  3              3             5
51                  3              3             5
52                 NA             NA            NA
53                  1              7             7
54                 NA             NA            NA

kbenoit · 2018-11-05T14:55:04Z

I think it should operate just as entity does, by marking the start and end of the noun phrase, and then using an extract or consolidate function to extract or combine them. The problem with the format above is that it repeats the noun phrases across its components.

So with entity:

txt3 <- "We analyzed the Supreme Court using natural language processing." 
spacy_parse(txt3, entity = TRUE, nounphrase = FALSE)
#    doc_id sentence_id token_id      token      lemma   pos entity
# 1   text1           1        1         We     -PRON-  PRON       
# 2   text1           1        2   analyzed    analyze  VERB       
# 3   text1           1        3        the        the   DET  ORG_B
# 4   text1           1        4    Supreme    supreme PROPN  ORG_I
# 5   text1           1        5      Court      court PROPN  ORG_I
# 6   text1           1        6      using        use  VERB       
# 7   text1           1        7    natural    natural   ADJ       
# 8   text1           1        8   language   language  NOUN       
# 9   text1           1        9 processing processing  NOUN       
# 10  text1           1       10          .          . PUNCT

I think nounphrase = TRUE ought to return:

spacy_parse(txt3, entity = FALSE, nounphrase = TRUE)
#    doc_id sentence_id token_id      token      lemma   pos nounphrase
# 1   text1           1        1         We     -PRON-  PRON       
# 2   text1           1        2   analyzed    analyze  VERB       
# 3   text1           1        3        the        the   DET      np_beg
# 4   text1           1        4    Supreme    supreme PROPN      np_mid
# 5   text1           1        5      Court      court PROPN      np_end
# 6   text1           1        6      using        use  VERB       
# 7   text1           1        7    natural    natural   ADJ       
# 8   text1           1        8   language   language  NOUN       
# 9   text1           1        9 processing processing  NOUN       
# 10  text1           1       10          .          . PUNCT

then we use similar code to the entity functions extract and consolidate to define nounphrase_extract() and nounphrase_consolidate(). Either of the two consolidate functions would remove the tags from the other - e.g.

spacy_parse(txt3, entity = TRUE, nounphrase = TRUE) %>%
    entity_consolidate()

would remove the nounphrase column altogether (and vice-versa).

kbenoit · 2018-11-05T15:01:06Z

Need to add some tests as well...

This makes it more consistent with our (non) use of underscores in other parse arguments.

amatsuo · 2018-11-05T16:02:58Z

I understand that we want to have consistency.

However, in that case, we need to sacrifice the re-constructability of original noun phrases. For instance, in the example above, various Artificial Intelligence (AI) techniques is one of the noun phrases. Concatenations with spaces will provide spaces after/before the parentheses. The problem is the spacy_parse output does not have trailing space info.

One possibility is to include a flag for trailing spaces as a field. This will mean a loop over tokens has to be run. Maybe worth it, though.

amatsuo · 2018-11-05T16:18:02Z

And, also the information about the root token is important for some purposes. So the desirable output might be:

spacy_parse(txt3, entity = FALSE, nounphrase = TRUE)
#    doc_id sentence_id token_id      token      lemma   pos nounphrase
# 1   text1           1        1         We     -PRON-  PRON       
# 2   text1           1        2   analyzed    analyze  VERB       
# 3   text1           1        3        the        the   DET      beg
# 4   text1           1        4    Supreme    supreme PROPN      mid
# 5   text1           1        5      Court      court PROPN      end_root
# 6   text1           1        6      using        use  VERB       
# 7   text1           1        7    natural    natural   ADJ       
# 8   text1           1        8   language   language  NOUN       
# 9   text1           1        9 processing processing  NOUN       
# 10  text1           1       10          .          . PUNCT

np_ is jsut taking up memory, so we don't need it. Also, we might possibly add the trailing space info.

amatsuo · 2018-11-05T17:07:55Z

I implemented a new version of this option in noun-phrase-v2 branch. Thoughts?

Output:

txt <- c(doc1 = "Natural Language Processing is a branch of computer science that employs various Artificial Intelligence (AI) techniques to process content written in natural language. NLP-enhanced wikis can support users in finding, developing and organizing knowledge contained inside the wiki repository. ", 
  doc2 = "Paul earned a postgraduate degree from MIT.")
  (spacy_parse(txt, nounphrase = TRUE))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.16, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
##    doc_id sentence_id token_id        token        lemma   pos entity
## 1    doc1           1        1      Natural      natural PROPN  ORG_B
## 2    doc1           1        2     Language     language PROPN  ORG_I
## 3    doc1           1        3   Processing   processing PROPN  ORG_I
## 4    doc1           1        4           is           be  VERB       
## 5    doc1           1        5            a            a   DET       
## 6    doc1           1        6       branch       branch  NOUN       
## 7    doc1           1        7           of           of   ADP       
## 8    doc1           1        8     computer     computer  NOUN       
## 9    doc1           1        9      science      science  NOUN       
## 10   doc1           1       10         that         that   ADJ       
## 11   doc1           1       11      employs       employ  VERB       
## 12   doc1           1       12      various      various   ADJ       
## 13   doc1           1       13   Artificial   artificial PROPN  ORG_B
## 14   doc1           1       14 Intelligence intelligence PROPN  ORG_I
## 15   doc1           1       15            (            ( PUNCT  ORG_I
## 16   doc1           1       16           AI           ai PROPN  ORG_I
## 17   doc1           1       17            )            ) PUNCT       
## 18   doc1           1       18   techniques    technique  NOUN       
## 19   doc1           1       19           to           to  PART       
## 20   doc1           1       20      process      process  VERB       
## 21   doc1           1       21      content      content  NOUN       
## 22   doc1           1       22      written        write  VERB       
## 23   doc1           1       23           in           in   ADP       
## 24   doc1           1       24      natural      natural   ADJ       
## 25   doc1           1       25     language     language  NOUN       
## 26   doc1           1       26            .            . PUNCT       
## 27   doc1           2        1          NLP          nlp PROPN  ORG_B
## 28   doc1           2        2            -            - PUNCT       
## 29   doc1           2        3     enhanced      enhance  VERB       
## 30   doc1           2        4        wikis         wiki  NOUN       
## 31   doc1           2        5          can          can  VERB       
## 32   doc1           2        6      support      support  VERB       
## 33   doc1           2        7        users         user  NOUN       
## 34   doc1           2        8           in           in   ADP       
## 35   doc1           2        9      finding         find  VERB       
## 36   doc1           2       10            ,            , PUNCT       
## 37   doc1           2       11   developing      develop  VERB       
## 38   doc1           2       12          and          and CCONJ       
## 39   doc1           2       13   organizing     organize  VERB       
## 40   doc1           2       14    knowledge    knowledge  NOUN       
## 41   doc1           2       15    contained      contain  VERB       
## 42   doc1           2       16       inside       inside   ADP       
## 43   doc1           2       17          the          the   DET       
## 44   doc1           2       18         wiki         wiki  NOUN       
## 45   doc1           2       19   repository   repository  NOUN       
## 46   doc1           2       20            .            . PUNCT       
## 47   doc2           1        1         Paul         paul PROPN  ORG_B
## 48   doc2           1        2       earned         earn  VERB       
## 49   doc2           1        3            a            a   DET       
## 50   doc2           1        4 postgraduate postgraduate  NOUN       
## 51   doc2           1        5       degree       degree  NOUN       
## 52   doc2           1        6         from         from   ADP       
## 53   doc2           1        7          MIT          mit PROPN  ORG_B
## 54   doc2           1        8            .            . PUNCT       
##    nounphrase whitespace
## 1         beg       TRUE
## 2         mid       TRUE
## 3    end_root       TRUE
## 4        <NA>       TRUE
## 5         beg       TRUE
## 6    end_root       TRUE
## 7        <NA>       TRUE
## 8         beg       TRUE
## 9    end_root       TRUE
## 10       <NA>       TRUE
## 11       <NA>       TRUE
## 12        beg       TRUE
## 13        mid       TRUE
## 14        mid       TRUE
## 15        mid      FALSE
## 16        mid      FALSE
## 17        mid       TRUE
## 18   end_root       TRUE
## 19       <NA>       TRUE
## 20       <NA>       TRUE
## 21   beg_root       TRUE
## 22       <NA>       TRUE
## 23       <NA>       TRUE
## 24        beg       TRUE
## 25   end_root      FALSE
## 26       <NA>       TRUE
## 27        beg      FALSE
## 28        mid      FALSE
## 29        mid       TRUE
## 30   end_root       TRUE
## 31       <NA>       TRUE
## 32       <NA>       TRUE
## 33   beg_root       TRUE
## 34       <NA>       TRUE
## 35       <NA>      FALSE
## 36       <NA>       TRUE
## 37       <NA>       TRUE
## 38       <NA>       TRUE
## 39       <NA>       TRUE
## 40   beg_root       TRUE
## 41       <NA>       TRUE
## 42       <NA>       TRUE
## 43        beg       TRUE
## 44        mid       TRUE
## 45   end_root      FALSE
## 46       <NA>       TRUE
## 47   beg_root       TRUE
## 48       <NA>       TRUE
## 49        beg       TRUE
## 50        mid       TRUE
## 51   end_root       TRUE
## 52       <NA>       TRUE
## 53   beg_root      FALSE
## 54       <NA>      FALSE

amatsuo · 2018-11-06T10:46:41Z

moved to #134 (comment)

kbenoit · 2018-11-08T16:24:25Z

Strangely I am seeing the following when running a local check:

N  checking R code for possible problems (3.1s)
   get_noun_phrases: no visible global function definition for ‘:=’
   get_noun_phrases: no visible binding for global variable ‘start_id’
   get_noun_phrases: no visible binding for global variable ‘root_id’
   nounphrase_extract.spacyr_parsed: no visible binding for global
     variable ‘nounphrase_id’
   nounphrase_extract.spacyr_parsed: no visible binding for global
     variable ‘token_space’
   nounphrase_extract.spacyr_parsed: no visible binding for global
     variable ‘token’
   spacy_extract_nounphrases.character: no visible binding for global
     variable ‘start_id’
   spacy_extract_nounphrases.character: no visible binding for global
     variable ‘root_id’
   spacy_parse.character: no visible binding for global variable ‘w_id’
   spacy_parse.character: no visible global function definition for ‘.’
   spacy_parse.character: no visible binding for global variable ‘root_id’
   spacy_parse.character: no visible binding for global variable ‘.N’
   spacy_parse.character: no visible binding for global variable
     ‘whitespace’
   Undefined global functions or variables:
     . .N := nounphrase_id root_id start_id token token_space w_id
     whitespace

and also the /inst/doc/ files ware getting deleted (I have not committed the deletions):

amatsuo added 18 commits June 4, 2018 15:24

check docname duplication

9749204

spacy_tokenize version 1

6cdddf4

multithread is implemented

88355da

implement remove_punct option in spacy_tokenize

56ceb75

add remove_url and remove_numbers

37983c2

implement padding option

53212dd

option to return data.frame

6fd5fb8

implement remove_whitespace_separators

45f55c6

bug fix. turn off multithread off when length(text) == 1

55d72d3

implement tokenize sentence option

4bbf189

add tokenizer benchmark result

4ec983a

update documentation

dc6cd1a

update documentation

1d5b222

implement extract_nounphrases list

295a5ce

implement extract_nounphrases data.frame

ec32380

implement python version check

35d178a

update documentation

872304c

Fix build issues

f495fbe

amatsuo requested a review from kbenoit June 8, 2018 10:33

kbenoit added 5 commits July 16, 2018 17:53

- Improve documentation

240b6ef

- Rename `type` to `output` - Some linting

rename and reorder remove_separators

1b17bfc

Fix reversed padding condition

fd70b31

Roxygen updates

c06fbcb

Add tests for spacy_tokenize()

09de74b

Many of these are not working - but that's part of test-driven development.

kbenoit reviewed Oct 7, 2018

View reviewed changes

amatsuo and others added 2 commits October 22, 2018 15:42

bug fix for the docnames

a0409a6

Add space to end

6e1c9e3

amatsuo and others added 6 commits October 26, 2018 12:07

implement remove_separators option to sentence tokenizer

dad8eb2

implement remove_symbols option in spacy_tokenize

50235b2

test for remove_symbols

d71a8e9

prepare merging origin/master

bb23abc

Merge remote-tracking branch 'origin/master' into tokenize-function # Conflicts: # inst/python/initialize_spacyPython.py

fix version check bugs

9322a00

Merge branch 'tokenize-function' of https://github.com/quanteda/spacyr …

3d99055

…into tokenize-function

kbenoit added 2 commits November 2, 2018 21:24

Remove mention of quanteda corpus input

152a30d

Add description

6e096a9

kbenoit mentioned this pull request Nov 3, 2018

Add nounphrase_consolidate() ? #134

Closed

amatsuo added 2 commits November 3, 2018 16:56

fix bugs in docnames for noun phrase extraction

7ce2e28

implement noun_phrase option in spacy_parse

f9608b6

kbenoit reviewed Nov 5, 2018

View reviewed changes

amatsuo added 2 commits November 5, 2018 11:55

more noun_phrase fields returned

b86ddf8

change returning field names

dc9f57a

Add imports, remove underscore in noun_phrase

1ad0ea8

This makes it more consistent with our (non) use of underscores in other parse arguments.

amatsuo mentioned this pull request Nov 6, 2018

Add a noun phrase field to spacy_parse() output #126

Closed

Expand argument description for nounphrase

586a55f

kbenoit merged commit 0e21059 into master Nov 12, 2018

This was referenced Nov 12, 2018

Revert "Tokenize and noun-phrase extraction" #135

Merged

Tokenize function #136

Closed

Tokenize function pull request #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize and noun-phrase extraction #119

Tokenize and noun-phrase extraction #119

amatsuo commented Jun 8, 2018 •

edited

Loading

codecov-io commented Jun 8, 2018 •

edited

Loading

kbenoit commented Oct 7, 2018

kbenoit left a comment •

edited

Loading

amatsuo commented Oct 22, 2018

kbenoit commented Nov 2, 2018

kbenoit Nov 5, 2018

amatsuo commented Nov 5, 2018

kbenoit commented Nov 5, 2018 •

edited

Loading

kbenoit commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 6, 2018 •

edited

Loading

kbenoit commented Nov 8, 2018 •

edited

Loading

Tokenize and noun-phrase extraction #119

Tokenize and noun-phrase extraction #119

Conversation

amatsuo commented Jun 8, 2018 • edited Loading

codecov-io commented Jun 8, 2018 • edited Loading

Codecov Report

kbenoit commented Oct 7, 2018

kbenoit left a comment • edited Loading

Choose a reason for hiding this comment

amatsuo commented Oct 22, 2018

kbenoit commented Nov 2, 2018

kbenoit Nov 5, 2018

Choose a reason for hiding this comment

amatsuo commented Nov 5, 2018

kbenoit commented Nov 5, 2018 • edited Loading

kbenoit commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 5, 2018

amatsuo commented Nov 6, 2018 • edited Loading

kbenoit commented Nov 8, 2018 • edited Loading

amatsuo commented Jun 8, 2018 •

edited

Loading

codecov-io commented Jun 8, 2018 •

edited

Loading

kbenoit left a comment •

edited

Loading

kbenoit commented Nov 5, 2018 •

edited

Loading

amatsuo commented Nov 6, 2018 •

edited

Loading

kbenoit commented Nov 8, 2018 •

edited

Loading